Docs
Data Sources

Data Sources

Upload, organize, and manage documents for your RAG system

Data Sources

The Data Sources step is where you upload and organize the documents that will power your RAG system. This step handles file processing, metadata extraction, and initial document validation to ensure your content is ready for embedding and retrieval.

Supported File Formats

Document Types

The RAG Wizard supports a comprehensive range of document formats:

Text Documents

  • .txt - Plain text files
  • .md - Markdown documents
  • .rtf - Rich Text Format

PDF Documents

  • Standard PDFs with extractable text
  • Scanned PDFs (OCR processing available)
  • Password-protected PDFs (with password provision)
  • Multi-page documents with complex layouts

Microsoft Office Documents

  • .docx - Word documents
  • .xlsx - Excel spreadsheets
  • .pptx - PowerPoint presentations

Structured Data

  • .csv - Comma-separated values
  • .json - JSON data files
  • .xml - XML structured documents

Web Content

  • .html - HTML documents
  • .htm - Web pages
  • URL imports (web scraping)

File Size Limits

  • Individual File: Up to 25MB per document
  • Batch Upload: Up to 100 files simultaneously
  • Total Project Storage: Varies by subscription tier

Upload Methods

Local File Upload

Drag & Drop Interface

  • Drag files directly into the upload zone
  • Visual feedback for file validation
  • Automatic format detection and preview

File Browser Selection

  • Traditional file picker interface
  • Multi-file selection support
  • Folder structure preservation options

Remote Content Import

URL Import

  • Import content directly from web URLs
  • Automatic content extraction and cleaning
  • Support for authentication headers

Cloud Storage Integration

  • Google Drive integration
  • Dropbox synchronization
  • OneDrive connectivity
  • AWS S3 bucket imports

Bulk Import Options

Archive Upload

  • ZIP file extraction and processing
  • Maintains folder structure
  • Batch metadata application

API-Based Import

  • Programmatic document upload
  • Integration with existing systems
  • Automated workflow support

Document Processing

Automatic Processing

Upon upload, documents undergo automatic processing:

Content Extraction

  • Text extraction from various formats
  • Image and table recognition (OCR)
  • Metadata preservation and enhancement
  • Language detection and encoding normalization

Quality Assessment

  • Content quality scoring
  • Duplicate detection and handling
  • Format validation and error reporting
  • Readability analysis

Pre-processing Options

  • HTML tag cleaning and normalization
  • Special character handling
  • Encoding standardization
  • Whitespace normalization

Manual Processing Controls

Document Settings per File

  • Custom metadata assignment
  • Processing method selection
  • Quality thresholds adjustment
  • Inclusion/exclusion rules

Batch Operations

  • Apply settings to multiple documents
  • Bulk metadata updates
  • Mass processing configuration
  • Group management tools

Document Organization

Metadata Management

Automatic Metadata Extraction

  • File properties (creation date, author, size)
  • Content analysis (topic, keywords, language)
  • Document structure (headings, sections, pages)
  • Quality metrics (readability, completeness)

Custom Metadata Fields

  • Add domain-specific metadata
  • Create custom taxonomies
  • Tag-based organization
  • Hierarchical categorization

Metadata Templates

  • Reusable metadata schemas
  • Industry-standard templates
  • Custom field definitions
  • Validation rules and constraints

Document Categorization

Automatic Categorization

  • Content-based grouping
  • Topic modeling and clustering
  • Similarity-based organization
  • Language and format grouping

Manual Organization

  • Folder-like structures
  • Tag-based systems
  • Priority and status indicators
  • Custom classification schemes

Validation and Quality Control

Content Validation

Format Validation

  • File integrity checks
  • Format compatibility verification
  • Encoding validation
  • Structure analysis

Quality Assessment

  • Content completeness scoring
  • Text quality evaluation
  • Information density analysis
  • Relevance scoring

Error Handling

  • Automatic error detection
  • Repair suggestions and options
  • Manual review workflows
  • Quarantine and resolution systems

Duplicate Detection

Automatic Detection

  • Content-based duplicate identification
  • Hash-based exact matching
  • Similarity-based near-duplicate detection
  • Cross-format duplicate recognition

Resolution Options

  • Merge duplicate documents
  • Keep best version based on quality
  • Manual review and selection
  • Custom resolution rules

Document Preview and Review

Content Preview

Rich Document Viewer

  • In-browser document preview
  • Highlighting of extracted content
  • Metadata display and editing
  • Structure visualization

Processing Results

  • Extraction quality indicators
  • Detected metadata display
  • Processing warnings and errors
  • Optimization suggestions

Bulk Review Mode

Overview Dashboard

  • Document status summary
  • Processing progress tracking
  • Quality metrics overview
  • Error and warning aggregation

Batch Operations

  • Select multiple documents for review
  • Apply bulk actions and settings
  • Mass approval workflows
  • Batch error resolution

Advanced Features

Version Control

Document Versioning

  • Track document updates and changes
  • Compare versions and differences
  • Rollback to previous versions
  • Change history and audit trails

Update Management

  • Automatic update detection
  • Selective re-processing options
  • Incremental update handling
  • Version conflict resolution

Integration Options

API Access

  • Programmatic document management
  • Webhook notifications for processing events
  • Batch processing APIs
  • Custom integration endpoints

Workflow Integration

  • Connect with existing document workflows
  • Automated processing pipelines
  • Event-driven processing triggers
  • Custom business logic integration

Best Practices

Document Preparation

  1. Clean Source Documents: Remove unnecessary formatting and content
  2. Consistent Structure: Use standardized document formats when possible
  3. Clear Metadata: Provide descriptive filenames and metadata
  4. Quality Content: Ensure documents contain relevant, high-quality information

Organization Strategy

  1. Logical Grouping: Organize documents by topic, department, or use case
  2. Consistent Naming: Use clear, descriptive naming conventions
  3. Regular Review: Periodically review and update document collections
  4. Version Management: Keep track of document versions and updates

Performance Optimization

  1. Batch Processing: Upload related documents together for efficiency
  2. Quality Filtering: Remove low-quality or irrelevant documents early
  3. Size Management: Balance document size with processing requirements
  4. Regular Maintenance: Clean up outdated or duplicate documents

Prerequisites for Next Step

Before proceeding to Document Processing & Chunking:

  • ✅ At least one document successfully uploaded and processed
  • ✅ Document validation completed without critical errors
  • ✅ Metadata extraction and organization completed
  • ✅ Quality assessment passed minimum thresholds

The documents uploaded in this step will be processed and chunked in the next phase, where you'll configure how content is divided and prepared for embedding.