Data Sources
Upload, organize, and manage documents for your RAG system
Data Sources
The Data Sources step is where you upload and organize the documents that will power your RAG system. This step handles file processing, metadata extraction, and initial document validation to ensure your content is ready for embedding and retrieval.
Supported File Formats
Document Types
The RAG Wizard supports a comprehensive range of document formats:
Text Documents
.txt
- Plain text files.md
- Markdown documents.rtf
- Rich Text Format
PDF Documents
- Standard PDFs with extractable text
- Scanned PDFs (OCR processing available)
- Password-protected PDFs (with password provision)
- Multi-page documents with complex layouts
Microsoft Office Documents
.docx
- Word documents.xlsx
- Excel spreadsheets.pptx
- PowerPoint presentations
Structured Data
.csv
- Comma-separated values.json
- JSON data files.xml
- XML structured documents
Web Content
.html
- HTML documents.htm
- Web pages- URL imports (web scraping)
File Size Limits
- Individual File: Up to 25MB per document
- Batch Upload: Up to 100 files simultaneously
- Total Project Storage: Varies by subscription tier
Upload Methods
Local File Upload
Drag & Drop Interface
- Drag files directly into the upload zone
- Visual feedback for file validation
- Automatic format detection and preview
File Browser Selection
- Traditional file picker interface
- Multi-file selection support
- Folder structure preservation options
Remote Content Import
URL Import
- Import content directly from web URLs
- Automatic content extraction and cleaning
- Support for authentication headers
Cloud Storage Integration
- Google Drive integration
- Dropbox synchronization
- OneDrive connectivity
- AWS S3 bucket imports
Bulk Import Options
Archive Upload
- ZIP file extraction and processing
- Maintains folder structure
- Batch metadata application
API-Based Import
- Programmatic document upload
- Integration with existing systems
- Automated workflow support
Document Processing
Automatic Processing
Upon upload, documents undergo automatic processing:
Content Extraction
- Text extraction from various formats
- Image and table recognition (OCR)
- Metadata preservation and enhancement
- Language detection and encoding normalization
Quality Assessment
- Content quality scoring
- Duplicate detection and handling
- Format validation and error reporting
- Readability analysis
Pre-processing Options
- HTML tag cleaning and normalization
- Special character handling
- Encoding standardization
- Whitespace normalization
Manual Processing Controls
Document Settings per File
- Custom metadata assignment
- Processing method selection
- Quality thresholds adjustment
- Inclusion/exclusion rules
Batch Operations
- Apply settings to multiple documents
- Bulk metadata updates
- Mass processing configuration
- Group management tools
Document Organization
Metadata Management
Automatic Metadata Extraction
- File properties (creation date, author, size)
- Content analysis (topic, keywords, language)
- Document structure (headings, sections, pages)
- Quality metrics (readability, completeness)
Custom Metadata Fields
- Add domain-specific metadata
- Create custom taxonomies
- Tag-based organization
- Hierarchical categorization
Metadata Templates
- Reusable metadata schemas
- Industry-standard templates
- Custom field definitions
- Validation rules and constraints
Document Categorization
Automatic Categorization
- Content-based grouping
- Topic modeling and clustering
- Similarity-based organization
- Language and format grouping
Manual Organization
- Folder-like structures
- Tag-based systems
- Priority and status indicators
- Custom classification schemes
Validation and Quality Control
Content Validation
Format Validation
- File integrity checks
- Format compatibility verification
- Encoding validation
- Structure analysis
Quality Assessment
- Content completeness scoring
- Text quality evaluation
- Information density analysis
- Relevance scoring
Error Handling
- Automatic error detection
- Repair suggestions and options
- Manual review workflows
- Quarantine and resolution systems
Duplicate Detection
Automatic Detection
- Content-based duplicate identification
- Hash-based exact matching
- Similarity-based near-duplicate detection
- Cross-format duplicate recognition
Resolution Options
- Merge duplicate documents
- Keep best version based on quality
- Manual review and selection
- Custom resolution rules
Document Preview and Review
Content Preview
Rich Document Viewer
- In-browser document preview
- Highlighting of extracted content
- Metadata display and editing
- Structure visualization
Processing Results
- Extraction quality indicators
- Detected metadata display
- Processing warnings and errors
- Optimization suggestions
Bulk Review Mode
Overview Dashboard
- Document status summary
- Processing progress tracking
- Quality metrics overview
- Error and warning aggregation
Batch Operations
- Select multiple documents for review
- Apply bulk actions and settings
- Mass approval workflows
- Batch error resolution
Advanced Features
Version Control
Document Versioning
- Track document updates and changes
- Compare versions and differences
- Rollback to previous versions
- Change history and audit trails
Update Management
- Automatic update detection
- Selective re-processing options
- Incremental update handling
- Version conflict resolution
Integration Options
API Access
- Programmatic document management
- Webhook notifications for processing events
- Batch processing APIs
- Custom integration endpoints
Workflow Integration
- Connect with existing document workflows
- Automated processing pipelines
- Event-driven processing triggers
- Custom business logic integration
Best Practices
Document Preparation
- Clean Source Documents: Remove unnecessary formatting and content
- Consistent Structure: Use standardized document formats when possible
- Clear Metadata: Provide descriptive filenames and metadata
- Quality Content: Ensure documents contain relevant, high-quality information
Organization Strategy
- Logical Grouping: Organize documents by topic, department, or use case
- Consistent Naming: Use clear, descriptive naming conventions
- Regular Review: Periodically review and update document collections
- Version Management: Keep track of document versions and updates
Performance Optimization
- Batch Processing: Upload related documents together for efficiency
- Quality Filtering: Remove low-quality or irrelevant documents early
- Size Management: Balance document size with processing requirements
- Regular Maintenance: Clean up outdated or duplicate documents
Prerequisites for Next Step
Before proceeding to Document Processing & Chunking:
- ✅ At least one document successfully uploaded and processed
- ✅ Document validation completed without critical errors
- ✅ Metadata extraction and organization completed
- ✅ Quality assessment passed minimum thresholds
The documents uploaded in this step will be processed and chunked in the next phase, where you'll configure how content is divided and prepared for embedding.