Docs

Data Sources

Upload, organize, and manage documents for your RAG system

Data Sources

The Data Sources step is where you upload and organize the documents that will power your RAG system. This step handles file processing, metadata extraction, and initial document validation to ensure your content is ready for embedding and retrieval.

Supported File Formats

Document Types

The RAG Wizard supports a comprehensive range of document formats:

Text Documents

.txt - Plain text files
.md - Markdown documents
.rtf - Rich Text Format

PDF Documents

Standard PDFs with extractable text
Scanned PDFs (OCR processing available)
Password-protected PDFs (with password provision)
Multi-page documents with complex layouts

Microsoft Office Documents

.docx - Word documents
.xlsx - Excel spreadsheets
.pptx - PowerPoint presentations

Structured Data

.csv - Comma-separated values
.json - JSON data files
.xml - XML structured documents

Web Content

.html - HTML documents
.htm - Web pages
URL imports (web scraping)

File Size Limits

Individual File: Up to 25MB per document
Batch Upload: Up to 100 files simultaneously
Total Project Storage: Varies by subscription tier

Upload Methods

Local File Upload

Drag & Drop Interface

Drag files directly into the upload zone
Visual feedback for file validation
Automatic format detection and preview

File Browser Selection

Traditional file picker interface
Multi-file selection support
Folder structure preservation options

Remote Content Import

URL Import

Import content directly from web URLs
Automatic content extraction and cleaning
Support for authentication headers

Cloud Storage Integration

Google Drive integration
Dropbox synchronization
OneDrive connectivity
AWS S3 bucket imports

Bulk Import Options

Archive Upload

ZIP file extraction and processing
Maintains folder structure
Batch metadata application

API-Based Import

Programmatic document upload
Integration with existing systems
Automated workflow support

Document Processing

Automatic Processing

Upon upload, documents undergo automatic processing:

Content Extraction

Text extraction from various formats
Image and table recognition (OCR)
Metadata preservation and enhancement
Language detection and encoding normalization

Quality Assessment

Content quality scoring
Duplicate detection and handling
Format validation and error reporting
Readability analysis

Pre-processing Options

HTML tag cleaning and normalization
Special character handling
Encoding standardization
Whitespace normalization

Manual Processing Controls

Document Settings per File

Custom metadata assignment
Processing method selection
Quality thresholds adjustment
Inclusion/exclusion rules

Batch Operations

Apply settings to multiple documents
Bulk metadata updates
Mass processing configuration
Group management tools

Document Organization

Metadata Management

Automatic Metadata Extraction

File properties (creation date, author, size)
Content analysis (topic, keywords, language)
Document structure (headings, sections, pages)
Quality metrics (readability, completeness)

Custom Metadata Fields

Add domain-specific metadata
Create custom taxonomies
Tag-based organization
Hierarchical categorization

Metadata Templates

Reusable metadata schemas
Industry-standard templates
Custom field definitions
Validation rules and constraints

Document Categorization

Automatic Categorization

Content-based grouping
Topic modeling and clustering
Similarity-based organization
Language and format grouping

Manual Organization

Folder-like structures
Tag-based systems
Priority and status indicators
Custom classification schemes

Validation and Quality Control

Content Validation

Format Validation

File integrity checks
Format compatibility verification
Encoding validation
Structure analysis

Quality Assessment

Content completeness scoring
Text quality evaluation
Information density analysis
Relevance scoring

Error Handling

Automatic error detection
Repair suggestions and options
Manual review workflows
Quarantine and resolution systems

Duplicate Detection

Automatic Detection

Content-based duplicate identification
Hash-based exact matching
Similarity-based near-duplicate detection
Cross-format duplicate recognition

Resolution Options

Merge duplicate documents
Keep best version based on quality
Manual review and selection
Custom resolution rules

Document Preview and Review

Content Preview

Rich Document Viewer

In-browser document preview
Highlighting of extracted content
Metadata display and editing
Structure visualization

Processing Results

Extraction quality indicators
Detected metadata display
Processing warnings and errors
Optimization suggestions

Bulk Review Mode

Overview Dashboard

Document status summary
Processing progress tracking
Quality metrics overview
Error and warning aggregation

Batch Operations

Select multiple documents for review
Apply bulk actions and settings
Mass approval workflows
Batch error resolution

Advanced Features

Version Control

Document Versioning

Track document updates and changes
Compare versions and differences
Rollback to previous versions
Change history and audit trails

Update Management

Automatic update detection
Selective re-processing options
Incremental update handling
Version conflict resolution

Integration Options

API Access

Programmatic document management
Webhook notifications for processing events
Batch processing APIs
Custom integration endpoints

Workflow Integration

Connect with existing document workflows
Automated processing pipelines
Event-driven processing triggers
Custom business logic integration

Best Practices

Document Preparation

Clean Source Documents: Remove unnecessary formatting and content
Consistent Structure: Use standardized document formats when possible
Clear Metadata: Provide descriptive filenames and metadata
Quality Content: Ensure documents contain relevant, high-quality information

Organization Strategy

Logical Grouping: Organize documents by topic, department, or use case
Consistent Naming: Use clear, descriptive naming conventions
Regular Review: Periodically review and update document collections
Version Management: Keep track of document versions and updates

Performance Optimization

Batch Processing: Upload related documents together for efficiency
Quality Filtering: Remove low-quality or irrelevant documents early
Size Management: Balance document size with processing requirements
Regular Maintenance: Clean up outdated or duplicate documents

Prerequisites for Next Step

Before proceeding to Document Processing & Chunking:

✅ At least one document successfully uploaded and processed
✅ Document validation completed without critical errors
✅ Metadata extraction and organization completed
✅ Quality assessment passed minimum thresholds

The documents uploaded in this step will be processed and chunked in the next phase, where you'll configure how content is divided and prepared for embedding.

Project Setup Document Processing