Docs

Document Processing & Chunking

Configure text processing pipeline and chunking strategies for optimal retrieval

Document Processing & Chunking

The Document Processing & Chunking step configures how your uploaded documents are processed, divided into chunks, and prepared for embedding. This critical step directly impacts retrieval quality and system performance.

Chunking Strategy Overview

Chunking divides large documents into smaller, semantically meaningful pieces that can be efficiently processed and retrieved. The choice of chunking strategy significantly affects your RAG system's performance and accuracy.

Why Chunking Matters

Embedding Efficiency: Models have token limits (typically 512-8192 tokens)
Retrieval Precision: Smaller chunks provide more focused context
Response Quality: Well-chunked content produces more relevant responses
Performance: Optimized chunks improve search and retrieval speed

Chunking Methods

Fixed-Size Chunking

Description: Divides text into chunks of predetermined size with configurable overlap.

Best For:

Consistent processing requirements
Uniform document types
High-volume processing needs
Predictable performance characteristics

Configuration Options:

Chunk Size: 100-8192 tokens (recommended: 512-1024)
Overlap Percentage: 0-50% (recommended: 10-20%)
Boundary Respect: Sentence or paragraph boundaries
Minimum Chunk Size: Prevent overly small chunks

Example Use Cases:

Large PDF documents
Legal documents with uniform structure
Technical manuals and documentation
Academic papers and research documents

Semantic Chunking

Description: Divides text based on semantic boundaries and topic coherence.

Best For:

Documents with clear topic transitions
Content requiring contextual integrity
Mixed content types
Quality over quantity prioritization

Configuration Options:

Semantic Threshold: Similarity cutoff for topic boundaries
Maximum Chunk Size: Upper limit for chunk length
Minimum Chunk Size: Lower limit to ensure content adequacy
Topic Modeling: Advanced semantic boundary detection

Example Use Cases:

Blog posts and articles
Multi-topic research papers
Meeting transcripts and conversations
Educational content with distinct sections

Recursive Chunking

Description: Hierarchical chunking that creates nested chunks at multiple levels.

Best For:

Complex document structures
Hierarchical content organization
Multi-level information retrieval
Comprehensive context preservation

Configuration Options:

Hierarchy Levels: Number of chunking levels (typically 2-4)
Size Ratios: Relative sizes for each hierarchy level
Overlap Strategy: How overlap applies across levels
Parent-Child Linking: Relationship preservation between levels

Example Use Cases:

Technical specifications with nested sections
Books and long-form content
Policy documents with hierarchical structure
Complex research papers with multiple subsections

Document-Based Chunking

Description: Treats entire documents or major sections as chunks.

Best For:

Short documents that shouldn't be divided
Documents with strong internal coherence
Reference materials requiring complete context
High-level document classification needs

Configuration Options:

Document Grouping: Combine related small documents
Section Detection: Automatic major section identification
Metadata Preservation: Enhanced metadata at document level
Quality Filtering: Document-level quality assessment

Example Use Cases:

FAQ entries and knowledge base articles
Product descriptions and specifications
Short reports and summaries
Individual emails or messages

Advanced Processing Options

Sentence Boundary Respect

Enable When:

Content quality is more important than strict size limits
Working with well-structured prose
Need to maintain grammatical coherence

Impact:

Prevents mid-sentence breaks
May result in variable chunk sizes
Improves readability and context quality

Context Coherence Maintenance

Enable When:

Documents contain interconnected concepts
Cross-references are important
Topic continuity matters for understanding

Features:

Preserves conceptual relationships
Maintains narrative flow
Handles transitional phrases and connections

Section Header Preservation

Enable When:

Documents have clear hierarchical structure
Headers provide important context
Navigation and organization are priorities

Benefits:

Maintains document structure information
Improves chunk categorization
Enables hierarchical retrieval strategies

Cross-Reference Handling

Enable When:

Documents contain internal references
Citations and links are important
Reference resolution is needed

Capabilities:

Link resolution and context inclusion
Citation tracking and management
Reference validation and updating

Code Block Handling

Enable When:

Technical documents with code examples
API documentation
Programming tutorials and guides

Features:

Syntax-aware chunking
Language detection and highlighting
Separate processing for code vs. prose

Content Filtering Options

Low Information Content Filtering

Skip chunks containing:

Repetitive boilerplate text
Navigation elements and menus
Copyright notices and disclaimers
Formatting artifacts and noise

Metadata-Only Chunk Filtering

Skip chunks that contain:

Only file metadata
Pure formatting information
Empty or whitespace-only content
System-generated content without value

Non-English Content Handling

Options:

Skip: Exclude non-English chunks entirely
Process: Include with language identification
Translate: Auto-translate before processing
Flag: Mark for manual review

Configuration Best Practices

Choosing Chunk Size

Small Chunks (100-300 tokens)

Pros: Precise retrieval, specific context
Cons: May lose broader context, requires more chunks
Best for: FAQ systems, specific fact retrieval

Medium Chunks (300-800 tokens)

Pros: Balance of precision and context
Cons: May still fragment complex topics
Best for: General knowledge bases, mixed content

Large Chunks (800-2048 tokens)

Pros: Rich context, fewer total chunks
Cons: Less precise retrieval, higher processing costs
Best for: Complex analysis, long-form content

Overlap Configuration

Low Overlap (5-10%)

Minimal redundancy
Efficient storage
Risk of losing context at boundaries

Medium Overlap (10-20%)

Good balance of efficiency and context
Recommended for most use cases
Handles boundary issues effectively

High Overlap (20-50%)

Maximum context preservation
Higher storage requirements
Best for critical context preservation

Quality Thresholds

Minimum Chunk Length

Set based on meaningful content requirements
Typically 50-100 tokens minimum
Prevents noise from very short chunks

Maximum Chunk Length

Model-dependent (respect embedding limits)
Consider processing performance
Balance with retrieval precision needs

Validation and Testing

Chunk Quality Assessment

Automatic Validation:

Content completeness checking
Semantic coherence scoring
Boundary quality evaluation
Overlap effectiveness analysis

Manual Review Tools:

Chunk preview and inspection
Boundary adjustment capabilities
Quality scoring and feedback
Batch review and approval workflows

Processing Pipeline Testing

Sample Processing:

Test configurations on sample documents
Performance benchmarking
Quality metric evaluation
A/B testing of different strategies

Iterative Optimization:

Performance monitoring and adjustment
Quality feedback incorporation
Configuration refinement tools
Automated optimization suggestions

Performance Considerations

Processing Speed

Factors Affecting Speed:

Document size and complexity
Chunking method selection
Processing option complexity
Hardware and resource allocation

Optimization Strategies:

Batch processing for efficiency
Parallel processing where possible
Caching of intermediate results
Progressive processing for large collections

Storage Efficiency

Chunk Storage Optimization:

Deduplication of similar chunks
Compression of text content
Efficient metadata storage
Hierarchical storage management

Retrieval Performance

Chunk Design for Retrieval:

Optimal chunk size for embedding models
Effective overlap for context preservation
Quality filtering for relevant results
Indexing strategy alignment

Prerequisites for Next Step

Before proceeding to Pipeline Configuration:

✅ Chunking strategy selected and configured
✅ Processing options configured appropriately
✅ Sample processing completed successfully
✅ Chunk quality validation passed
✅ Performance requirements satisfied

The chunks created in this step will be embedded and indexed in the Pipeline Configuration phase, where you'll set up the retrieval and generation components of your RAG system.

Data Sources Pipeline Configuration