Step 2 - Test Embedding Search

Docs

Verify your embedding model produces good similarity scores before deployment

Step 2: Test Embedding Search

Purpose

Before deploying your RAG system, verify that your chosen embedding model produces good similarity scores for your typical queries. This testing step is critical because:

Catches poor configurations early before users experience bad results
Validates embedding model choice for your specific content domain
Provides baseline metrics for future benchmarking
Prevents costly rework after deployment

Entry Point: Document Processing tab → "Test Embedding Search" button

Prerequisites: Documents must be processed (Step 1 complete)

Expected Outcome: Confirmed similarity scores are acceptable (0.7+)

Accessing Embedding Search

Location

Navigate to your RAG project and find the Document Processing tab. Click the "Test Embedding Search" button to open the testing interface.

UI Overview

The Embedding Search interface includes:

Query Input Box - Enter your test queries
Results Panel - Displays retrieved chunks with similarity scores
Score Indicator - Visual representation of score quality
Document Filter - Optional filter by specific documents

How Embedding Search Works

User enters query → System converts to vector → Searches chunks → Returns similarity scores

Your query is converted to a vector using the selected embedding model
The system searches all processed chunks for similar vectors
Results are ranked by similarity score (0.0 to 1.0)
Top results are displayed with their scores and source information

Understanding Similarity Scores

Similarity scores (0.0 to 1.0) indicate how well each chunk matches your query:

Score Range	Quality	Action Required
0.8 - 1.0	Excellent	Ready to proceed
0.7 - 0.8	Good	Acceptable for most use cases
0.5 - 0.7	Fair	Consider different embedding model
Below 0.5	Poor	Change embedding model required

Score Interpretation

Score: 0.8 - 1.0 (Excellent)

The chunk is highly relevant to your query. API responses using this configuration will return accurate, on-topic results.

Score: 0.7 - 0.8 (Good)

The chunk is moderately relevant. Acceptable for most production use cases.

Score: 0.5 - 0.7 (Fair)

The chunk has some relevance but may not be what users expect. Consider adjusting your embedding model or chunking settings.

Score: Below 0.5 (Poor)

The chunk is not relevant. Your RAG system will return poor quality responses with this configuration. Change embedding model immediately.

Testing Workflow

Step 1: Prepare Test Queries

Create 5-10 representative queries your users will ask:

Query Type	Description	Example
Simple Factual	Direct question with single answer	"What is the return policy?"
Multi-Part	Question with multiple components	"What are the return policy and refund timeline?"
Domain-Specific	Uses industry terminology	"What is the SLA for enterprise tier?"
Edge Case	Unusual or boundary query	"Can I return opened software?"

Tips for Test Queries:

Include queries from each category
Use actual user questions if available
Cover the full range of expected query types

Step 2: Run Tests

For each query:

Enter query in the Embedding Search input box
Press Enter or click Search
Review top 5 results for relevance
Note similarity scores for each result
Mark results as relevant or irrelevant

Step 3: Evaluate Results

Ask yourself these questions:

Relevance Check:

Are the top 3 results actually relevant to the query?
Do similarity scores match your intuition about relevance?
Are important documents appearing in results?

Score Distribution:

What is the average similarity score across all queries?
Are scores consistently above 0.7?
Are there any queries with all scores below 0.5?

Coverage Check:

Do results cover all your key documents?
Are critical documents appearing for relevant queries?

Example Test Session

Query: "What is the return policy for electronics?"

Results:
─────────────────────────────────────────────────
1. [Score: 0.89] "Electronics returns accepted within 30 days..."
   Source: policy.pdf, Chunk 3
   
2. [Score: 0.85] "Return policy overview: All products..."
   Source: policy.pdf, Chunk 1
   
3. [Score: 0.82] "Electronics category specific rules..."
   Source: electronics-faq.md, Chunk 2
   
4. [Score: 0.78] "Refund processing timeline..."
   Source: policy.pdf, Chunk 5
   
5. [Score: 0.75] "Exception items: Software, DVDs..."
   Source: returns.md, Chunk 4

Assessment: ✓ PASS
- All top 5 results are relevant
- Average score: 0.82 (Excellent)
- Key documents appearing in results

Tips for Better Results

If Scores Are Too Low

Option 1: Try a Larger Embedding Model

Current Model	Upgrade To	Expected Improvement
all-MiniLM-L6-v2 (384D)	text-embedding-3-small (1536D)	+0.1-0.15 scores
text-embedding-3-small (1536D)	text-embedding-3-large (3072D)	+0.05-0.1 scores

Expected API Improvement:

10-20% increase in retrieval accuracy
Better handling of complex, multi-part queries
More relevant similarity_score rankings

Option 2: Adjust Chunk Size

Chunk Size	Effect	Recommendation
Too small (< 256 tokens)	May lose context	Increase to 512
Too large (> 1024 tokens)	Diluted embeddings	Decrease to 768
Sweet spot (512-768 tokens)	Balanced	Use for most cases

Option 3: Enable BM25 Hybrid Search

Enable BM25 when:

Technical documents with proper nouns
Code repositories with function names
Legal documents with specific terms

Benefits:

Combines keyword + semantic matching
Helps when exact terms matter
Improves results for terminology-heavy content

If Results Are Inconsistent

Problem: Some queries get good scores, others get poor scores

Solutions:

Check if poor-scoring queries use different terminology
Verify all relevant documents are processed
Consider using a more general embedding model
Add synonyms or alternative phrasings to documents

Expected API Behavior

The similarity scores you see in Embedding Search directly translate to API responses:

API Response Example

{
  "query": "What is the return policy?",
  "results": [
    {
      "content": "Returns accepted within 30 days...",
      "similarity_score": 0.87,
      "source": "policy.pdf"
    },
    {
      "content": "Return policy overview...",
      "similarity_score": 0.82,
      "source": "policy.pdf"
    }
  ]
}

Key Points:

similarity_score in API matches Embedding Search scores
Same ranking algorithm is used
What you see in testing is what API returns

Decision Criteria

When to Proceed to Step 3

Proceed to Pipeline Configuration when:

✅ Average similarity score > 0.7 across all test queries
✅ Top 3 results are relevant for most queries
✅ Key documents appear in results for relevant queries
✅ No queries have all scores below 0.5

When to Iterate

Stay in Step 2 and iterate when:

❌ Average similarity score < 0.5
❌ Top results frequently irrelevant
❌ Important documents missing from results
❌ Scores vary wildly between queries

Common Issues and Solutions

Issue	Possible Cause	Solution
All scores below 0.5	Wrong embedding model	Upgrade to larger model
Irrelevant top results	Chunk size too large	Reduce to 512 tokens
Missing key documents	Document not processed	Check processing status
Inconsistent scores	Mixed content types	Enable BM25 hybrid
Good scores but wrong answers	Chunk too small	Increase chunk size

Next Step: Configure RAG Pipeline

Once similarity scores are acceptable, proceed to Step 3: Configure RAG Pipeline to test how embedding settings work together with retrieval configuration.

What to Bring to Step 3

Your test queries from this step
Baseline scores for comparison
Any notes on problematic queries
Selected embedding model confirmation

Tips for Success

Test with Real Queries: Use actual user questions when available
Document Your Baseline: Record scores for future comparison
Test Edge Cases: Include unusual queries in your test set
Iterate Quickly: Don't hesitate to try different models
Trust the Scores: Low scores predict poor API performance

Step 1: Create RAG Step 3: Configure Pipeline