Docs
Step 2 - Test Embedding Search

Step 2 - Test Embedding Search

Verify your embedding model produces good similarity scores before deployment

Purpose

Before deploying your RAG system, verify that your chosen embedding model produces good similarity scores for your typical queries. This testing step is critical because:

  • Catches poor configurations early before users experience bad results
  • Validates embedding model choice for your specific content domain
  • Provides baseline metrics for future benchmarking
  • Prevents costly rework after deployment

Entry Point: Document Processing tab → "Test Embedding Search" button

Prerequisites: Documents must be processed (Step 1 complete)

Expected Outcome: Confirmed similarity scores are acceptable (0.7+)

Location

Navigate to your RAG project and find the Document Processing tab. Click the "Test Embedding Search" button to open the testing interface.

UI Overview

The Embedding Search interface includes:

  1. Query Input Box - Enter your test queries
  2. Results Panel - Displays retrieved chunks with similarity scores
  3. Score Indicator - Visual representation of score quality
  4. Document Filter - Optional filter by specific documents

How Embedding Search Works

User enters query → System converts to vector → Searches chunks → Returns similarity scores
  1. Your query is converted to a vector using the selected embedding model
  2. The system searches all processed chunks for similar vectors
  3. Results are ranked by similarity score (0.0 to 1.0)
  4. Top results are displayed with their scores and source information

Understanding Similarity Scores

Similarity scores (0.0 to 1.0) indicate how well each chunk matches your query:

Score RangeQualityAction Required
0.8 - 1.0ExcellentReady to proceed
0.7 - 0.8GoodAcceptable for most use cases
0.5 - 0.7FairConsider different embedding model
Below 0.5PoorChange embedding model required

Score Interpretation

Score: 0.8 - 1.0 (Excellent)

The chunk is highly relevant to your query. API responses using this configuration will return accurate, on-topic results.

Score: 0.7 - 0.8 (Good)

The chunk is moderately relevant. Acceptable for most production use cases.

Score: 0.5 - 0.7 (Fair)

The chunk has some relevance but may not be what users expect. Consider adjusting your embedding model or chunking settings.

Score: Below 0.5 (Poor)

The chunk is not relevant. Your RAG system will return poor quality responses with this configuration. Change embedding model immediately.

Testing Workflow

Step 1: Prepare Test Queries

Create 5-10 representative queries your users will ask:

Query TypeDescriptionExample
Simple FactualDirect question with single answer"What is the return policy?"
Multi-PartQuestion with multiple components"What are the return policy and refund timeline?"
Domain-SpecificUses industry terminology"What is the SLA for enterprise tier?"
Edge CaseUnusual or boundary query"Can I return opened software?"

Tips for Test Queries:

  • Include queries from each category
  • Use actual user questions if available
  • Cover the full range of expected query types

Step 2: Run Tests

For each query:

  1. Enter query in the Embedding Search input box
  2. Press Enter or click Search
  3. Review top 5 results for relevance
  4. Note similarity scores for each result
  5. Mark results as relevant or irrelevant

Step 3: Evaluate Results

Ask yourself these questions:

Relevance Check:

  • Are the top 3 results actually relevant to the query?
  • Do similarity scores match your intuition about relevance?
  • Are important documents appearing in results?

Score Distribution:

  • What is the average similarity score across all queries?
  • Are scores consistently above 0.7?
  • Are there any queries with all scores below 0.5?

Coverage Check:

  • Do results cover all your key documents?
  • Are critical documents appearing for relevant queries?

Example Test Session

Query: "What is the return policy for electronics?"

Results:
─────────────────────────────────────────────────
1. [Score: 0.89] "Electronics returns accepted within 30 days..."
   Source: policy.pdf, Chunk 3
   
2. [Score: 0.85] "Return policy overview: All products..."
   Source: policy.pdf, Chunk 1
   
3. [Score: 0.82] "Electronics category specific rules..."
   Source: electronics-faq.md, Chunk 2
   
4. [Score: 0.78] "Refund processing timeline..."
   Source: policy.pdf, Chunk 5
   
5. [Score: 0.75] "Exception items: Software, DVDs..."
   Source: returns.md, Chunk 4

Assessment: ✓ PASS
- All top 5 results are relevant
- Average score: 0.82 (Excellent)
- Key documents appearing in results

Tips for Better Results

If Scores Are Too Low

Option 1: Try a Larger Embedding Model

Current ModelUpgrade ToExpected Improvement
all-MiniLM-L6-v2 (384D)text-embedding-3-small (1536D)+0.1-0.15 scores
text-embedding-3-small (1536D)text-embedding-3-large (3072D)+0.05-0.1 scores

Expected API Improvement:

  • 10-20% increase in retrieval accuracy
  • Better handling of complex, multi-part queries
  • More relevant similarity_score rankings

Option 2: Adjust Chunk Size

Chunk SizeEffectRecommendation
Too small (< 256 tokens)May lose contextIncrease to 512
Too large (> 1024 tokens)Diluted embeddingsDecrease to 768
Sweet spot (512-768 tokens)BalancedUse for most cases

Option 3: Enable BM25 Hybrid Search

Enable BM25 when:

  • Technical documents with proper nouns
  • Code repositories with function names
  • Legal documents with specific terms

Benefits:

  • Combines keyword + semantic matching
  • Helps when exact terms matter
  • Improves results for terminology-heavy content

If Results Are Inconsistent

Problem: Some queries get good scores, others get poor scores

Solutions:

  1. Check if poor-scoring queries use different terminology
  2. Verify all relevant documents are processed
  3. Consider using a more general embedding model
  4. Add synonyms or alternative phrasings to documents

Expected API Behavior

The similarity scores you see in Embedding Search directly translate to API responses:

API Response Example

{
  "query": "What is the return policy?",
  "results": [
    {
      "content": "Returns accepted within 30 days...",
      "similarity_score": 0.87,
      "source": "policy.pdf"
    },
    {
      "content": "Return policy overview...",
      "similarity_score": 0.82,
      "source": "policy.pdf"
    }
  ]
}

Key Points:

  • similarity_score in API matches Embedding Search scores
  • Same ranking algorithm is used
  • What you see in testing is what API returns

Decision Criteria

When to Proceed to Step 3

Proceed to Pipeline Configuration when:

  • ✅ Average similarity score > 0.7 across all test queries
  • ✅ Top 3 results are relevant for most queries
  • ✅ Key documents appear in results for relevant queries
  • ✅ No queries have all scores below 0.5

When to Iterate

Stay in Step 2 and iterate when:

  • ❌ Average similarity score < 0.5
  • ❌ Top results frequently irrelevant
  • ❌ Important documents missing from results
  • ❌ Scores vary wildly between queries

Common Issues and Solutions

IssuePossible CauseSolution
All scores below 0.5Wrong embedding modelUpgrade to larger model
Irrelevant top resultsChunk size too largeReduce to 512 tokens
Missing key documentsDocument not processedCheck processing status
Inconsistent scoresMixed content typesEnable BM25 hybrid
Good scores but wrong answersChunk too smallIncrease chunk size

Next Step: Configure RAG Pipeline

Once similarity scores are acceptable, proceed to Step 3: Configure RAG Pipeline to test how embedding settings work together with retrieval configuration.

What to Bring to Step 3

  1. Your test queries from this step
  2. Baseline scores for comparison
  3. Any notes on problematic queries
  4. Selected embedding model confirmation

Tips for Success

  1. Test with Real Queries: Use actual user questions when available
  2. Document Your Baseline: Record scores for future comparison
  3. Test Edge Cases: Include unusual queries in your test set
  4. Iterate Quickly: Don't hesitate to try different models
  5. Trust the Scores: Low scores predict poor API performance