Docs
Step 6 - Benchmarking & Iteration

Step 6 - Benchmarking & Iteration

Set up KPIs and track RAG quality over time for confident iteration

Step 6: Benchmarking & Iteration

Purpose

Set up KPIs and track RAG quality over time using business-specific queries and measurable benchmarks. This step enables confident iteration and continuous improvement.

Entry Point: RAG project → Benchmark tab

Prerequisites: Configured RAG system (Steps 1-5 complete)

Expected Outcome: Measurable KPIs, baseline established, iteration workflow ready

Why Benchmarking Matters

Without Benchmarking

❌ Changes are based on intuition ❌ Can't measure if improvements actually help ❌ Risk of breaking existing functionality ❌ No way to track progress over time

With Benchmarking

✅ Every change is measured against KPIs ✅ Confidence in improvements ✅ Catch regressions before production ✅ Data-driven decision making

Setting Up Your Benchmark

Step 1: Define Business Queries

Create a list of 20-50 queries representing real user questions:

Query IDQueryCategoryPriority
Q001"What is the return policy?"PolicyHigh
Q002"How do I reset my password?"SupportHigh
Q003"What products integrate with Slack?"ProductMedium
Q004"What is the enterprise pricing?"SalesHigh
Q005"How do I export my data?"FeaturesMedium

Tips for Query Selection:

  • Include queries from each category (Policy, Support, Product, etc.)
  • Prioritize high-frequency user questions
  • Include edge cases and complex queries
  • Use actual user queries from logs if available

Step 2: Define Expected Answers

For each query, document what a good answer should include:

Query Q001: "What is the return policy?"

Expected Answer Should Include:
- Return window (30 days)
- Condition requirements (unused, original packaging)
- Refund timeline (5-7 business days)
- Exception items (software, DVDs)

Accuracy Threshold: 80% (4 of 5 elements)
Query Q002: "How do I reset my password?"

Expected Answer Should Include:
- Password reset link location
- Step-by-step instructions
- Support contact if issues

Accuracy Threshold: 100% (all elements required)

Step 3: Set KPI Targets

KPITargetHow It's Measured
Answer Accuracy> 90%Human evaluation or LLM judge
Avg Similarity Score> 0.75Average of top result scores
Response Time< 2 secondsAPI response latency
Source Coverage> 80%Relevant sources retrieved

KPI Definitions:

  • Answer Accuracy: Percentage of expected elements present in answer
  • Avg Similarity Score: Average similarity score of retrieved chunks
  • Response Time: Time from query to complete response
  • Source Coverage: Percentage of relevant documents retrieved

Running Benchmarks

Benchmark Execution Flow

1. Load Query Set
   ↓
2. Execute Each Query Through RAG Pipeline
   ↓
3. Collect Results (answers, scores, sources)
   ↓
4. Compare Against Expected Answers
   ↓
5. Calculate KPIs
   ↓
6. Generate Report

Benchmark Report Example

Benchmark Report - Project: Customer Support RAG
Date: 2024-01-15
Query Set: 50 business queries
─────────────────────────────────────────────────

Overall KPIs:
┌─────────────────────┬────────┬──────────┬─────────┐
│ KPI                 │ Target │ Actual   │ Status  │
├─────────────────────┼────────┼──────────┼─────────┤
│ Answer Accuracy     │ > 90%  │ 92%      │ ✓ Pass  │
│ Avg Similarity      │ > 0.75 │ 0.82     │ ✓ Pass  │
│ Response Time       │ < 2s   │ 1.4s     │ ✓ Pass  │
│ Source Coverage     │ > 80%  │ 85%      │ ✓ Pass  │
└─────────────────────┴────────┴──────────┴─────────┘

Query-Level Details:
┌─────────┬────────────┬───────────┬─────────┬────────┐
│ Query   │ Category   │ Accuracy  │ Sim Score│ Status │
├─────────┼────────────┼───────────┼─────────┼────────┤
│ Q001    │ Policy     │ 95%       │ 0.89     │ ✓ Pass │
│ Q002    │ Support    │ 88%       │ 0.76     │ ⚠ Review│
│ Q003    │ Product    │ 94%       │ 0.85     │ ✓ Pass │
│ Q004    │ Sales      │ 91%       │ 0.81     │ ✓ Pass │
│ Q005    │ Features   │ 89%       │ 0.78     │ ✓ Pass │
└─────────┴────────────┴───────────┴─────────┴────────┘

Failed/Weak Queries:
- Q002: Answer missing password reset link instructions
  Recommendation: Add password_reset.md to document sources

Interpreting Results

Overall KPIs:

  • Green (✓ Pass): Meeting or exceeding target
  • Yellow (⚠ Review): Close to target, monitor
  • Red (✗ Fail): Below target, needs attention

Query-Level Details:

  • Identify specific queries causing issues
  • Pattern analysis by category
  • Prioritize fixes by query priority

Iteration Workflow

Making Confident Changes

Current State: Baseline KPIs measured
   ↓
Proposed Change: "Switch to text-embedding-3-large"
   ↓
Run Benchmark: Execute same query set
   ↓
Compare Results:
┌─────────────┬───────────┬───────────┬────────────┐
│ KPI         │ Before    │ After     │ Change     │
├─────────────┼───────────┼───────────┼────────────┤
│ Accuracy    │ 92%       │ 94%       │ +2% ✓      │
│ Sim Score   │ 0.82      │ 0.87      │ +0.05 ✓    │
│ Resp Time   │ 1.4s      │ 1.8s      │ -0.4s ⚠    │
└─────────────┴───────────┴───────────┴────────────┘
   ↓
Decision: Accuracy improvement worth slight latency increase
   ↓
Deploy Change

Common Iterations

ChangeExpected KPI ImpactWhen to Do
Upgrade embedding model↑ Accuracy, ↑ Similarity, ↑ LatencyWhen accuracy < target
Increase Top-K↑ Context coverage, ↑ LatencyWhen answers lack detail
Enable BM25↑ Accuracy for technical termsTechnical documentation
Add documents↑ Source coverageWhen queries miss info
Adjust chunk sizeVariable - test with benchmarkWhen scores inconsistent

Iteration Decision Framework

Deploy When:

  • Primary KPIs improve (accuracy, similarity)
  • Secondary KPI impact acceptable (latency)
  • No regressions in critical queries

Don't Deploy When:

  • Primary KPIs decrease
  • Latency increase unacceptable
  • Critical queries regress

Benchmarking Best Practices

Do's

✅ Run benchmarks after every significant change ✅ Include diverse query types (simple, complex, edge cases) ✅ Track benchmarks over time (trend analysis) ✅ Set realistic KPI targets based on use case ✅ Document benchmark changes (query additions, removals)

Don'ts

❌ Don't change benchmark queries frequently (breaks trend comparison) ❌ Don't optimize for KPIs at expense of user experience ❌ Don't ignore latency KPIs (accuracy isn't everything) ❌ Don't skip benchmark before production deployment

Integration with Development Workflow

Before Production Deployment

Development → Benchmark Test → KPI Threshold Check → Deploy
                                    ↓
                            If KPIs below threshold:
                            - Block deployment
                            - Investigate regression
                            - Fix and re-test

Scheduled Benchmarking

FrequencyScopePurpose
DailyCritical queries (10)Catch major regressions
WeeklyFull query set (50+)Track trends
MonthlyExtended set (100+)Comprehensive analysis

Trend Analysis

Track KPIs over time to identify:

  • Gradual improvements or degradations
  • Impact of document additions
  • Seasonal patterns in query types
Accuracy Trend (Last 4 Weeks):
Week 1: 89%
Week 2: 91% (+2%)
Week 3: 92% (+1%)
Week 4: 94% (+2%)

Trend: Improving ✓

Continuous Improvement

Benchmarking is not a one-time activity. Make it part of your regular RAG development workflow:

Weekly

  • Review benchmark trends
  • Identify queries needing attention
  • Plan improvements

Per Change

  • Run benchmark before deploying
  • Compare against baseline
  • Document results

Monthly

  • Add new queries based on user feedback
  • Review and update KPI targets
  • Comprehensive analysis

Quarterly

  • Full benchmark review
  • Strategic planning
  • Goal setting for next quarter

Next Steps

After establishing benchmarking:

  1. Deploy to Production with confidence
  2. Monitor Continuously using established KPIs
  3. Iterate Regularly based on benchmark results
  4. Expand Benchmark as new use cases emerge

Tips for Success

  1. Start Small: Begin with 10-20 critical queries
  2. Automate: Run benchmarks automatically on changes
  3. Visualize: Use charts for trend analysis
  4. Share: Make benchmarks visible to team
  5. Act: Use results to drive improvements