Step 6 - Benchmarking & Iteration
Set up KPIs and track RAG quality over time for confident iteration
Step 6: Benchmarking & Iteration
Purpose
Set up KPIs and track RAG quality over time using business-specific queries and measurable benchmarks. This step enables confident iteration and continuous improvement.
Entry Point: RAG project → Benchmark tab
Prerequisites: Configured RAG system (Steps 1-5 complete)
Expected Outcome: Measurable KPIs, baseline established, iteration workflow ready
Why Benchmarking Matters
Without Benchmarking
❌ Changes are based on intuition ❌ Can't measure if improvements actually help ❌ Risk of breaking existing functionality ❌ No way to track progress over time
With Benchmarking
✅ Every change is measured against KPIs ✅ Confidence in improvements ✅ Catch regressions before production ✅ Data-driven decision making
Setting Up Your Benchmark
Step 1: Define Business Queries
Create a list of 20-50 queries representing real user questions:
| Query ID | Query | Category | Priority |
|---|---|---|---|
| Q001 | "What is the return policy?" | Policy | High |
| Q002 | "How do I reset my password?" | Support | High |
| Q003 | "What products integrate with Slack?" | Product | Medium |
| Q004 | "What is the enterprise pricing?" | Sales | High |
| Q005 | "How do I export my data?" | Features | Medium |
Tips for Query Selection:
- Include queries from each category (Policy, Support, Product, etc.)
- Prioritize high-frequency user questions
- Include edge cases and complex queries
- Use actual user queries from logs if available
Step 2: Define Expected Answers
For each query, document what a good answer should include:
Query Q001: "What is the return policy?"
Expected Answer Should Include:
- Return window (30 days)
- Condition requirements (unused, original packaging)
- Refund timeline (5-7 business days)
- Exception items (software, DVDs)
Accuracy Threshold: 80% (4 of 5 elements)
Query Q002: "How do I reset my password?"
Expected Answer Should Include:
- Password reset link location
- Step-by-step instructions
- Support contact if issues
Accuracy Threshold: 100% (all elements required)
Step 3: Set KPI Targets
| KPI | Target | How It's Measured |
|---|---|---|
| Answer Accuracy | > 90% | Human evaluation or LLM judge |
| Avg Similarity Score | > 0.75 | Average of top result scores |
| Response Time | < 2 seconds | API response latency |
| Source Coverage | > 80% | Relevant sources retrieved |
KPI Definitions:
- Answer Accuracy: Percentage of expected elements present in answer
- Avg Similarity Score: Average similarity score of retrieved chunks
- Response Time: Time from query to complete response
- Source Coverage: Percentage of relevant documents retrieved
Running Benchmarks
Benchmark Execution Flow
1. Load Query Set
↓
2. Execute Each Query Through RAG Pipeline
↓
3. Collect Results (answers, scores, sources)
↓
4. Compare Against Expected Answers
↓
5. Calculate KPIs
↓
6. Generate Report
Benchmark Report Example
Benchmark Report - Project: Customer Support RAG
Date: 2024-01-15
Query Set: 50 business queries
─────────────────────────────────────────────────
Overall KPIs:
┌─────────────────────┬────────┬──────────┬─────────┐
│ KPI │ Target │ Actual │ Status │
├─────────────────────┼────────┼──────────┼─────────┤
│ Answer Accuracy │ > 90% │ 92% │ ✓ Pass │
│ Avg Similarity │ > 0.75 │ 0.82 │ ✓ Pass │
│ Response Time │ < 2s │ 1.4s │ ✓ Pass │
│ Source Coverage │ > 80% │ 85% │ ✓ Pass │
└─────────────────────┴────────┴──────────┴─────────┘
Query-Level Details:
┌─────────┬────────────┬───────────┬─────────┬────────┐
│ Query │ Category │ Accuracy │ Sim Score│ Status │
├─────────┼────────────┼───────────┼─────────┼────────┤
│ Q001 │ Policy │ 95% │ 0.89 │ ✓ Pass │
│ Q002 │ Support │ 88% │ 0.76 │ ⚠ Review│
│ Q003 │ Product │ 94% │ 0.85 │ ✓ Pass │
│ Q004 │ Sales │ 91% │ 0.81 │ ✓ Pass │
│ Q005 │ Features │ 89% │ 0.78 │ ✓ Pass │
└─────────┴────────────┴───────────┴─────────┴────────┘
Failed/Weak Queries:
- Q002: Answer missing password reset link instructions
Recommendation: Add password_reset.md to document sources
Interpreting Results
Overall KPIs:
- Green (✓ Pass): Meeting or exceeding target
- Yellow (⚠ Review): Close to target, monitor
- Red (✗ Fail): Below target, needs attention
Query-Level Details:
- Identify specific queries causing issues
- Pattern analysis by category
- Prioritize fixes by query priority
Iteration Workflow
Making Confident Changes
Current State: Baseline KPIs measured
↓
Proposed Change: "Switch to text-embedding-3-large"
↓
Run Benchmark: Execute same query set
↓
Compare Results:
┌─────────────┬───────────┬───────────┬────────────┐
│ KPI │ Before │ After │ Change │
├─────────────┼───────────┼───────────┼────────────┤
│ Accuracy │ 92% │ 94% │ +2% ✓ │
│ Sim Score │ 0.82 │ 0.87 │ +0.05 ✓ │
│ Resp Time │ 1.4s │ 1.8s │ -0.4s ⚠ │
└─────────────┴───────────┴───────────┴────────────┘
↓
Decision: Accuracy improvement worth slight latency increase
↓
Deploy Change
Common Iterations
| Change | Expected KPI Impact | When to Do |
|---|---|---|
| Upgrade embedding model | ↑ Accuracy, ↑ Similarity, ↑ Latency | When accuracy < target |
| Increase Top-K | ↑ Context coverage, ↑ Latency | When answers lack detail |
| Enable BM25 | ↑ Accuracy for technical terms | Technical documentation |
| Add documents | ↑ Source coverage | When queries miss info |
| Adjust chunk size | Variable - test with benchmark | When scores inconsistent |
Iteration Decision Framework
Deploy When:
- Primary KPIs improve (accuracy, similarity)
- Secondary KPI impact acceptable (latency)
- No regressions in critical queries
Don't Deploy When:
- Primary KPIs decrease
- Latency increase unacceptable
- Critical queries regress
Benchmarking Best Practices
Do's
✅ Run benchmarks after every significant change ✅ Include diverse query types (simple, complex, edge cases) ✅ Track benchmarks over time (trend analysis) ✅ Set realistic KPI targets based on use case ✅ Document benchmark changes (query additions, removals)
Don'ts
❌ Don't change benchmark queries frequently (breaks trend comparison) ❌ Don't optimize for KPIs at expense of user experience ❌ Don't ignore latency KPIs (accuracy isn't everything) ❌ Don't skip benchmark before production deployment
Integration with Development Workflow
Before Production Deployment
Development → Benchmark Test → KPI Threshold Check → Deploy
↓
If KPIs below threshold:
- Block deployment
- Investigate regression
- Fix and re-test
Scheduled Benchmarking
| Frequency | Scope | Purpose |
|---|---|---|
| Daily | Critical queries (10) | Catch major regressions |
| Weekly | Full query set (50+) | Track trends |
| Monthly | Extended set (100+) | Comprehensive analysis |
Trend Analysis
Track KPIs over time to identify:
- Gradual improvements or degradations
- Impact of document additions
- Seasonal patterns in query types
Accuracy Trend (Last 4 Weeks):
Week 1: 89%
Week 2: 91% (+2%)
Week 3: 92% (+1%)
Week 4: 94% (+2%)
Trend: Improving ✓
Continuous Improvement
Benchmarking is not a one-time activity. Make it part of your regular RAG development workflow:
Weekly
- Review benchmark trends
- Identify queries needing attention
- Plan improvements
Per Change
- Run benchmark before deploying
- Compare against baseline
- Document results
Monthly
- Add new queries based on user feedback
- Review and update KPI targets
- Comprehensive analysis
Quarterly
- Full benchmark review
- Strategic planning
- Goal setting for next quarter
Next Steps
After establishing benchmarking:
- Deploy to Production with confidence
- Monitor Continuously using established KPIs
- Iterate Regularly based on benchmark results
- Expand Benchmark as new use cases emerge
Tips for Success
- Start Small: Begin with 10-20 critical queries
- Automate: Run benchmarks automatically on changes
- Visualize: Use charts for trend analysis
- Share: Make benchmarks visible to team
- Act: Use results to drive improvements
