Back to case studies

Eval-Gated RAG Platform with CI/CD

Production RAG system with evaluation-driven releases, gold-set testing, and CI gates to prevent hallucinations. Self-hosted on Kubernetes with cost and accuracy SLOs.

RAGLLMEvaluationCI/CDKubernetesVector Database

The Problem

Organizations building RAG systems lack confidence in answer quality and have no systematic way to prevent regressions when updating retrieval strategies, prompts, or models.

Key Constraints

  • Cannot use external LLM APIs (compliance/cost)
  • Must prove accuracy to non-technical stakeholders
  • Need to detect and prevent hallucinations systematically
  • Updates must be testable before production

Results

Before

Answer Accuracy
Unknown (no measurement)
Hallucination Rate
~12% (manual spot-checks)
Time to Deploy Changes
2-3 days (manual testing)
Cost per Query
$0.08 (OpenAI API)

After

Answer Accuracy
Measured: 87% on gold-set
Hallucination Rate
3% (eval-gated)
Time to Deploy Changes
30 minutes (CI/CD)
Cost per Query
$0.006 (self-hosted)

Overview

A production RAG (Retrieval-Augmented Generation) platform with systematic evaluation and CI/CD gating. This system ensures every change—whether to retrieval strategy, prompt templates, or models—is tested against a gold-set before reaching production.

GitHub Repository: github.com/yrgenkuci/eval-gated-rag-platform

The Problem

Most RAG systems are deployed with hope, not confidence. Here's why:

  • No measurement: Teams don't know if answers are actually accurate
  • No regression prevention: Updates can silently make things worse
  • No systematic testing: Spot-checking 10 queries doesn't prove anything
  • High costs: Using OpenAI API at scale gets expensive fast
  • Compliance blockers: Can't send company data to external APIs

Without evaluation infrastructure, RAG becomes a black box that occasionally hallucinates confidently wrong answers.

Solution Architecture

Built a self-hosted RAG platform with eval-first design:

1. Core RAG Pipeline

  • Vector Database: Qdrant (self-hosted)
  • Embedding Model: BGE-large (open-source)
  • LLM: Llama 3 8B (self-hosted on GPU)
  • Chunking Strategy: Configurable with semantic splitting

2. Evaluation Harness

  • Gold-set management: Versioned test cases with expected answers
  • Automated metrics: ROUGE, BLEU, semantic similarity, custom validators
  • Drift detection: Compares current vs historical performance
  • Per-query confidence scores: Flags low-confidence answers for review

3. CI/CD Pipeline

  • Pre-merge testing: Every PR runs full eval suite
  • SLO gates: Changes that drop accuracy below threshold are blocked
  • Canary deployments: New versions tested on 10% of traffic first
  • Rollback triggers: Auto-rollback if live metrics degrade

Technical Stack

  • Orchestration: Kubernetes
  • Vector DB: Qdrant
  • LLM Runtime: vLLM (GPU inference server)
  • Embeddings: BGE-large (multilingual)
  • Eval Framework: Custom harness with pytest
  • CI/CD: GitHub Actions + ArgoCD
  • Observability: Prometheus + Grafana

Implementation Highlights

Gold-Set Driven Development

Every feature starts with test cases:

# example gold-set entry
- query: "What is our company's data retention policy?"
  expected_chunks: ["docs/policies/data-retention.md"]
  expected_answer_contains:
    - "90 days for logs"
    - "7 years for financial records"
  must_not_contain:
    - "indefinitely"

The eval harness runs these automatically and fails CI if accuracy drops below 85%.

CI/CD Gate Example

# .github/workflows/eval-gate.yml
- name: Run Eval Suite
  run: pytest tests/eval/ --gold-set=prod --threshold=0.85

- name: Block if Failed
  if: failure()
  run: |
    echo "Eval failed. Accuracy below SLO. Blocking merge."
    exit 1

Result: No broken changes reach production.

Self-Hosted LLM Cost Control

Running Llama 3 8B on NVIDIA L4 GPU:

  • Throughput: 50 tokens/sec
  • Cost: ~$0.006 per query (amortized GPU cost)
  • vs OpenAI: ~$0.08 per query

13x cost reduction while keeping data private.

Hallucination Prevention

Three layers of protection:

  1. Source attribution: Every answer cites specific chunks
  2. Confidence thresholds: Low-confidence answers trigger human review
  3. Eval gate: Hallucination rate tracked per deployment

This reduced hallucinations from ~12% to 3% through systematic measurement.

Results

Deployed for healthcare organization with 10k+ queries/day:

Metric Before After Improvement
Answer Accuracy Unknown 87% (measured) Baseline established
Hallucination Rate ~12% 3% 75% reduction
Deploy Confidence "Hope it works" Eval-gated Systematic quality
Cost per Query $0.08 $0.006 93% cost reduction
Change Velocity 2-3 days 30 minutes 10x faster iterations

Key Learnings

  1. Evaluation isn't optional: You can't improve what you don't measure
  2. Gold-sets need maintenance: Invest in curating high-quality test cases
  3. CI gates prevent regressions: Automated testing catches issues before users do
  4. Self-hosting pays off at scale: Break-even point is ~50k queries/month
  5. Observability = trust: Dashboards showing real metrics build stakeholder confidence

How This Can Work for You

This platform is the foundation of two services:

1. RAG Review & Hardening (1-2 weeks)

If you already have a RAG system:

  • Audit current implementation
  • Build gold-set for your domain
  • Add eval harness and CI gates
  • Measure and report baseline metrics

2. Eval-Driven RAG Platform (2-3 weeks)

Starting from scratch? We'll deploy the complete stack:

  • Complete deployment (vector DB, LLM, eval harness)
  • Configuration for your documents and use cases
  • CI/CD setup with quality gates
  • Team training on gold-set management

Both approaches use the same reusable components from this reference implementation.

Interested? Book an architecture review to discuss your RAG requirements and accuracy targets.

Related Resources

  • Technical Deep-Dive: How to build eval harnesses for LLM applications (in progress)
  • Gold-Set Best Practices: Curating test cases that actually catch regressions (in progress)
  • Cost Analysis: When self-hosting LLMs makes financial sense (in progress)

Want similar results for your project?

This case study demonstrates the approach I take with clients. Book a call to discuss your specific requirements.

Book an Architecture Review