GG logo mark GG Nagarkar
Back to Things I've Built

Agentic AI · 2025

Gen-AI and Agentic Framework QA Standards

Recent Phase: Agentic AI and Intelligence

Production-grade QA standards for LLM and RAG pipelines.

RAGASLangfuseCI/CDRAG EvaluationLLM QAAgentic AI

Architecture Responsibility

Responsible for technology architecture and hands-on delivery direction across system design, deployment, DevOps, cost, scale, reliability, and production readiness.

Outcome

Established automated QA gatekeeping for LLM pipelines, moving beyond basic unit tests toward measurable recall, faithfulness, and relevance standards.

Scale

Designed as a production QA standard for Gen-AI and agentic delivery pipelines.

Architecture

  • Integrated RAGAS into CI/CD pipelines to score retrieval recall, faithfulness, and answer relevance.
  • Configured hard build gates around quality thresholds such as 0.90 recall and 0.85 faithfulness/relevance targets.
  • Used Langfuse for live tracing, latency monitoring, and token cost tracking in production.
  • Connected evaluation signals to release decisions so quality regressions could block merges.

Lessons Learned

  • Gen-AI systems need evaluation architecture, not only unit tests, because quality depends on retrieval, prompts, grounding, and model behavior.
  • RAG pipelines should have measurable CI gates for recall, faithfulness, and relevance so quality regressions block release.
  • Tracing, evaluation, cost monitoring, and release control must work together before LLM systems can be treated as production-grade.