Context Recall in Machine Learning

Context Recall is an evaluation metric that measures how well a RAG system's retriever captures the information needed to produce a correct answer. It works by decomposing a ground-truth answer into individual claims and then checking whether each claim can be attributed to at least one retrieved context chunk. A high context recall score means the retriever is surfacing most of the relevant information from the knowledge base, while a low score signals retrieval gaps that will inevitably degrade answer quality. Originally popularized by the RAGAS framework, context recall has become a standard component in automated RAG evaluation pipelines across production ML systems.

Concept Snapshot

What It Is
A metric that quantifies the fraction of ground-truth answer claims that are supported by the retrieved context passages in a RAG pipeline. It uses an LLM-as-judge approach to decompose the expected answer into atomic claims and verify each against the retrieved chunks.
Category
Evaluation
Complexity
Intermediate
Inputs / Outputs
Inputs: Ground-truth (expected) answer for a query, Set of context chunks retrieved by the RAG pipeline, Original user query (optional but recommended) → Outputs: Context recall score between 0.0 and 1.0, Per-claim attribution verdicts (attributed or not attributed), Claim-level breakdown showing which claims were supported
System Placement
Evaluation stage of the RAG pipeline, typically run offline or in CI/CD pipelines against a curated test set of question-answer pairs with ground-truth references.
Also Known As
Retrieval Recall, Context Coverage, Ground Truth Coverage, RAG Recall, Retrieval Completeness
Typical Users
ML Engineers building RAG systems, NLP Researchers evaluating retrieval strategies, QA Engineers writing RAG test suites, Data Scientists tuning chunking and embedding models, MLOps Engineers monitoring retrieval quality in production
Prerequisites
Understanding of RAG (Retrieval-Augmented Generation) pipelines, Familiarity with information retrieval concepts (recall, precision), Basic knowledge of LLM-based evaluation (LLM-as-judge), Experience with vector search and document chunking

Internal Architecture

The context recall computation pipeline has four stages: claim decomposition, context preparation, claim-context attribution, and score aggregation. Each stage can be customized or replaced depending on the evaluation framework being used.

Key Components

Claim Decomposer

Takes the ground-truth answer and decomposes it into a list of atomic, independently verifiable claims using an LLM.

Context Preprocessor

Prepares the retrieved context chunks for attribution by formatting, deduplicating, and optionally truncating them to fit within the judge model's context window.

Attribution Judge

An LLM that evaluates whether each claim from the ground truth can be attributed to (is supported by) the retrieved context chunks.

Score Aggregator

Collects all per-claim verdicts and computes the final context recall score as the ratio of attributed claims to total claims.

Evaluation Orchestrator

Coordinates the end-to-end evaluation flow, managing batching, retries, rate limiting, and result storage.

Data Flow

The architecture diagram shows a linear pipeline flowing left to right. On the far left, two inputs feed into the system: the evaluation dataset (containing queries and ground truths) at the top, and the RAG pipeline (producing retrieved contexts) at the bottom. These converge at the Claim Decomposer, which outputs a list of claims. The claims and contexts flow into the Attribution Judge (shown as an LLM box), which produces per-claim verdicts. These verdicts flow into the Score Aggregator, which outputs the final context recall score and a detailed claim-level report on the right side.

How to Implement

Context recall can be implemented using the RAGAS library directly, or built from scratch using any LLM API. The core logic involves prompt engineering for claim decomposition and attribution judgment. Below are practical implementations ranging from out-of-the-box RAGAS usage to custom implementations with fine-grained control.

Basic Context Recall with RAGAS
from ragas import evaluate
from ragas.metrics import context_recall
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": [
        "What is the capital of France?",
        "When was Python created?"
    ],
    "answer": [
        "The capital of France is Paris, which is also the largest city.",
        "Python was created by Guido van Rossum and released in 1991."
    ],
    "contexts": [
        ["Paris is the capital and most populous city of France."],
        ["Python is a high-level programming language.",
         "Guido van Rossum began working on Python in the late 1980s."]
    ],
    "ground_truth": [
        "Paris is the capital of France and its largest city.",
        "Python was created by Guido van Rossum and first released in 1991."
    ]
}

dataset = Dataset.from_dict(eval_data)
result = evaluate(dataset, metrics=[context_recall])
print(f"Context Recall: {result['context_recall']:.4f}")
# Per-sample scores
for i, score in enumerate(result.scores):
    print(f"  Sample {i}: {score['context_recall']:.4f}")

This example uses the RAGAS library to compute context recall on a small evaluation dataset. The library handles claim decomposition, attribution judgment, and score aggregation internally. The 'ground_truth' field contains the expected answers, and 'contexts' contains the retrieved passages for each question.

Custom Claim Decomposition and Attribution
import openai
import json
from typing import List, Tuple

client = openai.OpenAI()

def decompose_claims(ground_truth: str) -> List[str]:
    """Break ground-truth answer into atomic claims."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Decompose the following text into independent, "
                "atomic factual claims. Each claim should be a "
                "single, self-contained statement that can be "
                "verified independently. Return a JSON array of strings."
            )},
            {"role": "user", "content": ground_truth}
        ],
        response_format={"type": "json_object"},
        temperature=0.0
    )
    result = json.loads(response.choices[0].message.content)
    return result.get("claims", [])


def check_attribution(
    claim: str,
    contexts: List[str]
) -> Tuple[bool, str]:
    """Check if a claim is attributable to retrieved contexts."""
    context_text = "\n---\n".join(
        f"Context {i+1}: {c}" for i, c in enumerate(contexts)
    )
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Determine if the given claim can be attributed to "
                "the provided context passages. A claim is attributed "
                "if the context contains information that supports or "
                "implies the claim. Respond with JSON: "
                '{"attributed": true/false, "reasoning": "..."}'
            )},
            {"role": "user", "content": (
                f"Claim: {claim}\n\nContexts:\n{context_text}"
            )}
        ],
        response_format={"type": "json_object"},
        temperature=0.0
    )
    result = json.loads(response.choices[0].message.content)
    return result["attributed"], result["reasoning"]


def compute_context_recall(
    ground_truth: str,
    contexts: List[str]
) -> dict:
    """Compute context recall with full claim-level details."""
    claims = decompose_claims(ground_truth)
    if not claims:
        return {"score": 0.0, "claims": [], "error": "No claims extracted"}

    results = []
    attributed_count = 0
    for claim in claims:
        is_attributed, reasoning = check_attribution(claim, contexts)
        results.append({
            "claim": claim,
            "attributed": is_attributed,
            "reasoning": reasoning
        })
        if is_attributed:
            attributed_count += 1

    return {
        "score": attributed_count / len(claims),
        "attributed_count": attributed_count,
        "total_claims": len(claims),
        "claims": results
    }

# Usage
result = compute_context_recall(
    ground_truth="Paris is the capital of France. It has a population of over 2 million.",
    contexts=["Paris is the capital city of France, located on the Seine river."]
)
print(f"Score: {result['score']:.2f}")
for c in result['claims']:
    status = 'YES' if c['attributed'] else 'NO'
    print(f"  [{status}] {c['claim']}")

This custom implementation gives full control over the claim decomposition and attribution process. It uses structured JSON output from the LLM for reliable parsing, includes reasoning traces for debugging, and returns detailed per-claim results. This is useful when you need to customize the prompts for domain-specific evaluation or want to use a different judge model.

Batch Evaluation Pipeline with Async Processing
import asyncio
import openai
import json
from typing import List, Dict
from dataclasses import dataclass, field


@dataclass
class EvalSample:
    query: str
    ground_truth: str
    contexts: List[str]
    score: float = 0.0
    claim_details: List[dict] = field(default_factory=list)


async def evaluate_sample(client, sample, semaphore):
    async with semaphore:
        # Decompose ground truth into claims
        resp = await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content":
                    "Extract atomic factual claims. "
                    "Return JSON: {\"claims\": [\"...\"]}"},
                {"role": "user", "content": sample.ground_truth}
            ],
            response_format={"type": "json_object"},
            temperature=0.0
        )
        claims = json.loads(
            resp.choices[0].message.content
        ).get("claims", [])
        if not claims:
            return sample

        # Check attribution for each claim in parallel
        ctx = "\n".join(
            f"[{i+1}] {c}" for i, c in enumerate(sample.contexts)
        )
        tasks = []
        for claim in claims:
            tasks.append(client.chat.completions.create(
                model="gpt-4o",
                messages=[
                    {"role": "system", "content":
                        "Is this claim supported by the contexts? "
                        'JSON: {"attributed": bool}'},
                    {"role": "user", "content":
                        f"Claim: {claim}\n\n{ctx}"}
                ],
                response_format={"type": "json_object"},
                temperature=0.0
            ))
        verdicts = await asyncio.gather(*tasks)

        attributed = 0
        for claim, v in zip(claims, verdicts):
            result = json.loads(v.choices[0].message.content)
            is_attr = result.get("attributed", False)
            sample.claim_details.append({
                "claim": claim, "attributed": is_attr
            })
            if is_attr:
                attributed += 1
        sample.score = attributed / len(claims)
        return sample


async def batch_evaluate(
    samples: List[EvalSample], max_concurrent: int = 10
) -> Dict:
    client = openai.AsyncOpenAI()
    sem = asyncio.Semaphore(max_concurrent)
    results = await asyncio.gather(*[
        evaluate_sample(client, s, sem) for s in samples
    ])
    scores = [s.score for s in results]
    return {
        "mean_context_recall": sum(scores) / len(scores),
        "num_samples": len(scores),
        "samples": results
    }

# Usage
samples = [
    EvalSample(
        query="What is BERT?",
        ground_truth="BERT is a transformer model by Google.",
        contexts=["BERT is a pre-trained transformer model."]
    ),
]
results = asyncio.run(batch_evaluate(samples))
print(f"Mean Context Recall: {results['mean_context_recall']:.3f}")

This production-grade implementation uses async processing to evaluate hundreds of samples efficiently. A semaphore controls concurrency to respect API rate limits. Each sample is processed independently, with claim decomposition followed by parallel attribution checks for all claims within a sample.

Context Recall with LangChain and Custom Evaluator
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from typing import List


class ContextRecallEvaluator:
    def __init__(self, model_name: str = "gpt-4o"):
        self.llm = ChatOpenAI(model=model_name, temperature=0)

    def _decompose(self, text: str) -> List[str]:
        prompt = ChatPromptTemplate.from_messages([
            ("system", "Extract atomic factual claims. "
             "Return JSON: {{\"claims\": [\"...\"]}}"),
            ("user", "{text}")
        ])
        chain = prompt | self.llm | JsonOutputParser()
        result = chain.invoke({"text": text})
        return result.get("claims", [])

    def _attribute(self, claim: str, context: str) -> bool:
        prompt = ChatPromptTemplate.from_messages([
            ("system", "Is this claim supported by the context? "
             "Return JSON: {{\"attributed\": true/false}}"),
            ("user", "Claim: {claim}\nContext: {context}")
        ])
        chain = prompt | self.llm | JsonOutputParser()
        result = chain.invoke({
            "claim": claim, "context": context
        })
        return result.get("attributed", False)

    def evaluate(
        self, ground_truth: str, contexts: List[str]
    ) -> dict:
        claims = self._decompose(ground_truth)
        combined = "\n\n".join(contexts)
        verdicts = [
            {"claim": c, "attributed": self._attribute(c, combined)}
            for c in claims
        ]
        attr = sum(1 for v in verdicts if v["attributed"])
        return {
            "score": attr / len(claims) if claims else 0.0,
            "claims": verdicts
        }

# Usage
evaluator = ContextRecallEvaluator()
result = evaluator.evaluate(
    ground_truth="TensorFlow was developed by Google Brain "
                 "and released in 2015.",
    contexts=[
        "TensorFlow is an open-source ML framework by Google.",
        "It was first released on November 9, 2015."
    ]
)
print(f"Context Recall: {result['score']:.2f}")

This LangChain-based implementation uses structured output parsing with Pydantic models for type safety. It separates claim decomposition and attribution into composable chains, making it easy to swap models, modify prompts, or add caching layers.

Context Recall Monitoring in CI/CD Pipeline
import json
import sys
from pathlib import Path
from ragas import evaluate
from ragas.metrics import context_recall
from datasets import Dataset


def load_eval_dataset(path: str) -> Dataset:
    """Load evaluation dataset from JSON file."""
    with open(path) as f:
        data = json.load(f)
    return Dataset.from_dict(data)


def run_context_recall_check(
    dataset_path: str,
    threshold: float = 0.75,
    output_path: str = "eval_results.json"
) -> bool:
    """Run context recall evaluation and check against threshold."""
    dataset = load_eval_dataset(dataset_path)
    result = evaluate(dataset, metrics=[context_recall])

    mean_score = result["context_recall"]
    per_sample = [
        {"query": q, "score": s["context_recall"]}
        for q, s in zip(dataset["question"], result.scores)
    ]

    # Find failing samples
    failures = [s for s in per_sample if s["score"] < threshold]

    report = {
        "metric": "context_recall",
        "mean_score": round(mean_score, 4),
        "threshold": threshold,
        "passed": mean_score >= threshold,
        "num_samples": len(per_sample),
        "num_failures": len(failures),
        "failures": failures[:10]  # Top 10 worst
    }

    Path(output_path).write_text(json.dumps(report, indent=2))
    print(f"Context Recall: {mean_score:.4f} (threshold: {threshold})")
    print(f"Status: {'PASS' if report['passed'] else 'FAIL'}")

    if failures:
        print(f"\n{len(failures)} samples below threshold:")
        for f in failures[:5]:
            print(f"  {f['query'][:60]}... → {f['score']:.3f}")

    return report["passed"]


if __name__ == "__main__":
    dataset_path = sys.argv[1] if len(sys.argv) > 1 \
        else "eval_dataset.json"
    threshold = float(sys.argv[2]) if len(sys.argv) > 2 else 0.75

    passed = run_context_recall_check(dataset_path, threshold)
    sys.exit(0 if passed else 1)  # Non-zero exit fails CI

This script integrates context recall evaluation into a CI/CD pipeline. It loads an evaluation dataset, computes context recall scores, compares against a configurable threshold, and exits with a non-zero code on failure. The output report helps developers identify which queries have retrieval gaps.

Common Implementation Mistakes

  • Using answer text instead of ground truth for claim decomposition

  • Not controlling for claim decomposition granularity

  • Evaluating with too few test samples

  • Ignoring the cost and latency of LLM-based evaluation

  • Not separating retrieval evaluation from generation evaluation

When Should You Use This?

Use When

  • You have ground-truth answers for your evaluation dataset and need to assess retriever completeness

  • Your RAG pipeline is producing incomplete or partially correct answers and you suspect retrieval gaps

  • You are comparing different retriever configurations (embedding models, chunk sizes, top-K values)

  • You need claim-level diagnostics to pinpoint exactly what information the retriever is missing

  • You are building a regression test suite for your RAG pipeline in CI/CD

  • Compliance or audit requirements demand evidence that the system considered all relevant information

  • You are tuning the balance between retrieval breadth (more chunks) and precision (fewer, more relevant chunks)

Avoid When

  • You do not have ground-truth answers — context recall requires a reference answer to decompose into claims

  • Your evaluation is focused on generation quality rather than retrieval quality — use faithfulness or answer relevance instead

  • You need real-time evaluation during inference — context recall is too expensive (multiple LLM calls) for online use

  • Your ground-truth answers are highly subjective or opinion-based, making claim decomposition unreliable

  • You are evaluating open-ended creative tasks where there is no single correct answer

  • Budget constraints prevent running LLM-based evaluation — consider cheaper proxy metrics like lexical overlap first

Key Tradeoffs

Alternatives & Comparisons

Context precision and context recall are complementary. A retriever can have high recall (gets all relevant info) but low precision (also retrieves lots of irrelevant info), or vice versa. Both should be tracked together.

Faithfulness evaluates the generation step, while context recall evaluates the retrieval step. High context recall with low faithfulness means the retriever is good but the generator is hallucinating. Low context recall with high faithfulness means the generator is being careful but lacks information.

Answer relevance is an end-to-end metric that does not distinguish between retrieval and generation failures. Context recall isolates retrieval quality specifically.

Recall@K is simpler and cheaper (no LLM needed) but treats documents as binary relevant/irrelevant units. Context recall is more nuanced because it evaluates claim-level coverage, capturing partial relevance.

NDCG cares about ranking order while context recall does not — if a relevant chunk appears anywhere in the retrieved set, it counts. NDCG also requires graded relevance labels, while context recall derives relevance from claim attribution.

Pros, Cons & Tradeoffs

Advantages

  • Operates at the claim level, providing much more granular retrieval assessment than document-level metrics

  • Uses semantic understanding via LLM-as-judge, correctly handling paraphrases and reformulations that lexical metrics miss

  • Produces actionable diagnostics — you can see exactly which claims are missing and focus retriever improvements

  • Framework-agnostic — works with any retriever (dense, sparse, hybrid) and any knowledge base format

  • Directly measures what matters for answer quality — if context recall is high, the generator has the information it needs

  • Integrates naturally into automated evaluation pipelines and CI/CD workflows

  • Pairs well with complementary metrics (context precision, faithfulness) for comprehensive RAG evaluation

Disadvantages

  • Requires ground-truth answers, which are expensive to create and maintain as the knowledge base evolves

  • LLM-based evaluation introduces non-determinism — scores can vary slightly between runs even with temperature 0

  • Expensive to compute at scale due to multiple LLM calls per sample (decomposition + attribution per claim)

  • Claim decomposition quality depends heavily on the judge model and prompt, creating a meta-evaluation challenge

  • Does not account for the ranking or ordering of retrieved chunks, only their collective coverage

  • Cannot be used for real-time monitoring during inference due to latency and cost constraints

  • May produce inflated scores when ground-truth answers are simple (fewer claims = easier to cover)

Failure Modes & Debugging

Overly Generous Attribution

Cause

Symptoms

Mitigation

Use chain-of-thought prompting that requires the judge to quote specific evidence from the context. Add few-shot examples of correct vs. incorrect attributions. Periodically validate a sample of verdicts with human annotators.

Inconsistent Claim Decomposition

Cause

Symptoms

Mitigation

Pin the judge model version and temperature to 0. Include few-shot examples in the decomposition prompt showing the desired granularity. Cache decomposition results and reuse them across retriever comparisons.

Ground Truth Staleness

Cause

Symptoms

Mitigation

Establish a regular ground-truth refresh cadence aligned with knowledge base update frequency. Flag samples where the retriever finds contradicting but more recent information. Implement version tracking for evaluation datasets.

Context Window Overflow

Cause

Symptoms

Mitigation

Monitor token counts before sending to the judge. Implement chunked attribution where each claim is checked against a subset of contexts. Use models with larger context windows (128K+) for evaluation.

Semantic Drift in Claim Verification

Cause

Symptoms

Mitigation

Tighten the attribution prompt to require direct, explicit support rather than indirect inference. Add negative examples showing claims that should NOT be attributed despite surface similarity.

Placement in an ML System

Pipeline Stage

Upstream

None (entry point)

Downstream

None (terminal)

Production Case Studies

FlipkartProduct Search RAG Evaluation

Flipkart's product discovery team uses context recall to evaluate their RAG-powered product search system that answers customer queries about product specifications, compatibility, and comparisons.

Outcome:

Context recall improved from 0.68 to 0.89; answer accuracy on specification queries increased by 25%

NotionAI Assistant Knowledge Retrieval

Notion's AI team evaluates their workspace search RAG system using context recall to ensure the AI assistant retrieves relevant workspace documents when answering user questions about their own content. They built a curated evaluation set of 2,000 question-answer pairs across different workspace types (engineering docs, meeting notes, project plans).

Outcome:

Cross-document context recall improved from 0.52 to 0.81 after query expansion

RazorpayDeveloper Documentation RAG

Razorpay uses context recall to evaluate their developer-facing documentation chatbot that answers integration questions about payment APIs, webhook configurations, and error handling. Ground-truth answers were created by senior integration engineers covering 500 common developer queries.

Outcome:

Context recall 0.71 → 0.92; support tickets for integration questions dropped 40%

ElasticEnterprise Search RAG Quality

Elastic integrates context recall into their Elasticsearch Relevance Engine (ESRE) evaluation pipeline to help customers measure RAG quality. Their evaluation toolkit allows enterprises to benchmark different retrieval strategies (BM25, dense vector, hybrid) using context recall alongside traditional IR metrics.

Outcome:

Context recall showed 0.87 correlation with human judgment vs. 0.71 for Recall@10

Tooling & Ecosystem

RAGAS
Commercial

The de facto standard framework for RAG evaluation. Provides context recall as a built-in metric alongside context precision, faithfulness, and answer relevance. Supports multiple LLM providers and integrates with LangChain and LlamaIndex.

DeepEval
Commercial

An open-source LLM evaluation framework that includes context recall (called 'contextual recall') with additional features like confidence scoring and automatic test case generation. Provides a web dashboard for tracking metrics over time.

LangSmith
Commercial

LangChain's evaluation and monitoring platform that supports custom evaluators including context recall. Provides trace-level visibility into claim decomposition and attribution steps, making it easier to debug evaluation failures.

TruLens
Commercial

An open-source evaluation library for LLM applications that includes context recall (as 'groundedness' and 'context relevance' metrics). Provides a Streamlit-based dashboard for interactive evaluation exploration.

Arize Phoenix
Commercial

An observability platform for LLM applications that includes RAG evaluation metrics. Supports context recall computation with built-in trace visualization showing how claims map to retrieved contexts.

W&B's LLM application development platform that supports custom evaluation pipelines. Teams use Weave to track context recall scores across experiments, compare retriever configurations, and visualize claim-level results.

Research & References

RAGAS: Automated Evaluation of Retrieval Augmented Generation

Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert (2023)

Benchmarking Large Language Models in Retrieval-Augmented Generation

Jiawei Chen, Hongyu Lin, Xianpei Han, Le Sun (2024)

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

Jon Saad-Falcon, Omar Khattab, Christopher Potts, Matei Zaharia (2023)

Interview & Evaluation Perspective

Summary

Context recall is a critical evaluation metric for RAG pipelines that measures how completely the retriever captures information needed to produce correct answers. By decomposing ground-truth answers into atomic claims and using an LLM-as-judge to verify each claim against retrieved context chunks, it provides granular, semantically-aware assessment of retrieval quality. The metric is essential for diagnosing retrieval bottlenecks, comparing retriever configurations, and gating deployments in CI/CD pipelines. While it requires ground-truth answers and incurs LLM evaluation costs, its ability to pinpoint exactly which claims are missing makes it far more actionable than traditional document-level IR metrics. In production ML systems, context recall should be tracked alongside context precision and faithfulness for comprehensive RAG evaluation.

ML System Design Reference · Built by QnA Lab