Context Recall in Machine Learning
Context Recall is an evaluation metric that measures how well a RAG system's retriever captures the information needed to produce a correct answer. It works by decomposing a ground-truth answer into individual claims and then checking whether each claim can be attributed to at least one retrieved context chunk. A high context recall score means the retriever is surfacing most of the relevant information from the knowledge base, while a low score signals retrieval gaps that will inevitably degrade answer quality. Originally popularized by the RAGAS framework, context recall has become a standard component in automated RAG evaluation pipelines across production ML systems.
Concept Snapshot
- What It Is
- A metric that quantifies the fraction of ground-truth answer claims that are supported by the retrieved context passages in a RAG pipeline. It uses an LLM-as-judge approach to decompose the expected answer into atomic claims and verify each against the retrieved chunks.
- Category
- Evaluation
- Complexity
- Intermediate
- Inputs / Outputs
- Inputs: Ground-truth (expected) answer for a query, Set of context chunks retrieved by the RAG pipeline, Original user query (optional but recommended) → Outputs: Context recall score between 0.0 and 1.0, Per-claim attribution verdicts (attributed or not attributed), Claim-level breakdown showing which claims were supported
- System Placement
- Evaluation stage of the RAG pipeline, typically run offline or in CI/CD pipelines against a curated test set of question-answer pairs with ground-truth references.
- Also Known As
- Retrieval Recall, Context Coverage, Ground Truth Coverage, RAG Recall, Retrieval Completeness
- Typical Users
- ML Engineers building RAG systems, NLP Researchers evaluating retrieval strategies, QA Engineers writing RAG test suites, Data Scientists tuning chunking and embedding models, MLOps Engineers monitoring retrieval quality in production
- Prerequisites
- Understanding of RAG (Retrieval-Augmented Generation) pipelines, Familiarity with information retrieval concepts (recall, precision), Basic knowledge of LLM-based evaluation (LLM-as-judge), Experience with vector search and document chunking
Internal Architecture
The context recall computation pipeline has four stages: claim decomposition, context preparation, claim-context attribution, and score aggregation. Each stage can be customized or replaced depending on the evaluation framework being used.
Key Components
Claim Decomposer
Takes the ground-truth answer and decomposes it into a list of atomic, independently verifiable claims using an LLM.
Context Preprocessor
Prepares the retrieved context chunks for attribution by formatting, deduplicating, and optionally truncating them to fit within the judge model's context window.
Attribution Judge
An LLM that evaluates whether each claim from the ground truth can be attributed to (is supported by) the retrieved context chunks.
Score Aggregator
Collects all per-claim verdicts and computes the final context recall score as the ratio of attributed claims to total claims.
Evaluation Orchestrator
Coordinates the end-to-end evaluation flow, managing batching, retries, rate limiting, and result storage.
Data Flow
The architecture diagram shows a linear pipeline flowing left to right. On the far left, two inputs feed into the system: the evaluation dataset (containing queries and ground truths) at the top, and the RAG pipeline (producing retrieved contexts) at the bottom. These converge at the Claim Decomposer, which outputs a list of claims. The claims and contexts flow into the Attribution Judge (shown as an LLM box), which produces per-claim verdicts. These verdicts flow into the Score Aggregator, which outputs the final context recall score and a detailed claim-level report on the right side.
How to Implement
Context recall can be implemented using the RAGAS library directly, or built from scratch using any LLM API. The core logic involves prompt engineering for claim decomposition and attribution judgment. Below are practical implementations ranging from out-of-the-box RAGAS usage to custom implementations with fine-grained control.
from ragas import evaluate
from ragas.metrics import context_recall
from datasets import Dataset
# Prepare evaluation dataset
eval_data = {
"question": [
"What is the capital of France?",
"When was Python created?"
],
"answer": [
"The capital of France is Paris, which is also the largest city.",
"Python was created by Guido van Rossum and released in 1991."
],
"contexts": [
["Paris is the capital and most populous city of France."],
["Python is a high-level programming language.",
"Guido van Rossum began working on Python in the late 1980s."]
],
"ground_truth": [
"Paris is the capital of France and its largest city.",
"Python was created by Guido van Rossum and first released in 1991."
]
}
dataset = Dataset.from_dict(eval_data)
result = evaluate(dataset, metrics=[context_recall])
print(f"Context Recall: {result['context_recall']:.4f}")
# Per-sample scores
for i, score in enumerate(result.scores):
print(f" Sample {i}: {score['context_recall']:.4f}")This example uses the RAGAS library to compute context recall on a small evaluation dataset. The library handles claim decomposition, attribution judgment, and score aggregation internally. The 'ground_truth' field contains the expected answers, and 'contexts' contains the retrieved passages for each question.
import openai
import json
from typing import List, Tuple
client = openai.OpenAI()
def decompose_claims(ground_truth: str) -> List[str]:
"""Break ground-truth answer into atomic claims."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": (
"Decompose the following text into independent, "
"atomic factual claims. Each claim should be a "
"single, self-contained statement that can be "
"verified independently. Return a JSON array of strings."
)},
{"role": "user", "content": ground_truth}
],
response_format={"type": "json_object"},
temperature=0.0
)
result = json.loads(response.choices[0].message.content)
return result.get("claims", [])
def check_attribution(
claim: str,
contexts: List[str]
) -> Tuple[bool, str]:
"""Check if a claim is attributable to retrieved contexts."""
context_text = "\n---\n".join(
f"Context {i+1}: {c}" for i, c in enumerate(contexts)
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": (
"Determine if the given claim can be attributed to "
"the provided context passages. A claim is attributed "
"if the context contains information that supports or "
"implies the claim. Respond with JSON: "
'{"attributed": true/false, "reasoning": "..."}'
)},
{"role": "user", "content": (
f"Claim: {claim}\n\nContexts:\n{context_text}"
)}
],
response_format={"type": "json_object"},
temperature=0.0
)
result = json.loads(response.choices[0].message.content)
return result["attributed"], result["reasoning"]
def compute_context_recall(
ground_truth: str,
contexts: List[str]
) -> dict:
"""Compute context recall with full claim-level details."""
claims = decompose_claims(ground_truth)
if not claims:
return {"score": 0.0, "claims": [], "error": "No claims extracted"}
results = []
attributed_count = 0
for claim in claims:
is_attributed, reasoning = check_attribution(claim, contexts)
results.append({
"claim": claim,
"attributed": is_attributed,
"reasoning": reasoning
})
if is_attributed:
attributed_count += 1
return {
"score": attributed_count / len(claims),
"attributed_count": attributed_count,
"total_claims": len(claims),
"claims": results
}
# Usage
result = compute_context_recall(
ground_truth="Paris is the capital of France. It has a population of over 2 million.",
contexts=["Paris is the capital city of France, located on the Seine river."]
)
print(f"Score: {result['score']:.2f}")
for c in result['claims']:
status = 'YES' if c['attributed'] else 'NO'
print(f" [{status}] {c['claim']}")This custom implementation gives full control over the claim decomposition and attribution process. It uses structured JSON output from the LLM for reliable parsing, includes reasoning traces for debugging, and returns detailed per-claim results. This is useful when you need to customize the prompts for domain-specific evaluation or want to use a different judge model.
import asyncio
import openai
import json
from typing import List, Dict
from dataclasses import dataclass, field
@dataclass
class EvalSample:
query: str
ground_truth: str
contexts: List[str]
score: float = 0.0
claim_details: List[dict] = field(default_factory=list)
async def evaluate_sample(client, sample, semaphore):
async with semaphore:
# Decompose ground truth into claims
resp = await client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content":
"Extract atomic factual claims. "
"Return JSON: {\"claims\": [\"...\"]}"},
{"role": "user", "content": sample.ground_truth}
],
response_format={"type": "json_object"},
temperature=0.0
)
claims = json.loads(
resp.choices[0].message.content
).get("claims", [])
if not claims:
return sample
# Check attribution for each claim in parallel
ctx = "\n".join(
f"[{i+1}] {c}" for i, c in enumerate(sample.contexts)
)
tasks = []
for claim in claims:
tasks.append(client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content":
"Is this claim supported by the contexts? "
'JSON: {"attributed": bool}'},
{"role": "user", "content":
f"Claim: {claim}\n\n{ctx}"}
],
response_format={"type": "json_object"},
temperature=0.0
))
verdicts = await asyncio.gather(*tasks)
attributed = 0
for claim, v in zip(claims, verdicts):
result = json.loads(v.choices[0].message.content)
is_attr = result.get("attributed", False)
sample.claim_details.append({
"claim": claim, "attributed": is_attr
})
if is_attr:
attributed += 1
sample.score = attributed / len(claims)
return sample
async def batch_evaluate(
samples: List[EvalSample], max_concurrent: int = 10
) -> Dict:
client = openai.AsyncOpenAI()
sem = asyncio.Semaphore(max_concurrent)
results = await asyncio.gather(*[
evaluate_sample(client, s, sem) for s in samples
])
scores = [s.score for s in results]
return {
"mean_context_recall": sum(scores) / len(scores),
"num_samples": len(scores),
"samples": results
}
# Usage
samples = [
EvalSample(
query="What is BERT?",
ground_truth="BERT is a transformer model by Google.",
contexts=["BERT is a pre-trained transformer model."]
),
]
results = asyncio.run(batch_evaluate(samples))
print(f"Mean Context Recall: {results['mean_context_recall']:.3f}")This production-grade implementation uses async processing to evaluate hundreds of samples efficiently. A semaphore controls concurrency to respect API rate limits. Each sample is processed independently, with claim decomposition followed by parallel attribution checks for all claims within a sample.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from typing import List
class ContextRecallEvaluator:
def __init__(self, model_name: str = "gpt-4o"):
self.llm = ChatOpenAI(model=model_name, temperature=0)
def _decompose(self, text: str) -> List[str]:
prompt = ChatPromptTemplate.from_messages([
("system", "Extract atomic factual claims. "
"Return JSON: {{\"claims\": [\"...\"]}}"),
("user", "{text}")
])
chain = prompt | self.llm | JsonOutputParser()
result = chain.invoke({"text": text})
return result.get("claims", [])
def _attribute(self, claim: str, context: str) -> bool:
prompt = ChatPromptTemplate.from_messages([
("system", "Is this claim supported by the context? "
"Return JSON: {{\"attributed\": true/false}}"),
("user", "Claim: {claim}\nContext: {context}")
])
chain = prompt | self.llm | JsonOutputParser()
result = chain.invoke({
"claim": claim, "context": context
})
return result.get("attributed", False)
def evaluate(
self, ground_truth: str, contexts: List[str]
) -> dict:
claims = self._decompose(ground_truth)
combined = "\n\n".join(contexts)
verdicts = [
{"claim": c, "attributed": self._attribute(c, combined)}
for c in claims
]
attr = sum(1 for v in verdicts if v["attributed"])
return {
"score": attr / len(claims) if claims else 0.0,
"claims": verdicts
}
# Usage
evaluator = ContextRecallEvaluator()
result = evaluator.evaluate(
ground_truth="TensorFlow was developed by Google Brain "
"and released in 2015.",
contexts=[
"TensorFlow is an open-source ML framework by Google.",
"It was first released on November 9, 2015."
]
)
print(f"Context Recall: {result['score']:.2f}")This LangChain-based implementation uses structured output parsing with Pydantic models for type safety. It separates claim decomposition and attribution into composable chains, making it easy to swap models, modify prompts, or add caching layers.
import json
import sys
from pathlib import Path
from ragas import evaluate
from ragas.metrics import context_recall
from datasets import Dataset
def load_eval_dataset(path: str) -> Dataset:
"""Load evaluation dataset from JSON file."""
with open(path) as f:
data = json.load(f)
return Dataset.from_dict(data)
def run_context_recall_check(
dataset_path: str,
threshold: float = 0.75,
output_path: str = "eval_results.json"
) -> bool:
"""Run context recall evaluation and check against threshold."""
dataset = load_eval_dataset(dataset_path)
result = evaluate(dataset, metrics=[context_recall])
mean_score = result["context_recall"]
per_sample = [
{"query": q, "score": s["context_recall"]}
for q, s in zip(dataset["question"], result.scores)
]
# Find failing samples
failures = [s for s in per_sample if s["score"] < threshold]
report = {
"metric": "context_recall",
"mean_score": round(mean_score, 4),
"threshold": threshold,
"passed": mean_score >= threshold,
"num_samples": len(per_sample),
"num_failures": len(failures),
"failures": failures[:10] # Top 10 worst
}
Path(output_path).write_text(json.dumps(report, indent=2))
print(f"Context Recall: {mean_score:.4f} (threshold: {threshold})")
print(f"Status: {'PASS' if report['passed'] else 'FAIL'}")
if failures:
print(f"\n{len(failures)} samples below threshold:")
for f in failures[:5]:
print(f" {f['query'][:60]}... → {f['score']:.3f}")
return report["passed"]
if __name__ == "__main__":
dataset_path = sys.argv[1] if len(sys.argv) > 1 \
else "eval_dataset.json"
threshold = float(sys.argv[2]) if len(sys.argv) > 2 else 0.75
passed = run_context_recall_check(dataset_path, threshold)
sys.exit(0 if passed else 1) # Non-zero exit fails CIThis script integrates context recall evaluation into a CI/CD pipeline. It loads an evaluation dataset, computes context recall scores, compares against a configurable threshold, and exits with a non-zero code on failure. The output report helps developers identify which queries have retrieval gaps.
Common Implementation Mistakes
- ●
Using answer text instead of ground truth for claim decomposition
- ●
Not controlling for claim decomposition granularity
- ●
Evaluating with too few test samples
- ●
Ignoring the cost and latency of LLM-based evaluation
- ●
Not separating retrieval evaluation from generation evaluation
When Should You Use This?
Use When
You have ground-truth answers for your evaluation dataset and need to assess retriever completeness
Your RAG pipeline is producing incomplete or partially correct answers and you suspect retrieval gaps
You are comparing different retriever configurations (embedding models, chunk sizes, top-K values)
You need claim-level diagnostics to pinpoint exactly what information the retriever is missing
You are building a regression test suite for your RAG pipeline in CI/CD
Compliance or audit requirements demand evidence that the system considered all relevant information
You are tuning the balance between retrieval breadth (more chunks) and precision (fewer, more relevant chunks)
Avoid When
You do not have ground-truth answers — context recall requires a reference answer to decompose into claims
Your evaluation is focused on generation quality rather than retrieval quality — use faithfulness or answer relevance instead
You need real-time evaluation during inference — context recall is too expensive (multiple LLM calls) for online use
Your ground-truth answers are highly subjective or opinion-based, making claim decomposition unreliable
You are evaluating open-ended creative tasks where there is no single correct answer
Budget constraints prevent running LLM-based evaluation — consider cheaper proxy metrics like lexical overlap first
Key Tradeoffs
Alternatives & Comparisons
Context precision and context recall are complementary. A retriever can have high recall (gets all relevant info) but low precision (also retrieves lots of irrelevant info), or vice versa. Both should be tracked together.
Faithfulness evaluates the generation step, while context recall evaluates the retrieval step. High context recall with low faithfulness means the retriever is good but the generator is hallucinating. Low context recall with high faithfulness means the generator is being careful but lacks information.
Answer relevance is an end-to-end metric that does not distinguish between retrieval and generation failures. Context recall isolates retrieval quality specifically.
Recall@K is simpler and cheaper (no LLM needed) but treats documents as binary relevant/irrelevant units. Context recall is more nuanced because it evaluates claim-level coverage, capturing partial relevance.
NDCG cares about ranking order while context recall does not — if a relevant chunk appears anywhere in the retrieved set, it counts. NDCG also requires graded relevance labels, while context recall derives relevance from claim attribution.
Pros, Cons & Tradeoffs
Advantages
Operates at the claim level, providing much more granular retrieval assessment than document-level metrics
Uses semantic understanding via LLM-as-judge, correctly handling paraphrases and reformulations that lexical metrics miss
Produces actionable diagnostics — you can see exactly which claims are missing and focus retriever improvements
Framework-agnostic — works with any retriever (dense, sparse, hybrid) and any knowledge base format
Directly measures what matters for answer quality — if context recall is high, the generator has the information it needs
Integrates naturally into automated evaluation pipelines and CI/CD workflows
Pairs well with complementary metrics (context precision, faithfulness) for comprehensive RAG evaluation
Disadvantages
Requires ground-truth answers, which are expensive to create and maintain as the knowledge base evolves
LLM-based evaluation introduces non-determinism — scores can vary slightly between runs even with temperature 0
Expensive to compute at scale due to multiple LLM calls per sample (decomposition + attribution per claim)
Claim decomposition quality depends heavily on the judge model and prompt, creating a meta-evaluation challenge
Does not account for the ranking or ordering of retrieved chunks, only their collective coverage
Cannot be used for real-time monitoring during inference due to latency and cost constraints
May produce inflated scores when ground-truth answers are simple (fewer claims = easier to cover)
Failure Modes & Debugging
Overly Generous Attribution
Cause
Symptoms
Mitigation
Use chain-of-thought prompting that requires the judge to quote specific evidence from the context. Add few-shot examples of correct vs. incorrect attributions. Periodically validate a sample of verdicts with human annotators.
Inconsistent Claim Decomposition
Cause
Symptoms
Mitigation
Pin the judge model version and temperature to 0. Include few-shot examples in the decomposition prompt showing the desired granularity. Cache decomposition results and reuse them across retriever comparisons.
Ground Truth Staleness
Cause
Symptoms
Mitigation
Establish a regular ground-truth refresh cadence aligned with knowledge base update frequency. Flag samples where the retriever finds contradicting but more recent information. Implement version tracking for evaluation datasets.
Context Window Overflow
Cause
Symptoms
Mitigation
Monitor token counts before sending to the judge. Implement chunked attribution where each claim is checked against a subset of contexts. Use models with larger context windows (128K+) for evaluation.
Semantic Drift in Claim Verification
Cause
Symptoms
Mitigation
Tighten the attribution prompt to require direct, explicit support rather than indirect inference. Add negative examples showing claims that should NOT be attributed despite surface similarity.
Placement in an ML System
Pipeline Stage
Upstream
None (entry point)
Downstream
None (terminal)
Production Case Studies
Flipkart's product discovery team uses context recall to evaluate their RAG-powered product search system that answers customer queries about product specifications, compatibility, and comparisons.
Context recall improved from 0.68 to 0.89; answer accuracy on specification queries increased by 25%
Notion's AI team evaluates their workspace search RAG system using context recall to ensure the AI assistant retrieves relevant workspace documents when answering user questions about their own content. They built a curated evaluation set of 2,000 question-answer pairs across different workspace types (engineering docs, meeting notes, project plans).
Cross-document context recall improved from 0.52 to 0.81 after query expansion
Razorpay uses context recall to evaluate their developer-facing documentation chatbot that answers integration questions about payment APIs, webhook configurations, and error handling. Ground-truth answers were created by senior integration engineers covering 500 common developer queries.
Context recall 0.71 → 0.92; support tickets for integration questions dropped 40%
Elastic integrates context recall into their Elasticsearch Relevance Engine (ESRE) evaluation pipeline to help customers measure RAG quality. Their evaluation toolkit allows enterprises to benchmark different retrieval strategies (BM25, dense vector, hybrid) using context recall alongside traditional IR metrics.
Context recall showed 0.87 correlation with human judgment vs. 0.71 for Recall@10
Tooling & Ecosystem
The de facto standard framework for RAG evaluation. Provides context recall as a built-in metric alongside context precision, faithfulness, and answer relevance. Supports multiple LLM providers and integrates with LangChain and LlamaIndex.
An open-source LLM evaluation framework that includes context recall (called 'contextual recall') with additional features like confidence scoring and automatic test case generation. Provides a web dashboard for tracking metrics over time.
LangChain's evaluation and monitoring platform that supports custom evaluators including context recall. Provides trace-level visibility into claim decomposition and attribution steps, making it easier to debug evaluation failures.
An open-source evaluation library for LLM applications that includes context recall (as 'groundedness' and 'context relevance' metrics). Provides a Streamlit-based dashboard for interactive evaluation exploration.
An observability platform for LLM applications that includes RAG evaluation metrics. Supports context recall computation with built-in trace visualization showing how claims map to retrieved contexts.
W&B's LLM application development platform that supports custom evaluation pipelines. Teams use Weave to track context recall scores across experiments, compare retriever configurations, and visualize claim-level results.
Research & References
Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert (2023)
Jiawei Chen, Hongyu Lin, Xianpei Han, Le Sun (2024)
Jon Saad-Falcon, Omar Khattab, Christopher Potts, Matei Zaharia (2023)
Cheng-Han Chiang, Hung-yi Lee (2024)
Interview & Evaluation Perspective
Summary
Context recall is a critical evaluation metric for RAG pipelines that measures how completely the retriever captures information needed to produce correct answers. By decomposing ground-truth answers into atomic claims and using an LLM-as-judge to verify each claim against retrieved context chunks, it provides granular, semantically-aware assessment of retrieval quality. The metric is essential for diagnosing retrieval bottlenecks, comparing retriever configurations, and gating deployments in CI/CD pipelines. While it requires ground-truth answers and incurs LLM evaluation costs, its ability to pinpoint exactly which claims are missing makes it far more actionable than traditional document-level IR metrics. In production ML systems, context recall should be tracked alongside context precision and faithfulness for comprehensive RAG evaluation.