What is Mean Reciprocal Rank (MRR) in simple terms?

MRR measures how quickly your search system finds the right answer. For each query, it looks at the position of the first correct result and takes the inverse: position 1 gives a score of 1.0, position 2 gives 0.5, position 3 gives 0.33, and so on. MRR is the average of these scores across all queries. Think of it like this: you ask 100 questions to a search engine. For each question, you note where the correct answer appears. If the answer is usually at position 1 or 2, MRR will be high (close to 1.0). If the answer is usually buried at position 10, MRR will be low (close to 0.1). An MRR of 0.67 means the correct answer is typically at position 1 or 2 -- pretty good.

How is MRR different from NDCG and MAP?

The three metrics answer different questions: **MRR** asks: 'Where is the **first** relevant result?' -- It ignores everything after the first relevant result. Best for single-answer tasks (QA, navigational search). **MAP** asks: 'Where are **all** the relevant results?' -- It computes precision at each relevant position and averages. Best for multi-answer tasks with binary relevance labels. **NDCG** asks: 'How good is the **entire ranking** considering graded relevance?' -- It assigns different scores to 'perfect', 'good', 'fair' results and discounts by position. Best for complex ranking with graded relevance. The key distinction: MRR only cares about one result per query. MAP cares about all relevant results. NDCG cares about all results *and* their relevance gradations. Use MRR when the user stops after finding one answer. Use NDCG when the user browses the full list.

What does MRR@K mean?

MRR@K means you only consider the top K results when computing MRR. If the first relevant result appears beyond position K, that query contributes 0 to the MRR average. For example, MRR@10 only looks at the top 10 results. If your system finds the correct answer at position 15, that query gets a reciprocal rank of 0 (not 1/15). This makes sense because users rarely scroll past 10 results. Common K values: - **MRR@1**: Only counts if the first result is relevant (same as Precision@1) - **MRR@5**: For mobile search UIs showing ~5 results - **MRR@10**: MS MARCO standard; typical desktop search page - **MRR@100**: Standard for knowledge graph evaluation (FB15k-237, WN18RR)

What is a good MRR score?

It depends heavily on the task difficulty and baseline: - **MRR@10 > 0.90**: Excellent. The relevant result is almost always at position 1. Typical for navigational queries ('Flipkart login'). - **MRR@10 ~ 0.50-0.80**: Good. The relevant result is usually in the top 2-3 positions. Competitive for most search and QA tasks. - **MRR@10 ~ 0.30-0.50**: Moderate. The relevant result is typically at positions 2-4. Room for improvement. - **MRR@10 < 0.30**: Poor. The relevant result is often below position 4 or missing entirely. For context, on MS MARCO passage ranking, BM25 achieves MRR@10 of ~0.187, while state-of-the-art neural rankers achieve ~0.42. On FB15k-237 (knowledge graph), top models achieve MRR of ~0.35-0.40. Always compare against a baseline rather than evaluating in isolation. An MRR improvement from 0.30 to 0.35 is significant and typically measurable in user satisfaction.

Why does MS MARCO use MRR@10 instead of NDCG?

MS MARCO's relevance labels are **binary** (relevant/not) and **sparse** (typically 1-2 relevant passages per query). These two properties make MRR a better choice than NDCG: 1. **Binary labels**: NDCG's advantage is graded relevance (0-4 scale). With binary labels, NDCG reduces to a simpler form and its discriminative power over MRR diminishes. You are paying the complexity cost of NDCG without getting the benefit. 2. **Sparse labels**: With only 1-2 relevant passages per query, MRR's focus on the first relevant result is appropriate -- there is not much to evaluate beyond the first hit. MAP would be slightly more comprehensive but noisier with such sparse judgments. 3. **User behavior**: For passage ranking (answering a question), users typically want one authoritative answer, not a list of 10 partially relevant passages. MRR's single-answer user model matches this behavior. The MS MARCO team explicitly discusses this design choice in their SIGIR 2021 paper.

How is MRR used in knowledge graph evaluation?

In knowledge graph (KG) evaluation, MRR measures how well embedding models (TransE, RotatE, ComplEx) predict missing links. For each test triple (head, relation, tail), the model ranks all possible tail entities, and MRR measures where the true tail entity appears. **The evaluation protocol**: 1. For test triple (Mumbai, capital_of, India), replace 'India' with all ~15,000 entities 2. Score each candidate: (Mumbai, capital_of, Delhi), (Mumbai, capital_of, Paris), etc. 3. Rank entities by score, find where 'India' ranks 4. Reciprocal rank = 1/rank of 'India' **The filtered setting** (crucial): Before ranking, remove all *other* known true triples. If (Mumbai, located_in, Maharashtra) is also true, remove Maharashtra from the candidate list. This prevents penalizing the model for ranking other correct answers above the test answer. MRR on FB15k-237 ranges from ~0.23 (TransE) to ~0.40+ (modern models). Alongside MRR, KG evaluations always report Hits@1, Hits@3, and Hits@10.

Can I use MRR for evaluating a RAG pipeline?

Yes, and it is one of the best metrics for RAG retrieval evaluation -- with a caveat. **When MRR works well for RAG**: If your RAG pipeline answers factoid questions ("What is RLHF?"), there is typically one key context chunk that contains the answer. MRR measures how quickly the retriever surfaces that chunk. If MRR is high, the LLM sees the relevant context and generates a good answer. If MRR is low, the LLM hallucinates. **When MRR falls short for RAG**: If the question requires synthesizing information from multiple documents ("Compare transformer and RNN architectures"), MRR only credits the first relevant document. You need Recall@K to check if *all* necessary documents are retrieved, and MRR to check if the most important one is near the top. **Practical recommendation**: Use MRR@5 + Recall@10 as a complementary pair for RAG evaluation. MRR@5 captures 'is the best context at the top?' and Recall@10 captures 'are all needed contexts in the retrieval window?'. This is the approach used by frameworks like RAGAS and DeepEval.

What is the relationship between MRR and Hit Rate (Success@K)?

Hit Rate (Success@K) is a coarser version of MRR. Hit Rate only checks *whether* any relevant result exists in the top K: it returns 1 if yes, 0 if no. MRR also tells you *where* that result appears. **Example**: Two ranked lists with K=5: - List A: [Relevant, Irrelevant, Irrelevant, Irrelevant, Irrelevant] -- Hit@5 = 1, MRR = 1.0 - List B: [Irrelevant, Irrelevant, Irrelevant, Irrelevant, Relevant] -- Hit@5 = 1, MRR = 0.2 Hit Rate treats both lists identically (both have a relevant result in top 5). MRR distinguishes them: List A is 5x better because the relevant result is at position 1 instead of position 5. In other words: Hit Rate >= 0 whenever MRR > 0, but Hit Rate loses the position information. If position does not matter (you just need *any* relevant document to feed to an LLM), Hit Rate suffices. If position matters (the user sees results in order), use MRR.

How much does MRR evaluation cost in India?

The MRR computation itself is free (pure math). The cost is in **relevance label collection**: **Binary labels (required for MRR)**: - Crowdsourced: INR 20-50 per query-result pair (cheaper than graded labels) - In-house annotators in India: INR 30-80 per label depending on domain expertise needed - Typical evaluation set: 1000 queries x 10 results = 10,000 labels - **Total: INR 2-5 lakh** for a solid evaluation set **Comparison to NDCG annotation costs**: - NDCG requires graded labels (0-4 scale): INR 50-150 per label - Same evaluation set: INR 5-15 lakh - MRR is **2-3x cheaper** to evaluate than NDCG **For knowledge graph evaluation**: Labels are free -- they come from the graph itself (known true triples). The cost is compute: evaluating a KGE model on FB15k-237 requires a GPU for ~5 minutes. On Wikidata-scale graphs, budget INR 500-2000 per evaluation run for cloud GPU costs. **For RAG evaluation**: You can bootstrap labels cheaply using LLM-as-judge (GPT-4 labeling relevance at INR 1-5 per label), then validate a subset with human annotators. Budget INR 50,000-1 lakh for an initial RAG evaluation set.

Evaluation

MRR in Machine Learning

Q: Why does MS MARCO use MRR@10 instead of NDCG?

MS MARCO's relevance labels are **binary** (relevant/not) and **sparse** (typically 1-2 relevant passages per query). These two properties make MRR a better choice than NDCG: 1. **Binary labels**: NDCG's advantage is graded relevance (0-4 scale). With binary labels, NDCG reduces to a simpler form and its discriminative power over MRR diminishes. You are paying the complexity cost of NDCG without getting the benefit. 2. **Sparse labels**: With only 1-2 relevant passages per query, MRR's focus on the first relevant result is appropriate -- there is not much to evaluate beyond the first hit. MAP would be slightly more comprehensive but noisier with such sparse judgments. 3. **User behavior**: For passage ranking (answering a question), users typically want one authoritative answer, not a list of 10 partially relevant passages. MRR's single-answer user model matches this behavior. The MS MARCO team explicitly discusses this design choice in their SIGIR 2021 paper.

Q: How is MRR used in knowledge graph evaluation?

In knowledge graph (KG) evaluation, MRR measures how well embedding models (TransE, RotatE, ComplEx) predict missing links. For each test triple (head, relation, tail), the model ranks all possible tail entities, and MRR measures where the true tail entity appears. **The evaluation protocol**: 1. For test triple (Mumbai, capital_of, India), replace 'India' with all ~15,000 entities 2. Score each candidate: (Mumbai, capital_of, Delhi), (Mumbai, capital_of, Paris), etc. 3. Rank entities by score, find where 'India' ranks 4. Reciprocal rank = 1/rank of 'India' **The filtered setting** (crucial): Before ranking, remove all *other* known true triples. If (Mumbai, located_in, Maharashtra) is also true, remove Maharashtra from the candidate list. This prevents penalizing the model for ranking other correct answers above the test answer. MRR on FB15k-237 ranges from ~0.23 (TransE) to ~0.40+ (modern models). Alongside MRR, KG evaluations always report Hits@1, Hits@3, and Hits@10.

Q: Can I use MRR for evaluating a RAG pipeline?

Yes, and it is one of the best metrics for RAG retrieval evaluation -- with a caveat. **When MRR works well for RAG**: If your RAG pipeline answers factoid questions ("What is RLHF?"), there is typically one key context chunk that contains the answer. MRR measures how quickly the retriever surfaces that chunk. If MRR is high, the LLM sees the relevant context and generates a good answer. If MRR is low, the LLM hallucinates. **When MRR falls short for RAG**: If the question requires synthesizing information from multiple documents ("Compare transformer and RNN architectures"), MRR only credits the first relevant document. You need Recall@K to check if *all* necessary documents are retrieved, and MRR to check if the most important one is near the top. **Practical recommendation**: Use MRR@5 + Recall@10 as a complementary pair for RAG evaluation. MRR@5 captures 'is the best context at the top?' and Recall@10 captures 'are all needed contexts in the retrieval window?'. This is the approach used by frameworks like RAGAS and DeepEval.

Q: What is the relationship between MRR and Hit Rate (Success@K)?

Hit Rate (Success@K) is a coarser version of MRR. Hit Rate only checks *whether* any relevant result exists in the top K: it returns 1 if yes, 0 if no. MRR also tells you *where* that result appears. **Example**: Two ranked lists with K=5: - List A: [Relevant, Irrelevant, Irrelevant, Irrelevant, Irrelevant] -- Hit@5 = 1, MRR = 1.0 - List B: [Irrelevant, Irrelevant, Irrelevant, Irrelevant, Relevant] -- Hit@5 = 1, MRR = 0.2 Hit Rate treats both lists identically (both have a relevant result in top 5). MRR distinguishes them: List A is 5x better because the relevant result is at position 1 instead of position 5. In other words: Hit Rate >= 0 whenever MRR > 0, but Hit Rate loses the position information. If position does not matter (you just need *any* relevant document to feed to an LLM), Hit Rate suffices. If position matters (the user sees results in order), use MRR.

Here is the simplest question you can ask about a search engine: "How far down the page did the user have to scroll to find their answer?" That single question, formalized into a number, is what Mean Reciprocal Rank (MRR) measures.

MRR is a ranking evaluation metric designed for tasks where there is typically one correct (or one most-relevant) answer per query, and your job is to figure out how quickly your system surfaces it. If the correct answer is at position 1, the reciprocal rank is 1. If it is at position 3, the reciprocal rank is 1/3. Average those values across all your queries, and you have MRR.

What makes MRR special is its radical simplicity. While NDCG juggles graded relevance and logarithmic discounts, and MAP cares about the full set of relevant documents, MRR focuses on a single data point per query: where does the first relevant result appear? That focus is a feature, not a bug -- it aligns perfectly with how users behave when they have a specific question and want a single authoritative answer.

MRR originated in the late 1990s at the TREC Question Answering track, where the goal was to evaluate systems returning short, direct answers to factoid questions. Since then, it has become one of the most widely used metrics in information retrieval, question answering, knowledge graph evaluation, and RAG pipeline assessment. Microsoft's MS MARCO passage ranking benchmark -- the most influential IR benchmark of the past decade -- uses MRR@10 as its primary metric. From Google's search quality team to Swiggy's restaurant feed ranking, MRR is everywhere.

Concept Snapshot

What It Is: A ranking evaluation metric that measures how quickly a system returns the first relevant result by computing the average of the reciprocal of the rank position of the first correct answer across all queries.
Category: Evaluation
Complexity: Beginner
Inputs / Outputs: Inputs: a ranked list of results per query and a set of relevant (ground-truth) items per query. Outputs: a single score between 0 and 1, where 1 means the first relevant result is always at position 1.
System Placement: Used offline during model evaluation, online for A/B testing, and as a training objective for ranking models. Evaluates any system that produces ranked lists -- search engines, QA systems, knowledge graph link predictors, recommendation engines.
Also Known As: Mean Reciprocal Rank, MRR, MRR@K, Average Reciprocal Rank
Typical Users: ML engineers, Search engineers, NLP researchers, Knowledge graph researchers, RAG pipeline developers, Recommendation system engineers
Prerequisites: Basic ranking concepts, Understanding of relevance in information retrieval, Arithmetic mean
Key Terms: reciprocal rankfirst relevant resultMRR@Kbinary relevancerank positionquery setnavigational queryfactoid QA

Why This Concept Exists

The Problem: One Question, One Answer

Not every search task is exploratory. When a user types "What is the capital of France?" into a search engine, they do not want a ranked list of ten slightly relevant results. They want a single, correct answer -- ideally at position 1. If the answer appears at position 5, that is a failure.

Early IR metrics were designed for a different world. Precision and recall measured how many relevant documents you found, not where you found them. Even position-aware metrics like NDCG assumed multiple relevant documents at different relevance levels. But for factoid QA, navigational search ("Flipkart login page"), and entity lookup ("CEO of Infosys"), the relevance landscape is flat: one right answer, everything else wrong.

Origin: TREC Question Answering Track (1999)

The metric gained prominence through the TREC-8 Question Answering Track in 1999, organized by Ellen Voorhees at NIST. The QA track was revolutionary: instead of evaluating document retrieval, it evaluated direct answer extraction. Systems received factoid questions like "Who invented the telephone?" and returned short text snippets ranked by confidence.

The organizers needed a metric that captured one thing: how quickly does the system return the correct answer? They adopted reciprocal rank -- the inverse of the position of the first correct response -- and averaged it across all questions. MRR was born.

Why It Endures: Simplicity as a Feature

MRR has survived for over 25 years because it captures a genuinely useful signal with minimal assumptions:

Cheap to compute: O(K) per query to find the first relevant result
Cheap to annotate: binary labels (relevant/not) are far less expensive than 5-level graded judgments
Easy to interpret: MRR = 0.5 means "on average, the first correct answer is at position 2"
Robust to annotation disagreement: binary labels have higher inter-annotator agreement than graded scales

Key Takeaway: MRR exists because many real-world tasks have a single correct answer, and the only question that matters is where it appears in the ranked list.

Core Intuition & Mental Model

The Analogy: Looking for Your Keys

Imagine you have lost your keys and you are checking pockets in order. Left jacket pocket, right jacket pocket, trouser pockets, bag.

If the keys are in the first pocket you check: reciprocal rank = 1/1 = 1.0. If in the third pocket: 1/3 = 0.33. If not found: 0.

Now imagine repeating this 100 times. Average all reciprocal ranks, and you get MRR. An MRR of 0.8 means you usually find the keys in the first or second place you check.

The Single-Answer Assumption

The most important thing about MRR is its user model: the user stops as soon as they find one relevant result. This is realistic for:

Factoid QA: "What year was India's independence?" -- one answer, done
Navigational search: "Zerodha login" -- one URL, done
Entity lookup: "Population of Bengaluru" -- one number, done
Knowledge graph link prediction: "(Mumbai, capital_of, ?)" -- one correct tail entity

But unrealistic for exploratory search ("Best restaurants in Mumbai"), product search ("Running shoes under 5000"), or literature reviews. For those tasks, use NDCG or MAP.

Why Reciprocal (1/rank)?

The reciprocal function penalizes pushing the first relevant result down much more harshly at the top:

Rank 1 to 2: score drops from 1.0 to 0.5 (50% drop)
Rank 5 to 6: drops from 0.2 to 0.167 (17% drop)
Rank 10 to 11: drops from 0.1 to 0.091 (9% drop)

This matches user behavior: the difference between position 1 and 2 is enormous. The difference between position 10 and 11? The user has already given up.

Mental Model: MRR answers: "On average, how quickly does my system find the needle in the haystack?" MRR of 1.0 means it is always on top. MRR of 0.33 means you dig through three items first.

Technical Foundations

Building Up the Formula

Let's formalize MRR step by step, starting from a single query and building to the full metric.

Step 1: Reciprocal Rank for a Single Query

For a query $q_i$ , let $\text{rank}_i$ be the position of the first relevant result in the ranked list returned by the system. The reciprocal rank is:

$\text{RR}_i = \frac{1}{\text{rank}_i}$

If no relevant result appears in the ranked list, $\text{RR}_i = 0$ .

Example: The system returns [irrelevant, irrelevant, relevant, irrelevant, relevant] for query $q_i$ . The first relevant result is at position 3, so $\text{RR}_i = 1/3 \approx 0.333$ . Note that the second relevant result at position 5 is completely ignored.

Step 2: Mean Reciprocal Rank

Given a set of $|Q|$ queries $Q = \{q_1, q_2, \ldots, q_{|Q|}\}$ , MRR is the arithmetic mean of the reciprocal ranks:

$\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}$

Properties:

$0 \leq \text{MRR} \leq 1$
MRR = 1 if and only if the first relevant result is at position 1 for every query
MRR = 0 if no relevant result appears for any query
MRR is undefined when no queries have relevant items; convention is to return 0

Step 3: MRR@K (Cutoff Variant)

In practice, we often only evaluate the top $K$ results. If the first relevant result appears beyond position $K$ , we treat it as absent:

$\text{MRR@K} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \begin{cases} \frac{1}{\text{rank}_i} & \text{if } \text{rank}_i \leq K \\ 0 & \text{otherwise} \end{cases}$

MS MARCO's benchmark uses MRR@10, meaning only the top 10 results are considered. If the first relevant passage is at position 15, that query contributes 0 to MRR@10.

Worked Example

Suppose we have 4 queries with the following ranked results (R = relevant, X = irrelevant):

Query	Ranked List	First Relevant Rank	Reciprocal Rank
$q_1$	[R, X, X, X, X]	1	1/1 = 1.000
$q_2$	[X, X, R, X, R]	3	1/3 = 0.333
$q_3$	[X, R, X, X, X]	2	1/2 = 0.500
$q_4$	[X, X, X, X, X]	None	0.000

$\text{MRR} = \frac{1}{4}(1.000 + 0.333 + 0.500 + 0.000) = \frac{1.833}{4} = 0.458$

Interpretation: On average, the first relevant result appears between positions 2 and 3. Query $q_4$ has no relevant result and contributes 0, pulling the average down.

Relationship to MAP

When every query has exactly one relevant document, MRR and MAP are equivalent:

$\text{MRR} = \text{MAP} \quad \text{when } |\text{relevant}(q_i)| = 1 \text{ for all } i$

This is because Average Precision for a single relevant document reduces to the reciprocal rank of that document. The distinction matters only when multiple relevant documents exist.

Relationship to Success@K (Hit Rate)

Hit Rate (Success@K) is a binary version of MRR: it checks whether any relevant result appears in the top $K$ . MRR is strictly more informative because it also tells you where that result appears:

Hit@5 = 1 for both [R, X, X, X, X] and [X, X, X, X, R]
MRR gives 1.0 for the first and 0.2 for the second

Implementation Note: MRR@10 is the default metric for MS MARCO. MRR@100 is common in knowledge graph evaluation. Choose K based on how many results your users actually see.

Internal Architecture

MRR is a metric, not a deployable system. But there is a well-defined computational pipeline for how MRR is calculated and integrated into ML evaluation workflows. The flow is straightforward: a ranking system produces ordered results, ground-truth labels identify which results are relevant, and the MRR calculator finds the first relevant result per query and averages the reciprocal ranks.

Key Components

Ranking System

Produces an ordered list of candidate results for each query. Could be a search engine (BM25, neural ranker), a QA model, a knowledge graph link predictor, or a RAG retrieval pipeline.

Ground Truth Labels

Binary relevance labels (relevant/not) for query-result pairs. For QA tasks, these are the known correct answers. For knowledge graph evaluation, these are the true triples. Simpler than graded labels (0-4 scale) used by NDCG.

First-Relevant Finder

Scans each ranked list from position 1 downward until it finds the first relevant result. Records the rank position. If no relevant result exists in the top K, returns infinity (treated as 0 reciprocal rank).

Reciprocal Rank Calculator

Computes 1/rank for each query's first relevant result. Queries with no relevant result receive 0. This is the core transformation that converts rank positions into scores.

Aggregation Layer

Averages reciprocal ranks across all queries to produce the final MRR score. May also compute per-category MRR (e.g., MRR for head queries vs. tail queries) for drill-down analysis.

Data Flow

Here is the data flow in a typical offline evaluation:

Input: A test set of $|Q|$ queries with ground-truth relevant items for each query, and a ranking system to evaluate.

For each query $q_i$ :

Ranking system produces a ranked list $R_i = [r_1, r_2, \ldots, r_K]$
Scan $R_i$ from position 1 to K until the first relevant result is found at position $\text{rank}_i$
Compute $\text{RR}_i = 1/\text{rank}_i$ (or 0 if no relevant result found)

Output: $\text{MRR} = \frac{1}{|Q|} \sum_{i} \text{RR}_i$

The entire computation is embarrassingly parallel across queries. For 1 million queries with K=10, MRR computation takes well under a second on a single CPU core. The metric itself is never the bottleneck.

A directed flow from 'Query Set' and 'Ranking System' producing 'Ranked Results per Query'. Ground Truth Labels and Ranked Results feed into the 'MRR Calculator', which finds the first relevant result per query, computes reciprocal ranks, averages them into a final MRR Score, and sends the result to a Report/Dashboard.

How to Implement

Three Ways to Compute MRR

MRR is one of the easiest metrics to implement. You have three practical options:

Option A: From scratch -- literally 10 lines of Python. Because MRR is so simple, a custom implementation is often preferable to importing a library. You understand exactly what it does, and there are no hidden defaults.

Option B: Use a metrics library (scikit-learn, torchmetrics, ir_measures) -- useful when you are computing MRR alongside other metrics (NDCG, MAP, Precision@K) in a standardized evaluation pipeline.

Option C: Use a RAG evaluation framework (RAGAS, DeepEval, LangChain) -- these frameworks compute MRR alongside RAG-specific metrics like faithfulness and answer relevance. Best for end-to-end RAG pipeline evaluation.

Regardless of the approach, the core logic is identical: find the first relevant result, take its reciprocal rank, average across queries.

Cost Note: MRR evaluation itself is essentially free (pure computation). The cost is in label collection: binary relevance labels cost INR 20-50 per query-result pair using crowdsourcing (cheaper than NDCG's graded labels at INR 50-150). For 1000 queries x 10 results = 10,000 labels, budget INR 2-5 lakh.

From Scratch -- Pure Python MRR Implementation50 lines

import numpy as np
from typing import List, Optional

def reciprocal_rank(ranked_list: List[bool], k: Optional[int] = None) -> float:
    """Compute reciprocal rank for a single query.
    
    Args:
        ranked_list: Boolean list where True = relevant, False = irrelevant
        k: Optional cutoff. Only consider top-k results.
    
    Returns:
        Reciprocal rank (0 if no relevant result found)
    """
    if k is not None:
        ranked_list = ranked_list[:k]
    
    for i, is_relevant in enumerate(ranked_list, start=1):
        if is_relevant:
            return 1.0 / i
    return 0.0


def mrr(queries_ranked_lists: List[List[bool]], k: Optional[int] = None) -> float:
    """Compute Mean Reciprocal Rank across multiple queries.
    
    Args:
        queries_ranked_lists: List of boolean ranked lists, one per query
        k: Optional cutoff (MRR@K)
    
    Returns:
        MRR score between 0 and 1
    """
    if not queries_ranked_lists:
        return 0.0
    
    rr_scores = [reciprocal_rank(rl, k) for rl in queries_ranked_lists]
    return float(np.mean(rr_scores))


# Example usage
results = [
    [True, False, False, False, False],   # RR = 1.0 (relevant at position 1)
    [False, False, True, False, True],    # RR = 1/3 (first relevant at position 3)
    [False, True, False, False, False],   # RR = 1/2 (relevant at position 2)
    [False, False, False, False, False],  # RR = 0   (no relevant result)
]

print(f"MRR:    {mrr(results):.4f}")      # 0.4583
print(f"MRR@3:  {mrr(results, k=3):.4f}") # 0.4583 (same here)
print(f"MRR@1:  {mrr(results, k=1):.4f}") # 0.2500 (only q1 has relevant at pos 1)

This is the clearest implementation of MRR you can write. For each query, scan the ranked list until you find the first relevant result (True), return 1/position. If no relevant result exists, return 0. Average across all queries. The optional k parameter implements MRR@K by truncating the ranked list before scanning. This is production-ready code -- there are no edge cases beyond empty inputs.

PyTorch Metrics -- MRR for Large-Scale Evaluation20 lines

from torchmetrics.retrieval import RetrievalMRR
import torch

# Initialize metric with K=10 cutoff
mrr_metric = RetrievalMRR(top_k=10)

# Simulate evaluation data
# indexes: which query each result belongs to
# preds: model confidence scores (used to rank)
# target: binary relevance labels
indexes = torch.tensor([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
preds   = torch.tensor([0.9, 0.7, 0.5, 0.3, 0.1, 0.8, 0.6, 0.4, 0.2, 0.05])
target  = torch.tensor([0, 0, 1, 0, 0, 1, 0, 0, 0, 0])

# Compute MRR
result = mrr_metric(preds, target, indexes)
print(f"MRR@10: {result:.4f}")
# Query 0: first relevant at predicted rank 3 -> RR = 1/3
# Query 1: first relevant at predicted rank 1 -> RR = 1/1
# MRR = (1/3 + 1) / 2 = 0.6667

TorchMetrics provides a GPU-accelerated MRR implementation that handles the prediction-to-rank conversion automatically. You pass raw model scores (preds), binary labels (target), and query group assignments (indexes). The library sorts by predicted scores within each query group, finds the first relevant result, and computes the reciprocal rank. The top_k parameter implements MRR@K. This is ideal for evaluating neural ranking models in PyTorch training loops.

Knowledge Graph Link Prediction -- MRR Evaluation58 lines

import numpy as np
from typing import List, Tuple, Dict

def evaluate_link_prediction(
    test_triples: List[Tuple[int, int, int]],
    score_fn,
    num_entities: int,
    filter_triples: set,
    k_values: List[int] = [1, 3, 10]
) -> Dict[str, float]:
    """Evaluate knowledge graph embedding with MRR and Hits@K.
    
    Args:
        test_triples: List of (head, relation, tail) triples to evaluate
        score_fn: Function(head, relation, candidate_tails) -> scores
        num_entities: Total number of entities in the KG
        filter_triples: Set of all known true triples (for filtered setting)
        k_values: K values for Hits@K
    
    Returns:
        Dict with 'mrr' and 'hits@k' metrics
    """
    reciprocal_ranks = []
    hits = {k: [] for k in k_values}
    
    for head, rel, true_tail in test_triples:
        # Score all possible tail entities
        all_tails = np.arange(num_entities)
        scores = score_fn(head, rel, all_tails)  # shape: (num_entities,)
        
        # Filtered setting: remove scores of other known true tails
        for t in range(num_entities):
            if t != true_tail and (head, rel, t) in filter_triples:
                scores[t] = -np.inf
        
        # Rank the true tail entity
        rank = (scores > scores[true_tail]).sum() + 1  # 1-indexed
        
        reciprocal_ranks.append(1.0 / rank)
        for k in k_values:
            hits[k].append(1.0 if rank <= k else 0.0)
    
    results = {'mrr': float(np.mean(reciprocal_ranks))}
    for k in k_values:
        results[f'hits@{k}'] = float(np.mean(hits[k]))
    
    return results

# Example usage (pseudo)
# results = evaluate_link_prediction(
#     test_triples=[(0, 1, 42), (5, 2, 17), ...],
#     score_fn=transe_model.score,
#     num_entities=14541,  # FB15k-237
#     filter_triples=all_known_triples,
# )
# print(f"MRR: {results['mrr']:.4f}")
# print(f"Hits@1: {results['hits@1']:.4f}")
# print(f"Hits@10: {results['hits@10']:.4f}")

In knowledge graph evaluation, MRR is the primary metric for link prediction tasks. For each test triple (head, relation, tail), the model scores all possible tail entities, and MRR measures where the true tail entity ranks. The filtered setting (standard practice since Bordes et al. 2013) removes other known true triples from the ranking to avoid penalizing correct predictions. This code follows the exact evaluation protocol used in TransE, RotatE, and other KGE model papers.

RAG Pipeline Evaluation -- MRR for Retrieved Contexts69 lines

from typing import List, Set

def evaluate_rag_retrieval(
    queries: List[str],
    retrieved_doc_ids: List[List[str]],
    relevant_doc_ids: List[Set[str]],
    k: int = 10
) -> dict:
    """Evaluate RAG retrieval quality using MRR and Hit Rate.
    
    Args:
        queries: List of user queries
        retrieved_doc_ids: Ranked list of retrieved document IDs per query
        relevant_doc_ids: Set of ground-truth relevant doc IDs per query
        k: Cutoff for MRR@K
    
    Returns:
        Dict with MRR@K, Hit Rate@K, and per-query details
    """
    reciprocal_ranks = []
    hits = []
    details = []
    
    for query, retrieved, relevant in zip(queries, retrieved_doc_ids, relevant_doc_ids):
        rr = 0.0
        hit = False
        first_rank = None
        
        for rank, doc_id in enumerate(retrieved[:k], start=1):
            if doc_id in relevant:
                rr = 1.0 / rank
                hit = True
                first_rank = rank
                break
        
        reciprocal_ranks.append(rr)
        hits.append(1.0 if hit else 0.0)
        details.append({
            'query': query,
            'reciprocal_rank': rr,
            'first_relevant_rank': first_rank,
        })
    
    return {
        f'mrr@{k}': float(sum(reciprocal_ranks) / len(reciprocal_ranks)),
        f'hit_rate@{k}': float(sum(hits) / len(hits)),
        'per_query': details,
    }

# Example: evaluate a vector search retriever
results = evaluate_rag_retrieval(
    queries=["What is RLHF?", "Explain attention mechanism", "BERT architecture"],
    retrieved_doc_ids=[
        ["doc_7", "doc_3", "doc_12", "doc_1", "doc_5"],
        ["doc_22", "doc_11", "doc_8", "doc_3", "doc_15"],
        ["doc_4", "doc_9", "doc_1", "doc_2", "doc_6"],
    ],
    relevant_doc_ids=[
        {"doc_3", "doc_1"},
        {"doc_8"},
        {"doc_4", "doc_1"},
    ],
    k=5
)

print(f"MRR@5: {results['mrr@5']:.4f}")
print(f"Hit Rate@5: {results['hit_rate@5']:.4f}")
for d in results['per_query']:
    print(f"  {d['query']}: RR={d['reciprocal_rank']:.3f} (rank {d['first_relevant_rank']})")

This example evaluates a RAG retrieval pipeline using MRR@K alongside Hit Rate@K. For each query, the retriever returns ranked document IDs, and we check where the first relevant document appears. The per-query details are invaluable for debugging: you can immediately see which queries have low reciprocal rank and investigate why the retriever failed. In production RAG systems, MRR@5 or MRR@10 tells you how often the context fed to the LLM actually contains the relevant information.

Configuration Example15 lines

# MRR evaluation config (YAML)
evaluation:
  metric: mrr
  cutoff_k: 10
  relevance_threshold: 1  # binary: 0 or 1
  handle_no_relevant: include_as_zero  # or 'exclude'
  report:
    - mrr@1
    - mrr@3
    - mrr@5
    - mrr@10
    - hit_rate@10
  segment_by:
    - query_type  # navigational, informational, transactional
    - query_frequency  # head, torso, tail

Common Implementation Mistakes

●
Counting multiple relevant results: MRR only considers the first relevant result. If your evaluation code sums reciprocal ranks of all relevant results, you are computing something else entirely (closer to MAP). Double-check that you break out of the loop after finding the first relevant item.
●
Forgetting the K cutoff: Computing MRR over the full ranked list (K=1000) inflates the metric because even a terrible ranker will eventually hit a relevant result. Always use MRR@K with a realistic K. For search, K=10 is standard. For knowledge graphs, K=100 is common.
●
Not handling zero-relevant queries: If a query has no relevant documents in the corpus (not just the ranked list), should you include it? The standard convention is to assign RR=0 and include it in the average, which penalizes the system. But some benchmarks exclude such queries. Be explicit about your choice.
●
Using MRR when multiple results matter: If users need to see several relevant items (product search, literature review), MRR is the wrong metric. It will reward a system that puts one relevant item at position 1 and trash everywhere else. Use NDCG or MAP instead.
●
Confusing predicted scores with rank positions: Some libraries take raw model scores and internally sort to determine ranks, while others expect pre-sorted ranked lists. Passing scores to a function that expects ranks (or vice versa) will produce garbage. Always read the API documentation.

When Should You Use This?

Use When

Your task has a single correct answer per query -- factoid QA, entity lookup, navigational search, knowledge graph link prediction
You care most about where the first relevant result appears, not about the full ranking quality below it
You need a simple, interpretable metric that stakeholders (product managers, executives) can understand: 'MRR = 0.7 means the answer is typically at position 1 or 2'
Your relevance labels are binary (relevant/not) and you cannot afford or do not need graded annotations (0-4 scale)
You are evaluating a RAG retrieval pipeline and want to know how often the correct context chunk is in the top few results
You are working with knowledge graph embeddings (TransE, RotatE, ComplEx) where MRR is the standard benchmark metric

Avoid When

Multiple relevant results matter and their relative order is important -- use NDCG instead (e.g., product search, recommendation carousels)
You need to evaluate ranking quality across all relevant documents, not just the first one -- use MAP instead
Your relevance is graded (perfect, good, fair, poor) and you want the metric to distinguish between them -- use NDCG with graded labels
You care about coverage (how many of the total relevant items did you retrieve?) -- use Recall@K instead
All items in your top-K are equally important and position within the K does not matter -- use Precision@K or Hit Rate instead
Your search task is exploratory (users browse many results) rather than known-item search (users seek one specific answer)

Key Tradeoffs

The Core Tradeoff: Simplicity vs. Completeness

MRR is the simplest position-aware ranking metric. That simplicity comes at a cost: it ignores everything after the first relevant result. Here is a concrete example of why that matters:

Ranking A: [Relevant, Irrelevant, Irrelevant, Irrelevant, Irrelevant] -- MRR contribution: 1.0 Ranking B: [Relevant, Relevant, Relevant, Relevant, Relevant] -- MRR contribution: 1.0

MRR gives both rankings the same score, but Ranking B is objectively better -- it has five relevant results instead of one. If your task values multiple relevant results, MRR is blind to that.

MRR vs. MAP vs. NDCG: When to Use Which

Metric	Relevance Type	Focus	Best For
MRR	Binary	First relevant result	Single-answer tasks (QA, entity lookup)
MAP	Binary	All relevant results	Multi-answer tasks with binary labels
NDCG	Graded (0-4)	All results with position weighting	Complex ranking with graded relevance

Rule of thumb: If your user stops after finding one answer, use MRR. If they want several answers, use MAP. If they want the best answers first, use NDCG.

MRR@K: Picking the Right K

The cutoff K should reflect user behavior:

MRR@1: Equivalent to Precision@1. Useful for voice assistants ("Hey Siri, what's...") where only the top result is spoken
MRR@5: Mobile search where viewport shows ~5 results
MRR@10: MS MARCO benchmark standard; desktop search with 10 results per page
MRR@100: Knowledge graph evaluation standard (FB15k-237, WN18RR)

Key Insight: MRR is not better or worse than NDCG or MAP -- it measures a different thing. The choice depends on whether you have a single-answer or multi-answer task. Using MRR for a multi-answer task is like grading an essay with a multiple-choice rubric: technically possible, but you are throwing away information.

Alternatives & Comparisons

MAP (Mean Average Precision)

MAP computes precision at each position where a relevant document appears and averages across positions and queries. Unlike MRR, MAP considers all relevant documents in the ranking, not just the first. Use MAP when multiple relevant documents exist per query and you care about their positions. MRR is simpler and more appropriate when there is one correct answer; MAP is more comprehensive for multi-answer retrieval. When each query has exactly one relevant document, MAP and MRR are mathematically equivalent.

NDCG (Normalized Discounted Cumulative Gain)

NDCG supports graded relevance (0-4 scale) and evaluates the full ranking with position-dependent discounting. MRR only uses binary relevance and only cares about the first relevant result. Use NDCG when relevance has meaningful gradations ("perfect match" vs. "good match" vs. "acceptable") and you want to evaluate the entire ranked list, not just the first hit. MRR is better for single-answer tasks; NDCG is better for complex ranking where both relevance levels and positions matter.

Precision@K

Precision@K measures the fraction of top-K results that are relevant, but it is position-unaware within those K results. MRR is position-aware: it distinguishes between a relevant result at position 1 and one at position 5. Use Precision@K when you care about the count of relevant items in the top K, regardless of their order. Use MRR when the position of the first relevant result matters.

Hit Rate (Success@K)

Hit Rate is a binary version of MRR: it only checks whether any relevant result exists in the top K (1 if yes, 0 if no). MRR is strictly more informative -- it tells you not just whether a relevant result exists, but where it appears. Hit Rate = 1 for both 'relevant at position 1' and 'relevant at position K', while MRR gives 1.0 and 1/K respectively. Use Hit Rate when you only need a coverage check; use MRR when position matters.

Pros, Cons & Tradeoffs

Advantages

Extremely simple to understand and compute: the formula is literally 'average of 1/rank'. You can explain MRR to a product manager in 30 seconds. No logarithms, no normalization constants, no graded relevance scales.
Cheap annotation: MRR requires only binary labels (relevant/not), which are faster and cheaper to collect than NDCG's graded labels. Binary annotation costs INR 20-50 per label vs. INR 50-150 for graded. Inter-annotator agreement is also higher for binary judgments.
Aligns perfectly with single-answer tasks: for QA, navigational search, and entity lookup, MRR measures exactly what you care about -- how quickly the system finds the one right answer. No wasted signal on irrelevant aspects of the ranking.
Position-aware at the top of the ranking: MRR sharply penalizes pushing the first relevant result down even one position (1.0 to 0.5 is a 50% drop), which aligns with the extreme top-heaviness of user attention in search.
Industry-standard for key benchmarks: MRR@10 is the primary metric for MS MARCO passage ranking, the most influential IR benchmark. MRR is the standard for knowledge graph evaluation (FB15k-237, WN18RR). If you publish results, you need MRR.
Mathematically well-behaved: bounded between 0 and 1, easy to average across query subsets, decomposes cleanly for per-category analysis. No division-by-zero issues (unlike NDCG with zero IDCG).
Fast to compute at scale: O(K) per query, trivially parallelizable across queries. Computing MRR for 10 million queries takes seconds.

Disadvantages

Ignores all results after the first relevant one: a ranking with one relevant result at position 1 and garbage everywhere else scores the same (MRR=1.0) as a perfect ranking with all relevant results at the top. This is a fundamental blind spot.
Binary relevance only: MRR cannot distinguish between a 'perfectly relevant' result and a 'somewhat relevant' one. If your task has meaningful relevance gradations, MRR throws away that information. NDCG handles this; MRR does not.
Penalizes systems with multiple equally-good answers: if a knowledge graph has 3 correct tail entities for a query, MRR only credits the first one found. The system gets no credit for ranking the other 2 correct entities highly.
Sensitive to a single query's failure: one query where the relevant result is at position 100 (RR=0.01) can drag down the average significantly, especially with small query sets. Median RR is more robust but less commonly reported.
Does not capture ranking quality below the first hit: after finding the first relevant result, MRR stops. If positions 2-10 are terrible (or excellent), MRR cannot tell the difference. For tasks where users scan beyond the first result, this is a limitation.
Not suitable for exploratory search: when users want to browse multiple options (product search, restaurant search, job search), MRR fundamentally misrepresents system quality because it only cares about one item.

Report both mean and median reciprocal rank. Median RR is robust to outliers and gives a better picture of 'typical' performance. Also compute confidence intervals via bootstrapping (sample queries with replacement, compute MRR on each sample, report 95% CI). Use at least 500-1000 queries for stable MRR estimates.

Placement in an ML System

Where Does MRR Sit in the Pipeline?

MRR is a metric, not a serving component. It lives in the evaluation and monitoring layer, separate from the inference path. Here is how it fits into different system architectures:

Search / QA Systems: After the ranking model produces results, MRR is computed offline on a test set to assess whether the correct answer is surfaced quickly. Online, MRR is estimated from click data (the first clicked result proxies for the first relevant result). Swiggy uses MRR alongside NDCG and median click depth to evaluate their restaurant feed ranking.

Knowledge Graph Systems: After a knowledge graph embedding model (TransE, RotatE, ComplEx) is trained, MRR is computed on a held-out set of test triples. For each test triple (h, r, ?), the model ranks all entities, and MRR measures where the true tail entity falls. MRR is the primary metric on benchmarks like FB15k-237 and WN18RR.

RAG Pipelines: MRR evaluates the retrieval component -- given a user question, where does the first relevant context chunk appear in the retrieved results? If MRR is low, the LLM never sees the relevant context and will hallucinate or give a poor answer. MRR is a leading indicator of downstream generation quality.

Key Insight: MRR is the canary in the coal mine for single-answer systems. If MRR drops, the user is scrolling further to find their answer -- and in many cases, they will just leave instead.

Pipeline Stage

Evaluation / Metrics

Upstream

Search Engine
QA Model
Knowledge Graph Link Predictor
RAG Retriever
Recommendation System
Ground Truth Annotation Pipeline

Downstream

Model Selection
Hyperparameter Tuning
A/B Testing Framework
Monitoring Dashboard
Ranking Model Training (as objective)

Scaling Bottlenecks

Where It Gets Tight

MRR computation itself is never the bottleneck. Finding the first relevant result in a top-K list is O(K) per query, and averaging is O(|Q|). For 10 million queries with K=10, this takes under a second on a single core.

The real bottlenecks are:

1. Label collection: Binary labels cost INR 20-50 per query-result pair. For 10,000 queries x 10 results = 100,000 labels, budget INR 20-50 lakh. For knowledge graph evaluation, labels come from the graph itself (true triples), so this is free.

2. Candidate generation for KG evaluation: In knowledge graph link prediction, MRR evaluation requires scoring all entities as candidate tails for each test triple. For FB15k-237 with 14,541 entities and 20,466 test triples, that is 14,541 x 20,466 = ~300 million scoring operations. On GPU, this takes minutes; on CPU, it can take hours.

3. Online MRR monitoring: Computing MRR from live traffic requires real-time relevance signals (clicks, dwell time). The signal is noisy and position-biased. De-biasing adds computational overhead but is essential for accurate online MRR.

Practical Throughput Numbers

Offline MRR on 1M queries: < 1 second (CPU)
KG MRR on FB15k-237 (14K entities, 20K test triples): ~5 minutes (single GPU)
KG MRR on Wikidata5M (5M entities, 5K test triples): ~1 hour (single GPU, scoring bottleneck)

Production Case Studies

Microsoft (MS MARCO)Search / IR Benchmarks

Microsoft's MS MARCO passage ranking benchmark uses MRR@10 as its primary evaluation metric. The benchmark contains 8.8 million passages and ~550,000 queries with sparse binary relevance labels (typically 1-2 relevant passages per query). MRR@10 was chosen specifically because the labels are binary and sparse -- NDCG's graded relevance advantage does not apply, and MAP's full-ranking evaluation adds noise with sparse labels. The MS MARCO leaderboard has been the most influential IR benchmark since 2018, driving progress in neural passage ranking from BM25 (MRR@10 ~0.187) to state-of-the-art neural models (MRR@10 ~0.42).

Outcome:

MRR@10 enabled standardized comparison of hundreds of passage ranking models. The 2x improvement in MRR@10 from 0.187 (BM25 baseline) to 0.42 (modern neural rankers) translates to the relevant passage moving from position ~5 to position ~2 on average -- a dramatic improvement in user experience.

SwiggyFood Delivery (India)

Swiggy's feed ranking team uses MRR alongside NDCG, median click depth, and median ordered-click depth to evaluate their restaurant ranking algorithms. When a user opens the Swiggy app, the feed shows a ranked list of restaurants. MRR measures how quickly the restaurant the user actually orders from appears in the feed. The team tracks MRR as one of several metrics, recognizing that a single metric cannot capture the full quality of a multi-faceted ranking problem. Their engineering blog details how they evolved from simple heuristic ranking to ML-based models, with MRR improving at each iteration.

Outcome:

MRR tracking helped Swiggy identify that their initial ranking model pushed popular chains to the top but buried niche restaurants that specific users preferred. Per-user MRR analysis revealed that personalization (showing different rankings to different users) improved MRR by 15-20% over a one-size-fits-all ranking.

Meta AI (Knowledge Graph Evaluation)AI Research

Meta AI's TransE paper (Bordes et al., NeurIPS 2013) established MRR as the standard evaluation metric for knowledge graph embedding models. TransE models relationships as translations in embedding space (h + r approximately equals t) and evaluates on FB15k and WN18 using MRR and Hits@K in the filtered setting. Since then, every major KGE model -- TransR, RotatE, ComplEx, DistMult, TuckER -- reports MRR on the same benchmarks. The FB15k-237 leaderboard shows MRR improving from ~0.23 (TransE, 2013) to ~0.35 (RotatE, 2019) to ~0.40+ (recent models, 2024).

Outcome:

MRR became the universal comparison metric for knowledge graph embedding research, enabling direct comparison across 100+ models over a decade. The filtered MRR protocol (removing known true triples from negative candidates) became the gold standard and prevented inflated metrics from trivial predictions.

LinkedInProfessional Network / Customer Service

LinkedIn's customer service team built a RAG-based QA system that integrates knowledge graphs with retrieval-augmented generation. They used MRR as the primary retrieval evaluation metric to measure how often the correct answer document was ranked in the top positions by their hybrid retriever (combining sparse BM25 search with dense embedding retrieval over their internal knowledge base). The system serves LinkedIn's customer service agents, who need precise answers to user questions about billing, account settings, and platform features.

Outcome:

The knowledge graph-integrated RAG system achieved a 77.6% improvement in retrieval MRR compared to the baseline retriever, and a 28.6% reduction in customer service resolution time. MRR directly correlated with agent efficiency -- higher MRR meant agents spent less time scrolling through retrieved documents to find the right answer.

Tooling & Ecosystem

TorchMetrics

PythonOpen Source

PyTorch-native metric library providing RetrievalMRR with GPU acceleration. Integrates seamlessly with PyTorch Lightning training loops. Supports top_k cutoff and handles query grouping via index tensors.

ir_measures

PythonOpen Source

Unified Python interface for 20+ IR metrics including MRR, MAP, NDCG, Recall@K, and more. Based on the pytrec_eval library. Ideal for benchmarking across multiple metrics simultaneously on TREC-format data.

RAGAS

PythonOpen Source

RAG evaluation framework that computes retrieval metrics (MRR, context precision) alongside generation metrics (faithfulness, answer relevance). Specifically designed for evaluating RAG pipelines end-to-end.

Pyserini

Python / JavaOpen Source

Python toolkit for reproducible IR research built on Apache Lucene. Includes utilities for computing MRR and other metrics on standard benchmarks (MS MARCO, TREC). The go-to tool for reproducing published IR results.

ranx

PythonOpen Source

Python library for information retrieval evaluation and comparison. Provides MRR, MAP, NDCG, and statistical significance tests (paired t-test, bootstrap). Supports TREC and JSON run formats. Excellent for comparing multiple ranking models.

Evidently AI

PythonOpen Source

ML monitoring platform that includes ranking metric computation and drift detection. Provides MRR alongside other retrieval metrics for production monitoring of search and recommendation systems.

PyKEEN

PythonOpen Source

Python library for knowledge graph embeddings that computes MRR, Hits@K, Mean Rank, and other KG-specific metrics. Implements the filtered evaluation protocol. Supports 30+ KGE models (TransE, RotatE, ComplEx, etc.).

Research & References

The TREC-8 Question Answering Track Report

Voorhees, E.M. (1999)TREC-8 Proceedings / LREC 2000

The foundational paper that introduced MRR as the primary evaluation metric for question answering. Established the TREC QA track where systems return short, ranked answers to factoid questions. MRR measured how quickly the first correct answer appeared.

Mean Reciprocal Rank

Craswell, N. (2009)Encyclopedia of Database Systems, Springer

The canonical reference definition of MRR in the Encyclopedia of Database Systems. Formalizes MRR as the mean of reciprocal ranks over binary relevance judgments and establishes its equivalence to MAP when each query has exactly one relevant document.

MS MARCO: Benchmarking Ranking Models in the Large-Data Regime

Craswell, N., Mitra, B., Yilmaz, E., Campos, D. & Voorhees, E.M. (2021)SIGIR 2021

Describes the MS MARCO benchmark design and justifies MRR@10 as the primary metric for passage ranking. Explains why MRR was chosen over NDCG (sparse binary labels) and MAP (noisy with incomplete judgments). The most influential passage ranking benchmark of the 2018-2025 era.

Translating Embeddings for Modeling Multi-relational Data

Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J. & Yakhnenko, O. (2013)NeurIPS 2013

Introduced TransE and the filtered evaluation protocol for knowledge graph link prediction. Established MRR and Hits@K as the standard metrics for KGE evaluation on FB15k and WN18 -- a convention followed by virtually every subsequent KGE paper.

Expected Reciprocal Rank for Graded Relevance

Chapelle, O., Metlzer, D., Zhang, Y. & Grinspan, P. (2009)CIKM 2009

Introduced ERR (Expected Reciprocal Rank), an extension of MRR that supports graded relevance by modeling the probability of a user stopping at each position as a function of relevance. ERR bridges MRR's simplicity with NDCG's graded relevance support.

A Unified Framework for Rank-based Evaluation Metrics for Link Prediction in Knowledge Graphs

Berrendorf, M., Galkin, M., Hoyt, C.T. (2022)ICLR 2022 Workshop on Graph Learning Benchmarks

Provides a unified mathematical framework for MRR, Hits@K, Mean Rank, and other rank-based metrics used in knowledge graph evaluation. Analyzes the relationships between metrics and proposes best practices for reporting KGE results.

Evaluating Retrieval Quality in Retrieval-Augmented Generation

Salemi, A. & Zamani, H. (2024)SIGIR 2024

Analyzes how retrieval metrics (MRR, Recall@K, NDCG) correlate with downstream RAG generation quality. Finds that MRR is a strong predictor of answer correctness in single-answer QA tasks, confirming its relevance for RAG pipeline evaluation.

Interview & Evaluation Perspective

Common Interview Questions

●
What is Mean Reciprocal Rank and when would you use it?
●
How is MRR different from NDCG and MAP? When would you choose one over the others?
●
Your MRR@10 is 0.45. What does that mean in practical terms?
●
How would you compute MRR for a knowledge graph link prediction task?
●
What are the limitations of MRR? When should you NOT use it?
●
You are building a RAG pipeline. Which retrieval metric would you use and why?
●
Explain the relationship between MRR and Hit Rate.

Key Points to Mention

●
MRR measures the average reciprocal rank of the first relevant result across queries. It answers: 'How quickly does the system find the answer?'
●
MRR is bounded between 0 and 1. MRR = 0.5 means the first relevant result is at position 2 on average. MRR = 1.0 means it is always at position 1.
●
MRR is ideal for single-answer tasks (QA, navigational search, entity lookup, KG link prediction) where the user stops after finding one relevant result.
●
MRR@K should match user behavior: MRR@10 for MS MARCO, MRR@5 for mobile search, MRR@100 for KG evaluation. Always justify your K choice.
●
MRR equals MAP when each query has exactly one relevant document. This is why MS MARCO uses MRR (sparse binary labels, ~1 relevant passage per query).
●
For KG evaluation, MRR uses the filtered setting: known true triples are removed from candidate rankings to avoid penalizing correct predictions.
●
MRR is cheap to annotate (binary labels) and cheap to compute (O(K) per query). This makes it practical for large-scale evaluation.

Pitfalls to Avoid

●
Claiming MRR evaluates the full ranked list -- it only cares about the first relevant result. Everything after that is invisible to MRR. If an interviewer asks about multi-answer tasks, immediately pivot to MAP or NDCG.
●
Using MRR for product search or recommendation systems where users browse multiple results. MRR is fundamentally wrong for these tasks -- it gives full credit to a list with one relevant item at position 1 and trash everywhere else.
●
Not mentioning the K cutoff. Saying 'I would use MRR' without specifying K shows you have not thought about the application. Always pair MRR with a K value and justify it.
●
Confusing MRR with Hit Rate. Hit Rate is binary (did a relevant result appear in top K?), while MRR is graded (where did it appear?). MRR is strictly more informative.
●
Forgetting to mention the filtered setting in knowledge graph evaluation. Unfiltered MRR is meaningless for KG link prediction because correct triples are penalized.

Senior-Level Expectation

A senior candidate should discuss MRR in the context of the broader metric landscape: when MRR is the right choice (single-answer tasks, binary labels, sparse judgments), when to switch to MAP (multiple relevant docs) or NDCG (graded relevance), and when to use complementary metrics (MRR + Recall@K for RAG, MRR + Hits@K for KG). They should know that MRR equals MAP for single-relevant-document queries, explain the filtered vs. unfiltered setting for KG evaluation, and discuss practical issues like label sparsity in MS MARCO (why MRR was chosen over NDCG), K selection based on UI viewport, and per-segment MRR analysis (head vs. tail queries). Senior engineers think about MRR as one tool in a metric toolkit, not the only metric -- and they can articulate why complementary metrics are needed.

Summary

Let's recap what we covered:

Mean Reciprocal Rank (MRR) measures how quickly a ranking system surfaces the first relevant result. The formula is simple: average of 1/rank across all queries. A score of 1.0 means the answer is always at position 1; a score of 0.5 means it is typically at position 2.
MRR is the right metric for single-answer tasks: factoid QA, navigational search, entity lookup, and knowledge graph link prediction. It aligns with user behavior in scenarios where people want one answer and stop looking once they find it. It is the wrong metric for exploratory search, product browsing, or any task where multiple relevant results matter -- use NDCG or MAP instead.
MRR@K adds a cutoff: only the top K results are evaluated. MS MARCO uses MRR@10, knowledge graph benchmarks use MRR@100. Choose K based on how many results your users actually see.
MRR requires only binary relevance labels (relevant/not), making it 2-3x cheaper to annotate than NDCG's graded labels. For knowledge graph evaluation, labels come from the graph itself at zero annotation cost.
Key relationships: MRR equals MAP when each query has exactly one relevant document. MRR is strictly more informative than Hit Rate (which only checks presence, not position). ERR extends MRR to graded relevance.
In knowledge graph evaluation, MRR is computed with the filtered setting (removing known true triples from candidates), which is essential to avoid penalizing correct predictions.

MRR's power is its radical simplicity. It answers one question -- 'how far down does the user have to look?' -- and answers it cleanly. That focus makes it the metric of choice for the vast class of retrieval tasks where there is one right answer and position is everything.

Concept Snapshot

Why This Concept Exists

The Problem: One Question, One Answer

Origin: TREC Question Answering Track (1999)

Why It Endures: Simplicity as a Feature

Core Intuition & Mental Model

The Analogy: Looking for Your Keys

The Single-Answer Assumption

Why Reciprocal (1/rank)?

Technical Foundations

Building Up the Formula

Step 1: Reciprocal Rank for a Single Query

Step 2: Mean Reciprocal Rank

Step 3: MRR@K (Cutoff Variant)

Worked Example

Relationship to MAP

Relationship to Success@K (Hit Rate)

Internal Architecture

Key Components

Data Flow

How to Implement

Three Ways to Compute MRR

Common Implementation Mistakes

When Should You Use This?

Use When

Avoid When

Key Tradeoffs

The Core Tradeoff: Simplicity vs. Completeness

MRR vs. MAP vs. NDCG: When to Use Which

MRR@K: Picking the Right K

Alternatives & Comparisons

Pros, Cons & Tradeoffs

Advantages

Disadvantages

Failure Modes & Debugging

Single-answer bias in multi-answer tasks

MRR inflation from easy queries

Missing relevant labels inflate MRR

K-cutoff mismatch with user behavior

Outlier queries dominate the average

Placement in an ML System

Where Does MRR Sit in the Pipeline?

Pipeline Stage

Upstream

Downstream

Scaling Bottlenecks

Production Case Studies

Tooling & Ecosystem

Research & References

Interview & Evaluation Perspective

Common Interview Questions

Key Points to Mention

Pitfalls to Avoid

Senior-Level Expectation

Summary

Related Blocks & Further Reading

Related ML Blocks

Further Reading