What is MAP (Mean Average Precision) in simple terms?

MAP is a metric that measures how good a retrieval system is at ranking relevant items at the top of its results. For each query, it computes Average Precision (AP): look at each position where a relevant item appears, note the precision (fraction of relevant items in the list up to that point), and average those precision values. Then MAP averages this AP score across all queries. Think of it as a grade on a series of scavenger hunts. Each hunt (query) has some treasures (relevant documents) hidden in a list of items. The more treasures you find early, the higher your grade. MAP is your average grade across all hunts. A MAP of 1.0 means you put every treasure first in every hunt -- perfect. A MAP of 0.5 means you are decent but have room for improvement.

What is the difference between MAP and NDCG?

The fundamental difference is **relevance type**: - **MAP** assumes **binary relevance**: a document is either relevant (1) or not (0). It computes the area under the precision-recall curve. - **NDCG** supports **graded relevance**: a document can be rated on a scale (e.g., 0-4). It applies a logarithmic position discount and normalizes against the ideal ranking. In practice, use MAP when your relevance labels are binary and you care about finding all relevant documents (MAP penalizes low recall). Use NDCG when you have graded labels and the distinction between 'somewhat relevant' and 'highly relevant' matters. For example, in a legal document retrieval system where a document is either relevant to a case or not, MAP is the right choice. In a web search engine where a product page might be perfectly relevant, somewhat relevant, or barely relevant, NDCG captures those distinctions better. Another practical difference: binary labels for MAP are cheaper to collect (INR 30-80 per judgment) than graded labels for NDCG (INR 50-150 per judgment). If budget is tight, MAP with binary labels can be the pragmatic choice.

What is the difference between interpolated and non-interpolated Average Precision?

**Non-interpolated AP** (the standard) computes precision at exactly the positions where relevant documents appear and averages them. This is what `trec_eval` computes and what most modern papers report. **Interpolated AP (11-point)** evaluates precision at 11 fixed recall levels (0.0, 0.1, 0.2, ..., 1.0). At each recall level, the interpolated precision is the maximum precision at any recall level greater than or equal to the current level. This smooths out the 'zigzag' pattern in the precision-recall curve. The interpolated version was used in early TREC evaluations and is still seen in some textbooks. However, it overestimates precision (by taking the maximum) and produces higher AP values than non-interpolated AP. Modern practice uses non-interpolated AP exclusively. In interviews, mention both versions and explain that non-interpolated is the standard. If you see 'MAP' without qualification, it means non-interpolated.

How does MAP handle queries with different numbers of relevant documents?

This is one of MAP's strengths. Each query's AP is computed with its own denominator (the total number of relevant documents for that query). A query with 3 relevant documents and one with 300 relevant documents both produce AP values between 0 and 1, and both contribute equally to the mean. For example: - Query A has 3 relevant docs, all retrieved at positions 1, 2, 3: AP = (1/1 + 2/2 + 3/3) / 3 = 1.0 - Query B has 100 relevant docs, all retrieved at positions 1-100 in perfect order: AP = 1.0 Both get AP = 1.0 despite having very different numbers of relevant documents. This normalization is what makes MAP comparable across queries of varying difficulty. However, this equal weighting can be a limitation: a query with 3 relevant documents (easy) counts the same as a query with 300 (hard). If you want to emphasize improvements on hard queries, consider GMAP (geometric mean of AP values) instead.

What is MAP@K and how should I choose K?

MAP@K is MAP computed only on the top K results. Instead of evaluating the entire ranked list, you truncate at position K and compute AP on just those K results. Choosing K depends on your application: - **Mobile search**: K=5 or K=10 (users see 5-10 results on screen) - **Desktop search**: K=10 to K=20 (first page of results) - **Candidate retrieval (for re-ranking)**: K=100 or K=1000 (you want to evaluate recall at the candidate stage) - **Academic benchmarks**: K=1000 (TREC standard for ad-hoc retrieval) A subtlety: the denominator in AP@K can either be `R` (total relevant docs) or `min(R, K)`. The TREC convention uses `R`, which means AP@K is always less than or equal to full-list AP. Some implementations use `min(R, K)` for a more optimistic score. Be explicit about which convention you use. In practice, report multiple K values (MAP@10, MAP@100, MAP@1000) to understand quality at different depths. If MAP@10 is high but MAP@1000 is low, your system ranks a few relevant items well but misses many others.

Why is MAP called 'Mean Average Precision' -- isn't that redundant?

The name sounds redundant but describes two distinct averaging operations: 1. **Average Precision (AP)**: The first 'average' is within a single query. It averages the precision values computed at each position where a relevant document appears. This summarizes the quality of one ranked list. 2. **Mean** Average Precision: The second average ('mean') is across queries. It takes the arithmetic mean of AP values across all queries in the evaluation set. So the structure is: Mean(over queries) of Average(precision at relevant positions). Two levels of averaging at different granularities. The 'Mean' handles variation across queries; the 'Average Precision' handles variation within a single ranked list. This two-level structure is what gives MAP its robustness: AP normalizes within a query (handling different numbers of relevant documents), and the mean normalizes across queries (handling different query difficulties).

How much does it cost to collect relevance labels for MAP evaluation in India?

Binary relevance labels (relevant/not relevant) are the cheapest type of relevance annotation. Here are approximate costs for crowdsourced annotation in India: - **Per-judgment cost**: INR 30-80 (~$0.35-$1.00 USD) depending on domain complexity - **Simple domains** (e-commerce product search): INR 30-50 per judgment - **Complex domains** (legal, medical, scientific): INR 60-80 per judgment - **Quality control overhead**: Add 20-30% for gold questions, redundant annotations, and annotator training **Example budget for a standard evaluation set**: - 500 queries x 50 judged documents = 25,000 judgments - At INR 50 per judgment: INR 12.5 lakh (~$15,000 USD) - With quality control: INR 15-17 lakh (~$18,000-$20,000 USD) - Timeline: 2-4 weeks with 10-20 annotators This is roughly 50% cheaper than graded (NDCG) annotation, which requires INR 50-150 per judgment on a 0-4 scale. If you are budget-constrained, MAP with binary labels is the practical choice. For ongoing evaluation, budget INR 3-5 lakh per quarter ($3,600-$6,000 USD) for continuous annotation of new queries as your system evolves.

Is MAP the same as mAP in object detection?

No, they are **completely different metrics** that share the same abbreviation. This is one of the most common sources of confusion in ML. **MAP in information retrieval**: - Evaluates ranked lists of retrieved documents - Uses binary relevance labels (relevant=1, not=0) - Computes precision at each rank where a relevant document appears - Score depends on ranking order **mAP in object detection (PASCAL VOC / COCO)**: - Evaluates predicted bounding boxes against ground-truth boxes - Uses IoU (Intersection over Union) to determine if a detection 'matches' a ground truth - Sweeps over confidence thresholds to compute precision-recall curves per class - Averages AP across object classes (not queries) - Often uses specific IoU thresholds like AP@0.5, AP@0.75, or AP@[0.5:0.95] The core idea (area under precision-recall curve) is similar, but the definitions of 'relevant,' 'retrieved,' and 'precision' differ entirely. Always specify whether you mean IR MAP or object detection mAP. In this guide, we are discussing IR MAP exclusively.

Can MAP be used for recommendation systems?

Yes, MAP is widely used for evaluating recommendation systems, especially when relevance is binary (user interacted/did not interact with an item). Common setups: - **E-commerce**: An item the user purchased is relevant=1, everything else is relevant=0. MAP@K where K is the number of visible recommendations. - **Music/Video**: A track the user played to completion is relevant=1, skipped tracks are relevant=0. - **Content feed**: An article the user clicked and read (dwell time > threshold) is relevant=1. MAP works well for recommendations when you want to evaluate both the **quantity** of relevant items retrieved and their **ranking**. It penalizes a system that buries relevant items at the bottom of the list. However, for recommendations with nuanced relevance (e.g., how much the user enjoyed a movie, not just whether they watched it), NDCG with graded relevance is more appropriate. The choice depends on whether your feedback signal is naturally binary (purchase/no-purchase) or graded (1-5 star rating).

Evaluation

MAP in Machine Learning

Here is a question that keeps coming up in every ML system design interview involving search or recommendations: you have a ranked list of results, some relevant and some not -- how do you measure whether the ranking is any good?

You could compute precision at some cutoff K, but that throws away information about where the relevant items sit within the top K. You could look at recall, but that ignores ranking entirely. You need a metric that captures both how many relevant items you retrieved and how high you ranked them.

Mean Average Precision (MAP) is exactly that metric. It has been the gold standard in information retrieval evaluation since the early TREC conferences in the 1990s, and it remains one of the most widely reported metrics in academic IR benchmarks, recommendation system evaluations, and search quality measurement.

MAP works by computing Average Precision (AP) for each query -- which summarizes the entire precision-recall curve into a single number -- and then averaging AP across all queries. The result is a value between 0 and 1 that rewards systems which place relevant documents at the top of the ranked list. A MAP of 1.0 means every relevant document appears before every irrelevant one, for every query.

From TREC's ad-hoc retrieval tracks to modern dense retrieval benchmarks like BEIR, from Flipkart's product search to JioSaavn's music recommendations -- if you are evaluating a retrieval system with binary relevance labels, MAP is likely the metric you should reach for first.

Concept Snapshot

What It Is: A ranking evaluation metric that averages the precision values computed at each position where a relevant document is retrieved, then averages that score across multiple queries to measure overall retrieval quality with binary relevance.
Category: Evaluation
Complexity: Intermediate
Inputs / Outputs: Inputs: a ranked list of retrieved items and binary relevance labels (relevant=1, not relevant=0) for each item. Outputs: a single score between 0 and 1, where 1 represents perfect retrieval ordering.
System Placement: Used in offline evaluation during model development, online A/B testing of ranking algorithms, and as a standard reporting metric in academic IR benchmarks (TREC, BEIR, MS MARCO).
Also Known As: mAP, Mean AP, MAP@K, Mean Average Precision at K
Typical Users: ML engineers, search engineers, IR researchers, recommendation system developers, NLP engineers working on retrieval
Prerequisites: Precision and recall fundamentals, Ranked retrieval concepts, Binary relevance judgments, Basic probability and averaging
Key Terms: Average Precision (AP)precision at rank krecallbinary relevanceinterpolated precisionnon-interpolated APMAP@Kprecision-recall curve

Why This Concept Exists

The Problem with Single-Point Metrics

Early information retrieval systems were evaluated with precision and recall -- two metrics that answer fundamentally different questions. Precision asks: "Of the items I retrieved, how many are relevant?" Recall asks: "Of all the relevant items, how many did I retrieve?"

The problem is that these metrics conflict. You can trivially achieve perfect recall by returning every document in the collection (precision drops to near zero). You can achieve perfect precision by returning only the single most confident result (recall drops to near zero). Researchers needed a way to capture the tradeoff between precision and recall across the entire ranked list.

The Precision-Recall Curve and Its Limitations

The precision-recall (PR) curve plots precision against recall at every rank position. It captures the full picture: how precision degrades as you retrieve more documents and recall increases. But a curve is not a number. You cannot say "System A is better than System B" by looking at two crossing curves. You need a single scalar that summarizes the curve.

The Birth of Average Precision

The solution emerged from the TREC (Text REtrieval Conference) community in the early 1990s. TREC, organized by NIST (National Institute of Standards and Technology) and led by Donna Harman and Ellen Voorhees, needed a standard metric to compare dozens of retrieval systems submitted by research groups worldwide.

Average Precision (AP) was the answer: it computes the area under the precision-recall curve (approximately) by averaging precision at every position where a relevant document is retrieved. This single number captures both precision and recall -- it rewards systems that retrieve relevant documents early and penalizes those that bury relevant documents deep in the list.

Mean Average Precision (MAP) is simply the mean of AP across all queries in the evaluation set. This handles the fact that different queries have different numbers of relevant documents and different difficulty levels. By averaging AP across queries, MAP provides a stable, robust measure of overall system quality.

Why MAP Became the Standard

Buckley and Voorhees (2004) showed that MAP has especially good discrimination (it can distinguish between systems with small quality differences) and stability (rankings of systems are consistent across different query sets). These properties made MAP the default metric in TREC evaluations for over two decades.

Key Insight: MAP exists because we need a single number that captures the full precision-recall tradeoff across a ranked list, averaged across queries of varying difficulty. It is the natural summary of the precision-recall curve, and its statistical properties make it ideal for comparing retrieval systems.

Core Intuition & Mental Model

The Elevator Pitch

Imagine you are a librarian helping patrons find books. For each request (query), you pull books off the shelves and stack them in a pile, with your best guesses on top.

MAP measures how good you are at this job across many patrons. Specifically, every time you place a relevant book in the stack, it checks: "What fraction of the books above this one (inclusive) are relevant?" Then it averages those fractions. If all the relevant books are at the very top of every stack, you get a perfect score of 1.0.

Walking Through AP Step by Step

Let's say a user searches for "machine learning tutorials" and there are 4 relevant documents in the collection. Your system returns 10 results:

Position: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] Relevant?: [1, 0, 1, 0, 0, 1, 0, 0, 0, 1]

Now, compute precision at each position where a relevant document appears:

Position 1: Precision = 1/1 = 1.00 (1 relevant out of 1 seen)
Position 3: Precision = 2/3 = 0.67 (2 relevant out of 3 seen)
Position 6: Precision = 3/6 = 0.50 (3 relevant out of 6 seen)
Position 10: Precision = 4/10 = 0.40 (4 relevant out of 10 seen)

Average Precision = (1.00 + 0.67 + 0.50 + 0.40) / 4 = 0.64

Notice what happens: the first relevant document (position 1) contributes a full 1.00 to the average, but the last one (position 10) only contributes 0.40. Relevant documents ranked higher contribute more to AP. This is why MAP rewards systems that put relevant items first.

Why "Average" Appears Twice

The name "Mean Average Precision" sounds redundant, but the two averages operate at different levels:

Average Precision (AP): averages precision values within a single query (across relevant document positions)
Mean AP: averages AP values across multiple queries

This two-level averaging is what makes MAP robust: AP handles within-query variation (different numbers of relevant documents), and the mean handles across-query variation (different query difficulties).

The Area-Under-the-Curve Connection

Here is an elegant way to think about AP: it approximates the area under the precision-recall curve. Each relevant document you find increases recall by $1/R$ (where $R$ is the total number of relevant documents), and the precision at that point is the "height" of the curve. Summing precision $\times$ recall-increment gives you the area. This is why AP is sometimes called "area under the PR curve" -- the approximation is exact for non-interpolated AP.

Mental Model: Think of AP as a grade on a scavenger hunt. Finding treasures early earns you high marks. Finding them late earns low marks. Missing them entirely earns zero. MAP is your grade averaged across many scavenger hunts.

Technical Foundations

Building MAP From First Principles

Let's formalize MAP step by step, building from precision through AP to MAP.

Step 1: Precision at Rank $k$

For a query $q$ , given a ranked list of retrieved documents $[d_1, d_2, \ldots, d_n]$ , precision at rank $k$ is:

$P(k) = \frac{\text{number of relevant documents in top } k}{k}$

Or equivalently:

$P(k) = \frac{\sum_{i=1}^{k} \text{rel}(d_i)}{k}$

where $\text{rel}(d_i) \in \{0, 1\}$ is the binary relevance label of document $d_i$ .

Step 2: Average Precision (AP)

Average Precision for a single query $q$ with $R$ total relevant documents:

$\text{AP}(q) = \frac{1}{R} \sum_{k=1}^{n} P(k) \cdot \text{rel}(d_k)$

Breaking this down:

The sum iterates over all positions $k$ in the ranked list
$\text{rel}(d_k)$ acts as a selector: we only compute precision at positions where a relevant document appears
$P(k)$ is the precision at that position
We divide by $R$ (total relevant documents, including those not retrieved) to penalize missed relevant documents

Critical detail: The denominator is $R$ (total relevant documents in the collection), not the number of relevant documents actually retrieved. If your system retrieves only 3 of 10 relevant documents, the denominator is 10. The 7 unretreived relevant documents implicitly contribute precision of 0, dragging AP down. This is how MAP penalizes low recall.

Step 3: MAP (Mean Average Precision)

Given a set of queries $Q = \{q_1, q_2, \ldots, q_m\}$ :

$\text{MAP} = \frac{1}{|Q|} \sum_{j=1}^{|Q|} \text{AP}(q_j)$

Simply the arithmetic mean of AP across all queries.

Step 4: MAP@K (MAP at Cutoff K)

In practice, we often evaluate only the top $K$ results:

$\text{AP@K}(q) = \frac{1}{\min(R, K)} \sum_{k=1}^{K} P(k) \cdot \text{rel}(d_k)$

$\text{MAP@K} = \frac{1}{|Q|} \sum_{j=1}^{|Q|} \text{AP@K}(q_j)$

Note: Some implementations use $R$ (total relevant docs) as the denominator even for AP@K, while others use $\min(R, K)$ . The TREC convention uses $R$ , which means AP@K is always $\leq$ AP for the full list. Be explicit about which convention your implementation follows.

Worked Example

Suppose we have 2 queries, each with 5 relevant documents in the collection.

Query 1: returns [R, N, R, R, N, N, R, N, N, N] (R=relevant, N=not relevant)

Position 1: P(1) = 1/1 = 1.00
Position 3: P(3) = 2/3 = 0.667
Position 4: P(4) = 3/4 = 0.75
Position 7: P(7) = 4/7 = 0.571
Only 4 of 5 relevant docs retrieved, so 5th contributes 0
AP = (1.00 + 0.667 + 0.75 + 0.571 + 0) / 5 = 0.598

Query 2: returns [R, R, R, R, R, N, N, N, N, N]

Position 1: P(1) = 1.00
Position 2: P(2) = 1.00
Position 3: P(3) = 1.00
Position 4: P(4) = 1.00
Position 5: P(5) = 1.00
All 5 relevant docs retrieved at top 5
AP = (1.00 + 1.00 + 1.00 + 1.00 + 1.00) / 5 = 1.00

MAP = (0.598 + 1.00) / 2 = 0.799

Interpolated vs Non-Interpolated AP

There are two variants of Average Precision, and the distinction matters:

Non-interpolated AP (TREC standard): Compute precision exactly at each position where a relevant document is found. This is what we defined above and what trec_eval computes. It is the standard in modern IR evaluation.

Interpolated AP (11-point): Compute precision at 11 fixed recall levels (0.0, 0.1, 0.2, ..., 1.0), where the interpolated precision at recall level $r$ is:

$P_{\text{interp}}(r) = \max_{r' \geq r} P(r')$

The interpolated precision at each recall level is the maximum precision found at any recall level greater than or equal to the current one. This smooths out the "zigzag" in the PR curve. The 11-point interpolated AP was used in early TREC evaluations but has largely been replaced by non-interpolated AP because interpolation can overestimate precision.

Implementation Note: Always use non-interpolated AP unless you are explicitly reproducing results from a paper that uses 11-point interpolation. Modern tools like trec_eval, pytrec_eval, and ranx default to non-interpolated AP.

Internal Architecture

MAP is a metric, not a deployable system component. But there is a well-defined computational pipeline for how MAP is calculated and integrated into ML evaluation workflows. Here is the architecture of a MAP evaluation pipeline.

MAP (Mean Average Precision) in ML Systems Architecture — A directed flow diagram showing 'Query Set' and 'Relevance Judgments' as two inputs feeding into ...

The pipeline operates in two phases. First, the retrieval system produces ranked results for each query. Second, the MAP calculator matches these results against ground-truth relevance judgments, computes AP per query, and aggregates to produce the final MAP score. This architecture is identical whether you are evaluating a sparse retriever (BM25), a dense retriever (DPR, ColBERT), or a learned re-ranker.

Key Components

Query Set

A collection of evaluation queries (typically 50-1000 queries for TREC-style benchmarks). Each query has a known set of relevant documents in the collection. The query set should represent the distribution of real user queries.

Retrieval System

The system under evaluation. Could be BM25, a neural dense retriever (DPR, ColBERT), a re-ranker, or any pipeline that produces a ranked list of documents for each query. MAP is agnostic to the retrieval method.

Relevance Judgments (qrels)

Ground-truth binary relevance labels for query-document pairs. In TREC format, these are stored as 'qrels' files: (query_id, iteration, doc_id, relevance). Relevance is binary: 0 (not relevant) or 1 (relevant). Documents not in the qrels file are assumed irrelevant.

Per-Query AP Calculator

For each query, walks down the ranked list, computes precision at every position where a relevant document appears, and averages these precision values divided by the total number of relevant documents for that query.

Mean Aggregation

Computes the arithmetic mean of AP values across all queries. This is the final MAP score. May also compute confidence intervals via bootstrapping for statistical significance testing.

Model Comparison Layer

Uses MAP scores (and statistical tests like paired t-test or bootstrap test) to determine whether one retrieval system is significantly better than another. Feeds into A/B testing decisions and model selection.

Data Flow

Here is the data flow for a standard offline evaluation:

Input: A test collection consisting of (1) a document corpus, (2) a set of queries $Q = \{q_1, q_2, \ldots, q_m\}$ , and (3) relevance judgments (qrels) mapping each query to its set of relevant documents.

For each query $q_j$ :

The retrieval system returns a ranked list $R_j = [d_1, d_2, \ldots, d_n]$ sorted by descending relevance score
For each position $k$ in $R_j$ , look up whether $d_k$ is relevant according to qrels
Compute $P(k)$ at each position where $\text{rel}(d_k) = 1$
Compute $\text{AP}(q_j) = \frac{1}{R_j} \sum_{k} P(k) \cdot \text{rel}(d_k)$ where $R_j$ is the total number of relevant documents for query $q_j$

Output: $\text{MAP} = \frac{1}{m} \sum_{j=1}^{m} \text{AP}(q_j)$

For online evaluation (A/B testing), the flow is similar but relevance judgments come from user behavior (clicks, purchases, engagement) rather than human annotations, and MAP is computed over cohorts of live traffic.

A directed flow diagram showing 'Query Set' and 'Relevance Judgments' as two inputs feeding into a 'MAP Calculator' that also receives 'Ranked Results' from the 'Retrieval System'. The calculator flows through 'Per-Query AP' to 'Mean Aggregation' to produce a 'MAP Score' that feeds 'Model Comparison / A/B Testing'.

How to Implement

Three Approaches to Computing MAP

There are three common ways to compute MAP in practice, each suited to different contexts:

Option A: Use ranx or ir_measures -- purpose-built IR evaluation libraries that handle TREC-format qrels, support MAP and dozens of other metrics, and are optimized with Numba for speed. Best for IR research and benchmarking.

Option B: Use scikit-learn's average_precision_score -- works for binary classification contexts where you compute AP from prediction scores and binary labels. Not designed for ranked retrieval but can be adapted. Best for quick prototyping.

Option C: Implement from scratch -- necessary when you need full control, custom cutoffs, or integration with proprietary evaluation pipelines. The formula is simple enough that a from-scratch implementation is practical and recommended for understanding.

For production evaluation pipelines, Option A is strongly recommended. For ML system design interviews, Option C demonstrates understanding. Let's walk through all three.

Cost Note: Computing MAP is computationally trivial (linear in the number of retrieved documents per query). The real cost is relevance judgment collection. For binary labels, expect INR 30-80 per query-document pair using crowdsourcing platforms. For 500 queries with 50 judged documents each = 25,000 labels, budget INR 7.5-20 lakh including quality control. Binary labels are cheaper than graded (0-4) labels because annotators only make a yes/no decision.

From Scratch — Complete MAP Implementation94 lines

import numpy as np
from typing import List, Dict, Tuple

def average_precision(ranked_list: List[int], num_relevant: int) -> float:
    """Compute Average Precision for a single query.
    
    Args:
        ranked_list: Binary relevance labels in ranked order.
                     e.g., [1, 0, 1, 0, 0, 1] means relevant at positions 1, 3, 6.
        num_relevant: Total number of relevant documents for this query
                      (including those NOT retrieved).
    
    Returns:
        AP score between 0 and 1.
    """
    if num_relevant == 0:
        return 0.0
    
    cumulative_relevant = 0
    precision_sum = 0.0
    
    for k, is_relevant in enumerate(ranked_list, start=1):
        if is_relevant:
            cumulative_relevant += 1
            precision_at_k = cumulative_relevant / k
            precision_sum += precision_at_k
    
    return precision_sum / num_relevant


def mean_average_precision(
    queries: Dict[str, Tuple[List[int], int]]
) -> float:
    """Compute MAP across multiple queries.
    
    Args:
        queries: Dict mapping query_id to (ranked_list, num_relevant).
    
    Returns:
        MAP score between 0 and 1.
    """
    if not queries:
        return 0.0
    
    ap_scores = []
    for query_id, (ranked_list, num_relevant) in queries.items():
        ap = average_precision(ranked_list, num_relevant)
        ap_scores.append(ap)
        print(f"Query {query_id}: AP = {ap:.4f}")
    
    map_score = np.mean(ap_scores)
    print(f"\nMAP = {map_score:.4f}")
    return map_score


def average_precision_at_k(
    ranked_list: List[int], num_relevant: int, k: int
) -> float:
    """Compute AP@K (Average Precision at cutoff K).
    
    Only considers the top K results.
    Uses min(num_relevant, K) as denominator (common convention).
    """
    if num_relevant == 0 or k == 0:
        return 0.0
    
    ranked_list = ranked_list[:k]
    cumulative_relevant = 0
    precision_sum = 0.0
    
    for i, is_relevant in enumerate(ranked_list, start=1):
        if is_relevant:
            cumulative_relevant += 1
            precision_at_i = cumulative_relevant / i
            precision_sum += precision_at_i
    
    # TREC convention: divide by num_relevant (penalizes low recall)
    # Alternative: divide by min(num_relevant, k)
    return precision_sum / num_relevant


# --- Example Usage ---
queries = {
    "q1": ([1, 0, 1, 1, 0, 0, 1, 0, 0, 0], 5),  # 4 of 5 relevant found
    "q2": ([1, 1, 1, 1, 1, 0, 0, 0, 0, 0], 5),  # 5 of 5 found (perfect)
    "q3": ([0, 0, 0, 1, 0, 1, 0, 0, 1, 0], 3),  # 3 of 3 found (late)
}

map_score = mean_average_precision(queries)
# Output:
# Query q1: AP = 0.6533
# Query q2: AP = 1.0000
# Query q3: AP = 0.4278
# MAP = 0.6937

This implementation follows the standard non-interpolated AP formula used by TREC. The key design decision is the denominator: we use num_relevant (total relevant documents in the collection for this query), not just the count of relevant documents retrieved. This means if your system misses relevant documents, AP is penalized -- it implicitly captures recall. The average_precision_at_k variant only considers the top K results, useful when you only care about the first page of search results.

ranx — Production-Grade MAP Evaluation with TREC Format47 lines

from ranx import Qrels, Run, evaluate

# Define ground truth relevance judgments (qrels)
# Format: {query_id: {doc_id: relevance_score}}
qrels_dict = {
    "q1": {"doc_a": 1, "doc_b": 0, "doc_c": 1, "doc_d": 1, "doc_e": 0},
    "q2": {"doc_f": 1, "doc_g": 1, "doc_h": 0, "doc_i": 1, "doc_j": 0},
    "q3": {"doc_k": 0, "doc_l": 1, "doc_m": 1, "doc_n": 0, "doc_o": 1},
}
qrels = Qrels(qrels_dict)

# Define system results (run)
# Format: {query_id: {doc_id: retrieval_score}}
# Higher score = ranked higher
run_dict = {
    "q1": {"doc_a": 0.95, "doc_c": 0.80, "doc_d": 0.70, "doc_b": 0.50, "doc_e": 0.30},
    "q2": {"doc_f": 0.90, "doc_g": 0.85, "doc_i": 0.75, "doc_h": 0.60, "doc_j": 0.40},
    "q3": {"doc_l": 0.88, "doc_o": 0.82, "doc_m": 0.70, "doc_k": 0.55, "doc_n": 0.35},
}
run = Run(run_dict)

# Compute MAP and MAP@K
results = evaluate(
    qrels,
    run,
    metrics=["map", "map@5", "map@10", "mrr", "ndcg@10"]
)

print("MAP:", results["map"])
print("MAP@5:", results["map@5"])
print("MAP@10:", results["map@10"])
print("MRR:", results["mrr"])
print("NDCG@10:", results["ndcg@10"])

# Compare two systems with statistical testing
from ranx import compare

run_bm25 = Run(run_dict, name="BM25")
run_neural = Run(run_dict, name="NeuralRetriever")  # different scores

report = compare(
    qrels=qrels,
    runs=[run_bm25, run_neural],
    metrics=["map@100", "ndcg@10", "mrr@10"],
    stat_test="paired_t",
)
print(report)

The ranx library is purpose-built for IR evaluation. It handles TREC-format qrels natively, computes MAP and 20+ other metrics efficiently (using Numba JIT compilation), and includes statistical significance testing (paired t-test, bootstrap test, Fisher randomization) out of the box. The compare function generates a formatted report showing which system is significantly better. This is the recommended approach for production evaluation pipelines.

pytrec_eval — Python Wrapper Around TREC's Official trec_eval39 lines

import pytrec_eval
import json

# Qrels in TREC format: {query_id: {doc_id: relevance}}
qrels = {
    "q1": {"d1": 1, "d2": 0, "d3": 1, "d4": 1, "d5": 0, "d6": 1, "d7": 0},
    "q2": {"d8": 1, "d9": 1, "d10": 0, "d11": 1, "d12": 0, "d13": 0, "d14": 1},
}

# Run in TREC format: {query_id: {doc_id: score}}
run = {
    "q1": {"d1": 5.0, "d3": 4.5, "d2": 4.0, "d4": 3.5, "d5": 3.0, "d6": 2.5, "d7": 2.0},
    "q2": {"d8": 5.0, "d9": 4.8, "d11": 4.5, "d14": 4.0, "d10": 3.5, "d12": 3.0, "d13": 2.5},
}

# Create evaluator with desired metrics
evaluator = pytrec_eval.RelevanceEvaluator(
    qrels,
    {"map", "map_cut_5", "map_cut_10", "P_5", "recall_10", "ndcg_cut_10"}
)

# Evaluate
results = evaluator.evaluate(run)

# Print per-query results
for query_id, metrics in sorted(results.items()):
    print(f"\n{query_id}:")
    print(f"  MAP: {metrics['map']:.4f}")
    print(f"  MAP@5: {metrics['map_cut_5']:.4f}")
    print(f"  MAP@10: {metrics['map_cut_10']:.4f}")
    print(f"  P@5: {metrics['P_5']:.4f}")
    print(f"  Recall@10: {metrics['recall_10']:.4f}")
    print(f"  NDCG@10: {metrics['ndcg_cut_10']:.4f}")

# Compute mean across queries
import numpy as np
for metric in ["map", "map_cut_5", "map_cut_10"]:
    values = [results[q][metric] for q in results]
    print(f"\nMean {metric}: {np.mean(values):.4f}")

pytrec_eval is a Python wrapper around the official TREC evaluation tool (trec_eval). It produces identical results to the C implementation used by TREC organizers, making it the gold standard for reproducible evaluation. The metric names follow TREC conventions: map is non-interpolated MAP, map_cut_5 is MAP@5, P_5 is Precision@5. Use this when you need exact compatibility with TREC evaluation.

scikit-learn — Average Precision for Binary Classification43 lines

from sklearn.metrics import average_precision_score
import numpy as np

# In sklearn, AP is computed from prediction scores and binary labels
# This is for binary classification, not ranked retrieval
# But it's equivalent when you have one query

# Ground truth: binary relevance labels
y_true = np.array([1, 0, 1, 1, 0, 0, 1, 0, 0, 0])

# Predicted scores: higher = more likely relevant
y_scores = np.array([0.95, 0.80, 0.78, 0.70, 0.55, 0.40, 0.35, 0.25, 0.15, 0.05])

# Compute AP (area under the precision-recall curve)
ap = average_precision_score(y_true, y_scores)
print(f"Average Precision: {ap:.4f}")
# Output: Average Precision: 0.8393

# For MAP across multiple queries, compute AP per query and average
def compute_map_sklearn(queries):
    """Compute MAP using sklearn.
    
    Args:
        queries: list of (y_true, y_scores) tuples, one per query.
    """
    ap_scores = []
    for y_true, y_scores in queries:
        ap = average_precision_score(y_true, y_scores)
        ap_scores.append(ap)
    return np.mean(ap_scores)

queries = [
    (np.array([1, 0, 1, 1, 0]), np.array([0.9, 0.8, 0.7, 0.6, 0.5])),
    (np.array([1, 1, 0, 0, 1]), np.array([0.95, 0.85, 0.75, 0.3, 0.2])),
]

map_score = compute_map_sklearn(queries)
print(f"MAP: {map_score:.4f}")

# NOTE: sklearn's AP uses a slightly different formula than trec_eval
# sklearn uses trapezoidal approximation of the PR curve area
# trec_eval uses the sum-of-precisions formula
# Results differ slightly -- use pytrec_eval for TREC-compatible MAP

Scikit-learn's average_precision_score computes the area under the precision-recall curve using trapezoidal integration. This is slightly different from the TREC-standard AP formula (sum of precisions at relevant positions divided by total relevant documents). For most practical purposes the difference is small, but if you need exact TREC compatibility, use pytrec_eval or ranx instead. Sklearn's version is convenient for quick prototyping but is not designed for multi-query retrieval evaluation.

Configuration Example25 lines

# Example: pytrec_eval metrics configuration
# Standard TREC evaluation metrics set
metrics:
  primary:
    - map                # Non-interpolated MAP (full list)
    - map_cut_10         # MAP@10
    - map_cut_100        # MAP@100
  secondary:
    - P_5                # Precision@5
    - P_10               # Precision@10
    - recall_100         # Recall@100
    - ndcg_cut_10        # NDCG@10 (for comparison)
    - recip_rank         # MRR (for comparison)
  
# Statistical significance testing
significance:
  test: paired_t         # or bootstrap, fisher_randomization
  alpha: 0.05
  correction: bonferroni # for multiple comparisons

# Evaluation parameters
evaluation:
  qrels_format: trec     # TREC qrels format
  depth: 1000            # Evaluate top 1000 results
  relevance_threshold: 1 # Binary: >= 1 is relevant

Common Implementation Mistakes

●
Using the wrong denominator in AP: The denominator should be the total number of relevant documents for the query (including those not retrieved), not just the count of relevant documents in the returned list. Using the wrong denominator inflates AP for systems with low recall.
●
Confusing MAP with mAP in object detection: In computer vision, mAP (mean Average Precision) uses IoU-based matching with precision-recall curves over confidence thresholds. In information retrieval, MAP uses rank-based precision over binary relevance. Same name, different computation. Be explicit about which domain you mean.
●
Assuming MAP handles graded relevance: MAP is strictly a binary relevance metric. If you have graded labels (0-4 scale), MAP treats everything above your threshold as relevant=1 and everything at or below as relevant=0. For graded relevance, use NDCG instead.
●
Not accounting for unjudged documents: In TREC evaluations, documents not in the qrels file are assumed irrelevant. If your retrieval system returns novel documents that happen to be relevant but were not judged, MAP will penalize them. This is the 'pooling bias' problem.
●
Comparing MAP across datasets with different recall bases: MAP of 0.40 on a dataset with 100 relevant documents per query is very different from MAP of 0.40 on a dataset with 3 relevant documents per query. Always report the dataset and its characteristics alongside MAP scores.
●
Using sklearn's AP when TREC-standard AP is needed: Sklearn uses trapezoidal PR curve area, not the sum-of-precisions formula. The difference is usually small but can matter for reproducibility. Use pytrec_eval or ranx for TREC-compatible MAP.

When Should You Use This?

Use When

Your relevance labels are binary (relevant/not relevant) and you do not have or do not need graded relevance judgments -- MAP is the natural metric for binary retrieval evaluation
You care about both precision and recall across the entire ranked list, not just a fixed cutoff -- MAP summarizes the full precision-recall tradeoff in a single number
You are evaluating a retrieval system for academic benchmarks or TREC-style evaluations where MAP is the standard primary metric
You want to compare multiple retrieval systems on the same dataset with statistical significance testing -- MAP has well-understood statistical properties (stability and discrimination)
You need to penalize systems that miss relevant documents (low recall) -- MAP's denominator naturally penalizes incomplete retrieval, unlike Precision@K
Your evaluation dataset has queries with varying numbers of relevant documents and you need a metric that handles this variation robustly

Avoid When

Relevance is graded (e.g., 0-4 scale) rather than binary -- MAP cannot distinguish between 'highly relevant' and 'somewhat relevant' documents. Use NDCG instead.
You only care about the first relevant result (navigational queries, QA systems) -- use MRR (Mean Reciprocal Rank) instead, which is simpler and more interpretable for this case
Your task requires position-weighted scoring with logarithmic discount -- MAP weights by rank position but not with the specific log discount that NDCG uses. For web search where top-3 matters far more than positions 8-10, NDCG may better match user behavior
You are working in object detection or computer vision -- the mAP used there (IoU-based, confidence threshold sweep) is a different metric. Do not mix them up.
You do not have ground-truth relevance labels and cannot afford to collect them -- MAP requires labels. Consider implicit metrics like click-through rate instead.
Your system returns a fixed-size set without meaningful ordering (e.g., set retrieval) -- MAP rewards ordering, so it adds no value over set-based metrics like F1

Key Tradeoffs

Binary vs. Graded Relevance: The Fundamental Tradeoff

The biggest decision point is whether your relevance judgments are binary or graded. MAP assumes binary relevance: a document is either relevant (1) or not (0). This makes annotation cheaper and faster (annotators make a yes/no decision instead of choosing from 5 levels), but it loses information.

Consider product search: a user searching for "iPhone 15 Pro" finds (a) the exact product, (b) an iPhone 15 Pro case, and (c) an iPhone 14 Pro. All three might be marked "relevant" in a binary scheme, but they clearly have different relevance levels. MAP treats them identically. NDCG would distinguish them.

Rule of thumb: If relevance gradations meaningfully affect user satisfaction (search engines, product recommendations), invest in graded labels and use NDCG. If relevance is naturally binary (duplicate detection, known-item retrieval, QA correctness), use MAP.

MAP vs. MAP@K: Full List vs. Cutoff

Variant	Pros	Cons
MAP (full list)	Captures full retrieval quality including recall	Sensitive to deep-list behavior that users never see
MAP@10	Focuses on user-visible results	Ignores recall beyond position 10
MAP@100	Good compromise for re-ranking pipelines	Still evaluates positions users rarely reach
MAP@1000	TREC standard for ad-hoc retrieval	Includes very deep positions

Choose the cutoff based on your application: MAP@10 for user-facing search, MAP@100 for candidate retrieval stages, MAP@1000 for academic benchmarks.

Cost of Annotation

Binary labels cost roughly INR 30-80 per judgment (about $0.35-$ 1.00 USD) on crowdsourcing platforms. Graded labels (0-4) cost INR 50-150 per judgment ( $0.60-$ 1.80 USD) because annotators need more time and training. For a 500-query evaluation set with 50 judged documents per query:

Binary labels: 25,000 judgments at ~~INR 50 = INR 12.5 lakh (~~$15,000 USD)
Graded labels: 25,000 judgments at ~~INR 100 = INR 25 lakh (~~$30,000 USD)

If budget is limited and binary relevance is a reasonable approximation, MAP saves you 50% on annotation costs compared to NDCG.

Key Insight: MAP is the right metric when relevance is genuinely binary, annotation budget is constrained, and you need a well-understood metric with strong statistical properties. When relevance has meaningful gradations and annotation budget allows, NDCG is more informative.

Alternatives & Comparisons

NDCG (Normalized Discounted Cumulative Gain)

NDCG supports graded relevance (0-4 scale) and applies a logarithmic position discount, making it more informative when relevance has meaningful gradations. MAP is limited to binary relevance but has stronger theoretical properties for binary tasks (it equals the area under the precision-recall curve). Use NDCG for web search and recommendations with graded labels; use MAP for binary retrieval tasks and TREC-style evaluations.

MRR (Mean Reciprocal Rank)

MRR measures the average inverse rank of the first relevant result (1/rank). It is ideal for tasks with a single correct answer (navigational search, factoid QA, known-item retrieval) but ignores all relevant documents after the first. MAP considers all relevant documents and their positions. Use MRR when there is one correct answer; use MAP when multiple relevant documents matter.

Precision@K

Precision@K counts the fraction of relevant items in the top K results. It is position-unaware within the top K: swapping items within K does not change the score. MAP is position-aware -- relevant items ranked higher contribute more to the score. Precision@K is simpler and more interpretable ('3 of 5 results were relevant') but less informative. Use Precision@K for quick sanity checks; use MAP for rigorous evaluation.

Recall@K

Recall@K measures what fraction of all relevant documents appear in the top K. It focuses purely on coverage (did you find them?) without considering order. MAP naturally incorporates recall (its denominator penalizes missed relevant documents) while also rewarding good ordering. Use Recall@K when evaluating a candidate retrieval stage where order does not matter yet; use MAP when evaluating the final ranked list.

Pros, Cons & Tradeoffs

Advantages

Captures the full precision-recall tradeoff in a single number: MAP approximates the area under the precision-recall curve, giving you a holistic view of retrieval quality without having to inspect the full PR curve
Penalizes missed relevant documents: Because the AP denominator is the total number of relevant documents (not just those retrieved), systems with low recall are naturally penalized -- you cannot cheat MAP by only returning easy-to-find documents
Position-sensitive: Relevant documents ranked higher contribute more to AP than those ranked lower. This aligns with user behavior -- users examine results from top to bottom
Strong statistical properties: Buckley and Voorhees (2004) showed MAP has excellent discrimination (can distinguish systems with small quality differences) and stability (system rankings are consistent across different query subsets). This makes it reliable for system comparison
Well-understood and widely adopted: MAP has been the primary metric in TREC evaluations since the 1990s. Extensive tooling support (trec_eval, pytrec_eval, ranx, ir_measures) and decades of published results make it easy to benchmark against prior work
Cheap to annotate: Binary relevance labels (relevant/not) are faster and cheaper to collect than graded labels (0-4 scale), making MAP practical for resource-constrained evaluation campaigns
Handles varying query difficulty: By averaging AP across queries, MAP normalizes for the fact that some queries have 3 relevant documents while others have 300. Each query contributes equally to the final score

Disadvantages

Binary relevance only: MAP cannot distinguish between 'somewhat relevant' and 'highly relevant' documents. A barely relevant document at position 1 contributes the same as a perfect match. For tasks where relevance gradations matter, this is a significant limitation
Sensitive to the total number of relevant documents: If the qrels (relevance judgments) are incomplete -- common in large collections where not all documents can be judged -- MAP can be misleading. Unjudged documents are assumed irrelevant, penalizing systems that retrieve novel relevant documents outside the judgment pool
Does not have a natural logarithmic discount: Unlike NDCG, MAP does not apply an explicit position discount function. The position sensitivity comes indirectly from precision computation, which is less intuitive and does not model user attention decay as explicitly
Heavily penalizes early mistakes: A single irrelevant document at position 1 drastically reduces all subsequent precision values, cascading through the AP computation. This can make MAP unstable for queries where the top result is ambiguous
Not ideal for modern web search UX: In modern search interfaces where users primarily see 3-5 results on mobile, MAP's full-list evaluation may not align with actual user experience. NDCG@5 or NDCG@10 may be more appropriate
Requires ground-truth labels: Like all offline metrics, MAP needs human-annotated relevance judgments. You cannot compute MAP from user behavior alone without converting clicks/engagement to binary relevance labels (which introduces bias)

Always report Recall@K alongside MAP. Use MAP in conjunction with recall metrics to get a complete picture. For retrieval stages that feed into re-rankers, prioritize recall-focused metrics. Consider using MAP with a deep cutoff (MAP@1000) to capture recall behavior.

Placement in an ML System

Where Does MAP Fit in the ML Pipeline?

MAP is an evaluation metric, not a serving component. It sits in the evaluation and monitoring layer, outside the inference path. Here is where it appears:

Offline model development: After training a retrieval model (BM25, dense retriever, learning-to-rank), compute MAP on a held-out test set to measure quality. MAP is the primary metric for model selection: the model with the highest MAP (statistically significantly) on the test set gets promoted.

Benchmark evaluation: When publishing results on IR benchmarks (TREC, BEIR, MS MARCO), report MAP alongside other metrics (NDCG, MRR, Recall). MAP is often the primary metric in TREC ad-hoc tracks and is required for BEIR evaluation.

A/B testing: When deploying a new retrieval model, compute MAP on a sample of live traffic (using post-hoc relevance judgments from clicks or annotations). Compare with the baseline model's MAP to decide whether to promote the new model.

Monitoring: Track MAP on a fixed set of 'canary queries' with known relevance labels. If MAP drops below a threshold, trigger an alert -- the retrieval model may have degraded due to index corruption, feature drift, or infrastructure issues.

Relationship to other metrics: MAP and NDCG are often computed side by side. MAP is the primary metric when relevance is binary; NDCG is primary when relevance is graded. For a comprehensive evaluation, report both along with Precision@K and Recall@K to give a complete picture.

Key Insight: MAP is to binary retrieval evaluation what NDCG is to graded ranking evaluation. It is the single most informative metric for binary relevance tasks, and it has served as the primary evaluation metric in the IR research community for over 30 years.

Pipeline Stage

Evaluation / Metrics

Upstream

precision-at-k
recall-at-k
search-engine
semantic-search
vector-store

Downstream

ndcg-metric
mrr-metric

Scaling Bottlenecks

Computational Cost

MAP computation itself is extremely fast: $O(n)$ per query where $n$ is the length of the ranked list (a single pass through the list, checking relevance at each position). For 10,000 queries with 1,000 results each, MAP computation takes well under a second on a single CPU core. This is never the bottleneck.

The real bottleneck is relevance judgment acquisition:

Binary annotations via crowdsourcing: INR 30-80 per judgment (~ $0.35-$ 1.00 USD)
For 500 queries x 50 judged docs = 25,000 judgments
Cost: INR 7.5-20 lakh ( $9,000-$ 24,000 USD)
Time: 2-4 weeks with a team of 10-20 annotators
Quality control (gold questions, inter-annotator agreement): adds 20-30% overhead

Scaling Annotation for Production Systems

For systems with millions of queries, exhaustive annotation is impossible. Strategies:

Traffic-weighted sampling: Sample queries proportional to their frequency. The top 1% of queries often account for 30-50% of traffic (power law). Focus annotation effort there.
Active learning for annotation: Train an initial model, identify queries where the model is uncertain or where system variants disagree, and preferentially annotate those.
Implicit feedback as proxy: Use clicks, purchases, or dwell time as noisy binary relevance labels. A click implies relevance, no click implies irrelevance (with position bias correction).
Periodic re-annotation: Budget a fixed annotation spend per quarter (e.g., INR 5 lakh/quarter) and continuously annotate new queries to keep the evaluation set fresh.

Online MAP Estimation

For A/B testing, you can estimate MAP online using interleaved experiments. Serve results from two systems interleaved and measure which system's results users prefer (click on). This avoids the need for explicit relevance annotations but requires careful handling of position bias.

Production Case Studies

NIST / TREC Ad-Hoc TracksInformation Retrieval Research

TREC (Text REtrieval Conference) has used MAP as its primary evaluation metric for ad-hoc retrieval tasks since the early 1990s. Research groups from around the world submit retrieval runs on standardized document collections (e.g., Robust04, GOV2), and MAP is used to rank systems and identify significant improvements. The TREC Robust Track specifically chose MAP because of its demonstrated stability and discrimination properties, enabling reliable comparison of hundreds of systems across decades of evaluations.

Outcome:

MAP has enabled systematic tracking of IR progress over 30+ years. Retrieval effectiveness as measured by MAP has approximately doubled since TREC-1 in 1992, validating MAP as a reliable progress metric for the field.

BEIR Benchmark (NeurIPS 2021)Information Retrieval / NLP Research

The BEIR (Benchmarking IR) benchmark evaluates 10 state-of-the-art retrieval models across 18 diverse datasets in a zero-shot setting. BEIR uses NDCG@10 as its primary metric but also reports MAP alongside Recall@100 and Precision@10. The benchmark showed that BM25 remains a strong baseline (often outperforming dense retrievers on out-of-domain data), with MAP scores ranging from 0.15 to 0.55 depending on dataset difficulty. Late-interaction models like ColBERT achieved the best MAP on average.

Outcome:

BEIR established that zero-shot generalization of neural retrievers is a major challenge. MAP and NDCG together revealed that dense retrievers excel on in-domain data but struggle on out-of-distribution tasks, guiding research toward more robust retrieval architectures.

Facebook AI Research (DPR)Open-Domain Question Answering

Facebook's Dense Passage Retrieval (DPR) system was evaluated using MAP alongside top-K retrieval accuracy on Natural Questions, TriviaQA, and other QA datasets. DPR used dual-encoder BERT models to encode questions and passages into 768-dimensional vectors. The system achieved MAP improvements of 9-19% over BM25 on multiple datasets, demonstrating that dense retrieval can substantially outperform sparse methods for QA-style retrieval where semantic understanding matters more than keyword matching.

Outcome:

DPR's MAP improvements over BM25 validated dense retrieval as a viable alternative to sparse methods for QA. The paper (Karpukhin et al., EMNLP 2020) has been cited over 4,000 times and catalyzed the dense retrieval revolution that underpins modern RAG pipelines.

NetflixMusic Streaming (India)

Netflix's recommendation system uses multiple evaluation metrics including precision and ranking quality measures to optimize content recommendations. They employ extensive A/B testing and offline metrics before deploying models to production.

Outcome:

Netflix's multi-layer metric approach includes training objectives, offline metrics like MAP (Mean Average Precision), online A/B tests, and long-term member joy optimization. Their hybrid recommendation system combines pre-computed recommendations with real-time re-ranking.

Microsoft BingWeb Search

Microsoft Research created the MSLR (Microsoft Learning to Rank) datasets, which include MSLR-WEB10K and MSLR-WEB30K, containing features extracted from Bing's query-URL pairs. These datasets are evaluated using both MAP and NDCG. Microsoft's research demonstrated that LambdaMART models trained on these features achieved state-of-the-art MAP scores, and their internal A/B testing pipeline uses MAP alongside NDCG to evaluate ranking model candidates before deploying to Bing's live traffic.

Outcome:

The MSLR datasets with MAP evaluation have become standard benchmarks for learning-to-rank research, with over 2,000 citations. MAP helped identify that feature engineering (combining hundreds of signals) matters more than the choice of ranking algorithm for web search quality.

Tooling & Ecosystem

trec_eval

COpen Source

The official TREC evaluation tool written in C. Computes MAP and 50+ other retrieval metrics from TREC-format qrels and run files. This is the reference implementation -- all other tools aim to produce identical results. Used by virtually every IR research group.

pytrec_eval

Python / COpen Source

Python wrapper around the official trec_eval C library. Provides identical results to trec_eval with a Pythonic API. Supports MAP, MAP@K, NDCG, MRR, Precision, Recall, and all standard TREC metrics. Best for Python-based evaluation pipelines that need TREC-compatible results.

ranx

PythonOpen Source

A blazing-fast Python library for ranking evaluation and comparison, built on Numba for JIT-compiled speed. Supports MAP, NDCG, MRR, and 20+ metrics. Includes statistical significance testing (paired t-test, bootstrap, Fisher randomization) and formatted comparison reports. Best for production evaluation pipelines and benchmarking multiple systems.

ir_measures

PythonOpen Source

Unified Python interface to compute 20+ IR metrics by wrapping pytrec_eval, gdeval, and other providers. Standardizes metric names and parameters across different evaluation tools. Integrates seamlessly with PyTerrier for end-to-end retrieval experiments.

scikit-learn

PythonOpen Source

Python library providing average_precision_score for computing AP from prediction scores and binary labels. Uses trapezoidal PR curve approximation (slightly different from TREC-standard AP). Best for quick prototyping and binary classification contexts. Not recommended for multi-query retrieval evaluation.

Pyserini

Python / JavaOpen Source

Python toolkit for reproducible IR research built on Apache Lucene. Includes utilities for running BM25 and dense retrieval experiments and evaluating with MAP, NDCG, and other metrics on standard benchmarks (MS MARCO, TREC, BEIR). Provides end-to-end retrieval + evaluation workflows.

Research & References

Retrieval Evaluation with Incomplete Information

Buckley, C. & Voorhees, E. M. (2004)SIGIR 2004

The foundational paper on MAP's statistical properties. Showed that MAP has excellent discrimination and stability compared to other metrics, making it the standard for TREC evaluation. Also introduced bpref, a metric robust to incomplete judgments. Essential reading for understanding why MAP became the default IR metric.

BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Thakur, N., Reimers, N., Rücklé, A., Srivastava, A. & Gurevych, I. (2021)NeurIPS 2021 (Datasets and Benchmarks Track)

Introduced the BEIR benchmark for zero-shot IR evaluation across 18 diverse datasets. Uses MAP as one of the standard evaluation metrics alongside NDCG@10. Showed that dense retrieval models struggle with out-of-distribution generalization, a finding enabled by consistent MAP evaluation across datasets.

Dense Passage Retrieval for Open-Domain Question Answering

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D. & Yih, W. (2020)EMNLP 2020

Introduced DPR (Dense Passage Retrieval) using dual-encoder BERT models. Evaluated using top-K accuracy and MAP on Natural Questions and TriviaQA. Demonstrated 9-19% MAP improvements over BM25, establishing dense retrieval as a viable alternative for open-domain QA and catalyzing the modern RAG pipeline.

Cumulated Gain-Based Evaluation of IR Techniques

Järvelin, K. & Kekäläinen, J. (2002)ACM Transactions on Information Systems (TOIS), Vol. 20, No. 4

The paper that introduced NDCG, MAP's primary alternative. Argued that binary relevance (as used in MAP) is insufficient for real-world IR evaluation and proposed graded relevance with position discounting. Understanding this paper is essential for knowing when to choose NDCG over MAP.

Introduction to Information Retrieval (Chapter 8: Evaluation in Information Retrieval)

Manning, C. D., Raghavan, P. & Schütze, H. (2008)Cambridge University Press

The standard textbook treatment of MAP and other IR evaluation metrics. Chapter 8 provides clear derivations of AP, MAP, interpolated precision, and the 11-point precision-recall curve. Available free online. The definitive pedagogical reference for MAP.

From RankNet to LambdaRank to LambdaMART: An Overview

Burges, C. J. (2010)Microsoft Research Technical Report MSR-TR-2010-82

Describes how learning-to-rank models can optimize ranking metrics including MAP. While LambdaMART primarily targets NDCG, the lambda gradient framework can also optimize MAP. Understanding this paper connects MAP evaluation to learning-to-rank training objectives.

Interview & Evaluation Perspective

Common Interview Questions

●
What is MAP and how is it different from Precision@K?
●
Walk me through the AP formula for a specific ranked list.
●
When would you use MAP vs. NDCG? What about MAP vs. MRR?
●
How does MAP handle queries with different numbers of relevant documents?
●
What happens to MAP when your relevance judgments are incomplete?
●
Your MAP is 0.45. Is that good? How would you improve it?
●
Explain the difference between interpolated and non-interpolated Average Precision.
●
How would you collect relevance labels for MAP evaluation in a production system?

Key Points to Mention

●
MAP summarizes the entire precision-recall curve into a single number by averaging precision at every rank where a relevant document appears, then averaging across queries. It rewards systems that rank relevant documents early.
●
The denominator in AP is the total number of relevant documents (not just those retrieved), which means MAP naturally penalizes low recall. This is a key advantage over Precision@K.
●
MAP assumes binary relevance. For graded relevance (0-4 scale), use NDCG instead. This is the most important decision point when choosing between the two metrics.
●
MAP has been the standard metric in TREC evaluations since the 1990s because of its strong statistical properties: high discrimination (can detect small quality differences) and stability (consistent system rankings across query subsets).
●
Always report the cutoff: MAP@10 for user-facing search, MAP@100 for retrieval pipelines, MAP@1000 for academic benchmarks. The cutoff should match how users interact with your system.
●
Binary annotation is cheaper than graded annotation (INR 30-80 vs. INR 50-150 per label), making MAP more practical when annotation budget is limited.

Pitfalls to Avoid

●
Confusing IR MAP with object detection mAP -- they are computed completely differently. Always clarify which domain you mean.
●
Claiming MAP works with graded relevance -- it does not. MAP is strictly binary. You can binarize graded labels but that loses information.
●
Forgetting that AP's denominator is total relevant documents (not just retrieved relevant). Getting this wrong shows a fundamental misunderstanding of the metric.
●
Not mentioning the pooling bias problem -- in TREC evaluations, unjudged documents are assumed irrelevant. A senior interviewer will expect you to know this limitation.
●
Saying 'MAP and NDCG are interchangeable' -- they are not. MAP is for binary relevance; NDCG is for graded relevance. The choice depends on your data and task.

Senior-Level Expectation

A senior candidate should discuss the full evaluation design: how to collect binary relevance labels cost-effectively (crowdsourcing strategy, annotation guidelines, inter-annotator agreement via Cohen's kappa), choice of evaluation cutoff (MAP@K) based on application UX, the pooling bias problem and how to mitigate it (deeper pools, infAP, manual re-annotation of novel results), when to use MAP vs. NDCG vs. MRR (binary vs. graded vs. single-answer tasks), statistical significance testing for comparing systems (paired t-test, bootstrap), and the relationship between MAP and the precision-recall curve (AP approximates the area under the PR curve). Senior engineers should also explain why MAP was the TREC standard for 30 years and when to consider alternatives like GMAP (geometric mean AP) for emphasizing hard queries or bpref for incomplete judgments.

Summary

Let us recap the key points about Mean Average Precision (MAP):

MAP (Mean Average Precision) is a ranking evaluation metric that summarizes the precision-recall tradeoff for binary retrieval tasks. It computes Average Precision (AP) for each query -- the mean of precision values at each position where a relevant document appears, divided by the total number of relevant documents -- and then averages AP across all queries.
The formula is $\text{AP}(q) = \frac{1}{R} \sum_{k=1}^{n} P(k) \cdot \text{rel}(d_k)$ and $\text{MAP} = \frac{1}{|Q|} \sum_{j} \text{AP}(q_j)$ , producing a score between 0 and 1 where 1 means perfect retrieval.
MAP assumes binary relevance: documents are either relevant (1) or not (0). This is its key distinction from NDCG, which supports graded relevance. Use MAP when labels are binary; use NDCG when labels are graded.
MAP naturally penalizes low recall because the denominator is the total number of relevant documents, not just those retrieved. Missing relevant documents implicitly contribute zero to the AP sum.
MAP has strong statistical properties: Buckley and Voorhees (2004) showed it has excellent discrimination and stability, which is why it has been the standard metric in TREC evaluations for over 30 years.
Non-interpolated AP is the modern standard (used by trec_eval, pytrec_eval, ranx). Interpolated 11-point AP is historical and overestimates precision.
Choose the cutoff K based on your application: MAP@10 for user-facing search, MAP@100 for retrieval pipelines, MAP@1000 for academic benchmarks.
Binary labels are cheaper: INR 30-80 per judgment vs. INR 50-150 for graded labels. When budget is constrained and binary relevance is a reasonable approximation, MAP is the pragmatic choice.

MAP is the gold standard for binary retrieval evaluation. It has served as the primary metric in the IR research community for three decades because it captures the full precision-recall tradeoff in a single, statistically robust number. If your relevance labels are binary and you are evaluating a retrieval system, MAP should be your starting point.

Concept Snapshot

Why This Concept Exists

The Problem with Single-Point Metrics

The Precision-Recall Curve and Its Limitations

The Birth of Average Precision

Why MAP Became the Standard

Core Intuition & Mental Model

The Elevator Pitch

Walking Through AP Step by Step

Why "Average" Appears Twice

The Area-Under-the-Curve Connection

Technical Foundations

Building MAP From First Principles

Step 1: Precision at Rank kkk

Step 2: Average Precision (AP)

Step 3: MAP (Mean Average Precision)

Step 4: MAP@K (MAP at Cutoff K)

Worked Example

Interpolated vs Non-Interpolated AP

Internal Architecture

Key Components

Data Flow

How to Implement

Three Approaches to Computing MAP

Common Implementation Mistakes

When Should You Use This?

Use When

Avoid When

Key Tradeoffs

Binary vs. Graded Relevance: The Fundamental Tradeoff

MAP vs. MAP@K: Full List vs. Cutoff

Cost of Annotation

Alternatives & Comparisons

Pros, Cons & Tradeoffs

Advantages

Disadvantages

Failure Modes & Debugging

Incomplete relevance judgments (pooling bias)

Zero relevant documents for a query

Binary threshold choice for graded relevance

Dominance by easy or hard queries

Position sensitivity masking recall issues

Placement in an ML System

Where Does MAP Fit in the ML Pipeline?

Pipeline Stage

Upstream

Downstream

Scaling Bottlenecks

Production Case Studies

Tooling & Ecosystem

Research & References

Interview & Evaluation Perspective

Common Interview Questions

Key Points to Mention

Pitfalls to Avoid

Senior-Level Expectation

Summary

Related Blocks & Further Reading

Related ML Blocks

Further Reading

Step 1: Precision at Rank $k$