Why Evaluation Metrics Exist — and Why Getting Them Wrong Is Expensive
Goodhart's Law and the Metric Trap
When a measure becomes a target, it ceases to be a good measure. This is Goodhart's Law, and it is the founding reason evaluation metrics must be chosen carefully. In 2012, a spam filter team at a major email provider optimised exclusively for accuracy. Their model hit 99% accuracy — because 99% of emails were not spam. The model learned to label everything as non-spam and was useless. The metric was technically correct. The model was not.
A Short History
Early ML practitioners borrowed metrics from statistics: mean squared error (1805, Legendre), correlation coefficients (1880s, Galton). Classification metrics emerged with the confusion matrix in the 1950s from signal detection theory. ROC curves were originally developed by radar engineers in World War II to measure how well operators distinguished signals from noise. Precision and recall came from information retrieval in the 1960s. BLEU arrived in 2002 for machine translation. BERTScore and model-graded evaluation are products of the transformer era post-2018. Each new paradigm forced new measurement.
The Core Question
Every metric answers a specific question about your model's behaviour. Before choosing one, ask:
- What does a wrong prediction cost — and does it cost the same in both directions?
- Is the problem constrained by a threshold (yes/no) or by a ranking (best 10 results)?
- Are the classes balanced or imbalanced?
- Are you measuring a model in isolation or in the context of a downstream business goal?
The metric you choose shapes every training decision downstream. Choose it before you touch the data.
The Confusion Matrix — The Foundation of Everything
What It Is
For any binary classifier, every prediction falls into one of four buckets. The confusion matrix makes all four explicit at once.
- True Positive (TP): Model said positive. It was positive. Correct.
- True Negative (TN): Model said negative. It was negative. Correct.
- False Positive (FP) — Type I Error: Model said positive. It was negative. False alarm.
- False Negative (FN) — Type II Error: Model said negative. It was positive. Miss.
Why This Matters Before Any Other Metric
Every single classification metric derives from these four numbers. Accuracy, precision, recall, F1, MCC — they are all weighted combinations of TP, TN, FP, FN. Understanding the confusion matrix means you understand what every other metric is hiding or highlighting.
from sklearn.metrics import confusion_matrix
y_true = [1, 0, 1, 1, 0, 1, 0, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0]
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
print(f"TP={tp}, TN={tn}, FP={fp}, FN={fn}")Classification Metrics
Accuracy
Accuracy tells you the fraction of all predictions that were correct. It is intuitive and easy to report. It is also the most misused metric in ML. On an imbalanced dataset — say, 1% fraud, 99% legitimate — a model that predicts "not fraud" for every transaction achieves 99% accuracy while being completely useless. Never use accuracy as your primary metric on imbalanced data. It should only be your primary metric when classes are roughly balanced and both types of errors cost the same.
Precision and Recall
Precision answers "when you raise the alarm, how often are you right?" A spam filter with high precision almost never sends real emails to spam. A cancer screening tool with high precision almost never sends a healthy patient to biopsy.
Recall answers "how many of the real positives did you find?" A cancer screening tool with high recall almost never misses a real tumour. A fraud detector with high recall almost never lets fraud through.
The Fundamental Tradeoff
Precision and recall trade off directly: lowering your decision threshold captures more true positives (higher recall) but also more false positives (lower precision). Raising the threshold does the opposite.
- When false negatives are more costly (cancer detection, fraud, structural failure): optimise for recall. Missing a real case is worse than a false alarm.
- When false positives are more costly (spam filter, content moderation, treatment approval): optimise for precision. A false alarm is worse than a miss.
from sklearn.metrics import precision_score, recall_score
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
print(f"Precision: {precision:.2f}, Recall: {recall:.2f}")F1-Score and the F-Beta Family
F1 is the harmonic mean of precision and recall. It is penalised hard when either one is low — a model with precision 0.9 and recall 0.1 scores only F1 = 0.18. This makes F1 a useful single-number summary when both errors matter, especially on imbalanced classes.
F-Beta generalises this: when recall matters times more than precision:
- F2 () weights recall twice as heavily — use this in medical screening or fraud detection where missing a positive is catastrophic.
- F0.5 () weights precision more heavily — use this in recommendation systems where irrelevant suggestions are costly.
from sklearn.metrics import fbeta_score
f1 = fbeta_score(y_true, y_pred, beta=1)
f2 = fbeta_score(y_true, y_pred, beta=2) # recall-heavy
f05 = fbeta_score(y_true, y_pred, beta=0.5) # precision-heavyROC Curve and AUC
ROC (Receiver Operating Characteristic) originated in radar signal detection during World War II, where operators needed to distinguish real aircraft from noise. The curve plots True Positive Rate (Recall) on the y-axis against False Positive Rate on the x-axis as the decision threshold sweeps from 0 to 1.
AUC = 0.5 means random. AUC = 1.0 means perfect. AUC = 0.85 means the model ranks a random positive above a random negative 85% of the time. AUC is threshold-independent and works reasonably on moderately imbalanced data, but misleads on severely imbalanced data because TN dominates FPR, keeping the curve looking good even when the model rarely catches actual positives.
When to prefer the PR curve over ROC
On highly imbalanced datasets (fraud: 0.1%, rare disease: 0.01%), use the Precision-Recall curve instead. It focuses on the positive class only and does not benefit from the large number of true negatives. A model with AUC-ROC of 0.97 can have AUC-PR of 0.10 on the same dataset — the second number is the honest one.
from sklearn.metrics import roc_auc_score, average_precision_score
y_scores = [0.9, 0.3, 0.8, 0.4, 0.2, 0.95, 0.6, 0.1]
roc_auc = roc_auc_score(y_true, y_scores) # threshold-independent
pr_auc = average_precision_score(y_true, y_scores) # better for imbalanced
print(f"ROC-AUC: {roc_auc:.3f}, PR-AUC: {pr_auc:.3f}")Log Loss (Cross-Entropy Loss)
Log loss measures the quality of a model's probability estimates, not just its binary decisions. A model that predicts 0.51 when the true label is 1 is penalised much less than a model that predicts 0.99 when the true label is 0. Use log loss when probability calibration matters — risk scoring, clinical decision support, ad bid pricing. A model that is directionally right but overconfident can have higher log loss than a more calibrated but less "accurate" model.
Matthews Correlation Coefficient (MCC)
MCC ranges from -1 (completely wrong) to +1 (perfect) with 0 being random. It is widely considered the most informative single metric for binary classification because it uses all four confusion matrix cells symmetrically. Unlike F1 and accuracy, MCC only reports a high score when a model performs well across all four cells simultaneously. A 2020 paper in Nature Methods demonstrated that AUC and F1 can both mislead on imbalanced data while MCC does not. Use MCC when you want a single number that cannot be gamed by predicting the majority class.
from sklearn.metrics import matthews_corrcoef
mcc = matthews_corrcoef(y_true, y_pred)
print(f"MCC: {mcc:.3f}") # 0 = random, 1 = perfect, -1 = perfectly wrongRegression Metrics
MAE — Mean Absolute Error
MAE is the average size of your errors, in the same units as your target variable. It treats all errors equally — being off by 10 is twice as bad as being off by 5. It is robust to outliers and easy to explain to stakeholders. If you are predicting delivery times in minutes, an MAE of 8 means your average prediction is off by 8 minutes. Use MAE when outliers are common and you do not want them to dominate the metric.
MSE and RMSE — Mean Squared Error and Root MSE
MSE squares each error before averaging, which means large errors are penalised disproportionately. A prediction that is off by 20 contributes 4× as much as one off by 10. RMSE returns this to the original units. Use MSE/RMSE when large errors are especially undesirable — financial forecasting, energy grid prediction, or any domain where a single catastrophic miss is worse than many small misses. The tradeoff: outliers in your dataset will inflate RMSE significantly, possibly making a good model look worse than it is.
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
y_true_reg = [100, 200, 150, 300, 250]
y_pred_reg = [110, 190, 160, 280, 260]
mae = mean_absolute_error(y_true_reg, y_pred_reg)
rmse = np.sqrt(mean_squared_error(y_true_reg, y_pred_reg))
print(f"MAE: {mae:.1f}, RMSE: {rmse:.1f}")MAPE — Mean Absolute Percentage Error
MAPE expresses error as a percentage of the actual value, making it scale-independent and easy to communicate across domains. A MAPE of 5% means you are typically off by 5% of the real value. The critical weakness: MAPE is undefined when any actual value is zero (division by zero) and heavily penalises underestimates when values are small. Use symmetric MAPE (sMAPE) to correct the asymmetry:
R² (Coefficient of Determination)
R² tells you what fraction of variance in the target the model explains. R² = 0.87 means the model explains 87% of the variance in the data. R² = 0 means the model is no better than predicting the mean. R² can be negative — this happens when your model is literally worse than just predicting the mean for every input.
The R² trap: R² always increases when you add more features — even noise features. Adjusted R² corrects for this by penalising model complexity. Do not compare R² across datasets with different scales. Zillow's Zestimate had an R² above 0.95 on historical training data. Their $881M write-down happened anyway because R² on the training distribution does not predict behaviour during market regime changes.
Residual Analysis — The Check Behind the Numbers
A single aggregate metric (even a good R²) can hide serious model pathologies. Plot residuals () against fitted values and look for:
- Systematic patterns: A curved residual plot means the model has not captured non-linearity.
- Heteroscedasticity: Residuals that fan out as fitted values increase mean variance is not constant — predictions are less reliable at higher values.
- Outlier clusters: A handful of extreme residuals may point to data quality issues or a subpopulation the model ignores.
Anscombe's Quartet (1973) is the canonical proof: four datasets with identical R², identical mean, and identical variance — but completely different shapes and violated model assumptions. A residual plot catches all four; a single summary statistic catches none.
Clustering Metrics
Clustering is harder to evaluate than supervised tasks because there is no ground truth label to compare against (except in benchmarking settings). Metrics fall into two families:
- Intrinsic: use only the data and cluster assignments
- Extrinsic: compare to known ground-truth labels
Silhouette Score (Intrinsic)
For each point : is the mean distance to all other points in its cluster (cohesion), and is the mean distance to all points in the nearest other cluster (separation). The score ranges from -1 to +1:
- Near +1: the point is well-matched to its own cluster and far from others.
- Near 0: the point sits on a cluster boundary.
- Negative: the point may have been assigned to the wrong cluster.
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42).fit(X)
score = silhouette_score(X, kmeans.labels_)
print(f"Silhouette: {score:.3f}") # closer to 1 is betterDavies-Bouldin Index (Intrinsic)
Davies-Bouldin measures the average ratio of within-cluster scatter to between-cluster separation. Lower is better (0 is perfect). It is computationally cheaper than silhouette and prefers compact, well-separated clusters. It tends to favour spherical clusters, so it can mislead with non-convex shapes.
Adjusted Rand Index — ARI (Extrinsic)
ARI compares your cluster assignments to ground-truth labels by measuring the fraction of pairs that are either in the same cluster in both, or in different clusters in both. The "adjusted" part corrects for chance — a random clustering scores near 0. ARI = 1 means perfect agreement. Use ARI when you have ground truth labels (e.g., in benchmarking) and want a robust, chance-corrected measure. NMI (Normalised Mutual Information) is an alternative that measures information overlap between the two labellings.
Ranking and Recommendation Metrics
Search engines, recommendation systems, and ad platforms do not ask "was the prediction correct?" — they ask "was the right item ranked high enough that the user found it?" All ranking metrics penalise putting relevant items lower in the list.
Precision@K and Recall@K
Precision@10 asks: of the 10 items the system showed, how many were relevant? Recall@10 asks: of all the relevant items in the catalogue, how many appeared in the top 10? Netflix optimises for Recall@K (did we put a show you'll enjoy somewhere in your top 20?) while Google Maps optimises for Precision@1 (is the top result correct?).
NDCG — Normalised Discounted Cumulative Gain
NDCG extends P@K by weighting relevance by position: a relevant item at rank 1 contributes more than at rank 5, which contributes more than at rank 10. The log discount models the real-world observation that users click less on lower-ranked results. IDCG (Ideal DCG) is the score of the perfect ordering. NDCG = 1.0 means your ranking is identical to perfect. NDCG is the standard metric for search engines (used in TREC benchmarks), learning-to-rank systems, and recommendation systems where relevance is graded (not binary).
MRR — Mean Reciprocal Rank
MRR cares only about the rank of the first relevant result. If the first correct answer is at rank 1, 2, or 3, MRR assigns reciprocals 1, 0.5, 0.33. It is ideal for question-answering systems and voice assistants where users want one correct answer fast. MRR ignores whether there are other relevant items further down the list — that is a feature, not a bug, for single-answer retrieval tasks.
Generation Metrics — From BLEU to the LLM Era
BLEU and ROUGE (2002–2003)
BLEU (Bilingual Evaluation Understudy) was introduced by Papineni et al. in 2002 for machine translation. It counts n-gram overlaps between the model output and one or more reference translations, with a brevity penalty for short outputs. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is the recall-focused counterpart, used for summarisation.
Both are fast, cheap, and correlate reasonably with human judgement at the corpus level. Both are notoriously bad at evaluating individual outputs — a translation can score high BLEU while being grammatically malformed, and a summary can score low ROUGE while being more factually accurate than the reference.
BERTScore (2019)
$$\text{BERTScore} = F_1\bigl(\text{cosine_sim}(\text{BERT}(y),\ \text{BERT}(\hat{y}))\bigr)$$BERTScore replaces exact n-gram matching with semantic similarity. It embeds both the reference and hypothesis using a pretrained BERT model and computes token-level cosine similarities using greedy matching. A sentence that uses different words to express the same meaning now scores well; BLEU would penalise it. BERTScore correlates better with human judgements than BLEU/ROUGE on most NLG benchmarks, but requires running inference through a large model — slower and not always available in constrained environments.
LLM-Era Metrics: Faithfulness, Relevance, G-Eval
The fundamental problem with reference-based metrics
They assume you have a correct reference answer to compare against. For open-ended LLM outputs — a summary, a multi-step reasoning response, a code generation — the reference is either unavailable or one of many acceptable answers. The field has shifted to reference-free evaluation.
RAG-specific metrics
For Retrieval-Augmented Generation systems, three metrics matter independently:
- Faithfulness: Is every claim in the generated answer supported by the retrieved context? Measures hallucination at the claim level.
- Answer Relevance: Does the answer actually address the question? A faithful but off-topic answer fails here.
- Context Recall: Did the retrieval step bring back the documents needed to answer the question? Measures retrieval quality, not generation quality.
G-Eval and LLM-as-Judge (2023)
G-Eval (Liu et al., 2023) uses a chain-of-thought LLM prompt to score outputs on dimensions like coherence, consistency, and fluency. It outperforms BLEU and ROUGE in correlation with human judgements on summarisation benchmarks by a wide margin. The core idea: if LLMs are generating the outputs, LLMs can evaluate them — they have semantic understanding that n-gram metrics lack.
The risks: LLM judges exhibit length bias (preferring longer outputs), self-preference bias (an OpenAI model scores OpenAI outputs higher), and are not deterministic. Use them with a rubric, multiple runs, and a calibration set of human-labelled examples.
Perplexity
Perplexity measures how well a language model predicts a sequence. Lower perplexity means the model assigned higher probability to the actual tokens. It is useful for comparing language models of the same architecture on the same vocabulary, and for detecting out-of-distribution inputs (high perplexity signals the model is in unfamiliar territory). It is not useful for comparing models with different tokenisers or for evaluating factual accuracy.
Choosing the Right Metric — A Decision Framework
Start with the problem type, then layer in the constraints.
Classification Problems
- Are classes balanced (roughly equal)? → Accuracy is acceptable as a secondary metric. F1 or MCC are still safer primaries.
- Is the positive class rare (<10%)? → Drop accuracy. Use F1, PR-AUC, or MCC.
- Does missing a positive cost more than a false alarm? (medical screening, fraud) → Prioritise Recall or F2.
- Does a false alarm cost more than a miss? (spam filter, content moderation) → Prioritise Precision or F0.5.
- Do you need calibrated probabilities? (risk scoring, ad bidding) → Log Loss.
- Do you need a single threshold-independent number? Balanced data → ROC-AUC. Imbalanced → PR-AUC or MCC.
Regression Problems
- Are large errors especially bad? → MSE/RMSE. They penalise outlier predictions aggressively.
- Are there real outliers in the target distribution? → MAE. It will not be dominated by them.
- Do you need scale-independent comparison across datasets? → MAPE (if no zeros in target).
- Are you reporting to a non-technical stakeholder? → MAE (easy: "off by X units on average") or MAPE ("off by X%").
- Always also run residual analysis. A good RMSE does not rule out systematic bias in a subpopulation.
Clustering Problems
- No ground truth labels available? → Silhouette score (general), Davies-Bouldin (spherical clusters), Calinski-Harabasz (fast, large datasets).
- Ground truth labels available (benchmarking)? → ARI or NMI.
- Choosing K? → Elbow method (plot inertia vs K) + Silhouette score confirmation. Neither is definitive alone.
Ranking and Generation Problems
- Single correct answer (QA, search)? → MRR.
- Multiple relevant items, position matters (search, recommendation)? → NDCG.
- Summarisation / translation with a reference? → BERTScore > ROUGE > BLEU for quality; ROUGE is still standard for reporting.
- RAG or open-ended LLM output (no reference)? → Faithfulness + Answer Relevance + Context Recall. Consider G-Eval for holistic scoring.
Common Mistakes to Avoid
- Choosing the metric after training: If you choose your metric after seeing results, you are not evaluating — you are justifying. Pick the metric before the first model run.
- Optimising a proxy metric instead of the business goal: Engagement rate (clicks) is not the same as user satisfaction. CTR is not revenue. Define the true north metric and track whether your proxy correlates with it.
- Using accuracy on imbalanced data: This is the most common beginner mistake. A model predicting the majority class for every input will report high accuracy and be completely useless.
- Ignoring the test set distribution: A metric evaluated on a non-representative test set is meaningless. Evaluate on the distribution your model will face in production. Temporal splits matter: always use future data as your test set for time-dependent problems.
- Single metric tunnel vision: Report the metric, but also inspect: confusion matrix (classification), residual plots (regression), and a sample of the worst predictions. Aggregate metrics hide subpopulation failures.
- Confusing training loss with evaluation metric: Cross-entropy loss is what the model minimises during training. F1 or NDCG is what you care about. They are related but not the same. A model can improve training loss while degrading your target metric — track both separately.
.jpeg&w=2048&q=70)

