Understand Precision and Recall once for all
February 14, 2026
Paper: Hand, D. J., & Christen, P. (2018). A note on using the F-measure for evaluating record linkage algorithms. Statistics and Computing, 28(3), 539–547.
Why Evaluation Metrics Matter
You train a model, run it on a test set, and get an accuracy of 98%. Impressive — until you realize that 98% of your samples are negative and your model predicts negative every single time. Accuracy told you nothing useful.
This is the classic trap, and it shows up everywhere in computational biology: predicting protein-protein interactions, identifying disease-associated genes, classifying cell types from single-cell data. The classes are almost always imbalanced. A metric that ignores this isn’t a metric — it’s a lie.
Precision and Recall are the standard answer. They have been around for decades (originating in information retrieval), but they are surprisingly easy to misinterpret and misapply. This post aims to make them click permanently.
The Confusion Matrix
Everything starts here. For a binary classifier, every prediction falls into one of four cells:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | True Positive (TP) | False Negative (FN) |
| Actually Negative | False Positive (FP) | True Negative (TN) |
- TP — correctly flagged a positive. You said “yes” and it was “yes”.
- FP — a false alarm. You said “yes” but it was “no”. Also called a Type I error.
- FN — a miss. You said “no” but it was “yes”. Also called a Type II error.
- TN — correctly dismissed a negative. You said “no” and it was “no”.
Every metric is just a different way of combining these four numbers.
Precision
Interpretation: Of everything you predicted as positive, what fraction actually was?
Precision answers the question: “Can I trust a positive prediction?” A precision of 1.0 means every positive prediction was correct — you never cried wolf. A precision of 0.1 means 9 out of 10 of your positive predictions were false alarms.
High precision matters when false positives are costly. In drug discovery, if your model flags 1000 candidate compounds and only 10 are real, you’ve wasted enormous experimental resources chasing false leads. You want your positives to be trustworthy.
Recall
Also called sensitivity or true positive rate.
Interpretation: Of all the actual positives, what fraction did you catch?
Recall answers: “Am I finding everything that matters?” A recall of 1.0 means you found every positive example — none slipped through. A recall of 0.3 means you missed 70% of them.
High recall matters when false negatives are costly. In cancer screening, missing a tumor (FN) is far worse than a false alarm that leads to a follow-up test. You need to catch everything. Similarly, in network biology, if you’re mapping a signaling pathway and miss half the interactions, your downstream analysis is built on an incomplete picture.
The Precision–Recall Trade-off
Here is the core tension: you usually cannot maximize both simultaneously.
Most classifiers output a score (a probability, a logit, a distance). You choose a threshold : predict positive if score , negative otherwise. As you move :
- Lower → more positives predicted → recall goes up, precision goes down
- Higher → fewer positives predicted → precision goes up, recall goes down
You can visualize this as a Precision–Recall curve (PR curve): plot precision on the -axis and recall on the -axis as you sweep from 1 to 0.
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
plt.plot(recall, precision)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision–Recall Curve")
plt.show()
A model in the top-right corner — high precision and high recall — is what you want. The area under the PR curve (AUPRC) summarizes this into a single number. Unlike AUROC, AUPRC is sensitive to class imbalance in exactly the right way: it weighs performance on the positive class heavily, which is usually what matters.
The F Score: Combining Both
When you need a single number that balances precision and recall, the standard choice is the F-measure:
controls the trade-off:
- (F1 score): Equal weight to precision and recall. The harmonic mean of the two.
- (e.g., F0.5): Weights precision more. Use when false positives are more costly.
- (e.g., F2): Weights recall more. Use when false negatives are more costly.
The F1 score is the most commonly reported variant:
Note that TN does not appear in F1 at all. This is intentional and important: in imbalanced problems, a huge TN count can inflate accuracy while F1 remains honest about how well you’re handling the positive class.
What Hand & Christen Actually Argue
The paper is specifically a critique of using F1 (and the broader F-measure family) for record linkage — the task of identifying which records in different databases refer to the same real-world entity.
Their main points are worth understanding because they generalize:
1. F Conflates Two Different Error Types
FP and FN are not the same kind of error. Conflating them into a single number obscures which type of mistake your algorithm is making. Depending on the downstream use, one matters far more than the other.
2. F Has No Probabilistic Justification
Unlike proper scoring rules (log-loss, Brier score), the F-measure has no derivation from a probability model or a decision-theoretic framework. It is a heuristic combination. This means you cannot directly interpret it as the expected loss under any natural cost structure.
3. The Implicit Cost Structure is Hidden and Arbitrary
Every choice of implicitly assigns a cost ratio to FP vs FN. But this cost ratio is rarely made explicit, and the “standard” choice of (equal weight) is almost never the right choice for any specific application.
Their Recommendation
Rather than reporting a single F-score, they argue for:
- Reporting precision and recall separately and letting the reader apply their own cost weighting.
- Plotting the PR curve so the full operating characteristic is visible.
- Making cost assumptions explicit if a single summary metric is needed.
Practical Guidance
Here is how I think about choosing metrics:
| Situation | Recommended Metric |
|---|---|
| Balanced classes, accuracy is meaningful | Accuracy or F1 |
| Imbalanced, FP and FN equally bad | F1 + AUPRC |
| Imbalanced, FP costly (e.g., drug screening) | Precision at fixed recall |
| Imbalanced, FN costly (e.g., disease screening) | Recall at fixed precision |
| Ranking/scoring task | AUPRC or AUROC |
| You want the full picture | Plot the PR curve |
A pattern I find useful in research: always report AUPRC alongside AUROC. AUROC can look great on imbalanced datasets because it measures the rank ordering of all samples, including the vast majority of negatives. AUPRC is more pessimistic and more honest about how well you’re doing on the rare positives.
Computing These in Python
from sklearn.metrics import (
precision_score,
recall_score,
f1_score,
average_precision_score,
PrecisionRecallDisplay,
)
# Hard predictions (threshold at 0.5)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
# Soft predictions (probability scores)
auprc = average_precision_score(y_true, y_scores)
# Full PR curve
PrecisionRecallDisplay.from_predictions(y_true, y_scores).plot()
For multi-class problems, sklearn supports average='macro', 'micro', and 'weighted' options — each with different assumptions about how to aggregate across classes. macro treats every class equally; weighted weights by support (number of true instances per class). In imbalanced multi-class settings, macro is typically the harder and more informative number.
Key Takeaways
-
Accuracy is misleading on imbalanced data. Always check your class distribution before choosing a metric.
-
Precision answers “are my positives trustworthy?” Recall answers “am I finding everything?” Know which one matters more for your problem before training.
-
The F-measure is a useful heuristic, not a ground truth. The choice of encodes a cost assumption. Make that assumption explicit.
-
Plot the PR curve. A single threshold gives you a single (precision, recall) point. The curve shows you the full trade-off space and how your model compares to others across all operating points.
-
Prefer AUPRC over AUROC when your positive class is rare and getting it right is the actual goal.
The best metric is the one that most directly measures what failure costs in your specific application. Precision and recall give you the building blocks — the choice of how to combine them should always be deliberate.