Tomek Links in Machine Learning
Tomek Links is one of the most elegant ideas in imbalanced classification: find pairs of samples from opposite classes that are each other's nearest neighbors, and remove the majority-class sample from each pair. The result is a cleaner decision boundary, fewer ambiguous samples, and a dataset that gives classifiers a much better chance at learning the true separation between classes.
The technique was introduced by Ivan Tomek in 1976 as a modification to Hart's Condensed Nearest Neighbor (CNN) rule. While it was originally framed as a data reduction method for nearest-neighbor classifiers, it has found its modern calling as a boundary cleaning technique for imbalanced datasets. Unlike random undersampling, which blindly discards majority-class samples, Tomek Links surgically removes only the samples that create confusion at the class boundary -- the ones most likely to cause misclassification.
What makes Tomek Links especially powerful is its role as a post-processing cleaner combined with oversampling. The famous SMOTETomek pipeline -- first generate synthetic minority samples with SMOTE, then clean the boundary with Tomek Links -- has become a standard recipe for handling class imbalance in domains ranging from credit card fraud detection at financial institutions like Razorpay and PayU, to medical diagnosis at hospitals across India, to churn prediction at telecom companies like Jio and Airtel.
This guide covers the mathematical definition, algorithmic details, implementation with imbalanced-learn, failure modes, production case studies, and the nuanced decision of when Tomek Links is the right tool versus alternatives like Edited Nearest Neighbors or NearMiss.
Concept Snapshot
- What It Is
- A data cleaning technique that identifies pairs of opposite-class samples that are each other's nearest neighbors (Tomek links) and removes the majority-class sample from each pair to clean the decision boundary and reduce class overlap.
- Category
- Data Generation / Resampling
- Complexity
- Intermediate
- Inputs / Outputs
- Input: an imbalanced dataset with features X and labels y. Output: a cleaned dataset with the same minority-class samples intact and majority-class samples that formed Tomek links removed.
- System Placement
- Sits in the data preprocessing stage, after feature engineering and before model training. Often used as a post-processing step after SMOTE or other oversampling techniques to clean up noisy synthetic samples near the boundary.
- Also Known As
- Tomek link removal, Tomek's modification of CNN, T-Link undersampling, boundary cleaning via Tomek links
- Typical Users
- Data Scientists, ML Engineers, Applied Researchers, Fraud Analysts, Biostatisticians
- Prerequisites
- Nearest neighbor algorithms (k-NN), Class imbalance concepts (majority/minority class), Distance metrics (Euclidean, Manhattan), Basic understanding of decision boundaries, SMOTE (for combined approaches)
- Key Terms
- Tomek linknearest neighbor pairboundary cleaningclass overlapmajority-class removalcondensed nearest neighborSMOTETomekprototype selectionimbalanced-learnundersampling
Why This Concept Exists
The Problem: Noisy Class Boundaries in Imbalanced Data
Imagine you are building a fraud detection model for a payment gateway like Razorpay. Out of 1 million transactions, only 2,000 are fraudulent -- a 0.2% positive rate. The challenge is not just that the minority class is rare; it is that the fraudulent transactions often look similar to legitimate ones. Some legitimate transactions sit right next to fraudulent ones in feature space, creating a fuzzy, overlapping boundary that classifiers struggle to learn.
Random undersampling addresses the class ratio by discarding majority-class samples, but it does so blindly -- it might remove informative samples far from the boundary while preserving confusing ones right at the edge. What you really want is a surgeon's scalpel, not a machete: remove only the majority-class samples that are actively confusing the classifier.
That is exactly what Tomek Links does. It identifies pairs of opposite-class samples that are each other's closest neighbors -- the most ambiguous, boundary-straddling pairs in the dataset -- and removes the majority-class member of each pair. The minority-class samples are left untouched (you cannot afford to lose any of them), and the boundary becomes cleaner.
Historical Context: From CNN to Tomek Links
The story begins with Peter Hart's Condensed Nearest Neighbor (CNN) rule in 1968. Hart proposed a method to reduce the size of the training set for a k-NN classifier while preserving classification accuracy. The idea was to keep only the samples necessary for correct classification -- essentially, the "important" samples near the decision boundary.
In 1972, Dennis Wilson introduced the Edited Nearest Neighbor (ENN) rule, which worked in the opposite direction: instead of keeping boundary samples, it removed samples that were misclassified by their neighbors, essentially cleaning noise from the dataset.
Ivan Tomek in 1976 proposed two modifications to Hart's CNN rule in his paper "Two Modifications of CNN" published in IEEE Transactions on Systems, Man, and Communications. The first modification introduced what we now call Tomek Links: if two samples from different classes are each other's nearest neighbors, they form a Tomek link. Tomek argued that at least one of these samples is either noise or sits on the boundary, and removing it (or both) would improve the CNN rule's efficiency.
Modern Renaissance: Tomek Links for Class Imbalance
Tomek Links languished in relative obscurity until the early 2000s, when the machine learning community became seriously concerned about class imbalance. The landmark paper by Batista, Prati, and Monard (2004) -- "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data" -- studied combining SMOTE with Tomek Links as a two-step resampling strategy. They showed that SMOTE+Tomek outperformed either technique alone on several benchmark datasets, establishing the SMOTETomek pipeline that remains popular today.
The inclusion of TomekLinks in the imbalanced-learn library (scikit-learn-contrib) cemented its place as a standard tool. Today, Tomek Links is one of the most commonly cited boundary cleaning methods in the imbalanced learning literature, used across fraud detection, medical diagnosis, network intrusion detection, and credit scoring.
Key Insight: Tomek Links is not primarily an undersampling technique -- it is a data cleaning technique. Its goal is not to balance the class distribution (it removes relatively few samples) but to sharpen the decision boundary by removing the most ambiguous majority-class samples.
Core Intuition & Mental Model
The Analogy: Sorting Out Disputed Territory
Imagine two neighboring countries that share a long, winding border. Most of the border is clearly defined -- there are mountains, rivers, and fences that make it obvious which country you are in. But at certain points, the border becomes fuzzy: towns that straddle the line, roads that weave back and forth, residents who could plausibly belong to either side. These disputed border towns are the source of most confusion and conflict.
Tomek Links identifies exactly these disputed border towns. Two samples form a Tomek link when they are from different classes and are each other's closest neighbor -- they are as close as two opposite-class samples can possibly be. In the disputed-territory analogy, they are the two houses on opposite sides of the border that are closer to each other than to any house in their own country.
The cleaning step removes the majority-class house from each disputed pair. Why the majority class? Because you have plenty of majority-class houses (you can afford to lose a few), but the minority-class houses are rare and every one matters. After cleaning, the border becomes sharper: the remaining majority-class samples are further from the minority-class region, making it easier for any classifier to draw a clean line between the two.
What Makes a Tomek Link Special?
Not every pair of opposite-class neighbors forms a Tomek link. The requirement is mutual nearest neighborhood: sample must be the nearest neighbor of sample , AND sample must be the nearest neighbor of sample . This mutual requirement is restrictive -- most cross-class neighbor pairs are one-directional (A's nearest neighbor is B, but B's nearest neighbor is some other sample from B's own class). Only the truly ambiguous, boundary-straddling pairs satisfy the mutual condition.
This is why Tomek Links is considered a conservative cleaning method: it identifies and removes only a small number of samples -- the most egregiously ambiguous ones. It does not dramatically change the class distribution. If your dataset has 100,000 majority samples and 1,000 minority samples, Tomek Links might remove only 200-500 majority samples. The class ratio barely changes, but the boundary quality improves significantly.
The Combined Power: SMOTE + Tomek Links
The most common use of Tomek Links is not standalone but as a post-processing step after SMOTE. Here is the intuition: SMOTE generates synthetic minority samples by interpolating between existing minority samples. But some of those synthetic samples may land in regions dominated by the majority class, creating new noise at the boundary. Tomek Links then sweeps through and removes the majority-class samples that are now uncomfortably close to these synthetic minority samples, cleaning up the boundary that SMOTE may have muddied.
Think of it as a two-phase construction project: SMOTE builds new houses (synthetic minority samples) in the minority neighborhood, and Tomek Links demolishes the majority-class houses that are too close to the new construction, creating a clear buffer zone along the border.
Technical Foundations
Definition of a Tomek Link
Given a dataset where and (binary classification with 0 as majority and 1 as minority), two samples and form a Tomek link if and only if:
- (they belong to different classes)
- (i.e., is the nearest neighbor of )
- (i.e., is the nearest neighbor of )
where is a distance metric, typically Euclidean:
More concisely: is a Tomek link if they are mutual nearest neighbors and belong to different classes.
Removal Strategy
Once all Tomek links are identified, the cleaning strategy determines which samples to remove:
-
Majority-only removal (default for imbalanced learning): Remove only the majority-class sample from each Tomek link pair. This preserves all minority-class samples while cleaning the boundary. If and form a Tomek link, remove .
-
Both-class removal (original Tomek proposal): Remove both samples from each Tomek link. This is more aggressive and used when data cleaning is the primary goal (not class balancing).
Algorithm
The Tomek Links algorithm has the following steps:
- For every sample , compute its nearest neighbor
- For each pair where :
- Check if (mutual nearest neighbor condition)
- If yes, mark this pair as a Tomek link
- Remove marked samples according to the chosen strategy
Computational Complexity
The dominant cost is computing nearest neighbors for all samples:
- Brute-force: time, space
- KD-tree: average time for low-dimensional data (), degrades to for high-dimensional data
- Ball-tree: average time, more robust than KD-tree in higher dimensions
After nearest neighbors are computed, identifying Tomek links is -- just a linear scan checking the mutual NN condition.
Properties
-
Conservative: Tomek Links removes relatively few samples. The number of Tomek links is bounded by where and are the majority and minority class sizes, but in practice the number is much smaller.
-
Deterministic: Given a fixed dataset and distance metric, the set of Tomek links is unique. There is no randomness in the algorithm (unlike random undersampling or SMOTE).
-
Boundary-focused: By construction, Tomek links exist only at the class boundary. Interior points (deep inside their class cluster) can never form Tomek links because their nearest neighbor will always be from their own class.
-
Distance-metric dependent: The set of identified Tomek links changes with the choice of distance metric. Euclidean distance is standard, but Manhattan or Mahalanobis distance may identify different boundary-ambiguous pairs.
Internal Architecture
The Tomek Links architecture is conceptually simple -- a nearest-neighbor computation followed by a pair-identification step and a removal step -- but the implementation details matter for performance at scale.

When combined with SMOTE in the SMOTETomek pipeline, the architecture becomes a two-stage process: SMOTE first generates synthetic minority samples and augments the dataset, then Tomek Links runs on the augmented dataset to identify and remove newly created Tomek links (typically between synthetic minority samples and nearby majority samples).
Key Components
Nearest Neighbor Index
Builds a spatial index (KD-tree, ball-tree, or brute-force distance matrix) over all samples in the dataset. This is the most computationally expensive component and determines the overall scalability of the algorithm. The imbalanced-learn implementation delegates to scikit-learn's NearestNeighbors class, which automatically selects the best algorithm based on dataset size and dimensionality.
Tomek Link Identifier
Iterates over all samples, checks the mutual nearest-neighbor condition for cross-class pairs, and marks samples that form Tomek links. This is a linear-time operation after the NN index is built. The identifier maintains a set of sample indices to be removed, handling edge cases where a single sample participates in multiple Tomek link relationships (rare but possible in multiclass settings).
Removal Engine
Removes the marked samples from the dataset according to the chosen strategy ('majority' to remove only majority-class samples, 'all' to remove both members of each Tomek link, or 'auto' which defaults to majority-only). Returns the cleaned feature matrix and label vector.
Pipeline Integration Layer
When used inside imblearn.pipeline.Pipeline or as part of SMOTETomek, this component ensures that Tomek Links runs on the correct version of the data (post-SMOTE augmented data, not the original data) and that the cleaning step respects the pipeline's fit_resample contract. It also handles the sampling_strategy parameter for multiclass settings.
Data Flow
Step 1 -- Index Construction: The full dataset (features and labels) is loaded. A spatial index (KD-tree for , ball-tree for , brute-force for or small ) is built over all feature vectors. This step is for tree-based indices.
Step 2 -- NN Computation: For each sample , query the index for its single nearest neighbor. Store the result as an array nn_indices of length . This step is for tree-based indices or for brute-force.
Step 3 -- Link Identification: Scan nn_indices for mutual cross-class pairs: if nn_indices[i] == j and nn_indices[j] == i and y[i] != y[j], then is a Tomek link. Collect the indices to be removed based on the removal strategy. This step is .
Step 4 -- Removal: Create a boolean mask over the dataset, setting False for indices to be removed. Apply the mask to produce the cleaned dataset X_clean, y_clean.
Step 5 (if combined): In SMOTETomek, Step 1-4 runs on the SMOTE-augmented dataset, so the Tomek link identification considers synthetic samples alongside original ones. Synthetic minority samples that ended up too close to majority samples are effectively defended by removing those majority samples.
A vertical flow from 'Imbalanced Dataset (X, y)' to 'Nearest Neighbor Index (KD-Tree / Ball-Tree / Brute)', then to 'Compute NN for all samples', then to 'Identify Tomek Link Pairs (Mutual NN + Different Class)', then to a decision node for 'Removal Strategy' that branches into 'Remove majority-class member' or 'Remove both members', both leading to 'Cleaned Dataset', which feeds into 'Downstream: Model Training or SMOTE + Tomek Pipeline'.
How to Implement
Implementation Approaches
Tomek Links implementation ranges from a one-line call with imbalanced-learn to a custom NumPy implementation for fine-grained control or edge cases:
Option A: imbalanced-learn TomekLinks -- the standard approach used by most practitioners. Handles multi-class, integrates with scikit-learn pipelines, and uses efficient NN algorithms internally. This is the right starting point for 95% of use cases.
Option B: imbalanced-learn SMOTETomek -- the combined pipeline that runs SMOTE followed by Tomek Links removal in a single fit_resample call. The most common production usage pattern.
Option C: Custom implementation -- useful when you need non-standard distance metrics, want to remove Tomek links iteratively (recomputing NNs after each removal round), or need to integrate with a non-scikit-learn pipeline (PyTorch, TensorFlow data loaders).
Cost Note: Tomek Links is primarily CPU-bound due to the NN computation. For a dataset with 100,000 samples and 50 features, the brute-force approach takes ~5 seconds on a modern laptop. KD-tree reduces this to ~1 second. For 1 million samples, KD-tree takes ~15-30 seconds, and brute-force becomes impractical (~500 seconds). On cloud instances (AWS
m5.xlargein Mumbai at ~INR 14/hr or 0.01) per run.
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import TomekLinks
import numpy as np
# Create an imbalanced dataset: 5% minority class
X, y = make_classification(
n_samples=10000,
n_features=20,
n_informative=10,
n_redundant=5,
n_clusters_per_class=2,
weights=[0.95, 0.05],
flip_y=0.02,
random_state=42,
)
print(f"Original distribution: {Counter(y)}")
# Output: Counter({0: 9500, 1: 500})
# Apply Tomek Links: remove majority-class samples forming Tomek links
tomek = TomekLinks(sampling_strategy='majority')
X_clean, y_clean = tomek.fit_resample(X, y)
print(f"After Tomek Links: {Counter(y_clean)}")
# Output: Counter({0: ~9350, 1: 500}) -- removed ~150 boundary majority samples
# Check how many samples were removed
removed = len(y) - len(y_clean)
print(f"Removed {removed} majority-class samples ({removed/Counter(y)[0]*100:.1f}%)")
# Verify: all minority samples are preserved
assert Counter(y_clean)[1] == Counter(y)[1], "Minority samples must be preserved!"The basic usage is straightforward: create a TomekLinks instance with sampling_strategy='majority' (default behavior), call fit_resample(X, y), and receive the cleaned dataset. Notice that Tomek Links removes a relatively small percentage of majority samples (typically 1-5%) -- it is a cleaning technique, not an aggressive balancing technique. All minority-class samples are always preserved.
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.combine import SMOTETomek
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score, classification_report
import numpy as np
# Create imbalanced dataset
X, y = make_classification(
n_samples=20000,
n_features=30,
n_informative=15,
weights=[0.97, 0.03],
flip_y=0.01,
random_state=42,
)
print(f"Original: {Counter(y)}")
# Method 1: SMOTETomek convenience class
smt = SMOTETomek(
smote=SMOTE(sampling_strategy=0.5, k_neighbors=5, random_state=42),
tomek=TomekLinks(sampling_strategy='majority'),
random_state=42,
)
X_resampled, y_resampled = smt.fit_resample(X, y)
print(f"After SMOTETomek: {Counter(y_resampled)}")
# Method 2: Explicit pipeline (more control)
pipeline = ImbPipeline([
('scaler', StandardScaler()),
('smote', SMOTE(sampling_strategy=0.5, k_neighbors=5, random_state=42)),
('tomek', TomekLinks(sampling_strategy='majority')),
('model', GradientBoostingClassifier(n_estimators=200, random_state=42)),
])
# Cross-validate with resampling inside the fold (no data leakage)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y, cv=cv, scoring='f1', n_jobs=-1)
print(f"\n5-Fold F1 (SMOTETomek + GBM): {scores.mean():.4f} (+/- {scores.std():.4f})")
# Compare with SMOTE-only pipeline
pipeline_smote_only = ImbPipeline([
('scaler', StandardScaler()),
('smote', SMOTE(sampling_strategy=0.5, k_neighbors=5, random_state=42)),
('model', GradientBoostingClassifier(n_estimators=200, random_state=42)),
])
scores_smote = cross_val_score(pipeline_smote_only, X, y, cv=cv, scoring='f1', n_jobs=-1)
print(f"5-Fold F1 (SMOTE-only + GBM): {scores_smote.mean():.4f} (+/- {scores_smote.std():.4f})")This example shows the two main ways to use SMOTETomek: the convenience class SMOTETomek for quick resampling, and an explicit imblearn.pipeline.Pipeline for production use with cross-validation. The pipeline approach is critical because it ensures resampling happens inside each CV fold, preventing data leakage. Note the sampling_strategy=0.5 for SMOTE, meaning the minority class is oversampled to 50% of the majority class size -- not full balance, which often leads to overfitting. Tomek Links then cleans up any boundary noise introduced by SMOTE.
import numpy as np
from sklearn.neighbors import NearestNeighbors
from collections import Counter
from typing import Tuple
def find_tomek_links(
X: np.ndarray,
y: np.ndarray,
metric: str = 'euclidean',
n_jobs: int = -1,
) -> np.ndarray:
"""Find indices of samples that form Tomek links.
Returns array of indices to REMOVE (majority-class members of each Tomek link).
"""
# Step 1: Build nearest neighbor index
nn = NearestNeighbors(n_neighbors=2, metric=metric, n_jobs=n_jobs)
nn.fit(X)
# Step 2: Query for nearest neighbor of each sample
# n_neighbors=2 because the first neighbor is the sample itself
distances, indices = nn.kneighbors(X)
nn_indices = indices[:, 1] # Index of nearest neighbor (excluding self)
# Step 3: Identify Tomek links
tomek_link_indices = []
n_samples = len(y)
majority_class = Counter(y).most_common(1)[0][0]
for i in range(n_samples):
j = nn_indices[i]
# Check mutual nearest neighbor condition AND different classes
if nn_indices[j] == i and y[i] != y[j]:
# Remove the majority-class sample
if y[i] == majority_class:
tomek_link_indices.append(i)
else:
tomek_link_indices.append(j)
# Deduplicate (a majority sample might be identified from both sides)
return np.unique(tomek_link_indices)
def apply_tomek_links(
X: np.ndarray,
y: np.ndarray,
metric: str = 'euclidean',
remove_both: bool = False,
) -> Tuple[np.ndarray, np.ndarray]:
"""Apply Tomek Links cleaning to a dataset."""
nn = NearestNeighbors(n_neighbors=2, metric=metric, n_jobs=-1)
nn.fit(X)
_, indices = nn.kneighbors(X)
nn_indices = indices[:, 1]
majority_class = Counter(y).most_common(1)[0][0]
to_remove = set()
for i in range(len(y)):
j = nn_indices[i]
if nn_indices[j] == i and y[i] != y[j]:
if remove_both:
to_remove.add(i)
to_remove.add(j)
else:
# Remove only majority-class member
to_remove.add(i if y[i] == majority_class else j)
mask = np.ones(len(y), dtype=bool)
mask[list(to_remove)] = False
return X[mask], y[mask]
# Example usage
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=5000, weights=[0.95, 0.05], random_state=42
)
print(f"Before: {Counter(y)}")
X_clean, y_clean = apply_tomek_links(X, y)
print(f"After: {Counter(y_clean)}")
# With custom distance metric
X_manhattan, y_manhattan = apply_tomek_links(X, y, metric='manhattan')
print(f"Manhattan distance: {Counter(y_manhattan)}")This custom implementation exposes the internals of the algorithm. The key insight is in the nn_indices[j] == i check -- this is the mutual nearest neighbor condition that defines a Tomek link. The implementation uses scikit-learn's NearestNeighbors for efficient NN computation but handles the link identification and removal manually. This gives you full control over the distance metric, the removal strategy (remove_both parameter), and allows easy extension to iterative Tomek link removal (recomputing NNs after each round of cleaning).
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.neighbors import NearestNeighbors
from imblearn.under_sampling import TomekLinks
from collections import Counter
# Create a 2D dataset for visualization
np.random.seed(42)
n_majority = 500
n_minority = 50
# Majority class: two clusters
X_maj = np.vstack([
np.random.randn(n_majority // 2, 2) + [2, 2],
np.random.randn(n_majority // 2, 2) + [-2, -2],
])
y_maj = np.zeros(n_majority, dtype=int)
# Minority class: overlapping with majority boundary
X_min = np.random.randn(n_minority, 2) * 0.8 + [0, 0]
y_min = np.ones(n_minority, dtype=int)
X = np.vstack([X_maj, X_min])
y = np.concatenate([y_maj, y_min])
print(f"Original: {Counter(y)}")
# Find Tomek links manually for visualization
nn = NearestNeighbors(n_neighbors=2).fit(X)
_, indices = nn.kneighbors(X)
nn_idx = indices[:, 1]
tomek_pairs = []
for i in range(len(y)):
j = nn_idx[i]
if nn_idx[j] == i and y[i] != y[j]:
tomek_pairs.append((i, j))
# Remove duplicate pairs
seen = set()
unique_pairs = []
for i, j in tomek_pairs:
pair = (min(i, j), max(i, j))
if pair not in seen:
seen.add(pair)
unique_pairs.append(pair)
print(f"Found {len(unique_pairs)} Tomek link pairs")
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Before cleaning
ax = axes[0]
ax.scatter(X[y == 0, 0], X[y == 0, 1], c='blue', alpha=0.3, s=20, label='Majority')
ax.scatter(X[y == 1, 0], X[y == 1, 1], c='red', alpha=0.7, s=40, label='Minority')
for i, j in unique_pairs:
ax.plot([X[i, 0], X[j, 0]], [X[i, 1], X[j, 1]], 'k-', alpha=0.5, linewidth=1)
ax.scatter(X[i, 0], X[i, 1], c='orange', s=80, zorder=5, edgecolors='black')
ax.scatter(X[j, 0], X[j, 1], c='orange', s=80, zorder=5, edgecolors='black')
ax.set_title(f'Before: {len(X)} samples, {len(unique_pairs)} Tomek links')
ax.legend()
# After cleaning
tomek = TomekLinks(sampling_strategy='majority')
X_clean, y_clean = tomek.fit_resample(X, y)
ax = axes[1]
ax.scatter(X_clean[y_clean == 0, 0], X_clean[y_clean == 0, 1],
c='blue', alpha=0.3, s=20, label='Majority')
ax.scatter(X_clean[y_clean == 1, 0], X_clean[y_clean == 1, 1],
c='red', alpha=0.7, s=40, label='Minority')
ax.set_title(f'After Tomek Links: {len(X_clean)} samples')
ax.legend()
plt.tight_layout()
plt.savefig('tomek_links_visualization.png', dpi=150)
plt.show()This visualization example creates a 2D dataset where you can see the Tomek links as lines connecting cross-class mutual nearest neighbors. The before plot shows these links as black lines with orange-highlighted endpoints -- these are the ambiguous boundary pairs. The after plot shows the cleaned dataset with majority-class members of Tomek links removed. This visualization is extremely useful for building intuition about what Tomek Links actually does and for presenting results to stakeholders who may not be familiar with the algorithm.
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
make_scorer, f1_score, precision_score, recall_score,
average_precision_score, roc_auc_score,
)
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks
from imblearn.combine import SMOTETomek
import joblib
import json
from datetime import datetime
def build_fraud_detection_pipeline(
smote_ratio: float = 0.3,
smote_k: int = 5,
n_estimators: int = 300,
max_depth: int = 6,
learning_rate: float = 0.05,
) -> ImbPipeline:
"""Build a production fraud detection pipeline with SMOTETomek."""
return ImbPipeline([
('scaler', StandardScaler()),
('smote', SMOTE(
sampling_strategy=smote_ratio,
k_neighbors=smote_k,
random_state=42,
)),
('tomek', TomekLinks(sampling_strategy='majority')),
('model', GradientBoostingClassifier(
n_estimators=n_estimators,
max_depth=max_depth,
learning_rate=learning_rate,
subsample=0.8,
random_state=42,
)),
])
def evaluate_pipeline(pipeline, X, y, n_splits=5):
"""Evaluate with stratified CV and multiple metrics."""
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
scoring = {
'f1': make_scorer(f1_score),
'precision': make_scorer(precision_score, zero_division=0),
'recall': make_scorer(recall_score),
'roc_auc': 'roc_auc',
'avg_precision': make_scorer(average_precision_score, needs_proba=True),
}
results = cross_validate(
pipeline, X, y,
cv=cv,
scoring=scoring,
return_train_score=True,
n_jobs=-1,
)
report = {}
for metric in scoring:
train_key = f'train_{metric}'
test_key = f'test_{metric}'
report[metric] = {
'train_mean': float(np.mean(results[train_key])),
'test_mean': float(np.mean(results[test_key])),
'test_std': float(np.std(results[test_key])),
'test_per_fold': [float(s) for s in results[test_key]],
}
return report
# Example usage
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=50000,
n_features=30,
n_informative=20,
weights=[0.99, 0.01], # 1% fraud rate
random_state=42,
)
pipeline = build_fraud_detection_pipeline(smote_ratio=0.3)
results = evaluate_pipeline(pipeline, X, y)
print("=" * 60)
print("Fraud Detection Pipeline: SMOTETomek + GBM")
print("=" * 60)
for metric, values in results.items():
gap = values['train_mean'] - values['test_mean']
print(f"{metric:>15}: test={values['test_mean']:.4f} +/- {values['test_std']:.4f} "
f"(train={values['train_mean']:.4f}, gap={gap:.4f})")This production-grade example demonstrates how Tomek Links fits into a real fraud detection pipeline. Key design decisions: (1) smote_ratio=0.3 means the minority class is oversampled to 30% of the majority, not full balance -- this avoids overfitting to synthetic samples. (2) Tomek Links runs after SMOTE to clean the boundary. (3) The pipeline is evaluated inside stratified CV to prevent leakage. (4) Multiple metrics are tracked including average precision (PR-AUC), which is more informative than ROC-AUC for extreme imbalance. (5) The train-test gap is computed for each metric to monitor overfitting. This pattern is directly applicable to fraud detection at fintech companies like Razorpay, PayU, and PhonePe.
# Tomek Links configuration (YAML equivalent)
tomek_links:
sampling_strategy: 'majority' # 'majority' | 'all' | 'auto' | dict
n_jobs: -1 # Parallel NN computation
# SMOTETomek combined configuration
smote_tomek:
smote:
sampling_strategy: 0.3 # Oversample minority to 30% of majority
k_neighbors: 5 # k for SMOTE interpolation
random_state: 42
tomek:
sampling_strategy: 'majority' # Clean only majority-class links
# Pipeline configuration
pipeline:
preprocessing:
- type: StandardScaler
resampling:
- type: SMOTE
sampling_strategy: 0.3
k_neighbors: 5
- type: TomekLinks
sampling_strategy: 'majority'
model:
type: GradientBoostingClassifier
params:
n_estimators: 300
max_depth: 6
learning_rate: 0.05
subsample: 0.8
# Evaluation
evaluation:
cv_strategy: stratified_kfold
n_splits: 5
primary_metric: f1
additional_metrics:
- precision
- recall
- roc_auc
- average_precisionCommon Implementation Mistakes
- ●
Applying Tomek Links before train-test split: Running
TomekLinks.fit_resample()on the entire dataset before splitting causes data leakage -- the NN computation uses validation/test samples to identify which training samples to remove. Always apply Tomek Links inside a pipeline with cross-validation, usingimblearn.pipeline.Pipeline. - ●
Expecting Tomek Links to balance the dataset: Tomek Links is a cleaning technique, not a balancing technique. It removes only a small percentage (typically 1-5%) of majority samples. If you need substantial rebalancing, combine it with SMOTE or use random undersampling.
- ●
Using Tomek Links on high-dimensional sparse data: In very high dimensions (e.g., TF-IDF vectors with 50,000+ features), nearest neighbor computations become unreliable due to the curse of dimensionality. Distance metrics lose discriminative power, and the identified Tomek links may not represent true boundary pairs. Apply dimensionality reduction (PCA, truncated SVD) first.
- ●
Ignoring the distance metric choice: The default Euclidean distance assumes all features have comparable scales. If features are on different scales (e.g., age in years and income in lakhs), Tomek Links will be dominated by the high-magnitude feature. Always scale features before applying Tomek Links, or use a pipeline with
StandardScaler. - ●
Running Tomek Links iteratively without convergence check: Some practitioners run Tomek Links repeatedly until no more links are found. This can over-clean the dataset, removing informative boundary samples. One pass is usually sufficient; if you need more aggressive cleaning, consider Edited Nearest Neighbors instead.
- ●
Applying to multiclass problems without understanding the behavior: In multiclass settings, Tomek Links identifies links between any pair of different classes, which may not target the specific class boundary you care about. Use the
sampling_strategyparameter to restrict which classes are cleaned.
When Should You Use This?
Use When
Your dataset has class overlap at the decision boundary -- majority-class samples are interspersed with minority-class samples, creating ambiguity that hurts classifier performance
You are already using SMOTE or another oversampling method and want to clean up the boundary noise that oversampling introduces -- the classic SMOTETomek pipeline
You need a deterministic, conservative cleaning method that removes only the most ambiguous samples, preserving the overall data distribution and avoiding the randomness of random undersampling
Your minority class is very small (hundreds of samples) and you cannot afford to lose any minority samples -- Tomek Links by default only removes majority-class samples
You are working with tabular data of moderate dimensionality () where nearest-neighbor distances are meaningful and the curse of dimensionality is not a concern
You need a preprocessing step that is easy to explain to stakeholders, regulators, or auditors -- the concept of removing ambiguous boundary pairs is intuitive and defensible
Avoid When
Your dataset is severely imbalanced (e.g., 99.9% majority) and you need significant rebalancing -- Tomek Links removes too few samples to meaningfully change the class ratio; use random undersampling, NearMiss, or cluster centroids instead
Your data is very high-dimensional (thousands of features) -- nearest neighbor distances become unreliable in high dimensions, and the identified Tomek links may not correspond to true boundary ambiguity
You have a very large dataset (millions of samples) and the or nearest neighbor computation is too expensive -- consider random undersampling or feature-space approaches instead
The class boundary is inherently noisy and removing boundary samples would discard genuinely informative data -- some domains (medical diagnosis, anomaly detection) have fundamentally overlapping class distributions that should not be cleaned away
You are working with non-numeric data (categorical features, text, images in raw form) where Euclidean nearest-neighbor distances are not meaningful -- preprocess into a numeric embedding first or use domain-specific cleaning methods
Your evaluation metric is not sensitive to boundary quality -- if you are optimizing for recall at a fixed low threshold, Tomek Links' boundary cleaning may not translate into meaningful metric improvement
Key Tradeoffs
Cleaning Aggressiveness vs. Information Preservation
Tomek Links sits at the conservative end of the undersampling spectrum. Here is how it compares:
| Technique | Samples Removed | Aggressiveness | Boundary Focus | Deterministic |
|---|---|---|---|---|
| Tomek Links | 1-5% of majority | Very Low | Yes (only mutual NN pairs) | Yes |
| Edited NN (ENN) | 5-20% of majority | Low-Medium | Yes (k-NN misclassified) | Yes |
| NearMiss | Up to 50%+ of majority | High | Yes (distance-based selection) | Yes |
| Random Undersampling | Variable (to target ratio) | Variable | No (random) | No |
| Cluster Centroids | Replaces all majority with centroids | High | No (cluster-based) | Yes |
The tradeoff is clear: Tomek Links preserves the most information but provides the least rebalancing. If your problem primarily suffers from boundary noise rather than class ratio imbalance, Tomek Links (or SMOTETomek) is the right choice. If the class ratio itself is the bottleneck, you need a more aggressive approach.
Standalone vs. Combined Usage
Using Tomek Links standalone is rarely the best strategy. Its strength is as a post-processing cleaner in combination with an oversampling method:
- SMOTETomek: The gold standard. SMOTE adds synthetic minority samples, Tomek Links cleans the boundary. Best for moderate imbalance (1-10% minority).
- SMOTE + ENN (SMOTEENN): More aggressive cleaning than SMOTETomek. ENN removes all majority samples misclassified by k-NN (not just mutual NN pairs). Better when there is substantial class overlap.
For Indian ML teams working on cost-constrained projects, the compute overhead of Tomek Links is negligible compared to model training. A single Tomek Links pass on 100K samples costs under INR 0.10 (~$0.001) in cloud compute. The question is purely about data quality, not cost.
When Does the Extra Cleaning Actually Help?
Empirical studies (Batista et al., 2004) show that SMOTETomek outperforms SMOTE alone on datasets with moderate class overlap but shows minimal improvement on datasets with well-separated classes. If a simple decision tree can already separate the classes with >95% accuracy on the majority class, Tomek Links cleaning will not add much. Its value scales with the degree of boundary ambiguity in your specific dataset.
Alternatives & Comparisons
ENN removes samples where the majority of k nearest neighbors belong to a different class, making it more aggressive than Tomek Links (which requires mutual nearest neighbor status). ENN typically removes 5-20% of majority samples vs. Tomek Links' 1-5%. Choose ENN when you have significant class overlap and need stronger cleaning; choose Tomek Links when you want minimal, surgical removal of only the most ambiguous pairs.
Random undersampling discards majority-class samples uniformly at random until a target ratio is achieved. It is much faster (no NN computation), can achieve any target ratio, and is non-deterministic. However, it does not distinguish between informative and noisy samples. Choose random undersampling when you need aggressive rebalancing and speed; choose Tomek Links when boundary quality matters more than class ratio.
SMOTE generates synthetic minority samples by interpolating between existing minority samples. It is an oversampling technique (adds samples) while Tomek Links is an undersampling/cleaning technique (removes samples). They are complementary: SMOTE addresses class ratio imbalance, Tomek Links addresses boundary noise. The standard combination is SMOTETomek, where SMOTE runs first and Tomek Links cleans the result.
SMOTETomek is the combined pipeline that runs SMOTE followed by Tomek Links in a single step. It is not an alternative to Tomek Links but rather its most common usage pattern. Choose standalone Tomek Links when you already have a balanced dataset and just need boundary cleaning; choose SMOTETomek when you need both oversampling and cleaning.
NearMiss selects majority-class samples based on their distance to minority-class samples (keeping only the closest ones). It is far more aggressive than Tomek Links and can achieve any target ratio. However, it discards most of the majority class, potentially losing important information about the majority distribution. Choose NearMiss when extreme rebalancing is needed; choose Tomek Links for gentle boundary cleaning.
Cluster centroids replaces the majority class with K-means cluster centroids, dramatically reducing the majority class size while preserving its overall shape. Unlike Tomek Links, it creates new synthetic majority samples (the centroids) rather than selecting existing ones. Choose cluster centroids for aggressive rebalancing with shape preservation; choose Tomek Links for minimal, boundary-focused cleaning.
Pros, Cons & Tradeoffs
Advantages
Targeted boundary cleaning -- removes only the majority-class samples that are most ambiguous (mutual nearest neighbors of minority samples), preserving informative majority-class samples far from the boundary
Preserves all minority samples -- in the default
sampling_strategy='majority'mode, no minority-class samples are ever removed, which is critical when the minority class has very few samples (e.g., 50-200)Deterministic and reproducible -- given the same dataset and distance metric, Tomek Links always identifies the same set of links and produces the same cleaned dataset, unlike random undersampling or SMOTE which depend on random seeds
Excellent as a post-processing step for SMOTE -- the SMOTETomek combination is well-studied and consistently outperforms SMOTE alone on datasets with class overlap (Batista et al., 2004), making Tomek Links' cleaning the standard second step in hybrid resampling pipelines
Minimal hyperparameter tuning -- the only real choice is the removal strategy (
majorityvs.both), making Tomek Links essentially parameter-free and easy to integrate into automated pipelinesIntuitive and explainable -- the concept of removing ambiguous boundary pairs is easy to explain to non-technical stakeholders, regulators, and auditors, which is important in regulated domains like finance (RBI guidelines) and healthcare (ICMR)
Low computational overhead -- for datasets under 100K samples, Tomek Links runs in seconds, adding negligible cost to a training pipeline. On Indian cloud pricing (AWS Mumbai or E2E Networks), a single run costs well under INR 1
Disadvantages
Does not balance the dataset -- Tomek Links removes only a small percentage of majority samples (typically 1-5%), so the class ratio remains heavily skewed. It is not a replacement for oversampling or aggressive undersampling when class imbalance is the primary problem
Computationally expensive for large datasets -- the brute-force or tree-based nearest neighbor computation becomes a bottleneck for datasets with millions of samples. A 1M-sample dataset takes 15-60 seconds depending on dimensionality
Sensitive to distance metric and feature scaling -- if features are not scaled, high-magnitude features dominate the distance calculation and the identified Tomek links may not reflect true boundary ambiguity. Requires careful preprocessing
Unreliable in high dimensions -- in spaces with hundreds or thousands of features, nearest neighbor distances converge (curse of dimensionality), making the mutual NN condition less meaningful. The algorithm may identify spurious Tomek links
Does not address the root cause of overlap -- if classes genuinely overlap in feature space (common in medical diagnosis, fraud detection), removing boundary samples does not help -- the overlap is a fundamental property of the problem, not noise to be cleaned
Limited benefit for tree-based models -- gradient boosting and random forests are already robust to noisy boundary samples because they learn piecewise decision boundaries. Tomek Links cleaning may show minimal improvement for these models while being more impactful for linear models, SVMs, and neural networks
Failure Modes & Debugging
Curse of dimensionality renders Tomek links meaningless
Cause
In high-dimensional spaces (), the ratio of the distance to the nearest neighbor versus the distance to the farthest neighbor converges to 1. This means nearest neighbor relationships become nearly random, and the mutual NN condition that defines Tomek links no longer identifies truly ambiguous boundary pairs.
Symptoms
The number of Tomek links found is either much higher or much lower than expected. Removing identified links does not improve classifier performance. In extreme cases, performance may degrade because the removed samples were not actually boundary noise. The cleaned dataset shows no visual difference in boundary quality (when projected to 2D with t-SNE or PCA).
Mitigation
Apply dimensionality reduction before Tomek Links: use PCA to reduce to the top 20-50 principal components, or use truncated SVD for sparse data (e.g., TF-IDF vectors). Alternatively, use a distance metric better suited to high dimensions (cosine similarity instead of Euclidean). For very high-dimensional data, consider Edited Nearest Neighbors with a higher k value, which is more robust to dimensional effects.
Data leakage from applying Tomek Links before cross-validation
Cause
Running TomekLinks.fit_resample() on the full dataset before train-test splitting or cross-validation. The nearest neighbor computation uses information from validation/test samples to determine which training samples are Tomek links and should be removed.
Symptoms
CV scores are slightly inflated compared to held-out test performance. The inflation is typically small (0.5-2% for Tomek Links alone) but can be larger when combined with SMOTE in SMOTETomek. Production performance consistently underperforms offline evaluation. The effect is most noticeable on small datasets where each sample has more influence.
Mitigation
Always apply Tomek Links inside a imblearn.pipeline.Pipeline and use it within cross_val_score or cross_validate. Never call fit_resample on the full dataset before splitting. If using a custom training loop, ensure Tomek Links is applied only to the training fold of each CV iteration.
Over-cleaning in iterative Tomek Links application
Cause
Running Tomek Links repeatedly (recomputing nearest neighbors and removing links each iteration) until no more links exist. Each round of removal creates new nearest-neighbor relationships that may flag previously safe samples as Tomek links. After several iterations, the cleaned dataset may have removed a significant fraction of informative majority-class samples near the boundary.
Symptoms
Classifier precision drops (more false positives) even though recall improves. The majority-class distribution becomes hollow near the boundary -- there are no majority samples close to the minority class. The model becomes overconfident in the minority class region, failing to properly identify majority-class samples that legitimately exist near the boundary.
Mitigation
Run Tomek Links for exactly one pass (the standard approach). If you need more aggressive cleaning, switch to Edited Nearest Neighbors (ENN) or SMOTEENN, which are designed for stronger cleaning in a single pass. If iterative cleaning is genuinely needed, set a maximum iteration count (2-3) and monitor the number of samples removed per iteration -- stop when the count drops below a threshold.
Ineffective cleaning for genuinely overlapping distributions
Cause
The class distributions fundamentally overlap in feature space -- no amount of boundary cleaning will separate them because the overlap is not noise but a true property of the data generation process. This is common in medical diagnosis (healthy and sick patients can have identical lab values) and fraud detection (sophisticated fraudsters mimic legitimate behavior perfectly).
Symptoms
Tomek Links removes boundary samples, but classifier performance does not improve. The confusion matrix shows that misclassifications are spread across the entire feature space, not concentrated at the boundary. Adding more features (better feature engineering) has a much larger effect than any resampling technique.
Mitigation
Recognize that resampling is not the bottleneck -- feature engineering is. Invest in creating features that better separate the classes. Consider ensemble methods (gradient boosting, random forests) that can model complex, overlapping boundaries without needing clean separation. For truly overlapping distributions, threshold tuning and cost-sensitive learning are often more effective than data-level resampling.
Unscaled features distort Tomek link identification
Cause
Features with different scales (e.g., age in years [0-100] and annual income in INR [100,000-10,000,000]) cause Euclidean distances to be dominated by the high-magnitude feature. Tomek links are identified based on income similarity alone, ignoring all other features.
Symptoms
The identified Tomek links cluster along one feature dimension. Cleaning removes samples that are far apart in most feature dimensions but happen to be close in the high-magnitude feature. Cleaned dataset does not show improved boundary quality when inspected with PCA or t-SNE projections.
Mitigation
Always apply StandardScaler or MinMaxScaler before Tomek Links. In imblearn.pipeline.Pipeline, place the scaler as the first step before the TomekLinks step. Alternatively, use Manhattan distance (less sensitive to outliers) or Mahalanobis distance (accounts for feature correlations and scales) in the custom implementation.
Placement in an ML System
Where Tomek Links Sits in the ML Pipeline
Tomek Links occupies a specific niche in the preprocessing stage of the ML pipeline, between feature engineering and model training. Its position is critical: it must run after feature scaling (so distances are meaningful) and after train-test splitting (to prevent data leakage), but before model training.
The typical pipeline flow for imbalanced classification is:
- Data ingestion -> raw data
- Feature engineering -> engineered features
- Train-test split -> training set and held-out test set
- Within each CV fold: a. Feature scaling (StandardScaler / MinMaxScaler) b. Oversampling (SMOTE) on training fold only c. Boundary cleaning (Tomek Links) on augmented training fold d. Model training on cleaned, augmented training fold e. Evaluation on original (unmodified) validation fold
This ordering is enforced automatically when using imblearn.pipeline.Pipeline with cross_validate.
In production ML systems at Indian fintech companies (Razorpay, PayU, CRED), Tomek Links is typically a step in an Airflow or Kubeflow pipeline that runs during scheduled retraining. The resampled data is not persisted (it changes with each retraining cycle); instead, the pipeline configuration (SMOTE ratio, Tomek strategy) is versioned alongside the model in the model registry.
Key Integration Point: Tomek Links produces a different-sized dataset than its input (some samples are removed). Any pipeline component downstream must handle variable-length inputs. This is not an issue with scikit-learn or imbalanced-learn pipelines but can trip up custom training loops that assume fixed batch sizes.
Pipeline Stage
Data Preprocessing / Resampling
Upstream
- train-test-split
- scaling
- encoding
- feature-selection
Downstream
- model-training
- cross-validation
- hyperparameter-tuning
Scaling Bottlenecks
The primary bottleneck is the nearest neighbor computation, which scales as with tree-based indices (KD-tree, ball-tree) for low-to-moderate dimensions, and for high dimensions or brute-force.
Here are practical benchmarks on typical hardware:
| Dataset Size | Features | Algorithm | Time | Cloud Cost (AWS Mumbai) | Cloud Cost (E2E Networks) |
|---|---|---|---|---|---|
| 10,000 | 20 | KD-tree | ~0.3s | < INR 0.01 | < INR 0.01 |
| 100,000 | 50 | KD-tree | ~3s | < INR 0.05 | < INR 0.03 |
| 500,000 | 50 | KD-tree | ~20s | < INR 0.10 | < INR 0.06 |
| 1,000,000 | 50 | Ball-tree | ~60s | ~ INR 0.25 (~$0.003) | ~ INR 0.18 |
| 1,000,000 | 500 | Brute-force | ~600s | ~ INR 2.50 (~$0.03) | ~ INR 1.80 |
For most practical use cases (tabular data with 10K-500K samples and under 100 features), Tomek Links is negligible in cost compared to model training. The bottleneck is almost always the downstream model training (especially if using gradient boosting or neural networks), not the resampling step.
For very large datasets (10M+ samples), consider these alternatives:
- Sample-then-clean: Apply Tomek Links to a random subsample of the majority class, then use the identified link pattern to extrapolate cleaning to the full dataset
- Approximate nearest neighbors: Use libraries like FAISS or Annoy for approximate NN computation in time regardless of dimensionality
- Skip Tomek Links entirely: At 10M+ samples, the boundary is often well-enough defined that cleaning provides minimal benefit
Production Case Studies
A comprehensive study published in the Journal of Big Data evaluated the effect of feature extraction (PCA, CAE) and data sampling techniques (RUS, SMOTE, SMOTE-Tomek) on credit card fraud detection using the European cardholders dataset (284,807 transactions, 0.172% fraud rate). The researchers tested multiple classifiers (Random Forest, SVM, Neural Network) with different resampling strategies, finding that SMOTE-Tomek combined with Random Forest achieved an F1-score of 0.8677, significantly outperforming models trained on the original imbalanced data.
SMOTE-Tomek improved Random Forest F1-score from 0.74 (no resampling) to 0.87 on the European bank fraud dataset, a 17.6% relative improvement. The study confirmed that Tomek Links' boundary cleaning after SMOTE consistently reduced false positives compared to SMOTE alone.
SAS published a research study at SAS Global Forum 2018 testing the efficacy of SMOTE and SMOTE+Tomek Links on a credit card fraud transaction dataset. They trained four classifiers -- Random Forest, Neural Network, SVM, and Rule Induction -- on the original imbalanced data, SMOTE-resampled data, and SMOTE+Tomek-resampled data. The study systematically measured recall (fraud detection rate) and the false negative to false positive ratio across all combinations.
SMOTE+Tomek improved recall and lowered the FN/FP ratio for every single classifier tested, demonstrating that Tomek Links' boundary cleaning consistently improves fraud detection regardless of the model choice. The SVM classifier showed the most dramatic improvement, with recall increasing from 62% to 81% after SMOTE+Tomek.
A detailed case study on the IEEE-CIS Fraud Detection competition dataset (590K transactions provided by Vesta Corporation) demonstrated the SMOTETomek + XGBoost pipeline for production fraud detection. The author compared baseline XGBoost (no resampling), SMOTE+XGBoost, and SMOTETomek+XGBoost, with detailed ROC curves and confusion matrix analysis. The SMOTETomek approach improved out-of-the-box XGBoost fraud detection rate from 84% to 88%.
After combining SMOTETomek with hyperparameter-tuned XGBoost, the model achieved a 92% fraud detection rate with 98%+ correct classification rate on the held-out test set, demonstrating that Tomek Links cleaning of SMOTE-generated boundary noise contributes measurably to final model performance even with strong tree-based classifiers.
A study published in the Journal of Big Data applied SMOTE, SMOTE+Tomek, and SMOTE+ENN to a medical insurance fraud detection dataset with extreme class imbalance. The study is particularly relevant to Indian healthcare systems (Ayushman Bharat, NHPM) where insurance fraud detection operates on datasets with fraud rates below 1%. The researchers found that SMOTE+Tomek with Random Forest consistently outperformed single-technique approaches across precision, recall, and F1-score.
SMOTE+Tomek achieved the best precision-recall balance (F1=0.91) compared to SMOTE alone (F1=0.87) and no resampling (F1=0.72) on the healthcare fraud dataset. The Tomek Links cleaning step was especially impactful for reducing false positives, which is critical in healthcare where false fraud accusations can deny legitimate patient claims.
Tooling & Ecosystem
The standard implementation of Tomek Links in Python. Part of the imbalanced-learn library (scikit-learn-contrib). Provides TomekLinks class with fit_resample(X, y) API, support for multiclass, configurable sampling_strategy, and seamless integration with imblearn.pipeline.Pipeline. Handles KD-tree/ball-tree/brute-force NN algorithm selection automatically via scikit-learn's NearestNeighbors.
The combined SMOTE + Tomek Links pipeline in imbalanced-learn. Runs SMOTE first to generate synthetic minority samples, then Tomek Links to clean boundary noise. Configurable SMOTE and Tomek parameters via smote and tomek constructor arguments. The most commonly used wrapper for production Tomek Links usage.
The underlying nearest-neighbor engine used by imbalanced-learn. Provides KD-tree, ball-tree, and brute-force algorithms with configurable distance metrics (Euclidean, Manhattan, Minkowski, Chebyshev, custom). Essential for custom Tomek Links implementations that need non-default distance metrics or approximate NN for large-scale data.
R package providing TomekClassif() function for Tomek Links-based classification on imbalanced data. Supports both majority-only and both-class removal strategies. Part of the broader UBL package for utility-based learning that includes multiple resampling techniques.
Julia implementation of SMOTE-Tomek Links for high-performance imbalanced learning. Provides fraud detection examples using the combined pipeline. Julia's JIT compilation makes the NN computation significantly faster than Python for large datasets, making this a good choice for teams already using Julia's ML ecosystem.
Facebook AI Similarity Search library for approximate nearest neighbor computation. While not a Tomek Links implementation itself, FAISS can replace the NN computation in custom Tomek Links implementations for datasets with millions of samples. The approximate NN trades a small accuracy reduction for 10-100x speedup on large datasets.
Research & References
Tomek, I. (1976)IEEE Transactions on Systems, Man, and Communications, Vol. 6
The foundational paper that introduced Tomek Links. Proposed two modifications to Hart's Condensed Nearest Neighbor (CNN) rule: (1) Tomek links -- mutual nearest neighbor pairs from different classes that indicate boundary or noise samples, and (2) a modified CNN that uses Tomek links to improve the condensation process. This paper is the origin of all modern Tomek Link-based resampling methods.
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C. (2004)ACM SIGKDD Explorations Newsletter, Vol. 6
The landmark study that established the SMOTETomek and SMOTEENN combined pipelines. Analyzed the behavior of oversampling (SMOTE) combined with undersampling (Tomek Links, ENN) on 13 benchmark datasets. Showed that SMOTE+Tomek consistently outperformed either technique alone, particularly for datasets with class overlap and small disjuncts. This paper is the primary citation for the SMOTETomek methodology.
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P. (2002)Journal of Artificial Intelligence Research, Vol. 16
The paper that introduced SMOTE, the most commonly paired technique with Tomek Links. SMOTE generates synthetic minority samples by interpolating between existing minority samples and their k-nearest neighbors. Understanding SMOTE is essential for understanding why Tomek Links is needed as a post-processing step: SMOTE can introduce noise at the boundary, and Tomek Links cleans it up.
Fernandez, A., Garcia, S., Herrera, F., Chawla, N.V. (2018)Journal of Artificial Intelligence Research, Vol. 61
A comprehensive 15-year retrospective on SMOTE and its extensions, including SMOTETomek. Reviews the theoretical foundations, practical guidelines, and open challenges. Discusses when combined approaches (SMOTE+Tomek, SMOTE+ENN) outperform standalone methods and provides recommendations for practitioners working with imbalanced data across different domains.
Hart, P.E. (1968)IEEE Transactions on Information Theory, Vol. 14
The original CNN paper that Tomek's 1976 work modified. Hart proposed keeping only the training samples necessary for correct nearest-neighbor classification, discarding redundant interior points. Understanding CNN is essential for understanding Tomek's modifications: Tomek Links identifies the boundary pairs that CNN's condensation process should prioritize.
Wilson, D.L. (1972)IEEE Transactions on Systems, Man, and Cybernetics, Vol. 2
Introduced the Edited Nearest Neighbor (ENN) rule, the "sibling" technique to Tomek Links. Wilson's ENN removes samples misclassified by their k nearest neighbors. Tomek Links and ENN represent two complementary approaches to data cleaning: Tomek Links targets mutual NN pairs (the tightest boundary pairs), while ENN targets samples that disagree with their neighborhood (more aggressive cleaning).
Interview & Evaluation Perspective
Common Interview Questions
- ●
What is a Tomek link and how does it help with imbalanced classification?
- ●
Explain the difference between Tomek Links and Edited Nearest Neighbors (ENN) as data cleaning techniques.
- ●
Why is Tomek Links typically combined with SMOTE rather than used standalone? What does each component contribute?
- ●
You have a fraud detection dataset with 0.1% positive rate. Would you use Tomek Links? What else would you combine it with?
- ●
How does Tomek Links handle the decision boundary differently from random undersampling?
- ●
What happens if you apply Tomek Links before train-test splitting? How would you detect this mistake?
- ●
In what situations would Tomek Links NOT help improve classifier performance?
- ●
How would you scale Tomek Links to a dataset with 10 million samples?
Key Points to Mention
- ●
Tomek Links identifies mutual nearest neighbor pairs from different classes -- this mutual condition is key. Not just any cross-class neighbor pair, but pairs where each is the other's closest neighbor. This targets the most ambiguous boundary samples.
- ●
The primary value is boundary cleaning, not class balancing. Tomek Links removes only 1-5% of majority samples. Its power comes from which samples it removes, not how many.
- ●
The standard usage is SMOTETomek: SMOTE generates synthetic minority samples, then Tomek Links cleans the boundary noise that SMOTE introduces. This two-step pipeline outperforms either technique alone (Batista et al., 2004).
- ●
Data leakage prevention: Tomek Links must be applied inside the CV fold, not before splitting. The NN computation uses information from all samples, so applying it before splitting leaks validation information into the training preprocessing.
- ●
Limitations: Tomek Links is unreliable in high dimensions (curse of dimensionality makes NN relationships meaningless), requires scaled features, and does not help when class overlap is fundamental rather than noise-based.
- ●
Know the historical lineage: Hart (1968) -> Wilson (1972) -> Tomek (1976) -> Batista et al. (2004) combining with SMOTE. This shows depth of understanding.
Pitfalls to Avoid
- ●
Calling Tomek Links an 'undersampling technique' without qualifying that it is primarily a cleaning technique -- interviewers expect you to understand the distinction between reducing class imbalance and cleaning boundary noise
- ●
Not mentioning the mutual nearest neighbor condition -- simply saying 'removes closest cross-class pairs' misses the key insight that the relationship must be reciprocal
- ●
Claiming Tomek Links will balance your dataset -- it removes far too few samples for that. If the interviewer asks about handling a 99:1 imbalance ratio, Tomek Links alone is not the answer
- ●
Forgetting to mention feature scaling as a prerequisite -- unscaled features make Euclidean distance meaningless and the identified Tomek links spurious
- ●
Not discussing the curse of dimensionality limitation -- senior interviewers will probe whether you understand when nearest-neighbor-based methods fail
Senior-Level Expectation
A senior candidate should demonstrate understanding beyond the basic definition. They should articulate the mathematical definition (mutual nearest neighbor condition), explain the historical lineage (Hart's CNN -> Wilson's ENN -> Tomek's modifications -> Batista's SMOTETomek), and discuss when Tomek Links fails (high dimensions, fundamental class overlap, unscaled features). They should know the computational complexity ( with tree-based NN, brute-force) and be able to discuss scaling strategies (approximate NN with FAISS, subsampling). Senior engineers should compare Tomek Links quantitatively with ENN (1-5% removal vs. 5-20%), understand why tree-based models benefit less from boundary cleaning than linear models or SVMs, and know that the imblearn.pipeline.Pipeline (not sklearn.pipeline.Pipeline) is required for proper integration with resampling. Finally, they should be able to discuss cost-benefit analysis: for a fraud detection system at scale (millions of transactions), when is the Tomek Links computation justified versus simply tuning the classifier's decision threshold or using cost-sensitive learning?
Summary
Tomek Links is a boundary cleaning technique for imbalanced classification that identifies and removes majority-class samples forming mutual nearest-neighbor pairs with minority-class samples. Introduced by Ivan Tomek in 1976 as a modification to Hart's Condensed Nearest Neighbor rule, it has found its primary modern application as a post-processing step in the SMOTETomek pipeline -- where SMOTE first generates synthetic minority samples to address class imbalance, and Tomek Links then cleans the boundary noise that oversampling introduces. The combination, established as effective by Batista et al. (2004), consistently outperforms either technique alone on datasets with class overlap.
The technique is notable for its conservatism and precision: it removes only 1-5% of majority-class samples, targeting exclusively the most ambiguous boundary pairs while preserving all minority samples and the overall majority-class distribution. This makes it a cleaning technique rather than a balancing technique, and it should almost always be combined with oversampling when class ratio is a concern. The algorithm is deterministic, has minimal hyperparameters (just the removal strategy), and runs in time with tree-based nearest neighbor indices -- making it computationally negligible for most practical datasets.
For Indian ML teams working on fraud detection (Razorpay, PayU, CRED), medical diagnosis (AIIMS, Apollo), churn prediction (Jio, Airtel), or insurance claims analysis (Ayushman Bharat), Tomek Links is a low-cost, high-value addition to any imbalanced classification pipeline. The key implementation requirements are: (1) always apply inside a imblearn.pipeline.Pipeline with cross-validation to prevent data leakage, (2) scale features before applying Tomek Links so distances are meaningful, (3) combine with SMOTE for the standard SMOTETomek recipe, and (4) understand the limitations -- Tomek Links is unreliable in high dimensions, does not address fundamental class overlap, and provides less benefit for tree-based models that are inherently robust to boundary noise. The compute cost on Indian cloud providers is under INR 1 for most datasets, making the technique essentially free to include in any preprocessing pipeline.