What exactly is a Tomek link?

A **Tomek link** is a pair of samples $(x_i, x_j)$ from different classes that are each other's **nearest neighbors**. Formally, $x_i$ and $x_j$ form a Tomek link if: 1. $y_i \neq y_j$ (they belong to different classes) 2. There is no sample $x_k$ such that $d(x_i, x_k) < d(x_i, x_j)$ (no sample is closer to $x_i$ than $x_j$) 3. There is no sample $x_k$ such that $d(x_j, x_k) < d(x_j, x_i)$ (no sample is closer to $x_j$ than $x_i$) The key is the **mutual** condition: both samples must be each other's closest neighbor, not just one of them. This mutual requirement makes the condition restrictive and ensures that Tomek links identify only the most ambiguous boundary pairs. Intuitively, a Tomek link represents the tightest possible "handshake" between opposite classes. These pairs are either noise (one sample is mislabeled or anomalous) or they sit right on the decision boundary where the classes are hardest to separate.

How is Tomek Links different from random undersampling?

The difference is **which** samples get removed: **Random undersampling** discards majority-class samples uniformly at random. It can remove samples from anywhere in the feature space -- deep inside the majority class cluster, near the boundary, or anywhere in between. The result is a smaller majority class with the same overall shape but noisier representation. **Tomek Links** removes only the majority-class samples that are **mutual nearest neighbors** of minority-class samples -- samples right at the decision boundary. It leaves the vast majority of the majority class untouched. The practical implications are significant: | Aspect | Random Undersampling | Tomek Links | |--------|---------------------|-------------| | Samples removed | Any target count | Only 1-5% | | Selection basis | Random | Boundary proximity | | Balances classes | Yes (to any ratio) | No (minimal change) | | Preserves distribution | Approximately | Yes (except boundary) | | Deterministic | No | Yes | | Computation cost | O(1) | O(n log n) | Use random undersampling when you need aggressive rebalancing. Use Tomek Links when you need surgical boundary cleaning. In practice, SMOTE handles the rebalancing and Tomek Links handles the cleaning.

Should I use Tomek Links standalone or combined with SMOTE?

**Almost always combined with SMOTE** (the SMOTETomek pipeline). Here is why: Standalone Tomek Links removes only 1-5% of majority samples -- far too few to meaningfully address class imbalance. If your dataset has a 99:1 majority-to-minority ratio, Tomek Links might bring it to 97:1 or 96:1. The class imbalance problem remains. The standard pipeline is: 1. **SMOTE** generates synthetic minority samples, increasing the minority class to a target ratio (e.g., 30-50% of the majority) 2. **Tomek Links** cleans boundary noise introduced by SMOTE, removing majority samples that are now uncomfortably close to the synthetic minority samples Batista et al. (2004) showed that this combination outperforms either technique alone on 13 benchmark datasets. The reason is clear: SMOTE addresses the **quantity problem** (too few minority samples) while Tomek Links addresses the **quality problem** (noisy boundary). The one exception: if your dataset is already reasonably balanced (e.g., 60:40 or 70:30) but has noisy class overlap at the boundary, standalone Tomek Links can improve classifier performance without needing oversampling.

How many samples does Tomek Links typically remove?

Tomek Links removes a **surprisingly small** number of samples -- typically **1-5% of the majority class**. The exact number depends on several factors: **Class overlap**: More overlap at the boundary means more Tomek links. A dataset with well-separated classes may have very few Tomek links (0.1-0.5%), while a dataset with heavy class mixing at the boundary may have 5-10%. **Class ratio**: More extreme imbalance (e.g., 99:1) tends to produce fewer Tomek links in absolute terms because the minority class has fewer samples to form pairs with. **Dimensionality**: Higher dimensions can either increase or decrease Tomek links depending on the data structure. In very high dimensions, the curse of dimensionality makes all distances similar, which can paradoxically increase the number of identified links. **After SMOTE**: When used after SMOTE (in SMOTETomek), Tomek Links typically finds more links because SMOTE's synthetic samples create new boundary proximity. The cleanup may remove 2-8% of majority samples. For practical planning: if you have 100,000 majority samples, expect Tomek Links to remove 1,000-5,000. This is a cleaning operation, not a balancing operation. The class ratio will barely change.

Does Tomek Links work for multiclass problems?

Yes, `imbalanced-learn`'s `TomekLinks` supports multiclass problems. In the multiclass setting, Tomek links are identified between **any pair of different classes**, not just majority-minority pairs. The behavior depends on the `sampling_strategy` parameter: - **`'auto'`** (default): Resample all classes except the minority class. For a 3-class problem with 1000/500/100 samples, Tomek Links would clean samples from class 0 (1000) and class 1 (500) that form links with any other class, but leave class 2 (100) untouched. - **`'majority'`**: Only resample the single largest class. - **`dict`**: Specify exactly which classes to clean. For example, `{0: 900, 1: 450}` would clean classes 0 and 1 down to the specified counts (approximately, since Tomek Links cannot guarantee exact counts). A key nuance: in multiclass settings, Tomek links between two large classes (e.g., class 0 and class 1) may outnumber links involving the rare class. This means the cleaning may not focus on the boundary you care about most. If you specifically need to clean the boundary between class 0 and class 2, you may need a custom implementation that filters for only cross-pair links involving the target classes. For most multiclass imbalanced problems in practice, the `'auto'` strategy works well.

What distance metric should I use with Tomek Links?

The default and most common choice is **Euclidean distance**, which works well for continuous features that have been properly scaled. However, the distance metric significantly affects which Tomek links are identified: **Euclidean distance** (default): $d(x, y) = \sqrt{\sum_i (x_i - y_i)^2}$. Best for dense, continuous, scaled features. Sensitive to outliers. **Manhattan distance**: $d(x, y) = \sum_i |x_i - y_i|$. More robust to outliers than Euclidean. Good when features represent genuinely different quantities (not just scaled versions of each other). **Minkowski distance**: Generalizes both Euclidean ($p=2$) and Manhattan ($p=1$). Allows tuning the sensitivity via the $p$ parameter. **Cosine distance**: $d(x, y) = 1 - \frac{x \cdot y}{\|x\| \|y\|}$. Best for high-dimensional sparse data (TF-IDF vectors, embeddings). Ignores magnitude and focuses on direction. **Practical recommendation**: Start with Euclidean after `StandardScaler`. If your data has many outliers, try Manhattan. If your data is high-dimensional and sparse, try cosine. You can test different metrics by comparing the number of Tomek links found and the downstream classifier performance -- the metric that produces the best classifier F1-score is the right choice. In the `imbalanced-learn` implementation, the distance metric is not directly configurable in `TomekLinks`, but you can use a custom implementation (as shown in the code examples) with scikit-learn's `NearestNeighbors(metric='manhattan')` or any other supported metric.

How does Tomek Links interact with tree-based models (XGBoost, LightGBM, Random Forest)?

Tree-based models are **inherently robust** to noisy boundary samples because they learn axis-aligned, piecewise decision boundaries. A single noisy majority sample near the minority class can be isolated by a few tree splits without affecting the broader decision surface. This makes Tomek Links **less impactful** for tree-based models compared to distance-based models (SVM, k-NN, logistic regression, neural networks). Empirical evidence supports this: - For **XGBoost and LightGBM**, SMOTETomek typically improves F1-score by 1-5% over SMOTE alone. The improvement is real but modest because the trees already handle boundary noise reasonably well. - For **SVM and logistic regression**, SMOTETomek can improve F1-score by 3-10% over SMOTE alone because these models' decision boundaries are directly affected by noisy samples. - For **k-NN classifiers**, Tomek Links is most impactful because k-NN is extremely sensitive to local neighborhood composition -- removing noisy boundary samples directly improves classification of nearby samples. Does this mean you should skip Tomek Links when using XGBoost? Not necessarily. Even for tree-based models, Tomek Links: 1. Slightly reduces training set size (faster training) 2. Reduces the chance of overfitting to boundary noise 3. Can improve calibration (probability estimates) at the boundary The cost is negligible (seconds of compute), so including Tomek Links in a SMOTETomek pipeline is a reasonable default even for tree-based models.

What is the computational cost of Tomek Links, and how do I handle large datasets?

The computational bottleneck is the **nearest neighbor computation**: - **Brute-force**: $O(n^2 \cdot d)$ time and $O(n \cdot d)$ space. Computes all pairwise distances. Practical only for $n < 10,000$. - **KD-tree**: $O(n \log n \cdot d)$ average time for $d < 20$. Degrades toward $O(n^2)$ as dimensionality increases. The default in imbalanced-learn for low-dimensional data. - **Ball-tree**: Similar to KD-tree but more robust in medium dimensions ($20 < d < 100$). Slightly higher constant factors than KD-tree. Practical timings on a 4-core laptop: | n | d | Time | |---|---|------| | 10K | 20 | 0.3s | | 100K | 50 | 3-5s | | 1M | 50 | 30-60s | | 10M | 50 | 10-20 min | For datasets over 1 million samples, consider these strategies: 1. **Approximate nearest neighbors**: Use FAISS or Annoy to compute approximate NNs in $O(n \log n)$ time regardless of dimensionality. Accuracy is typically 95-99% of exact NN. 2. **Subsample-then-clean**: Apply Tomek Links to a random subsample of 500K majority samples plus all minority samples. The boundary cleaning effect generalizes well from the subsample. 3. **Batch processing**: Split the dataset geographically or temporally, apply Tomek Links to each batch, then concatenate. This does not perfectly identify all global Tomek links but catches most local ones. For Indian cloud pricing, even the 10M-sample case is affordable: ~20 minutes on an `m5.xlarge` (INR 14/hr in AWS Mumbai) costs about INR 5 (~$0.06).

Data Generation

Tomek Links in Machine Learning

Q: How is Tomek Links different from random undersampling?

The difference is **which** samples get removed: **Random undersampling** discards majority-class samples uniformly at random. It can remove samples from anywhere in the feature space -- deep inside the majority class cluster, near the boundary, or anywhere in between. The result is a smaller majority class with the same overall shape but noisier representation. **Tomek Links** removes only the majority-class samples that are **mutual nearest neighbors** of minority-class samples -- samples right at the decision boundary. It leaves the vast majority of the majority class untouched. The practical implications are significant: | Aspect | Random Undersampling | Tomek Links | |--------|---------------------|-------------| | Samples removed | Any target count | Only 1-5% | | Selection basis | Random | Boundary proximity | | Balances classes | Yes (to any ratio) | No (minimal change) | | Preserves distribution | Approximately | Yes (except boundary) | | Deterministic | No | Yes | | Computation cost | O(1) | O(n log n) | Use random undersampling when you need aggressive rebalancing. Use Tomek Links when you need surgical boundary cleaning. In practice, SMOTE handles the rebalancing and Tomek Links handles the cleaning.

Q: Should I use Tomek Links standalone or combined with SMOTE?

**Almost always combined with SMOTE** (the SMOTETomek pipeline). Here is why: Standalone Tomek Links removes only 1-5% of majority samples -- far too few to meaningfully address class imbalance. If your dataset has a 99:1 majority-to-minority ratio, Tomek Links might bring it to 97:1 or 96:1. The class imbalance problem remains. The standard pipeline is: 1. **SMOTE** generates synthetic minority samples, increasing the minority class to a target ratio (e.g., 30-50% of the majority) 2. **Tomek Links** cleans boundary noise introduced by SMOTE, removing majority samples that are now uncomfortably close to the synthetic minority samples Batista et al. (2004) showed that this combination outperforms either technique alone on 13 benchmark datasets. The reason is clear: SMOTE addresses the **quantity problem** (too few minority samples) while Tomek Links addresses the **quality problem** (noisy boundary). The one exception: if your dataset is already reasonably balanced (e.g., 60:40 or 70:30) but has noisy class overlap at the boundary, standalone Tomek Links can improve classifier performance without needing oversampling.

Q: How many samples does Tomek Links typically remove?

Tomek Links removes a **surprisingly small** number of samples -- typically **1-5% of the majority class**. The exact number depends on several factors: **Class overlap**: More overlap at the boundary means more Tomek links. A dataset with well-separated classes may have very few Tomek links (0.1-0.5%), while a dataset with heavy class mixing at the boundary may have 5-10%. **Class ratio**: More extreme imbalance (e.g., 99:1) tends to produce fewer Tomek links in absolute terms because the minority class has fewer samples to form pairs with. **Dimensionality**: Higher dimensions can either increase or decrease Tomek links depending on the data structure. In very high dimensions, the curse of dimensionality makes all distances similar, which can paradoxically increase the number of identified links. **After SMOTE**: When used after SMOTE (in SMOTETomek), Tomek Links typically finds more links because SMOTE's synthetic samples create new boundary proximity. The cleanup may remove 2-8% of majority samples. For practical planning: if you have 100,000 majority samples, expect Tomek Links to remove 1,000-5,000. This is a cleaning operation, not a balancing operation. The class ratio will barely change.

Q: Does Tomek Links work for multiclass problems?

Yes, `imbalanced-learn`'s `TomekLinks` supports multiclass problems. In the multiclass setting, Tomek links are identified between **any pair of different classes**, not just majority-minority pairs. The behavior depends on the `sampling_strategy` parameter: - **`'auto'`** (default): Resample all classes except the minority class. For a 3-class problem with 1000/500/100 samples, Tomek Links would clean samples from class 0 (1000) and class 1 (500) that form links with any other class, but leave class 2 (100) untouched. - **`'majority'`**: Only resample the single largest class. - **`dict`**: Specify exactly which classes to clean. For example, `{0: 900, 1: 450}` would clean classes 0 and 1 down to the specified counts (approximately, since Tomek Links cannot guarantee exact counts). A key nuance: in multiclass settings, Tomek links between two large classes (e.g., class 0 and class 1) may outnumber links involving the rare class. This means the cleaning may not focus on the boundary you care about most. If you specifically need to clean the boundary between class 0 and class 2, you may need a custom implementation that filters for only cross-pair links involving the target classes. For most multiclass imbalanced problems in practice, the `'auto'` strategy works well.

Q: What distance metric should I use with Tomek Links?

The default and most common choice is **Euclidean distance**, which works well for continuous features that have been properly scaled. However, the distance metric significantly affects which Tomek links are identified: **Euclidean distance** (default): $d(x, y) = \sqrt{\sum_i (x_i - y_i)^2}$. Best for dense, continuous, scaled features. Sensitive to outliers. **Manhattan distance**: $d(x, y) = \sum_i |x_i - y_i|$. More robust to outliers than Euclidean. Good when features represent genuinely different quantities (not just scaled versions of each other). **Minkowski distance**: Generalizes both Euclidean ($p=2$) and Manhattan ($p=1$). Allows tuning the sensitivity via the $p$ parameter. **Cosine distance**: $d(x, y) = 1 - \frac{x \cdot y}{\|x\| \|y\|}$. Best for high-dimensional sparse data (TF-IDF vectors, embeddings). Ignores magnitude and focuses on direction. **Practical recommendation**: Start with Euclidean after `StandardScaler`. If your data has many outliers, try Manhattan. If your data is high-dimensional and sparse, try cosine. You can test different metrics by comparing the number of Tomek links found and the downstream classifier performance -- the metric that produces the best classifier F1-score is the right choice. In the `imbalanced-learn` implementation, the distance metric is not directly configurable in `TomekLinks`, but you can use a custom implementation (as shown in the code examples) with scikit-learn's `NearestNeighbors(metric='manhattan')` or any other supported metric.

Q: How does Tomek Links interact with tree-based models (XGBoost, LightGBM, Random Forest)?

Tree-based models are **inherently robust** to noisy boundary samples because they learn axis-aligned, piecewise decision boundaries. A single noisy majority sample near the minority class can be isolated by a few tree splits without affecting the broader decision surface. This makes Tomek Links **less impactful** for tree-based models compared to distance-based models (SVM, k-NN, logistic regression, neural networks). Empirical evidence supports this: - For **XGBoost and LightGBM**, SMOTETomek typically improves F1-score by 1-5% over SMOTE alone. The improvement is real but modest because the trees already handle boundary noise reasonably well. - For **SVM and logistic regression**, SMOTETomek can improve F1-score by 3-10% over SMOTE alone because these models' decision boundaries are directly affected by noisy samples. - For **k-NN classifiers**, Tomek Links is most impactful because k-NN is extremely sensitive to local neighborhood composition -- removing noisy boundary samples directly improves classification of nearby samples. Does this mean you should skip Tomek Links when using XGBoost? Not necessarily. Even for tree-based models, Tomek Links: 1. Slightly reduces training set size (faster training) 2. Reduces the chance of overfitting to boundary noise 3. Can improve calibration (probability estimates) at the boundary The cost is negligible (seconds of compute), so including Tomek Links in a SMOTETomek pipeline is a reasonable default even for tree-based models.

Q: What is the computational cost of Tomek Links, and how do I handle large datasets?

The computational bottleneck is the **nearest neighbor computation**: - **Brute-force**: $O(n^2 \cdot d)$ time and $O(n \cdot d)$ space. Computes all pairwise distances. Practical only for $n < 10,000$. - **KD-tree**: $O(n \log n \cdot d)$ average time for $d < 20$. Degrades toward $O(n^2)$ as dimensionality increases. The default in imbalanced-learn for low-dimensional data. - **Ball-tree**: Similar to KD-tree but more robust in medium dimensions ($20 < d < 100$). Slightly higher constant factors than KD-tree. Practical timings on a 4-core laptop: | n | d | Time | |---|---|------| | 10K | 20 | 0.3s | | 100K | 50 | 3-5s | | 1M | 50 | 30-60s | | 10M | 50 | 10-20 min | For datasets over 1 million samples, consider these strategies: 1. **Approximate nearest neighbors**: Use FAISS or Annoy to compute approximate NNs in $O(n \log n)$ time regardless of dimensionality. Accuracy is typically 95-99% of exact NN. 2. **Subsample-then-clean**: Apply Tomek Links to a random subsample of 500K majority samples plus all minority samples. The boundary cleaning effect generalizes well from the subsample. 3. **Batch processing**: Split the dataset geographically or temporally, apply Tomek Links to each batch, then concatenate. This does not perfectly identify all global Tomek links but catches most local ones. For Indian cloud pricing, even the 10M-sample case is affordable: ~20 minutes on an `m5.xlarge` (INR 14/hr in AWS Mumbai) costs about INR 5 (~$0.06).

Tomek Links is one of the most elegant ideas in imbalanced classification: find pairs of samples from opposite classes that are each other's nearest neighbors, and remove the majority-class sample from each pair. The result is a cleaner decision boundary, fewer ambiguous samples, and a dataset that gives classifiers a much better chance at learning the true separation between classes.

The technique was introduced by Ivan Tomek in 1976 as a modification to Hart's Condensed Nearest Neighbor (CNN) rule. While it was originally framed as a data reduction method for nearest-neighbor classifiers, it has found its modern calling as a boundary cleaning technique for imbalanced datasets. Unlike random undersampling, which blindly discards majority-class samples, Tomek Links surgically removes only the samples that create confusion at the class boundary -- the ones most likely to cause misclassification.

What makes Tomek Links especially powerful is its role as a post-processing cleaner combined with oversampling. The famous SMOTETomek pipeline -- first generate synthetic minority samples with SMOTE, then clean the boundary with Tomek Links -- has become a standard recipe for handling class imbalance in domains ranging from credit card fraud detection at financial institutions like Razorpay and PayU, to medical diagnosis at hospitals across India, to churn prediction at telecom companies like Jio and Airtel.

This guide covers the mathematical definition, algorithmic details, implementation with imbalanced-learn, failure modes, production case studies, and the nuanced decision of when Tomek Links is the right tool versus alternatives like Edited Nearest Neighbors or NearMiss.

Concept Snapshot

What It Is: A data cleaning technique that identifies pairs of opposite-class samples that are each other's nearest neighbors (Tomek links) and removes the majority-class sample from each pair to clean the decision boundary and reduce class overlap.
Category: Data Generation / Resampling
Complexity: Intermediate
Inputs / Outputs: Input: an imbalanced dataset with features X and labels y. Output: a cleaned dataset with the same minority-class samples intact and majority-class samples that formed Tomek links removed.
System Placement: Sits in the data preprocessing stage, after feature engineering and before model training. Often used as a post-processing step after SMOTE or other oversampling techniques to clean up noisy synthetic samples near the boundary.
Also Known As: Tomek link removal, Tomek's modification of CNN, T-Link undersampling, boundary cleaning via Tomek links
Typical Users: Data Scientists, ML Engineers, Applied Researchers, Fraud Analysts, Biostatisticians
Prerequisites: Nearest neighbor algorithms (k-NN), Class imbalance concepts (majority/minority class), Distance metrics (Euclidean, Manhattan), Basic understanding of decision boundaries, SMOTE (for combined approaches)
Key Terms: Tomek linknearest neighbor pairboundary cleaningclass overlapmajority-class removalcondensed nearest neighborSMOTETomekprototype selectionimbalanced-learnundersampling

Why This Concept Exists

The Problem: Noisy Class Boundaries in Imbalanced Data

Imagine you are building a fraud detection model for a payment gateway like Razorpay. Out of 1 million transactions, only 2,000 are fraudulent -- a 0.2% positive rate. The challenge is not just that the minority class is rare; it is that the fraudulent transactions often look similar to legitimate ones. Some legitimate transactions sit right next to fraudulent ones in feature space, creating a fuzzy, overlapping boundary that classifiers struggle to learn.

Random undersampling addresses the class ratio by discarding majority-class samples, but it does so blindly -- it might remove informative samples far from the boundary while preserving confusing ones right at the edge. What you really want is a surgeon's scalpel, not a machete: remove only the majority-class samples that are actively confusing the classifier.

That is exactly what Tomek Links does. It identifies pairs of opposite-class samples that are each other's closest neighbors -- the most ambiguous, boundary-straddling pairs in the dataset -- and removes the majority-class member of each pair. The minority-class samples are left untouched (you cannot afford to lose any of them), and the boundary becomes cleaner.

Historical Context: From CNN to Tomek Links

The story begins with Peter Hart's Condensed Nearest Neighbor (CNN) rule in 1968. Hart proposed a method to reduce the size of the training set for a k-NN classifier while preserving classification accuracy. The idea was to keep only the samples necessary for correct classification -- essentially, the "important" samples near the decision boundary.

In 1972, Dennis Wilson introduced the Edited Nearest Neighbor (ENN) rule, which worked in the opposite direction: instead of keeping boundary samples, it removed samples that were misclassified by their neighbors, essentially cleaning noise from the dataset.

Ivan Tomek in 1976 proposed two modifications to Hart's CNN rule in his paper "Two Modifications of CNN" published in IEEE Transactions on Systems, Man, and Communications. The first modification introduced what we now call Tomek Links: if two samples from different classes are each other's nearest neighbors, they form a Tomek link. Tomek argued that at least one of these samples is either noise or sits on the boundary, and removing it (or both) would improve the CNN rule's efficiency.

Modern Renaissance: Tomek Links for Class Imbalance

Tomek Links languished in relative obscurity until the early 2000s, when the machine learning community became seriously concerned about class imbalance. The landmark paper by Batista, Prati, and Monard (2004) -- "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data" -- studied combining SMOTE with Tomek Links as a two-step resampling strategy. They showed that SMOTE+Tomek outperformed either technique alone on several benchmark datasets, establishing the SMOTETomek pipeline that remains popular today.

The inclusion of TomekLinks in the imbalanced-learn library (scikit-learn-contrib) cemented its place as a standard tool. Today, Tomek Links is one of the most commonly cited boundary cleaning methods in the imbalanced learning literature, used across fraud detection, medical diagnosis, network intrusion detection, and credit scoring.

Key Insight: Tomek Links is not primarily an undersampling technique -- it is a data cleaning technique. Its goal is not to balance the class distribution (it removes relatively few samples) but to sharpen the decision boundary by removing the most ambiguous majority-class samples.

Core Intuition & Mental Model

The Analogy: Sorting Out Disputed Territory

Imagine two neighboring countries that share a long, winding border. Most of the border is clearly defined -- there are mountains, rivers, and fences that make it obvious which country you are in. But at certain points, the border becomes fuzzy: towns that straddle the line, roads that weave back and forth, residents who could plausibly belong to either side. These disputed border towns are the source of most confusion and conflict.

Tomek Links identifies exactly these disputed border towns. Two samples form a Tomek link when they are from different classes and are each other's closest neighbor -- they are as close as two opposite-class samples can possibly be. In the disputed-territory analogy, they are the two houses on opposite sides of the border that are closer to each other than to any house in their own country.

The cleaning step removes the majority-class house from each disputed pair. Why the majority class? Because you have plenty of majority-class houses (you can afford to lose a few), but the minority-class houses are rare and every one matters. After cleaning, the border becomes sharper: the remaining majority-class samples are further from the minority-class region, making it easier for any classifier to draw a clean line between the two.

What Makes a Tomek Link Special?

Not every pair of opposite-class neighbors forms a Tomek link. The requirement is mutual nearest neighborhood: sample $a$ must be the nearest neighbor of sample $b$ , AND sample $b$ must be the nearest neighbor of sample $a$ . This mutual requirement is restrictive -- most cross-class neighbor pairs are one-directional (A's nearest neighbor is B, but B's nearest neighbor is some other sample from B's own class). Only the truly ambiguous, boundary-straddling pairs satisfy the mutual condition.

This is why Tomek Links is considered a conservative cleaning method: it identifies and removes only a small number of samples -- the most egregiously ambiguous ones. It does not dramatically change the class distribution. If your dataset has 100,000 majority samples and 1,000 minority samples, Tomek Links might remove only 200-500 majority samples. The class ratio barely changes, but the boundary quality improves significantly.

The Combined Power: SMOTE + Tomek Links

The most common use of Tomek Links is not standalone but as a post-processing step after SMOTE. Here is the intuition: SMOTE generates synthetic minority samples by interpolating between existing minority samples. But some of those synthetic samples may land in regions dominated by the majority class, creating new noise at the boundary. Tomek Links then sweeps through and removes the majority-class samples that are now uncomfortably close to these synthetic minority samples, cleaning up the boundary that SMOTE may have muddied.

Think of it as a two-phase construction project: SMOTE builds new houses (synthetic minority samples) in the minority neighborhood, and Tomek Links demolishes the majority-class houses that are too close to the new construction, creating a clear buffer zone along the border.

Technical Foundations

Definition of a Tomek Link

Given a dataset $D = \{(x_i, y_i)\}_{i=1}^{n}$ where $x_i \in \mathbb{R}^d$ and $y_i \in \{0, 1\}$ (binary classification with 0 as majority and 1 as minority), two samples $(x_i, y_i)$ and $(x_j, y_j)$ form a Tomek link if and only if:

$y_i \neq y_j$ (they belong to different classes)
$d(x_i, x_j) = \min_{k \neq i} d(x_i, x_k)$ (i.e., $x_j$ is the nearest neighbor of $x_i$ )
$d(x_j, x_i) = \min_{k \neq j} d(x_j, x_k)$ (i.e., $x_i$ is the nearest neighbor of $x_j$ )

where $d(\cdot, \cdot)$ is a distance metric, typically Euclidean:

$d(x_i, x_j) = \sqrt{\sum_{k=1}^{d} (x_{i,k} - x_{j,k})^2}$

More concisely: $(x_i, x_j)$ is a Tomek link if they are mutual nearest neighbors and belong to different classes.

Removal Strategy

Once all Tomek links are identified, the cleaning strategy determines which samples to remove:

Majority-only removal (default for imbalanced learning): Remove only the majority-class sample from each Tomek link pair. This preserves all minority-class samples while cleaning the boundary. If $(x_i, y_i=0)$ and $(x_j, y_j=1)$ form a Tomek link, remove $x_i$ .
Both-class removal (original Tomek proposal): Remove both samples from each Tomek link. This is more aggressive and used when data cleaning is the primary goal (not class balancing).

Algorithm

The Tomek Links algorithm has the following steps:

For every sample $x_i \in D$ , compute its nearest neighbor $\text{NN}(x_i) = \arg\min_{j \neq i} d(x_i, x_j)$
For each pair $(x_i, \text{NN}(x_i))$ $(x_{i}, NN (x_{i}))$ where $y_i \neq y_{\text{NN}(x_i)}$ $y_{i} \neq = y_{NN (x_{i})}$ :
- Check if $\text{NN}(\text{NN}(x_i)) = x_i$ (mutual nearest neighbor condition)
- If yes, mark this pair as a Tomek link
Remove marked samples according to the chosen strategy

Computational Complexity

The dominant cost is computing nearest neighbors for all $n$ samples:

Brute-force: $O(n^2 d)$ time, $O(nd)$ space
KD-tree: $O(n \log n \cdot d)$ average time for low-dimensional data ( $d < 20$ ), degrades to $O(n^2 d)$ for high-dimensional data
Ball-tree: $O(n \log n \cdot d)$ average time, more robust than KD-tree in higher dimensions

After nearest neighbors are computed, identifying Tomek links is $O(n)$ -- just a linear scan checking the mutual NN condition.

Properties

Conservative: Tomek Links removes relatively few samples. The number of Tomek links is bounded by $\min(|C_0|, |C_1|)$ where $|C_0|$ and $|C_1|$ are the majority and minority class sizes, but in practice the number is much smaller.
Deterministic: Given a fixed dataset and distance metric, the set of Tomek links is unique. There is no randomness in the algorithm (unlike random undersampling or SMOTE).
Boundary-focused: By construction, Tomek links exist only at the class boundary. Interior points (deep inside their class cluster) can never form Tomek links because their nearest neighbor will always be from their own class.
Distance-metric dependent: The set of identified Tomek links changes with the choice of distance metric. Euclidean distance is standard, but Manhattan or Mahalanobis distance may identify different boundary-ambiguous pairs.

Internal Architecture

The Tomek Links architecture is conceptually simple -- a nearest-neighbor computation followed by a pair-identification step and a removal step -- but the implementation details matter for performance at scale.

Tomek Links in ML Systems Architecture — A vertical flow from 'Imbalanced Dataset (X, y)' to 'Nearest Neighbor Index (KD-Tree / Ball-Tree ...

When combined with SMOTE in the SMOTETomek pipeline, the architecture becomes a two-stage process: SMOTE first generates synthetic minority samples and augments the dataset, then Tomek Links runs on the augmented dataset to identify and remove newly created Tomek links (typically between synthetic minority samples and nearby majority samples).

Key Components

Nearest Neighbor Index

Builds a spatial index (KD-tree, ball-tree, or brute-force distance matrix) over all samples in the dataset. This is the most computationally expensive component and determines the overall scalability of the algorithm. The imbalanced-learn implementation delegates to scikit-learn's NearestNeighbors class, which automatically selects the best algorithm based on dataset size and dimensionality.

Tomek Link Identifier

Iterates over all samples, checks the mutual nearest-neighbor condition for cross-class pairs, and marks samples that form Tomek links. This is a linear-time operation after the NN index is built. The identifier maintains a set of sample indices to be removed, handling edge cases where a single sample participates in multiple Tomek link relationships (rare but possible in multiclass settings).

Removal Engine

Removes the marked samples from the dataset according to the chosen strategy ('majority' to remove only majority-class samples, 'all' to remove both members of each Tomek link, or 'auto' which defaults to majority-only). Returns the cleaned feature matrix and label vector.

Pipeline Integration Layer

When used inside imblearn.pipeline.Pipeline or as part of SMOTETomek, this component ensures that Tomek Links runs on the correct version of the data (post-SMOTE augmented data, not the original data) and that the cleaning step respects the pipeline's fit_resample contract. It also handles the sampling_strategy parameter for multiclass settings.

Data Flow

Step 1 -- Index Construction: The full dataset (features and labels) is loaded. A spatial index (KD-tree for $d < 20$ , ball-tree for $20 < d < 100$ , brute-force for $d > 100$ or small $n$ ) is built over all feature vectors. This step is $O(n \log n)$ for tree-based indices.

Step 2 -- NN Computation: For each sample $x_i$ , query the index for its single nearest neighbor. Store the result as an array nn_indices of length $n$ . This step is $O(n \log n)$ for tree-based indices or $O(n^2)$ for brute-force.

Step 3 -- Link Identification: Scan nn_indices for mutual cross-class pairs: if nn_indices[i] == j and nn_indices[j] == i and y[i] != y[j], then $(i, j)$ is a Tomek link. Collect the indices to be removed based on the removal strategy. This step is $O(n)$ .

Step 4 -- Removal: Create a boolean mask over the dataset, setting False for indices to be removed. Apply the mask to produce the cleaned dataset X_clean, y_clean.

Step 5 (if combined): In SMOTETomek, Step 1-4 runs on the SMOTE-augmented dataset, so the Tomek link identification considers synthetic samples alongside original ones. Synthetic minority samples that ended up too close to majority samples are effectively defended by removing those majority samples.

A vertical flow from 'Imbalanced Dataset (X, y)' to 'Nearest Neighbor Index (KD-Tree / Ball-Tree / Brute)', then to 'Compute NN for all samples', then to 'Identify Tomek Link Pairs (Mutual NN + Different Class)', then to a decision node for 'Removal Strategy' that branches into 'Remove majority-class member' or 'Remove both members', both leading to 'Cleaned Dataset', which feeds into 'Downstream: Model Training or SMOTE + Tomek Pipeline'.

How to Implement

Implementation Approaches

Tomek Links implementation ranges from a one-line call with imbalanced-learn to a custom NumPy implementation for fine-grained control or edge cases:

Option A: imbalanced-learn TomekLinks -- the standard approach used by most practitioners. Handles multi-class, integrates with scikit-learn pipelines, and uses efficient NN algorithms internally. This is the right starting point for 95% of use cases.

Option B: imbalanced-learn SMOTETomek -- the combined pipeline that runs SMOTE followed by Tomek Links removal in a single fit_resample call. The most common production usage pattern.

Option C: Custom implementation -- useful when you need non-standard distance metrics, want to remove Tomek links iteratively (recomputing NNs after each removal round), or need to integrate with a non-scikit-learn pipeline (PyTorch, TensorFlow data loaders).

Cost Note: Tomek Links is primarily CPU-bound due to the NN computation. For a dataset with 100,000 samples and 50 features, the brute-force approach takes ~5 seconds on a modern laptop. KD-tree reduces this to ~1 second. For 1 million samples, KD-tree takes ~15-30 seconds, and brute-force becomes impractical (~500 seconds). On cloud instances (AWS m5.xlarge in Mumbai at ~INR 14/hr or $0.17/hr, or E2E Networks `C2.8xlarge` at ~INR 11/hr), even million-sample datasets are cheap to process -- under INR 1 (~$ 0.01) per run.

Basic Tomek Links with imbalanced-learn31 lines

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import TomekLinks
import numpy as np

# Create an imbalanced dataset: 5% minority class
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    n_clusters_per_class=2,
    weights=[0.95, 0.05],
    flip_y=0.02,
    random_state=42,
)
print(f"Original distribution: {Counter(y)}")
# Output: Counter({0: 9500, 1: 500})

# Apply Tomek Links: remove majority-class samples forming Tomek links
tomek = TomekLinks(sampling_strategy='majority')
X_clean, y_clean = tomek.fit_resample(X, y)
print(f"After Tomek Links:    {Counter(y_clean)}")
# Output: Counter({0: ~9350, 1: 500})  -- removed ~150 boundary majority samples

# Check how many samples were removed
removed = len(y) - len(y_clean)
print(f"Removed {removed} majority-class samples ({removed/Counter(y)[0]*100:.1f}%)")

# Verify: all minority samples are preserved
assert Counter(y_clean)[1] == Counter(y)[1], "Minority samples must be preserved!"

The basic usage is straightforward: create a TomekLinks instance with sampling_strategy='majority' (default behavior), call fit_resample(X, y), and receive the cleaned dataset. Notice that Tomek Links removes a relatively small percentage of majority samples (typically 1-5%) -- it is a cleaning technique, not an aggressive balancing technique. All minority-class samples are always preserved.

SMOTETomek: Combined oversampling + boundary cleaning53 lines

from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.combine import SMOTETomek
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score, classification_report
import numpy as np

# Create imbalanced dataset
X, y = make_classification(
    n_samples=20000,
    n_features=30,
    n_informative=15,
    weights=[0.97, 0.03],
    flip_y=0.01,
    random_state=42,
)
print(f"Original: {Counter(y)}")

# Method 1: SMOTETomek convenience class
smt = SMOTETomek(
    smote=SMOTE(sampling_strategy=0.5, k_neighbors=5, random_state=42),
    tomek=TomekLinks(sampling_strategy='majority'),
    random_state=42,
)
X_resampled, y_resampled = smt.fit_resample(X, y)
print(f"After SMOTETomek: {Counter(y_resampled)}")

# Method 2: Explicit pipeline (more control)
pipeline = ImbPipeline([
    ('scaler', StandardScaler()),
    ('smote', SMOTE(sampling_strategy=0.5, k_neighbors=5, random_state=42)),
    ('tomek', TomekLinks(sampling_strategy='majority')),
    ('model', GradientBoostingClassifier(n_estimators=200, random_state=42)),
])

# Cross-validate with resampling inside the fold (no data leakage)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y, cv=cv, scoring='f1', n_jobs=-1)
print(f"\n5-Fold F1 (SMOTETomek + GBM): {scores.mean():.4f} (+/- {scores.std():.4f})")

# Compare with SMOTE-only pipeline
pipeline_smote_only = ImbPipeline([
    ('scaler', StandardScaler()),
    ('smote', SMOTE(sampling_strategy=0.5, k_neighbors=5, random_state=42)),
    ('model', GradientBoostingClassifier(n_estimators=200, random_state=42)),
])
scores_smote = cross_val_score(pipeline_smote_only, X, y, cv=cv, scoring='f1', n_jobs=-1)
print(f"5-Fold F1 (SMOTE-only + GBM): {scores_smote.mean():.4f} (+/- {scores_smote.std():.4f})")

This example shows the two main ways to use SMOTETomek: the convenience class SMOTETomek for quick resampling, and an explicit imblearn.pipeline.Pipeline for production use with cross-validation. The pipeline approach is critical because it ensures resampling happens inside each CV fold, preventing data leakage. Note the sampling_strategy=0.5 for SMOTE, meaning the minority class is oversampled to 50% of the majority class size -- not full balance, which often leads to overfitting. Tomek Links then cleans up any boundary noise introduced by SMOTE.

Custom Tomek Links implementation from scratch89 lines

import numpy as np
from sklearn.neighbors import NearestNeighbors
from collections import Counter
from typing import Tuple


def find_tomek_links(
    X: np.ndarray,
    y: np.ndarray,
    metric: str = 'euclidean',
    n_jobs: int = -1,
) -> np.ndarray:
    """Find indices of samples that form Tomek links.

    Returns array of indices to REMOVE (majority-class members of each Tomek link).
    """
    # Step 1: Build nearest neighbor index
    nn = NearestNeighbors(n_neighbors=2, metric=metric, n_jobs=n_jobs)
    nn.fit(X)

    # Step 2: Query for nearest neighbor of each sample
    # n_neighbors=2 because the first neighbor is the sample itself
    distances, indices = nn.kneighbors(X)
    nn_indices = indices[:, 1]  # Index of nearest neighbor (excluding self)

    # Step 3: Identify Tomek links
    tomek_link_indices = []
    n_samples = len(y)
    majority_class = Counter(y).most_common(1)[0][0]

    for i in range(n_samples):
        j = nn_indices[i]
        # Check mutual nearest neighbor condition AND different classes
        if nn_indices[j] == i and y[i] != y[j]:
            # Remove the majority-class sample
            if y[i] == majority_class:
                tomek_link_indices.append(i)
            else:
                tomek_link_indices.append(j)

    # Deduplicate (a majority sample might be identified from both sides)
    return np.unique(tomek_link_indices)


def apply_tomek_links(
    X: np.ndarray,
    y: np.ndarray,
    metric: str = 'euclidean',
    remove_both: bool = False,
) -> Tuple[np.ndarray, np.ndarray]:
    """Apply Tomek Links cleaning to a dataset."""
    nn = NearestNeighbors(n_neighbors=2, metric=metric, n_jobs=-1)
    nn.fit(X)
    _, indices = nn.kneighbors(X)
    nn_indices = indices[:, 1]

    majority_class = Counter(y).most_common(1)[0][0]
    to_remove = set()

    for i in range(len(y)):
        j = nn_indices[i]
        if nn_indices[j] == i and y[i] != y[j]:
            if remove_both:
                to_remove.add(i)
                to_remove.add(j)
            else:
                # Remove only majority-class member
                to_remove.add(i if y[i] == majority_class else j)

    mask = np.ones(len(y), dtype=bool)
    mask[list(to_remove)] = False

    return X[mask], y[mask]


# Example usage
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=5000, weights=[0.95, 0.05], random_state=42
)
print(f"Before: {Counter(y)}")

X_clean, y_clean = apply_tomek_links(X, y)
print(f"After:  {Counter(y_clean)}")

# With custom distance metric
X_manhattan, y_manhattan = apply_tomek_links(X, y, metric='manhattan')
print(f"Manhattan distance: {Counter(y_manhattan)}")

This custom implementation exposes the internals of the algorithm. The key insight is in the nn_indices[j] == i check -- this is the mutual nearest neighbor condition that defines a Tomek link. The implementation uses scikit-learn's NearestNeighbors for efficient NN computation but handles the link identification and removal manually. This gives you full control over the distance metric, the removal strategy (remove_both parameter), and allows easy extension to iterative Tomek link removal (recomputing NNs after each round of cleaning).

Multiclass Tomek Links and visualization79 lines

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.neighbors import NearestNeighbors
from imblearn.under_sampling import TomekLinks
from collections import Counter

# Create a 2D dataset for visualization
np.random.seed(42)
n_majority = 500
n_minority = 50

# Majority class: two clusters
X_maj = np.vstack([
    np.random.randn(n_majority // 2, 2) + [2, 2],
    np.random.randn(n_majority // 2, 2) + [-2, -2],
])
y_maj = np.zeros(n_majority, dtype=int)

# Minority class: overlapping with majority boundary
X_min = np.random.randn(n_minority, 2) * 0.8 + [0, 0]
y_min = np.ones(n_minority, dtype=int)

X = np.vstack([X_maj, X_min])
y = np.concatenate([y_maj, y_min])

print(f"Original: {Counter(y)}")

# Find Tomek links manually for visualization
nn = NearestNeighbors(n_neighbors=2).fit(X)
_, indices = nn.kneighbors(X)
nn_idx = indices[:, 1]

tomek_pairs = []
for i in range(len(y)):
    j = nn_idx[i]
    if nn_idx[j] == i and y[i] != y[j]:
        tomek_pairs.append((i, j))

# Remove duplicate pairs
seen = set()
unique_pairs = []
for i, j in tomek_pairs:
    pair = (min(i, j), max(i, j))
    if pair not in seen:
        seen.add(pair)
        unique_pairs.append(pair)

print(f"Found {len(unique_pairs)} Tomek link pairs")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Before cleaning
ax = axes[0]
ax.scatter(X[y == 0, 0], X[y == 0, 1], c='blue', alpha=0.3, s=20, label='Majority')
ax.scatter(X[y == 1, 0], X[y == 1, 1], c='red', alpha=0.7, s=40, label='Minority')
for i, j in unique_pairs:
    ax.plot([X[i, 0], X[j, 0]], [X[i, 1], X[j, 1]], 'k-', alpha=0.5, linewidth=1)
    ax.scatter(X[i, 0], X[i, 1], c='orange', s=80, zorder=5, edgecolors='black')
    ax.scatter(X[j, 0], X[j, 1], c='orange', s=80, zorder=5, edgecolors='black')
ax.set_title(f'Before: {len(X)} samples, {len(unique_pairs)} Tomek links')
ax.legend()

# After cleaning
tomek = TomekLinks(sampling_strategy='majority')
X_clean, y_clean = tomek.fit_resample(X, y)

ax = axes[1]
ax.scatter(X_clean[y_clean == 0, 0], X_clean[y_clean == 0, 1],
           c='blue', alpha=0.3, s=20, label='Majority')
ax.scatter(X_clean[y_clean == 1, 0], X_clean[y_clean == 1, 1],
           c='red', alpha=0.7, s=40, label='Minority')
ax.set_title(f'After Tomek Links: {len(X_clean)} samples')
ax.legend()

plt.tight_layout()
plt.savefig('tomek_links_visualization.png', dpi=150)
plt.show()

This visualization example creates a 2D dataset where you can see the Tomek links as lines connecting cross-class mutual nearest neighbors. The before plot shows these links as black lines with orange-highlighted endpoints -- these are the ambiguous boundary pairs. The after plot shows the cleaned dataset with majority-class members of Tomek links removed. This visualization is extremely useful for building intuition about what Tomek Links actually does and for presenting results to stakeholders who may not be familiar with the algorithm.

Production pipeline: Fraud detection with SMOTETomek98 lines

import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    make_scorer, f1_score, precision_score, recall_score,
    average_precision_score, roc_auc_score,
)
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks
from imblearn.combine import SMOTETomek
import joblib
import json
from datetime import datetime


def build_fraud_detection_pipeline(
    smote_ratio: float = 0.3,
    smote_k: int = 5,
    n_estimators: int = 300,
    max_depth: int = 6,
    learning_rate: float = 0.05,
) -> ImbPipeline:
    """Build a production fraud detection pipeline with SMOTETomek."""
    return ImbPipeline([
        ('scaler', StandardScaler()),
        ('smote', SMOTE(
            sampling_strategy=smote_ratio,
            k_neighbors=smote_k,
            random_state=42,
        )),
        ('tomek', TomekLinks(sampling_strategy='majority')),
        ('model', GradientBoostingClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            learning_rate=learning_rate,
            subsample=0.8,
            random_state=42,
        )),
    ])


def evaluate_pipeline(pipeline, X, y, n_splits=5):
    """Evaluate with stratified CV and multiple metrics."""
    cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

    scoring = {
        'f1': make_scorer(f1_score),
        'precision': make_scorer(precision_score, zero_division=0),
        'recall': make_scorer(recall_score),
        'roc_auc': 'roc_auc',
        'avg_precision': make_scorer(average_precision_score, needs_proba=True),
    }

    results = cross_validate(
        pipeline, X, y,
        cv=cv,
        scoring=scoring,
        return_train_score=True,
        n_jobs=-1,
    )

    report = {}
    for metric in scoring:
        train_key = f'train_{metric}'
        test_key = f'test_{metric}'
        report[metric] = {
            'train_mean': float(np.mean(results[train_key])),
            'test_mean': float(np.mean(results[test_key])),
            'test_std': float(np.std(results[test_key])),
            'test_per_fold': [float(s) for s in results[test_key]],
        }
    return report


# Example usage
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=50000,
    n_features=30,
    n_informative=20,
    weights=[0.99, 0.01],  # 1% fraud rate
    random_state=42,
)

pipeline = build_fraud_detection_pipeline(smote_ratio=0.3)
results = evaluate_pipeline(pipeline, X, y)

print("=" * 60)
print("Fraud Detection Pipeline: SMOTETomek + GBM")
print("=" * 60)
for metric, values in results.items():
    gap = values['train_mean'] - values['test_mean']
    print(f"{metric:>15}: test={values['test_mean']:.4f} +/- {values['test_std']:.4f}  "
          f"(train={values['train_mean']:.4f}, gap={gap:.4f})")

This production-grade example demonstrates how Tomek Links fits into a real fraud detection pipeline. Key design decisions: (1) smote_ratio=0.3 means the minority class is oversampled to 30% of the majority, not full balance -- this avoids overfitting to synthetic samples. (2) Tomek Links runs after SMOTE to clean the boundary. (3) The pipeline is evaluated inside stratified CV to prevent leakage. (4) Multiple metrics are tracked including average precision (PR-AUC), which is more informative than ROC-AUC for extreme imbalance. (5) The train-test gap is computed for each metric to monitor overfitting. This pattern is directly applicable to fraud detection at fintech companies like Razorpay, PayU, and PhonePe.

Configuration Example42 lines

# Tomek Links configuration (YAML equivalent)
tomek_links:
  sampling_strategy: 'majority'   # 'majority' | 'all' | 'auto' | dict
  n_jobs: -1                       # Parallel NN computation

# SMOTETomek combined configuration
smote_tomek:
  smote:
    sampling_strategy: 0.3         # Oversample minority to 30% of majority
    k_neighbors: 5                 # k for SMOTE interpolation
    random_state: 42
  tomek:
    sampling_strategy: 'majority'  # Clean only majority-class links

# Pipeline configuration
pipeline:
  preprocessing:
    - type: StandardScaler
  resampling:
    - type: SMOTE
      sampling_strategy: 0.3
      k_neighbors: 5
    - type: TomekLinks
      sampling_strategy: 'majority'
  model:
    type: GradientBoostingClassifier
    params:
      n_estimators: 300
      max_depth: 6
      learning_rate: 0.05
      subsample: 0.8

# Evaluation
evaluation:
  cv_strategy: stratified_kfold
  n_splits: 5
  primary_metric: f1
  additional_metrics:
    - precision
    - recall
    - roc_auc
    - average_precision

Common Implementation Mistakes

●
Applying Tomek Links before train-test split: Running TomekLinks.fit_resample() on the entire dataset before splitting causes data leakage -- the NN computation uses validation/test samples to identify which training samples to remove. Always apply Tomek Links inside a pipeline with cross-validation, using imblearn.pipeline.Pipeline.
●
Expecting Tomek Links to balance the dataset: Tomek Links is a cleaning technique, not a balancing technique. It removes only a small percentage (typically 1-5%) of majority samples. If you need substantial rebalancing, combine it with SMOTE or use random undersampling.
●
Using Tomek Links on high-dimensional sparse data: In very high dimensions (e.g., TF-IDF vectors with 50,000+ features), nearest neighbor computations become unreliable due to the curse of dimensionality. Distance metrics lose discriminative power, and the identified Tomek links may not represent true boundary pairs. Apply dimensionality reduction (PCA, truncated SVD) first.
●
Ignoring the distance metric choice: The default Euclidean distance assumes all features have comparable scales. If features are on different scales (e.g., age in years and income in lakhs), Tomek Links will be dominated by the high-magnitude feature. Always scale features before applying Tomek Links, or use a pipeline with StandardScaler.
●
Running Tomek Links iteratively without convergence check: Some practitioners run Tomek Links repeatedly until no more links are found. This can over-clean the dataset, removing informative boundary samples. One pass is usually sufficient; if you need more aggressive cleaning, consider Edited Nearest Neighbors instead.
●
Applying to multiclass problems without understanding the behavior: In multiclass settings, Tomek Links identifies links between any pair of different classes, which may not target the specific class boundary you care about. Use the sampling_strategy parameter to restrict which classes are cleaned.

When Should You Use This?

Use When

Your dataset has class overlap at the decision boundary -- majority-class samples are interspersed with minority-class samples, creating ambiguity that hurts classifier performance
You are already using SMOTE or another oversampling method and want to clean up the boundary noise that oversampling introduces -- the classic SMOTETomek pipeline
You need a deterministic, conservative cleaning method that removes only the most ambiguous samples, preserving the overall data distribution and avoiding the randomness of random undersampling
Your minority class is very small (hundreds of samples) and you cannot afford to lose any minority samples -- Tomek Links by default only removes majority-class samples
You are working with tabular data of moderate dimensionality ( $d < 50$ ) where nearest-neighbor distances are meaningful and the curse of dimensionality is not a concern
You need a preprocessing step that is easy to explain to stakeholders, regulators, or auditors -- the concept of removing ambiguous boundary pairs is intuitive and defensible

Avoid When

Your dataset is severely imbalanced (e.g., 99.9% majority) and you need significant rebalancing -- Tomek Links removes too few samples to meaningfully change the class ratio; use random undersampling, NearMiss, or cluster centroids instead
Your data is very high-dimensional (thousands of features) -- nearest neighbor distances become unreliable in high dimensions, and the identified Tomek links may not correspond to true boundary ambiguity
You have a very large dataset (millions of samples) and the $O(n \log n)$ or $O(n^2)$ nearest neighbor computation is too expensive -- consider random undersampling or feature-space approaches instead
The class boundary is inherently noisy and removing boundary samples would discard genuinely informative data -- some domains (medical diagnosis, anomaly detection) have fundamentally overlapping class distributions that should not be cleaned away
You are working with non-numeric data (categorical features, text, images in raw form) where Euclidean nearest-neighbor distances are not meaningful -- preprocess into a numeric embedding first or use domain-specific cleaning methods
Your evaluation metric is not sensitive to boundary quality -- if you are optimizing for recall at a fixed low threshold, Tomek Links' boundary cleaning may not translate into meaningful metric improvement

Key Tradeoffs

Cleaning Aggressiveness vs. Information Preservation

Tomek Links sits at the conservative end of the undersampling spectrum. Here is how it compares:

Technique	Samples Removed	Aggressiveness	Boundary Focus	Deterministic
Tomek Links	1-5% of majority	Very Low	Yes (only mutual NN pairs)	Yes
Edited NN (ENN)	5-20% of majority	Low-Medium	Yes (k-NN misclassified)	Yes
NearMiss	Up to 50%+ of majority	High	Yes (distance-based selection)	Yes
Random Undersampling	Variable (to target ratio)	Variable	No (random)	No
Cluster Centroids	Replaces all majority with centroids	High	No (cluster-based)	Yes

The tradeoff is clear: Tomek Links preserves the most information but provides the least rebalancing. If your problem primarily suffers from boundary noise rather than class ratio imbalance, Tomek Links (or SMOTETomek) is the right choice. If the class ratio itself is the bottleneck, you need a more aggressive approach.

Standalone vs. Combined Usage

Using Tomek Links standalone is rarely the best strategy. Its strength is as a post-processing cleaner in combination with an oversampling method:

SMOTETomek: The gold standard. SMOTE adds synthetic minority samples, Tomek Links cleans the boundary. Best for moderate imbalance (1-10% minority).
SMOTE + ENN (SMOTEENN): More aggressive cleaning than SMOTETomek. ENN removes all majority samples misclassified by k-NN (not just mutual NN pairs). Better when there is substantial class overlap.

For Indian ML teams working on cost-constrained projects, the compute overhead of Tomek Links is negligible compared to model training. A single Tomek Links pass on 100K samples costs under INR 0.10 (~$0.001) in cloud compute. The question is purely about data quality, not cost.

When Does the Extra Cleaning Actually Help?

Empirical studies (Batista et al., 2004) show that SMOTETomek outperforms SMOTE alone on datasets with moderate class overlap but shows minimal improvement on datasets with well-separated classes. If a simple decision tree can already separate the classes with >95% accuracy on the majority class, Tomek Links cleaning will not add much. Its value scales with the degree of boundary ambiguity in your specific dataset.

Alternatives & Comparisons

Edited Nearest Neighbors (ENN)

ENN removes samples where the majority of k nearest neighbors belong to a different class, making it more aggressive than Tomek Links (which requires mutual nearest neighbor status). ENN typically removes 5-20% of majority samples vs. Tomek Links' 1-5%. Choose ENN when you have significant class overlap and need stronger cleaning; choose Tomek Links when you want minimal, surgical removal of only the most ambiguous pairs.

Random Undersampler

Random undersampling discards majority-class samples uniformly at random until a target ratio is achieved. It is much faster (no NN computation), can achieve any target ratio, and is non-deterministic. However, it does not distinguish between informative and noisy samples. Choose random undersampling when you need aggressive rebalancing and speed; choose Tomek Links when boundary quality matters more than class ratio.

SMOTE (Synthetic Minority Oversampling)

SMOTE generates synthetic minority samples by interpolating between existing minority samples. It is an oversampling technique (adds samples) while Tomek Links is an undersampling/cleaning technique (removes samples). They are complementary: SMOTE addresses class ratio imbalance, Tomek Links addresses boundary noise. The standard combination is SMOTETomek, where SMOTE runs first and Tomek Links cleans the result.

SMOTETomek (Combined Pipeline)

SMOTETomek is the combined pipeline that runs SMOTE followed by Tomek Links in a single step. It is not an alternative to Tomek Links but rather its most common usage pattern. Choose standalone Tomek Links when you already have a balanced dataset and just need boundary cleaning; choose SMOTETomek when you need both oversampling and cleaning.

NearMiss

NearMiss selects majority-class samples based on their distance to minority-class samples (keeping only the closest ones). It is far more aggressive than Tomek Links and can achieve any target ratio. However, it discards most of the majority class, potentially losing important information about the majority distribution. Choose NearMiss when extreme rebalancing is needed; choose Tomek Links for gentle boundary cleaning.

Cluster Centroids

Cluster centroids replaces the majority class with K-means cluster centroids, dramatically reducing the majority class size while preserving its overall shape. Unlike Tomek Links, it creates new synthetic majority samples (the centroids) rather than selecting existing ones. Choose cluster centroids for aggressive rebalancing with shape preservation; choose Tomek Links for minimal, boundary-focused cleaning.

Pros, Cons & Tradeoffs

Advantages

Targeted boundary cleaning -- removes only the majority-class samples that are most ambiguous (mutual nearest neighbors of minority samples), preserving informative majority-class samples far from the boundary
Preserves all minority samples -- in the default sampling_strategy='majority' mode, no minority-class samples are ever removed, which is critical when the minority class has very few samples (e.g., 50-200)
Deterministic and reproducible -- given the same dataset and distance metric, Tomek Links always identifies the same set of links and produces the same cleaned dataset, unlike random undersampling or SMOTE which depend on random seeds
Excellent as a post-processing step for SMOTE -- the SMOTETomek combination is well-studied and consistently outperforms SMOTE alone on datasets with class overlap (Batista et al., 2004), making Tomek Links' cleaning the standard second step in hybrid resampling pipelines
Minimal hyperparameter tuning -- the only real choice is the removal strategy (majority vs. both), making Tomek Links essentially parameter-free and easy to integrate into automated pipelines
Intuitive and explainable -- the concept of removing ambiguous boundary pairs is easy to explain to non-technical stakeholders, regulators, and auditors, which is important in regulated domains like finance (RBI guidelines) and healthcare (ICMR)
Low computational overhead -- for datasets under 100K samples, Tomek Links runs in seconds, adding negligible cost to a training pipeline. On Indian cloud pricing (AWS Mumbai or E2E Networks), a single run costs well under INR 1

Disadvantages

Does not balance the dataset -- Tomek Links removes only a small percentage of majority samples (typically 1-5%), so the class ratio remains heavily skewed. It is not a replacement for oversampling or aggressive undersampling when class imbalance is the primary problem
Computationally expensive for large datasets -- the $O(n^2)$ brute-force or $O(n \log n)$ tree-based nearest neighbor computation becomes a bottleneck for datasets with millions of samples. A 1M-sample dataset takes 15-60 seconds depending on dimensionality
Sensitive to distance metric and feature scaling -- if features are not scaled, high-magnitude features dominate the distance calculation and the identified Tomek links may not reflect true boundary ambiguity. Requires careful preprocessing
Unreliable in high dimensions -- in spaces with hundreds or thousands of features, nearest neighbor distances converge (curse of dimensionality), making the mutual NN condition less meaningful. The algorithm may identify spurious Tomek links
Does not address the root cause of overlap -- if classes genuinely overlap in feature space (common in medical diagnosis, fraud detection), removing boundary samples does not help -- the overlap is a fundamental property of the problem, not noise to be cleaned
Limited benefit for tree-based models -- gradient boosting and random forests are already robust to noisy boundary samples because they learn piecewise decision boundaries. Tomek Links cleaning may show minimal improvement for these models while being more impactful for linear models, SVMs, and neural networks

Failure Modes & Debugging

Curse of dimensionality renders Tomek links meaningless

Cause

In high-dimensional spaces ( $d > 100$ ), the ratio of the distance to the nearest neighbor versus the distance to the farthest neighbor converges to 1. This means nearest neighbor relationships become nearly random, and the mutual NN condition that defines Tomek links no longer identifies truly ambiguous boundary pairs.

Symptoms

The number of Tomek links found is either much higher or much lower than expected. Removing identified links does not improve classifier performance. In extreme cases, performance may degrade because the removed samples were not actually boundary noise. The cleaned dataset shows no visual difference in boundary quality (when projected to 2D with t-SNE or PCA).

Mitigation

Apply dimensionality reduction before Tomek Links: use PCA to reduce to the top 20-50 principal components, or use truncated SVD for sparse data (e.g., TF-IDF vectors). Alternatively, use a distance metric better suited to high dimensions (cosine similarity instead of Euclidean). For very high-dimensional data, consider Edited Nearest Neighbors with a higher k value, which is more robust to dimensional effects.

Data leakage from applying Tomek Links before cross-validation

Cause

Running TomekLinks.fit_resample() on the full dataset before train-test splitting or cross-validation. The nearest neighbor computation uses information from validation/test samples to determine which training samples are Tomek links and should be removed.

Symptoms

CV scores are slightly inflated compared to held-out test performance. The inflation is typically small (0.5-2% for Tomek Links alone) but can be larger when combined with SMOTE in SMOTETomek. Production performance consistently underperforms offline evaluation. The effect is most noticeable on small datasets where each sample has more influence.

Mitigation

Always apply Tomek Links inside a imblearn.pipeline.Pipeline and use it within cross_val_score or cross_validate. Never call fit_resample on the full dataset before splitting. If using a custom training loop, ensure Tomek Links is applied only to the training fold of each CV iteration.

Over-cleaning in iterative Tomek Links application

Cause

Running Tomek Links repeatedly (recomputing nearest neighbors and removing links each iteration) until no more links exist. Each round of removal creates new nearest-neighbor relationships that may flag previously safe samples as Tomek links. After several iterations, the cleaned dataset may have removed a significant fraction of informative majority-class samples near the boundary.

Symptoms

Classifier precision drops (more false positives) even though recall improves. The majority-class distribution becomes hollow near the boundary -- there are no majority samples close to the minority class. The model becomes overconfident in the minority class region, failing to properly identify majority-class samples that legitimately exist near the boundary.

Mitigation

Run Tomek Links for exactly one pass (the standard approach). If you need more aggressive cleaning, switch to Edited Nearest Neighbors (ENN) or SMOTEENN, which are designed for stronger cleaning in a single pass. If iterative cleaning is genuinely needed, set a maximum iteration count (2-3) and monitor the number of samples removed per iteration -- stop when the count drops below a threshold.

Ineffective cleaning for genuinely overlapping distributions

Cause

The class distributions fundamentally overlap in feature space -- no amount of boundary cleaning will separate them because the overlap is not noise but a true property of the data generation process. This is common in medical diagnosis (healthy and sick patients can have identical lab values) and fraud detection (sophisticated fraudsters mimic legitimate behavior perfectly).

Symptoms

Tomek Links removes boundary samples, but classifier performance does not improve. The confusion matrix shows that misclassifications are spread across the entire feature space, not concentrated at the boundary. Adding more features (better feature engineering) has a much larger effect than any resampling technique.

Mitigation

Recognize that resampling is not the bottleneck -- feature engineering is. Invest in creating features that better separate the classes. Consider ensemble methods (gradient boosting, random forests) that can model complex, overlapping boundaries without needing clean separation. For truly overlapping distributions, threshold tuning and cost-sensitive learning are often more effective than data-level resampling.

Unscaled features distort Tomek link identification

Cause

Features with different scales (e.g., age in years [0-100] and annual income in INR [100,000-10,000,000]) cause Euclidean distances to be dominated by the high-magnitude feature. Tomek links are identified based on income similarity alone, ignoring all other features.

Symptoms

The identified Tomek links cluster along one feature dimension. Cleaning removes samples that are far apart in most feature dimensions but happen to be close in the high-magnitude feature. Cleaned dataset does not show improved boundary quality when inspected with PCA or t-SNE projections.

Mitigation

Always apply StandardScaler or MinMaxScaler before Tomek Links. In imblearn.pipeline.Pipeline, place the scaler as the first step before the TomekLinks step. Alternatively, use Manhattan distance (less sensitive to outliers) or Mahalanobis distance (accounts for feature correlations and scales) in the custom implementation.

Placement in an ML System

Where Tomek Links Sits in the ML Pipeline

Tomek Links occupies a specific niche in the preprocessing stage of the ML pipeline, between feature engineering and model training. Its position is critical: it must run after feature scaling (so distances are meaningful) and after train-test splitting (to prevent data leakage), but before model training.

The typical pipeline flow for imbalanced classification is:

Data ingestion -> raw data
Feature engineering -> engineered features
Train-test split -> training set and held-out test set
Within each CV fold: a. Feature scaling (StandardScaler / MinMaxScaler) b. Oversampling (SMOTE) on training fold only c. Boundary cleaning (Tomek Links) on augmented training fold d. Model training on cleaned, augmented training fold e. Evaluation on original (unmodified) validation fold

This ordering is enforced automatically when using imblearn.pipeline.Pipeline with cross_validate.

In production ML systems at Indian fintech companies (Razorpay, PayU, CRED), Tomek Links is typically a step in an Airflow or Kubeflow pipeline that runs during scheduled retraining. The resampled data is not persisted (it changes with each retraining cycle); instead, the pipeline configuration (SMOTE ratio, Tomek strategy) is versioned alongside the model in the model registry.

Key Integration Point: Tomek Links produces a different-sized dataset than its input (some samples are removed). Any pipeline component downstream must handle variable-length inputs. This is not an issue with scikit-learn or imbalanced-learn pipelines but can trip up custom training loops that assume fixed batch sizes.

Pipeline Stage

Data Preprocessing / Resampling

Upstream

train-test-split
scaling
encoding
feature-selection

Downstream

model-training
cross-validation
hyperparameter-tuning

Scaling Bottlenecks

Where Tomek Links Gets Expensive

The primary bottleneck is the nearest neighbor computation, which scales as $O(n \log n)$ with tree-based indices (KD-tree, ball-tree) for low-to-moderate dimensions, and $O(n^2)$ for high dimensions or brute-force.

Here are practical benchmarks on typical hardware:

Dataset Size	Features	Algorithm	Time	Cloud Cost (AWS Mumbai)	Cloud Cost (E2E Networks)
10,000	20	KD-tree	~0.3s	< INR 0.01	< INR 0.01
100,000	50	KD-tree	~3s	< INR 0.05	< INR 0.03
500,000	50	KD-tree	~20s	< INR 0.10	< INR 0.06
1,000,000	50	Ball-tree	~60s	~ INR 0.25 (~$0.003)	~ INR 0.18
1,000,000	500	Brute-force	~600s	~ INR 2.50 (~$0.03)	~ INR 1.80

For most practical use cases (tabular data with 10K-500K samples and under 100 features), Tomek Links is negligible in cost compared to model training. The bottleneck is almost always the downstream model training (especially if using gradient boosting or neural networks), not the resampling step.

For very large datasets (10M+ samples), consider these alternatives:

Sample-then-clean: Apply Tomek Links to a random subsample of the majority class, then use the identified link pattern to extrapolate cleaning to the full dataset
Approximate nearest neighbors: Use libraries like FAISS or Annoy for approximate NN computation in $O(n \log n)$ time regardless of dimensionality
Skip Tomek Links entirely: At 10M+ samples, the boundary is often well-enough defined that cleaning provides minimal benefit

Production Case Studies

European Cardholders Dataset (ULB)Financial Services / Fraud Detection

A comprehensive study published in the Journal of Big Data evaluated the effect of feature extraction (PCA, CAE) and data sampling techniques (RUS, SMOTE, SMOTE-Tomek) on credit card fraud detection using the European cardholders dataset (284,807 transactions, 0.172% fraud rate). The researchers tested multiple classifiers (Random Forest, SVM, Neural Network) with different resampling strategies, finding that SMOTE-Tomek combined with Random Forest achieved an F1-score of 0.8677, significantly outperforming models trained on the original imbalanced data.

Outcome:

SMOTE-Tomek improved Random Forest F1-score from 0.74 (no resampling) to 0.87 on the European bank fraud dataset, a 17.6% relative improvement. The study confirmed that Tomek Links' boundary cleaning after SMOTE consistently reduced false positives compared to SMOTE alone.

SAS InstituteEnterprise Analytics / Financial Services

SAS published a research study at SAS Global Forum 2018 testing the efficacy of SMOTE and SMOTE+Tomek Links on a credit card fraud transaction dataset. They trained four classifiers -- Random Forest, Neural Network, SVM, and Rule Induction -- on the original imbalanced data, SMOTE-resampled data, and SMOTE+Tomek-resampled data. The study systematically measured recall (fraud detection rate) and the false negative to false positive ratio across all combinations.

Outcome:

SMOTE+Tomek improved recall and lowered the FN/FP ratio for every single classifier tested, demonstrating that Tomek Links' boundary cleaning consistently improves fraud detection regardless of the model choice. The SVM classifier showed the most dramatic improvement, with recall increasing from 62% to 81% after SMOTE+Tomek.

Kaggle Competition: IEEE-CIS Fraud DetectionE-commerce / Payment Fraud

A detailed case study on the IEEE-CIS Fraud Detection competition dataset (590K transactions provided by Vesta Corporation) demonstrated the SMOTETomek + XGBoost pipeline for production fraud detection. The author compared baseline XGBoost (no resampling), SMOTE+XGBoost, and SMOTETomek+XGBoost, with detailed ROC curves and confusion matrix analysis. The SMOTETomek approach improved out-of-the-box XGBoost fraud detection rate from 84% to 88%.

Outcome:

After combining SMOTETomek with hyperparameter-tuned XGBoost, the model achieved a 92% fraud detection rate with 98%+ correct classification rate on the held-out test set, demonstrating that Tomek Links cleaning of SMOTE-generated boundary noise contributes measurably to final model performance even with strong tree-based classifiers.

Medical Diagnosis (Springer Nature Study)Healthcare

A study published in the Journal of Big Data applied SMOTE, SMOTE+Tomek, and SMOTE+ENN to a medical insurance fraud detection dataset with extreme class imbalance. The study is particularly relevant to Indian healthcare systems (Ayushman Bharat, NHPM) where insurance fraud detection operates on datasets with fraud rates below 1%. The researchers found that SMOTE+Tomek with Random Forest consistently outperformed single-technique approaches across precision, recall, and F1-score.

Outcome:

SMOTE+Tomek achieved the best precision-recall balance (F1=0.91) compared to SMOTE alone (F1=0.87) and no resampling (F1=0.72) on the healthcare fraud dataset. The Tomek Links cleaning step was especially impactful for reducing false positives, which is critical in healthcare where false fraud accusations can deny legitimate patient claims.

Tooling & Ecosystem

imbalanced-learn (TomekLinks)

PythonOpen Source

The standard implementation of Tomek Links in Python. Part of the imbalanced-learn library (scikit-learn-contrib). Provides TomekLinks class with fit_resample(X, y) API, support for multiclass, configurable sampling_strategy, and seamless integration with imblearn.pipeline.Pipeline. Handles KD-tree/ball-tree/brute-force NN algorithm selection automatically via scikit-learn's NearestNeighbors.

imbalanced-learn (SMOTETomek)

PythonOpen Source

The combined SMOTE + Tomek Links pipeline in imbalanced-learn. Runs SMOTE first to generate synthetic minority samples, then Tomek Links to clean boundary noise. Configurable SMOTE and Tomek parameters via smote and tomek constructor arguments. The most commonly used wrapper for production Tomek Links usage.

scikit-learn (NearestNeighbors)

PythonOpen Source

The underlying nearest-neighbor engine used by imbalanced-learn. Provides KD-tree, ball-tree, and brute-force algorithms with configurable distance metrics (Euclidean, Manhattan, Minkowski, Chebyshev, custom). Essential for custom Tomek Links implementations that need non-default distance metrics or approximate NN for large-scale data.

UBL (Utility-Based Learning) for R

ROpen Source

R package providing TomekClassif() function for Tomek Links-based classification on imbalanced data. Supports both majority-only and both-class removal strategies. Part of the broader UBL package for utility-based learning that includes multiple resampling techniques.

Imbalance.jl

JuliaOpen Source

Julia implementation of SMOTE-Tomek Links for high-performance imbalanced learning. Provides fraud detection examples using the combined pipeline. Julia's JIT compilation makes the NN computation significantly faster than Python for large datasets, making this a good choice for teams already using Julia's ML ecosystem.

FAISS (Approximate NN for large-scale Tomek Links)

Python/C++Open Source

Facebook AI Similarity Search library for approximate nearest neighbor computation. While not a Tomek Links implementation itself, FAISS can replace the NN computation in custom Tomek Links implementations for datasets with millions of samples. The approximate NN trades a small accuracy reduction for 10-100x speedup on large datasets.

Research & References

Two Modifications of CNN

Tomek, I. (1976)IEEE Transactions on Systems, Man, and Communications, Vol. 6

The foundational paper that introduced Tomek Links. Proposed two modifications to Hart's Condensed Nearest Neighbor (CNN) rule: (1) Tomek links -- mutual nearest neighbor pairs from different classes that indicate boundary or noise samples, and (2) a modified CNN that uses Tomek links to improve the condensation process. This paper is the origin of all modern Tomek Link-based resampling methods.

A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data

Batista, G.E.A.P.A., Prati, R.C., Monard, M.C. (2004)ACM SIGKDD Explorations Newsletter, Vol. 6

The landmark study that established the SMOTETomek and SMOTEENN combined pipelines. Analyzed the behavior of oversampling (SMOTE) combined with undersampling (Tomek Links, ENN) on 13 benchmark datasets. Showed that SMOTE+Tomek consistently outperformed either technique alone, particularly for datasets with class overlap and small disjuncts. This paper is the primary citation for the SMOTETomek methodology.

SMOTE: Synthetic Minority Over-sampling Technique

Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P. (2002)Journal of Artificial Intelligence Research, Vol. 16

The paper that introduced SMOTE, the most commonly paired technique with Tomek Links. SMOTE generates synthetic minority samples by interpolating between existing minority samples and their k-nearest neighbors. Understanding SMOTE is essential for understanding why Tomek Links is needed as a post-processing step: SMOTE can introduce noise at the boundary, and Tomek Links cleans it up.

SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

Fernandez, A., Garcia, S., Herrera, F., Chawla, N.V. (2018)Journal of Artificial Intelligence Research, Vol. 61

A comprehensive 15-year retrospective on SMOTE and its extensions, including SMOTETomek. Reviews the theoretical foundations, practical guidelines, and open challenges. Discusses when combined approaches (SMOTE+Tomek, SMOTE+ENN) outperform standalone methods and provides recommendations for practitioners working with imbalanced data across different domains.

The Condensed Nearest Neighbor Rule

Hart, P.E. (1968)IEEE Transactions on Information Theory, Vol. 14

The original CNN paper that Tomek's 1976 work modified. Hart proposed keeping only the training samples necessary for correct nearest-neighbor classification, discarding redundant interior points. Understanding CNN is essential for understanding Tomek's modifications: Tomek Links identifies the boundary pairs that CNN's condensation process should prioritize.

Asymptotic Properties of Nearest Neighbor Rules Using Edited Data

Wilson, D.L. (1972)IEEE Transactions on Systems, Man, and Cybernetics, Vol. 2

Introduced the Edited Nearest Neighbor (ENN) rule, the "sibling" technique to Tomek Links. Wilson's ENN removes samples misclassified by their k nearest neighbors. Tomek Links and ENN represent two complementary approaches to data cleaning: Tomek Links targets mutual NN pairs (the tightest boundary pairs), while ENN targets samples that disagree with their neighborhood (more aggressive cleaning).

Interview & Evaluation Perspective

Common Interview Questions

●
What is a Tomek link and how does it help with imbalanced classification?
●
Explain the difference between Tomek Links and Edited Nearest Neighbors (ENN) as data cleaning techniques.
●
Why is Tomek Links typically combined with SMOTE rather than used standalone? What does each component contribute?
●
You have a fraud detection dataset with 0.1% positive rate. Would you use Tomek Links? What else would you combine it with?
●
How does Tomek Links handle the decision boundary differently from random undersampling?
●
What happens if you apply Tomek Links before train-test splitting? How would you detect this mistake?
●
In what situations would Tomek Links NOT help improve classifier performance?
●
How would you scale Tomek Links to a dataset with 10 million samples?

Key Points to Mention

●
Tomek Links identifies mutual nearest neighbor pairs from different classes -- this mutual condition is key. Not just any cross-class neighbor pair, but pairs where each is the other's closest neighbor. This targets the most ambiguous boundary samples.
●
The primary value is boundary cleaning, not class balancing. Tomek Links removes only 1-5% of majority samples. Its power comes from which samples it removes, not how many.
●
The standard usage is SMOTETomek: SMOTE generates synthetic minority samples, then Tomek Links cleans the boundary noise that SMOTE introduces. This two-step pipeline outperforms either technique alone (Batista et al., 2004).
●
Data leakage prevention: Tomek Links must be applied inside the CV fold, not before splitting. The NN computation uses information from all samples, so applying it before splitting leaks validation information into the training preprocessing.
●
Limitations: Tomek Links is unreliable in high dimensions (curse of dimensionality makes NN relationships meaningless), requires scaled features, and does not help when class overlap is fundamental rather than noise-based.
●
Know the historical lineage: Hart (1968) -> Wilson (1972) -> Tomek (1976) -> Batista et al. (2004) combining with SMOTE. This shows depth of understanding.

Pitfalls to Avoid

●
Calling Tomek Links an 'undersampling technique' without qualifying that it is primarily a cleaning technique -- interviewers expect you to understand the distinction between reducing class imbalance and cleaning boundary noise
●
Not mentioning the mutual nearest neighbor condition -- simply saying 'removes closest cross-class pairs' misses the key insight that the relationship must be reciprocal
●
Claiming Tomek Links will balance your dataset -- it removes far too few samples for that. If the interviewer asks about handling a 99:1 imbalance ratio, Tomek Links alone is not the answer
●
Forgetting to mention feature scaling as a prerequisite -- unscaled features make Euclidean distance meaningless and the identified Tomek links spurious
●
Not discussing the curse of dimensionality limitation -- senior interviewers will probe whether you understand when nearest-neighbor-based methods fail

Senior-Level Expectation

A senior candidate should demonstrate understanding beyond the basic definition. They should articulate the mathematical definition (mutual nearest neighbor condition), explain the historical lineage (Hart's CNN -> Wilson's ENN -> Tomek's modifications -> Batista's SMOTETomek), and discuss when Tomek Links fails (high dimensions, fundamental class overlap, unscaled features). They should know the computational complexity ( $O(n \log n)$ with tree-based NN, $O(n^2)$ brute-force) and be able to discuss scaling strategies (approximate NN with FAISS, subsampling). Senior engineers should compare Tomek Links quantitatively with ENN (1-5% removal vs. 5-20%), understand why tree-based models benefit less from boundary cleaning than linear models or SVMs, and know that the imblearn.pipeline.Pipeline (not sklearn.pipeline.Pipeline) is required for proper integration with resampling. Finally, they should be able to discuss cost-benefit analysis: for a fraud detection system at scale (millions of transactions), when is the $O(n \log n)$ Tomek Links computation justified versus simply tuning the classifier's decision threshold or using cost-sensitive learning?

Summary

Tomek Links is a boundary cleaning technique for imbalanced classification that identifies and removes majority-class samples forming mutual nearest-neighbor pairs with minority-class samples. Introduced by Ivan Tomek in 1976 as a modification to Hart's Condensed Nearest Neighbor rule, it has found its primary modern application as a post-processing step in the SMOTETomek pipeline -- where SMOTE first generates synthetic minority samples to address class imbalance, and Tomek Links then cleans the boundary noise that oversampling introduces. The combination, established as effective by Batista et al. (2004), consistently outperforms either technique alone on datasets with class overlap.

The technique is notable for its conservatism and precision: it removes only 1-5% of majority-class samples, targeting exclusively the most ambiguous boundary pairs while preserving all minority samples and the overall majority-class distribution. This makes it a cleaning technique rather than a balancing technique, and it should almost always be combined with oversampling when class ratio is a concern. The algorithm is deterministic, has minimal hyperparameters (just the removal strategy), and runs in $O(n \log n)$ time with tree-based nearest neighbor indices -- making it computationally negligible for most practical datasets.

For Indian ML teams working on fraud detection (Razorpay, PayU, CRED), medical diagnosis (AIIMS, Apollo), churn prediction (Jio, Airtel), or insurance claims analysis (Ayushman Bharat), Tomek Links is a low-cost, high-value addition to any imbalanced classification pipeline. The key implementation requirements are: (1) always apply inside a imblearn.pipeline.Pipeline with cross-validation to prevent data leakage, (2) scale features before applying Tomek Links so distances are meaningful, (3) combine with SMOTE for the standard SMOTETomek recipe, and (4) understand the limitations -- Tomek Links is unreliable in high dimensions, does not address fundamental class overlap, and provides less benefit for tree-based models that are inherently robust to boundary noise. The compute cost on Indian cloud providers is under INR 1 for most datasets, making the technique essentially free to include in any preprocessing pipeline.

Concept Snapshot

Why This Concept Exists

The Problem: Noisy Class Boundaries in Imbalanced Data

Historical Context: From CNN to Tomek Links

Modern Renaissance: Tomek Links for Class Imbalance

Core Intuition & Mental Model

The Analogy: Sorting Out Disputed Territory

What Makes a Tomek Link Special?

The Combined Power: SMOTE + Tomek Links

Technical Foundations

Definition of a Tomek Link

Removal Strategy

Algorithm

Computational Complexity

Properties

Internal Architecture

Key Components

Data Flow

How to Implement

Implementation Approaches

Common Implementation Mistakes

When Should You Use This?

Use When

Avoid When

Key Tradeoffs

Cleaning Aggressiveness vs. Information Preservation

Standalone vs. Combined Usage

When Does the Extra Cleaning Actually Help?

Alternatives & Comparisons

Pros, Cons & Tradeoffs

Advantages

Disadvantages

Failure Modes & Debugging

Curse of dimensionality renders Tomek links meaningless

Data leakage from applying Tomek Links before cross-validation

Over-cleaning in iterative Tomek Links application

Ineffective cleaning for genuinely overlapping distributions

Unscaled features distort Tomek link identification

Placement in an ML System

Where Tomek Links Sits in the ML Pipeline

Pipeline Stage

Upstream

Downstream

Scaling Bottlenecks

Production Case Studies

Tooling & Ecosystem

Research & References

Interview & Evaluation Perspective

Common Interview Questions

Key Points to Mention

Pitfalls to Avoid

Senior-Level Expectation

Summary

Related Blocks & Further Reading

Related ML Blocks

Further Reading