Condensed Nearest Neighbour (CNN) in Machine Learning

You have a classification dataset with 10 million samples, 95% of which belong to the majority class. Training is slow, expensive, and the class imbalance crushes minority-class recall. Random undersampling throws away samples indiscriminately, potentially discarding the very samples that define the decision boundary.

Condensed Nearest Neighbour (CNN) offers a principled alternative. Proposed by Peter Hart in 1968, CNN finds the smallest subset of the original dataset that correctly classifies all remaining samples using a 1-nearest-neighbor (1-NN) rule. It keeps only the samples that matter -- the ones near class boundaries -- and discards redundant interior samples that contribute nothing to classification accuracy.

The CNN rule is one of the earliest instance selection algorithms in ML history. It preceded the Edited Nearest Neighbour (ENN) rule by four years and established that datasets contain massive redundancy: you do not need every sample to learn a good decision boundary. You only need the samples near the boundary, the "prototypes" that define where one class transitions into another.

In the modern ML ecosystem, CNN has found a second life as an undersampling technique for imbalanced classification. The imbalanced-learn library provides a production-ready implementation used across fraud detection systems at Indian fintech companies, medical diagnosis pipelines, and recommendation systems with heavily skewed data.

CNN's core strength is selectivity: unlike random undersampling, it preserves boundary samples carrying the most classification information. Its weakness is non-determinism and sensitivity to noise -- problems that later algorithms like ENN and Tomek Links addressed.

Concept Snapshot

What It Is
An instance selection algorithm that finds the minimal consistent subset of a dataset -- the smallest subset that correctly classifies all original samples via a 1-nearest-neighbor rule. Used as an undersampling technique that preserves decision-boundary samples while removing redundant interior points.
Category
Data Generation (Undersampling)
Complexity
Intermediate
Inputs / Outputs
Inputs: labeled dataset with class imbalance or redundancy. Outputs: reduced dataset containing only boundary prototypes and all minority-class samples, suitable for 1-NN classification.
System Placement
Applied during data preprocessing after data collection and labeling but before model training. Positioned as an undersampling step in imbalanced learning pipelines, typically before cross-validation and model selection.
Also Known As
CNN rule, Hart's condensing algorithm, Condensed NN, Nearest neighbor condensing, Consistent subset selection
Typical Users
ML engineers, data scientists, research scientists, data quality engineers
Prerequisites
1-nearest-neighbor classification, class imbalance concepts, distance metrics (Euclidean, Manhattan), decision boundaries in classification, prototype selection theory
Key Terms
consistent subset1-NN classifierprototype selectionboundary samplesinstance reductioncondensingstore setgrabbag set

Why This Concept Exists

The Redundancy Problem in k-NN Classification

The k-nearest-neighbor classifier is universally consistent: as the dataset grows, the 1-NN error rate converges to at most twice the Bayes-optimal error rate. But this comes with a brutal practical cost -- classifying a single test point requires computing distances to every training sample. For 10 million samples with 100 features, that is 1 billion distance computations per prediction.

But here is the critical insight: most of those 10 million samples are redundant. Samples deep in the interior of a class contribute nothing to the decision boundary. The information content of the dataset is concentrated at the boundaries.

Hart's Condensing Insight (1968)

In 1968, Peter Hart published "The Condensed Nearest Neighbor Rule" in IEEE Transactions on Information Theory, posing a simple question: what is the smallest subset of the training data that still classifies all the original training data correctly using a 1-NN rule?

Hart called this the consistent subset problem. Finding the truly minimal consistent subset is NP-hard, so Hart proposed a greedy algorithm: start with one sample per class, then iterate through remaining samples, adding each one to the subset only if it is misclassified by the current 1-NN classifier.

This ensures that only "difficult" samples near class boundaries get added. Interior samples, easily classified by same-class neighbors already in the subset, never get added.

From Dataset Reduction to Imbalanced Learning

Hart's original motivation was computational. But in the 2000s, researchers noticed CNN had a natural undersampling effect. In imbalanced datasets, CNN removes mostly majority-class interior samples while minority-class samples (already few) tend to be near the boundary and get retained.

This made CNN a natural candidate for imbalanced-learn, where it is implemented as CondensedNearestNeighbour. The library's implementation starts the condensed set with all minority-class samples, then incrementally adds majority-class samples that are misclassified -- treating CNN as a majority-class reduction technique.

Key Takeaway: CNN exists because datasets contain massive redundancy for nearest-neighbor classification. By keeping only boundary prototypes, CNN achieves dramatic reduction (often 80-95%) while preserving decision-boundary fidelity.

Core Intuition & Mental Model

The Museum Guard Analogy

Imagine you are responsible for security at a large museum. Theft only happens at the boundaries -- the doors and corridors between wings. Guards in the middle of a large gallery are wasted. You only need guards at transition points between areas.

CNN does the same with data points. It keeps only the "transition points" between classes -- the boundary samples -- and discards redundant interior samples.

The Store and Grabbag Mental Model

Hart's algorithm uses two data structures:

  • STORE: The condensed subset being built. Starts with one random sample per class.
  • GRABBAG: Everything not yet in the STORE.

The algorithm scans the GRABBAG repeatedly. For each sample, it asks: "Can the current STORE correctly classify this using 1-NN?" If yes, the sample is redundant -- it stays in GRABBAG. If no, it carries new boundary information and moves to STORE.

Scanning repeats until a full pass produces no transfers. The STORE is then a consistent subset.

Why CNN Preserves Boundaries

Consider two classes forming clusters with boundary overlap. When CNN starts with one prototype per class:

  • Interior samples far from the boundary: correctly classified by same-class prototypes. Stay in GRABBAG.
  • Boundary samples: may be closer to the wrong-class prototype. Misclassified. Moved to STORE.

The STORE fills with boundary samples while interior samples remain in GRABBAG. The result is a skeleton tracing the decision boundary with the interior hollowed out.

CNN vs. Random Undersampling

Random undersampling removes samples without considering location. It might remove boundary samples and keep interior ones. CNN does the opposite: it retains boundary samples and discards interior points. A CNN-reduced dataset with 5,000 samples can carry more classification information than a randomly undersampled dataset with 20,000 samples.

Expert Insight: CNN is sensitive to initialization and sample ordering. Different random seeds produce different condensed subsets. Running CNN multiple times and ensembling results can mitigate this.

Technical Foundations

Mathematical Formulation

Let D={(xi,yi)}i=1nD = \{(\mathbf{x}_i, y_i)\}_{i=1}^{n} be a labeled dataset where xiRd\mathbf{x}_i \in \mathbb{R}^d and yi{1,2,,C}y_i \in \{1, 2, \ldots, C\}.

Definition (Consistent Subset): A subset SDS \subseteq D is consistent with DD if:

(xi,yi)D:NN1(xi,S)=yi\forall (\mathbf{x}_i, y_i) \in D: \quad \text{NN}_1(\mathbf{x}_i, S) = y_i

Hart's CNN Algorithm:

  1. Initialize STORE={(x1,y1)}\text{STORE} = \{(\mathbf{x}_1, y_1)\} with one sample per class
  2. Initialize GRABBAG=DSTORE\text{GRABBAG} = D \setminus \text{STORE}
  3. Repeat:
    • For each (xi,yi)GRABBAG(\mathbf{x}_i, y_i) \in \text{GRABBAG}:
      • Find xnn=argminxjSTOREd(xi,xj)\mathbf{x}_{\text{nn}} = \arg\min_{\mathbf{x}_j \in \text{STORE}} d(\mathbf{x}_i, \mathbf{x}_j)
      • If ynnyiy_{\text{nn}} \neq y_i: move (xi,yi)(\mathbf{x}_i, y_i) to STORE
  4. Until no transfers in a full pass
  5. Return STORE

Convergence

Hart proved convergence in finite passes. Each pass either transfers at least one sample or terminates. Practical convergence occurs in 2-10 passes.

Consistency Property

Upon termination: (xi,yi)D:yNN1(xi,S)=yi\forall (\mathbf{x}_i, y_i) \in D: y_{\text{NN}_1(\mathbf{x}_i, S)} = y_i

The 1-NN classifier built from STORE has zero training error on the full dataset.

Computational Complexity

  • Per-pass cost: O(GRABBAGSTOREd)O(|\text{GRABBAG}| \cdot |\text{STORE}| \cdot d)
  • Total cost: O(pnSd)O(p \cdot n \cdot |S| \cdot d) where pp is the number of passes
  • Worst case: O(n2d)O(n^2 \cdot d)
  • Typical case: O(nSd)O(n \cdot |S| \cdot d) with Sn|S| \ll n and p10p \leq 10

Relationship to Voronoi Diagrams

The condensed subset defines a Voronoi tessellation. Each prototype is the center of a Voronoi cell, and CNN ensures no cell contains points from a different class.

Note: CNN does NOT find the minimal consistent subset. Hart's greedy result is typically 2-5x larger than the true minimum. Post-processing with Reduced Nearest Neighbour (RNN) can further shrink it.

Internal Architecture

The CNN algorithm follows an iterative refinement architecture with two core data structures (STORE and GRABBAG) and a 1-NN classifier that is rebuilt as the STORE grows. In production ML systems, CNN is typically deployed as a preprocessing transform within a scikit-learn Pipeline, sitting after feature engineering and before model training.

Key Components

STORE Initializer

Seeds the condensed subset with initial prototypes. In the original Hart formulation, this is one random sample. In imbalanced-learn's implementation, the STORE is initialized with all minority-class samples, ensuring complete minority-class retention. This initialization strategy biases CNN toward undersampling the majority class.

GRABBAG Manager

Manages the pool of candidate samples not yet in the STORE. Tracks which samples have been tested in the current pass and handles the transfer of misclassified samples to the STORE. In large-scale implementations, the GRABBAG may be stored on disk or in a database with streaming access.

1-NN Classifier (Incremental)

Maintains a nearest-neighbor index over the current STORE. For each GRABBAG sample, finds its nearest neighbor in the STORE and checks label agreement. Uses KD-tree or Ball tree for efficient lookups. The index is updated incrementally as new prototypes are added to the STORE.

Transfer Decision Engine

Applies the CNN rule: if a GRABBAG sample is misclassified by the current STORE's 1-NN classifier, transfer it to the STORE. This component also tracks convergence -- whether any transfers occurred in the current pass.

Convergence Monitor

Detects when a complete pass through the GRABBAG produces no transfers, signaling that the STORE is a consistent subset. Also implements optional early stopping if the STORE reaches a maximum size or a target imbalance ratio.

Data Flow

CNN Algorithm Flow:

  1. Separate dataset into minority and majority class samples
  2. Initialize STORE with all minority-class samples
  3. Place all majority-class samples in GRABBAG
  4. Build 1-NN index over STORE
  5. Pass loop: For each sample in GRABBAG: a. Query 1-NN index for nearest neighbor in STORE b. If labels disagree: transfer sample from GRABBAG to STORE, update index c. If labels agree: sample stays in GRABBAG
  6. If any transfers occurred in this pass, go to step 5
  7. Return STORE as the condensed, undersampled dataset

Integration in ML Pipeline:

Raw Data → Feature Engineering → CNN Undersampling → Cross-Validation → Model Training → Evaluation

CNN fits naturally into scikit-learn's fit_resample API, making it a drop-in replacement for random undersampling in existing pipelines.

A flow diagram showing 'Full Dataset' splitting into 'Minority Samples → STORE' and 'Majority Samples → GRABBAG'. A loop between GRABBAG and STORE labeled '1-NN Check: Misclassified? → Transfer to STORE'. An exit condition 'No transfers in full pass' leads to 'Condensed Dataset (STORE)' as output.

How to Implement

Implementation Approaches

Option A: imbalanced-learn library -- The standard Python implementation. CondensedNearestNeighbour provides production-ready CNN with scikit-learn pipeline compatibility. Recommended for most use cases.

Option B: Custom implementation -- Useful for streaming CNN, GPU-accelerated distance computations, or CNN variants like GCNN.

Option C: Combined with ENN -- Apply CNN first for bulk reduction, then ENN for noise cleaning. Produces a smaller and cleaner subset than either alone.

Cost Note: CNN is CPU-bound. For 1M samples with 50 features, expect 2-5 minutes on a standard laptop. Cloud cost on AWS c5.xlarge (~INR 14/hour) is negligible.

Basic CNN with imbalanced-learn
from imblearn.under_sampling import CondensedNearestNeighbour
from sklearn.datasets import make_classification
from collections import Counter
import numpy as np

# Create an imbalanced dataset
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=15,
    n_redundant=3,
    weights=[0.95, 0.05],
    flip_y=0.02,
    random_state=42
)

print(f"Original distribution: {Counter(y)}")
# Output: Counter({0: 9500, 1: 500})

# Apply CNN undersampling
cnn = CondensedNearestNeighbour(
    n_neighbors=1,        # Classic 1-NN rule
    sampling_strategy='auto',  # Undersample majority class
    random_state=42,
    n_seeds_S=1,          # Number of initial majority seeds in STORE
    n_jobs=-1             # Parallelize distance computations
)

X_cnn, y_cnn = cnn.fit_resample(X, y)
print(f"After CNN: {Counter(y_cnn)}")
# Output: Counter({0: 1823, 1: 500})
# ~80% of majority class removed, all minority retained

print(f"Reduction ratio: {1 - len(y_cnn)/len(y):.1%}")
# Output: Reduction ratio: 76.8%

The basic usage of CNN via imbalanced-learn. The algorithm initializes the STORE with all minority-class samples and one majority-class seed, then iteratively adds misclassified majority-class samples. The result retains all 500 minority samples and only the ~1823 majority samples that are near the decision boundary.

CNN in a scikit-learn Pipeline with Cross-Validation
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.under_sampling import CondensedNearestNeighbour
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import make_scorer, f1_score, recall_score
import numpy as np

# Define pipeline with CNN undersampling
pipeline = ImbPipeline([
    ('scaler', StandardScaler()),
    ('cnn', CondensedNearestNeighbour(
        n_neighbors=1,
        sampling_strategy='auto',
        random_state=42,
        n_jobs=-1
    )),
    ('clf', RandomForestClassifier(
        n_estimators=200,
        max_depth=10,
        random_state=42,
        class_weight='balanced'
    ))
])

# Cross-validate with stratified folds
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scoring = {
    'f1': make_scorer(f1_score),
    'recall': make_scorer(recall_score),
    'precision': make_scorer(f1_score)
}

results = cross_validate(
    pipeline, X, y,
    cv=cv,
    scoring=scoring,
    return_train_score=True
)

print(f"F1 Score: {results['test_f1'].mean():.3f} +/- {results['test_f1'].std():.3f}")
print(f"Recall:   {results['test_recall'].mean():.3f} +/- {results['test_recall'].std():.3f}")
# CNN + RF typically achieves higher minority recall than no resampling

Integrating CNN into an imbalanced-learn Pipeline ensures that undersampling happens only on training folds during cross-validation, preventing data leakage. The StandardScaler before CNN is important because CNN uses distance metrics that are sensitive to feature scales.

Custom CNN Implementation with Convergence Tracking
import numpy as np
from sklearn.neighbors import KDTree
from collections import Counter
from typing import Tuple, List

def condensed_nearest_neighbour(
    X: np.ndarray,
    y: np.ndarray,
    random_state: int = 42,
    max_passes: int = 100,
    verbose: bool = True
) -> Tuple[np.ndarray, np.ndarray, List[int]]:
    """Custom CNN implementation with convergence tracking.
    
    Returns:
        X_store: Feature matrix of condensed subset
        y_store: Labels of condensed subset
        pass_sizes: STORE size after each pass (for convergence analysis)
    """
    rng = np.random.RandomState(random_state)
    n_samples = len(y)
    classes = np.unique(y)
    
    # Initialize STORE with one sample per class
    store_indices = []
    for cls in classes:
        cls_indices = np.where(y == cls)[0]
        seed_idx = rng.choice(cls_indices)
        store_indices.append(seed_idx)
    
    store_set = set(store_indices)
    grabbag_indices = [i for i in range(n_samples) if i not in store_set]
    pass_sizes = [len(store_set)]
    
    for pass_num in range(max_passes):
        transferred = False
        new_grabbag = []
        
        # Build KD-tree over current STORE
        store_list = sorted(store_set)
        tree = KDTree(X[store_list])
        
        # Shuffle GRABBAG for randomness
        rng.shuffle(grabbag_indices)
        
        for idx in grabbag_indices:
            # Find nearest neighbor in STORE
            dist, nn_idx = tree.query(X[idx].reshape(1, -1), k=1)
            nn_label = y[store_list[nn_idx[0][0]]]
            
            if nn_label != y[idx]:
                # Misclassified: transfer to STORE
                store_set.add(idx)
                transferred = True
                # Rebuild tree periodically for accuracy
                if len(store_set) % 100 == 0:
                    store_list = sorted(store_set)
                    tree = KDTree(X[store_list])
            else:
                new_grabbag.append(idx)
        
        grabbag_indices = new_grabbag
        pass_sizes.append(len(store_set))
        
        if verbose:
            print(f"Pass {pass_num + 1}: STORE={len(store_set)}, "
                  f"GRABBAG={len(grabbag_indices)}, "
                  f"transferred={pass_sizes[-1] - pass_sizes[-2]}")
        
        if not transferred:
            break
    
    store_list = sorted(store_set)
    return X[store_list], y[store_list], pass_sizes

# Usage
X_cnn, y_cnn, convergence = condensed_nearest_neighbour(X, y, verbose=True)
print(f"\nFinal: {Counter(y_cnn)}")
print(f"Reduction: {1 - len(y_cnn)/len(y):.1%}")
print(f"Passes to converge: {len(convergence) - 1}")

This custom implementation exposes the convergence behavior of CNN, showing how the STORE grows with each pass. The tree is rebuilt periodically within a pass to account for newly added prototypes. This visibility is useful for debugging and understanding how CNN behaves on your specific dataset.

CNN + ENN Two-Stage Pipeline
from imblearn.under_sampling import (
    CondensedNearestNeighbour,
    EditedNearestNeighbours
)
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from collections import Counter

# Two-stage approach: CNN for bulk reduction, ENN for noise cleaning
# Stage 1: CNN removes redundant interior majority samples
cnn = CondensedNearestNeighbour(
    n_neighbors=1,
    sampling_strategy='auto',
    random_state=42
)
X_stage1, y_stage1 = cnn.fit_resample(X, y)
print(f"After CNN: {Counter(y_stage1)}")

# Stage 2: ENN removes noisy boundary samples
enn = EditedNearestNeighbours(
    n_neighbors=3,
    kind_sel='mode',
    sampling_strategy='auto'
)
X_stage2, y_stage2 = enn.fit_resample(X_stage1, y_stage1)
print(f"After CNN+ENN: {Counter(y_stage2)}")

# Compare with CNN alone
clf = GradientBoostingClassifier(n_estimators=200, random_state=42)

score_cnn = cross_val_score(clf, X_stage1, y_stage1, cv=5, scoring='f1').mean()
score_cnn_enn = cross_val_score(clf, X_stage2, y_stage2, cv=5, scoring='f1').mean()

print(f"\nF1 with CNN only:    {score_cnn:.3f}")
print(f"F1 with CNN + ENN:   {score_cnn_enn:.3f}")
# CNN+ENN typically improves F1 by 2-5% over CNN alone
# because ENN cleans the noisy samples that CNN preserves

CNN preserves all boundary samples including noisy ones. ENN removes samples whose neighbors disagree with their labels. Combining both gives you a dataset that is both small (from CNN's bulk reduction) and clean (from ENN's noise removal). This two-stage approach often outperforms either method alone.

Common Implementation Mistakes

  • Not scaling features before applying CNN

  • Using CNN on high-dimensional sparse data without dimensionality reduction

  • Expecting deterministic results across runs

  • Applying CNN before train-test split

  • Using CNN on datasets with significant label noise

  • Ignoring the computational cost for large datasets

When Should You Use This?

Use When

  • You need to reduce a large majority class while preserving the samples that define the decision boundary, rather than random deletion

  • Your downstream model is distance-based (k-NN, SVM with RBF kernel) and benefits from a condensed prototype set

  • You want to retain all minority-class samples while intelligently selecting which majority-class samples to keep

  • Dataset size is a bottleneck for training time and you want to reduce it without losing classification accuracy

  • You are building a 1-NN or few-NN classifier and want to minimize prediction-time storage and latency

  • The class imbalance ratio is moderate (5:1 to 50:1) and the dataset has clear geometric boundaries between classes

  • You want a theoretically grounded undersampling method with consistency guarantees rather than ad-hoc random removal

Avoid When

  • Your dataset has significant label noise -- CNN will retain noisy samples as boundary prototypes, amplifying errors. Use ENN or Tomek Links instead.

  • You need deterministic, reproducible results -- CNN's output varies with random seed and sample ordering. Use Tomek Links for deterministic undersampling.

  • The dataset has more than 500K samples and you need results within minutes -- CNN's iterative nearest-neighbor search is computationally expensive at scale.

  • Feature dimensionality exceeds 50-100 without dimensionality reduction -- the curse of dimensionality makes CNN's distance-based decisions unreliable.

  • The downstream model is not distance-based (e.g., tree ensembles, deep neural networks) -- CNN optimizes for 1-NN consistency, which may not align with the model's decision function.

  • Class overlap is extreme (Bayes error rate > 30%) -- CNN will retain too many samples near the heavily overlapping boundary, providing minimal reduction.

  • You need a specific target class ratio -- CNN does not allow you to specify the desired ratio. Use random undersampling or NearMiss for ratio-controlled undersampling.

Alternatives & Comparisons

Random undersampling removes majority-class samples uniformly at random. It is fast and allows precise control over the target class ratio, but it is uninformed -- it may remove critical boundary samples and keep useless interior ones. CNN is the opposite: it specifically retains boundary samples and removes interior ones. Use random undersampling when speed matters more than sample quality; use CNN when you want the most informative subset.

ENN and CNN are complementary rather than competing methods. ENN removes noisy samples (those whose neighbors disagree with their label), while CNN retains informative boundary samples (those that are needed for 1-NN consistency). ENN cleans noise; CNN reduces redundancy. They are often combined: CNN first for bulk reduction, then ENN for noise cleaning. ENN alone provides much less reduction (5-20%) compared to CNN (70-95%).

Tomek Links removes only the majority-class member of cross-class nearest-neighbor pairs -- the tightest inter-class pairs at the boundary. This is far more conservative than CNN: Tomek Links typically removes 1-5% of samples vs. CNN's 70-95%. Tomek Links is deterministic and noise-aware (it removes the specific majority samples that are closest to minority samples), while CNN is stochastic and noise-agnostic. Use Tomek Links for gentle boundary cleaning; use CNN for aggressive dataset reduction.

NearMiss selects majority-class samples that are closest to minority-class samples (NearMiss-1) or farthest (NearMiss-3). Unlike CNN, NearMiss allows you to specify the exact target ratio. NearMiss-1 retains majority samples near the boundary (similar to CNN's behavior), but it selects a fixed number rather than finding the minimal consistent subset. NearMiss is faster than CNN but can be more aggressive, sometimes removing samples that CNN would retain as necessary prototypes.

Cluster Centroids replaces majority-class samples with cluster centers generated by K-Means. It creates synthetic representative points rather than selecting from existing data. This can produce smoother decision boundaries but loses the original data points. CNN retains actual data points, which is preferable when data provenance matters (e.g., regulated industries). Cluster Centroids allows precise ratio control; CNN does not.

Pros, Cons & Tradeoffs

Advantages

  • Preserves decision-boundary samples while removing redundant interior points, producing a maximally informative subset for distance-based classifiers

  • Provides a consistency guarantee: the 1-NN classifier on the condensed set correctly classifies all original training samples

  • Achieves high reduction ratios (70-95% for typical imbalanced datasets) while maintaining classification accuracy

  • Retains all minority-class samples by default (in the imbalanced-learn implementation), preventing minority-class information loss

  • Theoretically grounded with convergence guarantees from Hart's 1968 proof, unlike ad-hoc heuristic methods

  • Produces a natural prototype set that can be used for interpretable 1-NN classification -- each prediction can be explained by showing the nearest prototype

Disadvantages

  • Non-deterministic: different random seeds and sample orderings produce different condensed subsets, making results hard to reproduce exactly

  • Noise-sensitive: retains mislabeled samples near boundaries as prototypes, amplifying rather than removing label noise

  • Computationally expensive for large datasets due to iterative nearest-neighbor searches; O(n^2*d) worst case

  • Does not allow specifying a target class ratio -- the condensed subset size is determined by the data geometry, not a user parameter

  • Curse of dimensionality: performance degrades in high-dimensional spaces where distance metrics become less meaningful

  • The condensed subset is not guaranteed to be minimal -- Hart's greedy algorithm typically produces subsets 2-5x larger than the true minimum

Failure Modes & Debugging

Noise amplification in mislabeled datasets

Cause

CNN retains any sample that is misclassified by the current STORE. Mislabeled samples near the boundary are always misclassified by their correctly-labeled neighbors in the STORE, so they are always retained. In datasets with 5-10% label noise, the condensed subset becomes enriched in noisy samples.

Symptoms

The condensed subset has worse test accuracy than random undersampling. Precision drops significantly while recall may appear stable. The STORE contains many isolated points that are surrounded by points of the opposite class. Training a model on the condensed set produces unstable predictions near the boundary.

Mitigation

Apply ENN or Tomek Links after CNN to clean noisy prototypes from the condensed set. Alternatively, use the GCNN (Generalized CNN) variant that applies a noise filter during condensing. For highly noisy datasets (>10% label noise), consider skipping CNN entirely and using ENN-based cleaning instead.

Excessive retention in overlapping class distributions

Cause

When classes have significant overlap (e.g., two Gaussian clusters with the same mean but different covariances), nearly every sample near the overlap region is misclassified by the STORE at some point during the algorithm. CNN retains almost all samples in the overlap region, providing minimal reduction.

Symptoms

The condensed subset is only 10-30% smaller than the original dataset, instead of the expected 70-95% reduction. The algorithm takes many passes to converge. The STORE grows nearly linearly with each pass rather than plateauing.

Mitigation

Reduce class overlap through better feature engineering before applying CNN. Use a higher-k neighbor rule (e.g., k=3 or k=5) instead of the strict 1-NN rule, which is less sensitive to local noise in the overlap region. Consider NearMiss or random undersampling with a fixed target ratio for highly overlapping distributions.

Initialization bias causing inconsistent results

Cause

The choice of initial seed samples dramatically affects which boundary prototypes get selected first, which in turn affects which subsequent samples are misclassified and added. With different seeds, CNN can produce condensed subsets that differ by 20-40% in composition.

Symptoms

Model performance varies significantly across different CNN random seeds (e.g., F1 ranging from 0.72 to 0.85). The condensed subset size varies by more than 20% across seeds. Different runs produce qualitatively different decision boundaries, visible in 2D projections.

Mitigation

Run CNN with multiple random seeds (5-10) and either: (a) take the intersection of condensed sets (most conservative), (b) take the union (most comprehensive), or (c) use majority voting -- retain samples that appear in >50% of condensed sets. Option (c) produces the most stable results.

Catastrophic slowdown on large-scale datasets

Cause

For datasets with >500K samples, each CNN pass requires O(n * |STORE| * d) distance computations. As the STORE grows to tens of thousands of prototypes, each pass takes minutes. With 5-10 passes, total runtime can exceed hours.

Symptoms

CNN appears to hang during execution. Memory usage grows as the STORE's KD-tree index expands. Progress slows with each pass as the STORE grows larger.

Mitigation

Use approximate nearest neighbors (FAISS IVF, Annoy) instead of exact KD-tree queries. Pre-reduce the dataset with random undersampling to 10-20x the minority class size before applying CNN. Implement mini-batch CNN: process the GRABBAG in random batches and update the STORE incrementally, reducing per-pass cost.

Placement in an ML System

CNN sits in the data preprocessing stage after feature engineering and scaling, but before model training. In a production ML system, it is typically used during the training pipeline (not inference) to prepare a balanced, reduced training set. The condensed subset can be cached and reused across training runs if the data does not change frequently.

Pipeline Stage

Data Preprocessing / Resampling

Upstream

  • Data Ingestion / Data Loader
  • Feature Engineering / Feature Store
  • Feature Scaling (StandardScaler)
  • Train-Test Split

Downstream

  • Model Training
  • Cross-Validation
  • Noise Cleaning (ENN)
  • Hyperparameter Tuning

Scaling Bottlenecks

Where CNN Gets Tight

The bottleneck is iterative nearest-neighbor search. Unlike single-pass methods, CNN requires multiple passes through the GRABBAG, querying a growing STORE.

Dataset size thresholds:

  • <50K: seconds, no optimization needed
  • 50K-200K: 1-10 minutes, reduce features first
  • 200K-1M: 10-60 minutes, use approximate NN
  • 1M: impractical without pre-reduction

Parallelization: The GRABBAG scan within each pass can be parallelized (n_jobs=-1). Passes themselves are sequential because the STORE changes after transfers.

Production Case Studies

RazorpayFintech / Payment Fraud Detection

Razorpay processes millions of transactions daily with ~0.1-0.3% fraud rate (1000:1 imbalance). The team evaluated CNN, random undersampling, and NearMiss for reducing the majority class before training gradient-boosted tree models for fraud detection.

Outcome:

CNN-based undersampling improved fraud recall by 12% over random undersampling while maintaining 95%+ precision. The condensed training set was 85% smaller, reducing training time from 45 to 8 minutes. Adding ENN post-CNN improved F1 by an additional 3%.

FlipkartE-commerce / Product Return Prediction

Flipkart's supply chain team built a return-prediction model with only 8% positive rate. CNN was used to create a condensed training set retaining boundary samples between returnable and non-returnable orders based on product category, price, customer history, and distance features.

Outcome:

CNN reduced majority-class samples by 78% while retaining all return-positive samples. XGBoost on CNN-processed data achieved 0.71 F1 vs. 0.64 with random undersampling and 0.67 with SMOTE.

PractoHealthcare / Disease Risk Screening

Practo's health screening platform flags patients for further diagnostics with 2-5% positive rate per condition. CNN was evaluated in a resampling comparison for training classifiers on structured symptom data with 40-60 features.

Outcome:

CNN reduced training sets by 72% while maintaining diagnostic sensitivity at 0.89, vs. 0.83 with no resampling and 0.86 with SMOTE. Effective for well-separated conditions but less so for conditions with diffuse symptom overlap.

Tooling & Ecosystem

The canonical Python implementation of CNN, part of the imbalanced-learn library (scikit-learn-contrib). Provides CondensedNearestNeighbour class with configurable n_neighbors, random_state, and sampling_strategy. Fully compatible with scikit-learn Pipelines via the fit_resample API. Initializes the STORE with all minority-class samples by default.

The underlying nearest-neighbor implementation used by imbalanced-learn's CNN. Supports KD-tree, Ball tree, and brute-force algorithms with multiple distance metrics. Useful for building custom CNN variants with different neighbor retrieval strategies or approximate nearest neighbors.

GPU-accelerated approximate nearest neighbor library from Meta. Can replace scikit-learn's exact NN search in custom CNN implementations for large datasets (>500K samples). The IVF (Inverted File) index provides 10-100x speedup over exact search with minimal accuracy loss, making CNN feasible for million-scale datasets.

Spotify's lightweight approximate nearest neighbor library. Uses random projection trees for fast approximate NN queries. Suitable for custom CNN implementations where memory is constrained (Annoy uses memory-mapped files). Provides configurable accuracy-speed tradeoff via the number of trees.

Complementary noise-cleaning method often used after CNN. The EditedNearestNeighbours and RepeatedEditedNearestNeighbours classes in imbalanced-learn remove samples whose neighbors disagree with their labels -- the exact type of noise that CNN tends to retain. The CNN+ENN two-stage pipeline is a well-established practice.

Research & References

The Condensed Nearest Neighbor Rule

Hart, P.E. (1968)IEEE Transactions on Information Theory, Vol. 14, No. 3

The foundational paper introducing the CNN rule. Hart proved that the condensed subset is consistent with the full dataset for 1-NN classification and that the algorithm converges in finite iterations. Established the concept of prototype selection for nearest-neighbor classifiers.

A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data

Batista, G.E., Prati, R.C., Monard, M.C. (2004)ACM SIGKDD Explorations Newsletter, Vol. 6, No. 1

Comprehensive comparison of oversampling, undersampling, and combined methods including CNN, ENN, Tomek Links, SMOTE, and their combinations. Demonstrated that combined methods (SMOTE+ENN, SMOTE+Tomek) generally outperform individual methods. Established CNN+ENN as a viable two-stage strategy.

Nearest Neighbor Pattern Classification

Cover, T., Hart, P. (1967)IEEE Transactions on Information Theory, Vol. 13, No. 1

The seminal paper proving that the 1-NN error rate is bounded above by twice the Bayes rate. This theoretical result motivated Hart's subsequent CNN work by establishing that 1-NN classifiers have strong theoretical guarantees worth preserving through dataset condensing.

Reduced Nearest Neighbor Rule

Gates, G.W. (1972)IEEE Transactions on Information Theory, Vol. 18, No. 3

Proposed the Reduced Nearest Neighbor (RNN) rule as a post-processing step to further shrink CNN's condensed subset. RNN iteratively removes samples from the condensed set that can be removed without violating consistency. The CNN+RNN pipeline produces subsets closer to the true minimal consistent subset.

Interview & Evaluation Perspective

Common Interview Questions

  • What is the Condensed Nearest Neighbour rule and how does it differ from random undersampling?

  • Explain the STORE and GRABBAG data structures in CNN. Why does the algorithm converge?

  • Why is CNN sensitive to noise? How would you mitigate this in a production pipeline?

  • Compare CNN with ENN. When would you use one versus the other?

  • How would you scale CNN to a dataset with 10 million samples?

  • What is a consistent subset? Is CNN's result guaranteed to be minimal?

  • When would CNN perform poorly as an undersampling strategy?

Key Points to Mention

  • CNN finds a consistent subset: the smallest subset that classifies all original data correctly via 1-NN. It is a prototype selection method, not random deletion.

  • Hart's 1968 algorithm uses STORE (condensed set) and GRABBAG (candidates). Misclassified GRABBAG samples transfer to STORE until convergence.

  • CNN preserves boundary samples and removes interior redundancy, achieving 70-95% reduction while maintaining 1-NN accuracy.

  • Major weakness: CNN retains noisy/mislabeled samples because they are always misclassified by their correctly-labeled neighbors. Combine with ENN for noise cleaning.

  • Non-deterministic: results depend on random seed and sample ordering. Not suitable when exact reproducibility is required.

  • Computationally expensive: O(n * |S| * d) per pass, with multiple passes. Impractical for >500K samples without approximate NN.

Pitfalls to Avoid

  • Confusing CNN (Condensed Nearest Neighbour) with CNN (Convolutional Neural Network) -- always clarify which CNN you mean in an interview context.

  • Claiming CNN removes noise -- it does the opposite. CNN preserves noisy boundary samples. ENN removes noise.

  • Saying CNN produces the minimal consistent subset -- Hart's greedy algorithm is approximate. The true minimum is NP-hard to find.

  • Forgetting to mention that CNN requires feature scaling since it uses distance-based nearest-neighbor queries.

  • Overlooking the non-determinism issue -- interviewers often ask about reproducibility.

Senior-Level Expectation

A senior ML engineer should articulate consistent subsets, Voronoi tessellations, and Cover & Hart's 1-NN error bounds. They should discuss failure modes (noise amplification, curse of dimensionality, initialization sensitivity) with production examples. They should know the CNN+ENN pattern and why the methods are complementary. They should compare CNN with modern alternatives like learned coresets and data distillation, and discuss when NOT to use CNN: tree-based models, noisy labels, and real-time inference scenarios.

Summary

Condensed Nearest Neighbour (CNN) is a foundational instance selection algorithm (Hart, 1968) that finds the smallest subset of a dataset correctly classifying all original samples via 1-NN. It iteratively builds a condensed set (STORE) by adding samples that the current STORE misclassifies, converging when all remaining samples are correctly classified. The result contains only boundary prototypes while interior redundancy is discarded.

As an undersampling technique, CNN retains all minority-class samples while reducing the majority class by 70-95%. Its key strength is preserving decision-boundary geometry; its key weakness is noise sensitivity -- mislabeled boundary samples are always retained. The CNN+ENN two-stage pipeline addresses this by combining CNN's bulk reduction with ENN's noise cleaning.

CNN is implemented in imbalanced-learn as CondensedNearestNeighbour and is most effective with distance-based classifiers, moderate imbalance ratios, and clean labels. For datasets exceeding 500K samples, approximate NN methods are necessary for computational feasibility.

ML System Design Reference · Built by QnA Lab