In machine learning, features or dimensions are individual independent variables that act as inputs in your system. While making the predictions, models use such features to make the predictions, for instance, if we consider a dataset of houses, the dimensions could include the house's price, size, number of bedrooms, location, and so on. The number of features or dimensions used for modeling a machine-learning algorithm depends on how perfectly they can capture the essence of the problem. With more dimensions come the problem of the Curse Of Dimensionality, high-dimensional datasets pose several practical concerns for machine learning algorithms, such as increased computation time, and storage space for big data, high-dimensional data is hard to visualize, making exploratory data analysis more difficult. etc. However, the biggest concern is perhaps decreased accuracy in predictive models. Statistical and machine learning models trained on high-dimensional datasets often generalize poorly.
So what do we do to solve this problem….?
Here comes the technique of Dimensionality Reduction. Dimensionality reduction is a technique used to reduce the number of features in a dataset while retaining as much of the important information as possible. In other words, it is a process of transforming high-dimensional data into a lower-dimensional space that still preserves the essence of the original data. There are several techniques for dimensionality reduction, including principal component analysis (PCA), linear discriminant analysis (LDA), and t-Distributed Stochastic Neighbor Embedding (t-SNE).
Let's go deep into these Dimensionality reduction techniques, where we use them and what is the nitty-gritty behind it.
PCA(Principal Component Analysis) is a linear transformation algorithm that seeks to project the original features of our data onto a smaller set of features ( or subspace ) while still retaining most of the information. To do this the algorithm tries to find the most appropriate directions/angles ( which are the principal components ) that maximize the variance in the new subspace. In a nutshell, PCA is an act of finding a new axis to represent the data so that a few principal components may contain the most information. When going from a high-dimensional space(d) to a low-dimensional space (d’ ), preserve dimensions that have high variance (or high / information.)
The reason why standardization is very much needed before performing PCA is that PCA is very sensitive to variances. Meaning, if there are large differences between the scales (ranges) of the original variables, then those with larger scales will dominate over those with small scales. Data standardization is the transformation of features by subtracting from the mean and dividing by standard deviation.
Step 2: Compute Covariance Matrix
This step aims to understand how the original variables of the input data set vary from the mean concerning each other, or in other words, to see if there is any relationship between them. Because sometimes, original variables are highly correlated in such a way that they contain redundant information. So, to identify these correlations, we compute the covariance matrix.
Covariance Matrix
The sign of the variables in the matrix tells us whether combinations are correlated:
Here, we calculate the eigenvectors (principal components) and eigenvalues of the covariance matrix. As eigenvectors, the principal components represent the directions of maximum variance in the data. The eigenvalues represent the amount of variance in each component. Ranking the eigenvectors by eigenvalue identifies the order of principal components. PCA aims to preserve the Euclidean distances between points as much as possible when reducing dimensionality
Here, we decide which components to keep and which to discard. Components with low eigenvalues typically will not be as significant. Scree plots usually plot the proportion of total variance explained and the cumulative proportion of variance. These metrics help one to determine the optimal number of components to retain. The point at which the Y axis of eigenvalues or total variance explained creates an "elbow" will generally indicate how many PCA components we want to include.
Finally, the data is transformed into the new coordinate system defined by the principal components. That is, the feature vector created from the eigenvectors of the covariance matrix projects the data onto the new axes defined by the principal components. This creates new data, capturing most of the information but with fewer dimensions than the original dataset.
Here we are using the Wine dataset from the UCI Machine Learning Repository. This dataset contains chemical analysis results of wines grown in the same region in Italy but derived from three different cultivars. It has 13 features, making it suitable for demonstrating the effect of PCA on high-dimensional data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_wine
# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target
# Convert to DataFrame for better visualization
df = pd.DataFrame(X, columns=wine.feature_names)
df['class'] = y
df['class'] = df['class'].map({0: wine.target_names[0], 1: wine.target_names[1], 2: wine.target_names[2]})
# Plotting the original dataset
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
for class_label in wine.target_names:
subset = df[df['class'] == class_label]
plt.scatter(subset.iloc[:, 0], subset.iloc[:, 1], label=class_label)
plt.xlabel(wine.feature_names[0])
plt.ylabel(wine.feature_names[1])
plt.title('Original Wine Dataset')
plt.legend()
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Create a DataFrame for the PCA results
df_pca = pd.DataFrame(X_pca, columns=['Principal Component 1', 'Principal Component 2'])
df_pca['class'] = y
df_pca['class'] = df_pca['class'].map({0: wine.target_names[0], 1: wine.target_names[1], 2: wine.target_names[2]})
# Plotting the PCA-transformed dataset
plt.subplot(1, 2, 2)
for class_label in wine.target_names:
subset = df_pca[df_pca['class'] == class_label]
plt.scatter(subset['Principal Component 1'], subset['Principal Component 2'], label=class_label)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Wine Dataset')
plt.legend()
plt.tight_layout()
plt.show()
PCA helps in reducing the dimensionality of the dataset while preserving as much variance as possible, making it easier to visualize the data.
The relationship between PCA and neural networks is complex, but it’s fascinating to discover that they are related. More specially, PCA can be understood as a type of unsupervised neural network known as an autoencoder when certain conditions prevail.
An autoencoder is a type of artificial neural network used for unsupervised learning of efficient codings. Autoencoders are architectures that compress the learned representation of the data in neural networks. They comprise an encoder and a decoder whose function is to equip the input data with low-dimensional latent space that captures all important features. Here, the decoder tries to construct the original data from this compressed representation.
For instance, when the loss function used for learning is a mean squared error (MSE), and if it has linear decoders, autoencoders exhibit some characteristics similar to those of PCA. Thus in this particular situation, an autoencoder via training learns how to encode data on a hyperplane equivalent to that employed by PCA!
In both methods, this link demonstrates their basic principles at work. By looking for directions along which there is the highest variability in data, PCA performs dimensional reduction. Similarly, with MSE loss and linear decoder, an autoencoder finds ways to squeeze out valuable information during reconstruction
MNIST response variable projected onto a reduced feature space containing only two dimensions. PCA (left) forces a linear projection whereas an autoencoder with non-linear activation functions allows a non-linear project.
Linear Discriminant Analysis (LDA), also known as Normal Discriminant Analysis or Discriminant Function Analysis, is a dimensionality reduction technique primarily utilized in supervised classification problems. It facilitates modeling distinctions between groups, effectively separating two or more classes. LDA operates by projecting features from a higher-dimensional space into a lower-dimensional one.
LDA is like PCA which helps in dimensionality reduction, however, it focuses on maximizing the separability among known categories by creating a new linear axis and projecting the data points on that axis.
There are some constraints to bear in mind, as the model assumes the following:
For these reasons, LDA may not perform well in high-dimensional feature spaces.
LDA focuses primarily on projecting the features in higher dimension space to lower dimensions. You can achieve this in three steps:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.datasets import load_winefrom sklearn.preprocessing
import StandardScaler
# Load the Wine dataset
wine = load_wine()
X = wine.datay = wine.target
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply LDA
lda = LDA(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)
# Plotting
plt.figure(figsize=(12, 5))
# Original data (first two features)
plt.subplot(1, 2, 1)
for i, class_label in enumerate(wine.target_names):
mask = y == i
plt.scatter(X_scaled[mask, 0], X_scaled[mask, 1], label=class_label)
plt.xlabel(wine.feature_names[0])
plt.ylabel(wine.feature_names[1])
plt.title('Original Wine Dataset\n(First Two Features)')
plt.legend()
# LDA
plt.subplot(1, 2, 2)
for i, class_label in enumerate(wine.target_names):
mask = y == i
plt.scatter(X_lda[mask, 0], X_lda[mask, 1], label=class_label)
plt.xlabel('First LDA Component')
plt.ylabel('Second LDA Component')
plt.title('LDA of Wine Dataset')
plt.legend()plt.tight_layout()plt.show()
# Print LDA explained variance ratio
print("LDA explained variance ratio:", lda.explained_variance_ratio_)
LDA explained variance ratio: [0.68747889 0.31252111].
The LDA-explained variance ratio is a measure that indicates how much of the variance in the data is explained by each Linear Discriminant Analysis (LDA) component. It's similar to the concept of explained variance ratio in Principal Component Analysis (PCA) but with an important difference.
t-Distributed Stochastic Neighbor Embedding (t-SNE algorithm) is a technique for dimensionality reduction that is particularly well suited for visualizing of high-dimensional datasets. t-SNE is a Non-linear Dimensionality reduction technique, which means the algorithm allows us to separate data that a straight line cannot separate.
The t-SNE algorithm finds the similarity measure between pairs of instances in higher and lower dimensional space. After that, it tries to optimize two similarity measures. It does all of that in three steps.
In higher dimensional space:
In lower dimensional space:
T-SNE minimizes the sum of KL divergences over all the data points.
The optimization process allows the creation of clusters and sub-clusters of similar data points in the lower-dimensional space that are visualized to understand the structure and relationship in the higher-dimensional data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
# Plotting
plt.figure(figsize=(12, 5))
# Original data (first two features)
plt.subplot(1, 2, 1)
for i, class_label in enumerate(wine.target_names):
mask = y == i
plt.scatter(X_scaled[mask, 0], X_scaled[mask, 1], label=class_label)
plt.xlabel(wine.feature_names[0])
plt.ylabel(wine.feature_names[1])
plt.title('Original Wine Dataset\n(First Two Features)')plt.legend()
# t-SNE
plt.subplot(1, 2, 2)
for i, class_label in enumerate(wine.target_names):
mask = y == i
plt.scatter(X_tsne[mask, 0], X_tsne[mask, 1], label=class_label)
plt.xlabel('First t-SNE Component')
plt.ylabel('Second t-SNE Component')
plt.title('t-SNE of Wine Dataset')
plt.legend()
plt.tight_layout()
plt.show()
t-SNE plot
Apart from visualizing complex multi-dimensional data, t-SNE has other uses mostly in the medical field.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.manifold import TSNE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.datasets import load_wine
# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Function to train and evaluate a Random Forest classifier
def evaluate_model(X_train, X_test, y_train, y_test):
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
return f1_score(y_test, y_pred, average='weighted'), model
# Evaluate the model on the original dataset
original_f1, original_model = evaluate_model(X_train, X_test, y_train, y_test)
# Apply PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
pca_f1, pca_model = evaluate_model(X_train_pca, X_test_pca, y_train, y_test)
# Apply LDA
lda = LDA(n_components=2)
X_train_lda = lda.fit_transform
(X_train, y_train)X_test_lda = lda.transform(X_test)
lda_f1, lda_model = evaluate_model(X_train_lda, X_test_lda, y_train, y_test)
# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=5, n_iter=300)
X_train_tsne = tsne.fit_transform(X_train[:100])
# Using a subset to speed up t-SNE
X_test_tsne = tsne.fit_transform(X_test[:100])
# Using a subset to speed up t-SNE
tsne_f1, tsne_model = evaluate_model(X_train_tsne, X_test_tsne, y_train[:100], y_test[:100])
# Print the F1 scores
print(f'F1 score on original dataset: {original_f1:.2f}')
print(f'F1 score on PCA-reduced dataset: {pca_f1:.2f}')
print(f'F1 score on LDA-reduced dataset: {lda_f1:.2f}')
print(f'F1 score on t-SNE-reduced dataset: {tsne_f1:.2f}')
Output --F1 score on original dataset: 1.00F1 score on PCA-reduced dataset: 0.74F1 score on LDA-reduced dataset: 0.98F1 score on t-SNE-reduced dataset: 0.74
LDA is the most efficient technique where both dimensionality reduction and high classification performance are required simultaneously. On the other hand, if you want easier visual representations then you may use either PCA or t-SNE although they may not be quite effective at preserving sufficient information about classification requirements.
This article is written by Gaurav Sharma, a member of 123 of AI, and edited by the 123 of AI team.
🚀 "Build ML Pipelines Like a Pro!" 🔥 From data collection to model deployment, this guide breaks down every step of creating machine learning pipelines with top resources
Explore top AI tools transforming industries—from smart assistants like Alexa to creative powerhouses like ChatGPT and Aiva. Unlock the future of work, creativity, and business today!
Master the art of model selection to supercharge your machine-learning projects! Discover top strategies to pick the perfect model for flawless predictions!