Feature Selection

October 24, 2024

Feature Selection - An Introduction

As someone learning data science with Python, understanding the significance of feature selection is paramount in constructing effective machine learning models. In practical data science scenarios, it's uncommon for every variable in a dataset to contribute meaningfully to model building. Including redundant variables can diminish a model's ability to generalize and potentially lower classifier accuracy. Moreover, incorporating additional variables enhances the overall complexity of the model.

As per the Principle of Parsimony, often referred to as Occam's Razor, the best way to explain a problem is with the fewest assumptions possible. This principle underscores the importance of feature selection in machine learning model development

Learning objectives:

  • Understanding the significance of feature selection.
  • Familiarizing with various feature selection methods.
  • Putting feature selection approaches into practice and evaluating performance.

Table of Contents

  1. Introduction to Feature Selection in Machine Learning
  • Feature Selection Techniques in Supervised Learning
  • Feature Selection Techniques in Unsupervised Learning
  1. Types of Feature Selection Methods:
  • Filter
  • Wrapper
  • Embedded

How does Feature Selection in Machine Learning work?

Feature selection techniques in machine learning aim to identify the most effective set of features for constructing optimized models of observed phenomena.

These techniques generally fall into several categories:

Feature Selection Techniques in Supervised Learning

These methods are suitable for labeled datasets and aim to pinpoint relevant features that enhance the model performance of supervised models such as classification and regression. Examples include linear regression, decision trees, and SVM.

Feature Selection Techniques in Unsupervised Learning

These methods are applicable to data without labels. Examples include K-Means Clustering, Principal Component Analysis, and Hierarchical Clustering.

From a classification perspective, these techniques fall into categories such as filter, wrapper, embedded, and hybrid methods.

Next, we will delve into detailed explanations of these widely used feature selection methods in machine learning.

We'll be considering the Pima Indians Diabetes dataset. Please download the dataset from the provided link to follow along with the examples

Code: https://colab.research.google.com/drive/1Wxm9RGuDtV_0kBWgEjax5TZTPbKa8Kvq

Types of Feature Selection Methods:

Filter Methods

Filter methods identify the inherent properties of features using univariate statistics rather than relying on cross-validation performance. These methods are quicker and less computationally intensive compared to wrapper methods. In the context of high-dimensional data, filter methods are more cost-effective computationally.

Some commonly used filter methods include:

Information Gain

Information gain measures the reduction in entropy when transforming a dataset. Entropy is a measure of randomness or uncertainty in the data; higher entropy means more disorder, and lower entropy means less disorder. In the context of feature selection, information gain evaluates how much knowing a feature helps in predicting the target variable. By calculating the information gain of each variable, we can determine which features reduce the uncertainty (entropy) the most and are therefore the most informative for our model.

import pandas as pd
from sklearn.feature_selection import mutual_info_classif
import matplotlib.pyplot as plt
%matplotlib inline

# Load the dataset
dataframe = pd.read_csv('/content/diabetes.csv')

# Separate the features and the target variable
X = dataframe.iloc[:, :-1]

# All columns except the last one
Y = dataframe.iloc[:, -1]

# The last column
# Compute the mutual information
importances = mutual_info_classif(X, Y)

# Create a series with the feature importances
feat_importances = pd.Series(importances, index=X.columns)

# Plot the feature importances
feat_importances.plot(kind='barh', color='teal')
plt.xlabel('Mutual Information')
plt.ylabel('Features')
plt.title('Feature Importances')
plt.show()

 

Chi-square Test

The Chi-squared test is used for selecting categorical variables in a dataset. By calculating the Chi-square statistic between each feature and the target variable, we can choose the features with the highest Chi-square scores. To properly use the Chi-square test to examine the relationship between different features and the  output variable, certain conditions must be met: the variables must be categorical, independently sampled, and each value should have an expected frequency greater than 5.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Convert to categorical data by converting data to integers
X_cat = X.astype(int)

# Three features with highest chi-squared statistics are selected
chi2_features = SelectKBest (chi2, k = 3)
X_kbest_features = chi2_features.fit_transform(X_cat, Y)

# Reduced features
print('Original feature number:', X_cat.shape[1])
print('Reduced feature number:', X_kbest_features.shape[1])

Fisher's Score

Fisher's Score is a popular supervised feature selection method that ranks variables based on their Fisher score in descending order. The algorithm we use provides these ranks, allowing us to select variables accordingly for our specific case.

!pip install skfeature-chappers

from skfeature.function.similarity_based import fisher_score
import matplotlib.pyplot as plt
%matplotlib inline

# Calculating scores
ranks = fisher_score.fisher_score (X.values, Y.values)

# Plotting the ranks
feat_importances = pd.Series(ranks, dataframe.columns[0:len(dataframe.columns)-1])
feat_importances.plot(kind='barh', color='teal')
plt.show()

Correlation Coefficient

The correlation coefficient quantifies the linear relationship between two or more variables. It helps us predict one variable from another. Using correlation for feature selection relies on the assumption that strong variables correlate closely with the target. Additionally, features should correlate with the target but not with each other.

When two variables are correlated, one can be predicted from the other. Thus, if two features are correlated, including both in the model doesn't provide additional information. In this context, we'll apply Pearson Correlation.

Code

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Correlation matrix
cor = dataframe.corr()

# Plotting Heatmap
plt.figure(figsize = (10,6))
sns.heatmap(cor, annot=True)

To proceed, we establish a threshold, such as an absolute value of 0.5, for variable selection. If predictor variables are found to be correlated, we prioritize those with higher correlation coefficients with the target variable. It's also essential to assess multiple correlation coefficients to identify multicollinearity, where several variables may correlate with each other.

Variance Threshold

The variance threshold offers a straightforward method for feature selection. It eliminates features whose variance falls below a specified threshold. By default, it filters out features with zero variance—those that have the same value across all samples. While higher-variance features are typically presumed to hold more valuable information, this method does not consider relationships between features or between features and the target variable. This limitation is a notable drawback of filter methods.

from sklearn.feature_selection import VarianceThreshold

# Resetting the value of X to make it non-categorical
X = X.iloc[:, 0:8]      

# Selecting the first 8 columns
v_threshold = VarianceThreshold(threshold=0)
v_threshold.fit(X)        

# fit finds the features with zero variance
v_threshold.get_support()

The get_support function returns a Boolean vector. A True value indicates that the variable does not have zero variance.

Mean Absolute Difference (MAD)

The Mean Absolute Difference (MAD) calculates the average absolute deviation of data points from their mean. Unlike variance, MAD uses absolute differences instead of squared differences. It provides a robust measure of dispersion that is less influenced by outliers. A higher MAD indicates greater variability in the data.

# Calculate MADmean_abs_diff = np.sum(np.abs(X-np.mean (X, axis=0)), axis=0)/X.shape[0]# Plot the barchartplt.bar(np.arange(X.shape[1]), mean_abs_diff, color = 'teal')

Wrapper Methods

Wrappers require some method to search the space of all possible subsets of features, assessing their quality by learning and evaluating a classifier with that feature subset. The feature selection process is based on a specific machine learning algorithm we are trying to fit on a given dataset. It follows a greedy search approach by evaluating all the possible combinations of features against the evaluation criterion. The wrapper methods usually result in better predictive accuracy than filter methods.

Let's explore some of such techniques:

Forward Feature Selection

This method is an iterative approach where we initially start with an empty set of features and keep adding a feature which best improves our model after each iteration. The stopping criterion is till the addition of a new variable does not improve the performance of the model.

# Forward Feature Selection
from sklearn.linear_model import LogisticRegression
from mlxtend.feature_selection import SequentialFeatureSelector

lr = LogisticRegression()
ffs = SequentialFeatureSelector(lr, k_features='best', forward=True, n_jobs=-1)
ffs.fit(X, Y)
features = list(ffs.k_feature_names_)
lr.fit(X[features], Y)
y_pred = lr.predict(X[features])


Backward Feature Elimination

This method is also an iterative approach where we initially start with all features and after each iteration, we remove the least significant feature. The stopping criterion is till no improvement in the performance of the model is observed after the feature is removed.

# Backward Feature Selection
from sklearn.linear_model import LogisticRegression
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import train_test_split

# Splitting the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
lr = LogisticRegression(class_weight='balanced', solver='lbfgs', random_state=42, n_jobs=-1, max_iter=500)
lr.fit(X, Y)bfs = SequentialFeatureSelector(lr, k_features='best', forward = False, n_jobs=-1)
bfs.fit(X, Y)
features = list(bfs.k_feature_names_)
lr.fit(X_train[features], Y_train)
y_pred = lr.predict(X_train[features])

Exhaustive Feature Selection

This technique is considered as the brute force approach for the evaluation of feature subsets. It creates all possible subsets and builds a learning algorithm for each subset and selects the subset whose model performance is best.

from sklearn.ensemble import RandomForestClassifier
from mlxtend.feature_selection import ExhaustiveFeatureSelector

# Create an ExhaustiveFeatureSelector object
efs = ExhaustiveFeatureSelector(RandomForestClassifier(),min_features=4,max_features=8,scoring='roc_auc',cv=2)

# Fit the ExhaustiveFeatureSelector object to the training data
efs.fit(X, Y)

# Print the selected feature indices numerically
print("Selected feature indices:", efs.best_idx_)

# Print the final prediction score
print("Best score (ROC AUC):", efs.best_score_)

Recursive Feature Elimination

This greedy optimization method selects features by recursively considering the smaller and smaller set of features. The estimator is trained on an initial set of features and their importance is obtained using feature_importance_attribute. The least important features are then removed from the current set of features till we are left with the required number of features.

# Recursive Feature Elimination
from sklearn.feature_selection import RFE
rfe = RFE(lr, n_features_to_select=7)
rfe.fit(X_train, Y_train)
y_pred = rfe.predict(X_train)

Embedded Methods

These methods combine the advantages of the wrapper and filter methods by incorporating interactions of features while remaining computationally efficient. Embedded methods are iterative in the sense that they take care of each iteration of the model training process, carefully extracting the features that contribute the most to training for that iteration.

Let's go over some of these techniques here:

Lasso Regularization (L1)

This method adds a penalty to different parameters of the machine learning model to avoid overfitting of the model. The penalty is applied over the coefficients, thus bringing down some coefficients to zero. The features having zero coefficient can be removed from the dataset.

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

# Set the regularization parameter C to 1
logistic = LogisticRegression(C=1, penalty="l1", solver='liblinear', random_state=7).fit(X, Y)

# Create a SelectFromModel object
model = SelectFromModel(logistic, prefit=True)

# Transform the original feature set
X_new = model.transform(X)

# Retrieve the selected feature indices based on non-zero variance
selected_features_idx = [i for i in range(X.shape[1]) if X.iloc[:, i].var() != 0]

# Print the selected feature indices numerically
print("Selected feature indices:", selected_features_idx)

Random Forest Importance

Random Forests is a type of bagging algorithm that combines a set number of decision trees. The tree-based strategies used by random forests are naturally ranked by how well they improve node purity, or a decrease in impurity (Gini impurity) across all trees. The nodes with the greatest decrease in impurity appear at the beginning of the trees, while the notes with the smallest decrease in impurity appear at the end. Thus, by pruning trees below a specific node, we can extract a subset of the most important features.

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

# Create the random forest model with your hyperparameters
model = RandomForestClassifier(n_estimators=348, random_state=7)

# Fit the model to start training
model.fit(X, Y)

# Get the importance of the resulting features
importances = model.feature_importances_

# Create a data frame for visualization
final_df = pd.DataFrame({"Features": X.columns, "Importances": importances})

# Set 'Features' as the index
final_df.set_index('Features', inplace=True)

# Sort in ascending order for better visualization
final_df = final_df.sort_values('Importances', ascending=False)

# Plot the feature importances in bars
final_df.plot.bar(color='teal')
plt.title("Feature Importances")
plt.ylabel("Importance")
plt.xlabel("Features")
plt.show()

Conclusion

Feature selection is a fundamental aspect of machine learning that significantly influences the performance and interpretability of your models. By carefully choosing relevant features, you not only enhance model accuracy but also reduce complexity, making your models more efficient and easier to understand.

In this blog, we explored a variety of feature selection methods, ranging from simple filter techniques to more sophisticated wrapper and embedded methods. Each approach has its unique advantages and use cases, whether you are dealing with supervised or unsupervised learning problems. Understanding when and how to apply these techniques is crucial for developing robust machine learning models.

As you continue to refine your skills in data science, remember the principle of Occam's Razor: simpler models with fewer assumptions are often more effective. By judiciously selecting features, you adhere to this principle, ensuring your models are not only powerful but also generalizable.

Incorporate these feature selection strategies into your workflow to tackle real-world datasets efficiently. Experiment with different techniques to find the optimal feature subset for your specific problem, and always validate your choices with proper evaluation metrics.

Happy coding and model building!

References:

Author

This article was written by Karan Shah, and edited by our writers team.

Latest Articles

All Articles
Resources for building ML pipelines

Resources for building ML pipelines

🚀 "Build ML Pipelines Like a Pro!" 🔥 From data collection to model deployment, this guide breaks down every step of creating machine learning pipelines with top resources

AI/ML
Is there an AI for that?

Is there an AI for that?

Explore top AI tools transforming industries—from smart assistants like Alexa to creative powerhouses like ChatGPT and Aiva. Unlock the future of work, creativity, and business today!

AI/ML
Model Selection

Model Selection

Master the art of model selection to supercharge your machine-learning projects! Discover top strategies to pick the perfect model for flawless predictions!

AI/ML