Model Selection

May 14, 2025

Model selection is an important step in the machine learning workflow. It is about choosing the best model from a set of potential models to get the highest predictive performance on a particular dataset. The process Model selection can highly impact the effectiveness of the model and, ultimately, the success of the machine learning project

Model Selection- why it matters?

A primary motive of machine learning is to develop a model that generalizes well to new, unseen data (usually called test data). Selecting an inappropriate model can lead to overfitting, which means it will work extremely well on the train data where target values are already known and poorly on the test data where raw data is used as an input to predict target values, it is equally probable that an inappropriate model selection process can lead to underfitting where it fails to track the patterns and isn't accurate.

Steps in Model Selection

Define the Problem and Targets: Understand the problem domain and the goals of the project. Is it a classification or regression problem? What metrics will be used to evaluate the model's performance? Defining the target variable (dependent variable) and input features (independent variables) is the first step.
Data Preparation: Clean and preprocess the data. This includes handling missing values, encoding categorical variables, scaling features, and splitting the data into training, validation, and test sets. A proper preprocessing procedure makes sure of the input characteristics are suitable for modeling.
Choose Candidate Models: Select a set of potential models to evaluate. This can include linear models, decision trees, support vector machines, ensemble methods, and neural networks. The variety of model selection allows us to compare both simple models and complicated models.
Evaluate Models: Use techniques like cross-validation to evaluate the performance of each candidate model on the training data. Cross-validation techniques, such as k-fold cross-validation, help in estimating the generalization error by averaging the performance across different validation folds.
Hyperparameter Tuning: Optimize the hyperparameters of the models to improve their performance. This step is very important for complicated models where the number of additional parameters can significantly affect the outcome.
Model Comparison: Compare the models based on their performance metrics and select the best one. This involves looking at relevant measures like mean squared error, absolute error, and other error metrics.
Validate the Model: Test the selected model on the test set to ensure it generalizes well to new data. This final step helps in verifying the model’s performance on unseen data.

Model Selection for classification problem

For example imagine we are working on a classification problem to predict whether a customer will buy a product based on their browsing history. The steps of approach to model selection would include:

Define the Problem and Targets: our target here is to maximize the accuracy and F1-score of the predictions (these are Evaluation metrics typically used across classification problems).
Data Preparation:
- Data Collection: Gather browsing history data for customers, including features like page views, time spent on pages, and previous purchases.
- Data Cleaning: Handle missing values, remove duplicates, and ensure data consistency across the dataset.
- Feature Engineering: Create new features based on the existing data, such as the average time spent on a page or the number of pages viewed in a session, time of day or night purchased at most i.e morning,afternoon,evening or night. etc.
- Encoding Categorical Variables: Convert categorical variables to numerical format using techniques like one-hot encoding.
- Scaling: Normalize the numerical features to ensure they are on the same scale.
- Data Splitting: Split the data into training (70%), validation (15%), and test (15%) sets.
Choose Candidate Models:
- Logistic Regression:
- - A simple yet effective linear model for binary classification.
- Decision Tree:
- - A non-linear model that splits the data based on feature values.
- Random Forest:
- - An ensemble method that combines multiple decision trees to improve performance.
- Support Vector Machine (SVM):
- - A model that finds the hyperplane that best separates the classes.
- Neural Network:
- - A complex model capable of capturing intricate patterns in the data.

Evaluate Models: Use 5-fold cross-validation to evaluate the the aforementioned models. Calculate the average accuracy and F1-score for all 5 models.

Example code for evaluating models
‍
from sklearn.model_selection import cross_val_score, GridSearchCV from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.neural_network import MLPClassifier from sklearn.metrics import accuracy_score, f1_score

# Sample data X_train, X_test, y_train, y_test = ... # Load and preprocess your data

# Define models models = { 'Logistic Regression': LogisticRegression(), 'Decision Tree': DecisionTreeClassifier(), 'Random Forest': RandomForestClassifier(), 'SVM': SVC(), 'Neural Network': MLPClassifier() }

# Evaluate models for name, model in models.items(): scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy') print(f'{name} Accuracy: {scores.mean()}')

# Example of output: # Logistic Regression Accuracy: 0.78 # Decision Tree Accuracy: 0.75 # Random Forest Accuracy: 0.80 # SVM Accuracy: 0.77 # Neural Network Accuracy: 0.79

Hyperparameter Tuning: Use grid search to find the best hyperparameters for the SVM model, as shown in our hypothetical example it showed promising performance.

# Hyperparameter tuning for the best model param_grid = { 'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf'] } grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy') grid_search.fit(X_train, y_train) print(f'Best parameters: {grid_search.best_params_}')

# Output: Best parameters: {'C': 1, 'kernel': 'rbf'}

Model Comparison: Compare the models based on their cross-validation performance and select the correct model (e.g., Random Forest).
Validate the Model: Test the best model on the test set to confirm its performance.

# Validate the model best_model = RandomForestClassifier(n_estimators=100, max_depth=None) best_model.fit(X_train, y_train) y_pred = best_model.predict(X_test) print(f'Test Accuracy: {accuracy_score(y_test, y_pred)}') print(f'Test F1-Score: {f1_score(y_test, y_pred)}')

# Output: Test Accuracy: 0.82 # Output: Test F1-Score: 0.81

Model Selection for Regression problem

For example lets take a regression problem where our target is to predict house prices based on various features such as location, size, and number of rooms. The steps would include:

Define the Problem and Targets: our target is to minimize the mean squared error (MSE) (This is an evaluation metric typically used for Regression problems)
Data Preparation:
- Data Collection: Gather data on house prices and features such as location, size, number of rooms, age of the house, etc.
- Data Cleaning: Handle missing values, remove duplicates, and ensure data consistency.
- Feature Engineering: Create new features, like the price per square foot. categorical features like premium location etc
- Encoding Categorical Variables: Convert categorical variables like neighborhood to numerical format using techniques like one-hot encoding.
- Scaling: Normalize the numerical features to ensure they are on the same scale.
- Data Splitting: Split the data into training (70%), validation (15%), and test (15%) sets.
Choose Candidate Models:
- Linear Regression: A simple linear model (Linear reg ) for predicting continuous values.
- Decision Tree Regressor: A non-linear model that splits the data based on feature values.
- Random Forest Regressor: An ensemble method that combines multiple decision trees to improve performance.
- Support Vector Regressor (SVR): A model that finds the hyperplane that best fits the data.
- Neural Network: A complex model capable of capturing intricate patterns in the data.
Evaluate Models: Use 5-fold cross-validation to evaluate the models. Calculate the average MSE for all 5 models.

# Example code for evaluating models from sklearn.model_selection import cross_val_score, GridSearchCV from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.svm import SVR from sklearn.neural_network import MLPRegressor from sklearn.metrics import mean_squared_error

# Sample data X_train, X_test, y_train, y_test = ... # Load and preprocess your data

# Define models models = { 'Linear Regression': LinearRegression(), 'Decision Tree': DecisionTreeRegressor(), 'Random Forest': RandomForestRegressor(), 'SVR': SVR(), 'Neural Network': MLPRegressor() }

# Evaluate models for name, model in models.items(): scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error') print(f'{name} MSE: {-scores.mean()}')

# Example of output: # Linear Regression MSE: 120000.0 # Decision Tree MSE: 135000.0 # Random Forest MSE: 110000.0 # SVR MSE: 115000.0 # Neural Network MSE: 125000.0

Hyperparameter Tuning: Use grid search to find the best hyperparameters for the Random Forest model, as it showed promising performance.

# Hyperparameter tuning for the best model param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20] } grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5, scoring='neg_mean_squared_error') grid_search.fit(X_train, y_train) print(f'Best parameters: {grid_search.best_params_}')

# Output: Best parameters: {'max_depth': None, 'n_estimators': 100}

Model Comparison: Compare the models based on their cross-validation performance and select the best one (e.g., Random Forest).
Validate the Model: Test the best model on the test set to confirm its performance.

# Validate the model best_model = RandomForestRegressor(n_estimators=100, max_depth=None) best_model.fit(X_train, y_train) y_pred = best_model.predict(X_test) print(f'Test MSE: {mean_squared_error(y_test, y_pred)}')

# Output: Test MSE: 105000.0

Model Selection Techniques

Cross - Validation

Cross-validation is a very unique technique for model evaluation. It involves splitting the data into multiple subsets and training the model on some subsets while validating it on others. Common techniques include k-fold cross-validation, leave-one-out cross-validation, and nested cross-validation.

For example In a 5-fold cross-validation, the dataset is first divided into 5 equal parts. The model is trained on 4 parts and tested on the remaining part(this means 4 parts of training data and 1 part of test data). This process is repeated 5 times, with each part being used as the test set once. And the average of performance metric is considered for all 5 iterations in this process.

Feature Selection

Feature selection is the process of selecting the most relevant features for the model. It helps in reducing the dimensionality of the data and improving the model's performance. Techniques include forward selection, backward elimination, and recursive feature elimination. There are important libraries in python like scikit-learn offers SelectKBest, RFE, and feature_selection module for various feature selection methods.

To learn more about selecting the most relevant features for your model, check out our blog on Feature Selection.

Model Averaging

Model averaging involves combining multiple models to improve performance. This can be done by averaging their predictions or using more sophisticated techniques like stacking or boosting.

For example In an ensemble model, predictions from a decision tree, a logistic regression model, and a support vector machine (SVM) are averaged to make a final prediction. This technique is chosen to reduce chances of overfitting and improve generalization of prediction model.

Penalty Terms

Penalty terms, such as L1 (Lasso) and L2 (Ridge) regularization, are used in linear models to prevent overfitting by adding a penalty for large coefficients.

For example In a linear regression model predicting house prices, L1 regularization (Lasso) can be used to enforce sparsity, setting some coefficients to zero, while L2 regularization (Ridge) shrinks all coefficients towards zero (both of which is considered potential techniques to reduce overfitting).

Nested Cross-Validation

Nested cross-validation is very powerful method used for hyperparameter tuning and model selection at the same time. It involves two layers of cross-validation: the inner loop for hyperparameter tuning and the outer loop for model evaluation.

For example In a 5x5 nested cross-validation, the outer 5-fold cross-validation loop checks model performance, while the inner 5-fold loop optimizes hyperparameters. This process ultimately gives a true estimate of the model's performance.

Model Assumptions and Selection Criteria

Each model has its own assumptions and selection criteria. For instance, linear regression assumes a linear relationship between the input features and the target variable, while random forests do not have such assumptions.so In case of predicting sales based on advertising spend, linear regression might be chosen only if the relationship is linear, and a decision tree would be a better fit iif the relationship is more complex.

To understand how to balance bias and variance in your models and improve generalization, explore our Bias and Variance Tradeoff blog.

Model Uncertainty

Model uncertainty can be addressed by using probabilistic models that provide confidence intervals or probabilities for their predictions. Bayesian models and ensemble methods are examples of approaches that incorporate model uncertainty.

For example Bayesian linear regression provides a distribution of possible outcomes rather than a single prediction, offering insights into the uncertainty of predictions. and there are python libraries like PyMC3, Stan, and scikit-learn’s BayesianRidge helpful in implementing this method.

Model Evaluation and Assessment

Model evaluation and assessment involve using relevant measures like mean squared error, accuracy, F1-score, and others to compare the performance of different models. This step ensures that the selected model meets the project goals and generalizes well to new data (unseen data or test data).

For example In a classification problem predicting customer churn, the accuracy, precision, recall, and F1-score are calculated for different models to select the best performing one. And in python important libraries from scikit-learn offers metrics like accuracy_score, f1_score, and mean_squared_error.

Computational Models and Criteria

Computational models and criteria for model selection include various algorithms and techniques used to compare and select models. These may involve mathematical models, model-based inference, and model identification criteria.

For example Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are used to compare models on the basis of their likelihood and complexity, selecting the model that best balances fit and simplicity. And python library from statsmodels provides functions to calculate AIC and BIC.

When to choose Classical ML Models vs. Deep Learning Models

Deciding between whether to choose Classical machine learning models vs deep learning models primarily depends upon the amount of data available, this decision would impact the model's complexity and computational resources required to get the expected performance.

Data Size: Deep learning models can leverage high amounts of data to learn complex patterns. hence Lrger datasets typically favor deep learning models to learn complex data patterns.

Model Interpretability: Classical ML models are typically more interpretable, making it preferable when understanding the model's decisions, which is very important for the problem.

Computational Resources: for Deep learning models we require more computational power and longer training times in comparison to classical ML models.

Task Complexity: Tasks that involve complex patterns and high-dimensional data, such as image and speech recognition, benefit more from deep learning models.

Classical ML Models

Classical ML models, like linear regression, support vector machines (SVM), and random forests, are usually more effective when the datasets are smaller. These models are generally easier to interpret and need less computational power compared to deep learning.

For example a dataset with 10,000 rows, predicting house prices based on features like square footage, number of bedrooms, and location, a random forest model or linear regression might be sufficient and efficient. These models can handle the dataset size well and provide interpretable results.

Deep Learning Models

Deep learning models, like neural networks (NN), convolutional neural networks (CNN), and recurrent neural networks (RNN), are better option for large datasets. They can automatically learn complex patterns and features from the data, their performance is commendable in tasks like image and speech recognition.

For example a dataset with 200,000 rows of image data, where the goal is to classify objects inside the images, a CNN would be more appropriate. CNNs are better at processing image data due to their ability to learn hierarchical feature representations through convolutional layers.

Which models to select from Deep Learning models

The choice of deep learning models depends on the data modality, the specific task (classification, regression, segmentation, retrieval, etc.), and other considerations like model size and inference time.

Data Modality and Model Selection

Different types of data (modality) require different deep learning architectures. The following table outlines common modalities and corresponding suitable models:

‍

Images, Videos (CNN, ViT)
- For example, CNNs use convolutional layers to extract features from images, which makes them efficient for image classification tasks. Vision Transformers (ViT) have also shown great performance in image classification by using transformer architectures.
- The use case would be Classifying images of handwritten digits (MNIST dataset).
Representation Learning (Autoencoder)
- For example, Autoencoders are used for unsupervised learning to learn efficient codings of input data. They can be used for tasks like anomaly detection or data compression.
- The use case would be Reducing the dimensionality of high-dimensional data for visualization.
Generative Models (VAE, GANs, Transformers)
- For example, Variational Autoencoders (VAE) and Generative Adversarial Networks (GANs) generate new data samples similar to the training data. Transformers are used for generating text or translating languages.
- The use case would be Generating realistic images or synthesizing new text.
Language (Text) (Transformer)
- For example, Transformers, such as BERT or GPT, are used for natural language processing tasks like text classification, translation, and summarization.
- The use case would be a Sentiment analysis of customer reviews.
Speech (LSTM, Transformer)
- For example, Long Short-Term Memory (LSTM) networks and Transformers handle sequential data, making them suitable for speech recognition and language modeling.
- The use case would be Transcribing spoken language into text.
Graphical Data (Graph Neural Nets)
- For example, Graph Neural Networks (GNN) work with data structured as graphs, capturing relationships and dependencies between nodes.
- The use case would be Predicting molecular properties in chemistry.
3D Data (GNN, PointNets, CNNs)
- For example, Models like PointNets process 3D point cloud data, while CNNs can be adapted for 3D data by using 3D convolutions.
- The use case would be 3D object recognition in autonomous driving.

References

Books

"Pattern Recognition and Machine Learning" by Christopher M. Bishop : This link directs you to a FREE PDF of this book published on Microsoft website!

"Machine Learning: A Probabilistic Perspective" by Kevin P. Murphy : A detailed book focusing on the probabilistic approach to machine learning.

YouTube Channels

StatQuest with Josh Starmer : Simplified explanations of various statistical and machine learning concepts, including model selection.

Sentdex : Practical tutorials on machine learning with Python, including model selection.

Data School : A clear introduction to cross-validation and its importance in model selection.

Author

This article was written by Kartikey Vyas, and edited by our writers team.

Your Cart