Imagine this scenario: you’re tasked with predicting whether customers will default on their loans, and you’ve trained a high-performing but mysterious black-box model—say, an XGBoost classifier or a deep neural network. You hand the results to your boss, who inevitably asks, “Great, but why does it predict defaults?” That’s where explainability steps in.
In this article, we’ll explore a practical, code-centric tour of several popular explainability techniques, blending model-agnostic and model-specific approaches. Using a loan default dataset, we’ll train a black-box model and demonstrate how to extract both global and local explanations to make informed decisions.
Let’s say we have a dataset of Home Credit Default Risk from Kaggle, use application_train.csv, with features like EXT_SOURCE_3, AMT_CREDIT, DAYS_EMPLOYED, and AMT_ANNUITY. Our goal is to predict which applicants are likely to default on their loans. To achieve this, we’ll follow two key steps:
Imagine you're presenting this model to the bank's compliance team. They want assurance that the model isn’t relying on questionable logic—such as ignoring crucial financial indicators or amplifying biases. Meanwhile, your boss needs confidence that key decision-making patterns are transparent and defensible. By applying interpretability techniques, we ensure that our predictions are not only accurate but also understandable and justifiable.
To keep things simple, let’s outline how we train our model:
# Step 1: Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import xgboost as xgb
from sklearn.metrics import accuracy_score, roc_auc_score
# For deep neural network
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
# Step 2: Load the Dataset
dataset_path = "application_train.csv"
df = pd.read_csv(dataset_path)
# Quick dataset preview (prints first 5 rows)
print(df.head())
# Step 3: Data Preprocessing
num_cols = df.select_dtypes(include=[np.number]).columns
df[num_cols] = df[num_cols].fillna(df[num_cols].median())
cat_cols = df.select_dtypes(include=[object]).columns
for col in cat_cols:
df[col] = df[col].fillna(df[col].mode()[0])
# Encode categorical features using LabelEncoder (for simplicity)
label_encoders = {}
for col in cat_cols:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le
# Define features and target:
# We drop 'SK_ID_CURR' (an identifier) and keep 'TARGET' as the target variable.
X = df.drop(columns=["TARGET", "SK_ID_CURR"])
y = df["TARGET"]
# Split the data into training and test sets (20% test size, stratified by target)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Standardize numerical features (important for neural networks)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Step 4: Classical ML Model - XGBoost
# Initialize and train the XGBoost model.
xgb_model = xgb.XGBClassifier(
objective="binary:logistic",
eval_metric="auc",
use_label_encoder=False,
n_estimators=100,
learning_rate=0.1,
max_depth=6,
random_state=42
)
xgb_model.fit(X_train, y_train)
# Make predictions on the test set
y_pred_xgb = xgb_model.predict(X_test)
y_pred_xgb_proba = xgb_model.predict_proba(X_test)[:, 1]
# Evaluate the XGBoost model
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
roc_auc_xgb = roc_auc_score(y_test, y_pred_xgb_proba)
print("\nXGBoost Model Performance:")
print("Accuracy: {:.4f}".format(accuracy_xgb))
print("ROC-AUC: {:.4f}".format(roc_auc_xgb))
# Step 5: Deep Neural Network Model
# Build a simple feedforward neural network.
dnn_model = Sequential()
dnn_model.add(Dense(128, activation='relu', input_shape=(X_train.shape[1],)))
dnn_model.add(Dropout(0.2))
dnn_model.add(Dense(64, activation='relu'))
dnn_model.add(Dropout(0.2))
dnn_model.add(Dense(1, activation='sigmoid')) # Sigmoid for binary classification
# Compile the model with binary cross-entropy loss and the Adam optimizer.
dnn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Use EarlyStopping to prevent overfitting.
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
# Train the neural network.
history = dnn_model.fit(
X_train, y_train,
epochs=50,
batch_size=256,
validation_split=0.2,
callbacks=[early_stop],
verbose=1
)
# Evaluate the DNN on the test set.
loss_dnn, accuracy_dnn = dnn_model.evaluate(X_test, y_test, verbose=0)
y_pred_dnn_proba = dnn_model.predict(X_test)
y_pred_dnn = (y_pred_dnn_proba > 0.5).astype(int)
roc_auc_dnn = roc_auc_score(y_test, y_pred_dnn_proba)
print("\nDeep Neural Network Model Performance:")
print("Accuracy: {:.4f}".format(accuracy_dnn))
print("ROC-AUC: {:.4f}".format(roc_auc_dnn))
One way to get a global view is by generating Partial Dependence Plots (PDP). These plots show how changing one feature while holding others constant influences the model's predicted outcome. This technique helps us understand the relationship between an individual feature and the target prediction.
For example, using the dataset that includes features like credit score, we can ask: “As the credit score increases from 600 to 800, does the model’s predicted probability of default decrease?” By plotting this relationship, we can visualize the overall trend in the model's behavior and identify any specific thresholds where the risk of default significantly increases or decreases.
import matplotlib.pyplot as plt
from sklearn.inspection import PartialDependenceDisplay
# Convert X_test back to a DataFrame using the original column names.
X_test_df = pd.DataFrame(X_test, columns=X.columns)
# Define the feature to plot (e.g., "AMT_CREDIT")
features_to_plot = ['AMT_CREDIT']
# Create the Partial Dependence Plot
fig, ax = plt.subplots(figsize=(8, 4))
PartialDependenceDisplay.from_estimator(xgb_model, X_test_df, features=features_to_plot, ax=ax)
plt.title("Partial Dependence Plot for AMT_CREDIT (XGBoost)")
plt.show()
Interpretation: Higher loan amounts increase the risk of default, but the effect stabilizes beyond a certain amount. This makes sense because loans above a specific threshold are typically given to customers with stronger financial backing, reducing further risk increases.
While PDPs are useful, they assume that features are independent, which is often not the case in real-world datasets. Accumulated Local Effects (ALE) solve this issue by computing the marginal effect of a feature while considering interactions with other features. ALE shows how predictions change when a feature varies, accounting for relationships between variables.
from alepython import ale_plot
import matplotlib.pyplot as plt
# Convert X_test to a DataFrame if it's not already
X_test_df = pd.DataFrame(X_test, columns=X.columns)
# Define prediction function for XGBoost
def model_predict(X):
return xgb_model.predict_proba(X)[:, 1] # Probability of default (class 1)
# Get the index of the feature to analyze
feature_name = X.columns[feature_idx]
# Generate the ALE plot for the selected feature
ale_plot(
xgb_model, # First argument should be the model prediction function
X_test_df, # Second argument is the feature matrix
feature_name, # Third argument is the feature index (NOT a list)
bins=20 # Number of bins for ALE calculation
)
plt.title("Accumulated Local Effects (ALE) for AMT_CREDIT (XGBoost)")
plt.show()
ALE handles correlated features more gracefully, giving a plot that’s interpreted similarly to PDP but with less risk of false patterns.
Interpretation: ALE captures the true impact of AMT_CREDIT by ensuring correlated features do not distort the results. If the ALE curve is rising, it indicates that higher credit amounts contribute more toward default probability, while a plateau suggests no additional risk after a threshold. This is an important distinction from PDP, as ALE accounts for feature dependencies.
Sometimes, a bank manager says, “But what about Bill, who had a decent credit score and was still rejected?” or “How can I explain Mary’s acceptance to her?” That’s local territory: why a particular instance got its prediction.
LIME (Local Interpretable Model-agnostic Explanations) is a powerful technique specifically designed to provide clarity on individual predictions made by complex models, rather than focusing on the model's overall behavior. The process begins by slightly altering the values of various features related to a specific data point. This modification allows us to observe how these changes influence the model’s predictions.
By analyzing the relationship between the adjusted features and the resulting predictions, LIME constructs a simplified, interpretable model—like linear regression or tree-based model. This simpler model helps to illustrate and approximate how the original model makes its predictions, offering valuable insights into the factors driving those predictions forLIME (Local Interpretable Model-agnostic Explanations) is a powerful technique specifically designed to provide clarity on individual predictions made by complex models, rather than focusing on the model's overall behavior. The process begins by slightly altering the values of various features related to a specific data point. This modification allows us to observe how these changes influence the model’s prediction.
import lime
import lime.lime_tabular
# Create a LIME explainer object based on training data (ensure you pass in the feature names and class names)
explainer = lime.lime_tabular.LimeTabularExplainer(
training_data=X_train,
feature_names=X.columns.tolist(),
class_names=['No Default', 'Default'],
mode='classification'
)
# Pick an instance from the test set to explain.
instance_idx = 10
instance = X_test[instance_idx]
# Generate explanation for the chosen instance.
exp = explainer.explain_instance(instance, xgb_model.predict_proba, num_features=10)
exp.show_in_notebook(show_table=True)
Feature Contributions (Bar Chart)
Example:
AMT_INCOME_TOTAL < 50,000 → +0.25 (increases default risk)
DAYS_EMPLOYED > 2,000 → -0.12 (decreases default risk)
EXT_SOURCE_3 < 0.4 → +0.30 (strongly increases default risk)
Interpretation: LIME highlights which features push a specific loan toward default or repayment. For instance, a low income and high loan amount may strongly push the prediction toward default, while a high credit score might reduce it. Since LIME builds a locally linear approximation, it works well for understanding decisions at a micro level, though results may vary if a different perturbation sample is used.
Caution: LIME can be unstable if the model’s local decision boundary is very twisty. Another method, SHAP, might be more stable but can require more computation.
SHAP values are a powerful tool for understanding the impact of individual features in predictive models, grounded in the principles of cooperative game theory. They quantify the contribution of each feature to the overall prediction, allowing for a clear interpretation of how various elements influence the final outcome. In contrast to LIME, which focuses on providing local explanations that approximate the model's behavior around a specific instance, SHAP offers a more rigorous and mathematically sound method. This enables it to assign importance to features consistently, both across the entire dataset (globally) and in relation to individual predictions (locally), enhancing our understanding of the model's behavior and the underlying data dynamics.
import shap
# Create a SHAP explainer for the XGBoost model
tree_explainer = shap.TreeExplainer(xgb_model)
shap_values = tree_explainer.shap_values(X_test)
# Summary plot: shows the global feature importance and impact.
shap.summary_plot(shap_values, X_test, feature_names=X.columns.tolist())
# For a single prediction (local explanation), use a force plot:
instance_shap = shap_values[0] # for the first test instance
shap.initjs()
shap.force_plot(tree_explainer.expected_value, instance_shap, X_test[0], feature_names=X.columns.tolist())
You’d see a force plot or bar chart showing which features push Bill’s default probability above or below the dataset’s mean. SHAP is more mathematically grounded, but for random forests with many trees, it might be slower. The visuals, however, are quite polished, especially for a single instance.
In Short: Local explanations make you able to say, “Here’s why we predicted default for Bill.” This is essential for building trust and giving actionable next steps.
For deep learning models, we need different techniques to explain predictions. These approaches leverage gradients to understand how changes in input affect the model’s decisions.
Integrated Gradients (IG) is an advanced technique designed to attribute a model's predictions to the individual input features that contribute to those predictions. The process involves systematically modifying the input data, starting from a baseline and gradually transitioning to the actual input, while tracking the changes in the model's output. By observing how the model's predictions shift in response to these alterations, IG provides insights into the significance of each feature in the decision-making process. This method is particularly beneficial for deep learning models, where conventional approaches to feature importance may fall short, enabling a clearer understanding of how inputs are translated into predictions.
import tensorflow as tf
def integrated_gradients(model, x, baseline=None, steps=50):
"""
Computes Integrated Gradients for a single input sample `x`.
"""
if baseline is None:
baseline = tf.zeros(shape=x.shape) # a baseline of zeros
# Scale inputs and compute gradients.
interpolated_inputs = tf.convert_to_tensor([
baseline + (float(i) / steps) * (x - baseline)
for i in range(steps + 1)
], dtype=tf.float32)
# Reshape interpolated inputs to match model input shape (batch_size, num_features)
interpolated_inputs = tf.reshape(interpolated_inputs, [steps + 1, x.shape[1]])
with tf.GradientTape() as tape:
tape.watch(interpolated_inputs)
predictions = model(interpolated_inputs)
grads = tape.gradient(predictions, interpolated_inputs)
avg_grads = tf.reduce_mean(grads, axis=0)
integrated_grad = (x - baseline) * avg_grads
return integrated_grad.numpy()
# Ensure X_test is a NumPy array before conversion
import numpy as np
X_test_np = np.array(X_test) # Convert to NumPy array
instance_nn = tf.convert_to_tensor(X_test_np[0:1], dtype=tf.float32) # Ensure batch dimension
# Compute integrated gradients for this instance
attributions = integrated_gradients(dnn_model, instance_nn, baseline=tf.zeros_like(instance_nn), steps=50)
print("Integrated Gradients Attributions for instance 0:")
print(attributions)
import pandas as pd
# Convert the attributions array into a Pandas DataFrame for readability
feature_importance_df = pd.DataFrame({
"Feature": X.columns.tolist(), # Ensure this matches the original feature order
"Attribution": attributions[0] # Take the first row from the IG output
})
# Sort by absolute importance to see the most influential features
feature_importance_df["Absolute Attribution"] = feature_importance_df["Attribution"].abs()
feature_importance_df = feature_importance_df.sort_values(by="Absolute Attribution", ascending=False)
# Display the top 10 most important features
print(feature_importance_df.head(10))
Integrated Gradients is a method that helps us determine the input features that most significantly influence a model's predictions. By examining these inputs, we can gain valuable insights into how various factors—such as an individual's income level or the amount of a loan—affect the model's decision-making process. This understanding allows us to interpret the workings of the model more clearly and assess the relevance of each characteristic in shaping the outcome.
Deep SHAP is an advanced extension of the SHAP framework specifically designed for deep neural networks. This powerful tool quantifies the impact of individual features on a model's predictions by calculating SHAP values. It does this by leveraging background data to create more accurate estimates, allowing for a clear understanding of how each feature influences the model's output. By providing insights into feature importance, Deep SHAP enhances interpretability in complex deep learning models.
background = X_train[np.random.choice(X_train.shape[0], 100, replace=False)]
deep_explainer = shap.DeepExplainer(dnn_model, background)
shap_values_nn = deep_explainer.shap_values(X_test[:10])
shap.summary_plot(shap_values_nn[0], X_test[:10])
Interpretation: Deep SHAP provides insight into how different features contribute to neural network predictions. It can be particularly useful for debugging and validating deep learning models.
Convolutional Neural Networks (CNNs) are widely used for analyzing images, such as signatures on loan applications. Let’s show a brief Grad-CAM snippet for Keras. Suppose we have a model model_cnn that outputs a probability for a single class (1 = forgery, 0 = not forgery):
import numpy as np
import tensorflow as tf
import cv2
# This function runs Grad-CAM on a Keras CNN model.
def grad_cam(model, img_array, last_conv_layer_name="conv2d"):
# 1) We get the gradient of the top predicted class w.r.t. the last conv layer.
grad_model = tf.keras.models.Model([
model.inputs
], [
model.get_layer(last_conv_layer_name).output,
model.output
])
with tf.GradientTape() as tape:
conv_outputs, predictions = grad_model(img_array)
class_idx = tf.argmax(predictions[0])
loss = predictions[:, class_idx]
# Gradient of the predicted class w.r.t. conv layer outputs
grads = tape.gradient(loss, conv_outputs)
# Global average pooling on grads
pooled_grads = tf.reduce_mean(grads, axis=(0, 1, 2))
conv_outputs = conv_outputs[0]
# Multiply each channel by pooled grads
conv_outputs = conv_outputs * pooled_grads
heatmap = tf.reduce_mean(conv_outputs, axis=-1)
# Convert to numpy
heatmap = heatmap.numpy()
heatmap = np.maximum(heatmap, 0) / (np.max(heatmap) + 1e-10)
return heatmap
Usage:
# Suppose some_image is preprocessed
heatmap = grad_cam(model_cnn, some_image, "last_conv_layer")
# Rescale heatmap to 0-255
heatmap = cv2.resize(heatmap, (some_image.shape[2], some_image.shape[1]))
heatmap = np.uint8(255 * heatmap)
# Apply heatmap to original image
colored_heatmap = cv2.applyColorMap(heatmap, cv2.COLORMAP_JET)
# Blend heatmap with original
alpha = 0.4
overlay_img = colored_heatmap * alpha + some_image[0] # if 0th batch
Mock Explanation:
We see a color heatmap overlayed on the signature,
highlighting the lower-right region as crucial for detecting a possible forgery.
When working with textual data, a Transformer model can generate an attention map that highlights the relationships between tokens. Let’s do a mini example with a Hugging Face BERT pipeline.
!pip install transformers bertviz
from transformers import AutoTokenizer, AutoModel
import torch
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_bert = AutoModel.from_pretrained(model_name, output_attentions=True)
text = "I lost my job but I will repay soon"
inputs = tokenizer(text, return_tensors='pt')
outputs = model_bert(**inputs)
attentions = outputs.attentions # a tuple of attention maps, one per layer
print("Number of layers:", len(attentions))
print("Shape of one attention map:", attentions[0].shape)
Possible Output:
Number of layers: 12
Shape of one attention map: torch.Size([1, 12, 8, 8])
Meaning 12 layers, each with 12 attention heads, sequence length=8.
You can visualize these with bertviz or custom code. For instance:
from bertviz import head_view
sentence_b = None # single sentence
head_view(inputs, outputs, model_name=model_name, sentence_b=sentence_b)
Mock Explanation:
[HTML-based diagram shows each attention head focusing strongly on "lost" and "job" tokens.
Some heads link "repay" to "soon" etc.]
Caveat: Attention is not a perfect explanation, but it’s a helpful lens to see how tokens interact.
If your model is remote (like a huggingface pipeline endpoint), you might not have gradients or code. So you rely on model-agnostic methods like LIME or KernelSHAP with perturbations:
Pseudo-code:
def huggingface_api_predict(text_input):
# calls remote HF model endpoint, returns a probability
# or classification label.
pass
lime_exp = explainer.explain_instance(
text_input,
huggingface_api_predict,
num_features=5
)
lime_exp.show_in_notebook()
Result: You’ll see which words or phrases contributed to the classification, even though you can’t see inside the model.
We’ve walked from tabular data (PDP, ALE, LIME, SHAP, Integrated Gradients) all the way to CNNs (Grad-CAM), Transformers (attention), and remote APIs (LIME sampling). The main strategy is:
With these code snippets and example outputs, you can adapt the techniques to your own dataset—whether it’s tabular, image, or text—and ensure you never again have to say, “I don’t know why the model did that.”
This article is written by Gaurav Sharma, a member of 123 of AI, and edited by the 123 of AI team.
Unlock AI transparency with tools like LIME and SHAP that deliver clear, personalized insights. Boost trust and innovation across healthcare, finance, and more.
Gain mastery in Machine Learning with premium hand-written notes, slides, and code workbooks!
🚀 "Build ML Pipelines Like a Pro!" 🔥 From data collection to model deployment, this guide breaks down every step of creating machine learning pipelines with top resources