Understanding the “Why”: A Practical Guide to Model Explainability

May 14, 2025

Imagine this scenario: you’re tasked with predicting whether customers will default on their loans, and you’ve trained a high-performing but mysterious black-box model—say, an XGBoost classifier or a deep neural network. You hand the results to your boss, who inevitably asks, “Great, but why does it predict defaults?” That’s where explainability steps in.

In this article, we’ll explore a practical, code-centric tour of several popular explainability techniques, blending model-agnostic and model-specific approaches. Using a loan default dataset, we’ll train a black-box model and demonstrate how to extract both global and local explanations to make informed decisions.

1. Scene-Setter: The Loan Default Challenge

Let’s say we have a dataset of Home Credit Default Risk from Kaggle, use application_train.csv, with features like EXT_SOURCE_3, AMT_CREDIT, DAYS_EMPLOYED, and AMT_ANNUITY. Our goal is to predict which applicants are likely to default on their loans. To achieve this, we’ll follow two key steps:

Train a black-box model (such as XGBoost or a deep neural network) that delivers strong predictive performance.
Apply explainability techniques to uncover both global patterns (how the model makes general decisions) and local insights (why it flagged a specific applicant as a default risk).

Why This Matters

Imagine you're presenting this model to the bank's compliance team. They want assurance that the model isn’t relying on questionable logic—such as ignoring crucial financial indicators or amplifying biases. Meanwhile, your boss needs confidence that key decision-making patterns are transparent and defensible. By applying interpretability techniques, we ensure that our predictions are not only accurate but also understandable and justifiable.

2. Training the Model (A Quick Glimpse)

To keep things simple, let’s outline how we train our model:

# Step 1: Import Libraries

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

import xgboost as xgb
from sklearn.metrics import accuracy_score, roc_auc_score

# For deep neural network

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Step 2: Load the Dataset

dataset_path = "application_train.csv"
df = pd.read_csv(dataset_path)

# Quick dataset preview (prints first 5 rows)
print(df.head())

# Step 3: Data Preprocessing

num_cols = df.select_dtypes(include=[np.number]).columns
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

cat_cols = df.select_dtypes(include=[object]).columns
for col in cat_cols:
df[col] = df[col].fillna(df[col].mode()[0])

# Encode categorical features using LabelEncoder (for simplicity)

label_encoders = {}
for col in cat_cols:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le

# Define features and target:
# We drop 'SK_ID_CURR' (an identifier) and keep 'TARGET' as the target variable.

X = df.drop(columns=["TARGET", "SK_ID_CURR"])
y = df["TARGET"]

# Split the data into training and test sets (20% test size, stratified by target)

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)

# Standardize numerical features (important for neural networks)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Step 4: Classical ML Model - XGBoost
# Initialize and train the XGBoost model.
xgb_model = xgb.XGBClassifier(
objective="binary:logistic",
eval_metric="auc",
use_label_encoder=False,
n_estimators=100,
learning_rate=0.1,
max_depth=6,
random_state=42

)

xgb_model.fit(X_train, y_train)

# Make predictions on the test set

y_pred_xgb = xgb_model.predict(X_test)
y_pred_xgb_proba = xgb_model.predict_proba(X_test)[:, 1]

# Evaluate the XGBoost model

accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
roc_auc_xgb = roc_auc_score(y_test, y_pred_xgb_proba)

print("\nXGBoost Model Performance:")
print("Accuracy: {:.4f}".format(accuracy_xgb))
print("ROC-AUC: {:.4f}".format(roc_auc_xgb))

# Step 5: Deep Neural Network Model
# Build a simple feedforward neural network.

dnn_model = Sequential()
dnn_model.add(Dense(128, activation='relu', input_shape=(X_train.shape[1],)))
dnn_model.add(Dropout(0.2))
dnn_model.add(Dense(64, activation='relu'))
dnn_model.add(Dropout(0.2))
dnn_model.add(Dense(1, activation='sigmoid')) # Sigmoid for binary classification
# Compile the model with binary cross-entropy loss and the Adam optimizer.
dnn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Use EarlyStopping to prevent overfitting.
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Train the neural network.
history = dnn_model.fit(
X_train, y_train,
epochs=50,
batch_size=256,
validation_split=0.2,
callbacks=[early_stop],
verbose=1
)

# Evaluate the DNN on the test set.
loss_dnn, accuracy_dnn = dnn_model.evaluate(X_test, y_test, verbose=0)
y_pred_dnn_proba = dnn_model.predict(X_test)
y_pred_dnn = (y_pred_dnn_proba > 0.5).astype(int)
roc_auc_dnn = roc_auc_score(y_test, y_pred_dnn_proba)

print("\nDeep Neural Network Model Performance:")
print("Accuracy: {:.4f}".format(accuracy_dnn))
print("ROC-AUC: {:.4f}".format(roc_auc_dnn))

While training models like XGBoost or neural networks is essential, the process doesn't end there. Resources for building ML pipelines can provide you with additional tools and practices for streamlining your machine learning pipeline.

3. Seeing the Big Picture: Global Explanations

3.1 Partial Dependence: The Classic Approach

One way to get a global view is by generating Partial Dependence Plots (PDP). These plots show how changing one feature while holding others constant influences the model's predicted outcome. This technique helps us understand the relationship between an individual feature and the target prediction.

For example, using the dataset that includes features like credit score, we can ask: “As the credit score increases from 600 to 800, does the model’s predicted probability of default decrease?” By plotting this relationship, we can visualize the overall trend in the model's behavior and identify any specific thresholds where the risk of default significantly increases or decreases.

‍

import matplotlib.pyplot as plt
from sklearn.inspection import PartialDependenceDisplay

# Convert X_test back to a DataFrame using the original column names.
X_test_df = pd.DataFrame(X_test, columns=X.columns)

# Define the feature to plot (e.g., "AMT_CREDIT")
features_to_plot = ['AMT_CREDIT']

# Create the Partial Dependence Plot
fig, ax = plt.subplots(figsize=(8, 4))
PartialDependenceDisplay.from_estimator(xgb_model, X_test_df, features=features_to_plot, ax=ax)
plt.title("Partial Dependence Plot for AMT_CREDIT (XGBoost)")
plt.show()

‍

Interpretation: Higher loan amounts increase the risk of default, but the effect stabilizes beyond a certain amount. This makes sense because loans above a specific threshold are typically given to customers with stronger financial backing, reducing further risk increases.

3.2 Trying Out ALE

While PDPs are useful, they assume that features are independent, which is often not the case in real-world datasets. Accumulated Local Effects (ALE) solve this issue by computing the marginal effect of a feature while considering interactions with other features. ALE shows how predictions change when a feature varies, accounting for relationships between variables.

from alepython import ale_plot
import matplotlib.pyplot as plt

# Convert X_test to a DataFrame if it's not already
X_test_df = pd.DataFrame(X_test, columns=X.columns)

# Define prediction function for XGBoost

def model_predict(X):
return xgb_model.predict_proba(X)[:, 1] # Probability of default (class 1)

# Get the index of the feature to analyze
feature_name = X.columns[feature_idx]

# Generate the ALE plot for the selected feature
ale_plot(

   xgb_model, # First argument should be the model prediction function
   X_test_df, # Second argument is the feature matrix
   feature_name, # Third argument is the feature index (NOT a list)
   bins=20 # Number of bins for ALE calculation
)

plt.title("Accumulated Local Effects (ALE) for AMT_CREDIT (XGBoost)")
plt.show()

ALE handles correlated features more gracefully, giving a plot that’s interpreted similarly to PDP but with less risk of false patterns.

Interpretation: ALE captures the true impact of AMT_CREDIT by ensuring correlated features do not distort the results. If the ALE curve is rising, it indicates that higher credit amounts contribute more toward default probability, while a plateau suggests no additional risk after a threshold. This is an important distinction from PDP, as ALE accounts for feature dependencies.

For diving deeper into the specifics of global explainability Local vs. Global Explainability: Why Both Matter for Interpreting Your ML Models provides a detailed comparison of why both approaches matter in interpreting machine learning models

4. Zooming In: Local Explanations

Sometimes, a bank manager says, “But what about Bill, who had a decent credit score and was still rejected?” or “How can I explain Mary’s acceptance to her?” That’s local territory: why a particular instance got its prediction.

4.1 Using LIME

LIME (Local Interpretable Model-agnostic Explanations) is a powerful technique specifically designed to provide clarity on individual predictions made by complex models, rather than focusing on the model's overall behavior. The process begins by slightly altering the values of various features related to a specific data point. This modification allows us to observe how these changes influence the model’s predictions.

By analyzing the relationship between the adjusted features and the resulting predictions, LIME constructs a simplified, interpretable model—like linear regression or tree-based model. This simpler model helps to illustrate and approximate how the original model makes its predictions, offering valuable insights into the factors driving those predictions forLIME (Local Interpretable Model-agnostic Explanations) is a powerful technique specifically designed to provide clarity on individual predictions made by complex models, rather than focusing on the model's overall behavior. The process begins by slightly altering the values of various features related to a specific data point. This modification allows us to observe how these changes influence the model’s prediction.

import lime
import lime.lime_tabular

# Create a LIME explainer object based on training data (ensure you pass in the feature names and class names)
explainer = lime.lime_tabular.LimeTabularExplainer(
   training_data=X_train,
  feature_names=X.columns.tolist(),
   class_names=['No Default', 'Default'],
   mode='classification'
)

# Pick an instance from the test set to explain.
instance_idx = 10
instance = X_test[instance_idx]

# Generate explanation for the chosen instance.
exp = explainer.explain_instance(instance, xgb_model.predict_proba, num_features=10)
exp.show_in_notebook(show_table=True)

‍

Feature Contributions (Bar Chart)

LIME shows which features pushed the prediction towards default (1) and which pushed it towards repayment (0).

Example:
AMT_INCOME_TOTAL < 50,000 → +0.25 (increases default risk)

DAYS_EMPLOYED > 2,000 → -0.12 (decreases default risk)

EXT_SOURCE_3 < 0.4 → +0.30 (strongly increases default risk)

Features contributing positively to the default prediction increase risk.
Features contributing negatively reduce risk.

Interpretation: LIME highlights which features push a specific loan toward default or repayment. For instance, a low income and high loan amount may strongly push the prediction toward default, while a high credit score might reduce it. Since LIME builds a locally linear approximation, it works well for understanding decisions at a micro level, though results may vary if a different perturbation sample is used.

‍

Caution: LIME can be unstable if the model’s local decision boundary is very twisty. Another method, SHAP, might be more stable but can require more computation.

4.2 A SHAP Example

SHAP values are a powerful tool for understanding the impact of individual features in predictive models, grounded in the principles of cooperative game theory. They quantify the contribution of each feature to the overall prediction, allowing for a clear interpretation of how various elements influence the final outcome. In contrast to LIME, which focuses on providing local explanations that approximate the model's behavior around a specific instance, SHAP offers a more rigorous and mathematically sound method. This enables it to assign importance to features consistently, both across the entire dataset (globally) and in relation to individual predictions (locally), enhancing our understanding of the model's behavior and the underlying data dynamics.

import shap

# Create a SHAP explainer for the XGBoost model
tree_explainer = shap.TreeExplainer(xgb_model)
shap_values = tree_explainer.shap_values(X_test)

# Summary plot: shows the global feature importance and impact.
shap.summary_plot(shap_values, X_test, feature_names=X.columns.tolist())

# For a single prediction (local explanation), use a force plot:
instance_shap = shap_values[0] # for the first test instance
shap.initjs()
shap.force_plot(tree_explainer.expected_value, instance_shap, X_test[0], feature_names=X.columns.tolist())

You’d see a force plot or bar chart showing which features push Bill’s default probability above or below the dataset’s mean. SHAP is more mathematically grounded, but for random forests with many trees, it might be slower. The visuals, however, are quite polished, especially for a single instance.

In Short: Local explanations make you able to say, “Here’s why we predicted default for Bill.” This is essential for building trust and giving actionable next steps.

5. Model-Specific Neural Network(Gradient-Based) Explanations

For deep learning models, we need different techniques to explain predictions. These approaches leverage gradients to understand how changes in input affect the model’s decisions.

5.1 Integrated Gradients

Integrated Gradients (IG) is an advanced technique designed to attribute a model's predictions to the individual input features that contribute to those predictions. The process involves systematically modifying the input data, starting from a baseline and gradually transitioning to the actual input, while tracking the changes in the model's output. By observing how the model's predictions shift in response to these alterations, IG provides insights into the significance of each feature in the decision-making process. This method is particularly beneficial for deep learning models, where conventional approaches to feature importance may fall short, enabling a clearer understanding of how inputs are translated into predictions.

import tensorflow as tf
def integrated_gradients(model, x, baseline=None, steps=50):

"""

Computes Integrated Gradients for a single input sample `x`.

   """
   if baseline is None:
      baseline = tf.zeros(shape=x.shape) # a baseline of zeros

   # Scale inputs and compute gradients.
   interpolated_inputs = tf.convert_to_tensor([
      baseline + (float(i) / steps) * (x - baseline)
      for i in range(steps + 1)
   ], dtype=tf.float32)

# Reshape interpolated inputs to match model input shape (batch_size, num_features)
interpolated_inputs = tf.reshape(interpolated_inputs, [steps + 1, x.shape[1]])

   with tf.GradientTape() as tape:
       tape.watch(interpolated_inputs)
       predictions = model(interpolated_inputs)

   grads = tape.gradient(predictions, interpolated_inputs)
   avg_grads = tf.reduce_mean(grads, axis=0)
   integrated_grad = (x - baseline) * avg_grads
   return integrated_grad.numpy()

# Ensure X_test is a NumPy array before conversion
import numpy as np
X_test_np = np.array(X_test) # Convert to NumPy array
instance_nn = tf.convert_to_tensor(X_test_np[0:1], dtype=tf.float32) # Ensure batch dimension

# Compute integrated gradients for this instance
attributions = integrated_gradients(dnn_model, instance_nn, baseline=tf.zeros_like(instance_nn), steps=50)
print("Integrated Gradients Attributions for instance 0:")
print(attributions)

import pandas as pd

# Convert the attributions array into a Pandas DataFrame for readability
feature_importance_df = pd.DataFrame({
"Feature": X.columns.tolist(), # Ensure this matches the original feature order
"Attribution": attributions[0] # Take the first row from the IG output
})

# Sort by absolute importance to see the most influential features
feature_importance_df["Absolute Attribution"] = feature_importance_df["Attribution"].abs()
feature_importance_df = feature_importance_df.sort_values(by="Absolute Attribution", ascending=False)

# Display the top 10 most important features
print(feature_importance_df.head(10))

Integrated Gradients is a method that helps us determine the input features that most significantly influence a model's predictions. By examining these inputs, we can gain valuable insights into how various factors—such as an individual's income level or the amount of a loan—affect the model's decision-making process. This understanding allows us to interpret the workings of the model more clearly and assess the relevance of each characteristic in shaping the outcome.

5.2 Deep SHAP

Deep SHAP is an advanced extension of the SHAP framework specifically designed for deep neural networks. This powerful tool quantifies the impact of individual features on a model's predictions by calculating SHAP values. It does this by leveraging background data to create more accurate estimates, allowing for a clear understanding of how each feature influences the model's output. By providing insights into feature importance, Deep SHAP enhances interpretability in complex deep learning models.

background = X_train[np.random.choice(X_train.shape[0], 100, replace=False)]
deep_explainer = shap.DeepExplainer(dnn_model, background)
shap_values_nn = deep_explainer.shap_values(X_test[:10])
shap.summary_plot(shap_values_nn[0], X_test[:10])

Interpretation: Deep SHAP provides insight into how different features contribute to neural network predictions. It can be particularly useful for debugging and validating deep learning models.

nn-models dives deeper into best practices and advanced techniques for building and explaining neural networks.

6. If Our Data is Image-Based: Grad-CAM

6.1 Grad-CAM in Keras

Convolutional Neural Networks (CNNs) are widely used for analyzing images, such as signatures on loan applications. Let’s show a brief Grad-CAM snippet for Keras. Suppose we have a model model_cnn that outputs a probability for a single class (1 = forgery, 0 = not forgery):

import numpy as np
import tensorflow as tf
import cv2

# This function runs Grad-CAM on a Keras CNN model.
def grad_cam(model, img_array, last_conv_layer_name="conv2d"):

   # 1) We get the gradient of the top predicted class w.r.t. the last conv layer.
   grad_model = tf.keras.models.Model([
       model.inputs
   ], [
        model.get_layer(last_conv_layer_name).output,
        model.output
   ])

   with tf.GradientTape() as tape:
       conv_outputs, predictions = grad_model(img_array)
       class_idx = tf.argmax(predictions[0])
       loss = predictions[:, class_idx]

# Gradient of the predicted class w.r.t. conv layer outputs
grads = tape.gradient(loss, conv_outputs)

   # Global average pooling on grads
   pooled_grads = tf.reduce_mean(grads, axis=(0, 1, 2))
   conv_outputs = conv_outputs[0]

   # Multiply each channel by pooled grads
   conv_outputs = conv_outputs * pooled_grads
   heatmap = tf.reduce_mean(conv_outputs, axis=-1)

   # Convert to numpy
   heatmap = heatmap.numpy()
   heatmap = np.maximum(heatmap, 0) / (np.max(heatmap) + 1e-10)
   return heatmap

Usage:

# Suppose some_image is preprocessed
heatmap = grad_cam(model_cnn, some_image, "last_conv_layer")

# Rescale heatmap to 0-255
heatmap = cv2.resize(heatmap, (some_image.shape[2], some_image.shape[1]))
heatmap = np.uint8(255 * heatmap)

# Apply heatmap to original image
colored_heatmap = cv2.applyColorMap(heatmap, cv2.COLORMAP_JET)

# Blend heatmap with original
alpha = 0.4
overlay_img = colored_heatmap * alpha + some_image[0] # if 0th batch

‍

Mock Explanation:

We see a color heatmap overlayed on the signature,

highlighting the lower-right region as crucial for detecting a possible forgery.

7. If Our Data is Text (Transformers)

7.1 Simple Attention Visualization

When working with textual data, a Transformer model can generate an attention map that highlights the relationships between tokens. Let’s do a mini example with a Hugging Face BERT pipeline.

!pip install transformers bertviz
from transformers import AutoTokenizer, AutoModel
import torch
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_bert = AutoModel.from_pretrained(model_name, output_attentions=True)
text = "I lost my job but I will repay soon"
inputs = tokenizer(text, return_tensors='pt')
outputs = model_bert(**inputs)
attentions = outputs.attentions # a tuple of attention maps, one per layer
print("Number of layers:", len(attentions))
print("Shape of one attention map:", attentions[0].shape)

Possible Output:
Number of layers: 12

Shape of one attention map: torch.Size([1, 12, 8, 8])

Meaning 12 layers, each with 12 attention heads, sequence length=8.

‍

You can visualize these with bertviz or custom code. For instance:

from bertviz import head_view
sentence_b = None # single sentence
head_view(inputs, outputs, model_name=model_name, sentence_b=sentence_b)

Mock Explanation:

[HTML-based diagram shows each attention head focusing strongly on "lost" and "job" tokens.

Some heads link "repay" to "soon" etc.]

Caveat: Attention is not a perfect explanation, but it’s a helpful lens to see how tokens interact.

8. Explaining a Hugging Face API Model

If your model is remote (like a huggingface pipeline endpoint), you might not have gradients or code. So you rely on model-agnostic methods like LIME or KernelSHAP with perturbations:

Pseudo-code:

def huggingface_api_predict(text_input):

   # calls remote HF model endpoint, returns a probability
   # or classification label.
   pass

lime_exp = explainer.explain_instance(
   text_input,
   huggingface_api_predict,
   num_features=5
)
lime_exp.show_in_notebook()

Result: You’ll see which words or phrases contributed to the classification, even though you can’t see inside the model.

9. Pitfalls & Cautionary Notes

Correlated Features: PDP might be misleading; ALE helps.
LIME: Local surrogates can vary with random seeds.
Attention: Not a guaranteed explanation, just a view.
API-based: Possibly slow or cost-consuming if you do many queries.
Explanations ≠ True Causation: These tools show associations and attributions, not guaranteed cause.

10. Conclusion: A Unified Toolbox

We’ve walked from tabular data (PDP, ALE, LIME, SHAP, Integrated Gradients) all the way to CNNs (Grad-CAM), Transformers (attention), and remote APIs (LIME sampling). The main strategy is:

Start with global approaches (Permutation Importance, PDP/ALE) to confirm your model’s broad patterns.
Drill down local (LIME, SHAP, Integrated Gradients) for specific instances.
If dealing with images, use Grad-CAM or other saliency maps.
If dealing with text, consider attention or local surrogates.
If you only have an API, rely on black-box sampling (LIME/KernelSHAP, etc.).

With these code snippets and example outputs, you can adapt the techniques to your own dataset—whether it’s tabular, image, or text—and ensure you never again have to say, “I don’t know why the model did that.”

Authors

This article is written by Gaurav Sharma, a member of 123 of AI, and edited by the 123 of AI team.

Your Cart