Variable Importance

Data & models

This exercise shows an example of model-agnostic variable importance for a regression problem. The dataset that we will be working with is Carseats from the ISLR library which has already been used during some of the exercises such as Ex_ML_Tree and Ex_ML_SVM. It is highly recommended that you try to implement some parts of the exercise yourself before checking the answers.

  • Load the data from ISLR package, then assign 90% of the data for training and the remainder for testing. Please keep in mind that we make a training split running the feature importance (instead of using the entire dataset) as we “may” want to re-train the model with only a fewer features rather than biasing our decision by also including the testing data.

  • Create three models including a linear regression, a regression tree and a support-vector machine.

Answer

library(rpart)
library(e1071)
library(dplyr)
library(ISLR)

# divide the data into training and testing sets
set.seed(2022)
carseats_index <- sample(x=1:nrow(Carseats), size=0.9*nrow(Carseats), replace=FALSE)
carseats_tr <- Carseats[carseats_index,]
carseats_te <- Carseats[-carseats_index,]

# define a linear regression
carseats_lm <- lm(Sales~., data=carseats_tr)

# define a regression tree (you can also use `adabag::autoprune()`here
carseats_rt <- rpart(Sales~., data=carseats_tr)

# define a support-vector machine
carseats_svm <- svm(Sales ~ ., data=carseats_tr)

Python

library(reticulate)
use_condaenv("MLBA")
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_openml

# Use the training and test sets created in R (no easy way to get the `Carseats` data in python)
# Convert the categorical columns to one-hot encoded ones
carseats_train, carseats_test = pd.get_dummies(r.carseats_tr.copy()), pd.get_dummies(r.carseats_tr.copy())

# Define a linear regression
carseats_lr = LinearRegression()
carseats_lr.fit(carseats_train.drop(columns=['Sales']), carseats_train['Sales'])
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Define a decision tree regressor
carseats_dtr = DecisionTreeRegressor()
carseats_dtr.fit(carseats_train.drop(columns=['Sales']), carseats_train['Sales'])
DecisionTreeRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Define a support-vector machine
carseats_svr = SVR()
carseats_svr.fit(carseats_train.drop(columns=['Sales']), carseats_train['Sales'])
SVR()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Variable Importance using DALEX

Feature importance can be manually computed by permuting one column and computing the drop in the value of the loss function (e.g. RMSE for regression and Accuracy for classification). This permutation can be repeated n-times to calculate a mean drop in loss value across various runs. This form of variable importance is often referred to as “permutation feature importance”.

Fortunately, in R there are some nice packages that can help you with that. Three common libraries (among many) for model-agnostic feature importance in R are iml, DALEX and vip . If you would like to learn about the model-specific and model-agnostic feature importance and the various packages in R, you can refer to the chapter 16 of the book “Hands-on Machine Learning with R”. For the purpose of this part of the exercise, we will be using the DALEX library. Please note that the exercise has been largely inspired by the same chapter 16 of the book. You can also check out other variations such as iml and vip implementations.

Creating an explain object

DALEX has an explain object which allows you to do various kind of explanatory analysis. Try reading about the required inputs for it by referring to its documentations (?DALEX::explain()) and also referring to the book mentioned at the beginning of this exercise (“Hands-on Machine Learning with R”). Create one explain object per model for the training data and set the inputs that you need which are model, data (data frame of features) and y (vector of observed outcomes). Also, you can give a caption to your model through the label argument.

library(dplyr)
library(DALEX)

x_train <- select(carseats_tr, -Sales)
y_train <- pull(carseats_tr, Sales)

explainer_lm <- DALEX::explain(model = carseats_lm, 
                                 data = x_train, 
                                 y = y_train,
                                 label = "Linear Regression")

explainer_rt <- DALEX::explain(model = carseats_rt,
                               data = x_train,
                               y = y_train,
                               label = "Regression Tree")

explainer_svm <- DALEX::explain(model = carseats_svm,
                                data = x_train,
                                y = y_train,
                                label = "Support Vector Machine")

To calculate the feature importances using Python, we’ll be using the permutation_importance function from the sklearn library.

from sklearn.inspection import permutation_importance
from sklearn.preprocessing import StandardScaler

# Scale the features (relevant to SVM)
scaler = StandardScaler()
carseats_train_scaled = carseats_train.copy()
carseats_test_scaled = carseats_test.copy()

# Calculate feature importances for each model
importance_lr = permutation_importance(carseats_lr, carseats_train.drop(columns=['Sales']), carseats_train['Sales'], n_repeats=10, random_state=2022)
importance_dtr = permutation_importance(carseats_dtr, carseats_train.drop(columns=['Sales']), carseats_train['Sales'], n_repeats=10, random_state=2022)
importance_svr = permutation_importance(carseats_svr, carseats_train.drop(columns=['Sales']), carseats_train['Sales'], n_repeats=10, random_state=2022)
# Train the SVM model on scaled data
carseats_svr_scaled = SVR(kernel='linear')
carseats_svr_scaled.fit(carseats_train_scaled.drop(columns=['Sales']), carseats_train_scaled['Sales'])
SVR(kernel='linear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
importance_svr_scaled = permutation_importance(carseats_svr_scaled, carseats_train_scaled.drop(columns=['Sales']), carseats_train_scaled['Sales'], n_repeats=10, random_state=2022)

# Print the feature importances
importance_df = pd.DataFrame(data={
    'Feature': carseats_train.drop(columns=['Sales']).columns,
    'Linear Regression': importance_lr.importances_mean,
    'Decision Tree': importance_dtr.importances_mean,
    'Support Vector Machine (unscaled)': importance_svr.importances_mean,
    'Support Vector Machine (scaled)': importance_svr_scaled.importances_mean
})

print(importance_df)

You can observe the value of scaling for SVM. The results seems to agree that Price is the most important feature.

Plotting the feature importance

Now that you have created the DALEX::explain objects, we will use another function called model_parts which takes care of the feature permutation. Try reading about the function DALEX::model_parts() . The main arguments that you need to provide to the explain function are:

  • An explainer object (what you created above).

  • B which is the number of permutations (i.e. how many times you want to randomly shuffle each column).

  • The type of scores you would like it to return (raw score vs differences vs ratios) which in this case we set to ratio and you can read more the differences in the documentation.

  • N argument which you can set to N=NULL which essentially asks how many samples you would like to use for calculating the variable importance, where setting it to NULL means that we use the entire training set.

  • There is also a loss_function which by default is RMSE for regression (our case) and 1-AUC for classification, by there are a few more which you can find out about by referring to the documentations (e.g. through ?DALEX::loss_root_mean_square).

After assigning model_parts to a variable, try plotting each model to see the most important variables. What do you see? Are there important features that are in common?

calculate_importance <- function(your_model_explainer, n_permutations = 10) {
  imp <- model_parts(explainer = your_model_explainer,
                     B = n_permutations,
                     type = "ratio",
                     N = NULL)
  return(imp)
}

importance_lm  <- calculate_importance(explainer_lm)
importance_rt  <- calculate_importance(explainer_rt)
importance_svm <- calculate_importance(explainer_svm)

library(ggplot2)
plot(importance_lm, importance_rt, importance_svm) +
  ggtitle("Mean variable-importance ratio over 10 permutations", "")

Partial Dependence Plots (PDP)

Partial dependence plots (PDPs) show the marginal effect of one or two features on the predicted outcome of a machine learning model. They can be used to visualize and interpret the influence of selected features on the model’s predictions. We’ll continue using the same Carseats data. We first create PDPs for the Price and Advertising features using the the SVM model carseats_svm created earlier:

library(caret)
library(tidyverse)
library(pdp)

pdp::partial(carseats_svm, pred.var = "Price", plot = TRUE)
pdp::partial(carseats_svm, pred.var = "Advertising", plot = TRUE)

What we see in the first plot is that for this particular model, the higher the price, the number of carseats sold (units) which makes sense. With regards to the advertising, we see that the higher the local advertising budget for the company (see ?Carseats for more details), the higher the sales number of units for a car, however, there’s a plateau after which you cannot sell more units. This means that after a certain point, increasing the advertising budget may no longer bring any benefit in terms of the units sold. It’s important to note that in PDP, we study the decision by the model and not the data, however, we can always visualize our results to see if they follow a similar trend or not:

# plot(Carseats$Advertising, Carseats$Sales, xlab = "Advertising Budget", ylab = "Sales Units")
ggplot(data = carseats_tr, aes(x = Advertising, y = Sales)) +
  geom_point() +
  labs(x = "Advertising Budget", y = "Sales Unit") +
  ggtitle("Relationship between Sales and Advertising Budget")

We can see that the trend is not very clear but may be headed in the same direction.

For categorical features, such as ShelveLoc, we can make a similar kind of PDP:

pdp::partial(carseats_svm, pred.var = "ShelveLoc", plot = TRUE, plot.engine = "ggplot")

From this plot, we can see that the better the quality of shelving location, the high the number of units sold (which again makes sense).

Limitations of PDPs

PDPs can have some limitations, which you, as users, should be aware of. These limitations do partially apply to other techniques presented during lab:

  1. Assuming feature independence: PDPs assume that the features being plotted are independent of each other. This can lead to misleading results when there is a strong correlation or interaction between features, as the partial dependence function will not capture their combined effect accurately.

  2. Inaccurate representation of complex interactions: PDPs cannot represent high-order interactions or nonlinear relationships between features.

  3. Unreliable in the presence of outliers: PDPs can be sensitive to outliers and extreme values in the dataset, which may result in distorted representations of the feature’s impact on the model’s predictions. Proper preprocessing and outlier detection techniques should be employed to avoid this issue.

  4. Using the average values: PDPs uses the mean expected value for the final prediction of the target feature. This should be fine in most applications, but it may be inappropriate in some applications.

  5. Computational burden: Generating PDPs can be computationally expensive, especially for high-dimensional datasets or complex models. It is important to weigh the benefits of this technique against the computational resources available.

Keep in mind these limitations when interpreting PDPs (and other techniques) and consider these limitations when drawing conclusions from them.

We can now create a two-way PDP to explore the interaction between Price and Advertising:

pdp::partial(carseats_svm, pred.var = c("Price", "Advertising"), plot = TRUE)

A combination of moderate advertising (around 2-18) and low prices (from around 0-40) seems to produce the highest sales units by the SVM model.

Note

You may note how no python implementation of PDP has been shown since sklearn.inspection.plot_partial_dependence, which was suitable for this task, has recently changed, and the new alternative does not always work well for categorical variables. However, if you are interested, feel free to look up existing alternatives.

LIME (Local Interpretable Model-agnostic Explanations)

LIME helps explain individual predictions of machine learning models by fitting a local, interpretable model around a specific data point. This allows for more transparency and understanding of the model’s behavior for individual instances.

SVM

lime package does not support the SVM model from the e1071 package out of the box.You can see the list of supported models via ?model_type. There are solutions to this:

  • Re-train your model with the caret library which we then work directly with this library (also may be good practice to build your models with caret).
# Load the lime library
library(lime)

# Create a caret model using a support vector machine
svm_caret_model <- caret::train(Sales ~ ., data = carseats_tr, method = "svmLinear2", trControl = trainControl(method = "none"))

# Predict on a test instance
test_instance <- carseats_te[1:4, -1]

# Create a lime explainer object for the SVM model
lime_svm_explainer <- lime::lime(carseats_tr[, -1], 
                                 svm_caret_model)

# Explain a prediction using lime
lime_svm_explanation <- explain(test_instance, lime_svm_explainer, n_features = 10)
plot_features(lime_svm_explanation)
  • Create custom predict_model and model_type methods for the SVM model.
# Custom predict_model function for SVM
predict_model.svm <- function(x, newdata, type, ...) {
  if (type == "raw") {
    res <- predict(x, newdata = newdata, ...)
    return(data.frame(Response = res, stringsAsFactors = FALSE))
  } else if (type == "prob") {
    res <- predict(x, newdata = newdata, ...)
    prob <- kernlab::kernel(x, newdata, j = -1)
    return(as.data.frame(prob, check.names = FALSE))
  }
}

# Custom model_type function for SVM
model_type.svm <- function(x, ...) {
  if (x$type == "C-classification") {
    return("classification")
  } else {
    return("regression")
  }
}

# Create a LIME explainer for the SVM model
lime_svm_explainer <- lime(x_train, carseats_svm)

# Choose a specific instance from the test set to explain
test_instance <- carseats_te[1:4,-1]

# Generate explanations for the chosen instance
lime_svm_explanation <- explain(test_instance, lime_svm_explainer, n_features = 10)

# Visualize the explanation
plot_features(lime_svm_explanation)

You can see the prediction plot for 4 test observations. We can see several bar charts. On the y-axis, you see the features (and their intervals), while the x-axis shows the relative strength of each feature at a given value or interval. The positive value (blue color) shows that the feature support or increases the value of the prediction, while the negative value (red color) has a negative effect or decreases the prediction value. Please note that the interpretation for each observation can be different (this explanation has been taken from this blog, which you can visit for further details).

We give the interpretation of the first test observation as an example. The first subplot shows that a price of less than 100 results in purchasing a higher quantity than expected. Additionally, people between the ages of 40 and 55 were most likely to buy the seat, which are people who are not too young nor too old. However, in a typical scenario, we would generally expect younger people to buy car seats, but that’s probably because of the high ages in our dataset (1st. quantile of age is around 40). If the price by the competitor (CompPrice) is also low, it’ll impact the sales units badly. Once again, please note that this is specific to the first observation (i.e., the first subplot).

The next element is Explanation Fit. These values indicate how well LIME explains the model, similar to an R-Squared in linear regression. Here we see the explanation Fit only has values around 0.50-0.7 (50%-70%), which can be interpreted that LIME can only explain a little about our model (in some cases, like the 3rd sub-plot, this value is extremely low). You may choose not to trust the LIME output since it only has a low Explanation Fit.

We’ll be using provide a small demonstration on how this can be achieved in python. Please note the same logic for the interpretation (and explanation) of the R version applies here, therefore, the code is shorter and there’s no further comment provided for its output.

We use lime package in python (already installed in the lab setup). Then we can run our model in the same way as R:

from sklearn import svm
from sklearn.pipeline import make_pipeline
from lime import lime_tabular
import matplotlib.pyplot as plt

# Assuming carseats_tr and carseats_te are already defined as pandas DataFrames
X_train = carseats_train.drop(columns='Sales')
y_train = carseats_train['Sales']
X_test = carseats_test.drop(columns='Sales')

# Create a support vector machine model
svm_caret_model = svm.LinearSVR(random_state=2022)

# Train the model
svm_caret_model.fit(X_train, y_train)
LinearSVR(random_state=2022)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Predict on a test instance
test_instance = X_test.iloc[0:4]

# Create a lime explainer object for the SVM model
lime_svm_explainer = lime_tabular.LimeTabularExplainer(X_train.values,
                                                       feature_names=X_train.columns,
                                                       class_names=['Sales'],
                                                       mode='regression')

# Explain a prediction using lime
lime_svm_explanation = lime_svm_explainer.explain_instance(test_instance.values[0], svm_caret_model.predict, num_features=10)

# Plot the features
# plt.clf()
lime_svm_explanation.as_pyplot_figure()
plt.show()

Bonus: XGBoost

To give you an example for a classification problem, we can also train an XGBoost using the xgboost library:

library(xgboost)

# Load and prepare the data
carseats_df <- Carseats
carseats_df$High <- ifelse(carseats_df$Sales <= 8, "yes", "no")
carseats_df$High <- as.factor(carseats_df$High)
carseats_df$ShelveLoc <- as.factor(carseats_df$ShelveLoc)
carseats_df$Urban <- as.factor(carseats_df$Urban)
carseats_df$US <- as.factor(carseats_df$US)
 
# Prepare the data for xgboost (as shown in the boosting excercises)
xgb_data <- model.matrix(High ~ ., data = carseats_df)[,-1]
xgb_label <- as.numeric(carseats_df$High) - 1
xgb_dmatrix <- xgb.DMatrix(data = xgb_data, label = xgb_label)

# Train a gradient boosting model
set.seed(42)
carseats_xgb <- xgboost(data = xgb_dmatrix, nrounds = 100, objective = "binary:logistic", eval_metric = "logloss")

Identify instances with predicted probabilities close to 1, 0, and 0.5:

# LIME explanations for a gradient boosting model
xgb_preds <- predict(carseats_xgb, xgb_dmatrix)

which.max(xgb_preds)
which.min(xgb_preds)
which.min(abs(xgb_preds - 0.5))

Generate LIME explanations for the selected instances:

# before making the prediction, we need to also one-hot encode the categorical variables
to_explain <- data.frame(model.matrix(~.,data = carseats_df[c(120, 4, 60), -ncol(carseats_df)])[,-1])

# we can finally run LIME on our results
carseats_lime_xgb <- lime(data.frame(xgb_data), carseats_xgb, bin_continuous = TRUE, quantile_bins = FALSE)
carseats_expl_xgb <- lime::explain(to_explain, carseats_lime_xgb, n_labels = 1, n_features = 10)

Visualize the LIME explanations

plot_features(carseats_expl_xgb, ncol = 2)

What can you observe from these subplots?

Your turn

After having done this for the training set, make a selection of the most important variables and run the model again. Would you go for this simpler model that has less features or the more complicated one? (hint: you can compute some scores to see whether dropping features would justify the performance drop).

As a final remark, the same kind of analysis can also be done for classification and different loss functions. As a good example, you can see a project done by some of the previous students of MScM_BA which also includes a section on feature importance.