library(rpart)
library(e1071)
library(dplyr)
library(ISLR)
# divide the data into training and testing sets
set.seed(2022)
<- sample(x=1:nrow(Carseats), size=0.9*nrow(Carseats), replace=FALSE)
carseats_index <- Carseats[carseats_index,]
carseats_tr <- Carseats[-carseats_index,]
carseats_te
# define a linear regression
<- lm(Sales~., data=carseats_tr)
carseats_lm
# define a regression tree (you can also use `adabag::autoprune()`here
<- rpart(Sales~., data=carseats_tr)
carseats_rt
# define a support-vector machine
<- svm(Sales ~ ., data=carseats_tr) carseats_svm
Variable Importance
Data & models
This exercise shows an example of model-agnostic variable importance for a regression problem. The dataset that we will be working with is Carseats
from the ISLR
library which has already been used during some of the exercises such as Ex_ML_Tree
and Ex_ML_SVM
. It is highly recommended that you try to implement some parts of the exercise yourself before checking the answers.
Load the data from ISLR package, then assign 90% of the data for training and the remainder for testing. Please keep in mind that we make a training split running the feature importance (instead of using the entire dataset) as we “may” want to re-train the model with only a fewer features rather than biasing our decision by also including the testing data.
Create three models including a linear regression, a regression tree and a support-vector machine.
Answer
Python
library(reticulate)
use_condaenv("MLBA")
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_openml
# Use the training and test sets created in R (no easy way to get the `Carseats` data in python)
# Convert the categorical columns to one-hot encoded ones
= pd.get_dummies(r.carseats_tr.copy()), pd.get_dummies(r.carseats_tr.copy())
carseats_train, carseats_test
# Define a linear regression
= LinearRegression()
carseats_lr =['Sales']), carseats_train['Sales']) carseats_lr.fit(carseats_train.drop(columns
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
# Define a decision tree regressor
= DecisionTreeRegressor()
carseats_dtr =['Sales']), carseats_train['Sales']) carseats_dtr.fit(carseats_train.drop(columns
DecisionTreeRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor()
# Define a support-vector machine
= SVR()
carseats_svr =['Sales']), carseats_train['Sales']) carseats_svr.fit(carseats_train.drop(columns
SVR()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVR()
Variable Importance using DALEX
Feature importance can be manually computed by permuting one column and computing the drop in the value of the loss function (e.g. RMSE for regression and Accuracy for classification). This permutation can be repeated n-times to calculate a mean drop in loss value across various runs. This form of variable importance is often referred to as “permutation feature importance”.
Fortunately, in R there are some nice packages that can help you with that. Three common libraries (among many) for model-agnostic feature importance in R are iml
, DALEX
and vip
. If you would like to learn about the model-specific and model-agnostic feature importance and the various packages in R, you can refer to the chapter 16 of the book “Hands-on Machine Learning with R”. For the purpose of this part of the exercise, we will be using the DALEX
library. Please note that the exercise has been largely inspired by the same chapter 16 of the book. You can also check out other variations such as iml
and vip
implementations.
Creating an explain
object
DALEX
has an explain
object which allows you to do various kind of explanatory analysis. Try reading about the required inputs for it by referring to its documentations (?DALEX::explain()
) and also referring to the book mentioned at the beginning of this exercise (“Hands-on Machine Learning with R”). Create one explain object per model for the training data and set the inputs that you need which are model
, data
(data frame of features) and y
(vector of observed outcomes). Also, you can give a caption to your model through the label
argument.
library(dplyr)
library(DALEX)
<- select(carseats_tr, -Sales)
x_train <- pull(carseats_tr, Sales)
y_train
<- DALEX::explain(model = carseats_lm,
explainer_lm data = x_train,
y = y_train,
label = "Linear Regression")
<- DALEX::explain(model = carseats_rt,
explainer_rt data = x_train,
y = y_train,
label = "Regression Tree")
<- DALEX::explain(model = carseats_svm,
explainer_svm data = x_train,
y = y_train,
label = "Support Vector Machine")
To calculate the feature importances using Python, we’ll be using the permutation_importance
function from the sklearn
library.
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import StandardScaler
# Scale the features (relevant to SVM)
= StandardScaler()
scaler = carseats_train.copy()
carseats_train_scaled = carseats_test.copy()
carseats_test_scaled
# Calculate feature importances for each model
= permutation_importance(carseats_lr, carseats_train.drop(columns=['Sales']), carseats_train['Sales'], n_repeats=10, random_state=2022)
importance_lr = permutation_importance(carseats_dtr, carseats_train.drop(columns=['Sales']), carseats_train['Sales'], n_repeats=10, random_state=2022)
importance_dtr = permutation_importance(carseats_svr, carseats_train.drop(columns=['Sales']), carseats_train['Sales'], n_repeats=10, random_state=2022)
importance_svr # Train the SVM model on scaled data
= SVR(kernel='linear')
carseats_svr_scaled =['Sales']), carseats_train_scaled['Sales']) carseats_svr_scaled.fit(carseats_train_scaled.drop(columns
SVR(kernel='linear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVR(kernel='linear')
= permutation_importance(carseats_svr_scaled, carseats_train_scaled.drop(columns=['Sales']), carseats_train_scaled['Sales'], n_repeats=10, random_state=2022)
importance_svr_scaled
# Print the feature importances
= pd.DataFrame(data={
importance_df 'Feature': carseats_train.drop(columns=['Sales']).columns,
'Linear Regression': importance_lr.importances_mean,
'Decision Tree': importance_dtr.importances_mean,
'Support Vector Machine (unscaled)': importance_svr.importances_mean,
'Support Vector Machine (scaled)': importance_svr_scaled.importances_mean
})
print(importance_df)
You can observe the value of scaling for SVM. The results seems to agree that Price
is the most important feature.
Plotting the feature importance
Now that you have created the DALEX::explain
objects, we will use another function called model_parts
which takes care of the feature permutation. Try reading about the function DALEX::model_parts()
. The main arguments that you need to provide to the explain
function are:
An
explainer
object (what you created above).B
which is the number of permutations (i.e. how many times you want to randomly shuffle each column).The
type
of scores you would like it to return (raw score vs differences vs ratios) which in this case we set toratio
and you can read more the differences in the documentation.N
argument which you can set toN=NULL
which essentially asks how many samples you would like to use for calculating the variable importance, where setting it toNULL
means that we use the entire training set.There is also a
loss_function
which by default isRMSE
for regression (our case) and1-AUC
for classification, by there are a few more which you can find out about by referring to the documentations (e.g. through?DALEX::loss_root_mean_square
).
After assigning model_parts
to a variable, try plotting each model to see the most important variables. What do you see? Are there important features that are in common?
<- function(your_model_explainer, n_permutations = 10) {
calculate_importance <- model_parts(explainer = your_model_explainer,
imp B = n_permutations,
type = "ratio",
N = NULL)
return(imp)
}
<- calculate_importance(explainer_lm)
importance_lm <- calculate_importance(explainer_rt)
importance_rt <- calculate_importance(explainer_svm)
importance_svm
library(ggplot2)
plot(importance_lm, importance_rt, importance_svm) +
ggtitle("Mean variable-importance ratio over 10 permutations", "")
Partial Dependence Plots (PDP
)
Partial dependence plots (PDPs) show the marginal effect of one or two features on the predicted outcome of a machine learning model. They can be used to visualize and interpret the influence of selected features on the model’s predictions. We’ll continue using the same Carseats data. We first create PDPs for the Price
and Advertising
features using the the SVM model carseats_svm
created earlier:
library(caret)
library(tidyverse)
library(pdp)
::partial(carseats_svm, pred.var = "Price", plot = TRUE) pdp
::partial(carseats_svm, pred.var = "Advertising", plot = TRUE) pdp
What we see in the first plot is that for this particular model, the higher the price
, the number of carseats sold (units) which makes sense. With regards to the advertising, we see that the higher the local advertising budget for the company (see ?Carseats
for more details), the higher the sales number of units for a car, however, there’s a plateau after which you cannot sell more units. This means that after a certain point, increasing the advertising budget may no longer bring any benefit in terms of the units sold. It’s important to note that in PDP, we study the decision by the model and not the data, however, we can always visualize our results to see if they follow a similar trend or not:
# plot(Carseats$Advertising, Carseats$Sales, xlab = "Advertising Budget", ylab = "Sales Units")
ggplot(data = carseats_tr, aes(x = Advertising, y = Sales)) +
geom_point() +
labs(x = "Advertising Budget", y = "Sales Unit") +
ggtitle("Relationship between Sales and Advertising Budget")
We can see that the trend is not very clear but may be headed in the same direction.
For categorical features, such as ShelveLoc
, we can make a similar kind of PDP:
::partial(carseats_svm, pred.var = "ShelveLoc", plot = TRUE, plot.engine = "ggplot") pdp
From this plot, we can see that the better the quality of shelving location, the high the number of units sold (which again makes sense).
PDPs can have some limitations, which you, as users, should be aware of. These limitations do partially apply to other techniques presented during lab:
Assuming feature independence: PDPs assume that the features being plotted are independent of each other. This can lead to misleading results when there is a strong correlation or interaction between features, as the partial dependence function will not capture their combined effect accurately.
Inaccurate representation of complex interactions: PDPs cannot represent high-order interactions or nonlinear relationships between features.
Unreliable in the presence of outliers: PDPs can be sensitive to outliers and extreme values in the dataset, which may result in distorted representations of the feature’s impact on the model’s predictions. Proper preprocessing and outlier detection techniques should be employed to avoid this issue.
Using the average values: PDPs uses the mean expected value for the final prediction of the target feature. This should be fine in most applications, but it may be inappropriate in some applications.
Computational burden: Generating PDPs can be computationally expensive, especially for high-dimensional datasets or complex models. It is important to weigh the benefits of this technique against the computational resources available.
Keep in mind these limitations when interpreting PDPs (and other techniques) and consider these limitations when drawing conclusions from them.
We can now create a two-way PDP to explore the interaction between Price
and Advertising
:
::partial(carseats_svm, pred.var = c("Price", "Advertising"), plot = TRUE) pdp
A combination of moderate advertising
(around 2-18) and low prices
(from around 0-40) seems to produce the highest sales units by the SVM model.
You may note how no python implementation of PDP has been shown since sklearn.inspection.plot_partial_dependence
, which was suitable for this task, has recently changed, and the new alternative does not always work well for categorical variables. However, if you are interested, feel free to look up existing alternatives.
LIME
(Local Interpretable Model-agnostic Explanations)
LIME
helps explain individual predictions of machine learning models by fitting a local, interpretable model around a specific data point. This allows for more transparency and understanding of the model’s behavior for individual instances.
SVM
lime
package does not support the SVM model from the e1071
package out of the box.You can see the list of supported models via ?model_type
. There are solutions to this:
- Re-train your model with the
caret
library which we then work directly with this library (also may be good practice to build your models withcaret
).
# Load the lime library
library(lime)
# Create a caret model using a support vector machine
<- caret::train(Sales ~ ., data = carseats_tr, method = "svmLinear2", trControl = trainControl(method = "none"))
svm_caret_model
# Predict on a test instance
<- carseats_te[1:4, -1]
test_instance
# Create a lime explainer object for the SVM model
<- lime::lime(carseats_tr[, -1],
lime_svm_explainer
svm_caret_model)
# Explain a prediction using lime
<- explain(test_instance, lime_svm_explainer, n_features = 10)
lime_svm_explanation plot_features(lime_svm_explanation)
- Create custom
predict_model
andmodel_type
methods for the SVM model.
# Custom predict_model function for SVM
<- function(x, newdata, type, ...) {
predict_model.svm if (type == "raw") {
<- predict(x, newdata = newdata, ...)
res return(data.frame(Response = res, stringsAsFactors = FALSE))
else if (type == "prob") {
} <- predict(x, newdata = newdata, ...)
res <- kernlab::kernel(x, newdata, j = -1)
prob return(as.data.frame(prob, check.names = FALSE))
}
}
# Custom model_type function for SVM
<- function(x, ...) {
model_type.svm if (x$type == "C-classification") {
return("classification")
else {
} return("regression")
}
}
# Create a LIME explainer for the SVM model
<- lime(x_train, carseats_svm)
lime_svm_explainer
# Choose a specific instance from the test set to explain
<- carseats_te[1:4,-1]
test_instance
# Generate explanations for the chosen instance
<- explain(test_instance, lime_svm_explainer, n_features = 10)
lime_svm_explanation
# Visualize the explanation
plot_features(lime_svm_explanation)
You can see the prediction plot for 4 test observations. We can see several bar charts. On the y-axis, you see the features (and their intervals), while the x-axis shows the relative strength of each feature at a given value or interval. The positive value (blue color) shows that the feature support or increases the value of the prediction, while the negative value (red color) has a negative effect or decreases the prediction value. Please note that the interpretation for each observation can be different (this explanation has been taken from this blog, which you can visit for further details).
We give the interpretation of the first test observation as an example. The first subplot shows that a price
of less than 100 results in purchasing a higher quantity than expected. Additionally, people between the ages of 40 and 55 were most likely to buy the seat, which are people who are not too young nor too old. However, in a typical scenario, we would generally expect younger people to buy car seats, but that’s probably because of the high ages in our dataset (1st. quantile of age is around 40). If the price by the competitor (CompPrice
) is also low, it’ll impact the sales units badly. Once again, please note that this is specific to the first observation (i.e., the first subplot).
The next element is Explanation Fit
. These values indicate how well LIME explains the model, similar to an R-Squared in linear regression. Here we see the explanation Fit only has values around 0.50-0.7 (50%-70%), which can be interpreted that LIME can only explain a little about our model (in some cases, like the 3rd sub-plot, this value is extremely low). You may choose not to trust the LIME output since it only has a low Explanation Fit.
We’ll be using provide a small demonstration on how this can be achieved in python. Please note the same logic for the interpretation (and explanation) of the R version applies here, therefore, the code is shorter and there’s no further comment provided for its output.
We use lime
package in python (already installed in the lab setup
). Then we can run our model in the same way as R:
from sklearn import svm
from sklearn.pipeline import make_pipeline
from lime import lime_tabular
import matplotlib.pyplot as plt
# Assuming carseats_tr and carseats_te are already defined as pandas DataFrames
= carseats_train.drop(columns='Sales')
X_train = carseats_train['Sales']
y_train = carseats_test.drop(columns='Sales')
X_test
# Create a support vector machine model
= svm.LinearSVR(random_state=2022)
svm_caret_model
# Train the model
svm_caret_model.fit(X_train, y_train)
LinearSVR(random_state=2022)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearSVR(random_state=2022)
# Predict on a test instance
= X_test.iloc[0:4]
test_instance
# Create a lime explainer object for the SVM model
= lime_tabular.LimeTabularExplainer(X_train.values,
lime_svm_explainer =X_train.columns,
feature_names=['Sales'],
class_names='regression')
mode
# Explain a prediction using lime
= lime_svm_explainer.explain_instance(test_instance.values[0], svm_caret_model.predict, num_features=10)
lime_svm_explanation
# Plot the features
# plt.clf()
lime_svm_explanation.as_pyplot_figure() plt.show()
Bonus: XGBoost
To give you an example for a classification problem, we can also train an XGBoost using the xgboost
library:
library(xgboost)
# Load and prepare the data
<- Carseats
carseats_df $High <- ifelse(carseats_df$Sales <= 8, "yes", "no")
carseats_df$High <- as.factor(carseats_df$High)
carseats_df$ShelveLoc <- as.factor(carseats_df$ShelveLoc)
carseats_df$Urban <- as.factor(carseats_df$Urban)
carseats_df$US <- as.factor(carseats_df$US)
carseats_df
# Prepare the data for xgboost (as shown in the boosting excercises)
<- model.matrix(High ~ ., data = carseats_df)[,-1]
xgb_data <- as.numeric(carseats_df$High) - 1
xgb_label <- xgb.DMatrix(data = xgb_data, label = xgb_label)
xgb_dmatrix
# Train a gradient boosting model
set.seed(42)
<- xgboost(data = xgb_dmatrix, nrounds = 100, objective = "binary:logistic", eval_metric = "logloss") carseats_xgb
Identify instances with predicted probabilities close to 1, 0, and 0.5:
# LIME explanations for a gradient boosting model
<- predict(carseats_xgb, xgb_dmatrix)
xgb_preds
which.max(xgb_preds)
which.min(xgb_preds)
which.min(abs(xgb_preds - 0.5))
Generate LIME explanations for the selected instances:
# before making the prediction, we need to also one-hot encode the categorical variables
<- data.frame(model.matrix(~.,data = carseats_df[c(120, 4, 60), -ncol(carseats_df)])[,-1])
to_explain
# we can finally run LIME on our results
<- lime(data.frame(xgb_data), carseats_xgb, bin_continuous = TRUE, quantile_bins = FALSE)
carseats_lime_xgb <- lime::explain(to_explain, carseats_lime_xgb, n_labels = 1, n_features = 10) carseats_expl_xgb
Visualize the LIME explanations
plot_features(carseats_expl_xgb, ncol = 2)
What can you observe from these subplots?
Your turn
After having done this for the training set, make a selection of the most important variables and run the model again. Would you go for this simpler model that has less features or the more complicated one? (hint: you can compute some scores to see whether dropping features would justify the performance drop).
As a final remark, the same kind of analysis can also be done for classification and different loss functions. As a good example, you can see a project done by some of the previous students of MScM_BA which also includes a section on feature importance.