library(ISLR)
library(dplyr)
<- Carseats %>% mutate(SaleHigh=ifelse(Sales > 7.5, "Yes", "No"))
MyCarseats <- MyCarseats %>% select(-Sales)
MyCarseats $SaleHigh <- as.factor(MyCarseats$SaleHigh)
MyCarseats
set.seed(123) # for reproducibility
<- sample(x=1:nrow(MyCarseats), size=0.8*nrow(MyCarseats), replace=FALSE)
index_tr <- MyCarseats[index_tr,]
df_tr <- MyCarseats[-index_tr,] df_te
Models: Support Vector Machine
Support Vector Machine (SVM)
In this exercise, we apply the Support Vector Machines (SVM) to the classification problem of the data set Carseats
from the package ISLR
(already used with CART).
SVM are often difficult to interpret. They are typically what we call “a black box” model, that is, a model which provides predictions without understanding what is behind (except if you have followed the course…) and how the features influences these predictions. Therefore, and because we do not want to get bored, we will also use the caret::train()
function to make our first real machine learning application. Note that this application can be done also with the models that have been seen previously.
Prepare the data
The lines below are just a repetition/reminder of the CART series to have the data ready. The only difference is the cast of SalesHigh
into factors rather than characters because the SVM functions require it.
To proceed we first have to build the data (below is the) Install the package ISLR
in order to access the data set Carseats
. Use ?Carseats
to read its description.
To apply a classification of the sales, we first create a categorical outcome SaleHigh
which equals “Yes” if Sales
> 7.5 and “No” otherwise. Then we create a data frame MyCarseats
containing SaleHigh
and all the features of Carseats
except Sales
. Finally, split MyCarseats
into a training and a test set (2/3 vs 1/3). Below we call them df_tr
and df_te
.
Linear SVM
The e1071::svm()
function of the e1071
package allows to fit SVM to the data with several possible kernels. Below, it is the linear kernel. We fit a linear kernel and check the predictions on the test set.
library(e1071)
set.seed(123)
<- svm(SaleHigh ~ ., data=df_tr, kernel="linear")
carseats_svm
carseats_svm<- predict(carseats_svm, newdata = df_te)
carseats_svm_pred
table(Pred=carseats_svm_pred, obs=df_te$SaleHigh)
To obtain a better insight about the prediction quality, we will use the accuracy measure. It is simply the proportion of correct predictions. This can be conveniently obtained (and much more) from the function caret::confusionMatrix()
of the library caret
. In the parameters, data
are the predictions, and reference
are the observations.
library(caret)
confusionMatrix(data=carseats_svm_pred, reference = df_te$SaleHigh )
We should only focus on the accuracy for now (the other measures will be studied later). In our run, it was ≈\(86\%\). That number is obtained using the default parameter, cost \(C=1\).
# The usual loading of our environment
library(reticulate)
use_condaenv("MLBA", required = TRUE)
Similar to the CART exercises, we use the df_train
and df_te
created in R to carry out our SVM training. Once again, we continue using the sklearn
library. First, we copy our dataset from the R variable to a python variable. Then we use LabelEncoder()
from sklearn.preprocessing
to convert categorical data into a numeric form, which many sklearn
machine learning algorithms require. The encoding assigns a unique numerical value to each categorical value, which can sometimes help the performance. In the case of caret::train()
, the function handles this transformation automatically. Also, we standardize the data in the case of python for faster computations using StandardScaler
, because of the same numerical stability mentioned for CART. After this, we divide the training and test sets into predictors (e.g., X_train
) and the outcome (e.g., y_train
) and initialize our linear kernel SVM to fit the model to the data.
We will use classification_report
and accuracy_score
from sklearn.metrics
to get more information on the performance. (you could also use confusion_matrix
from the same module.)
import pandas as pd
import numpy as np
from sklearn import svm
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
# We copy the training and testing datasets from R to Python.
= r.df_tr.copy()
carset_tr_py = r.df_te.copy()
carset_te_py
# We encode categorical variables as numeric, which is necessary for SVM.
= LabelEncoder()
le = ["ShelveLoc", "Urban", "US", "SaleHigh"]
cat_vars for var in cat_vars:
# Fit the LabelEncoder to the training set and transform the training and testing sets
# We need to use the same encoder for both sets to ensure consistency
= le.fit_transform(carset_tr_py[var])
carset_tr_py[var] = le.transform(carset_te_py[var])
carset_te_py[var]
# Split the data into training and testing sets
= carset_tr_py.drop(columns=["SaleHigh"]), carset_tr_py["SaleHigh"]
X_train, y_train = carset_te_py.drop(columns=["SaleHigh"]), carset_te_py["SaleHigh"]
X_test, y_test
# Standardize only the continuous variables
= ["CompPrice", "Income", "Advertising", "Population", "Price", "Age", "Education"]
cont_vars = StandardScaler()
scaler = scaler.fit_transform(X_train[cont_vars])
X_train[cont_vars] = scaler.transform(X_test[cont_vars])
X_test[cont_vars]
# To speed up the operation you can also transform the inputs
# X_train = X_train.to_numpy()
# y_train = y_train.to_numpy()
# Set random seed for reproducibility
123)
np.random.seed(
# Initialize a linear SVM model
= svm.SVC(kernel="linear")
carseats_svm_py
# Fit the SVM model to the training data
carseats_svm_py.fit(X_train, y_train)
SVC(kernel='linear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(kernel='linear')
# Print the model parameters
print(carseats_svm_py)
# Make predictions on the test set
= carseats_svm_py.predict(X_test)
carseats_svm_pred_py
# Print a confusion matrix
print(pd.crosstab(index=y_test, columns=carseats_svm_pred_py, rownames=['True'], colnames=['Predicted']))
# Alternatively, print a confusin matrix using `sklearn.metrics`
# print(confusion_matrix(y_test, carseats_svm_pred_py))
# Compute metrics
= classification_report(y_test, carseats_svm_pred_py, target_names=['No', 'Yes'])
report_linear_py = accuracy_score(y_test, carseats_svm_pred_py)
accuracy_linear_py
print(report_linear_py)
print("Overall accuracy:", accuracy_linear_py)
This performance is worse than our model in R and requires more tuning (due to differences in default values and implementations of the functions).
Radial basis SVM
We try now with a radial basis kernel (the default).
set.seed(123)
<- svm(SaleHigh ~ ., data=df_tr, kernel="radial")
carseats_rb
carseats_rb<- predict(carseats_rb, newdata = df_te)
carseats_rb_pred confusionMatrix(data=carseats_rb_pred, reference = df_te$SaleHigh )
The accuracy is now ≈\(81\%\). This shows how important that choice can be. In the same vein, we are now relying on the default parameters of the function. For the cost \(C\) it is \(1\), for the parameter gamma
of the kernel, it is 1/(data dimension) (see ?svm
). The train
function allows us to make a better selection.
This is same as the linear kernel, and we just need to change the kernel parameter value from linear
to rbf
. Here’s the radial version:
123) # once again, for reproducibility
np.random.seed(
# Use a radial kernel in the SVM classifier
= svm.SVC(kernel="rbf")
carseats_svm_rb_py carseats_svm_rb_py.fit(X_train, y_train)
SVC()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC()
print(carseats_svm_rb_py)
# Predict the values with the radial approach
= carseats_svm_rb_py.predict(X_test)
carseats_svm_pred_rb_py
# Compute metrics
= classification_report(y_test, carseats_svm_pred_rb_py, target_names=['No', 'Yes'])
report_radial_py = accuracy_score(y_test, carseats_svm_pred_rb_py)
accuracy_radial_py
# Print metrics
print(pd.crosstab(index=y_test, columns=carseats_svm_pred_rb_py, rownames=['True'], colnames=['Predicted']))
print(report_radial_py)
print("Overall accuracy:", accuracy_radial_py)
In both cases (R & python), with the default parameters, the linear kernel seems to do better than the radial one.
Tuning the hyperparameter
In R, caret
uses various libraries to run the svm models (check for yourself by searching for support vector machine
here). For instance, calling svmLinear
or svmRadial
uses the library kernlab
, and the kernlab::ksvm()
function.
The C
hyperparameter (from kernlab::ksvm()
) accounts for the cost argument and controls the trade-off between allowing misclassifications in the training set and finding a decision boundary that generalizes well to new data. A larger cost value leads to a smaller margin and a more complex model that may overfit the data. On the other hand, a smaller cost value leads to a larger margin and a simpler model that may underfit the data.
Similarly to EX_ML_NN
, to select the good hyperparameters, we build a search grid and fit the model with each possible value in the grid. Then, the best model is chosen among all the combinations of the hyperparameters.
As a reminder, the train
function from caret
. Has:
- a formula.
- a dataset.
- a method (i.e. the model which in this case is SVM with linear kernel).
- a training control procedure.
Linear SVM
<- trainControl(method = "cv", number=10)
trctrl set.seed(143)
<- train(SaleHigh ~., data = df_tr, method = "svmLinear",
svm_Linear trControl=trctrl)
svm_Linear
For now, the validation accuracy is very high (≈\(89\%\)). This is normal since this accuracy is computed on the training set.
We now supply a grid of values for the cost that we want to try and pass to the argugmenttuneGrid
. Be patient, it may take time.
<- expand.grid(C = c(0.01, 0.1, 1, 10, 100, 1000))
grid
gridset.seed(143)
<- train(SaleHigh ~., data = df_tr, method = "svmLinear",
svm_Linear_Grid trControl=trctrl,
tuneGrid = grid)
svm_Linear_Gridplot(svm_Linear_Grid)
$bestTune svm_Linear_Grid
We see that setting the cost to 1 provides the best model. The accuracy apparently reaches a plateau at this value. This is same as our cost parameter in section 1.2.
Radial Basis SVM
The sigma
hyperparameter (also from kernlab::ksvm()
) controls the width of the radial basis function kernel, which is used to transform the input data into a higher-dimensional feature space. A larger value of sigma corresponds to a narrower kernel and a more complex model, while a smaller value corresponds to a wider kernel and a simpler model.
We repeat the procedure for SVM with a radial basis kernel. Here, there are two parameters ( sigma
and C
) to tune. The grid choice is rather arbitrary (often the result of trials and errors), and very few general useful guidelines exist. The code below may take a few minutes to run.
<- expand.grid(sigma = c(0.01, 0.02, 0.05, 0.1),
grid_radial C = c(1, 10, 100, 500, 1000))
grid_radialset.seed(143)
<- train(SaleHigh ~., data = df_tr, method = "svmRadial",
svm_Radial_Grid trControl=trctrl,
tuneGrid = grid_radial)
svm_Radial_Gridplot(svm_Radial_Grid)
$bestTune svm_Radial_Grid
The optimal model from this search is with sigma = 0.01
and C=100
.
We can use GridSearchCV
from sklearn.model_selection
to achieve the same goal in python and set the argument for cross-validation (cv
to achieve the same results) and tune both types of kernels at once. sklearn.svm.SVC()
does contain the two arguments for the C
and sigma
but the relationship is slightly different and we’ll explain below. Also, please note in this approach, the linear
kernel by default ignores the sigma
values.
As mentioned earlier, in sklearn.svm.SVC()
, the equivalent parameter to cost is also C
, which is straightforward. Still, the equivalent parameter to sigma
is a bit trickier (called gamma
in sklearn.svm.SVC()
), which also controls the width of the radial basis function kernel. However, the relationship between gamma
and sigma
differs, and the two parameters cannot be directly compared. In particular, gamma
is defined as the inverse of the width of the kernel, i.e., gamma = 1/(2 * sigma**2)
.
For simplicity’s sake, we’ll directly use the gamma
with the same values as sigma
; however, you can always run 'gamma': [1/(2*sigma**2) for sigma in [0.01, 0.02, 0.05, 0.1]]
to get similar values (although you would want to round as this division results in non-terminating repeating decimal numbers).
The code will take a few minutes to run (longer than the R version).
from sklearn.model_selection import GridSearchCV
123) # for reproducibility
np.random.seed(
# Define the grid of hyperparameters to search over
= {'C': [1, 10, 100, 500, 1000],
param_grid 'gamma': [0.01, 0.02, 0.05, 0.1],
# 'gamma': [round(1/(2*sigma**2),2) for sigma in [0.01, 0.02, 0.05, 0.1]],
'kernel': ['linear', 'rbf']}
# Perform grid search with cross-validation
= GridSearchCV(svm.SVC(), param_grid, cv=10, scoring='accuracy', n_jobs = 1) # you can also set n_jobs = -1 to use all the cores and obtain the results faster (but unfortunately atm it only works on Mac/Linux and not Windows OS with `reticulate`)
grid_search # for more info, please see https://github.com/rstudio/reticulate/issues/1346
grid_search.fit(X_train, y_train)
GridSearchCV(cv=10, estimator=SVC(), n_jobs=1, param_grid={'C': [1, 10, 100, 500, 1000], 'gamma': [0.01, 0.02, 0.05, 0.1], 'kernel': ['linear', 'rbf']}, scoring='accuracy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=10, estimator=SVC(), n_jobs=1, param_grid={'C': [1, 10, 100, 500, 1000], 'gamma': [0.01, 0.02, 0.05, 0.1], 'kernel': ['linear', 'rbf']}, scoring='accuracy')
SVC()
SVC()
print("Best hyperparameters: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)
Note that the accuracy of 85% is on the training set. The best parameters are C = 500, gamma = 0.01, kernel = rbf. We will use a new plotting library for seeing this evolution called searborn
, which offers some great visualization tools.
Note that the accuracy of is on the training set. The best parameters are C = 500, gamma = 0.01, kernel = rbf. We will use a new plotting library for seeing this evolution called searborn
, which offers some great visualization tools.
We use the installed seaborn
package from our Setup which allows for grouping our hyperparameters and displaying them with a heatmap.
import seaborn as sns
import matplotlib.pyplot as plt
# Assuming `results` is your DataFrame from `grid_search.cv_results_`
= pd.DataFrame(grid_search.cv_results_)
results
# Before performing the pivot or groupby operation, drop the 'params' column (it's a dictionary)
= results.drop(columns=['params'])
results
# Now, you can safely group by 'param_C', 'param_gamma', and 'param_kernel' to calculate mean test scores
= results.pivot_table(index=['param_C', 'param_gamma'], columns='param_kernel', values='mean_test_score')
grouped_results
plt.clf()
# Create the heatmap
= sns.heatmap(grouped_results, annot=True, fmt='.3f', cmap='viridis', cbar=False)
ax # invert the y-axis to match your preference
ax.invert_yaxis()
'Kernel')
plt.xlabel('C-Gamma')
plt.ylabel( plt.show()
We can see that rbf
kernel benefits from changes in C
& gamma
, however, for the linear
, we’re always using the same gamma of 0.01
so for this kernel, the only changes are coming from C
parameter.
Best model
After finding the best hyperparameters, it is often good practice to re-train the model with the best hyperparameters on the entire training set before evaluating everything on the test set. We do not need to re-train the model with the entire dataset for the linear SVM, as the best cost matched those used in section 1.2. We re-train the final model with the entire training set using optimal hyperparameters for the radial basis kernel.
<- svm(SaleHigh ~ .,data = df_tr,
carseats_rb_tuned kernel = "radial", gamma = svm_Radial_Grid$bestTune$sigma,
cost = svm_Radial_Grid$bestTune$C)
<- predict(carseats_rb_tuned, newdata = df_te)
carseats_rb_tuned_pred confusionMatrix(data=carseats_rb_tuned_pred, reference = df_te$SaleHigh)
Overall, if we compare all the models, we see that the linear kernel SVM with cost of 1 looks like the best model. We already saw that it provides a \(86\%\) accuracy on the test set. This is what can be expected in the future from that model.
# re-train the model with best hyperparameters
= svm.SVC(**grid_search.best_params_)
svm_best_py svm_best_py.fit(X_train, y_train)
SVC(C=500, gamma=0.01)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(C=500, gamma=0.01)
# predict on test dataset
= svm_best_py.predict(X_test)
carseats_svm_pred_py
# print confusion matrix and accuracy score
print(pd.crosstab(index=y_test, columns=carseats_svm_pred_py, rownames=['True'], colnames=['Predicted']))
print(classification_report(y_test, carseats_svm_pred_py, target_names=['No', 'Yes']))
print('Accuracy Score:', accuracy_score(y_test, carseats_svm_pred_py))
The results are similar to the R outcome for the linear models (except differences in the confusion matrix).
Your turn
Repeat the analysis on the German credit data (german.csv
). Since dataset is much larger than MyCarSeats
, the tuning procedure may be longer. For this reason, just limit to a linear SVM model for the tuning with limited range for the grid search.