library(randomForest)
<- read.csv(here::here("data/Wine.csv"))
wine $quality <- as.factor(wine$quality)
wine
# define a function to get the splitting index (training and testing) of a given dataset
<- function(dataset, train_proportion = 0.75) {
get_split_index set.seed(123)
<-
index sample(
x = 1:2,
size = nrow(dataset),
replace = TRUE,
prob = c(train_proportion, 1 - train_proportion)
)return(index)
}
<- get_split_index(wine)
wine_index <- wine[wine_index == 1, ]
wine_tr <- wine[wine_index == 2, ] wine_te
Ensemble Methods
Random Forest
In this part of the lab, we will look at how the randomForest
library (alternative to ranger
) can be applied for classification and regression tasks. At the very end, please feel free to apply these techniques to one of your favorite datasets seen in class (classification or regression).
R (using the randomForest
library):
ntree
: The number of trees in the forest (equivalent ton_estimators
in python).mtry
: The number of features to consider when looking for the best split. (similar tomax_features
in python)max.depth
: The maximum depth of each tree.nodesize
: The minimum number of samples required to split an internal node (equivalent tomin_samples_split
in python).
Python (using the sklearn
library):
n_estimators
: The number of trees in the forest.max_features
: The number of features to consider when looking for the best split.max_depth
: The maximum depth of each tree.min_samples_split
: The minimum number of samples required to split an internal node.min_samples_leaf
: The minimum number of samples required to be at a leaf node.
A few (among many) tips for finding the ideal hyperparameters for RF:
ntree
(R) /n_estimators
(python): Use a large number of trees in the forest to improve model performance, but be aware of the increased computation time. The default value is usually a good starting point.mtry
(R) /max_features
(python): Experiment with different values, usually starting with the default (square root of the number of features for classification or one-third of the number of features for regression). Increasing this value may improve model performance but can also increase computation time.max.depth
: Control the depth of each tree to manage overfitting. Deeper trees capture more complex patterns but can lead to overfitting. Experiment with different values, keeping in mind that a shallower tree can be more interpretable and less prone to overfitting.nodesize
(R) /min_samples_split
(python): Increasing this value can help reduce overfitting, but setting it too high might lead to underfitting. Experiment with different values to find the optimal balance.
Classification
Data preparation for Random Forest
Load the library randomForest
in R. Then, load the wine
data set. This dataset is about white wine quality (in fact Portuguese vinho verde). The data contains 11 numerical features and 1 factor variable:
fixed.acidity
volatile.acidity
citric.acid
residual.sugar
chlorides
free.sulfur.dioxide
total.sulfur.dioxide
density
pH
sulphates
alcohol
quality: Good/Bad
All the numerical features have units. The data source can be found here. For simplicity, only an extraction of 200 wines are used in this exercise. Note that in the original data set, the quality
is a score (0 to 10) that was turned as factor here for the exercise (Bad: 0 to 5, Good: 6 to 10). Also, note that in the data source, the objective is to predict the quality from the other features (supervised learning).
As mentioned, the outcome variable used for this dataset is the wine quality
. We should first coerce the classes as factors. Then, we make the training/test set random split with a 75/25 scheme.
# Load the course python environment as usual with a r code chunks.
library(reticulate)
use_condaenv("MLBA")
Similar to the previous labs, in python, we can use the usual sklearn
library to do all our modelling. Please note that we will load the data again in python to make the demo easier. Additionally, we’ll load all the necessary libraries for this lab in this code chunk.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import confusion_matrix, accuracy_score, mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
# we first move up one directory to achieve relative paths
= pd.read_csv('../data/Wine.csv')
wine
'quality'] = wine['quality'].astype('category')
wine[
# Split wine dataset into train and test
= train_test_split(wine, test_size=0.25, random_state=123) train_wine, test_wine
Note: Here, we have written a function to make the split as we will also need to apply it also for another dataset in the regression part.
Training and testing the model
Fit a random forest on the train set. The target is the taste
variable that we want to predict. Specify for the number of trees ntree=1000
(by default, the function selects \(500\) trees). Remember to exclude quality
in the predictors of the formula. Also, use the option importance=TRUE
, we will need it afterward. Then test the model by computing the accuracy on the test set. You may use confusionMatrix
from caret
.
<- randomForest(quality~., data=wine_tr, ntree=1000, importance=TRUE)
wine_rf <- predict(wine_rf, newdata=wine_te)
wine.pred_rf
library(caret)
confusionMatrix(data=wine.pred_rf, reference = wine_te$quality)
# Fit a Random Forest classifier on the train set
= RandomForestClassifier(n_estimators=1000, random_state=123)
wine_rf 'quality', axis=1), train_wine['quality'])
wine_rf.fit(train_wine.drop(
# Test the model and compute the accuracy
= wine_rf.predict(test_wine.drop('quality', axis=1))
wine_pred_rf print(confusion_matrix(test_wine['quality'], wine_pred_rf))
print("Accuracy:", accuracy_score(test_wine['quality'], wine_pred_rf))
This model is worse than the R version mostly because of the different defaults.
Variable importance
Extract the model-specific variable importance using the functions varImpPlot
(plots) and importance
(values) on the model. Observe well that the mean decrease in accuracy of each variable is also computed for each specific class. In particular, what makes density
special for predicting Good
compare to another variable (like for example citric.acid
)?
varImpPlot(wine_rf)
importance(wine_rf)
# Variable importance
= pd.Series(wine_rf.feature_importances_, index=train_wine.drop('quality', axis=1).columns)
wine_importances =False).plot(kind='bar') wine_importances.sort_values(ascending
density
is important for predicting the Good
since their predictions is much less accurate if we do not use it. citric.acid
is both overall less important than density
but especially for prediction of Good
.
Regression
In this part, we will be using the real_estate_data.csv
once again. After reading the data, apply a random forest to predict price
using all the other variables except No
, Month
and Year
. Compute the RMSE and inspect the prediction quality with a graph. Note that the importance is not specific to any class here.
library(dplyr)
library(magrittr)
<- read.csv(here::here("labs/data/real_estate_data.csv"))
real_estate_data
# select the columns of interest
<-
real_estate_data %>%
real_estate_data select(-c(No, Month, Year))
# once again, divide the data into training and testing sets using the function created earlier
<- get_split_index(real_estate_data)
restate_index <- real_estate_data[restate_index == 1, ]
restate_tr <- real_estate_data[restate_index == 2, ]
restate_te
# apply the RF model as a regression
<- randomForest(Price~., data=restate_tr, ntree=1000, importance=TRUE)
restate_rf <-predict(restate_rf, newdata=restate_te)
restate.pred_rf
# compute rmse and plot the results as well the VarImp
<- sqrt(mean((restate_te$Price - restate.pred_rf)^2)))
(rmse plot(restate_te$Price ~ restate.pred_rf)
abline(0,1)
varImpPlot(restate_rf)
importance(restate_rf)
# Load real estate dataset
= pd.read_csv("../data/real_estate_data.csv")
real_estate_data
= real_estate_data.drop(['No', 'Month', 'Year'], axis=1)
real_estate_data
# Split real estate dataset into train and test
= train_test_split(real_estate_data, test_size=0.25, random_state=123)
train_restate, test_restate
# Fit a Random Forest regressor on the train set
= RandomForestRegressor(n_estimators=1000, random_state=123)
restate_rf 'Price', axis=1), train_restate['Price'])
restate_rf.fit(train_restate.drop(
# Test the model and compute the RMSE
= restate_rf.predict(test_restate.drop('Price', axis=1))
restate_pred_rf = np.sqrt(mean_squared_error(test_restate['Price'], restate_pred_rf))
rmse print("RMSE:", rmse)
# Plot the prediction quality
'Price'], restate_pred_rf)
plt.scatter(test_restate['Actual Price')
plt.xlabel('Predicted Price')
plt.ylabel(min(test_restate['Price']), max(test_restate['Price'])], [min(test_restate['Price']), max(test_restate['Price'])], color='red')
plt.plot([
plt.show()
# Variable importance
= pd.Series(restate_rf.feature_importances_, index=train_restate.drop('Price', axis=1).columns)
restate_importances =False).plot(kind='bar') restate_importances.sort_values(ascending
Compare this model with the one you came up with in Ex_ML_LinLogReg
. Which one would you go for?
Gradient Boosting Machines (GBM)
In this part of the lab, we will look at how the gbm
library in R and the GradientBoostingClassifier
and GradientBoostingRegressor
in Python can be applied for classification and regression tasks. We will continue using the wine
dataset for classification and real_estate_data
for regression.
R (using the gbm
library):
n.trees
: The number of boosting stages to perform (equivalent ton_estimators
in python).interaction.depth
: The maximum depth of each tree (equivalent tomax_depth
in python).shrinkage
: The learning rate.n.minobsinnode
: The minimum number of samples required to split an internal node (equivalent tomin_samples_split
in python).bag.fraction
: The fraction of samples to be used for fitting individual base learners (equivalent tosubsample
in python).
Python (using the sklearn
library):
n_estimators
: The number of boosting stages to perform. Similar to Random Forest, increasing the number of estimators can improve the model’s performance but may also increase the computational complexity and training time.learning_rate
: The learning rate shrinks the contribution of each tree. A smaller learning rate requires more boosting stages to achieve the same performance as a larger learning rate, but it can also result in a more robust model.max_depth
: The maximum depth of each tree. Similar to Random Forest, a higher depth can capture more complex patterns in the data, but it may also lead to overfitting.min_samples_split
: The minimum number of samples required to split an internal node. Similar to Random Forest, a smaller value allows the model to capture finer details in the data, while a larger value can help prevent overfitting.min_samples_leaf
: The minimum number of samples required to be at a leaf node. Similar to Random Forest, a smaller value allows the model to capture finer details, while a larger value can help prevent overfitting.subsample
: The fraction of samples to be used for fitting individual base learners. A value smaller than 1.0 can lead to a reduction in variance and an increase in bias, resulting in a more robust model.
Similar to R, here are some tips for finding the best combination of the hyperparameters:
n.trees
(R) /n_estimators
(python): Start with a lower number of trees and increase it until no further improvement in performance is observed. Be aware of the increased computation time with a larger number of trees.interaction.depth
(R) /max_depth
(python): Keep the depth of each tree relatively shallow (3-5 levels) to prevent overfitting. Deeper trees can capture more complex patterns but may lead to overfitting.shrinkage
(R) /learning_rate
(python): Use a smaller learning rate for better model performance, but be prepared for slower convergence and increased computation time. Typically, values range between 0.01 and 0.1.n.minobsinnode
(R) /min_samples_split
(python): Similar to Random Forest, experiment with different values to find the optimal balance between overfitting and underfitting.bag.fraction
(R) /subsample
(python): Using a subsample of the data (e.g., 0.5-0.8) can help reduce overfitting and speed up the training process. Experiment with different values to find the best trade-off between performance and computation time.
Classification
Training and testing the model
We now fit a GBM model on the wine
training set and apply it to the same target variable quality
. We can train the model and add
library(gbm)
set.seed(123)
<- gbm(quality~., data=wine_tr, distribution="multinomial", n.trees=1000, interaction.depth=4, shrinkage=0.01)
wine_gbm <- predict(wine_gbm, newdata=wine_te, n.trees=1000, type="response")
wine.pred_gbm <- apply(wine.pred_gbm, 1, which.max)
wine.pred_gbm_class levels(wine_te$quality) <- 1:length(levels(wine_te$quality))
confusionMatrix(factor(wine.pred_gbm_class), wine_te$quality)
from sklearn.ensemble import GradientBoostingClassifier
# Fit a Gradient Boosting classifier on the train set
= GradientBoostingClassifier(n_estimators=1000, learning_rate=0.01, max_depth=4, random_state=123)
wine_gbm 'quality', axis=1), train_wine['quality'])
wine_gbm.fit(train_wine.drop(
# Test the model and compute the accuracy
= wine_gbm.predict(test_wine.drop('quality', axis=1))
wine_pred_gbm print(confusion_matrix(test_wine['quality'], wine_pred_gbm))
print("Accuracy:", accuracy_score(test_wine['quality'], wine_pred_gbm))
Regression
In this part, we will continue using the real_estate_data.csv
. Fit a GBM model on the real estate training set to predict price
using all the other variables except No
, Month
, and Year
. Then compute the metrics and plot the predictions.
set.seed(123)
<- gbm(Price~., data=restate_tr, distribution="gaussian", n.trees=1000, interaction.depth=4, shrinkage=0.01)
restate_gbm <-predict(restate_gbm, newdata=restate_te, n.trees=1000)
restate.pred_gbm
# compute rmse and plot the results
<- sqrt(mean((restate_te$Price - restate.pred_gbm)^2)))
(rmse_gbm plot(restate_te$Price ~ restate.pred_gbm)
abline(0,1)
# run the code below if you have not cleared the plot yet
plt.clf()
from sklearn.ensemble import GradientBoostingRegressor
# Fit a Gradient Boosting regressor on the train set
= GradientBoostingRegressor(n_estimators=1000, learning_rate=0.01, max_depth=4, random_state=123)
restate_gbm 'Price', axis=1), train_restate['Price'])
restate_gbm.fit(train_restate.drop(# Test the model and compute the RMSE
= restate_gbm.predict(test_restate.drop('Price', axis=1))
restate_pred_gbm = np.sqrt(mean_squared_error(test_restate['Price'], restate_pred_gbm))
rmse_gbm print("RMSE:", rmse_gbm)
# Plot the prediction quality
'Price'], restate_pred_gbm)
plt.scatter(test_restate['Actual Price')
plt.xlabel('Predicted Price')
plt.ylabel(min(test_restate['Price']), max(test_restate['Price'])], [min(test_restate['Price']), max(test_restate['Price'])], color='red')
plt.plot([ plt.show()
Compare the GBM model with the Random Forest model you came up with earlier. Which one would you go for?
Bonus: XGBoost
What is XGBoost?
XGBoost (Extreme Gradient Boosting) is an optimized implementation of the gradient boosting algorithm. It is designed for high performance and efficient memory usage. XGBoost improves upon the base Gradient Boosting Machine (GBM) by incorporating regularization to prevent overfitting and implementing parallel processing techniques for faster training. The algorithm also offers built-in cross-validation and early stopping to save time and resources during model training.
Modelling with XGBoost
We’ll use the xgboost
library in both R and python. You can see some of the hyperparameters below:
eta
: Controls the learning rate, which determines the step size at each iteration while updating the model weights. Smaller values make the model more robust to overfitting but require more iterations to converge. Typical values range from 0.01 to 0.3.max_depth
: Controls the maximum depth of each tree. Deeper trees can model more complex relationships but are more prone to overfitting. Experiment with different values, keeping in mind that a shallower tree can be more interpretable and less prone to overfitting.min_child_weight
: Controls the minimum sum of instance weights needed in a child node. Increasing this value helps to prevent overfitting by making the model more conservative.
You can read more about the package in its documentation.
# Install and load the package
# install.packages("xgboost")
library(xgboost)
# Prepare data for XGBoost
<- xgb.DMatrix(data = as.matrix(restate_tr[, -ncol(restate_tr)]), label = restate_tr$Price)
dtrain <- xgb.DMatrix(data = as.matrix(restate_te[, -ncol(restate_te)]), label = restate_te$Price)
dtest
# Set hyperparameters
<- list(
params objective = "reg:squarederror",
eta = 0.1,
max_depth = 5,
min_child_weight = 1,
subsample = 1,
colsample_bytree = 1
)
# Train the model
<- xgb.train(params, dtrain, nrounds = 1000)
xgb_model
# Test the model and compute the RMSE
<- predict(xgb_model, dtest)
restate_pred_xgb <- sqrt(mean((restate_te$Price - restate_pred_xgb)^2))
rmse_xgb print(paste("RMSE:", rmse_xgb))
We used the python installation of xgboost
from our lab setup
.
# Install and load the package
import xgboost as xgb
# Prepare data for XGBoost
= xgb.DMatrix(train_restate.drop("Price", axis=1), label=train_restate["Price"])
dtrain = xgb.DMatrix(test_restate.drop("Price", axis=1), label=test_restate["Price"])
dtest
# Set hyperparameters
= {
params "objective": "reg:squarederror",
"eta": 0.1,
"max_depth": 3,
"min_child_weight": 1,
"subsample": 1,
"colsample_bytree": 1,
}
# Train the model
= xgb.train(params, dtrain, num_boost_round=1000)
xgb_model
# Test the model and compute the RMSE
= xgb_model.predict(dtest)
restate_pred_xgb = np.sqrt(mean_squared_error(test_restate["Price"], restate_pred_xgb))
rmse_xgb print("RMSE:", rmse_xgb)
Although initially our GBM suffered compared to the RF, we can see that XGBoost can help improve the result (the case for the python implementation). However, random forest still outperforms all the other models.
Feel free to apply XGBoost to the dataset of your choice.