<- read.csv(here::here("labs/data/Wine.csv"))
wine row.names(wine) <- paste("W", c(1:nrow(wine)), sep="")
-12] <- scale(wine[,-12]) wine[,
Principal Component Analysis
Data preparation
In this series of exercises, we illustrate PCA on the wine data already used for clustering. We first load the data.
library(reticulate)
use_condaenv("MLBA")
import pandas as pd
from sklearn.preprocessing import StandardScaler
= pd.read_csv('../../data/Wine.csv')
wine = ["W" + str(i) for i in range(1, len(wine)+1)]
wine.index = StandardScaler()
scaler -1] = scaler.fit_transform(wine.iloc[:, :-1]) wine.iloc[:, :
Note that here the scaling of the variables is optional. The PCA can be applied on the correlation matrix which is equivalent to use scaled features. We could alternatively use unscaled data. The results would of course be different and dependent on the scales themselves. That choice depends on the practical application.
Principal Component Analysis
Before to run the PCA, it is sometimes good to represent the correlations between the features.
library(ggcorrplot)
ggcorrplot(cor(wine[,-12]))
import seaborn as sns
import matplotlib.pyplot as plt
=(10, 8))
plt.figure(figsize-1].corr(), annot=True, cmap="coolwarm", center=0)
sns.heatmap(wine.iloc[:, : plt.show()
Please note that all the interpretations will be based on the R outputs. The python outputs can be slightly different; however, these differences have been explained wherever possible.
We see that some features are (negatively or positively) correlated, like alcohol
and density
. This means that they bring to some extend the same information to the data. They could thus be combined in a single feature. This is in fact what PCA is doing.
We now compute the principal components. There exist several functions in R
for this. We use PCA
available in package FactoMineR
. It is completed by factoextra
that provides nice visualizations.
library(FactoMineR)
library(factoextra)
<- PCA(wine[,-12], ncp = 11, graph = FALSE)
wine_pca wine_pca
from sklearn.decomposition import PCA
= PCA(n_components=11)
pca = pca.fit(wine.iloc[:, :-1])
wine_pca_result = pd.DataFrame(pca.transform(wine.iloc[:, :-1]))
wine_pca wine_pca
The object wine_pca
contains all the PCA results. We will analyze them now.
Note: we require ncp=11
because there can be 11 principal components (same as features), and that we want to keep the results for all the principal components. The results will not change if we set ncp=5
for example. It is just that only 5 PCs will be kept in the R
object.
Circle of correlations and interpretation
To produces a circle of correlations:
fviz_pca_var(wine_pca)
import numpy as np
import matplotlib.pyplot as plt
= wine.columns[:-1]
features = pd.DataFrame(pca.components_.T, columns=[f"PC{i+1}" for i in range(11)], index=features)
loading_matrix = loading_matrix.iloc[:, :2]
loading_matrix
# Normalize the PCA loadings
= loading_matrix / np.sqrt(np.sum(loading_matrix**2, axis=0))
loading_matrix_normalized
= plt.subplots(figsize=(8, 8))
fig, ax "PC1"], loading_matrix_normalized["PC2"], color="r", marker="o", s=100)
ax.scatter(loading_matrix_normalized[
for i, feature in enumerate(loading_matrix_normalized.index):
0, 0, loading_matrix_normalized.loc[feature, "PC1"], loading_matrix_normalized.loc[feature, "PC2"], color="r", alpha=0.8)
ax.arrow("PC1"], loading_matrix_normalized.loc[feature, "PC2"], feature, color="black", fontsize=12)
ax.text(loading_matrix_normalized.loc[feature,
0, color="black", linewidth=1)
ax.axhline(0, color="black", linewidth=1)
ax.axvline(-1.1, 1.1) ax.set_xlim(
-1.1, 1.1) ax.set_ylim(
= plt.Circle((0, 0), 1, color='gray', fill=False, linestyle="--", linewidth=1)
circle
ax.add_artist(circle)
"PC1")
ax.set_xlabel("PC2")
ax.set_ylabel("Circle of Correlations")
ax.set_title(
ax.grid() plt.show()
Please do note for this plot:
Values on the x-axis have been normalized to be between -1 and 1. For the rest of the python plots, we’ll not apply this.
To get the same results as R, the coordinates of PC2 should be flipped. The results appear mirrored because PCA is sensitive to the orientation of the data. The PCA algorithm calculates eigenvectors as the principal components, and eigenvectors can have either positive or negative signs. This means that the orientation of the principal components can vary depending on the implementation of PCA in different libraries. You can get the same results as R with the code below:
import numpy as np
import matplotlib.pyplot as plt
# Normalize the PCA loadings
= loading_matrix / np.sqrt(np.sum(loading_matrix**2, axis=0))
loading_matrix_normalized
# Add this if you want the same results
"PC2"] = -loading_matrix_normalized["PC2"]
loading_matrix_normalized[
= plt.subplots(figsize=(8, 8))
fig, ax "PC1"], loading_matrix_normalized["PC2"], color="r", marker="o", s=100)
ax.scatter(loading_matrix_normalized[
for i, feature in enumerate(loading_matrix_normalized.index):
0, 0, loading_matrix_normalized.loc[feature, "PC1"], loading_matrix_normalized.loc[feature, "PC2"], color="r", alpha=0.8)
ax.arrow("PC1"], loading_matrix_normalized.loc[feature, "PC2"], feature, color="black", fontsize=12)
ax.text(loading_matrix_normalized.loc[feature,
0, color="black", linewidth=1)
ax.axhline(0, color="black", linewidth=1)
ax.axvline(-1.1, 1.1) ax.set_xlim(
-1.1, 1.1) ax.set_ylim(
= plt.Circle((0, 0), 1, color='gray', fill=False, linestyle="--", linewidth=1)
circle
ax.add_artist(circle)
"PC1")
ax.set_xlabel("PC2")
ax.set_ylabel("Circle of Correlations")
ax.set_title(
ax.grid() plt.show()
This is for the two first principal components. We see
- PC1 explains \(31.4\%\) of the variance of the data, PC2 explains \(15.3\%\). In total, \(46.7\%\) of the variance of the data is explained by these two components.
- PC1 is positively correlated with
density
, negatively withalcohol
. Which confirms that these two features are negatively correlated. It is also positively correlated withresidual.sugar
. - PC2 is positively correlated with
pH
, negatively withfixed.acidity
and, a little less, withcitric.acid
. - Features with shorts arrows are not explained here:
volatile.acidity
,sulphates
, etc.
To even better interpret the dimensions, we can extract the contributions of each features in the dimension. Below, for PC1.
fviz_contrib(wine_pca, choice = "var", axes = 1)
= pca.explained_variance_ratio_
explained_variance = loading_matrix**2
square_loading_matrix = square_loading_matrix * 100 / explained_variance[:2]
contributions
= contributions["PC1"].sort_values(ascending=False)
contributions
=(10, 6))
plt.figure(figsize=contributions.values, y=contributions.index, palette="viridis")
sns.barplot(x"Contribution (%)")
plt.xlabel("Features")
plt.ylabel("Feature Contributions for PC1")
plt.title( plt.show()
We recover our conclusions from the circle (for PC1).
Map of the individual and biplot
We can represent the wines in the (PC1,PC2) map. To better interpret the map, we add on it the correlation circle: a biplot.
## fviz_pca_ind(wine_pca) ## only the individuals
fviz_pca_biplot(wine_pca) ## biplot
import numpy as np
import matplotlib.pyplot as plt
# Define the scores for the first two principal components
= wine_pca.iloc[:, :2]
scores
# Define the loadings for the first two principal components
= pca.components_.T[:, :2]
loadings
# Scale the loadings by the square root of the variance
= loadings * np.sqrt(pca.explained_variance_[:2])
loadings_scaled
# Calculate the scaling factor for the arrows
= 0.9 * np.max(np.max(np.abs(scores)))
arrow_max = arrow_max / np.max(np.abs(loadings_scaled))
scale_factor
# Create a scatter plot of the scores
=(10, 8))
plt.figure(figsize0], scores.iloc[:, 1], s=50, alpha=0.8)
plt.scatter(scores.iloc[:,
# Add arrows for each variable's loadings
for i, variable in enumerate(wine.columns[:-1]):
0, 0, loadings_scaled[i, 0]*scale_factor, loadings_scaled[i, 1]*scale_factor, color='r', alpha=0.8)
plt.arrow(0]*scale_factor, loadings_scaled[i, 1]*scale_factor, variable, color='black', fontsize=12)
plt.text(loadings_scaled[i,
# Add axis labels and title
'PC1')
plt.xlabel('PC2')
plt.ylabel('Biplot of Wine Dataset')
plt.title(
# Add grid lines
plt.grid()
# Show the plot
plt.show()
It’s a bit difficult to see all the patterns, but for instance
- Wine 168 has a large
fixed.acidity
and a lowpH
. - Wine 195 has a large
alcohol
and a lowdensity
. - etc.
When we say “large” or “low”, it is not in absolute value but relative to the data set, i.e., “larger than the average”; the average being at the center of the graph (PC1=0, PC2=0).
How many dimensions
For graphical representation one a single graph, we need to keep only two PCs. But if we use it to reduce the dimension of our data set, or if we want to represent the data on several graphs, then we need to know how many components are needed to reach a certain level of variance. This can be achieved by looking at the eigenvalues (screeplot).
fviz_eig(wine_pca, addlabels = TRUE, ncp=11)
=(10, 6))
plt.figure(figsizerange(1, 12), explained_variance * 100, 'o-')
plt.plot(range(1, 12), [f"PC{i}" for i in range(1, 12)]) plt.xticks(
"Principal Components")
plt.xlabel("Explained Variance (%)")
plt.ylabel("Scree Plot")
plt.title(
plt.grid() plt.show()
If we want to achieve \(75\%\) of representation of the data (i.e., of the variance of the data), we need 5 dimensions. This means that the three biplots below represent \(>75\%\) of the data (in fact \(84.6\%\)).
library(gridExtra)
<- fviz_pca_biplot(wine_pca, axes = 1:2)
p1 <- fviz_pca_biplot(wine_pca, axes = 3:4)
p2 <- fviz_pca_biplot(wine_pca, axes = 5:6)
p3 grid.arrange(p1, p2, p3, nrow = 2, ncol=2)
Using PCA to represent clustering results
We now combine clustering and PCA: we make clusters and represent them on the map of the individuals.
First, we make a clustering (below \(k=4\) for the example). Then, we use the group (as factors) to color the individuals in the biplot.
<- hclust(dist(wine[,-12], method = "manhattan"))
wine_hc <- cutree(wine_hc, k = 4)
wine_clust fviz_pca_biplot(wine_pca,
col.ind = factor(wine_clust))
from sklearn.cluster import AgglomerativeClustering
from matplotlib.cm import get_cmap
from matplotlib.patches import Patch
= AgglomerativeClustering(n_clusters=4, metric='manhattan', linkage='average')
clustering = clustering.fit_predict(wine.iloc[:, :-1])
wine_clust
= get_cmap("viridis", 4)
cmap = cmap(np.array(wine_clust) / 3)
colors
# Define legend labels based on cluster number
= ['Cluster {}'.format(i+1) for i in range(4)]
labels
# Rescale the loading matrix
= loading_matrix.div(loading_matrix.std())
loading_matrix_rescaled
=(10, 10))
plt.figure(figsize0], wine_pca[1], c=colors, marker="o", s=100, alpha=0.7)
plt.scatter(wine_pca[
for i, feature in enumerate(loading_matrix_rescaled.index):
0, 0, loading_matrix_rescaled.loc[feature, "PC1"], loading_matrix_rescaled.loc[feature, "PC2"], color="r", alpha=0.8)
plt.arrow("PC1"], loading_matrix_rescaled.loc[feature, "PC2"], feature, color="black", fontsize=12)
plt.text(loading_matrix_rescaled.loc[feature,
"PC1")
plt.xlabel("PC2")
plt.ylabel("PCA Biplot with Clusters")
plt.title(
plt.grid()
# Create legend patches and group them in a Legend instance
= [Patch(color=cmap(i/3), label=labels[i]) for i in range(4)]
legend_patches =legend_patches, title='', loc='upper right', bbox_to_anchor=(1.0, 1.0))
plt.legend(handles
plt.show()
We can see that the clusters are almost separated by the dimensions:
- Cluster 1: larger PC1 and PC2. That is, large pH (low acidity) and large density, residual sugar (low alcohol), etc.
- Cluster 3: larger PC1 and smaller PC2: That is, larger alcohol (lower density and sugar) and large pH (low acidity).
- Cluster 2 is fuzzier. Larger acidity (lower pH) and larger alcohol.
- Cluster 4 is apparently linked to larger citric acidity.