T104: Handling Multicollinearity-Feature selection techniques in machine learning
Note: This is a part of series on Data Preprocessing in Machine Learning you can check all tutorials here: Embedded Method, Wrapper Method, Filter Method,Handling Multicollinearity.
In this tutorial we will learn how to handle multicollinear features , this can be performed as a feature selection step in your machine learning pipeline. When two or more independent variables are highly correlated with each other then we can state that those features are multi collinear.
import pandas as pd
import numpy as np
cars_df=pd.read_csv('dataset/cleaned_cars.csv')
cars_df.head()
mpg | cylinders | displacement | horsepower | weight | acceleration | origin | Age | |
---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | US | 49 |
1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | US | 49 |
2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | US | 49 |
3 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | US | 49 |
4 | 15.0 | 8 | 429.0 | 198 | 4341 | 10.0 | US | 49 |
cars_df.describe()
mpg | cylinders | displacement | horsepower | weight | acceleration | Age | |
---|---|---|---|---|---|---|---|
count | 367.000000 | 367.000000 | 367.000000 | 367.000000 | 367.000000 | 367.000000 | 367.000000 |
mean | 23.556403 | 5.438692 | 191.592643 | 103.618529 | 2955.242507 | 15.543324 | 42.953678 |
std | 7.773266 | 1.694068 | 102.017066 | 37.381309 | 831.031730 | 2.728949 | 3.698402 |
min | 9.000000 | 3.000000 | 70.000000 | 46.000000 | 1613.000000 | 8.000000 | 37.000000 |
25% | 17.550000 | 4.000000 | 105.000000 | 75.000000 | 2229.000000 | 13.800000 | 40.000000 |
50% | 23.000000 | 4.000000 | 146.000000 | 94.000000 | 2789.000000 | 15.500000 | 43.000000 |
75% | 29.000000 | 8.000000 | 260.000000 | 121.000000 | 3572.000000 | 17.050000 | 46.000000 |
max | 46.600000 | 8.000000 | 455.000000 | 230.000000 | 5140.000000 | 24.800000 | 49.000000 |
As we can see range of these features are very different that means they all are in different scales so lets standardize the features using sklearn’s scale function.
from sklearn import preprocessing
cars_df[['cylinders']]=preprocessing.scale(cars_df[['cylinders']].astype('float64'))
cars_df[['displacement']]=preprocessing.scale(cars_df[['displacement']].astype('float64'))
cars_df[['horsepower']]=preprocessing.scale(cars_df[['horsepower']].astype('float64'))
cars_df[['weight']]=preprocessing.scale(cars_df[['weight']].astype('float64'))
cars_df[['acceleration']]=preprocessing.scale(cars_df[['acceleration']].astype('float64'))
cars_df[['Age']]=preprocessing.scale(cars_df[['Age']].astype('float64'))
cars_df.describe()
mpg | cylinders | displacement | horsepower | weight | acceleration | Age | |
---|---|---|---|---|---|---|---|
count | 367.000000 | 3.670000e+02 | 3.670000e+02 | 3.670000e+02 | 3.670000e+02 | 3.670000e+02 | 3.670000e+02 |
mean | 23.556403 | -1.936084e-17 | -1.936084e-17 | 9.680419e-17 | -7.744335e-17 | 9.680419e-17 | 2.323300e-16 |
std | 7.773266 | 1.001365e+00 | 1.001365e+00 | 1.001365e+00 | 1.001365e+00 | 1.001365e+00 | 1.001365e+00 |
min | 9.000000 | -1.441514e+00 | -1.193512e+00 | -1.543477e+00 | -1.617357e+00 | -2.767960e+00 | -1.611995e+00 |
25% | 17.550000 | -8.504125e-01 | -8.499642e-01 | -7.666291e-01 | -8.750977e-01 | -6.396984e-01 | -7.997267e-01 |
50% | 23.000000 | -8.504125e-01 | -4.475220e-01 | -2.576598e-01 | -2.003166e-01 | -1.589748e-02 | 1.254184e-02 |
75% | 29.000000 | 1.513992e+00 | 6.714636e-01 | 4.656124e-01 | 7.431720e-01 | 5.528622e-01 | 8.248104e-01 |
max | 46.600000 | 1.513992e+00 | 2.585518e+00 | 3.385489e+00 | 2.632559e+00 | 3.396661e+00 | 1.637079e+00 |
from sklearn.model_selection import train_test_split
Info: Our primary goal in this tutorial is to learn how to handle multicollinearity among features , hence we are not considering the origin variable in our features as it’s a categorical feature.
X=cars_df.drop(['mpg','origin'],axis=1)
Y=cars_df['mpg']
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2)
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression(normalize=True).fit(x_train,y_train)
print("Training score : ",linear_model.score(x_train,y_train))
Training score : 0.8003438238657309
y_pred = linear_model.predict(x_test)
from sklearn.metrics import r2_score
print("Testing_score :",r2_score(y_test,y_pred))
Testing_score : 0.8190012505093899
What is Adjusted $R^2$ Score?
When we have multiple predictors/features , A better measure of how good our model is Adjusted $R^2$ score
The Adjusted $R^2$ score is calculated using r2_score and it is a corrected goodness of fit measure for linear models. This is an Adjusted $R^2$ score that has been adjusted for the number of predictors/features we have used in our regression analysis.
-
The Adjusted $R^2$ score increases when a new predictor/feature has been added to train our model imporves our model more than the improvement that can be expected purely due to chance.
-
When we don’t have highly correlated features then we can observe that Adjusted $R^2$ score is very close to our actual r2 score.
def adjusted_r2(r_square,labels,features):
adj_r_square = 1 - ((1- r_square)*(len(labels)-1))/(len(labels)- features.shape[1])
return adj_r_square
print("Adjusted R2 score :",adjusted_r2(r2_score(y_test,y_pred),y_test,x_test))
Adjusted R2 score : 0.8056925189291979
feature_corr=X.corr()
feature_corr
cylinders | displacement | horsepower | weight | acceleration | Age | |
---|---|---|---|---|---|---|
cylinders | 1.000000 | 0.951901 | 0.841093 | 0.895922 | -0.483725 | 0.330754 |
displacement | 0.951901 | 1.000000 | 0.891518 | 0.930437 | -0.521733 | 0.362976 |
horsepower | 0.841093 | 0.891518 | 1.000000 | 0.862606 | -0.673175 | 0.410110 |
weight | 0.895922 | 0.930437 | 0.862606 | 1.000000 | -0.397605 | 0.302727 |
acceleration | -0.483725 | -0.521733 | -0.673175 | -0.397605 | 1.000000 | -0.273762 |
Age | 0.330754 | 0.362976 | 0.410110 | 0.302727 | -0.273762 | 1.000000 |
Now let’s explore the correlation matrix. We discovered that there are many features which are highly correlated with displacement.You can see that cylinders , horsepower , weight are all three highly correlated with displacement.This high correlation coefficient almost at 0.9 indicates that these features are likely to be colinear.
Another way of saying this is cylinders, horsepower, weight give us the same information as displacement.So we dont need all of them in our regression analysis.
Using this correlation matrix let’s say we want to see all those features with correlation coefficients greater than 0.8 , we can do that by below code.
abs(feature_corr) > 0.8
cylinders | displacement | horsepower | weight | acceleration | Age | |
---|---|---|---|---|---|---|
cylinders | True | True | True | True | False | False |
displacement | True | True | True | True | False | False |
horsepower | True | True | True | True | False | False |
weight | True | True | True | True | False | False |
acceleration | False | False | False | False | True | False |
Age | False | False | False | False | False | True |
trimmed_features_df = X.drop(['cylinders','horsepower','weight'],axis=1)
trimmed_features_corr=trimmed_features_df.corr()
trimmed_features_corr
displacement | acceleration | Age | |
---|---|---|---|
displacement | 1.000000 | -0.521733 | 0.362976 |
acceleration | -0.521733 | 1.000000 | -0.273762 |
Age | 0.362976 | -0.273762 | 1.000000 |
abs(trimmed_features_corr) > 0.8
displacement | acceleration | Age | |
---|---|---|---|
displacement | True | False | False |
acceleration | False | True | False |
Age | False | False | True |
Now we can check that independent features’ correlation has been reduced.
Variance Inflation Factor
Another way of selecting features which are not colinear is Variance Inflation Factor.This is a measure to quantify the severity of multicolinearity in an ordinary least squares regression analysis.
Variance inflation factor is a measure of the amount of multicollinearity in a set of multiple regression variables.
Variance inflation factor measures how much the behavior (variance) of an independent variable is influenced, or inflated, by its interaction/correlation with the other independent variables.
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif=pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
vif['Features'] = X.columns
vif.round(2)
VIF Factor | Features | |
---|---|---|
0 | 10.82 | cylinders |
1 | 19.13 | displacement |
2 | 8.98 | horsepower |
3 | 10.36 | weight |
4 | 2.50 | acceleration |
5 | 1.24 | Age |
- VIF = 1: Not correlated
- VIF =1-5: Moderately correlated
- VIF >5: Highly correlated
If we look at the VIF factors we can see displacement and weight are highly correlated features so let’s drop it from Features.
X = X.drop(['displacement','weight'], axis = 1)
Now again we calculate the VIF for the rest of the features
vif=pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
vif['Features'] = X.columns
vif.round(2)
VIF Factor | Features | |
---|---|---|
0 | 3.57 | cylinders |
1 | 5.26 | horsepower |
2 | 1.91 | acceleration |
3 | 1.20 | Age |
So now colinearity of features has been reduced using VIF.
X=cars_df.drop(['mpg','origin','displacement','weight'],axis=1)
Y=cars_df['mpg']
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2)
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression(normalize=True).fit(x_train,y_train)
print("Training score : ",linear_model.score(x_train,y_train))
Training score : 0.7537877265338784
y_pred = linear_model.predict(x_test)
from sklearn.metrics import r2_score
print("Testing_score :",r2_score(y_test,y_pred))
Testing_score : 0.7159725745358863
print("Adjusted R2 score :",adjusted_r2(r2_score(y_test,y_pred),y_test,x_test))
Adjusted R2 score : 0.7037999705874243
You can get the notebook used in this tutorial here and dataset used here
Thanks for reading!
Leave a comment