T104: Handling Multicollinearity-Feature selection techniques in machine learning

5 minute read

Note: This is a part of series on Data Preprocessing in Machine Learning you can check all tutorials here: Embedded Method, Wrapper Method, Filter Method,Handling Multicollinearity.

In this tutorial we will learn how to handle multicollinear features , this can be performed as a feature selection step in your machine learning pipeline. When two or more independent variables are highly correlated with each other then we can state that those features are multi collinear.

import pandas as pd
import numpy as np

cars_df=pd.read_csv('dataset/cleaned_cars.csv')
cars_df.head()
mpg cylinders displacement horsepower weight acceleration origin Age
0 18.0 8 307.0 130 3504 12.0 US 49
1 15.0 8 350.0 165 3693 11.5 US 49
2 18.0 8 318.0 150 3436 11.0 US 49
3 17.0 8 302.0 140 3449 10.5 US 49
4 15.0 8 429.0 198 4341 10.0 US 49
cars_df.describe()
mpg cylinders displacement horsepower weight acceleration Age
count 367.000000 367.000000 367.000000 367.000000 367.000000 367.000000 367.000000
mean 23.556403 5.438692 191.592643 103.618529 2955.242507 15.543324 42.953678
std 7.773266 1.694068 102.017066 37.381309 831.031730 2.728949 3.698402
min 9.000000 3.000000 70.000000 46.000000 1613.000000 8.000000 37.000000
25% 17.550000 4.000000 105.000000 75.000000 2229.000000 13.800000 40.000000
50% 23.000000 4.000000 146.000000 94.000000 2789.000000 15.500000 43.000000
75% 29.000000 8.000000 260.000000 121.000000 3572.000000 17.050000 46.000000
max 46.600000 8.000000 455.000000 230.000000 5140.000000 24.800000 49.000000

As we can see range of these features are very different that means they all are in different scales so lets standardize the features using sklearn’s scale function.

from sklearn import preprocessing

cars_df[['cylinders']]=preprocessing.scale(cars_df[['cylinders']].astype('float64'))
cars_df[['displacement']]=preprocessing.scale(cars_df[['displacement']].astype('float64'))
cars_df[['horsepower']]=preprocessing.scale(cars_df[['horsepower']].astype('float64'))
cars_df[['weight']]=preprocessing.scale(cars_df[['weight']].astype('float64'))
cars_df[['acceleration']]=preprocessing.scale(cars_df[['acceleration']].astype('float64'))
cars_df[['Age']]=preprocessing.scale(cars_df[['Age']].astype('float64'))

cars_df.describe()
mpg cylinders displacement horsepower weight acceleration Age
count 367.000000 3.670000e+02 3.670000e+02 3.670000e+02 3.670000e+02 3.670000e+02 3.670000e+02
mean 23.556403 -1.936084e-17 -1.936084e-17 9.680419e-17 -7.744335e-17 9.680419e-17 2.323300e-16
std 7.773266 1.001365e+00 1.001365e+00 1.001365e+00 1.001365e+00 1.001365e+00 1.001365e+00
min 9.000000 -1.441514e+00 -1.193512e+00 -1.543477e+00 -1.617357e+00 -2.767960e+00 -1.611995e+00
25% 17.550000 -8.504125e-01 -8.499642e-01 -7.666291e-01 -8.750977e-01 -6.396984e-01 -7.997267e-01
50% 23.000000 -8.504125e-01 -4.475220e-01 -2.576598e-01 -2.003166e-01 -1.589748e-02 1.254184e-02
75% 29.000000 1.513992e+00 6.714636e-01 4.656124e-01 7.431720e-01 5.528622e-01 8.248104e-01
max 46.600000 1.513992e+00 2.585518e+00 3.385489e+00 2.632559e+00 3.396661e+00 1.637079e+00
from sklearn.model_selection import train_test_split

Info: Our primary goal in this tutorial is to learn how to handle multicollinearity among features , hence we are not considering the origin variable in our features as it’s a categorical feature.

X=cars_df.drop(['mpg','origin'],axis=1) 
Y=cars_df['mpg']
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2)
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression(normalize=True).fit(x_train,y_train)
print("Training score : ",linear_model.score(x_train,y_train))
Training score :  0.8003438238657309
y_pred = linear_model.predict(x_test)
from sklearn.metrics import r2_score

print("Testing_score :",r2_score(y_test,y_pred))
Testing_score : 0.8190012505093899

What is Adjusted $R^2$ Score?

When we have multiple predictors/features , A better measure of how good our model is Adjusted $R^2$ score

The Adjusted $R^2$ score is calculated using r2_score and it is a corrected goodness of fit measure for linear models. This is an Adjusted $R^2$ score that has been adjusted for the number of predictors/features we have used in our regression analysis.

  • The Adjusted $R^2$ score increases when a new predictor/feature has been added to train our model imporves our model more than the improvement that can be expected purely due to chance.

  • When we don’t have highly correlated features then we can observe that Adjusted $R^2$ score is very close to our actual r2 score.

def adjusted_r2(r_square,labels,features):
    adj_r_square = 1 - ((1- r_square)*(len(labels)-1))/(len(labels)- features.shape[1])
    return adj_r_square
print("Adjusted R2 score :",adjusted_r2(r2_score(y_test,y_pred),y_test,x_test))
Adjusted R2 score : 0.8056925189291979
feature_corr=X.corr()
feature_corr
cylinders displacement horsepower weight acceleration Age
cylinders 1.000000 0.951901 0.841093 0.895922 -0.483725 0.330754
displacement 0.951901 1.000000 0.891518 0.930437 -0.521733 0.362976
horsepower 0.841093 0.891518 1.000000 0.862606 -0.673175 0.410110
weight 0.895922 0.930437 0.862606 1.000000 -0.397605 0.302727
acceleration -0.483725 -0.521733 -0.673175 -0.397605 1.000000 -0.273762
Age 0.330754 0.362976 0.410110 0.302727 -0.273762 1.000000

Now let’s explore the correlation matrix. We discovered that there are many features which are highly correlated with displacement.You can see that cylinders , horsepower , weight are all three highly correlated with displacement.This high correlation coefficient almost at 0.9 indicates that these features are likely to be colinear.

Another way of saying this is cylinders, horsepower, weight give us the same information as displacement.So we dont need all of them in our regression analysis.

Using this correlation matrix let’s say we want to see all those features with correlation coefficients greater than 0.8 , we can do that by below code.

abs(feature_corr) > 0.8
cylinders displacement horsepower weight acceleration Age
cylinders True True True True False False
displacement True True True True False False
horsepower True True True True False False
weight True True True True False False
acceleration False False False False True False
Age False False False False False True
trimmed_features_df = X.drop(['cylinders','horsepower','weight'],axis=1)
trimmed_features_corr=trimmed_features_df.corr()
trimmed_features_corr
displacement acceleration Age
displacement 1.000000 -0.521733 0.362976
acceleration -0.521733 1.000000 -0.273762
Age 0.362976 -0.273762 1.000000
abs(trimmed_features_corr) > 0.8
displacement acceleration Age
displacement True False False
acceleration False True False
Age False False True

Now we can check that independent features’ correlation has been reduced.

Variance Inflation Factor

Another way of selecting features which are not colinear is Variance Inflation Factor.This is a measure to quantify the severity of multicolinearity in an ordinary least squares regression analysis.

Variance inflation factor is a measure of the amount of multicollinearity in a set of multiple regression variables.

Variance inflation factor measures how much the behavior (variance) of an independent variable is influenced, or inflated, by its interaction/correlation with the other independent variables.

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif=pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
vif['Features'] = X.columns
vif.round(2)
VIF Factor Features
0 10.82 cylinders
1 19.13 displacement
2 8.98 horsepower
3 10.36 weight
4 2.50 acceleration
5 1.24 Age
  • VIF = 1: Not correlated
  • VIF =1-5: Moderately correlated
  • VIF >5: Highly correlated

If we look at the VIF factors we can see displacement and weight are highly correlated features so let’s drop it from Features.

X = X.drop(['displacement','weight'], axis = 1)

Now again we calculate the VIF for the rest of the features

vif=pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
vif['Features'] = X.columns
vif.round(2)
VIF Factor Features
0 3.57 cylinders
1 5.26 horsepower
2 1.91 acceleration
3 1.20 Age

So now colinearity of features has been reduced using VIF.

X=cars_df.drop(['mpg','origin','displacement','weight'],axis=1)
Y=cars_df['mpg']
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2)
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression(normalize=True).fit(x_train,y_train)
print("Training score : ",linear_model.score(x_train,y_train))
Training score :  0.7537877265338784
y_pred = linear_model.predict(x_test)
from sklearn.metrics import r2_score

print("Testing_score :",r2_score(y_test,y_pred))
Testing_score : 0.7159725745358863
print("Adjusted R2 score :",adjusted_r2(r2_score(y_test,y_pred),y_test,x_test))
Adjusted R2 score : 0.7037999705874243

You can get the notebook used in this tutorial here and dataset used here

Thanks for reading!

Leave a comment