T104: Handling Multicollinearity-Feature selection techniques in machine learning

5 minute read

Note: This is a part of series on Data Preprocessing in Machine Learning you can check all tutorials here: Embedded Method, Wrapper Method, Filter Method,Handling Multicollinearity.

In this tutorial we will learn how to handle multicollinear features , this can be performed as a feature selection step in your machine learning pipeline. When two or more independent variables are highly correlated with each other then we can state that those features are multi collinear.

import pandas as pd
import numpy as np

cars_df=pd.read_csv('dataset/cleaned_cars.csv')
cars_df.head()

	mpg	cylinders	displacement	horsepower	weight	acceleration	origin	Age
0	18.0	8	307.0	130	3504	12.0	US	49
1	15.0	8	350.0	165	3693	11.5	US	49
2	18.0	8	318.0	150	3436	11.0	US	49
3	17.0	8	302.0	140	3449	10.5	US	49
4	15.0	8	429.0	198	4341	10.0	US	49

cars_df.describe()

	mpg	cylinders	displacement	horsepower	weight	acceleration	Age
count	367.000000	367.000000	367.000000	367.000000	367.000000	367.000000	367.000000
mean	23.556403	5.438692	191.592643	103.618529	2955.242507	15.543324	42.953678
std	7.773266	1.694068	102.017066	37.381309	831.031730	2.728949	3.698402
min	9.000000	3.000000	70.000000	46.000000	1613.000000	8.000000	37.000000
25%	17.550000	4.000000	105.000000	75.000000	2229.000000	13.800000	40.000000
50%	23.000000	4.000000	146.000000	94.000000	2789.000000	15.500000	43.000000
75%	29.000000	8.000000	260.000000	121.000000	3572.000000	17.050000	46.000000
max	46.600000	8.000000	455.000000	230.000000	5140.000000	24.800000	49.000000

As we can see range of these features are very different that means they all are in different scales so lets standardize the features using sklearn’s scale function.

from sklearn import preprocessing

cars_df[['cylinders']]=preprocessing.scale(cars_df[['cylinders']].astype('float64'))
cars_df[['displacement']]=preprocessing.scale(cars_df[['displacement']].astype('float64'))
cars_df[['horsepower']]=preprocessing.scale(cars_df[['horsepower']].astype('float64'))
cars_df[['weight']]=preprocessing.scale(cars_df[['weight']].astype('float64'))
cars_df[['acceleration']]=preprocessing.scale(cars_df[['acceleration']].astype('float64'))
cars_df[['Age']]=preprocessing.scale(cars_df[['Age']].astype('float64'))

cars_df.describe()

	mpg	cylinders	displacement	horsepower	weight	acceleration	Age
count	367.000000	3.670000e+02	3.670000e+02	3.670000e+02	3.670000e+02	3.670000e+02	3.670000e+02
mean	23.556403	-1.936084e-17	-1.936084e-17	9.680419e-17	-7.744335e-17	9.680419e-17	2.323300e-16
std	7.773266	1.001365e+00	1.001365e+00	1.001365e+00	1.001365e+00	1.001365e+00	1.001365e+00
min	9.000000	-1.441514e+00	-1.193512e+00	-1.543477e+00	-1.617357e+00	-2.767960e+00	-1.611995e+00
25%	17.550000	-8.504125e-01	-8.499642e-01	-7.666291e-01	-8.750977e-01	-6.396984e-01	-7.997267e-01
50%	23.000000	-8.504125e-01	-4.475220e-01	-2.576598e-01	-2.003166e-01	-1.589748e-02	1.254184e-02
75%	29.000000	1.513992e+00	6.714636e-01	4.656124e-01	7.431720e-01	5.528622e-01	8.248104e-01
max	46.600000	1.513992e+00	2.585518e+00	3.385489e+00	2.632559e+00	3.396661e+00	1.637079e+00

from sklearn.model_selection import train_test_split

Info: Our primary goal in this tutorial is to learn how to handle multicollinearity among features , hence we are not considering the origin variable in our features as it’s a categorical feature.

X=cars_df.drop(['mpg','origin'],axis=1) 
Y=cars_df['mpg']
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2)

from sklearn.linear_model import LinearRegression

linear_model = LinearRegression(normalize=True).fit(x_train,y_train)

print("Training score : ",linear_model.score(x_train,y_train))

Training score :  0.8003438238657309

y_pred = linear_model.predict(x_test)

from sklearn.metrics import r2_score

print("Testing_score :",r2_score(y_test,y_pred))

Testing_score : 0.8190012505093899

What is Adjusted $R^2$ Score?

When we have multiple predictors/features , A better measure of how good our model is Adjusted $R^2$ score

The Adjusted $R^2$ score is calculated using r2_score and it is a corrected goodness of fit measure for linear models. This is an Adjusted $R^2$ score that has been adjusted for the number of predictors/features we have used in our regression analysis.

The Adjusted $R^2$ score increases when a new predictor/feature has been added to train our model imporves our model more than the improvement that can be expected purely due to chance.
When we don’t have highly correlated features then we can observe that Adjusted $R^2$ score is very close to our actual r2 score.

def adjusted_r2(r_square,labels,features):
    adj_r_square = 1 - ((1- r_square)*(len(labels)-1))/(len(labels)- features.shape[1])
    return adj_r_square

print("Adjusted R2 score :",adjusted_r2(r2_score(y_test,y_pred),y_test,x_test))

Adjusted R2 score : 0.8056925189291979

feature_corr=X.corr()
feature_corr

	cylinders	displacement	horsepower	weight	acceleration	Age
cylinders	1.000000	0.951901	0.841093	0.895922	-0.483725	0.330754
displacement	0.951901	1.000000	0.891518	0.930437	-0.521733	0.362976
horsepower	0.841093	0.891518	1.000000	0.862606	-0.673175	0.410110
weight	0.895922	0.930437	0.862606	1.000000	-0.397605	0.302727
acceleration	-0.483725	-0.521733	-0.673175	-0.397605	1.000000	-0.273762
Age	0.330754	0.362976	0.410110	0.302727	-0.273762	1.000000

Now let’s explore the correlation matrix. We discovered that there are many features which are highly correlated with displacement.You can see that cylinders , horsepower , weight are all three highly correlated with displacement.This high correlation coefficient almost at 0.9 indicates that these features are likely to be colinear.

Another way of saying this is cylinders, horsepower, weight give us the same information as displacement.So we dont need all of them in our regression analysis.

Using this correlation matrix let’s say we want to see all those features with correlation coefficients greater than 0.8 , we can do that by below code.

abs(feature_corr) > 0.8

	cylinders	displacement	horsepower	weight	acceleration	Age
cylinders	True	True	True	True	False	False
displacement	True	True	True	True	False	False
horsepower	True	True	True	True	False	False
weight	True	True	True	True	False	False
acceleration	False	False	False	False	True	False
Age	False	False	False	False	False	True

trimmed_features_df = X.drop(['cylinders','horsepower','weight'],axis=1)

trimmed_features_corr=trimmed_features_df.corr()

trimmed_features_corr

	displacement	acceleration	Age
displacement	1.000000	-0.521733	0.362976
acceleration	-0.521733	1.000000	-0.273762
Age	0.362976	-0.273762	1.000000

abs(trimmed_features_corr) > 0.8

	displacement	acceleration	Age
displacement	True	False	False
acceleration	False	True	False
Age	False	False	True

Now we can check that independent features’ correlation has been reduced.

Variance Inflation Factor

Another way of selecting features which are not colinear is Variance Inflation Factor.This is a measure to quantify the severity of multicolinearity in an ordinary least squares regression analysis.

Variance inflation factor is a measure of the amount of multicollinearity in a set of multiple regression variables.

Variance inflation factor measures how much the behavior (variance) of an independent variable is influenced, or inflated, by its interaction/correlation with the other independent variables.

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif=pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values,i) for i in range(X.shape[1])]

vif['Features'] = X.columns

vif.round(2)

	VIF Factor	Features
0	10.82	cylinders
1	19.13	displacement
2	8.98	horsepower
3	10.36	weight
4	2.50	acceleration
5	1.24	Age

VIF = 1: Not correlated
VIF =1-5: Moderately correlated
VIF >5: Highly correlated

If we look at the VIF factors we can see displacement and weight are highly correlated features so let’s drop it from Features.

X = X.drop(['displacement','weight'], axis = 1)

Now again we calculate the VIF for the rest of the features

vif=pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values,i) for i in range(X.shape[1])]

vif['Features'] = X.columns
vif.round(2)

	VIF Factor	Features
0	3.57	cylinders
1	5.26	horsepower
2	1.91	acceleration
3	1.20	Age

So now colinearity of features has been reduced using VIF.

X=cars_df.drop(['mpg','origin','displacement','weight'],axis=1)
Y=cars_df['mpg']
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2)

from sklearn.linear_model import LinearRegression

linear_model = LinearRegression(normalize=True).fit(x_train,y_train)

print("Training score : ",linear_model.score(x_train,y_train))

Training score :  0.7537877265338784

y_pred = linear_model.predict(x_test)

from sklearn.metrics import r2_score

print("Testing_score :",r2_score(y_test,y_pred))

Testing_score : 0.7159725745358863

print("Adjusted R2 score :",adjusted_r2(r2_score(y_test,y_pred),y_test,x_test))

Adjusted R2 score : 0.7037999705874243

You can get the notebook used in this tutorial here and dataset used here

Thanks for reading!

Share on

Twitter Facebook LinkedIn

Arup Bhunia

T104: Handling Multicollinearity-Feature selection techniques in machine learning

What is Adjusted $R^2$ Score?

Variance Inflation Factor

Share on

Leave a comment

You may also enjoy

Feature scaling and transformation in machine learning

T103: Filter method-Feature selection techniques in machine learning

T102: Wrapper method-Feature selection techniques in machine learning

T101: Embedded method-Feature selection techniques in machine learning