T101: Embedded method-Feature selection techniques in machine learning
Note: This is a part of series on Data Preprocessing in Machine Learning you can check all tutorials here: Embedded Method, Wrapper Method, Filter Method,Handling Multicollinearity.
Are you sinking into lots of feature but do not know which one to pick and which one to ignore? Then how would you develop a predictive model?
This is one of those questions which every machine learning engineer comes accross, you need deep knowledge of that domain to give an accepted answer,but don’t worry I am going to help you automating this process in this tutorial, there are certain checklists we should follow to select the best features from our data. Let’s begin!!
What is Feature Selection?
Feature selection is the automated process of selecting important features out of all the features in our dataset.
Why we need it?
Feature selection helps the model to increase its accuracy and improve the computational efficiency.
Feature selection vs Dimensionality reduction?
Feature selection isn’t like dimensionality reduction. Both methods are used to lessen the quantity of features/attributes in the dataset, however a dimensionality reduction technique accomplish that by way of developing new combos of features, where as feature selection techniques include and exclude features present within the dataset without changing them.
- Dimensionality reduction techniques : Principal Component Analysis, Singular Value Decomposition.
- Feature Selection techniques : Filter method, Wrapper method, Embedded method
import pandas as pd
import numpy as np
automobile=pd.read_csv('dataset/cleaned_cars.csv')
automobile.head()
mpg | cylinders | displacement | horsepower | weight | acceleration | origin | Age | |
---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | US | 49 |
1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | US | 49 |
2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | US | 49 |
3 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | US | 49 |
4 | 15.0 | 8 | 429.0 | 198 | 4341 | 10.0 | US | 49 |
Here we are using a car’s dataset, we are dropping the target column “mpg”. In this tutorial our primary goal is to learn feature selection techniques hence we are dropping “origin” column as well as it is not a continuous variable.
X=automobile.drop(['mpg','origin'],axis=1)
Y=automobile['mpg']
X,Y
( cylinders displacement horsepower weight acceleration Age
0 8 307.0 130 3504 12.0 49
1 8 350.0 165 3693 11.5 49
2 8 318.0 150 3436 11.0 49
3 8 302.0 140 3449 10.5 49
4 8 429.0 198 4341 10.0 49
.. ... ... ... ... ... ...
362 4 151.0 90 2950 17.3 37
363 4 140.0 86 2790 15.6 37
364 4 97.0 52 2130 24.6 37
365 4 135.0 84 2295 11.6 37
366 4 120.0 79 2625 18.6 37
[367 rows x 6 columns], 0 18.0
1 15.0
2 18.0
3 17.0
4 15.0
...
362 27.0
363 27.0
364 44.0
365 32.0
366 28.0
Name: mpg, Length: 367, dtype: float64)
Embedded Method
Embedded methods selects the important features while the model is being trained, You can say few model training algorithms already implements a feature selection process while getting trained with the data.
In this example we will be discussing about Lasso Regression , Ridge regression , decision tree.
Lasso Regression
Lasso regression is a L1 regularized regression when there is a penalty for more complicated coefficients.
from sklearn.linear_model import Lasso
lasso=Lasso(alpha=.8)
lasso.fit(X,Y)
Lasso(alpha=0.8, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)
Alpha value determines the strength of the regularization of our model , penalty parameter is the sum of the abs value of the coefficients and penalty parameter is multiplied by the alpha parameter.
Once we fit the lasso regression there is one property of coefficient for each features. Regularization parameters in lasso regression forces unimportant parameters coefficients close to 0.
predictors=X.columns
coef=pd.Series(lasso.coef_,predictors).sort_values()
print(coef)
Age -0.666233
horsepower -0.008517
weight -0.006472
displacement -0.000602
cylinders -0.000000
acceleration 0.000000
dtype: float64
So we can see that Age and Weight are the most significant features among these predictors and these features are called lasso features.
lasso_features=['Age','weight']
lasso_feature_df=X[lasso_features]
lasso_feature_df
Age | weight | |
---|---|---|
0 | 49 | 3504 |
1 | 49 | 3693 |
2 | 49 | 3436 |
3 | 49 | 3449 |
4 | 49 | 4341 |
... | ... | ... |
362 | 37 | 2950 |
363 | 37 | 2790 |
364 | 37 | 2130 |
365 | 37 | 2295 |
366 | 37 | 2625 |
367 rows × 2 columns
Ridge Regression
Ridge regression is a L2 regularization technique.
from sklearn.linear_model import Ridge
ridge=Ridge(alpha=1.0)
ridge.fit(X,Y)
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=None, solver='auto', tol=0.001)
Alpha value determines the strength of the regularization of our model , penalty parameter is the sum of the square of abs value of the coefficients and penalty parameter is multiplied by the alpha parameter.
Once we fit the Ridge regression there is one property of coefficient for each features. Regularization parameters in Ridge regression forces unimportant parameters coefficients close to 0, for correlated features, it means that they tend to get similar coefficients.Feature having negative coefficients don’t contribute that much to the model.
predictors=X.columns
coef=pd.Series(ridge.coef_,predictors).sort_values()
print(coef)
Age -0.738056
cylinders -0.174340
weight -0.006706
horsepower -0.001467
displacement 0.003505
acceleration 0.072909
dtype: float64
ridge_features=['displacement','acceleration']
ridge_feature_df=X[ridge_features]
ridge_feature_df
displacement | acceleration | |
---|---|---|
0 | 307.0 | 12.0 |
1 | 350.0 | 11.5 |
2 | 318.0 | 11.0 |
3 | 302.0 | 10.5 |
4 | 429.0 | 10.0 |
... | ... | ... |
362 | 151.0 | 17.3 |
363 | 140.0 | 15.6 |
364 | 97.0 | 24.6 |
365 | 135.0 | 11.6 |
366 | 120.0 | 18.6 |
367 rows × 2 columns
Decision Tree
During the construction of a decision tree the structure of the decision tree is such that the more important features are higher up , are closer to the root.
from sklearn.tree import DecisionTreeRegressor
decision_tree= DecisionTreeRegressor(max_depth=4)
decision_tree.fit(X,Y)
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
predictors=X.columns
coef=pd.Series(decision_tree.feature_importances_,predictors).sort_values()
print(coef)
cylinders 0.000000
acceleration 0.003211
weight 0.048203
Age 0.103696
horsepower 0.188937
displacement 0.655953
dtype: float64
So we can see the most significant features are displacement and horsepower
decision_tree_features=['displacement','horsepower']
decision_tree_feature_df=X[decision_tree_features]
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
def buildmodel(X,Y,test_frac,model_name=''):
x_train,x_test,y_train,y_test =train_test_split(X,Y,test_size=test_frac)
model =LinearRegression().fit(x_train,y_train)
y_pred=model.predict(x_test)
print('Accuracy of the '+model_name+' model is '+str(r2_score(y_test,y_pred)))
Now let’s check the accuracy of these models
buildmodel(ridge_feature_df,Y,test_frac=.2,model_name='Ridge')
Accuracy of the Ridge model is 0.6524069499984987
buildmodel(lasso_feature_df,Y,test_frac=.2,model_name='Lasso')
Accuracy of the Lasso model is 0.823518232271492
buildmodel(decision_tree_feature_df,Y,test_frac=.2, model_name='Decision Tree')
Accuracy of the Decision Tree model is 0.5284130942943683
You can get the notebook used in this tutorial here and dataset used here
Thanks for reading!!
Leave a comment