T101: Embedded method-Feature selection techniques in machine learning

5 minute read

Note: This is a part of series on Data Preprocessing in Machine Learning you can check all tutorials here: Embedded Method, Wrapper Method, Filter Method,Handling Multicollinearity.

Are you sinking into lots of feature but do not know which one to pick and which one to ignore? Then how would you develop a predictive model?

gif

This is one of those questions which every machine learning engineer comes accross, you need deep knowledge of that domain to give an accepted answer,but don’t worry I am going to help you automating this process in this tutorial, there are certain checklists we should follow to select the best features from our data. Let’s begin!!

What is Feature Selection?

Feature selection is the automated process of selecting important features out of all the features in our dataset.

Why we need it?

Feature selection helps the model to increase its accuracy and improve the computational efficiency.

Feature selection vs Dimensionality reduction?

Feature selection isn’t like dimensionality reduction. Both methods are used to lessen the quantity of features/attributes in the dataset, however a dimensionality reduction technique accomplish that by way of developing new combos of features, where as feature selection techniques include and exclude features present within the dataset without changing them.

Dimensionality reduction techniques : Principal Component Analysis, Singular Value Decomposition.
Feature Selection techniques : Filter method, Wrapper method, Embedded method

import pandas as pd
import numpy as np

automobile=pd.read_csv('dataset/cleaned_cars.csv')
automobile.head()

	mpg	cylinders	displacement	horsepower	weight	acceleration	origin	Age
0	18.0	8	307.0	130	3504	12.0	US	49
1	15.0	8	350.0	165	3693	11.5	US	49
2	18.0	8	318.0	150	3436	11.0	US	49
3	17.0	8	302.0	140	3449	10.5	US	49
4	15.0	8	429.0	198	4341	10.0	US	49

Here we are using a car’s dataset, we are dropping the target column “mpg”. In this tutorial our primary goal is to learn feature selection techniques hence we are dropping “origin” column as well as it is not a continuous variable.

X=automobile.drop(['mpg','origin'],axis=1)
Y=automobile['mpg']
X,Y

(     cylinders  displacement  horsepower  weight  acceleration  Age
          8         307.0         130    3504          12.0   49
          8         350.0         165    3693          11.5   49
          8         318.0         150    3436          11.0   49
          8         302.0         140    3449          10.5   49
          8         429.0         198    4341          10.0   49
 ..         ...           ...         ...     ...           ...  ...
        4         151.0          90    2950          17.3   37
        4         140.0          86    2790          15.6   37
        4          97.0          52    2130          24.6   37
        4         135.0          84    2295          11.6   37
        4         120.0          79    2625          18.6   37
 
 [367 rows x 6 columns], 0      18.0
    15.0
    18.0
    17.0
    15.0
        ... 
  27.0
  27.0
  44.0
  32.0
  28.0
 Name: mpg, Length: 367, dtype: float64)

Embedded Method

Embedded methods selects the important features while the model is being trained, You can say few model training algorithms already implements a feature selection process while getting trained with the data.

In this example we will be discussing about Lasso Regression , Ridge regression , decision tree.

Lasso Regression

Lasso regression is a L1 regularized regression when there is a penalty for more complicated coefficients.

from sklearn.linear_model import Lasso

lasso=Lasso(alpha=.8)
lasso.fit(X,Y)

Lasso(alpha=0.8, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

Alpha value determines the strength of the regularization of our model , penalty parameter is the sum of the abs value of the coefficients and penalty parameter is multiplied by the alpha parameter.

Once we fit the lasso regression there is one property of coefficient for each features. Regularization parameters in lasso regression forces unimportant parameters coefficients close to 0.

predictors=X.columns
coef=pd.Series(lasso.coef_,predictors).sort_values()
print(coef)

Age            -0.666233
horsepower     -0.008517
weight         -0.006472
displacement   -0.000602
cylinders      -0.000000
acceleration    0.000000
dtype: float64

So we can see that Age and Weight are the most significant features among these predictors and these features are called lasso features.

lasso_features=['Age','weight']
lasso_feature_df=X[lasso_features]
lasso_feature_df

	Age	weight
0	49	3504
1	49	3693
2	49	3436
3	49	3449
4	49	4341
...	...	...
362	37	2950
363	37	2790
364	37	2130
365	37	2295
366	37	2625

367 rows × 2 columns

Ridge Regression

Ridge regression is a L2 regularization technique.

from sklearn.linear_model import Ridge

ridge=Ridge(alpha=1.0)
ridge.fit(X,Y)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)

Alpha value determines the strength of the regularization of our model , penalty parameter is the sum of the square of abs value of the coefficients and penalty parameter is multiplied by the alpha parameter.

Once we fit the Ridge regression there is one property of coefficient for each features. Regularization parameters in Ridge regression forces unimportant parameters coefficients close to 0, for correlated features, it means that they tend to get similar coefficients.Feature having negative coefficients don’t contribute that much to the model.

predictors=X.columns
coef=pd.Series(ridge.coef_,predictors).sort_values()
print(coef)

Age            -0.738056
cylinders      -0.174340
weight         -0.006706
horsepower     -0.001467
displacement    0.003505
acceleration    0.072909
dtype: float64

ridge_features=['displacement','acceleration']
ridge_feature_df=X[ridge_features]
ridge_feature_df

	displacement	acceleration
0	307.0	12.0
1	350.0	11.5
2	318.0	11.0
3	302.0	10.5
4	429.0	10.0
...	...	...
362	151.0	17.3
363	140.0	15.6
364	97.0	24.6
365	135.0	11.6
366	120.0	18.6

367 rows × 2 columns

Decision Tree

During the construction of a decision tree the structure of the decision tree is such that the more important features are higher up , are closer to the root.

from sklearn.tree import DecisionTreeRegressor

decision_tree= DecisionTreeRegressor(max_depth=4)
decision_tree.fit(X,Y)

DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

predictors=X.columns
coef=pd.Series(decision_tree.feature_importances_,predictors).sort_values()
print(coef)

cylinders       0.000000
acceleration    0.003211
weight          0.048203
Age             0.103696
horsepower      0.188937
displacement    0.655953
dtype: float64

So we can see the most significant features are displacement and horsepower

decision_tree_features=['displacement','horsepower']

decision_tree_feature_df=X[decision_tree_features]

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
def buildmodel(X,Y,test_frac,model_name=''):
    x_train,x_test,y_train,y_test =train_test_split(X,Y,test_size=test_frac)
    model =LinearRegression().fit(x_train,y_train)
    y_pred=model.predict(x_test)
    print('Accuracy of the '+model_name+' model is '+str(r2_score(y_test,y_pred)))

Now let’s check the accuracy of these models

buildmodel(ridge_feature_df,Y,test_frac=.2,model_name='Ridge')

Accuracy of the Ridge model is 0.6524069499984987

buildmodel(lasso_feature_df,Y,test_frac=.2,model_name='Lasso')

Accuracy of the Lasso model is 0.823518232271492

buildmodel(decision_tree_feature_df,Y,test_frac=.2, model_name='Decision Tree')

Accuracy of the Decision Tree model is 0.5284130942943683

You can get the notebook used in this tutorial here and dataset used here

Thanks for reading!!

Share on

Twitter Facebook LinkedIn

Arup Bhunia

T101: Embedded method-Feature selection techniques in machine learning

What is Feature Selection?

Why we need it?

Feature selection vs Dimensionality reduction?

Embedded Method

Lasso Regression

Ridge Regression

Decision Tree

Share on

Leave a comment

You may also enjoy

Feature scaling and transformation in machine learning

T104: Handling Multicollinearity-Feature selection techniques in machine learning

T103: Filter method-Feature selection techniques in machine learning

T102: Wrapper method-Feature selection techniques in machine learning