T102: Wrapper method-Feature selection techniques in machine learning

Note: This is a part of series on Data Preprocessing in Machine Learning you can check all tutorials here: Embedded Method, Wrapper Method, Filter Method,Handling Multicollinearity.

In this tutorial we will see how we can select features using wrapper methods such as recursive feature elemination,forwward selection and backward selection where you generate models with subsets of features and find the best subset to work with based on the model’s performance.

What is wrapper method?

Wrapper methods are used to select a set of features by preparing where different combinations of features, then each combination is evaluated and compared to other combinations.Next a predictive model is used to assign a score based on model accuracy and to evaluate the combinations of these features.

import pandas as pd
import numpy as np
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
(     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
 0              6      148             72             35        0  33.6   
 1              1       85             66             29        0  26.6   
 2              8      183             64              0        0  23.3   
 3              1       89             66             23       94  28.1   
 4              0      137             40             35      168  43.1   
 ..           ...      ...            ...            ...      ...   ...   
 763           10      101             76             48      180  32.9   
 764            2      122             70             27        0  36.8   
 765            5      121             72             23      112  26.2   
 766            1      126             60              0        0  30.1   
 767            1       93             70             31        0  30.4   
      DiabetesPedigreeFunction  Age  
 0                       0.627   50  
 1                       0.351   31  
 2                       0.672   32  
 3                       0.167   21  
 4                       2.288   33  
 ..                        ...  ...  
 763                     0.171   63  
 764                     0.340   27  
 765                     0.245   30  
 766                     0.349   47  
 767                     0.315   23  
 [768 rows x 8 columns], 0      1
 1      0
 2      1
 3      0
 4      1
 763    0
 764    0
 765    0
 766    1
 767    0
 Name: Outcome, Length: 768, dtype: int64)

Recursive Feature Elimination

Recursive Feature Elimination selects features by recursively considering smaller subsets of features by pruning the least important feature at each step. Here models are created iteartively and in each iteration it determines the best and worst performing features and this process continues until all the features are explored.Next ranking is given on eah feature based on their elimination orde. In the worst case, if a dataset contains N number of features RFE will do a greedy search for $N^2$ combinations of features.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model =LogisticRegression(solver='liblinear')
print('Number of selected features',fit.n_features_)
print('Selected Features',fit.support_)
print('Feature rankings',fit.ranking_)
Number of selected features 4
Selected Features [ True  True False False False  True  True False]
Feature rankings [1 1 2 4 5 1 1 3]
def feature_ranks(X,Rank,Support):
    for i in range(X.shape[1]):
        new =pd.DataFrame({"Features":X.columns[i],"Rank":Rank[i],'Selected':Support[i]},index=[i])
    return feature_rank
Features Rank Selected
0 Pregnancies 1 True
1 Glucose 1 True
2 BloodPressure 2 False
3 SkinThickness 4 False
4 Insulin 5 False
5 BMI 1 True
6 DiabetesPedigreeFunction 1 True
7 Age 3 False

We can see there are four features with rank 1 ,RFE states that these are the most significant features.

recursive_feature_names=feature_rank_df.loc[feature_rank_df['Selected'] == True]
Features Rank Selected
0 Pregnancies 1 True
1 Glucose 1 True
5 BMI 1 True
6 DiabetesPedigreeFunction 1 True
Pregnancies Glucose BMI DiabetesPedigreeFunction
0 6 148 33.6 0.627
1 1 85 26.6 0.351
2 8 183 23.3 0.672
3 1 89 28.1 0.167
4 0 137 43.1 2.288

Forward Selection

In this feature selection technique one feature is added at a time based on the performance of the classifier till we get to the specified number of features.

from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.ensemble import RandomForestClassifier
['Glucose', 'BloodPressure', 'BMI', 'Age']
Glucose BloodPressure BMI Age
0 148 72 33.6 50
1 85 66 26.6 31
2 183 64 23.3 32
3 89 66 28.1 21
4 137 40 43.1 33

Backward Selection

In this feature selection technique one feature is removed at a time based on the performance of the classifier till we get to the specified number of features.

from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.ensemble import RandomForestClassifier
['Glucose', 'BMI', 'DiabetesPedigreeFunction', 'Age']
Glucose BMI DiabetesPedigreeFunction Age
0 148 33.6 0.627 50
1 85 26.6 0.351 31
2 183 23.3 0.672 32
3 89 28.1 0.167 21
4 137 43.1 2.288 33
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def buildmodel(X,Y,test_frac,model_name=''):
    x_train,x_test,y_train,y_test =train_test_split(X,Y,test_size=test_frac)
    model =LogisticRegression(solver='liblinear').fit(x_train,y_train)
    print('Accuracy of the '+model_name+' based model is ',str(accuracy_score(y_test,y_pred)))
Accuracy of the RFE based model is  0.7727272727272727
buildmodel(forward_elimination_features_df,Y,test_frac=.2,model_name='Forward Elimination')
Accuracy of the Forward Elimination based model is  0.6883116883116883
buildmodel(backward_elimination_features_df,Y,test_frac=.2,model_name='Backward Elimination')
Accuracy of the Backward Elimination based model is  0.8441558441558441

You can get the notebook used in this tutorial here and dataset used here

