T102: Wrapper method-Feature selection techniques in machine learning

4 minute read

Note: This is a part of series on Data Preprocessing in Machine Learning you can check all tutorials here: Embedded Method, Wrapper Method, Filter Method,Handling Multicollinearity.

In this tutorial we will see how we can select features using wrapper methods such as recursive feature elemination,forwward selection and backward selection where you generate models with subsets of features and find the best subset to work with based on the model’s performance.

What is wrapper method?

Wrapper methods are used to select a set of features by preparing where different combinations of features, then each combination is evaluated and compared to other combinations.Next a predictive model is used to assign a score based on model accuracy and to evaluate the combinations of these features.

import pandas as pd
import numpy as np

diabetes=pd.read_csv('dataset/diabetes.csv')
diabetes.head()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

X=diabetes.drop('Outcome',axis=1)
Y=diabetes['Outcome']
X,Y

(     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
            6      148             72             35        0  33.6   
            1       85             66             29        0  26.6   
            8      183             64              0        0  23.3   
            1       89             66             23       94  28.1   
            0      137             40             35      168  43.1   
 ..           ...      ...            ...            ...      ...   ...   
         10      101             76             48      180  32.9   
          2      122             70             27        0  36.8   
          5      121             72             23      112  26.2   
          1      126             60              0        0  30.1   
          1       93             70             31        0  30.4   
 
      DiabetesPedigreeFunction  Age  
                     0.627   50  
                     0.351   31  
                     0.672   32  
                     0.167   21  
                     2.288   33  
 ..                        ...  ...  
                   0.171   63  
                   0.340   27  
                   0.245   30  
                   0.349   47  
                   0.315   23  
 
 [768 rows x 8 columns], 0      1
    0
    1
    0
    1
       ..
  0
  0
  0
  1
  0
 Name: Outcome, Length: 768, dtype: int64)

Recursive Feature Elimination

Recursive Feature Elimination selects features by recursively considering smaller subsets of features by pruning the least important feature at each step. Here models are created iteartively and in each iteration it determines the best and worst performing features and this process continues until all the features are explored.Next ranking is given on eah feature based on their elimination orde. In the worst case, if a dataset contains N number of features RFE will do a greedy search for $N^2$ combinations of features.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model =LogisticRegression(solver='liblinear')
rfe=RFE(model,n_features_to_select=4)

fit=rfe.fit(X,Y)

print('Number of selected features',fit.n_features_)
print('Selected Features',fit.support_)
print('Feature rankings',fit.ranking_)

Number of selected features 4
Selected Features [ True  True False False False  True  True False]
Feature rankings [1 1 2 4 5 1 1 3]

def feature_ranks(X,Rank,Support):
    feature_rank=pd.DataFrame()
    for i in range(X.shape[1]):
        new =pd.DataFrame({"Features":X.columns[i],"Rank":Rank[i],'Selected':Support[i]},index=[i])
        feature_rank=pd.concat([feature_rank,new])
    return feature_rank

feature_rank_df=feature_ranks(X,fit.ranking_,fit.support_)
feature_rank_df

	Features	Rank	Selected
0	Pregnancies	1	True
1	Glucose	1	True
2	BloodPressure	2	False
3	SkinThickness	4	False
4	Insulin	5	False
5	BMI	1	True
6	DiabetesPedigreeFunction	1	True
7	Age	3	False

We can see there are four features with rank 1 ,RFE states that these are the most significant features.

recursive_feature_names=feature_rank_df.loc[feature_rank_df['Selected'] == True]

recursive_feature_names

	Features	Rank	Selected
0	Pregnancies	1	True
1	Glucose	1	True
5	BMI	1	True
6	DiabetesPedigreeFunction	1	True

RFE_selected_features=X[recursive_feature_names['Features'].values]
RFE_selected_features.head()

	Pregnancies	Glucose	BMI	DiabetesPedigreeFunction
0	6	148	33.6	0.627
1	1	85	26.6	0.351
2	8	183	23.3	0.672
3	1	89	28.1	0.167
4	0	137	43.1	2.288

Forward Selection

In this feature selection technique one feature is added at a time based on the performance of the classifier till we get to the specified number of features.

from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.ensemble import RandomForestClassifier

feature_selector=SequentialFeatureSelector(RandomForestClassifier(n_estimators=10),
                                          k_features=4,
                                          forward=True,
                                          scoring='accuracy',
                                          cv=4)
features=feature_selector.fit(np.array(X),Y)

forward_elimination_feature_names=list(X.columns[list(features.k_feature_idx_)])
forward_elimination_feature_names

['Glucose', 'BloodPressure', 'BMI', 'Age']

forward_elimination_features_df=X[forward_elimination_feature_names]
forward_elimination_features_df.head()

	Glucose	BloodPressure	BMI	Age
0	148	72	33.6	50
1	85	66	26.6	31
2	183	64	23.3	32
3	89	66	28.1	21
4	137	40	43.1	33

Backward Selection

In this feature selection technique one feature is removed at a time based on the performance of the classifier till we get to the specified number of features.

from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.ensemble import RandomForestClassifier

feature_selector=SequentialFeatureSelector(RandomForestClassifier(n_estimators=10),
                                          k_features=4,
                                          forward=False,
                                          scoring='accuracy',
                                          cv=4)
features=feature_selector.fit(np.array(X),Y)

backward_elimination_feature_names=list(X.columns[list(features.k_feature_idx_)])
backward_elimination_feature_names

['Glucose', 'BMI', 'DiabetesPedigreeFunction', 'Age']

backward_elimination_features_df=X[backward_elimination_feature_names]
backward_elimination_features_df.head()

	Glucose	BMI	DiabetesPedigreeFunction	Age
0	148	33.6	0.627	50
1	85	26.6	0.351	31
2	183	23.3	0.672	32
3	89	28.1	0.167	21
4	137	43.1	2.288	33

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def buildmodel(X,Y,test_frac,model_name=''):
    x_train,x_test,y_train,y_test =train_test_split(X,Y,test_size=test_frac)
    model =LogisticRegression(solver='liblinear').fit(x_train,y_train)
    y_pred=model.predict(x_test)
    print('Accuracy of the '+model_name+' based model is ',str(accuracy_score(y_test,y_pred)))

buildmodel(RFE_selected_features,Y,test_frac=.2,model_name='RFE')

Accuracy of the RFE based model is  0.7727272727272727

buildmodel(forward_elimination_features_df,Y,test_frac=.2,model_name='Forward Elimination')

Accuracy of the Forward Elimination based model is  0.6883116883116883

buildmodel(backward_elimination_features_df,Y,test_frac=.2,model_name='Backward Elimination')

Accuracy of the Backward Elimination based model is  0.8441558441558441

You can get the notebook used in this tutorial here and dataset used here

Thanks for reading!!

Share on

Twitter Facebook LinkedIn

Arup Bhunia

T102: Wrapper method-Feature selection techniques in machine learning

What is wrapper method?

Recursive Feature Elimination

Forward Selection

Backward Selection

Share on

Leave a comment

You may also enjoy

Feature scaling and transformation in machine learning

T104: Handling Multicollinearity-Feature selection techniques in machine learning

T103: Filter method-Feature selection techniques in machine learning

T101: Embedded method-Feature selection techniques in machine learning