T102: Wrapper method-Feature selection techniques in machine learning
Note: This is a part of series on Data Preprocessing in Machine Learning you can check all tutorials here: Embedded Method, Wrapper Method, Filter Method,Handling Multicollinearity.
In this tutorial we will see how we can select features using wrapper methods such as recursive feature elemination,forwward selection and backward selection where you generate models with subsets of features and find the best subset to work with based on the model’s performance.
What is wrapper method?
Wrapper methods are used to select a set of features by preparing where different combinations of features, then each combination is evaluated and compared to other combinations.Next a predictive model is used to assign a score based on model accuracy and to evaluate the combinations of these features.
import pandas as pd
import numpy as np
diabetes=pd.read_csv('dataset/diabetes.csv')
diabetes.head()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
X=diabetes.drop('Outcome',axis=1)
Y=diabetes['Outcome']
X,Y
( Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1
.. ... ... ... ... ... ...
763 10 101 76 48 180 32.9
764 2 122 70 27 0 36.8
765 5 121 72 23 112 26.2
766 1 126 60 0 0 30.1
767 1 93 70 31 0 30.4
DiabetesPedigreeFunction Age
0 0.627 50
1 0.351 31
2 0.672 32
3 0.167 21
4 2.288 33
.. ... ...
763 0.171 63
764 0.340 27
765 0.245 30
766 0.349 47
767 0.315 23
[768 rows x 8 columns], 0 1
1 0
2 1
3 0
4 1
..
763 0
764 0
765 0
766 1
767 0
Name: Outcome, Length: 768, dtype: int64)
Recursive Feature Elimination
Recursive Feature Elimination selects features by recursively considering smaller subsets of features by pruning the least important feature at each step. Here models are created iteartively and in each iteration it determines the best and worst performing features and this process continues until all the features are explored.Next ranking is given on eah feature based on their elimination orde. In the worst case, if a dataset contains N number of features RFE will do a greedy search for $N^2$ combinations of features.
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model =LogisticRegression(solver='liblinear')
rfe=RFE(model,n_features_to_select=4)
fit=rfe.fit(X,Y)
print('Number of selected features',fit.n_features_)
print('Selected Features',fit.support_)
print('Feature rankings',fit.ranking_)
Number of selected features 4
Selected Features [ True True False False False True True False]
Feature rankings [1 1 2 4 5 1 1 3]
def feature_ranks(X,Rank,Support):
feature_rank=pd.DataFrame()
for i in range(X.shape[1]):
new =pd.DataFrame({"Features":X.columns[i],"Rank":Rank[i],'Selected':Support[i]},index=[i])
feature_rank=pd.concat([feature_rank,new])
return feature_rank
feature_rank_df=feature_ranks(X,fit.ranking_,fit.support_)
feature_rank_df
Features | Rank | Selected | |
---|---|---|---|
0 | Pregnancies | 1 | True |
1 | Glucose | 1 | True |
2 | BloodPressure | 2 | False |
3 | SkinThickness | 4 | False |
4 | Insulin | 5 | False |
5 | BMI | 1 | True |
6 | DiabetesPedigreeFunction | 1 | True |
7 | Age | 3 | False |
We can see there are four features with rank 1 ,RFE states that these are the most significant features.
recursive_feature_names=feature_rank_df.loc[feature_rank_df['Selected'] == True]
recursive_feature_names
Features | Rank | Selected | |
---|---|---|---|
0 | Pregnancies | 1 | True |
1 | Glucose | 1 | True |
5 | BMI | 1 | True |
6 | DiabetesPedigreeFunction | 1 | True |
RFE_selected_features=X[recursive_feature_names['Features'].values]
RFE_selected_features.head()
Pregnancies | Glucose | BMI | DiabetesPedigreeFunction | |
---|---|---|---|---|
0 | 6 | 148 | 33.6 | 0.627 |
1 | 1 | 85 | 26.6 | 0.351 |
2 | 8 | 183 | 23.3 | 0.672 |
3 | 1 | 89 | 28.1 | 0.167 |
4 | 0 | 137 | 43.1 | 2.288 |
Forward Selection
In this feature selection technique one feature is added at a time based on the performance of the classifier till we get to the specified number of features.
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.ensemble import RandomForestClassifier
feature_selector=SequentialFeatureSelector(RandomForestClassifier(n_estimators=10),
k_features=4,
forward=True,
scoring='accuracy',
cv=4)
features=feature_selector.fit(np.array(X),Y)
forward_elimination_feature_names=list(X.columns[list(features.k_feature_idx_)])
forward_elimination_feature_names
['Glucose', 'BloodPressure', 'BMI', 'Age']
forward_elimination_features_df=X[forward_elimination_feature_names]
forward_elimination_features_df.head()
Glucose | BloodPressure | BMI | Age | |
---|---|---|---|---|
0 | 148 | 72 | 33.6 | 50 |
1 | 85 | 66 | 26.6 | 31 |
2 | 183 | 64 | 23.3 | 32 |
3 | 89 | 66 | 28.1 | 21 |
4 | 137 | 40 | 43.1 | 33 |
Backward Selection
In this feature selection technique one feature is removed at a time based on the performance of the classifier till we get to the specified number of features.
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.ensemble import RandomForestClassifier
feature_selector=SequentialFeatureSelector(RandomForestClassifier(n_estimators=10),
k_features=4,
forward=False,
scoring='accuracy',
cv=4)
features=feature_selector.fit(np.array(X),Y)
backward_elimination_feature_names=list(X.columns[list(features.k_feature_idx_)])
backward_elimination_feature_names
['Glucose', 'BMI', 'DiabetesPedigreeFunction', 'Age']
backward_elimination_features_df=X[backward_elimination_feature_names]
backward_elimination_features_df.head()
Glucose | BMI | DiabetesPedigreeFunction | Age | |
---|---|---|---|---|
0 | 148 | 33.6 | 0.627 | 50 |
1 | 85 | 26.6 | 0.351 | 31 |
2 | 183 | 23.3 | 0.672 | 32 |
3 | 89 | 28.1 | 0.167 | 21 |
4 | 137 | 43.1 | 2.288 | 33 |
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def buildmodel(X,Y,test_frac,model_name=''):
x_train,x_test,y_train,y_test =train_test_split(X,Y,test_size=test_frac)
model =LogisticRegression(solver='liblinear').fit(x_train,y_train)
y_pred=model.predict(x_test)
print('Accuracy of the '+model_name+' based model is ',str(accuracy_score(y_test,y_pred)))
buildmodel(RFE_selected_features,Y,test_frac=.2,model_name='RFE')
Accuracy of the RFE based model is 0.7727272727272727
buildmodel(forward_elimination_features_df,Y,test_frac=.2,model_name='Forward Elimination')
Accuracy of the Forward Elimination based model is 0.6883116883116883
buildmodel(backward_elimination_features_df,Y,test_frac=.2,model_name='Backward Elimination')
Accuracy of the Backward Elimination based model is 0.8441558441558441
You can get the notebook used in this tutorial here and dataset used here
Thanks for reading!!
Leave a comment