T103: Filter method-Feature selection techniques in machine learning
Note: This is a part of series on Data Preprocessing in Machine Learning you can check all tutorials here: Embedded Method, Wrapper Method, Filter Method,Handling Multicollinearity.
In this tutorial we will see how we can select features using Filter feature selection method.
Filter Methods
Filter method applies a statistical measure to assign a scoring to each feature.Then we can decide to keep or remove those features based on those scores. The methods are often univariate and consider the feature independently, or with regard to the dependent variable.
In this tutorial we will cover below approaches:
- Missing Value Ratio Threshold
- Variance Threshold
- $Chi^2$ Test
- Anova Test
Missing Value Ratio Threshold
We will remove those features which are having missing values more than a threshold.
import pandas as pd
import numpy as np
diabetes = pd.read_csv('dataset/diabetes.csv')
diabetes.head()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
We know that below features can not be zero(e.g. a person’s blood pressure can not be 0) hence we are imputing zeros with nan value in these features.
diabetes['Glucose'].replace(0,np.nan,inplace=True)
diabetes['BloodPressure'].replace(0,np.nan,inplace=True)
diabetes['SkinThickness'].replace(0,np.nan,inplace=True)
diabetes['Insulin'].replace(0,np.nan,inplace=True)
diabetes['BMI'].replace(0,np.nan,inplace=True)
diabetes.isnull().sum()
Pregnancies 0
Glucose 5
BloodPressure 35
SkinThickness 227
Insulin 374
BMI 11
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
Now let’s see for each feature what is the percentage of having missing values.
diabetes['Glucose'].isnull().sum()/len(diabetes)*100
0.6510416666666667
diabetes['BloodPressure'].isnull().sum()/len(diabetes)*100
4.557291666666666
diabetes['SkinThickness'].isnull().sum()/len(diabetes)*100
29.557291666666668
diabetes['Insulin'].isnull().sum()/len(diabetes)*100
48.69791666666667
diabetes['BMI'].isnull().sum()/len(diabetes)*100
1.4322916666666665
We can see that a large number of data missing in SkinThickness, Insulin.
diabetes_missing_value_threshold=diabetes.dropna(thresh=int(diabetes.shape[0] * .9) ,axis=1)
diabetes_missing_value_threshold
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6.0 | 148.0 | 72.0 | 35.0 | 218.937760 | 33.6 | 0.627 | 50.0 | 1 |
1 | 1.0 | 85.0 | 66.0 | 29.0 | 70.189298 | 26.6 | 0.351 | 31.0 | 0 |
2 | 8.0 | 183.0 | 64.0 | 29.0 | 269.968908 | 23.3 | 0.672 | 32.0 | 1 |
3 | 1.0 | 89.0 | 66.0 | 23.0 | 94.000000 | 28.1 | 0.167 | 21.0 | 0 |
4 | 0.0 | 137.0 | 40.0 | 35.0 | 168.000000 | 43.1 | 2.288 | 33.0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
763 | 10.0 | 101.0 | 76.0 | 48.0 | 180.000000 | 32.9 | 0.171 | 63.0 | 0 |
764 | 2.0 | 122.0 | 70.0 | 27.0 | 158.815881 | 36.8 | 0.340 | 27.0 | 0 |
765 | 5.0 | 121.0 | 72.0 | 23.0 | 112.000000 | 26.2 | 0.245 | 30.0 | 0 |
766 | 1.0 | 126.0 | 60.0 | 29.0 | 173.820363 | 30.1 | 0.349 | 47.0 | 1 |
767 | 1.0 | 93.0 | 70.0 | 31.0 | 87.196731 | 30.4 | 0.315 | 23.0 | 0 |
768 rows × 9 columns
Here we are keeping only those features which are having missing data less than 10%.
diabetes_missing_value_threshold_features = diabetes_missing_value_threshold.drop('Outcome',axis=1)
diabetes_missing_value_threshold_label= diabetes_missing_value_threshold['Outcome']
diabetes_missing_value_threshold_features,diabetes_missing_value_threshold_label
( Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 6.0 148.0 72.0 35.0 218.937760 33.6
1 1.0 85.0 66.0 29.0 70.189298 26.6
2 8.0 183.0 64.0 29.0 269.968908 23.3
3 1.0 89.0 66.0 23.0 94.000000 28.1
4 0.0 137.0 40.0 35.0 168.000000 43.1
.. ... ... ... ... ... ...
763 10.0 101.0 76.0 48.0 180.000000 32.9
764 2.0 122.0 70.0 27.0 158.815881 36.8
765 5.0 121.0 72.0 23.0 112.000000 26.2
766 1.0 126.0 60.0 29.0 173.820363 30.1
767 1.0 93.0 70.0 31.0 87.196731 30.4
DiabetesPedigreeFunction Age
0 0.627 50.0
1 0.351 31.0
2 0.672 32.0
3 0.167 21.0
4 2.288 33.0
.. ... ...
763 0.171 63.0
764 0.340 27.0
765 0.245 30.0
766 0.349 47.0
767 0.315 23.0
[768 rows x 8 columns], 0 1
1 0
2 1
3 0
4 1
..
763 0
764 0
765 0
766 1
767 0
Name: Outcome, Length: 768, dtype: int64)
Variance Threshold
If the variance is low or close to zero, then a feature is approximately constant and will not improve the performance of the model. In that case, it should be removed.
Variance will also be very low for a feature if only a handful of observations of that feature differ from a constant value.
diabetes = pd.read_csv('dataset/diabetes_cleaned.csv')
diabetes.head()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6.0 | 148.0 | 72.0 | 35.0 | 218.937760 | 33.6 | 0.627 | 50.0 | 1 |
1 | 1.0 | 85.0 | 66.0 | 29.0 | 70.189298 | 26.6 | 0.351 | 31.0 | 0 |
2 | 8.0 | 183.0 | 64.0 | 29.0 | 269.968908 | 23.3 | 0.672 | 32.0 | 1 |
3 | 1.0 | 89.0 | 66.0 | 23.0 | 94.000000 | 28.1 | 0.167 | 21.0 | 0 |
4 | 0.0 | 137.0 | 40.0 | 35.0 | 168.000000 | 43.1 | 2.288 | 33.0 | 1 |
X=diabetes.drop('Outcome',axis=1)
Y=diabetes['Outcome']
X.var(axis=0)
Pregnancies 11.354056
Glucose 932.425376
BloodPressure 153.317842
SkinThickness 109.767160
Insulin 14107.703775
BMI 47.955463
DiabetesPedigreeFunction 0.109779
Age 138.303046
dtype: float64
We can see that DiabetesPedigreeFunction variance is less so it brings little information because it is (almost) constant , this can be a justification to remove DiabetesPedigreeFunction column but before considering this we should scale these features because they are of different scales.
from sklearn.preprocessing import minmax_scale
X_scaled_df =pd.DataFrame(minmax_scale(X,feature_range=(0,10)),columns=X.columns)
We have used sklearn minmax scaler here.
X_scaled_df
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | |
---|---|---|---|---|---|---|---|---|
0 | 3.529412 | 6.709677 | 4.897959 | 3.043478 | 2.740295 | 3.149284 | 2.344150 | 4.833333 |
1 | 0.588235 | 2.645161 | 4.285714 | 2.391304 | 1.018185 | 1.717791 | 1.165670 | 1.666667 |
2 | 4.705882 | 8.967742 | 4.081633 | 2.391304 | 3.331099 | 1.042945 | 2.536294 | 1.833333 |
3 | 0.588235 | 2.903226 | 4.285714 | 1.739130 | 1.293850 | 2.024540 | 0.380017 | 0.000000 |
4 | 0.000000 | 6.000000 | 1.632653 | 3.043478 | 2.150572 | 5.092025 | 9.436379 | 2.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
763 | 5.882353 | 3.677419 | 5.306122 | 4.456522 | 2.289500 | 3.006135 | 0.397096 | 7.000000 |
764 | 1.176471 | 5.032258 | 4.693878 | 2.173913 | 2.044244 | 3.803681 | 1.118702 | 1.000000 |
765 | 2.941176 | 4.967742 | 4.897959 | 1.739130 | 1.502241 | 1.635992 | 0.713066 | 1.500000 |
766 | 0.588235 | 5.290323 | 3.673469 | 2.391304 | 2.217956 | 2.433538 | 1.157131 | 4.333333 |
767 | 0.588235 | 3.161290 | 4.693878 | 2.608696 | 1.215086 | 2.494888 | 1.011956 | 0.333333 |
768 rows × 8 columns
X_scaled_df.var()
Pregnancies 3.928739
Glucose 3.869637
BloodPressure 1.523548
SkinThickness 0.913109
Insulin 1.271218
BMI 2.041377
DiabetesPedigreeFunction 2.001447
Age 3.841751
dtype: float64
from sklearn.feature_selection import VarianceThreshold
select_features = VarianceThreshold(threshold=1.0)
X_variance_threshold_df=select_features.fit_transform(X_scaled_df)
X_variance_threshold_df
array([[3.52941176, 6.70967742, 4.89795918, ..., 3.14928425, 2.3441503 ,
4.83333333],
[0.58823529, 2.64516129, 4.28571429, ..., 1.71779141, 1.16567037,
1.66666667],
[4.70588235, 8.96774194, 4.08163265, ..., 1.04294479, 2.53629377,
1.83333333],
...,
[2.94117647, 4.96774194, 4.89795918, ..., 1.63599182, 0.71306576,
1.5 ],
[0.58823529, 5.29032258, 3.67346939, ..., 2.43353783, 1.15713066,
4.33333333],
[0.58823529, 3.16129032, 4.69387755, ..., 2.49488753, 1.01195559,
0.33333333]])
X_variance_threshold_df=pd.DataFrame(X_variance_threshold)
X_variance_threshold_df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 3.529412 | 6.709677 | 4.897959 | 2.740295 | 3.149284 | 2.344150 | 4.833333 |
1 | 0.588235 | 2.645161 | 4.285714 | 1.018185 | 1.717791 | 1.165670 | 1.666667 |
2 | 4.705882 | 8.967742 | 4.081633 | 3.331099 | 1.042945 | 2.536294 | 1.833333 |
3 | 0.588235 | 2.903226 | 4.285714 | 1.293850 | 2.024540 | 0.380017 | 0.000000 |
4 | 0.000000 | 6.000000 | 1.632653 | 2.150572 | 5.092025 | 9.436379 | 2.000000 |
def get_selected_features(raw_df,processed_df):
selected_features=[]
for i in range(len(processed_df.columns)):
for j in range(len(raw_df.columns)):
if (processed_df.iloc[:,i].equals(raw_df.iloc[:,j])):
selected_features.append(raw_df.columns[j])
return selected_features
selected_features= get_selected_features(X_scaled_df,X_variance_threshold_df)
selected_features
['Pregnancies',
'Glucose',
'BloodPressure',
'Insulin',
'BMI',
'DiabetesPedigreeFunction',
'Age']
We can see SkinThickness feature is not selected as its variance is less.
X_variance_threshold_df.columns=selected_features
selected_features_df = X_variance_threshold_df
selected_features_df
Pregnancies | Glucose | BloodPressure | Insulin | BMI | DiabetesPedigreeFunction | Age | |
---|---|---|---|---|---|---|---|
0 | 3.529412 | 6.709677 | 4.897959 | 2.740295 | 3.149284 | 2.344150 | 4.833333 |
1 | 0.588235 | 2.645161 | 4.285714 | 1.018185 | 1.717791 | 1.165670 | 1.666667 |
2 | 4.705882 | 8.967742 | 4.081633 | 3.331099 | 1.042945 | 2.536294 | 1.833333 |
3 | 0.588235 | 2.903226 | 4.285714 | 1.293850 | 2.024540 | 0.380017 | 0.000000 |
4 | 0.000000 | 6.000000 | 1.632653 | 2.150572 | 5.092025 | 9.436379 | 2.000000 |
... | ... | ... | ... | ... | ... | ... | ... |
763 | 5.882353 | 3.677419 | 5.306122 | 2.289500 | 3.006135 | 0.397096 | 7.000000 |
764 | 1.176471 | 5.032258 | 4.693878 | 2.044244 | 3.803681 | 1.118702 | 1.000000 |
765 | 2.941176 | 4.967742 | 4.897959 | 1.502241 | 1.635992 | 0.713066 | 1.500000 |
766 | 0.588235 | 5.290323 | 3.673469 | 2.217956 | 2.433538 | 1.157131 | 4.333333 |
767 | 0.588235 | 3.161290 | 4.693878 | 1.215086 | 2.494888 | 1.011956 | 0.333333 |
768 rows × 7 columns
Here we are keeping only those features which are having missing data less than 10%.
def generate_feature_scores_df(X,Score):
feature_score=pd.DataFrame()
for i in range(X.shape[1]):
new =pd.DataFrame({"Features":X.columns[i],"Score":Score[i]},index=[i])
feature_score=pd.concat([feature_score,new])
return feature_score
Chi-Square Test
Chi2 is a measure of dependency between two variables. It gives us a goodness of fit measure because it measures how well an observed distribution of a particular feature fits with the distribution that is expected if two features are independent.
Scikit-Learn offers a feature selection estimator named SelectKBest which select K numbers of features based on the statistical analysis.
diabetes=pd.read_csv('dataset/diabetes.csv')
X=diabetes.drop('Outcome',axis=1)
Y=diabetes['Outcome']
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
X=X.astype(np.float64)
chi2_test=SelectKBest(score_func=chi2,k=4)
chi2_model=chi2_test.fit(X,Y)
chi2_model.scores_
array([ 111.51969064, 1411.88704064, 17.60537322, 53.10803984,
2175.56527292, 127.66934333, 5.39268155, 181.30368904])
feature_score_df=generate_feature_scores_df(X,chi2_model.scores_)
feature_score_df
Features | Score | |
---|---|---|
0 | Pregnancies | 111.519691 |
1 | Glucose | 1411.887041 |
2 | BloodPressure | 17.605373 |
3 | SkinThickness | 53.108040 |
4 | Insulin | 2175.565273 |
5 | BMI | 127.669343 |
6 | DiabetesPedigreeFunction | 5.392682 |
7 | Age | 181.303689 |
Here we can see the features and corresponding chi square scores.
X_new=chi2_model.transform(X)
X_new=pd.DataFrame(X_new)
selected_features=get_selected_features(X,X_new)
selected_features
['Glucose', 'Insulin', 'BMI', 'Age']
chi2_best_features=X[selected_features]
chi2_best_features.head()
Glucose | Insulin | BMI | Age | |
---|---|---|---|---|
0 | 148.0 | 0.0 | 33.6 | 50.0 |
1 | 85.0 | 0.0 | 26.6 | 31.0 |
2 | 183.0 | 0.0 | 23.3 | 32.0 |
3 | 89.0 | 94.0 | 28.1 | 21.0 |
4 | 137.0 | 168.0 | 43.1 | 33.0 |
Anova F-Test
The F-value scores examine the varaiance by grouping the numerical feature by the target vector, the means for each group are significantly different.
from sklearn.feature_selection import f_classif,SelectPercentile
Anova_test=SelectPercentile(f_classif,percentile=80)
Anova_model= Anova_test.fit(X,Y)
So we will be selecting only 80% out of all the features based on the F-Score.
Anova_model.scores_
array([ 39.67022739, 213.16175218, 3.2569504 , 4.30438091,
13.28110753, 71.7720721 , 23.8713002 , 46.14061124])
feature_scores_df=generate_feature_scores_df(X,Anova_model.scores_)
feature_scores_df
Features | Score | |
---|---|---|
0 | Pregnancies | 39.670227 |
1 | Glucose | 213.161752 |
2 | BloodPressure | 3.256950 |
3 | SkinThickness | 4.304381 |
4 | Insulin | 13.281108 |
5 | BMI | 71.772072 |
6 | DiabetesPedigreeFunction | 23.871300 |
7 | Age | 46.140611 |
X_new=Anova_model.transform(X)
X_new=pd.DataFrame(X_new)
selected_features=get_selected_features(X,X_new)
selected_features
['Pregnancies', 'Glucose', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
Anova_selected_feature_df=X[selected_features]
Anova_selected_feature_df.head()
Pregnancies | Glucose | Insulin | BMI | DiabetesPedigreeFunction | Age | |
---|---|---|---|---|---|---|
0 | 6.0 | 148.0 | 0.0 | 33.6 | 0.627 | 50.0 |
1 | 1.0 | 85.0 | 0.0 | 26.6 | 0.351 | 31.0 |
2 | 8.0 | 183.0 | 0.0 | 23.3 | 0.672 | 32.0 |
3 | 1.0 | 89.0 | 94.0 | 28.1 | 0.167 | 21.0 |
4 | 0.0 | 137.0 | 168.0 | 43.1 | 2.288 | 33.0 |
Now let’s compare predictions of these approaches.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def buildmodel(X,Y,test_frac,model_name=''):
x_train,x_test,y_train,y_test =train_test_split(X,Y,test_size=test_frac)
model =LogisticRegression(solver='liblinear').fit(x_train,y_train)
y_pred=model.predict(x_test)
print('Accuracy of the '+model_name+' based model is ',str(accuracy_score(y_test,y_pred)))
buildmodel(X=diabetes_missing_value_threshold_features,Y=diabetes_missing_value_threshold_label,test_frac=.2,model_name="Missing values threshold")
Accuracy of the Missing values threshold based model is 0.8116883116883117
buildmodel(X=selected_features_df,Y=Y,test_frac=.2,model_name="Variance threshold")
Accuracy of the Variance threshold based model is 0.8181818181818182
buildmodel(X=X,Y=Y,test_frac=.2,model_name="General Logistic Regression")
Accuracy of the General Logistic Regression based model is 0.7012987012987013
buildmodel(X=chi2_best_features,Y=Y,test_frac=.2,model_name="Chi2 based")
Accuracy of the Chi2 based based model is 0.7922077922077922
buildmodel(X=Anova_selected_feature_df,Y=Y,test_frac=.2,model_name="Anova F-test based")
Accuracy of the Anova F-test based based model is 0.8051948051948052
We can definitely see that accuracy has been improved by taking these feature selection approaches.
You can get the notebook used in this tutorial here and dataset used here
Thanks for reading!
Leave a comment