T103: Filter method-Feature selection techniques in machine learning

8 minute read

Note: This is a part of series on Data Preprocessing in Machine Learning you can check all tutorials here: Embedded Method, Wrapper Method, Filter Method,Handling Multicollinearity.

In this tutorial we will see how we can select features using Filter feature selection method.

Filter Methods

Filter method applies a statistical measure to assign a scoring to each feature.Then we can decide to keep or remove those features based on those scores. The methods are often univariate and consider the feature independently, or with regard to the dependent variable.

In this tutorial we will cover below approaches:

  1. Missing Value Ratio Threshold
  2. Variance Threshold
  3. $Chi^2$ Test
  4. Anova Test

Missing Value Ratio Threshold

We will remove those features which are having missing values more than a threshold.

import pandas as pd
import numpy as np
diabetes = pd.read_csv('dataset/diabetes.csv')
diabetes.head()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

We know that below features can not be zero(e.g. a person’s blood pressure can not be 0) hence we are imputing zeros with nan value in these features.

diabetes['Glucose'].replace(0,np.nan,inplace=True)
diabetes['BloodPressure'].replace(0,np.nan,inplace=True)
diabetes['SkinThickness'].replace(0,np.nan,inplace=True)
diabetes['Insulin'].replace(0,np.nan,inplace=True)
diabetes['BMI'].replace(0,np.nan,inplace=True)
diabetes.isnull().sum()
Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

Now let’s see for each feature what is the percentage of having missing values.

diabetes['Glucose'].isnull().sum()/len(diabetes)*100
0.6510416666666667
diabetes['BloodPressure'].isnull().sum()/len(diabetes)*100
4.557291666666666
diabetes['SkinThickness'].isnull().sum()/len(diabetes)*100
29.557291666666668
diabetes['Insulin'].isnull().sum()/len(diabetes)*100
48.69791666666667
diabetes['BMI'].isnull().sum()/len(diabetes)*100
1.4322916666666665

We can see that a large number of data missing in SkinThickness, Insulin.

diabetes_missing_value_threshold=diabetes.dropna(thresh=int(diabetes.shape[0] * .9) ,axis=1)
diabetes_missing_value_threshold
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6.0 148.0 72.0 35.0 218.937760 33.6 0.627 50.0 1
1 1.0 85.0 66.0 29.0 70.189298 26.6 0.351 31.0 0
2 8.0 183.0 64.0 29.0 269.968908 23.3 0.672 32.0 1
3 1.0 89.0 66.0 23.0 94.000000 28.1 0.167 21.0 0
4 0.0 137.0 40.0 35.0 168.000000 43.1 2.288 33.0 1
... ... ... ... ... ... ... ... ... ...
763 10.0 101.0 76.0 48.0 180.000000 32.9 0.171 63.0 0
764 2.0 122.0 70.0 27.0 158.815881 36.8 0.340 27.0 0
765 5.0 121.0 72.0 23.0 112.000000 26.2 0.245 30.0 0
766 1.0 126.0 60.0 29.0 173.820363 30.1 0.349 47.0 1
767 1.0 93.0 70.0 31.0 87.196731 30.4 0.315 23.0 0

768 rows × 9 columns

Here we are keeping only those features which are having missing data less than 10%.

diabetes_missing_value_threshold_features = diabetes_missing_value_threshold.drop('Outcome',axis=1)
diabetes_missing_value_threshold_label= diabetes_missing_value_threshold['Outcome']
diabetes_missing_value_threshold_features,diabetes_missing_value_threshold_label
(     Pregnancies  Glucose  BloodPressure  SkinThickness     Insulin   BMI  \
 0            6.0    148.0           72.0           35.0  218.937760  33.6   
 1            1.0     85.0           66.0           29.0   70.189298  26.6   
 2            8.0    183.0           64.0           29.0  269.968908  23.3   
 3            1.0     89.0           66.0           23.0   94.000000  28.1   
 4            0.0    137.0           40.0           35.0  168.000000  43.1   
 ..           ...      ...            ...            ...         ...   ...   
 763         10.0    101.0           76.0           48.0  180.000000  32.9   
 764          2.0    122.0           70.0           27.0  158.815881  36.8   
 765          5.0    121.0           72.0           23.0  112.000000  26.2   
 766          1.0    126.0           60.0           29.0  173.820363  30.1   
 767          1.0     93.0           70.0           31.0   87.196731  30.4   
 
      DiabetesPedigreeFunction   Age  
 0                       0.627  50.0  
 1                       0.351  31.0  
 2                       0.672  32.0  
 3                       0.167  21.0  
 4                       2.288  33.0  
 ..                        ...   ...  
 763                     0.171  63.0  
 764                     0.340  27.0  
 765                     0.245  30.0  
 766                     0.349  47.0  
 767                     0.315  23.0  
 
 [768 rows x 8 columns], 0      1
 1      0
 2      1
 3      0
 4      1
       ..
 763    0
 764    0
 765    0
 766    1
 767    0
 Name: Outcome, Length: 768, dtype: int64)

Variance Threshold

If the variance is low or close to zero, then a feature is approximately constant and will not improve the performance of the model. In that case, it should be removed.

Variance will also be very low for a feature if only a handful of observations of that feature differ from a constant value.

diabetes = pd.read_csv('dataset/diabetes_cleaned.csv')
diabetes.head()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6.0 148.0 72.0 35.0 218.937760 33.6 0.627 50.0 1
1 1.0 85.0 66.0 29.0 70.189298 26.6 0.351 31.0 0
2 8.0 183.0 64.0 29.0 269.968908 23.3 0.672 32.0 1
3 1.0 89.0 66.0 23.0 94.000000 28.1 0.167 21.0 0
4 0.0 137.0 40.0 35.0 168.000000 43.1 2.288 33.0 1
X=diabetes.drop('Outcome',axis=1)
Y=diabetes['Outcome']
X.var(axis=0)
Pregnancies                    11.354056
Glucose                       932.425376
BloodPressure                 153.317842
SkinThickness                 109.767160
Insulin                     14107.703775
BMI                            47.955463
DiabetesPedigreeFunction        0.109779
Age                           138.303046
dtype: float64

We can see that DiabetesPedigreeFunction variance is less so it brings little information because it is (almost) constant , this can be a justification to remove DiabetesPedigreeFunction column but before considering this we should scale these features because they are of different scales.

from sklearn.preprocessing import minmax_scale
X_scaled_df =pd.DataFrame(minmax_scale(X,feature_range=(0,10)),columns=X.columns)

We have used sklearn minmax scaler here.

X_scaled_df
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 3.529412 6.709677 4.897959 3.043478 2.740295 3.149284 2.344150 4.833333
1 0.588235 2.645161 4.285714 2.391304 1.018185 1.717791 1.165670 1.666667
2 4.705882 8.967742 4.081633 2.391304 3.331099 1.042945 2.536294 1.833333
3 0.588235 2.903226 4.285714 1.739130 1.293850 2.024540 0.380017 0.000000
4 0.000000 6.000000 1.632653 3.043478 2.150572 5.092025 9.436379 2.000000
... ... ... ... ... ... ... ... ...
763 5.882353 3.677419 5.306122 4.456522 2.289500 3.006135 0.397096 7.000000
764 1.176471 5.032258 4.693878 2.173913 2.044244 3.803681 1.118702 1.000000
765 2.941176 4.967742 4.897959 1.739130 1.502241 1.635992 0.713066 1.500000
766 0.588235 5.290323 3.673469 2.391304 2.217956 2.433538 1.157131 4.333333
767 0.588235 3.161290 4.693878 2.608696 1.215086 2.494888 1.011956 0.333333

768 rows × 8 columns

X_scaled_df.var()
Pregnancies                 3.928739
Glucose                     3.869637
BloodPressure               1.523548
SkinThickness               0.913109
Insulin                     1.271218
BMI                         2.041377
DiabetesPedigreeFunction    2.001447
Age                         3.841751
dtype: float64
from sklearn.feature_selection import VarianceThreshold

select_features = VarianceThreshold(threshold=1.0)
X_variance_threshold_df=select_features.fit_transform(X_scaled_df)
X_variance_threshold_df
array([[3.52941176, 6.70967742, 4.89795918, ..., 3.14928425, 2.3441503 ,
        4.83333333],
       [0.58823529, 2.64516129, 4.28571429, ..., 1.71779141, 1.16567037,
        1.66666667],
       [4.70588235, 8.96774194, 4.08163265, ..., 1.04294479, 2.53629377,
        1.83333333],
       ...,
       [2.94117647, 4.96774194, 4.89795918, ..., 1.63599182, 0.71306576,
        1.5       ],
       [0.58823529, 5.29032258, 3.67346939, ..., 2.43353783, 1.15713066,
        4.33333333],
       [0.58823529, 3.16129032, 4.69387755, ..., 2.49488753, 1.01195559,
        0.33333333]])
X_variance_threshold_df=pd.DataFrame(X_variance_threshold)
X_variance_threshold_df.head()
0 1 2 3 4 5 6
0 3.529412 6.709677 4.897959 2.740295 3.149284 2.344150 4.833333
1 0.588235 2.645161 4.285714 1.018185 1.717791 1.165670 1.666667
2 4.705882 8.967742 4.081633 3.331099 1.042945 2.536294 1.833333
3 0.588235 2.903226 4.285714 1.293850 2.024540 0.380017 0.000000
4 0.000000 6.000000 1.632653 2.150572 5.092025 9.436379 2.000000
def get_selected_features(raw_df,processed_df):
    selected_features=[]
    for i in range(len(processed_df.columns)):
        for j in range(len(raw_df.columns)):
            if (processed_df.iloc[:,i].equals(raw_df.iloc[:,j])):
                selected_features.append(raw_df.columns[j])
    return selected_features
selected_features= get_selected_features(X_scaled_df,X_variance_threshold_df)
selected_features
['Pregnancies',
 'Glucose',
 'BloodPressure',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']

We can see SkinThickness feature is not selected as its variance is less.

X_variance_threshold_df.columns=selected_features
selected_features_df = X_variance_threshold_df
selected_features_df
Pregnancies Glucose BloodPressure Insulin BMI DiabetesPedigreeFunction Age
0 3.529412 6.709677 4.897959 2.740295 3.149284 2.344150 4.833333
1 0.588235 2.645161 4.285714 1.018185 1.717791 1.165670 1.666667
2 4.705882 8.967742 4.081633 3.331099 1.042945 2.536294 1.833333
3 0.588235 2.903226 4.285714 1.293850 2.024540 0.380017 0.000000
4 0.000000 6.000000 1.632653 2.150572 5.092025 9.436379 2.000000
... ... ... ... ... ... ... ...
763 5.882353 3.677419 5.306122 2.289500 3.006135 0.397096 7.000000
764 1.176471 5.032258 4.693878 2.044244 3.803681 1.118702 1.000000
765 2.941176 4.967742 4.897959 1.502241 1.635992 0.713066 1.500000
766 0.588235 5.290323 3.673469 2.217956 2.433538 1.157131 4.333333
767 0.588235 3.161290 4.693878 1.215086 2.494888 1.011956 0.333333

768 rows × 7 columns

Here we are keeping only those features which are having missing data less than 10%.

def generate_feature_scores_df(X,Score):
    feature_score=pd.DataFrame()
    for i in range(X.shape[1]):
        new =pd.DataFrame({"Features":X.columns[i],"Score":Score[i]},index=[i])
        feature_score=pd.concat([feature_score,new])
    return feature_score

Chi-Square Test

Chi2 is a measure of dependency between two variables. It gives us a goodness of fit measure because it measures how well an observed distribution of a particular feature fits with the distribution that is expected if two features are independent.

Scikit-Learn offers a feature selection estimator named SelectKBest which select K numbers of features based on the statistical analysis.

diabetes=pd.read_csv('dataset/diabetes.csv')
X=diabetes.drop('Outcome',axis=1)
Y=diabetes['Outcome']
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
X=X.astype(np.float64)
chi2_test=SelectKBest(score_func=chi2,k=4)
chi2_model=chi2_test.fit(X,Y)
chi2_model.scores_
array([ 111.51969064, 1411.88704064,   17.60537322,   53.10803984,
       2175.56527292,  127.66934333,    5.39268155,  181.30368904])
feature_score_df=generate_feature_scores_df(X,chi2_model.scores_)
feature_score_df
Features Score
0 Pregnancies 111.519691
1 Glucose 1411.887041
2 BloodPressure 17.605373
3 SkinThickness 53.108040
4 Insulin 2175.565273
5 BMI 127.669343
6 DiabetesPedigreeFunction 5.392682
7 Age 181.303689

Here we can see the features and corresponding chi square scores.

X_new=chi2_model.transform(X)
X_new=pd.DataFrame(X_new)
selected_features=get_selected_features(X,X_new)
selected_features
['Glucose', 'Insulin', 'BMI', 'Age']
chi2_best_features=X[selected_features]
chi2_best_features.head()
Glucose Insulin BMI Age
0 148.0 0.0 33.6 50.0
1 85.0 0.0 26.6 31.0
2 183.0 0.0 23.3 32.0
3 89.0 94.0 28.1 21.0
4 137.0 168.0 43.1 33.0

Anova F-Test

The F-value scores examine the varaiance by grouping the numerical feature by the target vector, the means for each group are significantly different.

from sklearn.feature_selection import f_classif,SelectPercentile
Anova_test=SelectPercentile(f_classif,percentile=80)
Anova_model= Anova_test.fit(X,Y)

So we will be selecting only 80% out of all the features based on the F-Score.

Anova_model.scores_
array([ 39.67022739, 213.16175218,   3.2569504 ,   4.30438091,
        13.28110753,  71.7720721 ,  23.8713002 ,  46.14061124])
feature_scores_df=generate_feature_scores_df(X,Anova_model.scores_)
feature_scores_df
Features Score
0 Pregnancies 39.670227
1 Glucose 213.161752
2 BloodPressure 3.256950
3 SkinThickness 4.304381
4 Insulin 13.281108
5 BMI 71.772072
6 DiabetesPedigreeFunction 23.871300
7 Age 46.140611
X_new=Anova_model.transform(X)
X_new=pd.DataFrame(X_new)
selected_features=get_selected_features(X,X_new)

selected_features
['Pregnancies', 'Glucose', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
Anova_selected_feature_df=X[selected_features]
Anova_selected_feature_df.head()
Pregnancies Glucose Insulin BMI DiabetesPedigreeFunction Age
0 6.0 148.0 0.0 33.6 0.627 50.0
1 1.0 85.0 0.0 26.6 0.351 31.0
2 8.0 183.0 0.0 23.3 0.672 32.0
3 1.0 89.0 94.0 28.1 0.167 21.0
4 0.0 137.0 168.0 43.1 2.288 33.0

Now let’s compare predictions of these approaches.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def buildmodel(X,Y,test_frac,model_name=''):
    x_train,x_test,y_train,y_test =train_test_split(X,Y,test_size=test_frac)
    model =LogisticRegression(solver='liblinear').fit(x_train,y_train)
    y_pred=model.predict(x_test)
    print('Accuracy of the '+model_name+' based model is ',str(accuracy_score(y_test,y_pred)))
buildmodel(X=diabetes_missing_value_threshold_features,Y=diabetes_missing_value_threshold_label,test_frac=.2,model_name="Missing values threshold")
Accuracy of the Missing values threshold based model is  0.8116883116883117
buildmodel(X=selected_features_df,Y=Y,test_frac=.2,model_name="Variance threshold")
Accuracy of the Variance threshold based model is  0.8181818181818182
buildmodel(X=X,Y=Y,test_frac=.2,model_name="General Logistic Regression")
Accuracy of the General Logistic Regression based model is  0.7012987012987013
buildmodel(X=chi2_best_features,Y=Y,test_frac=.2,model_name="Chi2 based")
Accuracy of the Chi2 based based model is  0.7922077922077922
buildmodel(X=Anova_selected_feature_df,Y=Y,test_frac=.2,model_name="Anova F-test based")
Accuracy of the Anova F-test based based model is  0.8051948051948052

We can definitely see that accuracy has been improved by taking these feature selection approaches.

You can get the notebook used in this tutorial here and dataset used here

Thanks for reading!

Leave a comment