T103: Filter method-Feature selection techniques in machine learning

8 minute read

Note: This is a part of series on Data Preprocessing in Machine Learning you can check all tutorials here: Embedded Method, Wrapper Method, Filter Method,Handling Multicollinearity.

In this tutorial we will see how we can select features using Filter feature selection method.

Filter Methods

Filter method applies a statistical measure to assign a scoring to each feature.Then we can decide to keep or remove those features based on those scores. The methods are often univariate and consider the feature independently, or with regard to the dependent variable.

In this tutorial we will cover below approaches:

Missing Value Ratio Threshold
Variance Threshold
$Chi^2$ Test
Anova Test

Missing Value Ratio Threshold

We will remove those features which are having missing values more than a threshold.

import pandas as pd
import numpy as np

diabetes = pd.read_csv('dataset/diabetes.csv')
diabetes.head()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

We know that below features can not be zero(e.g. a person’s blood pressure can not be 0) hence we are imputing zeros with nan value in these features.

diabetes['Glucose'].replace(0,np.nan,inplace=True)
diabetes['BloodPressure'].replace(0,np.nan,inplace=True)
diabetes['SkinThickness'].replace(0,np.nan,inplace=True)
diabetes['Insulin'].replace(0,np.nan,inplace=True)
diabetes['BMI'].replace(0,np.nan,inplace=True)

diabetes.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

Now let’s see for each feature what is the percentage of having missing values.

diabetes['Glucose'].isnull().sum()/len(diabetes)*100

0.6510416666666667

diabetes['BloodPressure'].isnull().sum()/len(diabetes)*100

4.557291666666666

diabetes['SkinThickness'].isnull().sum()/len(diabetes)*100

29.557291666666668

diabetes['Insulin'].isnull().sum()/len(diabetes)*100

48.69791666666667

diabetes['BMI'].isnull().sum()/len(diabetes)*100

1.4322916666666665

We can see that a large number of data missing in SkinThickness, Insulin.

diabetes_missing_value_threshold=diabetes.dropna(thresh=int(diabetes.shape[0] * .9) ,axis=1)
diabetes_missing_value_threshold

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6.0	148.0	72.0	35.0	218.937760	33.6	0.627	50.0	1
1	1.0	85.0	66.0	29.0	70.189298	26.6	0.351	31.0	0
2	8.0	183.0	64.0	29.0	269.968908	23.3	0.672	32.0	1
3	1.0	89.0	66.0	23.0	94.000000	28.1	0.167	21.0	0
4	0.0	137.0	40.0	35.0	168.000000	43.1	2.288	33.0	1
...	...	...	...	...	...	...	...	...	...
763	10.0	101.0	76.0	48.0	180.000000	32.9	0.171	63.0	0
764	2.0	122.0	70.0	27.0	158.815881	36.8	0.340	27.0	0
765	5.0	121.0	72.0	23.0	112.000000	26.2	0.245	30.0	0
766	1.0	126.0	60.0	29.0	173.820363	30.1	0.349	47.0	1
767	1.0	93.0	70.0	31.0	87.196731	30.4	0.315	23.0	0

768 rows × 9 columns

Here we are keeping only those features which are having missing data less than 10%.

diabetes_missing_value_threshold_features = diabetes_missing_value_threshold.drop('Outcome',axis=1)
diabetes_missing_value_threshold_label= diabetes_missing_value_threshold['Outcome']
diabetes_missing_value_threshold_features,diabetes_missing_value_threshold_label

(     Pregnancies  Glucose  BloodPressure  SkinThickness     Insulin   BMI  \
          6.0    148.0           72.0           35.0  218.937760  33.6   
          1.0     85.0           66.0           29.0   70.189298  26.6   
          8.0    183.0           64.0           29.0  269.968908  23.3   
          1.0     89.0           66.0           23.0   94.000000  28.1   
          0.0    137.0           40.0           35.0  168.000000  43.1   
 ..           ...      ...            ...            ...         ...   ...   
       10.0    101.0           76.0           48.0  180.000000  32.9   
        2.0    122.0           70.0           27.0  158.815881  36.8   
        5.0    121.0           72.0           23.0  112.000000  26.2   
        1.0    126.0           60.0           29.0  173.820363  30.1   
        1.0     93.0           70.0           31.0   87.196731  30.4   
 
      DiabetesPedigreeFunction   Age  
                     0.627  50.0  
                     0.351  31.0  
                     0.672  32.0  
                     0.167  21.0  
                     2.288  33.0  
 ..                        ...   ...  
                   0.171  63.0  
                   0.340  27.0  
                   0.245  30.0  
                   0.349  47.0  
                   0.315  23.0  
 
 [768 rows x 8 columns], 0      1
    0
    1
    0
    1
       ..
  0
  0
  0
  1
  0
 Name: Outcome, Length: 768, dtype: int64)

Variance Threshold

If the variance is low or close to zero, then a feature is approximately constant and will not improve the performance of the model. In that case, it should be removed.

Variance will also be very low for a feature if only a handful of observations of that feature differ from a constant value.

diabetes = pd.read_csv('dataset/diabetes_cleaned.csv')
diabetes.head()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6.0	148.0	72.0	35.0	218.937760	33.6	0.627	50.0	1
1	1.0	85.0	66.0	29.0	70.189298	26.6	0.351	31.0	0
2	8.0	183.0	64.0	29.0	269.968908	23.3	0.672	32.0	1
3	1.0	89.0	66.0	23.0	94.000000	28.1	0.167	21.0	0
4	0.0	137.0	40.0	35.0	168.000000	43.1	2.288	33.0	1

X=diabetes.drop('Outcome',axis=1)
Y=diabetes['Outcome']

X.var(axis=0)

Pregnancies                    11.354056
Glucose                       932.425376
BloodPressure                 153.317842
SkinThickness                 109.767160
Insulin                     14107.703775
BMI                            47.955463
DiabetesPedigreeFunction        0.109779
Age                           138.303046
dtype: float64

We can see that DiabetesPedigreeFunction variance is less so it brings little information because it is (almost) constant , this can be a justification to remove DiabetesPedigreeFunction column but before considering this we should scale these features because they are of different scales.

from sklearn.preprocessing import minmax_scale
X_scaled_df =pd.DataFrame(minmax_scale(X,feature_range=(0,10)),columns=X.columns)

We have used sklearn minmax scaler here.

X_scaled_df

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
0	3.529412	6.709677	4.897959	3.043478	2.740295	3.149284	2.344150	4.833333
1	0.588235	2.645161	4.285714	2.391304	1.018185	1.717791	1.165670	1.666667
2	4.705882	8.967742	4.081633	2.391304	3.331099	1.042945	2.536294	1.833333
3	0.588235	2.903226	4.285714	1.739130	1.293850	2.024540	0.380017	0.000000
4	0.000000	6.000000	1.632653	3.043478	2.150572	5.092025	9.436379	2.000000
...	...	...	...	...	...	...	...	...
763	5.882353	3.677419	5.306122	4.456522	2.289500	3.006135	0.397096	7.000000
764	1.176471	5.032258	4.693878	2.173913	2.044244	3.803681	1.118702	1.000000
765	2.941176	4.967742	4.897959	1.739130	1.502241	1.635992	0.713066	1.500000
766	0.588235	5.290323	3.673469	2.391304	2.217956	2.433538	1.157131	4.333333
767	0.588235	3.161290	4.693878	2.608696	1.215086	2.494888	1.011956	0.333333

768 rows × 8 columns

X_scaled_df.var()

Pregnancies                 3.928739
Glucose                     3.869637
BloodPressure               1.523548
SkinThickness               0.913109
Insulin                     1.271218
BMI                         2.041377
DiabetesPedigreeFunction    2.001447
Age                         3.841751
dtype: float64

from sklearn.feature_selection import VarianceThreshold

select_features = VarianceThreshold(threshold=1.0)

X_variance_threshold_df=select_features.fit_transform(X_scaled_df)
X_variance_threshold_df

array([[3.52941176, 6.70967742, 4.89795918, ..., 3.14928425, 2.3441503 ,
        4.83333333],
       [0.58823529, 2.64516129, 4.28571429, ..., 1.71779141, 1.16567037,
        1.66666667],
       [4.70588235, 8.96774194, 4.08163265, ..., 1.04294479, 2.53629377,
        1.83333333],
       ...,
       [2.94117647, 4.96774194, 4.89795918, ..., 1.63599182, 0.71306576,
        1.5       ],
       [0.58823529, 5.29032258, 3.67346939, ..., 2.43353783, 1.15713066,
        4.33333333],
       [0.58823529, 3.16129032, 4.69387755, ..., 2.49488753, 1.01195559,
        0.33333333]])

X_variance_threshold_df=pd.DataFrame(X_variance_threshold)

X_variance_threshold_df.head()

	0	1	2	3	4	5	6
0	3.529412	6.709677	4.897959	2.740295	3.149284	2.344150	4.833333
1	0.588235	2.645161	4.285714	1.018185	1.717791	1.165670	1.666667
2	4.705882	8.967742	4.081633	3.331099	1.042945	2.536294	1.833333
3	0.588235	2.903226	4.285714	1.293850	2.024540	0.380017	0.000000
4	0.000000	6.000000	1.632653	2.150572	5.092025	9.436379	2.000000

def get_selected_features(raw_df,processed_df):
    selected_features=[]
    for i in range(len(processed_df.columns)):
        for j in range(len(raw_df.columns)):
            if (processed_df.iloc[:,i].equals(raw_df.iloc[:,j])):
                selected_features.append(raw_df.columns[j])
    return selected_features

selected_features= get_selected_features(X_scaled_df,X_variance_threshold_df)
selected_features

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']

We can see SkinThickness feature is not selected as its variance is less.

X_variance_threshold_df.columns=selected_features
selected_features_df = X_variance_threshold_df
selected_features_df

	Pregnancies	Glucose	BloodPressure	Insulin	BMI	DiabetesPedigreeFunction	Age
0	3.529412	6.709677	4.897959	2.740295	3.149284	2.344150	4.833333
1	0.588235	2.645161	4.285714	1.018185	1.717791	1.165670	1.666667
2	4.705882	8.967742	4.081633	3.331099	1.042945	2.536294	1.833333
3	0.588235	2.903226	4.285714	1.293850	2.024540	0.380017	0.000000
4	0.000000	6.000000	1.632653	2.150572	5.092025	9.436379	2.000000
...	...	...	...	...	...	...	...
763	5.882353	3.677419	5.306122	2.289500	3.006135	0.397096	7.000000
764	1.176471	5.032258	4.693878	2.044244	3.803681	1.118702	1.000000
765	2.941176	4.967742	4.897959	1.502241	1.635992	0.713066	1.500000
766	0.588235	5.290323	3.673469	2.217956	2.433538	1.157131	4.333333
767	0.588235	3.161290	4.693878	1.215086	2.494888	1.011956	0.333333

768 rows × 7 columns

Here we are keeping only those features which are having missing data less than 10%.

def generate_feature_scores_df(X,Score):
    feature_score=pd.DataFrame()
    for i in range(X.shape[1]):
        new =pd.DataFrame({"Features":X.columns[i],"Score":Score[i]},index=[i])
        feature_score=pd.concat([feature_score,new])
    return feature_score

Chi-Square Test

Chi2 is a measure of dependency between two variables. It gives us a goodness of fit measure because it measures how well an observed distribution of a particular feature fits with the distribution that is expected if two features are independent.

Scikit-Learn offers a feature selection estimator named SelectKBest which select K numbers of features based on the statistical analysis.

diabetes=pd.read_csv('dataset/diabetes.csv')

X=diabetes.drop('Outcome',axis=1)
Y=diabetes['Outcome']

from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest

X=X.astype(np.float64)

chi2_test=SelectKBest(score_func=chi2,k=4)
chi2_model=chi2_test.fit(X,Y)

chi2_model.scores_

array([ 111.51969064, 1411.88704064,   17.60537322,   53.10803984,
       2175.56527292,  127.66934333,    5.39268155,  181.30368904])

feature_score_df=generate_feature_scores_df(X,chi2_model.scores_)
feature_score_df

	Features	Score
0	Pregnancies	111.519691
1	Glucose	1411.887041
2	BloodPressure	17.605373
3	SkinThickness	53.108040
4	Insulin	2175.565273
5	BMI	127.669343
6	DiabetesPedigreeFunction	5.392682
7	Age	181.303689

Here we can see the features and corresponding chi square scores.

X_new=chi2_model.transform(X)

X_new=pd.DataFrame(X_new)

selected_features=get_selected_features(X,X_new)
selected_features

['Glucose', 'Insulin', 'BMI', 'Age']

chi2_best_features=X[selected_features]
chi2_best_features.head()

	Glucose	Insulin	BMI	Age
0	148.0	0.0	33.6	50.0
1	85.0	0.0	26.6	31.0
2	183.0	0.0	23.3	32.0
3	89.0	94.0	28.1	21.0
4	137.0	168.0	43.1	33.0

Anova F-Test

The F-value scores examine the varaiance by grouping the numerical feature by the target vector, the means for each group are significantly different.

from sklearn.feature_selection import f_classif,SelectPercentile
Anova_test=SelectPercentile(f_classif,percentile=80)
Anova_model= Anova_test.fit(X,Y)

So we will be selecting only 80% out of all the features based on the F-Score.

Anova_model.scores_

array([ 39.67022739, 213.16175218,   3.2569504 ,   4.30438091,
        13.28110753,  71.7720721 ,  23.8713002 ,  46.14061124])

feature_scores_df=generate_feature_scores_df(X,Anova_model.scores_)
feature_scores_df

	Features	Score
0	Pregnancies	39.670227
1	Glucose	213.161752
2	BloodPressure	3.256950
3	SkinThickness	4.304381
4	Insulin	13.281108
5	BMI	71.772072
6	DiabetesPedigreeFunction	23.871300
7	Age	46.140611

X_new=Anova_model.transform(X)

X_new=pd.DataFrame(X_new)

selected_features=get_selected_features(X,X_new)

selected_features

['Pregnancies', 'Glucose', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

Anova_selected_feature_df=X[selected_features]
Anova_selected_feature_df.head()

	Pregnancies	Glucose	Insulin	BMI	DiabetesPedigreeFunction	Age
0	6.0	148.0	0.0	33.6	0.627	50.0
1	1.0	85.0	0.0	26.6	0.351	31.0
2	8.0	183.0	0.0	23.3	0.672	32.0
3	1.0	89.0	94.0	28.1	0.167	21.0
4	0.0	137.0	168.0	43.1	2.288	33.0

Now let’s compare predictions of these approaches.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def buildmodel(X,Y,test_frac,model_name=''):
    x_train,x_test,y_train,y_test =train_test_split(X,Y,test_size=test_frac)
    model =LogisticRegression(solver='liblinear').fit(x_train,y_train)
    y_pred=model.predict(x_test)
    print('Accuracy of the '+model_name+' based model is ',str(accuracy_score(y_test,y_pred)))

buildmodel(X=diabetes_missing_value_threshold_features,Y=diabetes_missing_value_threshold_label,test_frac=.2,model_name="Missing values threshold")

Accuracy of the Missing values threshold based model is  0.8116883116883117

buildmodel(X=selected_features_df,Y=Y,test_frac=.2,model_name="Variance threshold")

Accuracy of the Variance threshold based model is  0.8181818181818182

buildmodel(X=X,Y=Y,test_frac=.2,model_name="General Logistic Regression")

Accuracy of the General Logistic Regression based model is  0.7012987012987013

buildmodel(X=chi2_best_features,Y=Y,test_frac=.2,model_name="Chi2 based")

Accuracy of the Chi2 based based model is  0.7922077922077922

buildmodel(X=Anova_selected_feature_df,Y=Y,test_frac=.2,model_name="Anova F-test based")

Accuracy of the Anova F-test based based model is  0.8051948051948052

We can definitely see that accuracy has been improved by taking these feature selection approaches.

You can get the notebook used in this tutorial here and dataset used here

Thanks for reading!

Share on

Twitter Facebook LinkedIn

Arup Bhunia

T103: Filter method-Feature selection techniques in machine learning

Filter Methods

Missing Value Ratio Threshold

Variance Threshold

Chi-Square Test

Anova F-Test

Share on

Leave a comment

You may also enjoy

Feature scaling and transformation in machine learning

T104: Handling Multicollinearity-Feature selection techniques in machine learning

T102: Wrapper method-Feature selection techniques in machine learning

T101: Embedded method-Feature selection techniques in machine learning