Mutual Information Score — Feature Selection using entropy from information theory
Hi everyone, how’s life? another day in paradise? Great.
Today we will look into a unique way of feature selection using mutual information.
Generally, there are two ways we look into a feature selection technique for numerical input data and a numerical target variable.
They are:
Correlation Statistics #Multi colineartiy — — — — — — — — — — — — — — corr_matrix = X.corr().abs() #select the upper triangle upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape),k=1).astype(np.bool)) #find features with correlation greater than 0.95 to_drop = [column for column in upper.columns if any(upper[column]>0.95)] #drop the features X.to_drop(to_drop,axis=1,inplace=True)
Mutual Information Statistics
Others: Model Centric using Random Forest, Decision Trees etc. , but there is a catch, We need to train the model before we get to know the feature importance Thus it becomes computationally expensive at a production level.. Remember that! its a Data Science Architect Question.
Wrapper Functions: there are various wrapper functions like SelectKBest available with more flexibility to suit certain scenarios
Under Development: TVS a method to discover feature importance using unsupervised learning.
The focus of this article will be the Mutual Information Feature Selection.
Mutual Information Feature Selection Mutual information from the field of information theory is the application of information gain (typically used in the construction of decision trees) to feature selection.
Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable. Mutual information is straightforward when considering the distribution of two discrete (categorical or ordinal) variables, such as categorical input and categorical output data. Nevertheless, it can be adapted for use with numerical input and output data. Mutual Information measures the entropy drops under the condition of the target value.
Simple explanation to this concept is this formula:
MI(feature;target) = Entropy(feature) — Entropy(feature|target) The MI score will fall in the range from 0 to ∞.
The high value of Mi means a closer connection between the feature and the target indicating feature importance for training the model. However, the lower the MI score like 0 indicates a weak connection between the feature and the target.
Mutual Information Feature Selection for Regressor.
#Feature Selection using Mutual Information with K = 'all'
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression
pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)
pyplot.show()
Modeling with all Features
#model using all input features from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error
# load the dataset X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)
# split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# fit the model model = LinearRegression() model.fit(X_train, y_train)
Modeling with correlated features with K = 88, here we will be using a wrapper function SelectKBest with score_func = f_regression similar to applying the regression model and then getting back the feature importance.
#model using 88 features chosen with correlation from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_regression from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error
# feature selection def select_features(X_train, y_train, X_test): # configure to select a subset of features fs = SelectKBest(score_func=f_regression, k=88) # learn relationship from training data fs.fit(X_train, y_train) # transform train input data X_train_fs = fs.transform(X_train) # transform test input data X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs, fs
# load the dataset X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)
# split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# fit the model model = LinearRegression() model.fit(X_train_fs, y_train)
# evaluate the model yhat = model.predict(X_test_fs)
# evaluate predictions mae = mean_absolute_error(y_test, yhat) #0.08569191074140582
Modeling with Mutual Information Features K=88,
here we also used the same SelectKBest wrapper function to get back the important features but this time we are using score_func = mutual_info_regression.
#model using 88 features chosen with mutual information
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
# feature selection
def select_features(X_train, y_train, X_test):
# configure to select a subset of features
fs = SelectKBest(score_func=mutual_info_regression, k=88)
# learn relationship from training data
fs.fit(X_train, y_train)
# transform train input data
X_train_fs = fs.transform(X_train)
# transform test input data
X_test_fs = fs.transform(X_test)
return X_train_fs, X_test_fs, fs
# load the dataset
X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)
# fit the model
model = LinearRegression()
model.fit(X_train_fs, y_train)
# evaluate the model
yhat = model.predict(X_test_fs)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
#0.08378300965184769
Now let's apply the same in simple words with a dataset.
Next we will quickly go over a Mutual Information Classification Example.
#mutual Information Classification
# load cancer data
from sklearn.datasets import load_breast_cancer as LBC
cancer = LBC()
X = cancer['data']
y = cancer['target']
#Compute MI scores
from sklearn.feature_selection import mutual_info_classif as MIC
mi_scores = MIC(X,y)
print(mi_scores)
#prepare dataset 1
from sklearn.model_selection import train_test_split as tts
X_train_1,X_test_1,y_train,y_test = tts(
X,y,random_state=0,stratify=y )
# prepare dataset 2, MI > 0.2
import numpy as np
mi_score_selected_index = np.where(mi_scores >0.2)[0]
X_2 = X[:,mi_score_selected_index]
X_train_2,X_test_2,y_train,y_test = tts(
X_2,y,random_state=0,stratify=y)
# prepare dataset 3, MI <0.2
mi_score_selected_index = np.where(mi_scores < 0.2)[0]
X_3 = X[:,mi_score_selected_index]
X_train_3,X_test_3,y_train,y_test = tts(
X_3,y,random_state=0,stratify=y)
# compare results with Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier as DTC
model_1 = DTC().fit(X_train_1,y_train)
model_2 = DTC().fit(X_train_2,y_train)
model_3 = DTC().fit(X_train_3,y_train)
score_1 = model_1.score(X_test_1,y_test)
score_2 = model_2.score(X_test_2,y_test)
score_3 = model_3.score(X_test_3,y_test)
# use Scikit-learn feature selector
from sklearn.feature_selection import SelectPercentile as SP
selector = SP(percentile=50) # select features with top 50% MI scores
selector.fit(X,y)
X_4 = selector.transform(X)
X_train_4,X_test_4,y_train,y_test = tts(
X_4,y,random_state=0,stratify=y)
model_4 = DTC().fit(X_train_4,y_train)
score_4 = model_4.score(X_test_4,y_test)
Score_1: 0.9370629370629371
Score_2: 0.916083916083916
Score_3: 0.8391608391608392
Score_4: 0.916083916083916
Done. Thats it…………..
i hope you enjoyed likewise, i will try to bring as much as possible new contents across the data science realm and i hope the package will be useful at some point in your work. Because I believe machine learning is not replacing us, it’s about replacing the same iterative work that consumes time and much effort. So people should come to work to create innovations rather than be occupied in the same repetitive boring tasks.
Thanks again, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Some of my alternative internet presencesFacebook, Instagram, Udemy, Blogger, Issuu, Slideshare, Scribd and more.
Comments
Post a Comment