Mutual Information Score — Feature Selection using entropy from information theory

January 09, 2023

Mutual Information Score — Feature Selection using entropy from information theory

Hi everyone, how’s life? another day in paradise? Great.

Today we will look into a unique way of feature selection using mutual information.

Generally, there are two ways we look into a feature selection technique for numerical input data and a numerical target variable.

They are:

Correlation Statistics
#Multi colineartiy — — — — — — — — — — — — — —
corr_matrix = X.corr().abs()
#select the upper triangle
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape),k=1).astype(np.bool))
#find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column]>0.95)]
#drop the features
X.to_drop(to_drop,axis=1,inplace=True)
Mutual Information Statistics
Others: Model Centric using Random Forest, Decision Trees etc. , but there is a catch, We need to train the model before we get to know the feature importance Thus it becomes computationally expensive at a production level.. Remember that! its a Data Science Architect Question.
I have few of my own :) I named it as Auto Evaluate Feature Selection using Bregman divergence, Article Link: https://medium.com/@bobrupakroy/auto-evaluate-feature-selection-a1f74cbd119a
Wrapper Functions: there are various wrapper functions like SelectKBest available with more flexibility to suit certain scenarios
Under Development: TVS a method to discover feature importance using unsupervised learning.

The focus of this article will be the Mutual Information Feature Selection.

Mutual Information Feature Selection
Mutual information from the field of information theory is the application of information gain (typically used in the construction of decision trees) to feature selection.

Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable.
Mutual information is straightforward when considering the distribution of two discrete (categorical or ordinal) variables, such as categorical input and categorical output data. Nevertheless, it can be adapted for use with numerical input and output data. Mutual Information measures the entropy drops under the condition of the target value.

Simple explanation to this concept is this formula:

MI(feature;target) = Entropy(feature) — Entropy(feature|target)
The MI score will fall in the range from 0 to ∞.

The high value of Mi means a closer connection between the feature and the target indicating feature importance for training the model. However, the lower the MI score like 0 indicates a weak connection between the feature and the target.

Mutual Information Feature Selection for Regressor.


	#Feature Selection using Mutual Information with K = 'all' from sklearn.datasets import make_regression
	from sklearn.model_selection import train_test_split
	from sklearn.feature_selection import SelectKBest
	from sklearn.feature_selection import mutual_info_regression
	from matplotlib import pyplot

	# feature selection
	def select_features(X_train, y_train, X_test):
	# configure to select all features
	fs = SelectKBest(score_func=mutual_info_regression, k='all')
	# learn relationship from training data
	fs.fit(X_train, y_train)
	# transform train input data
	X_train_fs = fs.transform(X_train)
	# transform test input data
	X_test_fs = fs.transform(X_test)
	return X_train_fs, X_test_fs, fs

	# load the dataset
	X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)
	# split into train and test sets
	X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
	# feature selection
	X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)

	# what are scores for the features
	for i in range(len(fs.scores_)):
	print('Feature %d: %f' % (i, fs.scores_[i]))
	# plot the scores
	pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)
	pyplot.show()

Modeling with all Features

#model using all input features
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# load the dataset
X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# evaluate the model
yhat = model.predict(X_test)

# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)
#0.08569191074140582

Modeling with correlated features with K = 88, here we will be using a wrapper function SelectKBest with score_func = f_regression similar to applying the regression model and then getting back the feature importance.

#model using 88 features chosen with correlation
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
 
# feature selection
def select_features(X_train, y_train, X_test):
 # configure to select a subset of features
 fs = SelectKBest(score_func=f_regression, k=88)
 # learn relationship from training data
 fs.fit(X_train, y_train)
 # transform train input data
 X_train_fs = fs.transform(X_train)
 # transform test input data
 X_test_fs = fs.transform(X_test)
 return X_train_fs, X_test_fs, fs
 
# load the dataset
X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)

# fit the model
model = LinearRegression()
model.fit(X_train_fs, y_train)

# evaluate the model
yhat = model.predict(X_test_fs)

# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
#0.08569191074140582

Modeling with Mutual Information Features K=88,

here we also used the same SelectKBest wrapper function to get back the important features but this time we are using score_func = mutual_info_regression.

#model using 88 features chosen with mutual information
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
 
# feature selection
def select_features(X_train, y_train, X_test):
 # configure to select a subset of features
 fs = SelectKBest(score_func=mutual_info_regression, k=88)
 # learn relationship from training data
 fs.fit(X_train, y_train)
 # transform train input data
 X_train_fs = fs.transform(X_train)
 # transform test input data
 X_test_fs = fs.transform(X_test)
 return X_train_fs, X_test_fs, fs
 
# load the dataset
X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)

# fit the model
model = LinearRegression()
model.fit(X_train_fs, y_train)

# evaluate the model
yhat = model.predict(X_test_fs)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
#0.08378300965184769

Now let's apply the same in simple words with a dataset.
We will use the ames.csv dataset available at
https://github.com/rupak-roy/Mutual-Information-Feature-Selection

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.feature_selection import mutual_info_regression

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)
# Load data
df = pd.read_csv("ames.csv")

# Utility functions
def make_mi_scores(X, y):
    X = X.copy()
    for colname in X.select_dtypes(["object", "category"]):
        X[colname], _ = X[colname].factorize()
    # All discrete features should now have integer dtypes
    discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")

#EDA
features = ["YearBuilt", "MoSold", "ScreenPorch"]
sns.relplot(
    x="value", y="SalePrice", col="variable", data=df.melt(id_vars="SalePrice", value_vars=features), facet_kws=dict(sharex=False),
);

#Mutual Information 
X = df.copy()
y = X.pop('SalePrice')

mi_scores = make_mi_scores(X, y)
mi_scores.head()

plt.figure(dpi=100, figsize=(8, 5))
plot_mi_scores(mi_scores.head(20))

#EDA

mi_scores.head()
plot_mi_scores(mi_scores.head(20))
Next we will quickly go over a Mutual Information Classification Example.

#mutual Information Classification

# load cancer data
from sklearn.datasets import load_breast_cancer as LBC
cancer = LBC()
X = cancer['data']
y = cancer['target']

#Compute MI scores
from sklearn.feature_selection import mutual_info_classif as MIC
mi_scores = MIC(X,y)
print(mi_scores)

#prepare dataset 1
from sklearn.model_selection import train_test_split as tts
X_train_1,X_test_1,y_train,y_test = tts(
    X,y,random_state=0,stratify=y )

# prepare dataset 2, MI > 0.2
import numpy as np
mi_score_selected_index = np.where(mi_scores >0.2)[0]
X_2 = X[:,mi_score_selected_index]
X_train_2,X_test_2,y_train,y_test = tts(
    X_2,y,random_state=0,stratify=y)

# prepare dataset 3, MI <0.2
mi_score_selected_index = np.where(mi_scores < 0.2)[0]
X_3 = X[:,mi_score_selected_index]
X_train_3,X_test_3,y_train,y_test = tts(
    X_3,y,random_state=0,stratify=y)

# compare results with Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier as DTC
model_1 = DTC().fit(X_train_1,y_train)
model_2 = DTC().fit(X_train_2,y_train)
model_3 = DTC().fit(X_train_3,y_train)
score_1 = model_1.score(X_test_1,y_test)
score_2 = model_2.score(X_test_2,y_test)
score_3 = model_3.score(X_test_3,y_test)

# use Scikit-learn feature selector
from sklearn.feature_selection import SelectPercentile as SP
selector = SP(percentile=50) # select features with top 50% MI scores
selector.fit(X,y)
X_4 = selector.transform(X)
X_train_4,X_test_4,y_train,y_test = tts(
    X_4,y,random_state=0,stratify=y)

model_4 = DTC().fit(X_train_4,y_train)
score_4 = model_4.score(X_test_4,y_test)

Score_1: 0.9370629370629371
Score_2: 0.916083916083916
Score_3: 0.8391608391608392
Score_4: 0.916083916083916
Done. Thats it…………..
i hope you enjoyed likewise, i will try to bring as much as possible new contents across the data science realm and i hope the package will be useful at some point in your work. Because I believe machine learning is not replacing us, it’s about replacing the same iterative work that consumes time and much effort. So people should come to work to create innovations rather than be occupied in the same repetitive boring tasks.
Thanks again, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Some of my alternative internet presences Facebook, Instagram, Udemy, Blogger, Issuu, Slideshare, Scribd and more.
Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy
Let me know if you need anything. Talk Soon.
Kaggle Implementation: https://www.kaggle.com/rupakroy/mutual-information-feature-selection-regression
A Cup of Engineering :)

Search This Blog

Welcome to #bobrupakroy

Mutual Information Score — Feature Selection using entropy from information theory

Mutual Information Feature Selection for Regressor.

Modeling with all Features

Modeling with Mutual Information Features K=88,

Comments

Post a Comment

Popular Posts

Stable Diffusion 1.4 in Kaggle | Apply the latest stable diffusion for free in Kaggle using JAX/ FAX!

Borderline KNN| SVM and ADAYSN SMOTE ~ Complete walkthrough the various advanced variants of SMOTE