Handle Imbalanced Dataset with ease, Featuring BalancedBaggingClassifier and BalancedRandomforestClassfier

July 02, 2022

Handle Imbalanced Dataset with ease, Featuring BalancedBaggingClassifier and BalancedRandomforestClassfier

Hello hi hey there, once again we meet with another topic of “how to handle imbalanced datasets” In our previous discussion we performed various ways like Smote, CNN, OSS, NCR, ENN Tomeklinks, and many more. This time we will look for default sklearn features that can handle imbalanced datasets with ease.

One of them is Balanced Bagging Classifier. just like our regular bagging classifier that builds several estimators on different random subsets of data, our regular bagging classifiers don't allow us to balance each subset of data. Therefore when we train our model on an imbalanced dataset our regular bagging classifiers will favor the majority classes.

Let’s run a comparison of both

Bagging Classifier

from sklearn.datasets import make_classification

#create a dataset
X, y = make_classification(n_samples=10000, n_features=2, n_informative=2,n_redundant=0, n_repeated=0, n_classes=3, n_clusters_per_class=1,weights=[0.01, 0.05, 0.94], class_sep=0.8, random_state=0)

from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
bc = BaggingClassifier(base_estimator=DecisionTreeClassifier(), random_state=0)
bc.fit(X_train, y_train)
y_pred = bc.predict(X_test)

print("Accuracy",balanced_accuracy_score(y_test, y_pred))
#Accuracy 0.7739

Well, remember that the accuracy of 0.7734, now let's try with our new BalancedBaggingClassifier()

from imblearn.ensemble import BalancedBaggingClassifier
bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(), sampling_strategy='auto',  replacement=False,random_state=0)

bbc.fit(X_train, y_train)
y_pred = bbc.predict(X_test)

print("Accuracy",balanced_accuracy_score(y_test, y_pred))
#Accuracy 0.8

See the difference? Accuracy from 0.7 to 0.8

2.) BalancedRandomForestClassifier is another ensemble method in which the tree of the forest will be provided with a balanced bootstrap sample.

from imblearn.ensemble import BalancedRandomForestClassifier

brf = BalancedRandomForestClassifier(n_estimators=100, random_state=0)

brf.fit(X_train, y_train)
y_pred = brf.predict(X_test)

print("Accuracy",balanced_accuracy_score(y_test, y_pred))
#Accuracy 0.8

Well thats it, simple, quick yet powerful.

Bonus: 3.) SMOTE-NC

from imblearn.over_sampling import SMOTENC
cat_indx =[0,1]
sm = SMOTENC(categorical_features= cat_indx, random_state=0)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

the very initials NC in the algorithm name refers Nominal-Continuous, SOTE-NC is not designed to work only with categorical features. It requires some numerical features.

Here is the link to repo https://imbalanced-learn.org/stable/ensemble.html

Likewise, i will try to bring more interesting topics from across debugged with my intuition, next we will move on with powerful deep learning architecture to create synthetic data to handle imbalanced datasets, See you there. Cao.

If you wish to explore more about new ways of doing data science follow my other articles.

Some of my alternative internet presences Facebook, Instagram, Udemy, Blogger, Issuu, and more.

Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy

Have a good day.

Search This Blog

Welcome to #bobrupakroy

Handle Imbalanced Dataset with ease, Featuring BalancedBaggingClassifier and BalancedRandomforestClassfier

Have a good day.

Comments

Post a Comment

Popular Posts

Neural Prophet for Time Series- A deep learning approach for sequential learning time-series data

Condensed Nearest Neighbor Rule Undersampling (CNN) ~ An alternative to oversampling techniques like SMOTE