Edited Nearest Neighbors ~ ENN Another way to under-sample imbalanced classes

Hi there, is everything cool? great! as mentioned in my previous article on CNN undersampling today I will bring you another rule to undersample our imbalanced classes in our dataset.

Edited Nearest Neighbors Rule for undersampling involves using K=3 nearest neighbors to the data points that are misclassified and that are then removed before a K=1 classification rule is applied. This approach of resampling and classification was first proposed by Dennis Wilson in his 1972 paper titled “Asymptotic Properties of Nearest Neighbor Rules Using Edited Data.”

When used as an undersampling procedure, the rule can be applied to each example in the majority class, allowing those examples that are misclassified as belonging to the minority class to be removed and those correctly classified to remain.

Let’s see how can we apply the ENN

#Edited Nearest Neighbor
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import EditedNearestNeighbours
from matplotlib import pyplot
from numpy import where
# define dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# summarize class distribution
counter = Counter(y)
print(counter)
# define the undersampling method
undersample = EditedNearestNeighbours(n_neighbors=3)
# transform the dataset
X, y = undersample.fit_resample(X, y)
# summarize the new class distribution
counter = Counter(y)
print(counter)
# scatter plot of examples by class label
for label, _ in counter.items():
row_ix = where(y == label)[0]
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()

Well, that's it! if we check the class we will have a properly balanced class

Counter({0: 9900, 1: 100})
Counter({0: 9806, 1: 100})

And just like CNN, the ENN gives the best results when combined with another oversampling method like SMOTE.

To know more about how SMOTE works, it's a whole chapter by itself well documented in my previous article. However, let me share with you the SMOTEENN code so that you can keep the tool handy.

#combined SMOTE and Edited Nearest Neighbors resampling for #imbalanced classification
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from imblearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from imblearn.combine import SMOTEENN
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# define model
model = DecisionTreeClassifier()
# define resampling
resample = SMOTEENN()
# define pipeline
pipeline = Pipeline(steps=[('r', resample), ('m', model)])
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

Further extensions of ENN was introduced namely:

1. RepeatedEditedNearestNeighbors: http://glemaitre.github.io/imbalanced-learn/generated/imblearn.under_sampling.RepeatedEditedNearestNeighbours.html

2. All KNN: https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.AllKNN.html

Here we are to the end of another interesting topic, i hope you enjoyed it.

Likewise, i will try to bring more interesting topics from across debugged with my intuition, next i found another interesting topic ‘One-Side Selection for undersampling’ see you there. Caooo.

If you wish to explore more about new ways of doing data science follow my other articles.

Some of my alternative internet presences Facebook, Instagram, Udemy, Blogger, Issuu, and more.

Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy

Have a good day.


Comments

Popular Posts