OSS & NCR. Another interesting undersampling technique for an imbalanced dataset

Hi there, here we are again as promised with an interesting topic from across the data science realm.

Today we will look into a combination of Keep and Delete Methods for undersampling imbalanced datasets.

One-Sided Selection (OSS) is another undersampling technique that combines Tomek Links and Condensed (CNN) rule.

Tomek Links are ambiguous points on the class boundary which are removed in the majority class and the CNN method is used to remove redundant examples from the majority class that is far from the decision boundary.

The method was first proposed by Miroslav Kubat and Stan Matwin in their 1997 paper titled “Addressing The Curse Of Imbalanced Training Sets: One-sided Selection.”

#One-Sided Selection
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import OneSidedSelection
from matplotlib import pyplot
from numpy import where
#define dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
#summarize class distribution
counter = Counter(y)
print(counter)
#define the undersampling method
undersample = OneSidedSelection(n_neighbors=1, n_seeds_S=200)
#transform the dataset
X, y = undersample.fit_resample(X, y)
#summarize the new class distribution
counter = Counter(y)
print(counter)
#plot
for label, _ in counter.items():
row_ix = where(y == label)[0]
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()

that's it!

Now let’s check what is NCR

NCR stands for Neighborhood Cleaning rule an undersampling technique that combines both the CNN to remove redundant and the ENN to remove noisy or ambiguous data.

The focus here is less on improving the balance of the class distribution and more on the quality(unambiguity) of the data that are retained in the majority class.

The approach involves first selecting all examples from the minority class. Then all of the ambiguous data in the majority class are identified using the ENN rule and removed. Finally, a one-step version of CNN is used where those remaining data in the majority class that is misclassified against the store are removed, but only if the number of data in the majority class is larger than half the size of the minority class.

#neighborhood cleaning rule
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import NeighbourhoodCleaningRule
from matplotlib import pyplot
from numpy import where
#define dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
#summarize class distribution
counter = Counter(y)
print(counter)
#define the undersampling method
undersample = NeighbourhoodCleaningRule(n_neighbors=3, threshold_cleaning=0.5)
#transform the dataset
X, y = undersample.fit_resample(X, y)
#summarize the new class distribution
counter = Counter(y)
print(counter)
#scatter plot of examples by class label
for label, _ in counter.items():
row_ix = where(y == label)[0]
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()

that's it.

I hope you enjoyed it, If you wish to learn more in detail you can google the OSS & NCR — undersampling techniques or you may visit the machine learning mastery site.

Likewise, I will try my best to bring more new ways of Data Science.

If you wish to explore more about new ways of doing data science follow my other articles.

Some of my alternative internet presences Facebook, Instagram, Udemy, Blogger, Issuu, and more.

Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy

Have a good day.

pexel


Comments

Popular Posts