What is Supervised Linear Discriminant Analysis(LDA) ~ PCA

February 03, 2022

What is Supervised Linear Discriminant Analysis(LDA) ~ PCA

What is PCA first of all?

Principal Component Analysis or PCA is a statistical procedure that allows us to summarize/extract the only important data that explains the whole dataset.

Principal component analysis today is one of the most popular multivariate statistical techniques. PCA is the mother method for multivariate data analysis MVDA

It has been widely used in the areas of pattern recognition and signal processing and in statistical analysis to reduce the dimension, in simple words, to understand and extract only the important factors that explain the whole data. This helps in avoiding unnecessary data being processed.

Now since we got a basic idea of what is pca. Let’s understand what is Linear Discriminant Analysis (LDA) type is for dimensionality reduction.

Both PCA and LDA are linear transformation techniques used for dimensionality reduction. However, PCA is unsupervised and LDA is a supervised one with the fact that the dependent variable (DV) is considered makes LDA a supervised model.

Let’s understand this with the help of an example. For this example we will use one of the famous available datasets ‘wine_quality’ where we have the key components factors of wine and its customer.

Wine quality Data set for LDA — Wine Quality Dataset

What we will do is with LDA we will try to find out the few key component factors that describe the whole dataset and to validate we will cross-check it with performance metrics.

Let’s get started!

First, we will import the required libraries and import the dataset then define our independent variable X and dependent variable y then split the dataset into train and test.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#Importing the dataset
dataset = pd.read_csv('Wine.csv')
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values

#Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

We will also normalize/scale the data into a common range to reduce the magnitude/spread of data points without losing their original meaning.

#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Time to apply LDA!

#LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 2)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)

It's almost the same as PCA only

‘Explained_variance = pca.explained_variance_ratio’ is not required as we are not looking for independent variables that explain the most variance. Now we are looking for independent variables that separate the most classes of our dependent variable.

Now it's time to fit our data in a model and will check how much accuracy we can derive.

#Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

#Predicting the Test set results
y_pred = classifier.predict(X_test)

Here as you can see, we are using logistic regression, random_state is just a seed number to avoid randomness for the set of data for each time the algorithm computes. Then we will predict with unseen data (X_test)

We will use our regular confusion matrix then the accuracy score metric.

#Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

#Another evaluation Metrics
print('Accuracy Score:', metrics.accuracy_score(y_test, y_pred))

Well indeed LDA perfectly separated the independent variables across the classes for our test dataset(unseen data) and also our model accuracy is 100%

Now think of this having more than 10000 columns and we are able to extract only a few principal components that explain the whole dataset.

Just imagine how helpful PCA Principal Components Analysis it will be.

Let’s put all of the codes together.

#Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#Importing the dataset
dataset = pd.read_csv('Wine.csv')
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values

#Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#Applying LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 2)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)

#Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

#Predicting the Test set results
y_pred = classifier.predict(X_test)

#Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

#Another evaluation Metrics 
from sklearn import metrics
print('Accuracy Score:', metrics.accuracy_score(y_test, y_pred))

LDA Version in R

#Importing the dataset
dataset = read.csv('Wine.csv')

#Splitting the dataset into the Training set and Test set
#install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Customer_Segment, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
nrow(training_set)/nrow(dataset)
nrow(test_set)/nrow(dataset)

#Feature Scaling
training_set[-14] = scale(training_set[-14])
test_set[-14] = scale(test_set[-14])

We have loaded the dataset then split it into train and test sets. After, we will feature scale/normalize training and test set, -14 refers to exclude the 14th column that is our dependent variable.

Now its time to apply LDA in R

#Applying LDA
library(MASS)
lda = lda(formula = Customer_Segment ~ ., data = training_set)
training_set = as.data.frame(predict(lda, training_set))
head(training_set)
training_set = training_set[c(5, 6,1)]
test_set = as.data.frame(predict(lda, test_set))
test_set = test_set[c(5, 6,1)]

Customer_Segment ~ . indicates DV ~ . Tilde ~ is the separator of DV ~ IV and the dot ‘ .’ refers to taking all the independent variables. Then we will extract the training set columns 1,5 and 6 from the LDA results

training_set = as.data.frame(predict(lda, training_set))
head(training_set)
training_set = training_set[c(5, 6,1)]

LDA using R results — LDA results using R

We will do the same for the test set.

test_set = as.data.frame(predict(lda, test_set))
test_set = test_set[c(5, 6,1)]

Now it's time to fit our data into the model.

#Fitting SVM to the Training set
#install.packages('e1071')
library(e1071)
classifier = svm(formula = class ~ .,data = training_set,type = 'C-classification',kernel = 'linear')

#Predicting the Test set results

y_pred = predict(classifier, newdata = test_set[-3])

#Confusion Matrix
cm = table(test_set[, 3], y_pred)

Well again we have high accuracy with our 2 Lda components /columns

Now for those who wish to perform more ways of performing PCA with R programming i have a whole new course to perform various types of PCA. PCA with Big Data, PCA with Random Forest further divided into classification and regression, PCA with Generalized Boosted Models(GBM), PCA with Generalized Linear Models(GLMNET), PCA with Ensemble, PCA with fscaret and more.

Next will see another type of advanced PCA optimized for non-linear data, Kernal PCA

Thanks for your time to read to the end. I tried my best to keep it short and simple keeping in mind to use this code in our daily life.

I hope you enjoyed it.

Feel Free to ask because “Curiosity Leads To Perfection”

Some of my alternative internet presences are Facebook, Instagram, Udemy, Blogger, Issuu, and more.

Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy

Stay tuned for more updates.! have a good day….

~ Be Happy and Enjoy!

Search This Blog

Welcome to #bobrupakroy

What is Supervised Linear Discriminant Analysis(LDA) ~ PCA

~ Be Happy and Enjoy!

Comments

Post a Comment

Popular Posts

Stable Diffusion 1.4 in Kaggle | Apply the latest stable diffusion for free in Kaggle using JAX/ FAX!

Borderline KNN| SVM and ADAYSN SMOTE ~ Complete walkthrough the various advanced variants of SMOTE