Data Science Interview Q’s — IV A walkthrough with/from the essentials of data science interviews.

 

Data Science Interview Q’s — IV

A walkthrough with/from the essentials of data science interviews.

Hi hey there, thanks for the continuous support for my previous articles. Today we will continue from our previous article “Data Science Interview Q’s — III” PART-III, the commonly asked essential questions by the interviewers to understand the root level knowledge of DS rather than going for fancy advanced questions.

Q1. Movie Recommendation systems are an example of:

1. Classification 2. Clustering

3. Reinforcement Learning 4. Regression

Options:

B. A. 2 Only

C. 1 and 2

D. 1 and 3

E. 2 and 3

F. 1, 2 and 3

H. 1, 2, 3 and 4

Solution: (E) Generally, movie recommendation systems cluster the users in a finite number of similar groups based on their previous activities and profile. Then, at a fundamental level, people in the same cluster are made similar recommendations.

In some scenarios, this can also be approached as a classification problem for assigning the most appropriate movie class to the user of a specific group of users. Also, a movie recommendation system can be viewed as a reinforcement learning problem where it learns by its previous recommendations and improves the future recommendations.

Q2. Sentiment Analysis is an example of:

1. Regression 2. Classification

3. Clustering 4. Reinforcement Learning

Options:

A. 1 Only

B. 1 and 2

C. 1 and 3

D. 1, 2 and 3

E. 1, 2 and 4

F. 1, 2, 3 and 4

Solution: (E) Sentiment analysis at the fundamental level is the task of classifying the sentiments represented in an image, text or speech into a set of defined sentiment classes like happy, sad, excited, positive, negative, etc. It can also be viewed as a regression problem for assigning a sentiment score of say 1 to 10 for a corresponding image, text or speech.

Another way of looking at sentiment analysis is to consider it using a reinforcement learning perspective where the algorithm constantly learns from the accuracy of past sentiment analysis performed to improve the future performance.

Q3. Can decision trees be used for performing clustering?

A. True B. False

Solution: (A) Decision trees can also be used to for clusters in the data but clustering often generates natural clusters and is not dependent on any objective function.

Q4. Which of the following is the most appropriate strategy for data cleaning before performing clustering analysis, given less than desirable number of data points:

1. Capping and flouring of variables

2. Removal of outliers

Options:

A. 1 only

B. 2 only

C. 1 and 2

D. None of the above

Solution: (A) Decision trees can also be used to for clusters in the data but clustering often generates natural clusters and is not dependent on any objective function.

Q4. Which of the following is the most appropriate strategy for data cleaning before performing clustering analysis, given less than desirable number of data points:

1. Capping and flouring of variables

2. Removal of outliers

Options:

A. 1 only

B. 2 only

C. 1 and 2

D. None of the above

Solution: (A) Removal of outliers is not recommended if the data points are few in number. In this scenario, capping and

flouring of variables is the most appropriate strategy.

Q5. What is the minimum no. of variables/ features required to perform clustering?

A. 0

B. 1

C. 2

D. 3

Solution: (B) At least a single variable is required to perform clustering analysis. Clustering analysis with a single variable can be visualized with the help of a histogram.

Q6. For two runs of K-Mean clustering is it expected to get same clustering results?

A. Yes

B. No

Solution: (B) K-Means clustering algorithm instead converses on local minima which might also correspond to the global minima in some cases but not always. Therefore, it’s advised to run the K-Means algorithm multiple times before drawing inferences about the clusters.

However, note that it’s possible to receive same clustering results from K-means by setting the same seed value for each run. But that is done by simply making the algorithm choose the set of same random no. for each run.

Q7. Is it possible that Assignment of observations to clusters does not change between successive iterations in K-Means

A. Yes B. No

C. Can’t say D. None of these

Solution: (A) When the K-Means algorithm has reached the local or global minima, it will not alter the assignment of data points to clusters for two successive iterations.

Q8. Which of the following can act as possible termination conditions in K-Means?

  1. For a fixed number of iterations.
  2. 2. Assignment of observations to clusters does not change between iterations. Except for cases with a bad local minimum.

3. Centroids do not change between successive iterations.

4. Terminate when RSS falls below a threshold.

Options:

A. 1, 3 and 4

B. 1, 2 and 3

C. 1, 2 and 4

D. All of the above

Solution: (D) All four conditions can be used as possible termination condition in K-Means clustering:

1. This condition limits the runtime of the clustering algorithm, but in some cases the quality of the clustering will be poor because of an insufficient number of iterations.

2. Except for cases with a bad local minimum, this produces a good clustering, but runtimes may be unacceptably long.

3. This also ensures that the algorithm has converged at the minima.

4. Terminate when RSS falls below a threshold. This criterion ensures that the clustering is of a desired quality after termination. Practically, it’s a good practice to combine it with a bound on the number of iterations to guarantee termination.

Q9. Which of the following clustering algorithms suffers from the problem of convergence at local optima?

1. K- Means clustering algorithm

2. Agglomerative clustering algorithm

3. Expectation-Maximization clustering algorithm

4. Diverse clustering algorithm

Options:

A. 1 only

B. 2 and 3

C. 2 and 4

D. 1 and 3

E. 1,2 and 4

F. All of the above

Solution: (D) Out of the options given, only K-Means clustering algorithm and EM clustering algorithm has the drawback of converging at local minima.

Q10. Which of the following algorithm is most sensitive to outliers?

A. K-means clustering algorithm

B. K-medians clustering algorithm

C. K-modes clustering algorithm

D. K-medoids clustering algorithm

Solution: (A) Out of all the options, K-Means clustering algorithm is most sensitive to outliers as it uses the mean of cluster data points to find the cluster center.

Q12. How can Clustering (Unsupervised Learning) be used to improve the accuracy of Linear Regression

model (Supervised Learning):

1. Creating different models for different cluster groups.

2. Creating an input feature for cluster ids as an ordinal variable.3. Creating an input feature for cluster centroids as a continuous variable.

4. Creating an input feature for cluster size as a continuous variable.

Options:

A. 1 only

B. 1 and 2

C. 1 and 4

D. 3 only

E. 2 and 4

F. All of the above

Solution: (F)

Creating an input feature for cluster ids as ordinal variable or creating an input feature for cluster centroids as a continuous variable might not convey any relevant information to the regression model for multidimensional data. But for clustering in a single dimension, all of the given methods are expected to convey meaningful information to the regression model. For example, to cluster people in two groups based on their hair length, storing clustering ID as ordinal variable and cluster centroids as continuous variables will convey meaningful information.

Q13. What could be the possible reason(s) for producing two different dendrograms using agglomeration clustering algorithm for the same dataset?

A. Proximity function used

B. of data points used

C. of variables used

D. B and c only

E. All of the above

Solution: (E) Change in either of Proximity function, no. of data points or no. of variables will lead to different clustering results and hence different dendrograms.

Q15. What is the most appropriate no. of clusters for the data points represented by the following dendrogram:

A. 2

B. 4

C. 6

D. 8

Solution: (B) The decision of the no. of clusters that can best depict different groups can be chosen by observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the dendrogram cut by ahorizontal line that can transverse the maximum distance vertically without intersecting a cluster.

In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the dendrogram below covers maximum vertical distance AB.

Q16. In which of the following cases will K-Means clustering fail to give good results?

1. Data points with outliers

2. Data points with different densities

3. Data points with round shapes

4. Data points with non-convex shapes

Options:

A. 1 and 2

B. 2 and 3

C. 2 and 4

D. 1, 2 and 4

E. 1, 2, 3 and 4

Solution: (D) K-Means clustering algorithm fails to give good results when the data contains outliers, the density spread of data points across the data space is different and the data points follow non-convex shapes.

Q17. Which of the following metrics, do we have for finding dissimilarity between two clusters in hierarchical clustering?

1. Single-link

2. Complete-link

3. Average-link

Options:

A. 1 and 2

B. 1 and 3

C. 2 and 3

D. 1, 2 and 3

Solution: (D) All of the three methods i.e. single link, complete link and average link can be used for finding dissimilarity between two clusters in hierarchical clustering.

Q18. Which of the following are true?

1. Clustering analysis is negatively affected by multicollinearity of features

2. Clustering analysis is negatively affected by heteroscedasticity

Options:

A. 1 only

B. 2 only

C. 1 and 2

D. None of them

Solution: (A) Clustering analysis is not negatively affected by heteroscedasticity but the results are negatively impacted by multicollinearity of features/ variables used in clustering as the correlated feature/ variable will carry extra weight on the distance calculation than desired.

Q19. What should be the best choice of no. of clusters based on the following results:

A. 1

B. 2

C. 3

D. 4

Solution: ( C) The silhouette coefficient is a measure of how similar an object is to its own cluster compared to other clusters. Number of clusters for which silhouette coefficient is highest represents the best choice of the number of clusters.

Q20. Which of the following is/are valid iterative strategy for treating missing values before clustering analysis?

A. Imputation with mean

B. Nearest Neighbor assignment

C. Imputation with Expectation Maximization algorithm

D. All of the above

Solution: ( C) All of the mentioned techniques are valid for treating missing values before clustering analysis but only imputation with EM algorithm is iterative in its functioning.

Q21. K-Mean algorithm has some limitations. One of the limitation it has is, it makes hard assignments(A point either completely belongs to a cluster or not belongs at all) of points to clusters.Note: Soft assignment can be consider as the probability of being assigned to each cluster: say K = 3 and for some point xn, p1 = 0.7, p2 = 0.2, p3 = 0.1)

Which of the following algorithm(s) allows soft assignments?

1. Gaussian mixture models

2. Fuzzy K-means

Options:

A. 1 only

B. 2 only

C. 1 and 2

D. None of these

Solution: ( C) Both, Gaussian mixture models and Fuzzy K-means allows soft assignments.

Q22. Assume, you want to cluster 7 observations into 3 clusters using K-Means clustering algorithm. After first iteration clusters, C1, C2, C3 has following observations:

C1: {(2,2), (4,4), (6,6)}

C2: {(0,4), (4,0)}

C3: {(5,5), (9,9)}

What will be the cluster centroids if you want to proceed for second iteration?

A. C1: (4,4), C2: (2,2), C3: (7,7)

B. C1: (6,6), C2: (4,4), C3: (9,9)

C. C1: (2,2), C2: (0,0), C3: (5,5)

D. None of these

Solution: (A) Finding centroid for data points in cluster C1 = ((2+4+6)/3, (2+4+6)/3) = (4, 4)

Finding centroid for data points in cluster C2 = ((0+4)/2, (4+0)/2) = (2, 2)

Finding centroid for data points in cluster C3 = ((5+9)/2, (5+9)/2) = (7, 7)

Hence, C1: (4,4), C2: (2,2), C3: (7,7)

Q23. Assume, you want to cluster 7 observations into 3 clusters using K-Means clustering algorithm. After first iteration clusters, C1, C2, C3 has following observations:

C1: {(2,2), (4,4), (6,6)}

C2: {(0,4), (4,0)}C3: {(5,5), (9,9)}

What will be the Manhattan distance for observation (9, 9) from cluster centroid C1. In second iteration.

A. 10

B. 5*sqrt(2)

C. 13*sqrt(2)

D. None of these

Solution: (A) Manhattan distance between centroid C1 i.e. (4, 4) and (9, 9) = (9–4) + (9–4) = 10

Q24. If two variables V1 and V2, are used for clustering. Which of the following are true for K means clustering with k =3?

1. If V1 and V2 has a correlation of 1, the cluster centroids will be in a straight line

2. If V1 and V2 has a correlation of 0, the cluster centroids will be in straight line

Options:

A. 1 only

B. 2 only

C. 1 and 2

D. None of the above

Solution: (A) If the correlation between the variables V1 and V2 is 1, then all the data points will be in a straight line. Hence, all the three cluster centroids will form a straight line as well.

Q25. Feature scaling is an important step before applying K-Mean algorithm. What is reason behind this?

A. In distance calculation it will give the same weights for all features

B. You always get the same clusters. If you use or don’t use feature scaling

C. In Manhattan distance it is an important step but in Euclidian it is not

D. None of these

Solution; (A) Feature scaling ensures that all the features get same weight in the clustering analysis. Consider a scenario of clustering people based on their weights (in KG) with range 55–110 and height (in inches) with range 5.6 to 6.4. In this case, the clusters produced without scaling can be very misleading as the range of weight is much higher than that of height. Therefore, its necessary to bring them to same scale so that they have equal weightage on the clustering result.

Q26. Which of the following method is used for finding optimal of cluster in K-Mean algorithm?

A. Elbow method

B. Manhattan method

C. Ecludian mehthod

D. All of the above

E. None of these

Solution: (A) Out of the given options, only elbow method is used for finding the optimal number of clusters. The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data.

Q27. What is true about K-Mean Clustering?

1. K-means is extremely sensitive to cluster center initializations

2. Bad initialization can lead to Poor convergence speed

3. Bad initialization can lead to bad overall clustering

Options:

A. 1 and 3

B. 1 and 2

C. 2 and 3

D. 1, 2 and 3

Solution: (D)

All three of the given statements are true. K-means is extremely sensitive to cluster center initialization. Also, bad initialization can lead to Poor convergence speed as well as bad overall clustering.

Q28. Which of the following can be applied to get good results for K-means algorithm corresponding to global minima?

1. Try to run algorithm for different centroid initialization

2. Adjust number of iterations

3. Find out the optimal number of clusters

Options:

A. 2 and 3

B. 1 and 3

C. 1 and 2

D. All of above

Solution: (D)All of these are standard practices that are used in order to obtain good clustering results.

Q29. What should be the best choice for number of clusters based on the following results:

A. 5

B. 6

C. 14

D. Greater than 14

Solution: (B) Based on the above results, the best choice of number of clusters using elbow method is 6.

Q30. What should be the best choice for number of clusters based on the following results:

A. 2

B. 4

C. 6

D. 8

Solution: ( C)

Generally, a higher average silhouette coefficient indicates better clustering quality. In this plot, the optimal clustering number of grid cells in the study area should be 2, at which the value of the average silhouette coefficient is highest. However, the SSE of this clustering solution (k = 2) is too large. At k = 6, the SSE is much lower. In addition, the value of the average silhouette coefficient at k = 6 is also very high, which is just lower than k = 2. Thus, the best choice is k = 6.

Q31. Which of the following sequences is correct for a K-Means algorithm using Forgy method of initialization?

1. Specify the number of clusters

2. Assign cluster centroids randomly

3. Assign each data point to the nearest cluster centroid

4. Re-assign each point to nearest cluster centroids

5. Re-compute cluster centroids

Options:

A. 1, 2, 3, 5, 4

B. 1, 3, 2, 4, 5

C. 2, 1, 3, 4, 5

D. None of these

Solution: (A) The methods used for initialization in K means are Forgy and Random Partition. The Forgy method randomly chooses k observations from the data set and uses these as the initial means. The Random Partition method first randomly assigns a cluster to each observation and then proceeds to the update step, thus computing the initial mean to be the centroid of the cluster’s randomly assigned points.

Q32. If you are using Multinomial mixture models with the expectation-maximization algorithm for clustering a set of data points into two clusters, which of the assumptions are important:

A. All the data points follow two Gaussian distribution

B. All the data points follow n Gaussian distribution (n >2)

C. All the data points follow two multinomial distribution

D. All the data points follow n multinomial distribution (n >2)

Solution: ( C) In EM algorithm for clustering its essential to choose the same no. of clusters to classify the data points into as the no. of different distributions they are expected to be generated from and also the distributions must be of the same type.

Q33. Which of the following is/are not true about Centroid based K-Means clustering algorithm and Distribution based expectation-maximization clustering algorithm:

1. Both starts with random initializations

2. Both are iterative algorithms

3. Both have strong assumptions that the data points must fulfill

4. Both are sensitive to outliers

5. Expectation maximization algorithm is a special case of K-Means

6. Both requires prior knowledge of the no. of desired clusters

7. The results produced by both are non-reproducible.

Options:

A. 1 only

B. 5 only

C. 1 and 3

D. 6 and 7

E. 4, 6 and 7

F. None of the above

Solution: (B) All of the above statements are true except the 5th as instead K-Means is a special case of EM algorithm in which only the centroids of the cluster distributions are calculated at each iteration.

Q34. Which of the following is/are not true about DBSCAN clustering algorithm:

1. For data points to be in a cluster, they must be in a distance threshold to a core point

2. It has strong assumptions for the distribution of data points in dataspace

3. It has substantially high time complexity of order O(n3)

4. It does not require prior knowledge of the no. of desired clusters

5. It is robust to outliers

Options:

A. 1 only

B. 2 only

C. 4 only

D. 2 and 3E. 1 and 5

F. 1, 3 and 5

Solution: (D) DBSCAN can form a cluster of any arbitrary shape and does not have strong assumptions for the distribution of data points in the dataspace. DBSCAN has a low time complexity of order O(n log n) only.

Q35. Which of the following are the high and low bounds for the existence of F-Score?

A. [0,1]

B. (0,1)

C. [-1,1]

D. None of the above

Solution: (A) The lowest and highest possible values of F score are 0 and 1 with 1 representing that every data point is assigned to the correct cluster and 0 representing that the precession and/ or recall of the clustering analysis are both 0. In clustering analysis, high value of F score is desired.

Q36. Following are the results observed for clustering 6000 data points into 3 clusters: A, B and C:

What is the F1-Score with respect to cluster B?

A. 3

B. 4

C. 5

D. 6

Solution: (D) Here,

True Positive, TP = 1200True Negative, TN = 600 + 1600 = 2200

False Positive, FP = 1000 + 200 = 1200

False Negative, FN = 400 + 400 = 800

Therefore,

Precision = TP / (TP + FP) = 0.5

Recall = TP / (TP + FN) = 0.6

Hence,

F1 = 2 * (Precision * Recall)/ (Precision + recall) = 0.54 ~ 0.5

I hope you will find the questionnaires useful for your career and also the credit goes to aditya vidhya analytics from i was able to gather this set of interview questionnaires for you.

Next, we will walk through the more advanced questionnaire in the next article Part V, which will again surprise you!

Thanks again, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy

Some of my alternative internet presences Facebook, Instagram, Udemy, Blogger, Issuu, Slideshare, Scribd and more.

Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy

Let me know if you need anything. Talk Soon.

Comments

Popular Posts