# 60 Data Science Interview Questions & Answers – Crack Technical Interview Now!

Welcome to part 2 of DataFlair’s Data Science Interview Questions and Answers series. In our first part, we discussed some basic level data science questions which could be asked in your next interview, especially if you are a fresher in Data Science. Today, I am sharing a list of top 60 Data Science Interview Questions and Answers; this list mostly consists of mostly technical questions. The following are the topics covered in Data Science Interview questions –

**Python**Data Science Interview Questions and Answers**ML & Statistics**Data Science Interview Questions and Answers**General**Data Science Interview Questions and Answers**Scenario-based**Data Science Interview Questions and Answers**Behavior-based**Data Science Interview Questions

After going through our previous * Data Science Interview Questions*, we hope you have cleared all the basic concepts related to data science. if you haven’t done so, I highly recommend you go through them before continuing with the next part.

A Data Science Interview is not a test of your knowledge, but your ability to do it at the right time.

## Python Interview Questions for Data Science

Every data science interview has many Python-related questions, so if you really want to crack your next data science interview, you need to * master Python*. Now, an interviewer can ask two specific Python Data Science Interview Questions – First one is solving Python programming question in theory without using code and the second one, bu using codes. So, our this section of Data Science Interview Questions and Answers will prepare you for both the situations. Have a look –

**Q.1 What is a lambda expression in Python?**

**Ans.** With the help of* lambda expression*, you can create an anonymous function. Unlike conventional functions, lambda functions occupy a single line of code. The basic syntax of a lambda function is –

**lambda arguments: expression**

An example of lambda function in Python data science is –

**x = lambda a : a * 5**

**print(x(5))**

**We obtain the output of 25.**

**Q.2 How will you measure the Euclidean distance between the two arrays in numpy?**

**Ans.** In order to measure the Euclidean distance between the two arrays, we will first initialize our two arrays, then we will use the linalg.norm() function provided by the numpy library. Here, numpy is imported as np.

a = np.array([1,2,3,4,5]) b = np.array([6,7,8,9,10]) # Solution e_dist = np.linalg.norm(a-b) e_dist 11.180339887498949

With data integrity, we can define the accuracy as well as the consistency of the data. This integrity is to be ensured over the entire life-cycle.

**Q.3 How will you create an identity matrix using numpy?**

**Ans.** In order to create the identity matrix with numpy, we will use the identity() function. Numpy is imported as np

np.identity(3)

We will obtain the output as –

**array([[1., 0., 0.],**

**[0., 1., 0.],**

**[0., 0., 1.]])**

**Q.4 You had mentioned Python as one of the tools for solving data science problems, can you tell me the various libraries of Python that are used in Data Science?**

**Ans.** Some of the important libraries of Python that are used in Data Science are –

- Numpy
- SciPy
- Pandas
- Matplotlib
- Keras
- TensorFlow
- Scikit-learn

**To crack your next Data Science Interview, you need to learn these top Python Libraries now. **

**Q.5 How do you create a 1-D array in numpy?**

**Ans.** You can create a 1-D array in numpy as follows:

x = np.array([1,2,3,4])

Where numpy is imported as np

**Q.6 What function of numpy will you use to find maximum value from each row in a 2D numpy array?**

**Ans.** In order to find the maximum value from each row in a 2D numpy array, we will use the amax() function as follows –

np.amax(input, axis=1)

Where numpy is imported as np and input is the input array.

**Q.7 Given two lists [1,2,3,4,5] and [6,7,8], you have to merge the list into a single dimension. How will you achieve this?**

**Ans.** In order to merge the two lists into a single list, we will concatenate the two lists as follows –

list1 + list2

We will obtain the output as –** [1, 2, 3, 4, 5, 6, 7, 8]**

**Q.8 How will you create an identity matrix using numpy?**

**Ans.** In order to create the identity matrix with numpy, we will use the identity() function. Numpy is imported as np

np.identity(3)

We will obtain the output as –

**array([[1., 0., 0.],**

**[0., 1., 0.],**

**[0., 0., 1.]])**

**Q.9 How to add a border that is filled with 0s around an existing array?**

**Ans.** In order to add a border to an array that is filled with 0s, we first make an array Z and initialize it with zeroes. We first import numpy as np.

Z = np.ones((5,5)) Then, we perform padding on it with the help of pad() function. Z = np.pad(Z, pad_width=1, mode='constant', constant_values=0) print(Z)

**Q.10 Consider a (5,6,7) shape array, what is the index (x,y,z) of the 50th element?**

**Ans.**

print(np.unravel_index(50,(5,6,7)))

**Q.11 How will you multiply a 4×3 matrix by a 3×2 matrix ?**

**Ans.** There are two ways to do this. The first method is for the versions of Python that are older than 3.5 –

Z = np.dot(np.ones((4,3)), np.ones((3,2)))

print(Z)

array([[3., 3.],

[3., 3.],

[3., 3.],

[3., 3.]])

The second method is for Python version > 3.5,

Z = np.ones((4,3)) @ np.ones((3,2))

**Note: This is not enough, a lot of NumPy related questions are asked in the Data Science Interview. Therefore DataFlair has published Python NumPy Tutorial – An A to Z guide that will surely help you. **

## Machine Learning and Statistics Data Science Interview Questions and Answers

The next important part of our data science interview questions and answers is mathematics, ML and Statistics. Without having the knowledge of these 3 you cannot become a data scientist. For most of the candidates, statistics prove as a tough part. So, this is something that can help you to score well in your data science interview. My tip is to thoroughly learn all the formulas and definitions related to it.

**Q.12 Can you name the type of biases that occur in machine learning?**

**Ans.** There are four main types of biases that occur while building * machine learning algorithms* –

- Sample Bias
- Prejudice Bias
- Measurement Bias
- Algorithm Bias

**Q.13 How is skewness different from kurtosis?**

**Ans.** In data science, the general meaning of skewness is basically to determine the imbalance. In statistics, skewness is a measure of asymmetry in the distribution of data. Ideally, data is normally distributed, meaning that both the left and right tails are equidistant from the center of the distribution. In this case, the skewness is 0. However, a distribution exhibits negative skewness if the left tail is longer than the right one. And, the distribution exhibits positive skewness if the right tail is longer than the left one.

In case of kurtosis, we measure the pointedness of the peak of distribution. The ideal kurtosis or the kurtosis of a normal distribution is 3. If the kurtosis of the tail data exceeds 3, then we say that the distributions possess heavy tails. And, if the kurtosis is less than 3, we say that the distributions have thin tails.

**Q.14 In a univariate linear least squares regression, what is the relationship between the correlation coefficient and coefficient of determination? **

**Ans.** The relationship between the correlation coefficient and coefficient of determination in a univariate linear least squares regression is that the latter is a result of the square of the former. R squared tells us about the coefficient of determination and it provides a magnitude of variability of the dependent variable through the independent one.

**If there is any concept in Machine learning that you have missed, DataFlair came with the complete Machine Learning Tutorial Library. Save the page and learn everything for free at any time. **

**Q.15 What is z-score?**

**Ans.** Z-score, also known as the standard score is the number of standard deviations that the data-point is from the mean. It measures how many standard deviations below or above the population mean is. Z-score ranges from -3 and goes up till +3 standard deviations.

**Q.16 What is the formula of Logistic Regression?**

**Ans.** The formula of Logistic Regression is:

Where P represents the probability, e is the base of natural logarithms and a and b are the parameters of the logistic regression model.

**Do you know – There is no single Data Science Interview where the question from logistic regression is not asked. What are you waiting for? Start learning logistic regression with the best ever guide. **

**Q.17 For tuning hyperparameters of your machine learning model, what will be the ideal seed?**

**Ans.** There is no fixed value for the seed and no ideal value. The seed is initialized randomly in order to tune the hyperparameters of the machine learning model.

**Q.18 Explain the difference between Eigenvalue and Eigenvectors.**

**Ans.** While the eigenvalues are the values that are associated with the degree of linear transformation, eigenvectors of a non-singular matrix are associated with its linear transformations that are calculated with correlation or covariance matrix functions.

**Q.19 Is it true that Pearson captures the monotonic behavior of the relation between the two variables whereas Spearman captures how linearly dependent the two variables are?**

**Ans.** No. It is actually the opposite. Pearson evaluates the linear relationship between the two variables whereas Spearson evaluates the monotonic behavior that the two variables share in a relationship.

**Q.20 How is standard deviation affected by the outliers?**

In the formula for standard deviation –

The variation in the input value of x, that is, a variation in its value between high and low would adversely affect the standard deviation and its value would be farther away from the mean. Therefore, we conclude that outliers will have an effect on the standard deviation.

**Q.21 How will you create a decision tree?**

**Ans.**

- If both positive and negative examples are present, we select the attribute for splitting them.
- If examples are positive, answer yes. Otherwise, answer no.
- When there are no observed examples then we select a default based on majority classification at the parent.
- If no attributes are remaining, then both the positive and negative examples are present. This means that there are no sufficient features for classification or an error is present in the examples.

**Master the concept of decision trees and answer all the Data Science Interview Questions related to it confidently. **

**Q.22 What is regularization? How is it useful?**

**Ans.** Regularizations are the techniques for reducing the error by fitting a function on a training set in an appropriate manner to avoid overfitting.

While training the model, there is a high chance of the model learning noise or the data-points that do not represent any property of your true data. This can lead to overfitting. Therefore, in order to minimize this form of error, we use regularization in our machine learning models.

**Q.23 Given a linear equation: 2x + 8 = y for the following data-points:**

X | Y |

5 | 18 |

6 | 20 |

7 | 22 |

8 | 24 |

9 | 26 |

**What will be the corresponding Mean Absolute Error?**

**Ans.** In order to calculate the mean value error, we first calculate the value of y as per the given linear equation. Then we calculate the absolute error with respect to the output value of y. In the end, we find the average of the errors which is our Mean Absolute Error.

X | Y | 2x + 8 = y | Absolute Error |

5 | 10 | 18 | 8 |

6 | 21 | 20 | 1 |

7 | 26 | 22 | 4 |

8 | 19 | 24 | 5 |

9 | 30 | 26 | 4 |

**Mean Error 4.4**

### General Data Science Interview Questions and Answers

**Q.24 How is conditional random field different from hidden markov models?**

**Ans.** Conditional Random Fields (CRMs) are discriminative in nature whereas Hidden Markov Models (HMMs) are generative models.

**Q.25 What does the cost parameter in SVM stand for?**

**Ans.** Cost (C) Parameter in SVM decides how well the data should with the model. Cost Parameter is used for adjusting the hardness or softness of your large margin classification. With low cost, we make use of a smooth decision surface whereas to classify more points we make use of the higher cost.

**Q.26 Why is gradient descent stochastic in nature?**

**Ans.** The term stochastic means random probability. Therefore, in the case of stochastic gradient descent, the samples are selected at random instead of taking the whole in a single iteration.

**Q.27 How will you subtract means of each row of matrix?**

**Ans.** In order to subtract the means of each row of a matrix, we will use the mean() function as follows –

X = np.random.rand(5, 10)

Y = X – X.mean(axis=1, keepdims=True)

**Q.28 What do you mean by the law of large numbers?**

**Ans.** According to the law of large numbers, the frequency of occurrence of events that possess the same likelihood are evened out after they undergo a significant number of trials.

**Q.29 Explain L1 and L2 Regularization**

Both L1 and L2 regularizations are used to avoid overfitting in the model. L1 regularization or Lasso and L2 regularization or Ridge Regularization remove features from our model. L1 regularization, however, is more tolerant to outliers. Therefore L1 regularization is much better at handling noisy data.

**Q.30 What do the Alpha and Beta Hyperparameter stand for in the Latent Dirichlet Allocation Model for text classification? **

**Ans.** In the Latent Dirichlet Model for text classification, Alpha represents the number of topics within the document and Beta stands for the number of terms occurring within the topic.

**Q.31 Is it true that the LogLoss evaluation metric can possess negative values?**

**Ans.** No. Log Loss evaluation metric cannot possess negative values.

**Q.32 What is the formula of Stochastic Gradient Descent?**

The formula for Stochastic Gradient Descent is as follows:

**Q.33 Explain TF/IDF Vectorization**

TF/IDF stands for Term Frequency/Inverse Document Frequency. It is used for information retrieval and mining. It is used as a weighing factor to find the importance of word to a document. This importance is proportional to, and increases with the number of times a word occurs in the document but is offset by the frequency of the word in a corpus.

**Q.34 What is Softmax Function? What is the formula of Softmax Normalization?**

**Ans.** Softmax Function is used for normalizing the input into a probability distribution over the output classes. Following is the formula for the Softmax Normalization:

## Scenario-based Data Science Interview Questions and Answers

**Q.35 Suppose that you have to train your neural networks over a dataset of 20 GB. You have a RAM of 3 GB. How will you resolve this problem of training large data? **

**Ans.**

- We will train our neural network with limited memory as follows:
- We first load the entire data in our numpy array.
- Then we obtain the data through passing the index to the numpy array.
- We then pass this data to our neural network and train it in small batches.

**You can’t afford to miss Neural Network for data science interview preparation. Learn it through the DataFlair’s latest guide on Neural Networks for Data Science Interview. **

**Q.36 If through training all the features in the dataset, an accuracy of 100% is obtained but with the validation set, the accuracy score is 75%. What should be looked out for? **

**Ans.** If the training accuracy of 100% is obtained, then a verification of overfitting is required in our model.

**Q.37 Suppose that you are training your machine learning model on the text data. The document matrix that is created consists of more than 200K documents. What techniques can you use to reduce the dimensions of the data?**

**Ans.** In order to reduce the dimensions of our data, we can use any one of the following three techniques:

- Latent Semantic Indexing
- Latent Dirichlet Allocation
- Keyword Normalization

**Q.38 In a survey conducted, the average height was 164cm with a standard deviation of 15cm. If Alex had a z-score of 1.30, what will be his height?**

**Ans.** Using the formula, X= μ+Zσ, we determine that X = 164 + 1.30*15 = 183.5. Therefore, the height of Alex is 183.50 cm.

**Q.39 Assume that you have 1000 input features and 1 target feature. You have to filter out 100 most significant features that justify the relationship between input features and the target output. What will you use? **

**Ans.** For solving this problem, we will use Principle Discriminant Analysis as the number of features are much larger than the significant ones.

**Q.40 While reading the file ‘file.csv’, you get the following error: **

**Traceback (most recent call last):**

**File “<input>”, line 1, in<module>**

**UnicodeEncodeError: ‘ascii’ codec can’t encode character.**

**How will you correct this error? **

**Ans.** In order to correct this error, we will read the csv with the utf-8 encoding. pd.read_csv(“‘file.csv”, encoding=’utf-8′)

## Behavior-based Data Science Interview Questions

Bu this, we mean the general questions that could be based on your past experience, behavior, about the company, about the role, your family background, etc. Must remember these type of questions has a good weightage in data science interview. This type of questions can be asked indirectly. It is recommended to practice them twice or thrice before attempting for the interview, it will surely boost your confidence.

**Q.1** Where do you see yourself in X years?

**Q.2** Why did you choose this role?

**Q.3** Which was the most challenging project you did? Explain it.

**Q.4** How will you manage a conflict situation with your colleague?

**Q.5** What would you prefer – working in a large team, small team, or individually?

**Q.6** If you encountered a tedious or boring task how will you motivate yourself to complete it?

**Q.7** Tell me about one innovative solution that you have developed in the previous job that you are proud of.

**Q.8** What can your hobbies tell me that resume can’t?

**Q.9** Tell me about your top 5 predictions for the next 15 years?

**Q.10** Tell me your likes and dislikes about the previous position.

**Q.11** How will you identify a barrier that can affect your performance?

**Q.12** What are your motivations for working with our company?

**Q.13** Tell me about a challenging work situation and how you overcame it?

**Q.14** Tell me about the situation when you were dealing with the coworkers and patience proves as a strength there.

**Q.15** Is there any case when you changed someone’s opinion?

**Q.16** What would you do if your senior/manager rejected all your ideas?

**Q.17** If you were assigned multiple tasks at the same time, how would you organize yourself to produce quality work under tight deadlines?

Answering the above data science interview questions won’t work alone. You need to learn the talent of correctly framing the answers for data science interview questions. For that, you can check **DataFlair’s Data Science Interview Preparation Guide designed by experts. **

Be so good in an interview that they can’t ignore you!

## Summary

Superb, you have read all the data science interview questions and answers. As the industry is booming and companies are demanding more data scientist. This can increase the level of interview. Therefore, DataFlair is trying to prepare you for the advanced level. If there is any answer in which you are facing difficulty you can comment below, we will surely help you.

Have you faced any Data Science Interview yet? Do share your experience with us.

If there is any topic which you want to prepare for data science interview, you can visit DataFlair’s * Data Science tutorial Library*.

* Keep learning, keep succeeding. All the very best!*👍