30 Most Popular Data Science Interview Questions

1. Objective

Through this tutorial, you will get 30 Popular Data Science Interview Questions ans Answers. As this blog contains Popular Data Science Interview Questions which are frequently asked in data science interviews. Also, this Popular Interview Questions on Data Science contains answers to the questions to help you to crack the interview for the data scientist job.

Popular Data Science Interview Questions

2. What are Popular Data Science Interview Questions

Q.1. What do you understand by the lattice package?

  • Trellis graphics for R (developed in S)
  • Powerful high-level data visualization system
  • Provides common statistical graphics with conditioning
  • Emphasis on multivariate data
  • Enough for typical graphics needs
  • Flexible enough to handle most nonstandard requirements

Read more about Lattice packages in detail.

Q.2. What education is required for data science role?

Generally, 88% of Data Scientists have a Master’s Degree, and 46% have PhDs. Also, other skills data scientists include:

  • In-depth knowledge of SAS and/or R.
  • For Data Science, generally we use R.
  1. Python coding: Basically, it is the most common coding language t hat we use in data science along with Java, Perl, C/C++.
  2. Hadoop platform: We don’t have need of it in every case. But if we know the Hadoop platform, is still prefer first for the field. Furthermore, experience in Hive or Pig is a huge plus.
  3. SQL database/coding: Though NoSQL and Hadoop are the major focus for data scientists. But preferred candidates can write and execute complex queries in SQL.

Read more about Skills to become Data Scientist

Q.3. What skills are required for data analytics roles?

  • Programming skills: Basically, both R and Python are important for any data analyst.
  • Statistical skills and mathematics: Descriptive, inferential statistics are must for data analysts.
  • Machine learning skills.
  • Data wrangling skills: Generally, we are able to map raw data and convert it into another format. Thus, this allows for a more convenient consumption of the data.
  • Furthermore, Communication and Data Visualization skills are also important.

Q.4. What are the drawbacks of the linear model?

Some drawbacks of the linear model are:

  • The assumption of linearity of the errors.
  • Also, it can’t be used for count outcomes or binary outcomes.
  • Moreover, there are over-fitting problems that it can’t solve.

Q.5. What is root Linear Regression analysis?

It was initially developed to analyze industrial accidents. Although, it is widely used in other areas. Also, it is a problem solving technique. Basically, we use it for isolating the root causes of faults or problems. Moreover, a factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from reoccurring.

Read more about Linear Regression in detail.

Q.6. What are Recommended Systems in R?

Basically, these are a subclass of information filtering systems. Also, That are meant to predict the preferences or ratings that a user would give to a product.

Q.7. What is Collaborative Filtering in Data science?

Generally, the process of filtering used by most recommended systems to find patterns and information by collaborating perspectives, numerous data sources and several agents.

Q.8. What is Cluster Sampling in Data Science?

It is a technique which is used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Moreover, cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.

Learn more about Clustering in R

Q.9. What is bootstrapping in R?

Basically, it is a very useful tool in statistics. Furthermore, bootstrapping comes in handy when there is a doubt. Also, it is a non-parametric method.

Generally bootstrapping follows the same basic steps:

  1. First, we use to re-sample a given data set a specified number of times
  2. Then, we will calculate a specific statistic from each sample.
  3. Find the standard deviation of the distribution of that statistic.

Non-parametric Bootstrapping

Generally, a package is presented as “ boot package”. That provides extensive facilities. Also, you can bootstrap a single statistic (e.g. a median), or a vector (e.g., regression weights).

The main bootstrapping function is a boot( ) and has the following format:

bootobject <- boot(data= , statistic= , R=, …)

Understand Bootstrapping in detail.

Q.10. What are bootstrap methods?

There are two methods of bootstrapping:

a. Residuals – First we bootstrap the residuals. Then we create a set of new dependent variables. Further, we use these dependent variables to form a bootstrapped sample.

b. Bootstrapping pairs – Basically, it involves sampling pairs of the dependent and independent variable. Although, between these two methods, the second method is being found to be more robust.

Q.11. When to use bootstrap?

It is being used to enable inference on the statistic of interest. Also, it’s important when the true distribution of this statistic is unknown.

For example

In case of a linear model, if the analyst does not want to spend time while writing down the equations. Then bootstrapping might be a great approach. Also, it helps to get standard errors and confidence intervals from the bootstrapped distribution.

Q.12. When the bootstrap is inconsistent and How to remedy it?

This gives a set of scenarios when the bootstrap procedure can fail. There are some situations where bootstrapped will fail:

a. Generally, it is being observed that for small sample sizes less than 10, a bootstrapped sample is not reliable.

b. The distributions that have infinite second moments.

c. When estimating extreme values.

d. At the time of unstable AR processes.

Q.13. What is Bootstrap Development?

Basically, bootstrap by Twitter is the best existing frameworks. Also, we use bootstrap for developing responsive and mobile first projects on the web. Which are an HTML, CSS and JS framework.

Now, I will tell you one most important thing. The frameworks may save you a bunch load of time that you would usually spend in coding. But it restricts your creativity. So it’s better for You to come up with design ideas that fit their requirements.

Q.14. What are advantages of bootstrap development?

a. It has fewer cross-browser bugs.

b. It is having responsive structures and styles.

c. It Contains Several JavaScript plugins using the j Query.

d. It is having Good documentation and community support.

e. It has loads of free and professional templates, WordPress themes and plugins

f. It has great grid system.

Q.15. What are disadvantages of bootstrap development?

a. There will be a need for lots of styles overrides or rewrite files. Thus, it can lead to spent more time on designing and coding the website. So, if the design tends to deviate from the customary design used in Bootstrap.

b. We would have to go the extra mile while creating a design. Otherwise, if we don’t go with heavy customization then all the websites will look the same.

c. The Styles that are present is verbose. Also, that can lead to lots of output in HTML.

d. A JavaScript is being tied to jQuery. It is one of the commonest libraries which thus leaves most of the plugins unused.

e. Non-compliant HTML.

Q.16. What are pros and cons of bootstrapping?

Pros of Bootstrapping Your Business

a. Instead of using your time to hunt down an investment, you can focus more on the business itself.

b. Without outside investors, you are able to control your company completely without pressure.

c. Basically, it’s being guaranteed your business will become more customer-focused. Since all money is coming from customers instead of investors.

d. The owner gets to take home a bigger piece of the pie in the event of a company exit instead of having to share it with investors.

Cons of Bootstrapping Your Business

a. Since bootstrapping contractor uses their own personal assets to get their business going. Thus they are in more at risk of ending up in a lot of debt if the business fails.

b. It takes a much longer time to grow a company without an investment. Which could mean that you will not be earning any money for quite a while?

c. It’s not at all possible that you have to bootstrap all business ventures. Especially if the business will need a large amount of capital to get started.

d. Generally, it might be possible that your business can fail. But if product development and marketing don’t become efficient.

e. Competitors with a better financial standing. Also, they have a better chance to push you out of the market. Before you even get a chance to get your business up and running.

Q.17. What is meant by ANOVA models in R?

Basically, it is a type of model which is seldom sweet and almost always confusing. Moreover, we use analysis of Variance in a statistical technique. That is used for investigating data by comparing the means of subsets of the data.


Basically, it is an analysis of Deviance for Generalized Linear Model Fits. That is the need to compute an analysis of deviance table for one or more generalized linear model fits.


models, regression


# S3 method for glm

anova(object, …, dispersion = NULL, test = NULL)


a. the object, …

It is the result of a call to glm or a list of objects for the “almost” method.

b. dispersion

The dispersion parameter for the fitting family.

Learn more about Anova Models in R

Q.18. What is meant by classical ANOVA?

Firstly we start with simple additive fixed effects model. In this model, we use the built-in function aov

aov(Y ~ A + B, data=d)

Now, to cross these factors, or more generally to interact two variables we use either of

aov(Y ~ A * B, data=d)

aov(Y ~ A + B + A:B, data=d)

So far so familiar. Now assume that B is being nested within A

aov(Y ~ A/B, data=d)

aov(Y ~ A + B %in% A, data=d)

aov(Y ~ A + A:B, data=d)

so, nesting amounts to adding one main effect and one interaction.

Q.19. What is meant by character function in R?

a. Function – grep(pattern, x , ignore.case=FALSE, fixed=FALSE)


Search for a pattern in x.

If fixed =FALSE then the pattern is a regular expression;

If fixed=TRUE then the pattern is a text string;

Returns matching indices.

grep(“A”, c(“b”,”A”,”c”), fixed=TRUE) returns 2

b. Function – substr(x, start=n1, stop=n2)

Description –

Extract or replace substrings in a character vector.

x <- “abcdef”

substr(x, 2, 4) is “bcd”

substr(x, 2, 4) <- “22222” is “a222ef”

c. Function – strsplit(x, split)


Split the elements of character vector x at split.

strsplit(“abc”, “”) returns 3 element vector “a”,”b”,”c” d. Function – sub(pattern, replacement, x, ignore.case =FALSE, fixed=FALSE)


Find a pattern in x and replace with the replacement text;

If a fixed=FALSE then a pattern is a regular expression;

If fixed = T then a pattern is a text string.

sub(“\\s”,”.”,”Hello There”) returns “Hello.There”

e. Function – toupper(x)



f. Function – tolower(x)



g. Function – paste(…, sep=””)


Concatenate strings after using sep string to separate them.

paste(“x”,1:3,sep””) returns c(“x1”,”x2”,”x3”)

paste(“x”,1:3,sep=”M”) returns c(“xM1”,”xM2” “xM3”)

paste(“Today is”, date())

Q.20. What are methods for character functions in R?


In R, we use to store strings in a character vector. We can create strings with a single quote / double quote.

For Example-

y = “ I Love Dancing”

a. Convert object into character type

We use an as.character function that converts arguments to a character type.

For Example

we are storing 20 as a character

Y = as.character(20)


The class(Y) returns character as 20 is stored as a character in above code.

b. Check the character type

X = “ I Love Dancing”


Output: TRUE

Like is.character function, there are other functions such as is.numeric, is.integer and is.array for checking numeric vector, integer, and an array.

c. Concatenate Strings

Basically, we use paste function to join two strings. Also, it is one of the most important strings manipulation task. Every analyst performs it almost daily to structure data.

Paste Function Syntax

paste (objects, sep = ” “, collapse = NULL)

The sep= keyword denotes a separator or delimiter. The default separator is a single space. The collapse= keyword is used to separate the results.

For Example:

x = “Ritika”

y =”Joshi”

paste(x, y)

Output: Ritika Joshi

paste(x, y, sep = “,”)

Output: Ritika, Joshi

d. String Formatting

Suppose the value is being stored infraction and you need to convert it to percent. The sprintf is used to perform C-style string formatting.

Sprintf Function Syntax

sprintf(fmt, …)

The keyword fmt denotes string format. The format starts with the symbol % followed by numbers and letters.

x = 0.25


Output: 25.00%

e. Extract or replace

substrings substr Syntax -substr(x, starting position, end position)

x = “abcdef”

substr(x, 1, 4)

Output: abcd

In the above example,we are telling R to extract a string from 1st letter through the 4th letter.

Replace Substring – substr(x, starting position, end position) = Value

substr(x, 1, 3) = “11”

Output: 111def

In the above example, we are telling R to replace first 3letters with 111.

f. String Length

The nchar function is being used to compute the length of a character value.

x = “I love Dancing”


Output: 14

It returns 14 as the vector ‘x’ contains 14 letters (including 2 spaces).

g. Extract word from a programming

Suppose you need to pull a first or last word from a character string.

Word Function Syntax (Library : stringr)

word(string, position of word to extract, separator)

For Example:

x = “I love Dancing”


word(x, 1,sep = ” “)

Output: I

In the example above, ‘1’ denotes the first word to be extracted from a string. sep=” ” denotes a single space as a delimiter (It’s the default delimiter in the word function)

Extract Last Word

x = “I love Dancing”


word(x, -1,sep = ” “)

Output: Dancing

In the example above, ‘-1’ denotes the first word but started to be reading from the right of the string. sep=” ” denotes a single space as a delimiter (It’s the default delimiter in the word function)

h. Convert Character to Uppercase / Lowercase /Propercase

In many times, we need to change the case of a word.

For example:

convert the case to uppercase or lowercase.


x = “I love Dancing”


Output: “i love dancing”

The tolower() function converts letters in a string to lowercase.



The toupper() function converts letters in a string to uppercase.



Output: “I Love Dancing”

The str_to_title() function converts the first letter in a string to uppercase and the remaining letters to lowercase.

i. Converting Multiple Spaces to a Single Space

Basically, it’s a challenging task to remove many spaces from a string and keep only a single space. In R, it is possible to do it with a qdap package.

x= “ritika joshi”



Output: ritika joshi

j. Repeat the character N times

We can use strrep base R function to repeat the character N times.


Output: “xxxxx”

k. Find String in a Character Variable

The str_detect() function helps to check whether a substring exists in a string. It is equal to ‘contain’ function of SAS. Also, returns TRUE/FALSE against each value.

x = c(“Ritika Joshi”, “Ritika Gupta”, “Linkedin”, “Google”)




l. Splitting a Character Vector

In case of text mining, it is being required to split a string to calculate the used keywords in the list. We use ‘strsplit()’ in base R to perform this operation.

x = c(“I love Dancing”)

strsplit(x, ” “)

Output: “I” “love” “Dancing”

Q.21. What is meant by binomial distribution?

Generally, it is been applied to a single variable discrete data where results are the no. of “successful outcomes”.

Q.22. What do you understand by Poisson distribution?

  • Generally, we use to call it as the distribution of rare events., a Poisson process is where DISCRETE events occur in a continuous, but finite interval of time or space.

The Following conditions must apply:

  • For a small interval, the probability of the event occurring is proportional to the size of the interval.
  • The probability of more than one occurrence in the small interval is negligible.
  • Basically, each occurrence must be independent of others and must be at random.
  • The events are often defects, accidents or unusual natural happenings, such as an earthquake.
  • The parameter for the Poisson distribution is lambda. Also, it is an average or mean of occurrences over  a given interval.
  • The probability function is: for x= 0,1.2,3 ….

Understand Binomial and Poisson Distribution in R

Q.23. Explain difference between binomial and Poisson distribution?

Binomial Distribution

  1. Fixed no. of Trials (n) [10 pie throws].
  2. Although, only 2 possible outcomes are possible.
  3. Basically, a probability of Success is constant(p).
  4. Moreover, each Trial is independent.
  5. Also, it predicts no.s of successes within a set no. of trials.
  6. Generally, we use it to test for Independence.

Poisson Distribution

  1. Infinite no. of Trials.
  2. Also, it has unlimited no. of outcomes possible.
  3. Basically, mean of the distribution is the same for all intervals.
  4. Generally, no. of occurrence in any given interval independent of others.
  5. Also, it Predicts no. of occurrences per unit, Time, Space.
  6. Moreover, we use it to test for independence.

Read more about Poisson and binomial distribution in detail.

Q.24. What do you understand by Ordinary Least Squares Linear Regression?

Basically, it is a type of statistical technique. That is being used for modeling. Also, used for analysis of linear relationships between a response variable. Further, if there is a relationship between two variables appears to be linear. Then a straight line can be fit to the data to model the relationship.

The linear equation for a bivariate regression takes the following form:


Where y = response(dependent) variable

m = gradient(slope)

x = predictor(independent) variable

C = is the intercept

Learn more about OLS in R

Q.25. Explain OLS in brief?


Models, regression


ols(formula, data, weights, subset, na.action=na.delete,

method=”qr”, model=FALSE,

x=FALSE, y=FALSE, se.fit=FALSE, linear.predictors=TRUE,

penalty=0, penalty.matrix, tol=1e-7, sigma,

var.penalty=c(‘simple’,’sandwich’), …)


a. Formula an S formula object, e.g.

Y ~ rcs(x1,5)*lsp(x2,c(10,20))

b. Data

It is a name of an S data frame containing all needed variables.

c. Weights

We use it in the fitting process.

d. Subset

Basically, it is an expression that defines a subset of the observations to use in the fit. The default is to use all observations.

e. na.action

Basically, this specifies an S function to handle missing data.

f. Method

Also, this specifies a particular fitting method, or “model.frame”.

g. Model

The default is FALSE. it is set to TRUE. That return the model frame as element model of the fit object.

h. X

The default is FALSE. Set to TRUE to return the expanded design matrix as element x of the returned fit object. First set both x=TRUE if you are going to use the residuals function.

i. Y

The default is FALSE. Set to TRUE to return the vector of response values as element y of the fit.

j. Se.fit

the default is FALSE. It is set to TRUE. That computes the estimated standard errors of the estimate of Xβ. And also store them in element se.fit of the fit.

k. Linear.predictors

It is set FALSE as default. That is being used to cause predicted values not to be stored.

l. Penalty penalty.matrix

see lrm

m. Tol

tolerance for information matrix singularity.

n. Sigma

If sigma is being given, then we can use it as the actual root mean squared error parameter for the model. Otherwise, sigma is being estimated from the data using the usual formulas.

o. Var.penalty

Basically, it’s define type of variance-co-variance matrix. That is to be stored in the var component of the fit when penalization is being used.

p. …

arguments to pass to lm.wfit or lm.fit

Q.26. What is chi-square tests?

It is a very popular Data Science interview questions that is asked most frequently in data scietist interviews.

Basically, it is a statistical method which is being used to determine if two categorical variables have a significant correlation between them. Also, we have to choose both variables from the same population And they should be categorized as − Male/Female, Red/Green Yes/No, etc.

Read more about Chi-Square tests in detail.

Q.27. Explain how R programming applied to the real world?

  • Generally, R is being used as a primary programming tool in finance by many quantitative analysts.
  • Once you are used to for R, it’s good for everything.
  • Also, R is an open source language. Hence, It is a language as well as the environment for statistical computing and design.
  • R is a GNU venture which is like the S Language. Further, we can considered as an alternate execution of S. Also, there are some dynamic contrasts. But, much code composed of S runs unaltered under R.
  • Also, R gives a broad variety bunching, grouping, and traditional statistical tests.
  • Moreover, R application compass the world from hard sciences, computational statistics and hypothetical.

For example– medicine, chemistry, marketing, finance and much more.

  • Moreover, to resolve most difficult issues R is being used around the World. Moreover, R is being used as a fundamental tool for finance analytic-driven organizations.

For example– Google, Facebook, and LinkedIn.

Learn more about R Applications in Real world

Q.28. Name various sectors that are using R?

1. Social media

2. Public affairs

3. Services

4. Analytic

5. Finance

6. Media

7. Government

8. Software vendor revolution analytic

Read more about R programming in detail.

Q.29. Give a brief introduction to an array in R?

Basically, we use to call array in R Programming simply as the multi-dimensional Data structure. In this, data is stored in the form of matrices, row, and as well as in columns. Also, we can use matrix level, row index, and column index to access the matrix elements.

Arrays in R are the data objects which can store data in more than two dimensions. Also, an array is created using the array() function. We can use vectors as input. Further, to create an array we can use this values in the dim parameter.

Q.30. Why Reshape R Package?

For analytic functions, the data obtained as a result of an experiment or study is generally different. Generally, the data from a study has one or more columns that can identify a row followed by a number of columns that represent the values measured. Also, the columns that identify the row can be thought of as composite key of a database column.

Learn R data reshaping in detail.


As a result, we have studied Popular Data Science Interview Questions. Also, I hope this Popular Data Science Interview Questions will help you to resolve your queries. Hope this blog will act as a gateway to your Data Science Job. Furthermore, if you feel any query, you can freely ask in comment box.

Leave a comment

Your email address will not be published. Required fields are marked *