Top 60 Data Science Interview Questions & Answers (Python, ML, Stats): The Real Ones They Actually Ask
I bombed a data science interview once because I could not explain the difference between bias and variance. Not because I did not know it. Because I had memorized the definition, and when the interviewer asked me to explain it like I was talking to a non-technical product manager, my brain froze. I recited something about model complexity and error decomposition. He nodded. The kind of nod that means "I have already moved on to the next candidate in my head."
That failure taught me something. Data science interviews are not a memory test. They are a thinking test disguised as a technical one. The interviewer does not just want the correct answer. They want to see how you arrive at it, how you explain it, and whether you actually understand it or just recognize the words.
So I put together this list. Not the theoretical questions from some textbook. The actual data science interview questions I have asked, been asked, and seen others ask in real interviews. Sixty of them. Split into Python, machine learning, statistics, SQL, and the behavioral stuff that everyone ignores until it costs them an offer.
A quick note. Do not memorize these. Use them to check your understanding. If you can explain a concept to someone without jargon, you know it. If you cannot, the definition in your head is just decoration.
The Python Questions They Always Ask
Python is the language of data science. Nobody will hire you if you cannot write basic code. These questions test whether you have actually written Python or just watched tutorials about it.
What is the difference between a list and a tuple?
Lists are mutable. You can change them after creation. Tuples are immutable. Once created, they stay the same. The practical implication is that lists are for collections that change, tuples are for fixed data. Tuples are also faster and can be used as dictionary keys.
What are list comprehensions?
A compact way to create lists. Instead of writing a for loop, you write the expression inside square brackets. It is faster and more readable once you get used to it. But do not nest them three levels deep. That becomes unreadable and you will confuse yourself six months later.
How do you handle missing values in a Pandas DataFrame?
First, you figure out why they are missing. The reason matters. Then you either drop them with dropna if the missing data is minimal, or fill them with fillna using mean, median, mode, or a more sophisticated imputation method. The choice depends on the data and the problem. There is no universal correct answer. That is the point of the question.
What is the difference between apply and map in Pandas?
Apply works on both Series and DataFrames. Map works only on Series. Apply is more flexible. Map is simpler for element-wise substitutions. I use map for simple replacements and apply for anything that needs a custom function.
How do you merge two DataFrames?
The merge function. Like a JOIN in SQL. You specify the key column and the type of merge. Inner, outer, left, right. Understanding these joins is essential. Most real-world data requires merging multiple sources.
What is a lambda function?
An anonymous, one-line function. Useful for small operations you do not want to define a full function for. They are handy inside apply statements. They become unreadable if you try to do too much in one line. Keep them simple.
Explain train_test_split and why it is used.
It splits your data into training and testing sets. The model learns on the training data. You evaluate it on the testing data. This prevents overfitting. If you train and test on the same data, your evaluation is meaningless. The model has already seen the answers.
What libraries do you use most often in data science projects?
Pandas for data manipulation. NumPy for numerical operations. Matplotlib and Seaborn for visualization. Scikit-learn for machine learning. Statsmodels for statistical modeling. Name them. Then explain briefly when you use each one.
How do you read a CSV file with Pandas?
pd.read_csv. It sounds trivial. But interviewers ask this to see if you have ever actually opened a CSV file in Python. Mention parameters like encoding and handling bad lines. That signals real experience.
What is the difference between a Series and a DataFrame?
A Series is one column. A DataFrame is multiple columns. A DataFrame is essentially a collection of Series. Simple distinction. But it matters for understanding how Pandas structures data.
How do you remove duplicates from a DataFrame?
drop_duplicates. You can specify a subset of columns to check for duplicates. You can keep the first or last occurrence. Real data is full of duplicates that are not exact duplicates. Cleaning them requires judgment.
What is iloc and loc?
iloc uses integer positions. loc uses labels. This confuses beginners constantly. iloc for row number. loc for row index name. Knowing the difference prevents silent bugs.
How do you iterate over a DataFrame?
You generally should not. Vectorized operations are faster. But if you must, iterrows or itertuples. itertuples is faster. Mention that loops should be a last resort in Pandas. That shows you understand performance.
What is a virtual environment and why use one?
An isolated Python environment for a project. It keeps dependencies separate. Without it, different projects with conflicting library versions break each other. This is one of those things that sounds boring but will save you from nightmare debugging sessions.
Write a function to check if a string is a palindrome.
They want to see if you can code something simple without googling. Return string equals string reversed. Keep it clean. Use slicing. Show you can write basic logic without overcomplicating it.
The Statistics Questions That Expose Shallow Understanding
Statistics is the foundation. A lot of people skip it because it is not as shiny as machine learning. Interviewers know this. They probe here deliberately.
What is p-value?
The probability of observing results at least as extreme as yours, assuming the null hypothesis is true. A low p-value suggests your results are unlikely to be due to chance. But p-value is not the probability that the null hypothesis is false. That misinterpretation is everywhere. Knowing the difference signals real understanding.
Explain the difference between correlation and causation.
Correlation means two variables move together. Causation means one variable directly affects the other. Ice cream sales and drowning deaths are correlated. Both increase in summer. Ice cream does not cause drowning. Heat causes both. Confounding variables are the reason correlation is not causation.
What is the Central Limit Theorem?
The distribution of sample means approaches a normal distribution as sample size increases, regardless of the original distribution. This is why many statistical methods work. It is the foundation of inferential statistics. Know it intuitively, not just mathematically.
What is the difference between Type I and Type II error?
Type I is a false positive. Rejecting a true null hypothesis. Type II is a false negative. Failing to reject a false null hypothesis. The tradeoff between them depends on the context. In medical testing, a false negative might be worse than a false positive. In spam detection, a false positive, good email marked as spam, is worse than a false negative.
What is standard deviation versus standard error?
Standard deviation measures variability in your data. Standard error measures variability in your sample mean estimate. Standard error is always smaller. It decreases with sample size. Confusing them is a common beginner mistake.
What is a confidence interval?
A range that likely contains the true population parameter. A 95 percent confidence interval means that if you repeated the sampling process many times, 95 percent of the intervals would contain the true value. It is not the probability that the true value is in your specific interval. That distinction is subtle but important.
What is a normal distribution?
A symmetric, bell-shaped distribution defined by mean and standard deviation. Many natural phenomena approximate it. Many statistical tests assume it. But real data is often not normal. Knowing when the assumption holds is more important than knowing the formula.
How do you detect outliers?
Box plots. Z-scores. IQR method. But detection is easy. Deciding what to do with outliers is hard. Are they errors? Are they genuine extreme values? The context decides whether you remove, transform, or keep them.
What is skewness and kurtosis?
Skewness measures asymmetry. Kurtosis measures tailedness. A positive skew means a long right tail. Income data is positively skewed. High kurtosis means heavy tails. More extreme values than a normal distribution. These properties affect which statistical methods are appropriate.
Explain Bayes' Theorem in simple terms.
It updates the probability of a hypothesis based on new evidence. Prior probability plus new data equals posterior probability. It is the mathematical formalization of learning from experience. The formula looks intimidating. The concept is intuitive.
The Machine Learning Questions That Actually Test Depth
ML questions separate the tutorial graduates from the real practitioners. The interviewer is checking whether you understand why models work, not just how to import them.
Explain bias-variance tradeoff.
Bias is error from oversimplification. Variance is error from oversensitivity to training data. High bias underfits. High variance overfits. The tradeoff is finding the sweet spot. This is the single most important concept in machine learning. If you truly understand it, you understand model selection.
What is overfitting and how do you prevent it?
The model learns noise instead of signal. It performs great on training data and terribly on new data. Prevent it with cross-validation, regularization, simpler models, more data, and early stopping. Overfitting is the most common mistake in applied machine learning.
What is cross-validation?
Splitting data into multiple folds. Training on some, validating on others. Rotating which fold is the validation set. K-fold is the standard. It gives a more reliable estimate of model performance than a single train-test split.
Difference between supervised and unsupervised learning?
Supervised has labeled data. You know the answer you are trying to predict. Regression and classification. Unsupervised has unlabeled data. You are finding patterns without predefined answers. Clustering and dimensionality reduction.
How does a decision tree work?
It splits data based on features to create homogeneous groups. Each split tries to maximize information gain or minimize impurity. It is intuitive and interpretable. But single trees overfit easily. That is why ensembles like random forests exist.
What is a random forest?
An ensemble of decision trees. Each tree is trained on a random subset of data and features. The final prediction is the average or majority vote. It reduces overfitting. It handles non-linear relationships well. It is the workhorse algorithm for structured data.
Explain gradient boosting.
It builds trees sequentially. Each new tree corrects the errors of the previous ones. XGBoost, LightGBM, CatBoost are implementations. They are powerful and often win competitions. They are also prone to overfitting if not tuned properly.
What is the difference between bagging and boosting?
Bagging trains models in parallel. Each model is independent. Reducing variance is the goal. Boosting trains models sequentially. Each model learns from previous errors. Reducing bias is the goal. Random forest is bagging. XGBoost is boosting.
How do you evaluate a classification model?
Accuracy, precision, recall, F1-score. But accuracy is misleading for imbalanced datasets. If 95 percent of samples are class A, a model that always predicts A has 95 percent accuracy and is useless. Use precision and recall. Use the confusion matrix. Understand the tradeoffs.
What is the ROC curve and AUC?
ROC plots true positive rate against false positive rate at different thresholds. AUC is the area under it. Higher AUC means better discrimination. AUC of 0.5 is random. AUC of 1.0 is perfect. It is a good overall metric for balanced classification.
How do you handle an imbalanced dataset?
Resampling. Oversample the minority class or undersample the majority. SMOTE is a popular oversampling technique. Use appropriate metrics like precision-recall instead of accuracy. Use algorithms that handle imbalance well. This problem is extremely common in real-world data.
What is regularization?
Adding a penalty to model complexity. L1, Lasso, can reduce coefficients to zero, performing feature selection. L2, Ridge, shrinks coefficients but keeps all features. Regularization prevents overfitting. It is a fundamental technique.
What is the difference between K-means and hierarchical clustering?
K-means partitions data into K clusters. You specify K. It is fast and scales well. Hierarchical clustering builds a tree of clusters. You do not need to specify K. It is slower and does not scale to huge datasets. Both have their place.
What is PCA?
Principal Component Analysis. Reduces dimensionality by finding directions of maximum variance. It transforms correlated features into uncorrelated components. Used for visualization, noise reduction, and speeding up other algorithms. The components are linear combinations of original features, which makes interpretation tricky.
Explain feature engineering.
Creating new features from existing ones to improve model performance. It is where domain expertise meets data science. It is often more impactful than algorithm choice. A good feature can boost performance more than switching from a basic model to an advanced one.
What is the difference between generative and discriminative models?
Discriminative models learn the boundary between classes. Generative models learn the distribution of each class. Naive Bayes is generative. Logistic regression is discriminative. Discriminative often performs better with enough data. Generative can work with less.
How do you select features for a model?
Filter methods. Statistical tests. Wrapper methods. Recursive feature elimination. Embedded methods. Lasso regularization. Domain knowledge. Correlation analysis. The best approach combines automated methods with human judgment.
The SQL Questions You Cannot Afford to Miss
Data lives in databases. SQL is how you talk to databases. Not knowing SQL as a data scientist is a serious gap. These questions test practical query skills.
Write a query to find the second highest salary from an employee table.
Use a subquery with LIMIT and OFFSET, or use a window function like DENSE_RANK. This is a classic. It tests whether you can think beyond basic SELECT statements.
What is the difference between WHERE and HAVING?
WHERE filters rows before aggregation. HAVING filters groups after aggregation. You cannot use aggregate functions in WHERE. You use HAVING for that. This distinction comes up constantly in real queries.
Explain different types of JOINs.
INNER JOIN returns matching rows from both tables. LEFT JOIN returns all rows from the left table and matching rows from the right. RIGHT JOIN is the reverse. FULL OUTER JOIN returns all rows from both. Knowing when to use each prevents data loss and incorrect results.
What is a subquery?
A query inside another query. Used in WHERE, FROM, or SELECT clauses. They can be powerful. They can also be slow if not written carefully. Sometimes a JOIN is better. Knowing the tradeoffs matters.
How do you optimize a slow SQL query?
Check the execution plan. Add indexes. Avoid SELECT asterisk. Filter early with WHERE. Limit subqueries. Use appropriate JOIN types. Query optimization is a practical skill that separates people who have worked with real data from those who have only run tutorials.
What are window functions?
Functions that perform calculations across a set of rows related to the current row. ROW_NUMBER, RANK, LAG, LEAD. They are incredibly useful for running totals, rankings, and comparisons without losing row-level detail.
What is a primary key?
A column or set of columns that uniquely identifies each row. It cannot contain NULL values. Every table should have one. It is fundamental to database design.
What is a foreign key?
A column that creates a relationship between two tables. It references the primary key of another table. It enforces referential integrity. It prevents orphaned records.
How do you handle NULL values in SQL?
Use IS NULL or IS NOT NULL. Not equals NULL because NULL is not a value, it is the absence of a value. COALESCE to replace NULLs with a default. NULL handling trips up even experienced people.
Write a query to find duplicate records.
GROUP BY the columns that should be unique. Use HAVING COUNT asterisk greater than one. This is a practical data cleaning task that comes up constantly.
The Behavioral and Scenario Questions
These feel soft. They are not. They test judgment, communication, and self-awareness. I have seen technically strong candidates lose offers here.
Describe a data science project you worked on from start to finish.
Use the framework. Problem, approach, challenges, results, lessons learned. Do not just describe what you did. Explain why you made the choices you made. The why matters more than the what.
How do you explain a complex model to a non-technical stakeholder?
Use analogies. Avoid jargon. Focus on the business impact. "The model looks at past customer behavior patterns to identify which current customers might leave. It does not predict the future perfectly, but it gives us a priority list of who to reach out to."
Tell me about a time your analysis was wrong. What happened?
Give a real failure. Not a humble brag. Explain what went wrong and what you learned. This tests honesty and the ability to learn from mistakes. Everyone makes errors. Not everyone can talk about them openly.
How do you prioritize multiple data requests from different teams?
Consider business impact and urgency. Communicate timelines clearly. Do not overpromise. If everything is high priority, nothing is. Ask stakeholders to clarify the cost of delay for each request.
How do you stay current with developments in data science?
Name specific sources. A newsletter you read. A conference you attended. A paper you found interesting. Then say something about applying what you learn. Passive consumption is not impressive. Active experimentation is.
What is the most challenging data problem you have solved?
Pick something real. Explain why it was hard. Not just technically hard. Maybe the data was messy. Maybe the stakeholder kept changing requirements. Maybe the deadline was impossible. Show how you navigated the difficulty.
How would you approach a project where the data quality is poor?
Start by quantifying the poorness. What exactly is wrong. Missing values, inconsistencies, outliers. Then prioritize fixes based on impact. Communicate limitations to stakeholders. A model built on bad data is worse than no model at all.
Why do you want to work in data science?
Be honest. Not "because it is the sexiest job of the 21st century." Say something real. The satisfaction of finding answers in data. The variety of problems. The constant learning. Something that sounds like a human being, not a LinkedIn headline.
A Quick Preparation Checklist
One. Pick your projects. Know them inside out. Be ready to talk about the problem, the approach, the hardest bug, and what you would do differently.
Two. Revise the fundamentals. Statistics basics. SQL. Python. Do not skip these for advanced ML topics. Most interviews spend more time on fundamentals.
Three. Practice thinking out loud. Have a friend ask you questions. Practice saying "I do not know, but here is how I would find out."
Four. Prepare three questions for the interviewer. About the team, the data stack, the kinds of problems they solve. Not about salary. Not yet.
Five. Sleep the night before. A clear mind is worth more than last-minute cramming.
The Honest Closing
Sixty questions is a lot. You will not be asked all of them. But if you understand the concepts behind them, you can handle whatever gets thrown at you. The interviewer is not looking for a perfect answer. They are looking for evidence that you can think, that you have done real work, and that you are someone they would not mind working with every day.
If you are still building these skills, structured preparation helps. SkillsYard's Data Science and AI program covers the practical side of all these topics. Live mentors who have worked in the industry. Projects that are real, not clean toy datasets. Mock interviews with feedback. A free demo class lets you see if the style clicks. No pressure. Just a session to watch and decide.