Top 60 Data Science Interview Questions & Answers (Python, ML, Stats) 2026

Prepare for your data science interview with 60 real questions and answers covering Python, machine learning, statistics, SQL, and behavioral rounds. Honest advice from someone who has been on both sides.

Ravi Vohra

03 Jun 2026

68 min read

The Python Questions They Always Ask

Python is the language of data science. Nobody will hire you if you cannot write basic code. These questions test whether you have actually written Python or just watched tutorials about it.

What is the difference between a list and a tuple?

Lists are mutable. You can change them after creation. Tuples are immutable. Once created, they stay the same. The practical implication is that lists are for collections that change, tuples are for fixed data. Tuples are also faster and can be used as dictionary keys.

What are list comprehensions?

A compact way to create lists. Instead of writing a for loop, you write the expression inside square brackets. It is faster and more readable once you get used to it. But do not nest them three levels deep. That becomes unreadable and you will confuse yourself six months later.

How do you handle missing values in a Pandas DataFrame?

First, you figure out why they are missing. The reason matters. Then you either drop them with dropna if the missing data is minimal, or fill them with fillna using mean, median, mode, or a more sophisticated imputation method. The choice depends on the data and the problem. There is no universal correct answer. That is the point of the question.

What is the difference between apply and map in Pandas?

Apply works on both Series and DataFrames. Map works only on Series. Apply is more flexible. Map is simpler for element-wise substitutions. I use map for simple replacements and apply for anything that needs a custom function.

How do you merge two DataFrames?

The merge function. Like a JOIN in SQL. You specify the key column and the type of merge. Inner, outer, left, right. Understanding these joins is essential. Most real-world data requires merging multiple sources.

What is a lambda function?

An anonymous, one-line function. Useful for small operations you do not want to define a full function for. They are handy inside apply statements. They become unreadable if you try to do too much in one line. Keep them simple.

Explain train_test_split and why it is used.

It splits your data into training and testing sets. The model learns on the training data. You evaluate it on the testing data. This prevents overfitting. If you train and test on the same data, your evaluation is meaningless. The model has already seen the answers.

What libraries do you use most often in data science projects?

Pandas for data manipulation. NumPy for numerical operations. Matplotlib and Seaborn for visualization. Scikit-learn for machine learning. Statsmodels for statistical modeling. Name them. Then explain briefly when you use each one.

How do you read a CSV file with Pandas?

pd.read_csv. It sounds trivial. But interviewers ask this to see if you have ever actually opened a CSV file in Python. Mention parameters like encoding and handling bad lines. That signals real experience.

What is the difference between a Series and a DataFrame?

A Series is one column. A DataFrame is multiple columns. A DataFrame is essentially a collection of Series. Simple distinction. But it matters for understanding how Pandas structures data.

How do you remove duplicates from a DataFrame?

drop_duplicates. You can specify a subset of columns to check for duplicates. You can keep the first or last occurrence. Real data is full of duplicates that are not exact duplicates. Cleaning them requires judgment.

What is iloc and loc?

iloc uses integer positions. loc uses labels. This confuses beginners constantly. iloc for row number. loc for row index name. Knowing the difference prevents silent bugs.

How do you iterate over a DataFrame?

You generally should not. Vectorized operations are faster. But if you must, iterrows or itertuples. itertuples is faster. Mention that loops should be a last resort in Pandas. That shows you understand performance.

What is a virtual environment and why use one?

An isolated Python environment for a project. It keeps dependencies separate. Without it, different projects with conflicting library versions break each other. This is one of those things that sounds boring but will save you from nightmare debugging sessions.

Write a function to check if a string is a palindrome.

They want to see if you can code something simple without googling. Return string equals string reversed. Keep it clean. Use slicing. Show you can write basic logic without overcomplicating it.

The Statistics Questions That Expose Shallow Understanding

Statistics is the foundation. A lot of people skip it because it is not as shiny as machine learning. Interviewers know this. They probe here deliberately.

What is p-value?

The probability of observing results at least as extreme as yours, assuming the null hypothesis is true. A low p-value suggests your results are unlikely to be due to chance. But p-value is not the probability that the null hypothesis is false. That misinterpretation is everywhere. Knowing the difference signals real understanding.

Explain the difference between correlation and causation.

Correlation means two variables move together. Causation means one variable directly affects the other. Ice cream sales and drowning deaths are correlated. Both increase in summer. Ice cream does not cause drowning. Heat causes both. Confounding variables are the reason correlation is not causation.

What is the Central Limit Theorem?

The distribution of sample means approaches a normal distribution as sample size increases, regardless of the original distribution. This is why many statistical methods work. It is the foundation of inferential statistics. Know it intuitively, not just mathematically.

What is the difference between Type I and Type II error?

Type I is a false positive. Rejecting a true null hypothesis. Type II is a false negative. Failing to reject a false null hypothesis. The tradeoff between them depends on the context. In medical testing, a false negative might be worse than a false positive. In spam detection, a false positive, good email marked as spam, is worse than a false negative.

What is standard deviation versus standard error?

Standard deviation measures variability in your data. Standard error measures variability in your sample mean estimate. Standard error is always smaller. It decreases with sample size. Confusing them is a common beginner mistake.

What is a confidence interval?

A range that likely contains the true population parameter. A 95 percent confidence interval means that if you repeated the sampling process many times, 95 percent of the intervals would contain the true value. It is not the probability that the true value is in your specific interval. That distinction is subtle but important.

What is a normal distribution?

A symmetric, bell-shaped distribution defined by mean and standard deviation. Many natural phenomena approximate it. Many statistical tests assume it. But real data is often not normal. Knowing when the assumption holds is more important than knowing the formula.

How do you detect outliers?

Box plots. Z-scores. IQR method. But detection is easy. Deciding what to do with outliers is hard. Are they errors? Are they genuine extreme values? The context decides whether you remove, transform, or keep them.

What is skewness and kurtosis?

Skewness measures asymmetry. Kurtosis measures tailedness. A positive skew means a long right tail. Income data is positively skewed. High kurtosis means heavy tails. More extreme values than a normal distribution. These properties affect which statistical methods are appropriate.

Explain Bayes' Theorem in simple terms.

It updates the probability of a hypothesis based on new evidence. Prior probability plus new data equals posterior probability. It is the mathematical formalization of learning from experience. The formula looks intimidating. The concept is intuitive.

The Machine Learning Questions That Actually Test Depth

ML questions separate the tutorial graduates from the real practitioners. The interviewer is checking whether you understand why models work, not just how to import them.

Explain bias-variance tradeoff.

Bias is error from oversimplification. Variance is error from oversensitivity to training data. High bias underfits. High variance overfits. The tradeoff is finding the sweet spot. This is the single most important concept in machine learning. If you truly understand it, you understand model selection.

What is overfitting and how do you prevent it?

The model learns noise instead of signal. It performs great on training data and terribly on new data. Prevent it with cross-validation, regularization, simpler models, more data, and early stopping. Overfitting is the most common mistake in applied machine learning.

What is cross-validation?

Splitting data into multiple folds. Training on some, validating on others. Rotating which fold is the validation set. K-fold is the standard. It gives a more reliable estimate of model performance than a single train-test split.

Difference between supervised and unsupervised learning?

Supervised has labeled data. You know the answer you are trying to predict. Regression and classification. Unsupervised has unlabeled data. You are finding patterns without predefined answers. Clustering and dimensionality reduction.

How does a decision tree work?

It splits data based on features to create homogeneous groups. Each split tries to maximize information gain or minimize impurity. It is intuitive and interpretable. But single trees overfit easily. That is why ensembles like random forests exist.

What is a random forest?

An ensemble of decision trees. Each tree is trained on a random subset of data and features. The final prediction is the average or majority vote. It reduces overfitting. It handles non-linear relationships well. It is the workhorse algorithm for structured data.

Explain gradient boosting.

It builds trees sequentially. Each new tree corrects the errors of the previous ones. XGBoost, LightGBM, CatBoost are implementations. They are powerful and often win competitions. They are also prone to overfitting if not tuned properly.

What is the difference between bagging and boosting?

Bagging trains models in parallel. Each model is independent. Reducing variance is the goal. Boosting trains models sequentially. Each model learns from previous errors. Reducing bias is the goal. Random forest is bagging. XGBoost is boosting.

How do you evaluate a classification model?

Accuracy, precision, recall, F1-score. But accuracy is misleading for imbalanced datasets. If 95 percent of samples are class A, a model that always predicts A has 95 percent accuracy and is useless. Use precision and recall. Use the confusion matrix. Understand the tradeoffs.

What is the ROC curve and AUC?

ROC plots true positive rate against false positive rate at different thresholds. AUC is the area under it. Higher AUC means better discrimination. AUC of 0.5 is random. AUC of 1.0 is perfect. It is a good overall metric for balanced classification.

How do you handle an imbalanced dataset?

Resampling. Oversample the minority class or undersample the majority. SMOTE is a popular oversampling technique. Use appropriate metrics like precision-recall instead of accuracy. Use algorithms that handle imbalance well. This problem is extremely common in real-world data.

What is regularization?

Adding a penalty to model complexity. L1, Lasso, can reduce coefficients to zero, performing feature selection. L2, Ridge, shrinks coefficients but keeps all features. Regularization prevents overfitting. It is a fundamental technique.

What is the difference between K-means and hierarchical clustering?

K-means partitions data into K clusters. You specify K. It is fast and scales well. Hierarchical clustering builds a tree of clusters. You do not need to specify K. It is slower and does not scale to huge datasets. Both have their place.

What is PCA?

Principal Component Analysis. Reduces dimensionality by finding directions of maximum variance. It transforms correlated features into uncorrelated components. Used for visualization, noise reduction, and speeding up other algorithms. The components are linear combinations of original features, which makes interpretation tricky.

Explain feature engineering.

Creating new features from existing ones to improve model performance. It is where domain expertise meets data science. It is often more impactful than algorithm choice. A good feature can boost performance more than switching from a basic model to an advanced one.

What is the difference between generative and discriminative models?

Discriminative models learn the boundary between classes. Generative models learn the distribution of each class. Naive Bayes is generative. Logistic regression is discriminative. Discriminative often performs better with enough data. Generative can work with less.

How do you select features for a model?

Filter methods. Statistical tests. Wrapper methods. Recursive feature elimination. Embedded methods. Lasso regularization. Domain knowledge. Correlation analysis. The best approach combines automated methods with human judgment.

The SQL Questions You Cannot Afford to Miss

Data lives in databases. SQL is how you talk to databases. Not knowing SQL as a data scientist is a serious gap. These questions test practical query skills.

Write a query to find the second highest salary from an employee table.

Use a subquery with LIMIT and OFFSET, or use a window function like DENSE_RANK. This is a classic. It tests whether you can think beyond basic SELECT statements.

What is the difference between WHERE and HAVING?

WHERE filters rows before aggregation. HAVING filters groups after aggregation. You cannot use aggregate functions in WHERE. You use HAVING for that. This distinction comes up constantly in real queries.

Explain different types of JOINs.

INNER JOIN returns matching rows from both tables. LEFT JOIN returns all rows from the left table and matching rows from the right. RIGHT JOIN is the reverse. FULL OUTER JOIN returns all rows from both. Knowing when to use each prevents data loss and incorrect results.

What is a subquery?

A query inside another query. Used in WHERE, FROM, or SELECT clauses. They can be powerful. They can also be slow if not written carefully. Sometimes a JOIN is better. Knowing the tradeoffs matters.

How do you optimize a slow SQL query?

Check the execution plan. Add indexes. Avoid SELECT asterisk. Filter early with WHERE. Limit subqueries. Use appropriate JOIN types. Query optimization is a practical skill that separates people who have worked with real data from those who have only run tutorials.

What are window functions?

Functions that perform calculations across a set of rows related to the current row. ROW_NUMBER, RANK, LAG, LEAD. They are incredibly useful for running totals, rankings, and comparisons without losing row-level detail.

What is a primary key?

A column or set of columns that uniquely identifies each row. It cannot contain NULL values. Every table should have one. It is fundamental to database design.

What is a foreign key?

A column that creates a relationship between two tables. It references the primary key of another table. It enforces referential integrity. It prevents orphaned records.

How do you handle NULL values in SQL?

Use IS NULL or IS NOT NULL. Not equals NULL because NULL is not a value, it is the absence of a value. COALESCE to replace NULLs with a default. NULL handling trips up even experienced people.

Write a query to find duplicate records.

GROUP BY the columns that should be unique. Use HAVING COUNT asterisk greater than one. This is a practical data cleaning task that comes up constantly.

The Behavioral and Scenario Questions

These feel soft. They are not. They test judgment, communication, and self-awareness. I have seen technically strong candidates lose offers here.

Describe a data science project you worked on from start to finish.

Use the framework. Problem, approach, challenges, results, lessons learned. Do not just describe what you did. Explain why you made the choices you made. The why matters more than the what.

How do you explain a complex model to a non-technical stakeholder?

Use analogies. Avoid jargon. Focus on the business impact. "The model looks at past customer behavior patterns to identify which current customers might leave. It does not predict the future perfectly, but it gives us a priority list of who to reach out to."

Tell me about a time your analysis was wrong. What happened?

Give a real failure. Not a humble brag. Explain what went wrong and what you learned. This tests honesty and the ability to learn from mistakes. Everyone makes errors. Not everyone can talk about them openly.

How do you prioritize multiple data requests from different teams?

Consider business impact and urgency. Communicate timelines clearly. Do not overpromise. If everything is high priority, nothing is. Ask stakeholders to clarify the cost of delay for each request.

How do you stay current with developments in data science?

Name specific sources. A newsletter you read. A conference you attended. A paper you found interesting. Then say something about applying what you learn. Passive consumption is not impressive. Active experimentation is.

What is the most challenging data problem you have solved?

Pick something real. Explain why it was hard. Not just technically hard. Maybe the data was messy. Maybe the stakeholder kept changing requirements. Maybe the deadline was impossible. Show how you navigated the difficulty.

How would you approach a project where the data quality is poor?

Start by quantifying the poorness. What exactly is wrong. Missing values, inconsistencies, outliers. Then prioritize fixes based on impact. Communicate limitations to stakeholders. A model built on bad data is worse than no model at all.

Why do you want to work in data science?

Be honest. Not "because it is the sexiest job of the 21st century." Say something real. The satisfaction of finding answers in data. The variety of problems. The constant learning. Something that sounds like a human being, not a LinkedIn headline.

A Quick Preparation Checklist

One. Pick your projects. Know them inside out. Be ready to talk about the problem, the approach, the hardest bug, and what you would do differently.

Two. Revise the fundamentals. Statistics basics. SQL. Python. Do not skip these for advanced ML topics. Most interviews spend more time on fundamentals.

Three. Practice thinking out loud. Have a friend ask you questions. Practice saying "I do not know, but here is how I would find out."

Four. Prepare three questions for the interviewer. About the team, the data stack, the kinds of problems they solve. Not about salary. Not yet.

Five. Sleep the night before. A clear mind is worth more than last-minute cramming.

The Honest Closing

Sixty questions is a lot. You will not be asked all of them. But if you understand the concepts behind them, you can handle whatever gets thrown at you. The interviewer is not looking for a perfect answer. They are looking for evidence that you can think, that you have done real work, and that you are someone they would not mind working with every day.

If you are still building these skills, structured preparation helps. SkillsYard's Data Science and AI program covers the practical side of all these topics. Live mentors who have worked in the industry. Projects that are real, not clean toy datasets. Mock interviews with feedback. A free demo class lets you see if the style clicks. No pressure. Just a session to watch and decide.

Related Courses

Data Science & Analytics

BEGINNER

Advance Certification in Power BI

Master Power BI with advanced data modeling, interactive dashboards, and automation. Build business intelligence and reporting skills within 3 months.

Power BIData VisualizationDAXData ModelingDashboard Design

3 months

BEGINNER

Advance Certification in Python for Data Science

Accelerate your career with Python! Master Pandas and Scikit-learn in 6 months, build your portfolio, and land a data science job.

PythonNumPyPandasMatplotlib & SeabornScikit-learn

3 months

INTERMEDIATE

Advance Certification in SQL

Accelerate your career by mastering advanced SQL. Gain expertise in complex querying, performance optimization, and database management in just six months to unlock new job opportunities.

SQLDatabase ManagementData AnalysisQuery OptimizationStored Procedures

6 months

ADVANCED

Advance Program in Data Analytics

Accelerate your career with Data Analytics! Master SQL, Power BI, Tableau, and Excel in 1 year, build a strong portfolio, and land your dream analytics job.

Data AnalyticsSQLPower BITableauExcelPython

12 months

ADVANCED

Advance Program in Data Science

Unlock your career in Data Science! Master statistics, machine learning & deep learning in 2 years and build predictive solutions for the future.

Data SciencePythonR ProgrammingMachine LearningDeep LearningArtificial Intelligence

16 months

ADVANCED

Advance program in machine learning

Unlock your career in Machine Learning! Master supervised & unsupervised learning, deep learning, NLP, and reinforcement learning in 2 years, building real-world AI solutions.

Machine LearningDeep LearningAIPythonComputer Vision

24 months

BEGINNER

Advance Certification in Advance Excel

Master Excel with advanced functions, dynamic dashboards, and automation. Build data analysis and reporting skills in 3 months.

Microsoft ExcelAdvanced FunctionsPivotTables & PivotChartsPower QueryPower Pivot

3 months

Frequently Asked Questions

Share this article

Share Share

Top 60 Data Science Interview Questions & Answers (Python, ML, Stats) 2026

Top 60 Data Science Interview Questions & Answers (Python, ML, Stats): The Real Ones They Actually Ask

The Python Questions They Always Ask

What is the difference between a list and a tuple?

What are list comprehensions?

How do you handle missing values in a Pandas DataFrame?

What is the difference between apply and map in Pandas?

How do you merge two DataFrames?

What is a lambda function?

Explain train_test_split and why it is used.

What libraries do you use most often in data science projects?

How do you read a CSV file with Pandas?

What is the difference between a Series and a DataFrame?

How do you remove duplicates from a DataFrame?

What is iloc and loc?

How do you iterate over a DataFrame?

What is a virtual environment and why use one?

Write a function to check if a string is a palindrome.

The Statistics Questions That Expose Shallow Understanding

What is p-value?

Explain the difference between correlation and causation.

What is the Central Limit Theorem?

What is the difference between Type I and Type II error?

What is standard deviation versus standard error?

What is a confidence interval?

What is a normal distribution?

How do you detect outliers?

What is skewness and kurtosis?

Explain Bayes' Theorem in simple terms.

The Machine Learning Questions That Actually Test Depth

Explain bias-variance tradeoff.

What is overfitting and how do you prevent it?

What is cross-validation?

Difference between supervised and unsupervised learning?

How does a decision tree work?

What is a random forest?

Explain gradient boosting.

What is the difference between bagging and boosting?

How do you evaluate a classification model?

What is the ROC curve and AUC?

How do you handle an imbalanced dataset?

What is regularization?

What is the difference between K-means and hierarchical clustering?

What is PCA?

Explain feature engineering.

What is the difference between generative and discriminative models?

How do you select features for a model?

The SQL Questions You Cannot Afford to Miss

Write a query to find the second highest salary from an employee table.

What is the difference between WHERE and HAVING?

Explain different types of JOINs.

What is a subquery?

How do you optimize a slow SQL query?

What are window functions?

What is a primary key?

What is a foreign key?

How do you handle NULL values in SQL?

Write a query to find duplicate records.

The Behavioral and Scenario Questions

Describe a data science project you worked on from start to finish.

Tell me about a time your analysis was wrong. What happened?

How do you prioritize multiple data requests from different teams?

How do you stay current with developments in data science?

What is the most challenging data problem you have solved?

How would you approach a project where the data quality is poor?

Why do you want to work in data science?

A Quick Preparation Checklist

The Honest Closing

Related Courses

Advance Certification in Power BI

Advance Certification in Python for Data Science

Advance Certification in SQL

Advance Program in Data Analytics

Advance Program in Data Science

Advance program in machine learning

Advance Certification in Advance Excel

Frequently Asked Questions

1How many of these 60 questions should I expect in a typical data science interview?

2Is Python mandatory, or can I use R for data science interviews?

3How deep should my machine learning knowledge be for a fresher role?