Data scientist interviews in 2026 test five areas: statistics and probability (still the foundation), SQL proficiency, machine learning theory, product sense and business cases, and coding. The typical loop at FAANG includes a phone screen with coding, an on-site with 4-5 rounds (SQL, stats, ML, product case, behavioral). New trend: LLM and GenAI questions are increasingly appearing, especially at AI-focused companies.
Statistics and probability (the brain teasers)
- [Facebook] 'Fair coin vs unfair coin (both sides tails). You pick one randomly, flip 5 times, all tails. Probability you picked the unfair coin?' — ~97% via Bayes' Theorem. P(unfair|5T) = P(5T|unfair) * P(unfair) / P(5T) = 1 * 0.5 / (1*0.5 + (1/32)*0.5).
- [Google] 'A coin flipped 1000 times lands heads 550 times. Is it biased?' — Z-score = (550-500)/sqrt(1000*0.25) = 3.16. p < 0.001. Reject null hypothesis at any standard significance level.
- [Amazon] 'Disease affects 1 in 1000. Test is 98% accurate for infected, 1% false positive. Someone tests positive. Actual probability of disease?' — Only ~9% via Bayes. P(D|+) = 0.001*0.98 / (0.001*0.98 + 0.999*0.01). The low base rate dominates.
- [Two Sigma] 'Expected number of coin flips to get two consecutive heads?' — 6 flips. Solve with recurrence relation. This tests whether you can set up and solve Markov chains.
- [Meta] 'Explain a confidence interval to a non-technical person.' — 'If we repeated this study 100 times, about 95 of those intervals would contain the true value.' NOT 'there is a 95% chance the true value is in this interval.' Getting this wrong is an instant credibility hit.
A/B testing (asked at every company)
- 'Describe A/B testing and common pitfalls.' — Peeking (stopping early when results look significant), novelty effects (users engage more just because it is new), network effects and spillover between groups, Simpson's paradox in subgroups, multiple comparisons without Bonferroni correction, underpowered tests.
- 'Your A/B test shows a 2% lift in conversion with p=0.03. Should we launch?' — Statistically significant, yes. But check: is 2% practically significant given implementation cost? Is the sample size adequate? Were there multiple variants tested (multiple comparison issue)? Is there segment-level Simpson's paradox?
- 'How do you determine sample size for an A/B test?' — Power analysis: specify minimum detectable effect, significance level (usually 0.05), power (usually 0.80), and baseline conversion rate. Tools: statsmodels in Python. Underpowered tests are the most common experimental design mistake.
- 'What is a switchback experiment and when do you use it?' — Time-based treatment/control alternation for marketplace experiments where user-level randomization causes interference. Used at Uber, Lyft, DoorDash for pricing experiments.
Machine learning theory
- 'Explain random forests vs gradient boosting. When would you choose each?' — Random forests: parallel trees, less prone to overfitting, faster to train, good default. Gradient boosting: sequential trees, often higher accuracy, needs careful tuning, prone to overfitting. Choose RF for quick baseline; XGBoost/LightGBM when you need peak performance and have time to tune.
- 'What is the curse of dimensionality?' — As features increase, data becomes sparse in high-dimensional space, distances become meaningless, and models need exponentially more data. Solutions: PCA, feature selection, regularization, domain knowledge to reduce features.
- 'How do you detect and handle model drift in production?' — Monitor prediction distributions, feature distributions, and performance metrics over time. Statistical tests (KS test, PSI) for distribution shift. Retrain triggers when drift exceeds threshold. Shadow deployment for new models.
- 'Explain cross-validation. Why not just use a train/test split?' — K-fold cross-validation gives a more robust estimate of model performance by using all data for both training and validation. Single train/test split is vulnerable to lucky/unlucky splits. Use stratified K-fold for imbalanced classes.
Product case studies
- [Google] 'How would you measure the success of Google Maps?' — North Star: number of successful navigations completed. Supporting: time-to-destination accuracy, search-to-navigation conversion, daily active users, report rate of incorrect information. Layer by use case: driving, walking, transit, local business search.
- [Meta] 'Instagram engagement is down 5% this quarter. Investigate.' — Segment by surface (Feed, Stories, Reels, Explore), user cohort (new vs returning, age, geography), content type, and device. Check for product changes, algorithm updates, competitive shifts (TikTok), and seasonality. Present hypothesis tree before diving into data.
- [Amazon] 'Should Amazon offer free returns for all products?' — Frame as cost-benefit: increased conversion rate and customer trust vs return shipping costs and fraud risk. Propose an experiment: randomize free returns by product category, measure conversion lift vs cost, calculate breakeven point.
- 'Design a churn prediction model for a SaaS product.' — Define churn (no login for 30 days? Subscription cancelled?). Features: usage frequency trend, feature adoption, support ticket volume, billing issues. Model: logistic regression for interpretability, then gradient boosting for performance. Evaluation: precision at top decile (actionable for intervention team).
SQL challenges
- 'Find the median salary per department without using PERCENTILE_CONT.' — Use ROW_NUMBER, count per department, filter where row number equals floor and ceiling of count/2. Tests window functions and mathematical reasoning.
- 'Write a query to find users who were active 3+ consecutive days.' — Self-join or LAG/LEAD approach. GROUP BY user_id with date arithmetic to identify sequences. Tests date manipulation and creative problem-solving.
- 'Calculate retention by monthly cohort.' — JOIN users (by signup month) with activity (by activity month). COUNT DISTINCT active users per cohort per month. Divide by original cohort size. This is the most common SQL analytics question.
- 'Find the top 3 products by revenue in each category using a single query.' — DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) with CTE or subquery filter. Know when to use DENSE_RANK vs ROW_NUMBER.
Behavioral questions
- 'Tell me about a time your analysis led to a counterintuitive finding. How did you convince stakeholders?' — Show you validated the finding rigorously before presenting, anticipated objections, and framed it in terms the audience cared about.
- 'Describe a time you had to choose between model complexity and interpretability.' — Show you consider the audience and use case. Regulatory context may require interpretable models. A/B test arbitration may allow complex models. The choice is contextual, not absolute.
- 'How do you prioritize competing analysis requests from multiple teams?' — Framework: impact (revenue/user impact), urgency (deadline-driven), effort (quick wins vs deep dives), and strategic alignment. Communicate tradeoffs transparently.
