Career Guides15 min read2026-05-19Julian Caraulani

AI/ML Engineer Interview Questions — 35 Real Questions & Answers (2026)

ML system design, LLM/GenAI questions, coding challenges, and what Google, Meta, and OpenAI actually ask.

AI/ML engineer interviews in 2026 follow a 4-6 round structure: recruiter screen, coding (LeetCode medium-hard), ML system design, ML depth/breadth technical, and behavioral. Google runs 5 rounds, Meta runs 4 virtual onsite rounds (2 coding + 1 system design + 1 behavioral), and Amazon emphasizes Leadership Principles alongside LeetCode easy-medium. New for 2026: LLM and GenAI questions now appear in nearly every ML interview loop.

ML fundamentals (asked everywhere)

  • 'Explain the bias-variance tradeoff and how you would diagnose which one is the problem.' — High bias: underfitting (training and validation error both high). High variance: overfitting (low training error, high validation error). Fix bias with more complex models; fix variance with regularization or more data.
  • 'Walk me through gradient descent. What are the differences between batch, mini-batch, and stochastic?' — Batch uses all data per step (stable but slow), stochastic uses one sample (noisy but fast), mini-batch is the practical middle ground. Mention learning rate scheduling and Adam optimizer.
  • 'How do you handle class imbalance in a classification problem?' — Resampling (SMOTE, undersampling), class weights, threshold tuning, ensemble methods, or focal loss. The right approach depends on the severity and the cost matrix of false positives vs false negatives.
  • 'Explain L1 vs L2 regularization. When would you use each?' — L1 (Lasso) drives coefficients to zero for feature selection. L2 (Ridge) shrinks coefficients evenly. Use L1 when you suspect many irrelevant features; L2 when all features contribute.
  • 'What is the difference between precision and recall? When does each matter more?' — Precision = of predicted positives, how many are correct. Recall = of actual positives, how many did you find. Spam filter: optimize precision. Cancer screening: optimize recall.

ML system design (FAANG favorite)

  • 'Design a product recommendation system for an e-commerce platform.' (Meta/Amazon) — Candidate retrieval with embeddings, ranking with a learned model, re-ranking for diversity/business rules. Discuss offline vs online evaluation, A/B testing framework, and cold-start problem.
  • 'Design a content moderation system at scale.' (Google/Meta) — Multi-modal pipeline: text classifier, image classifier, video frame sampling. Discuss precision-recall tradeoffs for different violation types, human-in-the-loop for edge cases, and appeal workflow.
  • 'Design a fraud detection system for a payments platform.' (Amazon/Stripe) — Real-time feature computation, ensemble model with rule-based overrides, feedback loop from manual reviews. Discuss latency constraints, feature stores, and concept drift monitoring.
  • 'Design a search autocomplete system.' (Google/Amazon) — Trie-based candidate generation, personalized ranking model, caching layer. Discuss how to handle trending queries, typo correction, and offensive content filtering.
  • 'Design a video recommendation feed like TikTok.' — Two-tower retrieval model, contextual bandit for exploration, session-based features. Discuss engagement vs quality tradeoffs, filter bubbles, and real-time inference.

LLM and GenAI questions (new for 2026)

  • 'When would you use RAG vs fine-tuning? Design a pipeline for each.' — RAG for dynamic knowledge, fine-tuning for behavior/style. RAG pipeline: chunk documents, embed with a model, store in vector DB, retrieve top-k, inject into prompt. Fine-tuning: curate dataset, choose base model, train with LoRA/QLoRA, evaluate on held-out set.
  • 'How do you evaluate LLM output quality at scale?' — Automated metrics (RAGAS for RAG, BLEU/ROUGE for summarization), LLM-as-judge with calibrated rubrics, human evaluation samples. Discuss the tradeoffs of each and when you need all three.
  • 'How do you defend against prompt injection?' — Input sanitization, system prompt isolation, output filtering, canary tokens, instruction hierarchy. Discuss the fundamental tension: LLMs are instruction-following by design.
  • 'Design an AI agent system with tool use.' — Router model decides which tools to call, tool execution sandbox, result validation, retry logic, memory/context management. Discuss error handling and cost budgets.
  • 'How do you optimize LLM inference cost and latency?' — Model distillation, quantization (INT8/INT4), KV cache optimization, batching strategies, model routing (small model for easy queries, large for hard). Discuss the cost-quality Pareto frontier.

Coding questions

  • 'Implement K-Means clustering from scratch.' (Google) — Initialize centroids, assign points to nearest centroid, recompute centroids, repeat until convergence. Discuss initialization strategies (K-Means++) and convergence criteria.
  • 'Implement logistic regression with gradient descent.' (Amazon/Meta) — Sigmoid function, binary cross-entropy loss, gradient computation, weight update loop. Discuss learning rate selection and feature scaling.
  • 'Compute AUC-ROC from scratch given predictions and labels.' (Uber) — Sort by predicted probability, sweep threshold, compute TPR and FPR at each point, integrate. Discuss interpretation and when AUC is misleading.
  • 'Build rolling window features for time series.' (Amazon/Spotify) — Sliding window aggregations (mean, std, min, max) with proper handling of window edges. Discuss leakage prevention.
  • 'Implement a simple neural network with backpropagation in NumPy.' (Google/Apple) — Forward pass, loss computation, backward pass with chain rule, weight updates. Discuss vanishing gradients and activation function choices.

Behavioral questions for ML roles

  • 'Tell me about a model you deployed that failed in production.' — They want to hear about monitoring, debugging methodology, and what you changed. Not having a failure story is a red flag.
  • 'How do you handle the tradeoff between model accuracy and inference latency?' — Show you think about business constraints, not just model performance. Discuss distillation, quantization, and acceptable accuracy loss.
  • 'Describe a time you disagreed with a stakeholder about a data-driven decision.' — Show you can defend your analysis while remaining open to new information.
  • 'How do you stay current with ML research?' — Mention specific papers, conferences (NeurIPS, ICML), or newsletters. Generic answers like 'I read blogs' are insufficient.
  • 'What is your biggest technical regret?' — Shows self-awareness. Discuss what you would do differently and why.

Common mistakes

  • Production blindness — discussing models without mentioning deployment, monitoring, or maintenance.
  • Shallow tool-centric answers — saying 'I use XGBoost' without explaining WHY it fits the problem.
  • Ignoring data quality — jumping to model selection without discussing data cleaning, labeling, and validation.
  • Metric selection as afterthought — choosing accuracy for imbalanced datasets is an instant credibility hit.
  • Not asking clarifying questions in system design — the ambiguity is intentional. Ask about scale, latency requirements, and success metrics before designing.