Cloud architect interviews in 2026 follow a 4-6 round loop over 2-4 weeks: recruiter screen, hiring manager screen, IaC technical screen (Terraform/Pulumi), platform deep-dive (AWS/Azure/GCP specifics), system design whiteboard, and behavioral. Security questions now appear in every round, not just a dedicated stage. Cost is treated as an architectural constraint from day one. Interviewers intentionally leave constraints ambiguous to test whether you ask clarifying questions before designing.
AWS architecture questions
- 'Design a VPC for a multi-tier web application.' — Public subnet for ALB, private subnets for app and DB tiers, NAT Gateway for outbound, VPC endpoints for S3 and DynamoDB. Security groups (stateful, instance-level) plus NACLs (stateless, subnet-level) for layered defense.
- 'A developer needs temporary access to a production S3 bucket. How?' — IAM role with STS AssumeRole, scoped policy with time-bound session, CloudTrail logging. Never distribute access keys. Use IAM Identity Center for human access.
- 'Your Lambda function has cold start issues impacting UX. Fix it.' — Provisioned Concurrency for predictable workloads, optimize package size, keep the function warm with scheduled invocations, consider container-based Lambda for larger runtimes.
- 'Design a disaster recovery strategy with RPO of 15 minutes and RTO of 1 hour.' — Multi-region active-passive with Aurora Global Database (cross-region replication), S3 cross-region replication, Route 53 health checks for automated failover, infrastructure-as-code for rapid recreation.
- 'How do you optimize a $50K/month AWS bill?' — Right-sizing instances (check CPU/memory utilization), Reserved Instances or Savings Plans for steady workloads, Spot for batch processing, S3 lifecycle policies, unused EBS/EIP cleanup, Cost Explorer anomaly detection.
Kubernetes and container questions
- 'A pod is in CrashLoopBackOff. Walk me through debugging.' — kubectl describe pod (check events), kubectl logs (check application errors), check resource limits (OOMKilled?), verify health/readiness probes, check image pull and secrets.
- 'Explain the difference between a Deployment, StatefulSet, and DaemonSet.' — Deployment: stateless replicas, rolling updates. StatefulSet: stable network identities, ordered deployment, persistent volumes (databases). DaemonSet: one pod per node (monitoring agents, log collectors).
- 'How do you handle secrets in Kubernetes?' — External Secrets Operator syncing from AWS Secrets Manager or Vault, sealed secrets for GitOps, never store secrets in ConfigMaps or environment variables in plain text.
- 'Design a multi-cluster Kubernetes architecture.' — Federation or fleet management (GKE Enterprise, EKS Anywhere), service mesh for cross-cluster communication, GitOps with ArgoCD for consistent deployments, centralized observability.
Terraform and IaC questions
- 'How do you manage Terraform state in a team environment?' — Remote backend (S3 + DynamoDB for locking), state isolation per environment (workspaces or separate state files), state encryption at rest, import existing resources before modifying.
- 'What is Terraform drift and how do you detect it?' — Drift occurs when actual infrastructure diverges from state. Detect with terraform plan in CI/CD, or tools like Spacelift/env0 for continuous drift detection. Fix by reconciling: import, replace, or taint.
- 'Explain Terraform modules and when NOT to use them.' — Modules encapsulate reusable infrastructure patterns. Do not use for simple, one-off resources (over-abstraction). Do not nest modules more than 2 levels deep. Keep modules focused on a single responsibility.
- 'How do you handle sensitive values in Terraform?' — Use variables marked as sensitive, store values in a secrets manager (not tfvars files), use SOPS or sealed secrets for encrypted values in Git, never commit state files containing secrets.
System design and whiteboard
- 'Design the infrastructure for a real-time analytics platform processing 1M events/second.' — Kinesis or Kafka for ingestion, Flink for stream processing, DynamoDB or Redis for real-time queries, S3 + Athena for historical analysis. Discuss partitioning strategy, exactly-once semantics, and cost at scale.
- 'Design a global CDN-backed web application with sub-100ms latency.' — CloudFront with edge caching, origin shield, Lambda@Edge for personalization, multi-region backends with Global Accelerator, Route 53 latency-based routing.
- 'A startup needs to go from 0 to production in 2 weeks. Design the infrastructure.' — Start with managed services: ECS Fargate or App Runner, RDS, S3, CloudFront. IaC from day one with Terraform. CI/CD with GitHub Actions. Monitoring with CloudWatch. Do not over-engineer: no Kubernetes, no multi-region, no custom service mesh.
AI infrastructure questions (new for 2026)
- 'Design GPU infrastructure for serving an LLM with 100 requests/second.' — GPU instance selection (A100 vs H100), model parallelism strategy, batching with vLLM or TensorRT-LLM, auto-scaling based on queue depth, cost optimization with Spot GPU instances for batch workloads.
- 'How do you manage costs for AI/ML workloads in the cloud?' — Spot instances for training, reserved capacity for inference, right-sizing GPU instances, model quantization to use smaller GPUs, FinOps dashboards for per-model cost attribution.
- 'Design a RAG system architecture on AWS.' — S3 for document storage, OpenSearch or Pinecone for vector search, Lambda or ECS for embedding pipeline, API Gateway for serving, SageMaker for model hosting, CloudWatch for monitoring retrieval quality.
Behavioral questions
- 'Tell me about a time an architecture decision you made caused a production outage.' — Show you take ownership, describe root cause analysis, and explain what architectural guardrails you added to prevent recurrence.
- 'How do you handle a situation where a developer wants to bypass security controls for speed?' — Show you balance security and velocity. Propose alternatives that maintain security while reducing friction (automated compliance checks, pre-approved patterns).
- 'Describe how you evaluate build vs buy decisions.' — Framework: team capacity, maintenance burden, differentiation value, vendor lock-in risk, total cost of ownership over 3 years, security implications. Show you consider both technical and business factors.
