Spot Instances and Preemptible VMs: Saving 70% on Compute Without Breaking Production

How Spot Pricing Works

AWS spot prices fluctuate based on supply and demand in each availability zone for each instance type. Discounts compared to on-demand range from 60% to 90%; the exact discount varies constantly. The EC2 Spot Instances documentation covers the full model including Spot Instance requests and Fleet configuration.

Spot instances can be terminated with two minutes’ notice when AWS needs the capacity back. The notice arrives via instance metadata; well-designed workloads catch it, drain in-flight work, and exit cleanly.

Workloads That Fit

Stateless: anything that can lose its current node and restart elsewhere. CI runners, batch processing, image rendering, ML training (with checkpointing), web tiers behind a load balancer with sufficient redundancy.

Workloads that don’t fit: anything with persistent local state, anything that takes longer than a few minutes to drain, and anything where a brief outage is unacceptable.

Fleet Diversity Reduces Risk

The single biggest mistake with spot is asking for one specific instance type. When that type’s spot capacity tightens, your whole fleet gets evicted at once.

Capacity-optimized allocation in EC2 Auto Scaling and Spot Fleet picks from a pool of instance types based on which has the most available capacity right now. Diversifying across 4-8 similar instance types dramatically reduces the chance of mass eviction.

Kubernetes on Spot

EKS managed node groups support spot directly. Karpenter (the AWS-native cluster autoscaler) makes the diversification effortless: declare your workload’s requirements (CPU, memory, optional architecture), and Karpenter picks the cheapest available capacity automatically.

Pod disruption budgets prevent simultaneous eviction from taking down a service. Topology spread constraints keep replicas across multiple nodes. With both in place, spot-based Kubernetes is reliable for most workloads.

Mixed Strategies

The pragmatic pattern: baseline capacity on on-demand or Savings Plans for predictable workloads, burst and batch capacity on spot. Auto Scaling Groups with mixed instance policies let you specify percentages of each.

Don’t aim for 100% spot. Aim for 70-90% spot for batch-style work and stateless services, and accept that some workloads (databases, stateful services, latency-sensitive controllers) stay on-demand.

Spot Interruption Handling Patterns

Workload code that handles spot interruptions gracefully has two key behaviors: it listens for the termination notice (via instance metadata or Kubernetes node draining events) and it can resume cleanly after restart.

For HTTP services, this is usually graceful shutdown (drain connections, finish in-flight requests, exit). For batch jobs, checkpoint state regularly and pick up from the last checkpoint after restart. The work is one-time engineering; the savings are ongoing.

Karpenter and Modern Spot Orchestration

Karpenter is the modern way to run spot on EKS. Unlike Cluster Autoscaler, which manages predefined node groups, Karpenter provisions nodes on demand based on actual pod requirements.

Karpenter’s spot handling includes: capacity-optimized selection across instance types, automatic node consolidation to reduce waste, and configurable interruption handling. For new EKS clusters, Karpenter is the default choice; for existing clusters, migration is worthwhile.

Workload Classification

Before adopting spot widely, classify workloads by interruption tolerance. Tier 1: cannot tolerate interruption (databases, stateful coordinators) — never on spot. Tier 2: tolerant with brief disruption (web tiers, API servers) — mostly on spot, with on-demand fallback. Tier 3: fully tolerant (batch jobs, CI, ML training with checkpoints) — entirely on spot.

The classification drives architecture. Tier 1 workloads run on dedicated reserved capacity. Tier 2 use mixed instance policies with spot percentages. Tier 3 use spot fleets with aggressive diversification. The AWS Spot Best Practices guide documents the allocation strategies and interruption handling patterns in detail.

Most organizations have more tier 2 and tier 3 capacity than they think. The exercise of classification often surfaces 70-80% of compute as spot-eligible.

Bid Strategies and Allocation

Spot pricing is now nearly always a flat discount from on-demand (no bidding), but allocation strategy still matters. Capacity-optimized strategy picks instance types with the most current capacity, minimizing interruption probability.

Lowest-price strategy is tempting but worse — it concentrates fleet in the cheapest type, which is often the most interruption-prone. Capacity-optimized typically delivers 30-50% lower interruption rates at minimal price difference.

For latency-sensitive workloads, capacity-optimized-prioritized adds your preferred instance types to the front of the selection list, falling back to alternatives only when needed.

Hybrid and Multi-Cloud Considerations

Few large organizations are purely single-cloud. Acquisitions, regulatory requirements, and specific service preferences all push toward multi-cloud reality. The challenge is operating consistently across the resulting environment.

Tools that help: Crossplane for multi-cloud infrastructure provisioning, Terraform for multi-provider IaC, Kubernetes as a consistent application platform across clouds. Each abstracts away some cloud-specific details at the cost of giving up some cloud-specific capabilities.

The pragmatic path is usually ‘primary cloud plus secondary’ — most workloads on one cloud with specific workloads or backup capacity on another. Pure multi-cloud parity is rarely worth the operational cost.

Tagging and Resource Governance

At any meaningful scale, cloud resource governance requires consistent tagging. Tags by team, environment, project, cost center, and compliance category enable cost attribution, security scoping, and operational filtering.

Enforcement is the hard part. IAM policies can deny resource creation without required tags. Cloud Custodian and similar policy engines can scan for non-compliant resources and remediate.

Without enforcement, tags drift. Engineers create resources for quick experiments and forget to tag. Within a quarter, untagged resources outnumber tagged ones. Build the enforcement early; retrofit is painful.

Documentation and Knowledge Management

Cloud infrastructure changes constantly. Documentation that captures architecture decisions, runbooks for common operations, and explanations of non-obvious choices preserves institutional knowledge through team turnover.

Architecture Decision Records (ADRs) are a lightweight pattern: a short document per significant decision capturing context, options considered, decision, and consequences. ADRs accumulate into a chronicle of why the architecture looks the way it does.

Living documentation beats one-time writeups. Tie documentation to code where possible — README files in repos, comments in Terraform, generated diagrams from infrastructure tools. Documentation that lives near the code it documents stays current.

Compliance and Audit Considerations

Cloud workloads often fall under compliance frameworks: SOC 2, ISO 27001, HIPAA, PCI DSS, FedRAMP. Each has specific control requirements affecting how you architect, configure, and operate.

Common cross-framework requirements: encryption at rest and in transit, access logging and review, incident response procedures, change management, vulnerability management. Building these in from the start is dramatically cheaper than retrofitting under audit pressure.

Tools that help: AWS Config and Audit Manager, GCP Security Command Center, Azure Policy. Each provides continuous compliance monitoring against defined rules. The configuration is upfront work; ongoing compliance becomes monitoring rather than periodic discovery.

Looking Ahead

Cloud infrastructure continues to evolve rapidly. The shifts most relevant to platform teams today: continued moves toward serverless and managed services that reduce operational overhead, growing importance of cost optimization as cloud spend matures into a major budget line, and the increasing role of compliance and data sovereignty in architecture decisions.

Teams that invest in transferable skills — Linux fundamentals, networking, distributed systems, observability — adapt to specific cloud changes more easily than teams that invest narrowly in vendor-specific certifications. The vendor-specific knowledge matters, but it’s a layer on top of broader engineering capability.

The cost of building infrastructure has dropped dramatically in two decades; the cost of operating it well has not. The teams that thrive long-term combine cloud-native tooling with the operational discipline that makes any infrastructure reliable.

Practical takeaway: don’t chase every new cloud service. Identify the gaps in your current architecture, evaluate options carefully against your requirements, and move deliberately. The pace of cloud announcements far exceeds the pace at which most organizations should adopt new technologies.

Key Takeaways

The most important point throughout this guide: practical engineering decisions depend on specific context. Best-practice recommendations are starting points, not destinations. The right answer for your team depends on your scale, your existing tooling investment, your team’s experience, and the specific constraints you face.

Three principles worth carrying forward regardless of specific tool choices. First, measure what you change. Engineering improvements without measurement become folklore — claims without evidence. Track the metrics that show whether interventions are working.

Second, default to simpler architectures and tools. Complexity has cost. Each additional moving part is something to monitor, debug, upgrade, and eventually replace. Choose the simplest thing that meets your actual requirements, not the most sophisticated thing you could build.

Third, invest continuously in the boring foundations. Reliable CI, good observability, sensible access controls, and clear documentation pay back across every project. Skipping these for short-term feature velocity accumulates debt that eventually consumes the velocity it was supposed to enable.

The teams that operate well over the long term are usually not the teams with the most exotic tooling. They’re the teams with disciplined fundamentals, deliberate decision-making, and continuous incremental improvement.

Frequently Asked Questions

How often do spot instances actually get terminated?

Depends on instance type and region. Diversified fleets often see eviction rates below 5% per month. Single-type fleets can see much higher.

Can I run databases on spot?

Mostly no. Even with replication, the operational overhead exceeds the savings.

What’s the equivalent on GCP?

Spot VMs (newer) and Preemptible VMs (older). Similar discounts, slightly different interruption mechanics.

Should I use Karpenter or Cluster Autoscaler?

Karpenter for spot-heavy workloads — its bin-packing and diversification are dramatically better than Cluster Autoscaler.