Finoud
Compute

AWS Spot Instances for Production Workloads: When, How, and Why It's Not Gambling (2026)

Spot isn't just for batch jobs. Here's how teams run production on Spot safely — workload selection, diversification, graceful interruption handling, and blending Spot with commitments.

Finoud Team11 min read

There’s a myth that Spot Instances are only for batch jobs and overnight number-crunching — that running production on capacity AWS can reclaim with two minutes’ notice is reckless. It isn’t. Plenty of teams run stateless production on Spot at scale, and they’re not gambling. They’re engineering for interruption the same way they already engineer for a node dying.

The headline number is real: Spot can be up to ~90% cheaper than On-Demand. The honest number is that a well-designed fleet blends Spot with On-Demand and commitments and lands at a realistic 50–70% blended savingon compute. That gap between “up to 90%” and “blended 50–70%” is where the engineering lives. This is how to capture it without paging yourself at 3 a.m.

The Spot myth vs the reality

A Spot Instance is the same EC2 instanceyou’d launch On-Demand — same hardware, same AMI, same performance — sold out of AWS’s spare capacity at a steep discount. The catch is the contract: when AWS needs that capacity back, it reclaims your instance with a two-minute interruption notice. That’s the entire risk model. Not slower instances, not throttled ones — just instances that can disappear on short notice.

The old mental model is out of date. Years ago, Spot ran on a live bidding market: you set a max price, prices swung wildly, and you could get outbid and evicted in seconds. AWS retired that model. Today Spot pricing is smoothed and far more stable, adjusting gradually based on long-term supply and demand per capacity pool. You no longer bid against other customers; you pay the current Spot price and optionally cap it. Interruptions are now driven by capacity, not by getting outbid — which makes them predictable enough to design around.

Reframe

If your service already survives an EC2 instance failing — because it’s behind a load balancer, runs multiple replicas, or pulls work from a queue — then it already survives a Spot interruption. Spot isn’t a new failure mode. It’s the failure mode you should already handle, just more frequently.

When Spot fits — and when it doesn’t

Workload selection is the decision that determines whether Spot is safe or scary. The test is simple: can this work tolerate a node vanishing with two minutes’ notice? If losing the instance means rescheduling work rather than losing data, Spot fits.

Strong fits for Spot

  • Stateless web and API tiersbehind an ALB/NLB — drain a node, route around it, replace it. Users never notice.
  • Queue consumers and background workers— SQS, Kafka, or Celery-style workers that re-deliver in-flight messages if a consumer dies mid-task.
  • Batch and CI/CD— build runners, test fleets, and AWS Batch jobs that retry cleanly. Interruptions cost minutes, not data.
  • Big-data and fault-tolerant processing— EMR task nodes, Spark executors, and ETL where the framework already assumes workers come and go.
  • Fault-tolerant Kubernetes workloads— stateless Deployments and horizontally-scaled services on Spot node groups, where the scheduler reschedules evicted pods automatically.

Keep these off Spot (or be very careful)

  • Primary stateful databases.Your writer node for Postgres, MySQL, or a self-managed datastore should not ride Spot. An interruption mid-write is a recovery event you don’t want to schedule for AWS’s convenience.
  • Long, non-checkpointed jobs.A 6-hour computation with no intermediate state is a coin flip — lose the node at hour 5 and you start over. Checkpoint it first, then Spot becomes viable.
  • License-locked single hosts. Software pinned to a specific host ID, dedicated-tenancy licensing, or single-instance appliances with no failover have nowhere to reschedule to.
Easy to miss

Stateful doesn’t automatically mean “no Spot.” A read replica that can be rebuilt from the primary, or a cache that warms from a source of truth, is fair game. The question is never “is there state?” — it’s “can this state be reconstructed faster than it costs to lose?”

How Spot capacity actually works

To run Spot safely you have to understand the unit AWS allocates from: the capacity pool. A pool is one specific instance type in one Availability Zone — for example m6i.large in us-east-1a. Each pool has its own depth and its own interruption risk at any given moment. When a pool runs short, AWS reclaims instances from that pool.

This is the single most important idea in the whole article: diversification across pools is your number-one lever against interruptions. If your entire fleet sits in one pool (c5.xlargein one AZ), you’ve concentrated all your risk in one place — when that pool tightens, everything goes at once. Spread the same vCPU/memory target across a dozen pools and AWS can pull capacity from whichever ones are deepest, while the others keep running.

ApproachPools in playBlast radius of a tight pool
Single type, single AZ1Entire fleet can be reclaimed together
Single type, 3 AZs3Up to a third of the fleet at once
6 types × 3 AZs18A small slice; capacity-optimized fills from deep pools

AWS also gives you an early-warning signal called a rebalance recommendation. It fires when a pool’s interruption risk rises — often beforethe two-minute notice — so you can proactively launch a replacement and drain the at-risk instance on your own schedule rather than reacting to a hard eviction. Treat it as a softer, earlier version of the interruption notice.

Handling interruptions gracefully

The two-minute notice is your whole window, and it’s plenty if you prepare. The signal arrives two ways: in the instance metadata service (poll /latest/meta-data/spot/instance-action) and as an EventBridge eventyou can route to automation. Don’t rely on one — wire both.

# Poll IMDSv2 for the interruption notice from inside the instance
TOKEN=$(curl -sX PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/spot/instance-action
# 404  -> healthy, keep serving
# 200  -> { "action": "terminate", "time": "2026-06-02T14:32:00Z" }
#         you now have ~120s: deregister, drain, checkpoint, exit

When the notice fires, a graceful shutdown does four things, fast:

  1. Stop accepting new work. Deregister from the target group so the load balancer routes new requests elsewhere; pause the queue consumer so no new messages are pulled.
  2. Drain in-flight work. Let open connections and running tasks finish. Connection draining on the ELB plus a sensible deregistration delay buys you a clean handoff.
  3. Checkpoint state.Flush progress for any resumable job to S3, a database, or the queue’s visibility-timeout redelivery — so the work resumes elsewhere instead of restarting.
  4. Exit cleanly. Terminate before AWS does, so your shutdown hooks run on your terms.

On plain EC2, a mixed-instances Auto Scaling group fronted by an ELB handles registration and replacement for you — an interrupted Spot node deregisters and the ASG launches a replacement from the next-best pool. On Kubernetes, the pattern is mature: Karpenter consumes interruption and rebalance events natively, cordons and drains the doomed node, and provisions replacement capacity from healthy pools before pods get evicted. The older Cluster Autoscaler works too, typically paired with the AWS Node Termination Handler to translate the notice into a cordon-and- drain. Either way, set realistic pod terminationGracePeriodSeconds and PodDisruptionBudgets so the scheduler respects your drain window.

Rule of thumb

If your shutdown path can’t reliably finish inside two minutes, that’s a signal to shrink the unit of work — smaller batches, more frequent checkpoints, shorter-lived tasks — not to abandon Spot.

A resilient Spot architecture

The production-grade pattern on EC2 is a mixed-instances Auto Scaling group: a small On-Demand base for the capacity you can never lose, with Spot layered on topfor everything elastic. You express this directly in the ASG’s instances distribution.

# Mixed-instances ASG: On-Demand floor + diversified Spot on top
InstancesDistribution:
  OnDemandBaseCapacity: 2            # always-on floor, never interrupted
  OnDemandPercentageAboveBaseCapacity: 20   # 20% On-Demand above the base
  SpotAllocationStrategy: price-capacity-optimized
  # remaining 80% above the base runs on Spot

LaunchTemplate:
  Overrides:                         # diversify across many pools
    - InstanceType: m6i.large
    - InstanceType: m6a.large
    - InstanceType: m5.large
    - InstanceType: m5a.large
    - InstanceType: m5n.large
    - InstanceType: m6in.large
# spread across 3+ AZs via the ASG's subnets -> 6 types x 3 AZs = 18 pools

The four design rules that make this resilient:

  • Keep a small On-Demand floor. OnDemandBaseCapacity guarantees a minimum that survives even a total Spot drought. Size it to the absolute minimum your service needs to stay up, not to your average load.
  • Diversify families and sizes. Mix m6i, m6a, m5, m5n, and equivalents so a shortage in one family doesn’t starve the fleet. Pick types with comparable vCPU/memory so any of them can run your workload.
  • Spread across AZs. Three or more AZs multiplies your pool count and your availability at the same time.
  • Choose the right allocation strategy. capacity-optimized launches from the deepest pools to minimize interruptions; price-capacity-optimizedis the modern default — it weighs both capacity depth and price, so you get most of the interruption resilience while still favoring cheaper pools. Avoid lowest-price for production: it crams you into the cheapest, shallowest pools and interruption rates climb.
Sequence matters

Don’t move an oversized fleet to Spot — you’ll just buy a deep discount on waste, and a bigger fleet means more pools to interrupt. Rightsize first, then Spot the correctly-sized result. See how to rightsize cloud resources without breaking applications before you flip fleets to Spot.

Blending Spot with commitments

The most common misconception is that Spot and Savings Plans compete. They don’t — they cover different layers of the same fleet. Savings Plans and Reserved Instances discount On-Demand usage; Spot is already discounted and billed separately, so a commitment never applies to Spot. That makes them complementary by construction.

Picture your compute as three stacked layers:

LayerWhat it isHow you pay for it
Predictable baselineThe floor your fleet never drops belowSavings Plans / RIs (deepest discount on steady spend)
On-Demand bufferHeadroom for spikes + Spot fallbackOn-Demand (partly covered by the commitment)
Elastic burstScale-out, batch, fault-tolerant workSpot (up to ~90% off, outside any commitment)

The practical implication for buying: size your commitments to the On-Demand baseline you keep after Spot, not your total fleet.If you commit to your whole footprint and then shift half of it to Spot, the commitment can’t draw down against Spot usage and you’ll under-utilize what you bought. Decide your Spot strategy first, then commit to the On-Demand floor that remains. The full decision framework for that floor — 1- year vs 3-year, Compute SP vs EC2 Instance SP — lives in Savings Plans vs Reserved Instances; layer Spot on top of whatever baseline you land on there.

Monitoring Spot in production

Spot is a system you operate, not a setting you toggle. Four signals tell you whether it’s healthy and whether you’re actually capturing the savings:

  • Interruption rate.Track interruptions per pool over time. A creeping rate means a pool is tightening — add more instance types or AZs before it becomes a reliability problem. The Spot placement score and historical interruption data help you pick new pools.
  • Savings realized.Compare what you paid on Spot to the equivalent On-Demand cost. This is the number that justifies the engineering — if your blended saving isn’t in the 50–70% range, something’s off (too little Spot, or pools chosen for price over capacity).
  • Capacity-pressure alerts.Alarm on a spike in rebalance recommendations or repeated fallback to On-Demand — both mean your pools are getting thin and diversification needs attention.
  • Fallback events. Count how often the ASG fills with On-Demand instead of Spot. Occasional fallback is the system working as designed; constant fallback quietly erases your savings and deserves investigation.

How Finoud helps

Finoud tracks your Spot utilization and the savings you’re actually realizing against the On-Demand equivalent, surfaces interruption and capacity-pressure visibility so you can spot tightening pools early, and models the optimal blend of Spot, On-Demand, and commitments against your real usage — across AWS, GCP, and Azure in one view — so you commit to the baseline you actually keep and let the elastic layer ride Spot. Join the waitlist for early access.

Frequently asked questions

How much warning do I get before a Spot instance is reclaimed?
You get a two-minute interruption notice delivered through instance metadata (IMDS) and an EventBridge event. Rebalance recommendations can arrive even earlier, signaling that a pool is at elevated risk. Design your handlers to drain connections and checkpoint state within that two-minute window.
Can I run stateful workloads or databases on Spot?
Avoid Spot for primary stateful databases and any long transaction that can't be resumed — losing the host mid-write is not worth the discount. Spot shines for stateless services, queue workers, batch jobs, CI runners, and fault-tolerant data processing where a node can vanish and the work simply reschedules. If a replica can be rebuilt from a source of truth, it can usually live on Spot.
Do Savings Plans or Reserved Instances apply to Spot usage?
No. Spot is already discounted and billed separately, so commitments never draw it down. Savings Plans and Reserved Instances cover On-Demand usage only — size your commitments to the On-Demand baseline you keep after Spot, not your total fleet.
How do I reduce Spot interruptions?
Diversify across many instance types and Availability Zones using capacity-optimized (or price-capacity-optimized) allocation, so AWS draws from the deepest, healthiest pools. Keep an On-Demand fallback in a mixed-instances Auto Scaling group so capacity gaps degrade gracefully instead of failing. The more pools you can run on, the fewer interruptions you'll feel.