AI · FinOps

GPU cost optimization on AWS: a working playbook

CloudDrove · 8 min read

GPU instances are usually the most expensive line on an AI infrastructure bill, and most of them run at 30 to 45% utilization. A single p5.48xlarge (8x H100) tops $90K a month on-demand. With a few deliberate choices you can cut GPU spend 40 to 60% without slowing development or hurting inference performance.

TL;DR

GPU instances are the priciest items in most AI budgets, and they typically sit at 30 to 45% utilization.
Match instance types to the workload, mix Spot, Capacity Blocks, and On-Demand deliberately, and instrument utilization before you provision.
Done well, this cuts 40 to 60% of GPU cost with no loss of speed or inference performance.

Why GPU spend leaks

GPU procurement usually happens under deadline pressure and then goes unreviewed. The result is that 40 to 60% of GPU spend ends up on mismatched or idle capacity: the wrong instance for the workload, or instances nobody turned off.

1. Match the instance to the workload shape

Training tolerates interruption if you checkpoint every 15 to 30 minutes, so it suits Spot. Inference needs steady baseline capacity, so it wants On-Demand or Reserved. Fine-tuning does well on Spot plus aggressive auto-shutdown.

2. Build a deliberate purchasing mix

Spot gives 50 to 70% off with a two-minute interruption notice. Capacity Blocks for ML guarantee reservations for 1 to 14 day windows. On-Demand or Reserved covers steady inference. A rough 70% Capacity Blocks, 20% Spot, 10% Reserved split runs about 45% cheaper than all on-demand.

Capacity Blocks 70% Spot 20% 10%

A 70 / 20 / 10 split across Capacity Blocks, Spot, and Reserved runs about 45% cheaper than buying all on-demand.

3. Instrument utilization before you provision

Export nvidia-smi metrics through the DCGM exporter into Prometheus, segmented by team and workload. That alone surfaces forgotten instances, queued jobs, and oversized deployments, usually within days.

4. Right-size before you reserve

Get utilization visibility and fix workload matching first. Only then commit to Reserved Instances or Savings Plans, so you are reserving the right thing.

What this looks like in practice

One AI startup cut monthly GPU spend from $187K to $94K, a 50% reduction, by moving training to a Capacity Blocks plus Spot architecture, shifting inference off p4d onto right-sized g5 instances behind autoscaling, and terminating idle instances. Performance improved too: inference p99 latency actually went down.

Trade-offs worth knowing

Keep a Reserved or On-Demand baseline for latency-sensitive inference. Don't put it all on Spot.
Test checkpoint-and-resume before betting a multi-day training run on Spot interruptions.
Capacity Blocks need forecasting discipline. If your roadmap shifts weekly, Spot's flexibility may serve you better.

What to do next

Establish a two-week utilization baseline with DCGM and Prometheus, segmented by team, then model your Spot, Capacity Block, and On-Demand mix against it. A Cloud Infrastructure Assessment can run that audit and hand back a prioritized plan.

All blogs