Modeling the Cost of and performance of A100 vs H100 with Cold Loads
Over the past few months, I’ve been building ExaSketch, a system simulation framework for modeling high-performance computing clusters and AI workloads. It’s my personal tool for exploring infrastructure tradeoffs and performance scenarios without needing to build anything or spend a dollar on hardware.
To demonstrate what ExaSketch can reveal, I simulated a surprisingly common inefficiency I've seen at AI startups: what happens if your inference infrastructure reloads the model from disk every single time a new request comes in? At first glance, that sounds like an obvious anti-pattern—but when you quantify it, the impact is more dramatic than most people expect.
I modeled a 1,000-request LLM inference workload that represents a worst-case deployment. Each request performs a full cold model load of a 100 GB model, followed by a moderate-length input sequence of around 850 tokens (representing a realistic prefill stage), then a memory-intensive decode stage with large key-value cache reads approximating a 30,000-token output, and finally a network stage that returns a response of roughly 1,000 tokens. This cold-load pattern is unlikely when serving a single model, but it becomes plausible when users can independently select from many models hosted on the same system. The goal wasn’t to create an optimized workload; it was to evaluate how well different systems tolerate unfavorable conditions.
System Configurations
Modest A100 Cluster
4 nodes, each with:
2× A100 40GB GPUs
1× NVMe SSD
2× ConnectX-6 25Gb NICs
1× EPYC 9654 CPU
Total system cost: ~$140,000 (assumes $9,000/GPU)
Peak power: ~5.2 kW
Likely fits in a single standard 42U rack
Could potentially operate from 120V power in a lab environment
Overbuilt H100 Cluster
4 nodes, each with:
8× H100 80GB GPUs
1× high-performance NVMe SSD
2× ConnectX-7 200Gb NICs
1× EPYC 9654 CPU
Total system cost: ~$1,030,000 (assumes $30,000/GPU)
Peak power: ~25 kW
Requires multiple power feeds, likely 208V 3-phase power
Not suitable for a single standard rack; designed for data center deployment
Tokens Per Second per Dollar (higher is better)
Despite the inefficiency of the workload, the results are telling. The A100 cluster completed the run in just under 477 seconds, while the H100 cluster finished in about 40 seconds. That’s an 11.9× speedup. In terms of raw throughput, the A100 cluster sustained about 1,340 tokens per second system-wide, or about 165 TPS per GPU. The H100 cluster, by contrast, delivered 16,190 TPS system-wide. That's more than an order of magnitude higher, and about 506 TPS per GPU. It’s important to note the GPU count difference between the systems: while the H100 system performs ~12× better overall, each GPU contributes roughly 3× more throughput than those in the A100 cluster.
Power and Energy Use
Energy per 1000 Generated Tokens (lower is better)
A100 system
Avg. power draw: ~2.4 kW
Total energy consumed: ~577 kJ
Energy cost: ~$0.035 per 1,000 requests
H100 system
Avg. power draw: ~14.3 kW
Total energy consumed: ~298 kJ
Energy cost: ~$0.018 per 1,000 requests
At first glance, the takeaway might seem obvious: more and faster GPUs lead to faster inference. But the real insight is subtler. The H100 cluster is indeed faster, but it’s also 7× more expensive, and its performance-per-dollar isn’t 7× better. Instead, we see about a 3× improvement in per-GPU efficiency. In addition, in order to make use of the additional performance from the GPU, we need to build the rest of the system (I/O, memory, and networking) in a way that can can keep up. If the workload were slightly more disk-bound, or if storage performance hadn’t scaled, the H100s would have been underutilized.
While energy costs for this short simulation are trivial (under $0.20 for the A100 system and under $0.10 for the H100 system), they compound quickly in production. At millions of requests per day, power costs and power delivery constraints matter.
Total Cost of Ownership over 3 Years (CapEx + Energy at Simulated Average Utilization)
This view captures total cost of ownership: purchase price plus three years of energy costs. We assume both systems run at their simulated average utilization 46% for the A100 cluster, 56% for the H100 cluster and use those values to project power usage. This is equivalent to assuming they will run our simulated workload for the entire 36-month period. The H100 system's higher power draw isn't inherently bad, but it does underscore the need to consider both upfront and operational expenses.
Beyond cost and performance, there’s also deployment complexity. The A100 system is simple to install and run: it fits in a single rack, may not need exotic cooling, and might even run on standard power. The H100 system requires serious infrastructure for power and cooling. It’s not something you casually drop into a lab.
If you're choosing between systems like these, the key question isn’t just performance—it’s: how much are you paying per token over time? ExaSketch helps answer that before you commit real dollars.
This is what ExaSketch is built to explore. I use it to simulate real-world workloads so I can help teams make better design decisions. Not all bottlenecks are obvious, and not all hardware upgrades deliver proportional gains. But with simulation, you can find out—before you build.
If you're working on infrastructure and want to explore performance tradeoffs before making big decisions, feel free to reach out. I’d be happy to simulate your workload. I’m interested in calibrating ExaSketch against real-world workloads to refine it and make sure it’s as accurate as possible.