A100 vs H100 vs H200 vs L40: which GPU for your workload

CClodei teamApril 18, 20264 min read

We get this question almost every week: "Which GPU should I rent?" NVIDIA's marketing pages all read the same way (faster, bigger, smarter), so this is the version that is actually useful when you're staring at a workload and trying to pick the cheapest GPU that will get the job done.

The four GPUs at a glance

GPU	VRAM	FP16 TFLOPs	Mem bandwidth	Sweet spot
L40	48 GB	~362	864 GB/s	Inference, smaller training, rendering
A100 80GB	80 GB	~312	2 TB/s	Mid-size training, long-context inference
H100 80GB	80 GB	~989 (FP8)	3.35 TB/s	Large training, latency-sensitive serving
H200 141GB	141 GB	~989 (FP8)	4.8 TB/s	Long-context serving, memory-bound jobs

These are nominal numbers. Real throughput depends on batching, kernel choice, and how memory-bound your workload actually is.

A100 80GB, still the workhorse

A100 launched in 2020. Five years later it remains the most cost-effective GPU for a huge slice of real work, for three reasons.

It has 80 GB VRAM, which fits 7B and 13B models comfortably with room for the KV cache. Its HBM2e at 2 TB/s is enough memory bandwidth for almost any inference scenario short of frontier scale. And the per-hour price is the lowest among the four for genuine datacenter capability.

Pick A100 80GB when:

You're running 7B–34B inference and care about €/token more than tokens/second.
You're fine-tuning something Llama-class with LoRA or QLoRA.
The model fits in 80 GB with room to spare.

Skip A100 when:

You need FP8 (it doesn't have it natively).
You're hitting memory bandwidth limits (large-batch decode).
You're training from scratch at scale.

H100 80GB, the FP8 step change

H100 added two things that matter. FP8 native support, which is roughly 2x the throughput of A100 for transformer kernels that can use it, and memory bandwidth at 3.35 TB/s, which makes large-batch decode scale much better.

The €/h price is around 2.5–3x the A100, so the math only works when you can actually extract the FP8 advantage and run the GPU near saturation. For batch-1 inference on a 7B model, the H100 is strictly worse value than the A100 because you can't use the FP8 throughput.

Pick H100 when:

You're serving moderate-to-high QPS where FP8 gets you more tokens per euro.
You're training a model where the budget closes only with H100-class throughput.
Latency matters and the workload can keep the GPU saturated.

Skip H100 when:

You're not in FP8 territory (rules out a lot of inference).
Your workload is memory-bandwidth bound. Use H200 instead.

H200 141GB, the long-context play

H200 is H100 with bigger, faster memory. Same compute, 1.7x the bandwidth, 1.75x the VRAM.

What that buys you in practice:

70B-class models without parallelism gymnastics. They fit on a single GPU.
Long-context inference (32k+ tokens) where the KV cache dominates VRAM.
Bigger batches before you run out of memory bandwidth.

Pick H200 when:

You serve 70B models, or want to, without sharding.
You're memory-bound today and bandwidth is the bottleneck.
Long-context serving is on the roadmap.

Skip H200 when:

You're compute-bound. H100 is cheaper per FLOP.
The smaller VRAM is plenty.

L40, the underrated inference GPU

L40 is built on the Ada Lovelace generation. It's a workstation-derived datacenter card with 48 GB VRAM, strong FP8 support, and a lower €/h than the A100 for many inference scenarios.

L40 is the right answer for:

Small-batch serving of 7B–13B models.
Image generation (Stable Diffusion, Flux).
Rendering and graphics work that benefits from RT cores.

L40 is the wrong answer for:

Training anything bigger than a single-card LoRA fine-tune.
Workloads that need more than 48 GB VRAM.
Distributed training (lower NVLink class than A100/H100).

A simple decision tree

If your model fits in 24 GB, look at consumer cards before datacenter. Too small for this comparison.

If you're training from scratch or doing a full fine-tune, default to H100. Step up to H200 if memory-bound, down to A100 if budget-bound.

If you're serving inference: L40 for 7B–13B small batch, H100 for 7B–34B when you can use FP8, H200 for long-context or 70B-class. If you're not sure, A100 80GB is the safest default.

If you're doing graphics, rendering or simulation, L40 has RT cores. The others don't.

Pricing reality

The €/h ratios on Clodei (and on most EU specialists) are roughly:

L40: 1x
A100 80GB: 1.5–2x
H100 80GB: 4–5x
H200 141GB: 6–8x

The "right" GPU is the cheapest one that gets the job done in the time you have. Renting an H100 for a job that an A100 finishes overnight just costs more money to wait the same number of hours. Renting an A100 for a workload that hits memory-bandwidth cliffs is renting a smaller car when you needed a truck.

Match the GPU to the bottleneck, not the brand to the badge.

A100 vs H100 vs H200 vs L40: which GPU for your workload

The four GPUs at a glance

A100 80GB, still the workhorse

H100 80GB, the FP8 step change

H200 141GB, the long-context play

L40, the underrated inference GPU

A simple decision tree

Pricing reality

Keep reading

Zero egress fees explained: what hyperscaler bandwidth really costs

From signup to running model in 60 seconds: a Clodei walkthrough

Per-minute vs reserved GPU pricing: when each model wins