H100 vs A100: Performance, Specs, and Cloud Pricing (2026)

Q: How much faster is H100 than A100 for LLM training?

2-3x faster for most workloads. With FP8 on memory-bound models (30B+), the speedup reaches 4x. Smaller models see closer to 2x because they are more compute-bound.

Carl PetersonJuly 27, 202612 min read

The H100 is 2-4x faster than the A100 for LLM training and inference, but its higher hourly rate means it only delivers lower total job cost when the speedup reaches 3x or more. For LoRA fine-tuning, small-model inference, and development work, the A100 at $1.09/hr is usually the cheaper option per job.

This guide covers the full spec comparison, real-world benchmarks, a per-job cost model, and a plain-language decision guide for the most common developer workloads.

Takeaways

The H100 is 2-4x faster for LLM training than the A100; the speedup grows with model size and reaches 4x with FP8 on 70B+ models.
The A100 does not support FP8 or the Transformer Engine; these are exclusive to Hopper and matter most for large-model inference throughput.
Both GPUs support MIG partitioning into up to seven isolated instances; the A100 is cheaper per MIG slice for small-model multi-tenant deployments.

Architecture: Ampere vs. Hopper

The A100 (2020) is built on NVIDIA's Ampere architecture. It introduced 3rd-gen Tensor Cores, MIG (Multi-Instance GPU) partitioning, and support for FP16 and BF16 mixed-precision formats. The A100 was the standard GPU for AI training from 2020 through 2023 and remains widely deployed across cloud infrastructure.

The H100 (2022) is built on NVIDIA's Hopper architecture. It adds 4th-gen Tensor Cores, a dedicated Transformer Engine, and native FP8 precision support (none of which exist on the A100). Hopper also upgrades inter-GPU communication to NVLink 4.0 at 900 GB/s, versus the A100's NVLink 3.0 at 600 GB/s. For transformer-based workloads, the step change between Ampere and Hopper is significant.

FP8 and the Transformer Engine

The H100's Transformer Engine with native FP8 precision is its key advantage over the A100. FP8 halves the memory footprint of model weights compared to FP16, letting the H100 fit larger batches, reduce memory-bound bottlenecks, and sustain higher compute throughput simultaneously. The A100 has no native FP8 support and tops out at FP16 and BF16, which limits throughput at the batch sizes required for large-model inference.

The practical impact scales with model size. For an 8B model, the H100's FP8 advantage is modest, roughly 2x throughput over the A100. For 32B or 70B models, that advantage grows to 3-4x, because FP8 most effectively relieves the memory bandwidth pressure that dominates large-model inference.

Specs Comparison

Specification	A100 PCIe 80GB	A100 SXM 80GB	H100 PCIe 80GB	H100 SXM 80GB
Architecture	Ampere	Ampere	Hopper	Hopper
CUDA Cores	6,912	6,912	14,592	16,896
VRAM	80 GB HBM2e	80 GB HBM2e	80 GB HBM2e	80 GB HBM3
Memory Bandwidth	1,935 GB/s	2,039 GB/s	2,000 GB/s	3,350 GB/s
FP16 Tensor TFLOPS	312	312	800 (with sparsity)	989 (with sparsity)
FP8 Support	No	No	Yes	Yes
Transformer Engine	No	No	Yes	Yes
NVLink Bandwidth	600 GB/s	600 GB/s	900 GB/s	900 GB/s
MIG Instances	Up to 7	Up to 7	Up to 7	Up to 7
TDP	300W	400W	350W	700W
Release Year	2020	2020	2022	2022
Sources: NVIDIA A100 Datasheet; NVIDIA H100 PCIe Product Brief; NVIDIA Hopper Architecture In-Depth. FP16 TFLOPS figures shown with structured sparsity enabled.

The H100 SXM's 3,350 GB/s bandwidth is 64% higher than the A100 SXM's 2,039 GB/s. The SXM form factor is used in most cloud GPU deployments, making this the bandwidth gap that matters most for production workloads.

H100 vs A100 Benchmark: Training and Inference Performance

The figures below represent typical ranges observed across independent testing for transformer-based models. Results vary by model architecture, precision format, and batch size.

Training throughput (Llama-class models, H100 speedup over A100 SXM baseline):

Model Size	A100 SXM	H100 SXM	H100 Speedup
7B (FP16/BF16)	Baseline	~2x	~2x
30B (BF16)	Baseline	~2.5x	~2.5x
70B (BF16)	Baseline	~3x	~3x
70B (FP8 on H100)	Baseline	~4x	~4x
Based on published MLPerf results and independent benchmarks. FP8 speedup applies to H100 only; A100 does not support FP8.

H100 inference performance: tokens/second (vLLM, single GPU)

Model	A100 SXM (FP16)	H100 SXM (FP8)	H100 Speedup
Llama-3 8B	~130 tok/s	~300 tok/s	~2.3x
Llama-3 70B	~35 tok/s	~140 tok/s	~4x
Qwen3 32B	~55 tok/s	~220 tok/s	~4x
Batch size 16, single-GPU. H100 numbers use FP8 with the Transformer Engine. A100 numbers use FP16. Exact figures vary by model quantization and serving framework version.

The inference gap widens with model size. For models under 13B, the H100's advantage is real but modest. For 30B and 70B models, the combination of higher memory bandwidth and FP8 makes the H100 the clear choice for production inference.

Which GPU Is Actually Cheaper? The Per-Job Cost Model

Hourly rate comparisons are misleading in isolation. The H100 costs more per hour but completes jobs faster, meaning fewer total GPU hours per run. The H100 only wins on total job cost when the speedup is large enough to offset the hourly premium.

BF16/FP16 QLoRA: A Closer Contest

For a Llama-3 70B QLoRA fine-tune using BF16 compute with 4-bit quantized base weights, H100 speedups over the A100 typically fall in the 1.5-2.5x range. QLoRA's quantization and dequantization steps add overhead that limits how much of Hopper's compute advantage is usable.

On an A100 SXM at $1.09/hr, a 36-hour run costs $39.24. The same run on an H100 SXM at $2.19/hr, assuming a 2x speedup, finishes in roughly 18 hours for $39.42. The H100 finishes twice as fast but costs nearly the same total. The choice comes down to whether faster turnaround matters, not which option is cheaper.

See the full NVIDIA A100 specs guide for VRAM, MIG configuration, and fine-tuning use cases.

FP8: Where the H100 Pulls Ahead

The picture changes when the job uses H100's FP8 training via the Transformer Engine, a feature not available on the A100. Speedups with FP8 climb to 3-4x over A100 BF16, because FP8 unlocks compute paths unavailable to the previous generation.

Using the same 36-hour A100 baseline at $39.24, an H100 job using FP8 finishes in about 12 hours for $26.28, which is 33% cheaper despite the H100's near-double hourly rate. In this scenario, the FP8 throughput gain is large enough to justify the price premium.

Cloud GPU Pricing Comparison

The table below shows on-demand hourly rates per GPU for A100 80GB and H100 80GB across major providers, as of July 2026.

Provider	A100 80GB / hr	H100 80GB / hr	Notes
Thunder Compute	$1.09	$2.19	Per-minute billing, storage included
Vast.ai	$0.77*	$2.00*	Marketplace; rates vary by host
Runpod	$1.39	$1.99
Hyperstack	$1.39	$1.90
Lambda Labs	$2.79	$3.29
AWS	$1.85	$3.93	Storage and egress billed separately
Google Cloud	~$3.67	~$11.06	A3 instances; storage billed separately
CoreWeave	$2.50	$2.70	GPU component only; CPU/RAM/storage separate
* Vast.ai is a marketplace; rates reflect lowest available listings as of 6/27/2026 and vary by host and region. Hyperscaler rates are per-GPU estimates and exclude storage and egress costs.

The spread between the cheapest and most expensive H100 options is about 8x. A 30-hour training job costs $66 on Thunder Compute and $332 on Google Cloud for the GPU alone. Provider choice is often a larger cost lever than the GPU choice itself.

See Thunder Compute's full H100 cloud pricing guide, updated quarterly.

MIG GPU Partitioning: Multi-Tenant and Development Use Cases

Both the A100 and H100 support MIG partitioning, splitting a single physical GPU into up to seven isolated instances. Each instance gets its own dedicated VRAM, cache, and compute cores for hardware-level workload isolation.

On the A100 80GB, each of the seven MIG instances gets approximately 10 GB of dedicated HBM2e. This makes the A100 practical for multi-tenant serving, development teams sharing a single GPU, or inference deployments where a 10 GB slice covers the model.

The H100 supports the same seven-instance MIG configuration. Each H100 MIG instance also benefits from higher memory bandwidth and FP8 throughput, making it better suited for larger quantized models or low-latency concurrent inference pipelines.

GPU Decision Matrix

GPU selection depends on model size, workload type, and whether per-hour or per-job cost is the priority.

Use Case	Recommended GPU	Why
LLM training (70B+ parameters)	H100	FP8 + bandwidth makes total job cost 30-40% lower than A100
LLM training (7B-30B parameters)	A100 or H100	H100 is faster; A100 is cheaper/hr. Run the per-job math for your run length.
LoRA / QLoRA fine-tuning (<30B)	A100	Cost-efficient; H100 speedup rarely justifies the premium at this scale
High-throughput inference (70B+)	H100	4x+ tok/s advantage at large model sizes; lower cost/token served
Production inference (<13B, latency-sensitive)	A100 or H100	Both capable; A100 is cheaper. H100 worth it if you need >200 tok/s single-GPU.
Batch inference / async jobs	A100	Latency tolerance makes cost efficiency the priority
Multi-tenant MIG serving	A100 or H100	Both support 7 MIG instances; A100 is cheaper for small-model deployments
Development and experimentation	A100	43% lower hourly rate; H100 speed advantage matters less during iteration
HPC / scientific simulation (FP64)	H100	H100 SXM delivers ~67 TFLOPS FP64 vs A100's 19.5 TFLOPS (roughly 3x more)

Renting H100 and A100 GPUs on Thunder Compute

Thunder Compute offers on-demand A100 80GB instances from $1.09/hr and H100 PCIe 80GB instances from $2.19/hr, billed by the minute with no long-term commitments. Both include persistent storage at no additional cost.

The VS Code and Cursor extensions connect your editor directly to a running instance, with no SSH configuration required. To switch GPU types, save an instance snapshot and relaunch it on different hardware.

Last Thoughts on H100 vs A100

For 30B+ training and high-throughput inference, the H100's FP8 advantage typically makes it cheaper per job despite the higher hourly rate. For development, LoRA fine-tuning, and smaller-model inference, the A100 is more cost-effective. Provider pricing matters as much as GPU generation: the same H100 ranges from $2.19/hr to $23.12/hr depending on where you rent it.

FAQ

Is H100 Worth It Over A100 in 2026?

For 70B+ training and inference, yes. The H100's throughput advantage cuts total job cost by 30-40% despite the higher hourly rate. For LoRA fine-tuning and smaller model inference, the A100 is usually cheaper in total.

How Much Faster Is H100 Than A100 for LLM Training?

The H100 is roughly 2-3x faster for most LLM training workloads. When FP8 precision is used on the H100 with memory-bound models (30B+), the speedup reaches 4x. Smaller models see closer to 2x because they are more compute-bound.

Does the A100 Support FP8?

No. FP8 and the Transformer Engine are exclusive to the Hopper architecture. The A100 supports FP16 and BF16, which cover most fine-tuning and training workflows. FP8 matters most for high-throughput inference on large models.

Which GPU Is Better for LoRA and QLoRA Fine-Tuning?

The A100 is the better choice for models under 30B parameters. At $1.09/hr, it typically delivers lower total cost than the H100 for multi-hour runs. For fine-tunes running 20+ hours on 30B+ models, run the per-job cost calculation — the H100 may be cheaper.

What Is the Cheapest Way to Rent an H100 in 2026?

Thunder Compute offers H100 PCIe 80GB instances from $2.19/hr with no minimum commitment, storage included, and per-minute billing. Marketplace options like Vast.ai can list lower prices, but rates vary by host and availability is inconsistent.

Is the A100 Still Relevant in 2026?

Yes. The A100 is the better pick for budget-sensitive development, LoRA fine-tuning, and inference workloads that do not require FP8. It has a mature software stack and runs every major framework without extra configuration.

What Is NVLink and Does It Matter for Single-GPU Workloads?

NVLink is NVIDIA's GPU-to-GPU interconnect and has no impact on single-GPU workloads. For multi-GPU distributed training, the H100's NVLink 4.0 (900 GB/s) is 50% faster than the A100's NVLink 3.0 (600 GB/s).

What Is the Difference Between H100 PCIe and H100 SXM?

The SXM variant uses HBM3 memory at 3,350 GB/s and draws 700W in a high-performance server module. The PCIe variant uses HBM2e at 2,000 GB/s and draws 350W in a standard server slot. SXM delivers higher throughput; PCIe is easier to deploy and more widely available on cloud platforms.

When Does H100 Cost Less Than A100 Per Job?

When the H100 delivers 3x or more speedup over the A100. At that point, fewer total GPU hours consumed offset the higher hourly rate. For workloads with a 2x or smaller speedup, such as small-model fine-tuning or development work, the A100 is typically cheaper per job.

What Is the Memory Bandwidth Difference Between the H100 SXM and A100 SXM?

The H100 SXM delivers 3,350 GB/s via HBM3, 64% higher than the A100 SXM's 2,039 GB/s via HBM2e. This bandwidth gap is the primary driver of the H100's advantage on large-model inference and training, where GPUs spend more time moving data than executing math.

How Does MIG Partitioning Differ Between the H100 and A100?

Both GPUs support up to seven isolated MIG instances with dedicated VRAM, cache, and compute. The H100's MIG instances benefit from higher memory bandwidth and FP8 throughput, making them better suited for larger quantized models or low-latency concurrent inference. The A100 is cheaper per MIG slice for small-model multi-tenant deployments.