How much does an A100 80 GB cost on Thunder Compute?

$0.78 per hour, making short runs effectively free for many users.

Back

How to Fine-tune Llama 4

Q: Why QLoRA instead of full fine-tuning?

QLoRA freezes the base model and trains small, low-rank adapters in 4-bit, so 8-70 B models fit on a single A100 without quality loss.

Q: Does Llama 4 Maverick fit on one GPU?

No. Even at 4-bit it needs ~300 GB VRAM, so allocate 4 × A100 80 GB or similar.

Fine‑tune Llama 4 on a single A100 GPU, with exact commands, runtimes, and cost math.

Published:

Apr 19, 2025

Last updated:

Jun 17, 2025

Why this guide?

Meta’s Llama 4 Scout packs serious performance into 17 B parameters, yet you can still fine‑tune it cheaply by combining QLoRA with a single A100 80 GB. This walkthrough shows the exact commands, runtimes, and cost math so you can reproduce results—no infra expertise required.

Prerequisites

What you need	Why
Thunder Compute account	Fast access to an A100 80 GB at $0.78 / hr
VS Code + Thunder Compute extension	One‑click instance creation & remote workspace
Python 3.10 + Conda	Clean, reproducible env
Hugging Face account	Model & dataset hub

Tip: Follow the Thunder Compute Quick Start to install the VS Code extension.

1. Launch an A100 80 GB instance

Console → New Instance › A100 80 GB
VS Code → Thunder tab ＋ → A100 80 GB

Set disk = 300 GB (fits model + dataset)

2. Connect from VS Code

Open Command Palette → Thunder Compute: Connect (or click ⇄). The integrated terminal now runs on the GPU box—no Remote‑SSH add‑on needed.

3. Prepare the environment

# Pre‑install CUDA drivers on template image
conda create -y -n l4 python=3.10
conda activate l4
pip install --upgrade "transformers>=4.40" datasets accelerate bitsandbytes peft trl
huggingface-cli login   # paste your token

Access permissions: request Llama access here—typically approved in < 5 min.

4. Minimal QLoRA script

Create train_llama_qLoRA.py:

import logging, os, torch, time
from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForCausalLM,
                          TrainingArguments, logging as hf_logging)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

hf_logging.set_verbosity_info()
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")

MODEL = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

ds = load_dataset("Abirate/english_quotes", split="train[:2%]")
tok = AutoTokenizer.from_pretrained(MODEL)
tok.pad_token = tok.eos_token
base = AutoModelForCausalLM.from_pretrained(MODEL,
                                            load_in_4bit=True,
                                            device_map="auto")

lora = LoraConfig(r=64, lora_alpha=16,
                  target_modules=["q_proj","k_proj","v_proj","o_proj"],
                  lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
model = get_peft_model(base, lora)

args = TrainingArguments(
    output_dir="outputs",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    num_train_epochs=1,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
)

trainer = SFTTrainer(model=model,
                     train_dataset=ds,
                     tokenizer=tok,
                     args=args)

trainer.train()

Run it:

5. Runtime & VRAM

Model	Steps (≈ 1 epoch on 2 % data)	Time	Peak VRAM
Llama 3‑8B (4‑bit)	~1 500	~2 h	42 GB
Llama 4 Scout 17B (4‑bit)	~1 500	~2 h	79 GB

Need Llama 4 Maverick? Spin up 2–4× A100s and run torchrun --nproc_per_node $N ....

6. Track spend & shut down

Use the Thunder Compute console to monitor cost. Stopping the instance halts GPU billing while keeping the disk.

7. Next steps

Swap in your own dataset
Increase num_train_epochs until validation loss plateaus
If VRAM allows, try load_in_4bit=False for 8‑bit precision

FAQ

Why QLoRA instead of full fine‑tuning?

QLoRA freezes the base model and trains small 4‑bit adapters, letting even 70 B+ checkpoints fit on one GPU with minimal quality loss (see QLoRA paper).

How much does an A100 80 GB cost?

$0.78 / hr at Thunder Compute—price checked June 2025.

Does Llama 4 Maverick fit on one GPU?

No. Even in 4‑bit it needs ~300 GB VRAM; launch at least 4× A100 80 GB or similar.

Author

Carl Peterson—former NVIDIA solutions architect, 10 + years building large‑scale ML infra. Follow me on LinkedIn or X.

<!-- FAQ structured data -->
<script type="application/ld+json">
{
 "@context":"https://schema.org",
 "@type":"FAQPage",
 "mainEntity":[
  {
   "@type":"Question",
   "name":"Why QLoRA instead of full fine-tuning?",
   "acceptedAnswer":{"@type":"Answer","text":"QLoRA freezes the base model and trains small 4‑bit adapters, so even 70 B+ checkpoints fit on a single GPU."}
  },
  {
   "@type":"Question",
   "name":"How much does an A100 80 GB cost?",
   "acceptedAnswer":{"@type":"Answer","text":"$0.78 per hour on Thunder Compute (June 2025)."}
  },
  {
   "@type":"Question",
   "name":"Does Llama 4 Maverick fit on one GPU?",
   "acceptedAnswer":{"@type":"Answer","text":"No, you need ~300 GB VRAM; spin up at least four A100 80 GB cards."}
  }
 ]
}
</script>