Back

Fine Tune Llama 3 or Llama 4 on a Single A100: Cost, Time, and QLoRA Code

Fine‑tune Llama 3 or the newly‑released Llama 4 Scout on a single Thunder Compute A100, with exact commands, runtimes, and cost maths.

Published:

Apr 19, 2025

|

Last updated:

Apr 19, 2025

1) Prerequisites

What

Why

Thunder Compute account (includes $20 monthly credit)

Covers ~35 min on an A100 80 GB

VS Code + Thunder Compute extension

One‑click instance creation & remote workspace

Python 3.10 + Conda

Env bootstrap

Follow the official Quick Start guide to install the VSCode extension

2) Create an A100 80GB instance (≈ $0.78 hr) Thunder Compute

  • In the Console: click New Instance → A100 80 GB.

  • In VS Code: click the + icon at the top of the Thunder Compute tab and pick A100 80 GB.

  • Make sure to set storage to 300GB or more for Llama4

3) Connect from VS Code

Command Palette → Thunder Compute: Connect (or click the “⇄” icon beside the instance).
VS Code reloads; the integrated terminal is now running on the GPU box—no extra Remote‑SSH extension required.

4) Set up the Python environment

Authenticate huggingface

huggingface-cli login

For the Llama family of models, you will need to use this form to request access permissions. Approval time varies but generally takes <5 minutes.

5) Minimal QLoRA training script

Create train_llama_qLoRA.py:

import logging, os, time, torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,          # prefer explicit TrainingArguments for clarity
    logging as hf_logging,
    TrainerCallback, TrainerState, TrainerControl
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

# ------------------------------------------------------------------
# 1 · LOGGING SET‑UP
# ------------------------------------------------------------------
os.makedirs("logs", exist_ok=True)
logging.basicConfig(
    format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
    handlers=[
        logging.StreamHandler(),                 # stdout
        logging.FileHandler("logs/train.log")    # file
    ],
    level=logging.INFO
)
log = logging.getLogger("qlora")

# give Hugging Face/transformers the same verbosity
hf_logging.set_verbosity_info()

# ------------------------------------------------------------------
# 2 · DATA + MODEL
# ------------------------------------------------------------------
model_name = "meta-llama/Llama-4-Scout-17B-16E-Instruct"   # or "meta-llama/Meta-Llama-3-8B"
log.info(f"Loading dataset and tokenizer for {model_name}")

dataset = load_dataset("Abirate/english_quotes", split="train[:2%]")
tok = AutoTokenizer.from_pretrained(model_name)
tok.pad_token = tok.eos_token

log.info("Loading model in 4‑bit … this can take a minute")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto"
)

peft_cfg = LoraConfig(
    r=64, lora_alpha=16,
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, peft_cfg)
log.info("LoRA adapters added")

# ------------------------------------------------------------------
# 3 · OPTIONAL GPU‑MEM CALLBACK
# ------------------------------------------------------------------
class GPUStatsCallback(TrainerCallback):
    def on_log(self, args:TrainingArguments, state:TrainerState,
               control:TrainerControl, **kwargs):
        if torch.cuda.is_available():
            alloc = torch.cuda.memory_allocated() / 1024**3
            reserv = torch.cuda.memory_reserved() / 1024**3
            logging.info(f"GPU mem  alloc={alloc:.1f} GiB, reserved={reserv:.1f} GiB")

# ------------------------------------------------------------------
# 4 · TRAINER
# ------------------------------------------------------------------
training_args = TrainingArguments(
    output_dir="outputs",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    num_train_epochs=1,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    logging_dir="logs",
    report_to="none"        # disable wandb / tensorboard auto‑uploads
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    max_seq_length=2048,
    tokenizer=tok,
    args=training_args,
    callbacks=[GPUStatsCallback()]

Run:

6) Expected runtime & VRAM

Model

Steps (≈ 1 epoch on 2 % of dataset)

Time on A100 80 GB

Peak VRAM

Llama 3‑8B (4‑bit)

~1500

~2 h

42 GB

Llama 4 Scout 17B (4‑bit)

~1500

~2 h

~79 GB Hugging Face

Need Maverick or larger? Spin up 2 – 4 × A100s from the same dialog and let torchrun --nproc_per_node {N} handle model parallelism. Scout with 4-bit quantization will still run happily on a single card.

7) Track spend & shut down

You can track spend and manage instances through the console.

8) Next steps

  • Swap in your own dataset.

  • Increase num_train_epochs until loss plateaus.

  • If VRAM allows, switch load_in_4bit=False for 8‑bit precision.

FAQ

Why QLoRA? It trains low‑rank adapters while keeping base weights in 4‑bit, letting 8 - 70 B models fit on a single A100.

What about Llama 4 Maverick? The 128‑expert variant needs ~300 GB VRAM (4 × A100 80 GB in INT4) ApX Machine Learning.

Commands and script tested on a fresh Thunder Compute A100 80 GB instance (Ubuntu 22.04).

Carl Peterson

Try Thunder Compute

Start building AI/ML with the world's cheapest GPUs