A pre-trained LLM is a generalist that has read most of the internet — but it doesn't speak your specific domain, follow your tone, or perform your exact task out of the box. Fine-tuning takes the pretrained model and continues training it briefly on your task-specific data. The challenge: a 7B-parameter model has 28GB of weights — full fine-tuning costs serious money. Modern parameter-efficient methods (LoRA, prompt tuning) let you adapt huge models with as little as 0.05% of the original parameters.
Full Fine-Tuning (the 2018-2020 default)
Procedure:
1. Load pretrained weights into a fresh trainer.
2. Add (or replace) a task-specific head (linear layer for classification,
or just continue language-model head for generation).
3. Train ALL weights end-to-end on your task data.
Use a small learning rate (1e-5 to 5e-5) — much smaller than pretraining
because you don't want to destroy what's already learned.
4. Run for 1-5 epochs — overfitting is real with small datasets.
Cost analysis (LLaMA-7B example):
• 7B params × 4 bytes (FP32) = 28GB just for weights
• + optimiser states (Adam) ≈ 56GB
• + activations + gradients ≈ another 30GB
→ Need at least 1× A100 80GB GPU, often 4-8 in parallel
→ Hours to days of training time
→ Each task = one full fine-tuned model copy = 28GB of weights to store and serveUpdates every weight · maximum capacity · prohibitive cost for many users
LoRA — Low-Rank Adaptation (2021)
Insight: when fine-tuning, the WEIGHT CHANGES (ΔW) tend to be low-rank.
Decompose them as a product of two small matrices:
ΔW = B · A where A: (r × d), B: (d × r), r ≪ d
For a typical attention weight matrix W (d × d) in a 7B model:
full ΔW: d × d = 4096 × 4096 = 16.7M params
LoRA ΔW: with r=16 → 4096×16 + 16×4096 = 130K params (128× smaller!)
During fine-tuning:
• W (pretrained) stays FROZEN
• Only A and B are trained
• Effective forward pass: y = (W + B·A) · x
Results are often within 1-2% of full fine-tuning at 0.5% of the cost.
LoRA adapters are ~30MB instead of ~28GB — easy to store, swap, and stack.Train tiny rank-r matrices · keep base model frozen · same performance for ~1% of compute
Prompt Tuning / Prefix Tuning
Even more aggressive: don't change ANY weights. Just learn a few extra 'soft prompt' vectors prepended to the input.
Normal input: [tok1, tok2, tok3, ...]
Prompt-tuned input: [P1, P2, P3, ..., Pk, tok1, tok2, ...]
where P1...Pk are LEARNED VECTORS (not real words),
trained via gradient descent on task data.
Typical k = 10-100 soft prompts × 4096 dims = 40-400K trainable params.
That's 0.005% of a 7B model.
Works surprisingly well for many tasks at huge model scale (>10B params).
Weaker for smaller models — they need more parameter capacity to adapt.No weight updates at all · a few soft tokens steer the entire frozen model
Comparison Table
- Full FT: 100% of params trainable, ~28GB ΔW per task, highest quality, highest cost. Use when: small to medium model, lots of compute, task very different from pretraining.
- LoRA: 0.5-1% trainable, ~30MB per task, near-full quality, low cost. Default modern choice. Stack multiple LoRA adapters for multi-task models.
- QLoRA (LoRA + 4-bit base): same as LoRA but base model is quantised to 4-bit weights. Fits 70B models on a single 24GB GPU. Tiny accuracy hit.
- Prompt tuning: 0.005-0.05% trainable, ~few MB, decent quality at 10B+ scale. Use when: many tasks to support, extreme storage constraints.
- Frozen + classifier head: 0.1% trainable, smallest setup, lowest quality but cheapest. Use when: simple classification on a representation model.