Training a deep network from scratch on a small dataset is like starting school without any prior knowledge — you need thousands of examples just to learn basic concepts. Transfer learning inverts this: start with a model already trained on millions of images, then adapt it to your specific task. The network has already learned to detect edges, textures, shapes, and objects — you just teach it what those features mean for your problem.
Why Transfer Learning Works
Deep CNNs learn hierarchical features. Early layers learn universal low-level patterns — horizontal edges, vertical edges, colour blobs, and simple textures — that are useful for almost any image task. Middle layers combine these into shapes and object parts. Only the deepest layers are task-specific. Since the low-level features generalise, they can be reused without retraining.
Three Strategies
- Feature Extraction: freeze all backbone layers, replace only the final classification head, train the head on your data. Best when your dataset is small (< 1,000 samples) and your domain is similar to the pre-training domain (e.g., natural images → product photos).
- Fine-tuning: freeze early layers (edges/textures), unfreeze later layers + head, train both on your data with a small learning rate. Best when your dataset is moderate (1,000–100,000 samples) or your domain is somewhat different.
- Full Fine-tuning: unfreeze and train all layers. Best when you have a large dataset (> 100,000 samples) and sufficient compute. Use a very small learning rate for early layers to avoid destroying the pre-trained representations.
Rule of Thumb: Dataset Size × Domain Similarity
Similar domain Different domain
──────────────────────────────────────────────────
Small dataset │ Feature extract │ Fine-tune top layers
Large dataset │ Fine-tune all │ Fine-tune or from scratch
──────────────────────────────────────────────────
Small + similar → safest case: freeze backbone, train only head.
Small + different → unfreeze more layers, but use small lr to not forget.
Large + similar → fine-tune everything with a 10× smaller lr for early layers.
Large + different → consider training from scratch, or very aggressive fine-tuning.The further your domain from the pre-training domain, the more layers need to adapt
Learning Rate for Fine-tuning
Typical fine-tuning learning rate schedule:
Frozen layers: lr = 0 (no update)
Unfrozen backbone: lr = 1e-5 (very small — preserve features)
New head: lr = 1e-3 (normal — learn from scratch)
PyTorch: use different param_groups per layer group:
optimizer = Adam([
{'params': backbone.parameters(), 'lr': 1e-5},
{'params': head.parameters(), 'lr': 1e-3},
])
Using the same lr for all layers is the most common fine-tuning mistake —
pre-trained representations get overwritten in the first few batches.Differential learning rates are essential — backbone lr should be 100× smaller than head lr
Common Pre-trained Models
- Image classification backbones: ResNet-50/101 (ILSVRC 2015 winner), EfficientNet-B0 to B7, ViT (Vision Transformer). All available in torchvision.models with pretrained=True.
- Object detection: Faster R-CNN, YOLO, DETR — all use ImageNet-pretrained backbones.
- Medical imaging: Often fine-tuned from ImageNet despite domain difference — the low-level feature representations still transfer even to X-rays and histology slides.
- NLP transfer: BERT (Google, 2018) pre-trained on 3.3B words. GPT-2/3/4 pre-trained on web text. Fine-tune with 1,000–10,000 labelled examples for classification, NER, or generation tasks.
- Foundation models (2023+): CLIP, DINO, SAM — pre-trained on web-scale data with self-supervised or contrastive objectives. Remarkably general — good zero-shot performance without any fine-tuning.