A convolutional layer produces feature maps that are spatially large — a single conv layer on a 224×224 image still produces a 224×224 output. Pooling layers reduce the spatial dimensions by summarising regions. Stack convolution, activation, and pooling into a pipeline and you have the full CNN — a hierarchical feature extractor that goes from raw pixels to class labels.
Max Pooling
For a p×p pooling window with stride s:
Output[r, c] = max{ X[r·s + i, c·s + j] for i,j ∈ [0, p) }
Output size formula:
H_out = ⌊(H − p) / s⌋ + 1
Most common: 2×2 max pool, stride 2:
H_out = ⌊(H−2)/2⌋+1 = H/2 → halves spatial dimensions
A 28×28 feature map becomes 14×14
A 56×56 feature map becomes 28×28
No parameters — max pool has nothing to learn.Max pooling keeps the strongest activation in each region — 'was this pattern present anywhere here?'
Average Pooling
Output[r, c] = (1/p²) × Σᵢ Σⱼ X[r·s + i, c·s + j]
Average pooling takes the mean of the window instead of the max.
Global Average Pooling (GAP) — a special case used in modern networks:
Takes the average of the ENTIRE feature map (one value per channel)
A 7×7×512 feature map → 1×1×512 → flatten to 512-vector
Modern ResNets, EfficientNets, and MobileNets all use GAP
before the final classifier instead of large FC layers.
ResNet-50: GAP replaces 25M-parameter FC block → same accuracy, 25M fewer params.GAP is especially resistant to overfitting — no extra parameters to memorise training data
Full CNN Architecture: the Pipeline
Typical CNN pipeline stage:
[Conv k×k, C filters] → [BatchNorm] → [ReLU] → [MaxPool 2×2]
Full example (LeNet-5, 1998):
Input 32×32×1
→ Conv 5×5, 6 filters → 28×28×6 (no padding)
→ Tanh → 28×28×6
→ MaxPool 2×2 → 14×14×6 ← spatial ÷2
→ Conv 5×5, 16 filters → 10×10×16
→ Tanh → 10×10×16
→ MaxPool 2×2 → 5×5×16 ← spatial ÷2
→ Flatten → 400
→ FC 120 → FC 84 → Softmax 10
Total parameters: ~60,000 — a complete digit classifier in 60K numbers.LeNet-5 introduced the Conv→Pool→Conv→Pool→FC pattern still used today
How Spatial Dimensions Change
- Each Conv layer (no padding) shrinks the spatial size by (k−1) — a 3×3 conv on 28×28 → 26×26.
- Each Conv layer (same padding) preserves spatial size — most modern CNNs use padding='same'.
- Each MaxPool 2×2 (stride=2) halves spatial dimensions — 224→112→56→28→14→7.
- Depth (channel count) grows as we go deeper: 3 → 64 → 128 → 256 → 512 is a typical pattern.
- The final spatial size × channel count = the flattened vector fed to the classifier.
- Smaller spatial + more channels = more abstract, richer representations for classification.
Parameter Count: Conv vs FC
Conv layer: C_in × k × k × C_out + C_out biases
Example: 64 → 3×3 → 128:
64 × 3 × 3 × 128 + 128 = 73,856 params
FC layer with same input (after 7×7×64 feature map):
7 × 7 × 64 × 128 = 401,408 params — 5× more!
Why convolutions win:
Same 3×3 kernel scans the entire 224×224 image.
Weight sharing = 87,000× fewer params vs a dense layer of equivalent width.
Fewer parameters = less overfitting, smaller model, faster inference.Conv replaces spatial FC connections — each kernel is one tiny shared FC over a 3×3 patch