Deep Learning - Intermediate - 15 min

Learn CNN — Convolution

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

A convolutional layer slides a small filter (kernel) across the input, computing a dot product at each position. The result is a feature map showing where in the image the pattern encoded by that kernel appears. Stack multiple kernels and you detect edges, textures, shapes — each layer building more abstract features from the previous one's output.

The Convolution Operation

For an input X (H×W) and kernel K (k×k), with stride s:

  Output[r, c] = Σᵢ Σⱼ K[i,j] · X[r·s + i, c·s + j]

Output size formula:
  H_out = ⌊(H − k) / s⌋ + 1
  W_out = ⌊(W − k) / s⌋ + 1

Example: 7×7 input, 3×3 kernel, stride=1:
  H_out = ⌊(7−3)/1⌋+1 = 5 → 5×5 feature map

Example: 7×7 input, 3×3 kernel, stride=2:
  H_out = ⌊(7−3)/2⌋+1 = 3 → 3×3 feature map

Stride controls how far the kernel jumps each step — stride=2 halves the spatial dimensions

What Kernels Learn to Detect

  • Early layers (raw pixels → edges): kernels learn simple patterns — horizontal edges, vertical edges, diagonal gradients, colour blobs.
  • Middle layers (edges → shapes): kernels combine edge activations into corners, circles, simple textures, local patterns.
  • Deep layers (shapes → objects): kernels build representations of eyes, wheels, fur — entire semantic concepts.
  • This hierarchical feature learning is not hand-coded — the kernels are learned from data by backpropagation.

Parameter Sharing — Why CNNs are Efficient

Fully connected layer (224×224 image → 1000 neurons):
  Parameters = 224 × 224 × 3 × 1000 = 150,528,000

Convolutional layer (same image, 64 kernels of 3×3×3):
  Parameters = 3 × 3 × 3 × 64 = 1,728

The same kernel weights are used at every position → parameter sharing.
Result: 87,000× fewer parameters for roughly equivalent feature richness.
Fewer parameters = less overfitting, less memory, faster training.

Parameter sharing is the core advantage of convolution over fully-connected layers for images

Padding

Without padding: output shrinks each layer.
  7×7 → 5×5 → 3×3 → 1×1 (only 3 convolutions possible!)

With same padding (padding = ⌊k/2⌋ = 1 for k=3):
  Output size = Input size (for stride=1)
  7×7 → 7×7 → 7×7 → ...  (can go as deep as needed)

Padding adds zeros around the input border.
Most modern CNNs use padding='same' to preserve spatial dimensions.

Padding='same' keeps spatial size constant — essential for deep networks

Practice questions

  1. A 9×9 input has a 3×3 kernel applied with stride=2 and no padding. What is the output size?
  2. Why is parameter sharing in convolution so important?
  3. What does a negative output value from a convolution mean?
  4. Why does 'same' padding preserve spatial dimensions?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Deep Learning lessons