Learn CNN — Convolution - Free Visual AI and ML Lesson

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

A convolutional layer slides a small filter (kernel) across the input, computing a dot product at each position. The result is a feature map showing where in the image the pattern encoded by that kernel appears. Stack multiple kernels and you detect edges, textures, shapes — each layer building more abstract features from the previous one's output.

The Convolution Operation

For an input X (H×W) and kernel K (k×k), with stride s:

  Output[r, c] = Σᵢ Σⱼ K[i,j] · X[r·s + i, c·s + j]

Output size formula:
  H_out = ⌊(H − k) / s⌋ + 1
  W_out = ⌊(W − k) / s⌋ + 1

Example: 7×7 input, 3×3 kernel, stride=1:
  H_out = ⌊(7−3)/1⌋+1 = 5 → 5×5 feature map

Example: 7×7 input, 3×3 kernel, stride=2:
  H_out = ⌊(7−3)/2⌋+1 = 3 → 3×3 feature map

Stride controls how far the kernel jumps each step — stride=2 halves the spatial dimensions

What Kernels Learn to Detect

Early layers (raw pixels → edges): kernels learn simple patterns — horizontal edges, vertical edges, diagonal gradients, colour blobs.
Middle layers (edges → shapes): kernels combine edge activations into corners, circles, simple textures, local patterns.
Deep layers (shapes → objects): kernels build representations of eyes, wheels, fur — entire semantic concepts.
This hierarchical feature learning is not hand-coded — the kernels are learned from data by backpropagation.

Parameter Sharing — Why CNNs are Efficient

Fully connected layer (224×224 image → 1000 neurons):
  Parameters = 224 × 224 × 3 × 1000 = 150,528,000

Convolutional layer (same image, 64 kernels of 3×3×3):
  Parameters = 3 × 3 × 3 × 64 = 1,728

The same kernel weights are used at every position → parameter sharing.
Result: 87,000× fewer parameters for roughly equivalent feature richness.
Fewer parameters = less overfitting, less memory, faster training.

Parameter sharing is the core advantage of convolution over fully-connected layers for images

Padding

Without padding: output shrinks each layer.
  7×7 → 5×5 → 3×3 → 1×1 (only 3 convolutions possible!)

With same padding (padding = ⌊k/2⌋ = 1 for k=3):
  Output size = Input size (for stride=1)
  7×7 → 7×7 → 7×7 → ...  (can go as deep as needed)

Padding adds zeros around the input border.
Most modern CNNs use padding='same' to preserve spatial dimensions.

Padding='same' keeps spatial size constant — essential for deep networks

Learn CNN — Convolution