Learn Images as Data

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Before a model can do anything visual, you need to remember one thing: a digital image is just a grid of numbers. There is no 'pixel' in the model's view — only an integer between 0 and 255 telling it how much red, green, or blue lives at that position. Once you internalise this, every later concept (filters, edges, segmentation, detection) is just clever math on this grid.

The Image Tensor: H × W × C

A color image is a 3D tensor:

  shape = [Height, Width, Channels]

For a typical RGB photo:
  • Height: rows (e.g., 1080)
  • Width:  columns (e.g., 1920)
  • Channels: 3 (red, green, blue)

Each element image[y][x][c] is a number, traditionally:
  • uint8 — integer in [0, 255]
  • float — normalised to [0, 1] or [−1, +1] before feeding to a neural net

Special cases:
  • Grayscale: 1 channel — just brightness
  • RGBA: 4 channels — RGB + alpha (transparency)
  • Depth/IR/multi-spectral: arbitrary channel count (drone imagery can have 7-12)

image[y][x][c] → one number · the entire 'photo' is just billions of these

RGB Color: Three Numbers Per Pixel

Pure red: [255, 0, 0] — full red, no green, no blue.
Pure white: [255, 255, 255] — all channels max.
Pure black: [0, 0, 0] — all channels zero.
Yellow: [255, 255, 0] — red + green, no blue.
A soft pink: [240, 180, 200] — high R, medium G, medium-high B.

Pre-processing for Neural Networks

Raw images need preparation before feeding to a model:

  1. Resize: rescale to a fixed dimension (e.g., 224×224 for ImageNet models).
     Done with bilinear/bicubic interpolation.

  2. Convert to float: divide by 255 to map [0, 255] → [0, 1].

  3. Normalise: standardise each channel using dataset statistics:
        x_normalised = (x − mean) / std
     ImageNet means:  [0.485, 0.456, 0.406]
     ImageNet stds:   [0.229, 0.224, 0.225]
     This puts each channel near mean 0, variance 1 — what the model expects.

  4. Permute: shuffle axes if needed (HWC ↔ CHW for the framework).

  5. Batch: stack N images into a 4D tensor [N, C, H, W] for parallel GPU processing.

Almost every pretrained CNN expects this exact pre-processing — get it wrong and accuracy collapses

What Each Channel Captures

Red channel: bright on warm objects (skin, fire, sunset). Skin tones especially live mostly in red.
Green channel: bright on plants, grass, signs (eyes evolved to be most sensitive to green).
Blue channel: bright on sky, water, shadows. Often the noisiest in low light.
Inspecting a single channel as grayscale reveals what cues live where — useful for debugging.