Computer Vision - Beginner - 10 min

Learn Images as Data

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

Before a model can do anything visual, you need to remember one thing: a digital image is just a grid of numbers. There is no 'pixel' in the model's view — only an integer between 0 and 255 telling it how much red, green, or blue lives at that position. Once you internalise this, every later concept (filters, edges, segmentation, detection) is just clever math on this grid.

The Image Tensor: H × W × C

A color image is a 3D tensor:

  shape = [Height, Width, Channels]

For a typical RGB photo:
  • Height: rows (e.g., 1080)
  • Width:  columns (e.g., 1920)
  • Channels: 3 (red, green, blue)

Each element image[y][x][c] is a number, traditionally:
  • uint8 — integer in [0, 255]
  • float — normalised to [0, 1] or [−1, +1] before feeding to a neural net

Special cases:
  • Grayscale: 1 channel — just brightness
  • RGBA: 4 channels — RGB + alpha (transparency)
  • Depth/IR/multi-spectral: arbitrary channel count (drone imagery can have 7-12)

image[y][x][c] → one number · the entire 'photo' is just billions of these

RGB Color: Three Numbers Per Pixel

  • Pure red: [255, 0, 0] — full red, no green, no blue.
  • Pure white: [255, 255, 255] — all channels max.
  • Pure black: [0, 0, 0] — all channels zero.
  • Yellow: [255, 255, 0] — red + green, no blue.
  • A soft pink: [240, 180, 200] — high R, medium G, medium-high B.

Pre-processing for Neural Networks

Raw images need preparation before feeding to a model:

  1. Resize: rescale to a fixed dimension (e.g., 224×224 for ImageNet models).
     Done with bilinear/bicubic interpolation.

  2. Convert to float: divide by 255 to map [0, 255] → [0, 1].

  3. Normalise: standardise each channel using dataset statistics:
        x_normalised = (x − mean) / std
     ImageNet means:  [0.485, 0.456, 0.406]
     ImageNet stds:   [0.229, 0.224, 0.225]
     This puts each channel near mean 0, variance 1 — what the model expects.

  4. Permute: shuffle axes if needed (HWC ↔ CHW for the framework).

  5. Batch: stack N images into a 4D tensor [N, C, H, W] for parallel GPU processing.

Almost every pretrained CNN expects this exact pre-processing — get it wrong and accuracy collapses

What Each Channel Captures

  • Red channel: bright on warm objects (skin, fire, sunset). Skin tones especially live mostly in red.
  • Green channel: bright on plants, grass, signs (eyes evolved to be most sensitive to green).
  • Blue channel: bright on sky, water, shadows. Often the noisiest in low light.
  • Inspecting a single channel as grayscale reveals what cues live where — useful for debugging.

Practice questions

  1. What is the tensor shape of a 224×224 RGB color image?
  2. What does each pixel's [R, G, B] value of [255, 255, 0] represent?
  3. Why must images typically be normalised (e.g., (x − mean) / std) before feeding into a neural network?
  4. What's the difference between a grayscale image and a single channel of an RGB image?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Computer Vision lessons