Learn Multimodal Models

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Multimodal models bridge senses. They take input from MULTIPLE modalities — text + image, text + audio, text + video — and produce outputs that depend on all of them at once. CLIP showed that images and captions can share an embedding space. GPT-4V can answer questions about a photograph. Gemini and Claude can see, read, listen, and reason across all of these simultaneously. The key idea: encode each modality into a common vector space where 'a cat' (text) and a photo of a cat can be compared and combined.

Architecture: dual encoders + shared space

CLIP-style training (the foundational technique):
  text_encoder(t) → text vector  v_t   (768-dim, normalised)
  image_encoder(x) → image vector v_x  (768-dim, normalised)

  Loss = contrastive
    For a batch of N (text, image) pairs:
      pull matched (t_i, x_i) close
      push unmatched (t_i, x_j) for i≠j apart
  cosine similarity v_t · v_x ≈ 1 for matches, ≈ 0 for non-matches.

After training, the same encoders work for:
  • Image search: text → encode → find nearest image vectors
  • Image classification: encode 'a photo of a cat'/'a photo of a dog'/... and match
  • Captioning: image → encoder → fed as embeddings to a separate decoder LM

Two encoders · trained to align · shared embedding space enables cross-modal tasks

Modern multimodal LLMs (GPT-4V, Gemini, Claude 3)

Image input: pretrained image encoder (often ViT) produces image tokens that are projected into the LLM's text-token embedding space.
These image tokens are then prepended/inserted into the text token stream like any other tokens.
The LLM's transformer sees a unified sequence of (text + image) tokens and reasons over them jointly via standard attention.
Output is text (or text + image with diffusion-based decoder, or audio with speech models).

Capabilities unlocked

Visual question answering: 'how many people are in this photo?'
Document understanding: 'extract all dates from this scanned invoice'
Code from sketches: 'turn this whiteboard sketch into HTML'
Cross-modal generation: text → image (DALL-E, Stable Diffusion), text → audio (Suno, ElevenLabs), text → video (Sora, Veo).
Embodied assistance: 'tell me what you see and how to fix the broken faucet'.

Architecture: dual encoders + shared space

Modern multimodal LLMs (GPT-4V, Gemini, Claude 3)

Capabilities unlocked

Practice questions

Related AI learning resources