Multimodal models bridge senses. They take input from MULTIPLE modalities — text + image, text + audio, text + video — and produce outputs that depend on all of them at once. CLIP showed that images and captions can share an embedding space. GPT-4V can answer questions about a photograph. Gemini and Claude can see, read, listen, and reason across all of these simultaneously. The key idea: encode each modality into a common vector space where 'a cat' (text) and a photo of a cat can be compared and combined.
Architecture: dual encoders + shared space
CLIP-style training (the foundational technique):
text_encoder(t) → text vector v_t (768-dim, normalised)
image_encoder(x) → image vector v_x (768-dim, normalised)
Loss = contrastive
For a batch of N (text, image) pairs:
pull matched (t_i, x_i) close
push unmatched (t_i, x_j) for i≠j apart
cosine similarity v_t · v_x ≈ 1 for matches, ≈ 0 for non-matches.
After training, the same encoders work for:
• Image search: text → encode → find nearest image vectors
• Image classification: encode 'a photo of a cat'/'a photo of a dog'/... and match
• Captioning: image → encoder → fed as embeddings to a separate decoder LMTwo encoders · trained to align · shared embedding space enables cross-modal tasks
Modern multimodal LLMs (GPT-4V, Gemini, Claude 3)
- Image input: pretrained image encoder (often ViT) produces image tokens that are projected into the LLM's text-token embedding space.
- These image tokens are then prepended/inserted into the text token stream like any other tokens.
- The LLM's transformer sees a unified sequence of (text + image) tokens and reasons over them jointly via standard attention.
- Output is text (or text + image with diffusion-based decoder, or audio with speech models).
Capabilities unlocked
- Visual question answering: 'how many people are in this photo?'
- Document understanding: 'extract all dates from this scanned invoice'
- Code from sketches: 'turn this whiteboard sketch into HTML'
- Cross-modal generation: text → image (DALL-E, Stable Diffusion), text → audio (Suno, ElevenLabs), text → video (Sora, Veo).
- Embodied assistance: 'tell me what you see and how to fix the broken faucet'.