Bag of Words treats every word as an island — 'cat' and 'feline' are as unrelated as 'cat' and 'turbocharger'. Word embeddings change this. Each word becomes a dense vector (typically 300 dimensions) such that similar words end up close together in the vector space. Suddenly the model knows that king and queen are related, that Paris is to France as Tokyo is to Japan, and that 'happy' is closer to 'joyful' than to 'pencil'.
The Distributional Hypothesis
Word2Vec — the breakthrough (Mikolov, 2013)
Skip-gram objective: given a center word, predict its neighbours.
For each (center, context) pair in the training data:
P(context | center) = exp(v_context · v_center) / Σ exp(v_w · v_center)
Maximise log-likelihood over all observed pairs.
Result: the matrix of v_center vectors IS the word embedding.
CBOW (Continuous Bag of Words) — the inverse:
Given context words, predict the center word.
Both trained with hierarchical softmax or negative sampling for speed.
Trained on 6B words from Google News in 2013, embedding dim = 300.
The stunning result: embeddings exhibit ANALOGY structure for free.Train to predict neighbours · embeddings emerge as a side-effect
Vector Arithmetic — the Famous Trick
Once trained, embeddings exhibit linear structure that captures semantic relationships:
v(king) − v(man) + v(woman) ≈ v(queen)
v(paris) − v(france) + v(italy) ≈ v(rome)
v(walking) − v(walked) + v(swam) ≈ v(swimming)
Procedure:
1. Compute the target vector via subtraction + addition
2. Find the closest word in the vocab via cosine similarity
3. Often the answer is the expected analogy
Why this works: the vector v(king) − v(man) captures the abstract direction of 'royalty without male specificity'. Adding v(woman) puts you back in 'royalty + female' territory — closest word: queen.
This was the first viral moment for deep learning in NLP — proof that learned representations contained structure.Direction in vector space ≈ semantic relationship · the network learned this without any supervision
Beyond Word2Vec
- GloVe (Stanford, 2014): trains on global word co-occurrence counts directly. Similar quality, slightly different math, often used interchangeably with Word2Vec.
- FastText (Facebook, 2016): extends Word2Vec to model character n-grams as well — handles out-of-vocabulary words by composing them from sub-word vectors. Strong for morphologically rich languages.
- ELMo (2018): the first contextualised embeddings. The vector for 'bank' depends on whether it's a 'river bank' or 'investment bank'. Trained as a bidirectional LSTM language model.
- BERT/GPT embeddings (2018+): deep transformer-based contextual embeddings. The de facto standard. Each token has a different embedding depending on its context — no longer one fixed vector per word.