Day 1 gave you the mental model. Day 2 opens the hood — no equations, no linear algebra. Just the actual mechanics, explained the way they work.
Before the model sees a single word, it splits everything into sub-word chunks called tokens.
Not characters. Not words. Sub-words — the smallest units that let the model handle any text without an infinite vocabulary. GPT-4 has ~100,000 of them.
Every token becomes a point in high-dimensional space. Meaning = distance.
GPT-3's largest model used 12,288 dimensions (documented in the 2020 paper). Modern models use similar or larger spaces. We can only visualise 2, but the principle holds: words that mean similar things cluster together. The model doesn't know language — it knows geometry.
Because meaning is geometry, you can do arithmetic on concepts. The famous one:
Every word asks: which other words should I pay attention to?
This is self-attention. Each token simultaneously looks at all other tokens and decides what matters. "It" in the sentence on the right attends almost entirely to "animal" — that's pronoun disambiguation, solved by geometry.
Before transformers, RNNs processed tokens one-by-one, left to right. Attention replaced that with parallel, global lookup — and it's why transformers scaled.
The model gives you probabilities. You choose how random to be.
After the transformer computes logits (raw scores), a softmax converts them to probabilities. Temperature divides the logits before softmax — low = confident, high = chaotic. Then a sampling strategy picks the winner.
These are the things no one tells you in the blog posts.
Self-attention treats input as a set, not a sequence. Word order only enters because you explicitly add positional encodings. Without them, 'the cat sat' = 'sat cat the.' The original paper used sinusoidal encodings; modern models use learned or rotary (RoPE) variants.
70 billion parameters × 2 bytes (fp16) = 140GB. That's your entire 'AI brain' — nothing more than a giant matrix multiplication engine operating on learned weights.
Information flows through a residual stream — every layer adds a small update (delta) on top of what came before, not a full replacement. OpenAI has never disclosed GPT-4's exact depth, but research models like Llama 3 70B have 80 layers. Early layers parse syntax; later layers handle reasoning.
Through superposition, models represent more features than dimensions by overlapping them. A neuron doesn't represent one concept — it represents many, at an angle. This is why interpretability is hard.
Analysts estimate GPT-4's training run cost between $63M–$100M+ in compute (SemiAnalysis, 2023). OpenAI has never confirmed. Training requires gradient descent over trillions of tokens on thousands of H100s for months. Inference is much cheaper — a few cents per call — but training is a one-time capital bet.
Everything the model 'knows' in a conversation fits in the context window. Past that, it forgets. There's no retrieval, no hard disk — just the current token sequence. Memory is a product feature, not a model feature.
Don't skip these. Day 3 builds directly on them.
Paste a paragraph into the OpenAI tokenizer. Count the tokens. Then paste the same paragraph in Hindi or Arabic. Compare.
platform.openai.com/tokenizer — takes 2 minutes, sticks forever.
Call the Claude API (or any LLM API) with temperature=0, then temperature=1.5, same prompt. Notice the difference.
If you don't have API access, use Claude.ai and notice how responses change on re-runs.