Tokenization: How Models Read Text
Models don’t read sentences; they read tokens: subwords like"in", "gen", "er", "ation", "AI", "##s"
Tokenization affects:
-
context window
-
reasoning quality
-
output stability
-
cost of inference
Bad tokenization → broken prompts, hallucinations, cutoff words.
How Generation Works (Mechanically)
Every time the model generates a word, it performs:
-
Read input tokens
-
Compute attention weights
-
Output a probability distribution
-
Sample one token
-
Append it and repeat
This loop happens hundreds to thousands of times per prompt.
Sampling Techniques
These control creativity and stability:
-
Temperature
-
Higher → more random
-
Lower → deterministic
-
-
Top-K
-
Keep only top-k most probable tokens
-
-
Top-P (nucleus sampling)
-
Keep tokens until probability mass reaches P
-
-
Beam Search
-
Explore multiple possible sequences
-
If you don’t understand sampling, you won’t understand model behavior.
Why Models Hallucinate
Because a model must predict something even when:
-
context is missing
-
training data is sparse
-
retrieval is bad
-
prompt is unclear