1.3 Modern Model Architectures

Generative AI

Why Transformers Dominated

Transformers won because they solve 3 problems old AI couldn’t:

Parallel processing (RNNs processed sequentially → slow)
Long-context reasoning
Scalability (more data = better performance)

Attention Mechanism (QKV Explained Simply)

You don’t need math; you need intuition:

Query (Q): What the current token wants
Key (K): What each token offers
Value (V): The information attached to each token

Attention = “How much each token should influence the next one?”

Multi-head attention = multiple “perspectives” learning different patterns.

Decoder-Only vs Encoder-Decoder

Decoder-only (GPT-style):
Best for text generation and reasoning.
Encoder-decoder (T5-style):
Best for translation, summarization, and structured tasks.

Modern systems use decoder-only because:

simpler
cheaper
more general-purpose
easier to scale

Context Windows

Defines how much the model can “remember” at once.
Larger context windows enable:

document reasoning
long conversations
large RAG chunks
multi-step workflows

KV Cache

Critical for speed.
Instead of recomputing everything every token, models store attention states.
This reduces inference cost by up to 80%.

Scaling Laws

More data + more parameters + more compute → predictable performance increases.
This is why companies train gigantic models — the scaling curve rewards them.