What happens from prompt → final output
-
Tokenize input
-
Build attention maps
-
Generate probability distribution
-
Sample next token
-
Repeat until stop condition
-
Detokenize to human-readable output
For image models:
-
Start with random noise
-
Iteratively denoise using UNet + cross-attention
-
Decode latent vectors into pixel space
Latency & Cost Breakdown
You need to know what actually consumes compute:
-
Model size (parameter count)
-
Sequence length
-
Batch size
-
KV cache hit ratio
-
GPU vRAM
This determines whether your product is fast or unusably slow.