Back to all posts
Understanding LLM Inference Internals
Introduction
What makes LLM inference fast (or slow)? This post breaks down the full pipeline.
The Inference Pipeline
1. Tokenization
How text becomes tokens.
2. Embedding Lookup
How tokens become vectors.
3. Transformer Layers
Attention, FFN, and residual connections.
4. Sampling
Temperature, top-k, top-p — how the next token is chosen.
Code Walkthrough
// Example from llama.cpp or similar
Conclusion
Key takeaways about inference performance.
Have thoughts or questions? Reach out on X.