Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Week 2: Tiny vLLM

In Week 2 of the course, we will focus on building serving infrastructure for the Qwen2 model. Essentially, this means creating a minimal version of the vLLM project from scratch. By the end of the week, you’ll be able to serve the Qwen2 model efficiently on your Apple Silicon device using the infrastructure we’ve built together.

What We’ll Cover

  • Key-value cache implementation
  • C++/Metal kernels
    • Implementing a quantized matmul kernel
    • Implementing a flash attention kernel
    • Note: This week, we won’t focus on performance optimization. The kernels you build will likely be around 10x slower than MLX implementations. Optimizing them will be left as an exercise.
  • Model serving infrastructure
    • Implementing chunked prefill
    • Implementing continuous batching

Additionally, the repo includes skeleton code for the Qwen3 model. If your device supports the bfloat16 data type (note: M1 chips do not), you’re encouraged to try implementing it and experiment with the Qwen3-series models as well.

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book © 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.