Week 2: Tiny vLLM
In Week 2 of the course, we will focus on building serving infrastructure for the Qwen3 model. Essentially, this means creating a minimal version of the vLLM project from scratch. By the end of the week, you’ll be able to serve the Qwen3 model efficiently on your Apple Silicon device using the infrastructure we’ve built together.
What We’ll Cover
- Key-value cache implementation
- C++/Metal kernels
- Implementing a quantized matmul kernel
- Implementing a flash attention kernel
- Note: This week, we won’t focus on performance optimization. The kernels you build will likely be around 10x slower than MLX implementations. Optimizing them will be left as an exercise.
- Model serving infrastructure
- Implementing chunked prefill
- Implementing continuous batching
This week continues with Qwen3 as the main model. The serving code uses the official Qwen3 MLX 4-bit model files, preserves their bfloat16 tensors, and builds the KV cache, custom kernels, and batching path around that model family.
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book © 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.