Week 1: From Matmul to Text
In this week, we will start from the basic matrix operations and see how those these matrix manipulations can turn the Qwen2 model parameters into a model that generates text. We will implement the neural network layers used in the Qwen2 model using mlx's matrix APIs.
We will use the Qwen2-7B-Instruct model for this week. As we need to dequantize the model parameters, the 4GB model needs 20GB of memory in week 1. If you do not have enough memory, you can consider using the smaller 0.5B model (we do not have infra to test it so you need to figure out things on your own unfortunately).
What We will Cover
- Attention, Multi-Head Attention, and Grouped/Multi Query Attention
- Positional Embeddings and RoPE
- Put the attention layers together and implement the whole Transformer block
- Implement the MLP layer and the whole Transformer model
- Load the Qwen2 model parameters and generate text
What We will Not Cover
To make the journey as interesting as possible, we will skip a few things for now:
- How to quantize/dequantize a model -- that will be part of week 2. The Qwen2 model is quantized so we will need to dequantize them before we can use them in our layer implementations.
- Actually we still used some APIs other than matrix manipulations -- like softmax, exp, log, etc. But they are simple and not implementing them would not affect the learning experience.
- Tokenizer -- we will not implement the tokenizer from scratch. We will use the
mlx_lm
tokenizer to tokenize the input. - Loading the model weights -- I don't think it's an interesting thing to learn how to decode those tensor dump files, so
we will use the
mlx_lm
to load the model and steal the weights from the loaded model into our layer implementations.
Qwen2 Models
You can try the Qwen2 model with MLX/vLLM. You can read the blog post below to have some idea of what we will build within this course. At the end of this week, we will be able to chat with the model -- that is to say, use Qwen2 to generate text, as a casual language model.
The reference implementation of the Qwen2 model can be found in huggingface transformers, vLLM, and mlx-lm. You may utilize these resources to better understand the internals of the model and what we will implement in this week.
📚 Readings
- Qwen2.5: A Party of Foundation Models!
- Key Concepts of the Qwen2 Model
- Huggingface Transformers - Qwen2
- vLLM Qwen2
- mlx-lm Qwen2
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book © 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.