Week 1: From Matmul to Text
This book is not complete and this chapter is not finalized yet. We might switch to Qwen3 in the final version of the course.
In this week, we will start from the basic matrix operations and see how those these matrix manipulations can turn the Qwen2 model parameters into a model that generates text. We will implement the neural network layers used in the Qwen2 model using mlx's matrix APIs.
We will use the Qwen2-7B-Instruct model for this week. As we need to dequantize the model parameters, the model of 4GB download size needs 20GB of memory in week 1. If you do not have enough memory, you can consider using the smaller 0.5B model.
The MLX version of the Qwen2-7B-Instruct model we downloaded in the setup is an int4 quantized version of the original bfloat16 model.
What We will Cover
- Attention, Multi-Head Attention, and Grouped/Multi Query Attention
- Positional Embeddings and RoPE
- Put the attention layers together and implement the whole Transformer block
- Implement the MLP layer and the whole Transformer model
- Load the Qwen2 model parameters and generate text
What We will Not Cover
To make the journey as interesting as possible, we will skip a few things for now:
- How to quantize/dequantize a model -- that will be part of week 2. The Qwen2 model is quantized so we will need to dequantize them before we can use them in our layer implementations.
- Actually we still used some APIs other than matrix manipulations -- like softmax, exp, log, etc. But they are simple and not implementing them would not affect the learning experience.
- Tokenizer -- we will not implement the tokenizer from scratch. We will use the
mlx_lm
tokenizer to tokenize the input. - Loading the model weights -- I don't think it's an interesting thing to learn how to decode those tensor dump files, so
we will use the
mlx_lm
to load the model and steal the weights from the loaded model into our layer implementations.
Basic Matrix APIs
Although MLX does not offer an introductory guide for beginners, its Python API is designed to be highly compatible with NumPy. To get started, you can refer to NumPy: The Absolute Basic for Beginners to learn essential matrix operations.
You can also refer to the MLX Operations API for more details.
Qwen2 Models
You can try the Qwen2 model with MLX/vLLM. You can read the blog post below to have some idea of what we will build within this course. At the end of this week, we will be able to chat with the model -- that is to say, use Qwen2 to generate text, as a causal language model.
The reference implementation of the Qwen2 model can be found in huggingface transformers, vLLM, and mlx-lm. You may utilize these resources to better understand the internals of the model and what we will implement in this week.
📚 Readings
- Qwen2.5: A Party of Foundation Models!
- Key Concepts of the Qwen2 Model
- Huggingface Transformers - Qwen2
- vLLM Qwen2
- mlx-lm Qwen2
- Qwen2 Technical Report
- Qwen2.5 Technical Report
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book © 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.