Preface
This course is designed for systems engineers who want to understand how LLMs work.
As a system engineer, I always wonder how things work internally and how to optimize them. I had a hard time figuring out the LLM stuff. Most of the open source projects that serve LLMs are highly optimized with CUDA kernels and other low-level optimizations. It is not easy to understand the whole picture by looking at a codebase of 100k lines of code. Therefore, I decided to implement an LLM serving project from scratch -- with only matrix manipulations APIs, so that I can understand what it takes to load those LLM model parameters and do the math magic to generate text.
You can think of this course as an LLM version of CMU Deep Learning Systems course's needle project.
Prerequisites
You should have some experience with the basics of deep learning and have some idea of how PyTorch works. Some recommended resources are:
- CMU Intro to Machine Learning -- this course teaches you the basics of machine learning
- CMU Deep Learning Systems -- this course teaches you how to build PyTorch from scratch
Environment Setup
This course uses MLX, an array/machine learning library for Apple Silicon. Nowaways it's much easier to get an Apple Silicon device than NVIDIA GPUs. In theory you can also do this course with PyTorch or numpy, but we just don't have the test infra to support them. We test your implementation against PyTorch's CPU implementation and MLX's implementation to ensure correctness.
Course Structure
This course is divided into 3 weeks. We will serve the Qwen2-7B-Instruct model and optimize it throughout the course.
- Week 1: serve Qwen2 with purely matrix manipulation APIs. Just Python.
- Week 2: optimizations, implement C++/Metal custom kernels to make the model run faster.
- Week 3: more optimizations, batch the requests to serve the model with high throughput.
How to Use This Book
The thing you are reading right now is the tiny-llm book. It is designed more like a guidebook instead of a textbook
that explains everything from scratch. In this course, we provide the materials that we find useful on the Internet
when the author(s) implemented the tiny-llm project. The Internet does a better job of explaining the concepts and I
do not think it is necessary to repeat everything here. Think of this as a guide (of a list of tasks) and some hints!
We will also unify the language of the Internet materials so that it is easier to correspond them to the codebase.
For example, we will have a unified dimension symbols for the tensors. You do not need to figure out what H
, L
, E
stands for and what dimension of the matrixes are passed into the function.
About the Authors
This course is created by Chi and Connor.
Chi is a systems software engineer at Neon (now acquired by Databricks), focusing on storage systems. Fascinated by the vibe of large language models (LLMs), he created this course to explore how LLM inference works.
Connor is a software engineer at PingCAP, developing the TiKV distributed key-value database. Curious about the internals of LLMs, he joined this course to practice how to build a high-performance LLM serving system from scratch, and contributed to building the course for the community.
Community
You may join skyzh's Discord server and study with the tiny-llm community.
Get Started
Now, you can start to set up the environment following the instructions in Setting Up the Environment and begin your journey to build tiny-llm!
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.
Setting Up the Environment
To follow along this course, you will need a Macintosh device with Apple Silicon. We manage the codebase with pdm.
Install pdm
Please follow the offcial guide to install pdm.
Clone the Repository
git clone https://github.com/skyzh/tiny-llm
The repository is organized as follows:
src/tiny_llm -- your implementation
src/tiny_llm_week1_ref -- reference implementation of week 1
tests/ -- unit tests for your implementation
tests_ref_impl_week1/ -- unit tests for the reference implementation of week 1
book/ -- the book
We provide all reference implementations and you can refer to them if you get stuck in the course.
Install Dependencies
cd tiny-llm
pdm install -v # this will automatically create a virtual environment and install all dependencies
Check the Installation
pdm run check-installation
# The reference solution should pass all the *week 1* tests
pdm run test-refsol -- -- -k week_1
Run Unit Tests
Your code is in src/tiny_llm
. You can run the unit tests with:
pdm run test
Download the Model Parameters
We will use the Qwen2-7B-Instruct model for this course. It takes ~20GB of memory in week 1 to load the model parameters. If you do not have enough memory, you can consider using the smaller 0.5B model.
Follow the guide of this page to install the huggingface cli.
The model parameters are hosted on Hugging Face. Once you authenticated your cli with the credentials, you can download them with:
huggingface-cli login
huggingface-cli download Qwen/Qwen2-7B-Instruct-MLX
Then, you can run:
pdm run main --solution ref --loader week1
It should load the model and print some text.
In week 2, we will write some kernels in C++/Metal, and we will need to set up additional tools for that. We will cover it later.
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.
Week 1: From Matmul to Text
In this week, we will start from the basic matrix operations and see how those these matrix manipulations can turn the Qwen2 model parameters into a model that generates text. We will implement the neural network layers used in the Qwen2 model using mlx's matrix APIs.
We will use the Qwen2-7B-Instruct model for this week. As we need to dequantize the model parameters, the model of 4GB download size needs 20GB of memory in week 1. If you do not have enough memory, you can consider using the smaller 0.5B model.
The MLX version of the Qwen2-7B-Instruct model we downloaded in the setup is an int4 quantized version of the original bfloat16 model.
What We will Cover
- Attention, Multi-Head Attention, and Grouped/Multi Query Attention
- Positional Embeddings and RoPE
- Put the attention layers together and implement the whole Transformer block
- Implement the MLP layer and the whole Transformer model
- Load the Qwen2 model parameters and generate text
What We will Not Cover
To make the journey as interesting as possible, we will skip a few things for now:
- How to quantize/dequantize a model -- that will be part of week 2. The Qwen2 model is quantized so we will need to dequantize them before we can use them in our layer implementations.
- Actually we still used some APIs other than matrix manipulations -- like softmax, exp, log, etc. But they are simple and not implementing them would not affect the learning experience.
- Tokenizer -- we will not implement the tokenizer from scratch. We will use the
mlx_lm
tokenizer to tokenize the input. - Loading the model weights -- I don't think it's an interesting thing to learn how to decode those tensor dump files, so
we will use the
mlx_lm
to load the model and steal the weights from the loaded model into our layer implementations.
Basic Matrix APIs
Although MLX does not offer an introductory guide for beginners, its Python API is designed to be highly compatible with NumPy. To get started, you can refer to NumPy: The Absolute Basic for Beginners to learn essential matrix operations.
You can also refer to the MLX Operations API for more details.
Qwen2 Models
You can try the Qwen2 model with MLX/vLLM. You can read the blog post below to have some idea of what we will build within this course. At the end of this week, we will be able to chat with the model -- that is to say, use Qwen2 to generate text, as a causal language model.
The reference implementation of the Qwen2 model can be found in huggingface transformers, vLLM, and mlx-lm. You may utilize these resources to better understand the internals of the model and what we will implement in this week.
π Readings
- Qwen2.5: A Party of Foundation Models!
- Key Concepts of the Qwen2 Model
- Huggingface Transformers - Qwen2
- vLLM Qwen2
- mlx-lm Qwen2
- Qwen2 Technical Report
- Qwen2.5 Technical Report
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.
Week 1 Day 1: Attention and Multi-Head Attention
In day 1, we will implement the basic attention layer and the multi-head attention layer. Attention layers take a input sequence and focus on different parts of the sequence when generating the output. Attention layers are the key building blocks of the Transformer models.
π Reading: Transformer Architecture
We use the Qwen2 model for text generation. The model is a decoder-only model. The input of the model is a sequence of token embeddings. The output of the model is the most likely next token ID.
π Reading: LLM Inference, the Decode Phase
Back to the attention layer. The attention layer takes a query, a key, and a value. In a classic implementation, all
of them are of the same shape: N.. x L x D
.
N..
is zero or some number of dimensions for batches. Within each of the batch, L
is the sequence length and D
is
the dimension of the embedding for a given head in the sequence.
So, for example, if we have a sequence of 1024 tokens, where each of the token has a 512-dimensional embedding (head_dim),
we will pass a tensor of the shape N.. x 1024 x 512
to the attention layer.
Task 1: Implement scaled_dot_product_attention_simple
In this task, we will implement the scaled dot product attention function. We assume the input tensors (Q, K, V) have the same dimensions. In the next few chapters, we will support more variants of attentions that might not have the same dimensions for all tensors.
src/tiny_llm/attention.py
π Readings
- Annotated Transformer
- PyTorch Scaled Dot Product Attention API (assume
enable_gqa=False
, assume dim_k=dim_v=dim_q and H_k=H_v=H_q) - MLX Scaled Dot Product Attention API (assume dim_k=dim_v=dim_q and H_k=H_v=H_q)
- Attention is All You Need
Implement scaled_dot_product_attention
following the below attention function. The function takes key, value, and query of the same dimensions, and an optional mask matrix M
.
Note that is the scale factor. The user might specify their own scale factor or use the default one.
L is seq_len, in PyTorch API it's S (source len)
D is head_dim
key: N.. x L x D
value: N.. x L x D
query: N.. x L x D
output: N.. x L x D
scale = 1/sqrt(D) if not specified
You may use softmax
provided by mlx and implement it later in week 2.
Because we are always using the attention layer within the multi-head attention layer, the actual tensor shape when serving the model will be:
key: 1 x H x L x D
value: 1 x H x L x D
query: 1 x H x L x D
output: 1 x H x L x D
mask: 1 x H x L x L
.. though the attention layer only cares about the last two dimensions. The test case will test any shape of the batching dimension.
At the end of this task, you should be able to pass the following tests:
pdm run test --week 1 --day 1 -- -k task_1
Task 2: Implement SimpleMultiHeadAttention
In this task, we will implement the multi-head attention layer.
src/tiny_llm/attention.py
π Readings
- Annotated Transformer
- PyTorch SimpleMultiHeadAttention API (assume dim_k=dim_v=dim_q and H_k=H_v=H_q)
- MLX SimpleMultiHeadAttention API (assume dim_k=dim_v=dim_q and H_k=H_v=H_q)
- The Illustrated GPT-2 (Visualizing Transformer Language Models) helps you better understand what key, value, and query are.
Implement SimpleMultiHeadAttention
. The layer takes a batch of vectors, maps it through the K, V, Q weight matrixes, and use the attention function we implemented in task 1 to compute the result. The output needs to be mapped using the O
weight matrix.
You will also need to implement the linear
function in basics.py
first. For linear
, it takes a tensor of the shape N.. x I
, a weight matrix of the shape O x I
, and a bias vector of the shape O
. The output is of the shape N.. x O
. I
is the input dimension and O
is the output dimension.
For the SimpleMultiHeadAttention
layer, the input tensors query
, key
, value
have the shape N x L x E
, where E
is the dimension of the
embedding for a given token in the sequence. The K/Q/V
weight matrixes will map the tensor into key, value, and query
separately, where the dimension E
will be mapped into a dimension of size H x D
, which means that the token embedding
gets mapped into H
heads, each with a dimension of D
. You can directly reshape the tensor to split the H x D
dimension
into two dimensions of H
and D
to get H
heads for the token.
Now, you have a tensor of the shape N.. x L x H x D
for each of the key, value, and query. To apply the attention function, you first need to transpose them into shape N.. x H x L x D
.
- This makes each attention head an independent batch, so that attention can be calculated separately for each head across the sequence
L
. - If you kept
H
behindL
, attention calculation would mix head and sequence dimensions, which is not what we want β each head should focus only on the relationships between tokens in its own subspace.
The attention function produces output for each of the head of the token. Then, you can transpose it back into N.. x L x H x D
and reshape it
so that all heads get merged back together with a shape of N.. x L x (H x D)
. Map it through the output weight matrix to get
the final output.
E is hidden_size or embed_dim or dims or model_dim
H is num_heads
D is head_dim
L is seq_len, in PyTorch API it's S (source len)
w_q/w_k/w_v: E x (H x D)
output/input: N x L x E
w_o: (H x D) x E
At the end of the task, you should be able to pass the following tests:
pdm run test --week 1 --day 1 -- -k task_2
You can run all tests for the day with:
pdm run test --week 1 --day 1
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.
Week 1 Day 2: Positional Encodings and RoPE
In day 2, we will implement the positional embedding used in the Qwen2 model: Rotary Postional Encoding. In a transformer model, we need a way to embed the information of the position of a token into the input of the attention layers. In Qwen2, positional embedding is applied within the multi head attention layer on the query and key vectors.
π Readings
- You could have designed state of the art positional encoding
- Roformer: Enhanced Transformer with Rotary Positional Encoding
Task 1: Implement Rotary Postional Encoding "RoPE"
You will need to modify the following file:
src/tiny_llm/positional_encoding.py
In traditional RoPE (as described in the readings), the positional encoding is applied to each head of the query and key vectors.
You can pre-compute the frequencies when initializing the RoPE
class.
If offset
is not provided, the positional encoding will be applied to the entire sequence: 0th frequency applied to the
0th token, up to the (L-1)-th token. Otherwise, the positional encoding will be applied to the sequence according to the
offset slice. If the offset slice is 5..10, then the sequence length provided to the layer would be 5, and the 0th token
will be applied with the 5th frequency.
You only need to consider offset
being None
or a single slice. The list[slice]
case will be implemented when we
start implementing the continuous batching feature. Assume all batches provided use the same offset.
x: (N, L, H, D)
cos/sin_freqs: (MAX_SEQ_LEN, D // 2)
In the traditional form of RoPE, each head on the dimension of D
is viewed as consequtive complex pairs. That is to
say, if D = 8, then, x[0] and x[1] are a pair, x[2] and x[3] are another pair, and so on. A pair gets the same frequency
from cos/sin_freqs
.
Note that, practically, D can be even or odd. In the case of D being odd, the last dimension of x
doesnβt have a matching pair,
and is typically left untouched in most implementations. For simplicity, we just assume that D is always even.
output[0] = x[0] * cos_freqs[0] + x[1] * -sin_freqs[0]
output[1] = x[0] * sin_freqs[0] + x[1] * cos_freqs[0]
output[2] = x[2] * cos_freqs[1] + x[3] * -sin_freqs[1]
output[3] = x[2] * sin_freqs[1] + x[3] * cos_freqs[1]
...and so on
You can do this by reshaping x
to (N, L, H, D // 2, 2) and then applying the above formula to each pair.
π Readings
- PyTorch RotaryPositionalEmbeddings API
- MLX Implementation of RoPE before the custom metal kernel implementation
You can test your implementation by running the following command:
pdm run test --week 1 --day 2 -- -k task_1
Task 2: Implement RoPE
in the non-traditional form
The Qwen2 model uses a non-traditional form of RoPE. In this form, the head embedding dimension is split into two halves,
and the two halves are applied with different frequencies. Let's say x1 = x[.., :HALF_DIM]
and x2 = x[.., HALF_DIM:]
.
output[0] = x1[0] * cos_freqs[0] + x2[0] * -sin_freqs[0]
output[HALF_DIM] = x1[0] * sin_freqs[0] + x2[0] * cos_freqs[0]
output[1] = x1[1] * cos_freqs[1] + x2[1] * -sin_freqs[1]
output[HALF_DIM + 1] = x1[1] * sin_freqs[1] + x2[1] * cos_freqs[1]
...and so on
You can do this by directly getting the first half / second half of the embedding dimension of x
and applying the
frequencies to each half separately.
π Readings
You can test your implementation by running the following command:
pdm run test --week 1 --day 2 -- -k task_2
At the end of the day, you should be able to pass all tests of this day:
pdm run test --week 1 --day 2
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.
Week 1 Day 3: Grouped Query Attention (GQA)
In day 3, we will implement Grouped Query Attention (GQA). The Qwen2 models use GQA which is an optimization technique for multi-head attention that reduces the computational and memory costs associated with the Key (K) and Value (V) projections. Instead of each Query (Q) head having its own K and V heads (like in Multi-Head Attention, MHA), multiple Q heads share the same K and V heads. Multi-Query Attention (MQA) is a special case of GQA where all Q heads share a single K/V head pair.
Readings
- GQA Paper (Training Generalized Multi-Query Transformer Models from Pre-Trained Checkpoints)
- Qwen layers implementation in mlx-lm
- PyTorch API (the case where enable_gqa=True)
- torchtune.modules.MultiHeadAttention
Task 1: Implement scaled_dot_product_attention_grouped
You will need to modify the following file:
src/tiny_llm/attention.py
In this task, we will implement the grouped scaled dot product attention function, which forms the core of GQA.
Implement scaled_dot_product_attention_grouped
in src/tiny_llm/attention.py
. This function is similar to the standard scaled dot product attention, but handles the case where the number of query heads is a multiple of the number of key/value heads.
The main progress is the same as the standard scaled dot product attention. The difference is that the K and V heads are shared across multiple Q heads. This means that instead of having H_q
separate K and V heads, we have H
K and V heads, and each K and V head is shared by n_repeats = H_q // H
Q heads.
The core idea is to reshape query
, key
, and value
so that the K and V tensors can be effectively broadcasted to match the query heads within their groups during the matmul
operations.
* Think about how to isolate the H
and n_repeats
dimensions in the query
tensor.
* Consider adding a dimension of size 1 for n_repeats
in the key
and value
tensors to enable broadcasting.
Then perform the scaled dot product attention calculation (matmul
, scale, optional mask, softmax
, matmul
). Broadcasting should handle the head repetition implicitly.
Note that, leverage broadcasting instead of repeating the K and V tensors is more efficient. This is because broadcasting allows the same data to be used in multiple places without creating multiple copies of the data, which can save memory and improve performance.
At last, don't forget to reshape the final result back to the expected output shape.
N.. is zero or more dimensions for batches
H_q is the number of query heads
H is the number of key/value heads (H_q must be divisible by H)
L is the query sequence length
S is the key/value sequence length
D is the head dimension
query: N.. x H_q x L x D
key: N.. x H x S x D
value: N.. x H x S x D
mask: N.. x H_q x L x S
output: N.. x H_q x L x D
Please note that besides the grouped heads, we also extend the implementation that Q, K, and V might not have the same sequence length.
You can test your implementation by running the following command:
pdm run test --week 1 --day 3 -- -k task_1
Task 2: Causal Masking
Readings
In this task, we will implement the causal masking for the grouped attention.
The causal masking is a technique that prevents the attention mechanism from attending to future tokens in the sequence.
When mask
is set to causal
, we will apply the causal mask.
The causal mask is a square matrix of shape (L, S)
, where L
is the query sequence length and S
is the key/value sequence length.
The mask is a lower triangular matrix, where the elements on the diagonal and below the diagonal are 0, and the elements above the diagonal are -inf. For example, if L = 3
and S = 5
, the mask will be:
0 0 0 -inf -inf
0 0 0 0 -inf
0 0 0 0 0
Please implement the causal_mask
function in src/tiny_llm/attention.py
and then use it in the scaled_dot_product_attention_grouped
function. Also note that our causal mask diagonal position is different from the PyTorch API.
You can test your implementation by running the following command:
pdm run test --week 1 --day 3 -- -k task_2
Task 3: Qwen2 Grouped Query Attention
In this task, we will implement the Qwen2 Grouped Query Attention. You will need to modify the following file:
src/tiny_llm/qwen2_week1.py
Qwen2MultiHeadAttention
implements the multi-head attention for Qwen2. You will need to implement the following pseudo code:
x: B, L, E
q = linear(x, wq, bq) -> B, L, H_q, D
k = linear(x, wk, bk) -> B, L, H, D
v = linear(x, wv, bv) -> B, L, H, D
q = rope(q, offset=slice(offset, offset + L))
k = rope(k, offset=slice(offset, offset + L))
(transpose as needed)
x = scaled_dot_product_attention_grouped(q, k, v, scale, mask) -> B, L, H_q, D ; Do this at float32 precision
(transpose as needed)
x = linear(x, wo) -> B, L, E
You can test your implementation by running the following command:
pdm run test --week 1 --day 3 -- -k task_3
At the end of the day, you should be able to pass all tests of this day:
pdm run test --week 1 --day 3
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.
Week 1 Day 4: RMSNorm and Multi Perceptron Layer
In day 4, we will implement two crucial components of the Qwen2 Transformer architecture: RMSNorm and the MLP (Multi-Layer Perceptron) block, also known as the FeedForward Network. RMSNorm is a layer normalization technique that helps stabilize training with less computational overhead compared to traditional layer normalization. The MLP block is a feedforward network that processes the output of the attention layers, applying non-linear transformations to enhance the model's expressiveness.
Task 1: Implement RMSNorm
In this task, we will implement the RMSNorm
layer.
src/tiny_llm/layer_norm.py
π Readings
- Root Mean Square Layer Normalization
- Qwen2 layers implementation in mlx-lm (includes RMSNorm) - See
Qwen2RMSNorm
.
RMSNorm is defined as:
Where:
x
is the input tensor.weight
is a learnable scaling parameter.epsilon
(eps) is a small constant added for numerical stability (e.g., 1e-5 or 1e-6).mean(x^2)
is the sum of squares and then division by the number of elements.
The normalization is applied independently to each sampleβs feature vector, typically over the last dimension of input.
Note that, mean calculation should be performed with float32
accumulation to maintain precision before taking the square root, even if the input and weights are in a lower precision format (e.g., float16
or bfloat16
).
D is the embedding dimension.
x: N.. x D
weight: D
output: N.. x D
You can test your implementation by running:
pdm run test --week 1 --day 4 -- -k task_1
Task 2: Implement the MLP Block
In this task, we will implement the MLP block named Qwen2MLP
.
src/tiny_llm/qwen2_week1.py
The original Transformer model utilized a simple Feed-Forward Network (FFN) within each block. This FFN typically consisted of two linear transformations with a ReLU activation in between, applied position-wise.
Modern Transformer architectures, including Qwen2, often employ more advanced FFN variants for improved performance. Qwen2 uses a specific type of Gated Linear Unit (GLU) called SwiGLU.
π Readings
- Attention is All You Need (Transformer Paper, Section 3.3 "Position-wise Feed-Forward Networks")
- GLU Paper(Language Modeling with Gated Convolutional Networks)
- SilU(Swish) activation function
- SwiGLU Paper(GLU Variants Improve Transformer)
- PyTorch SiLU documentation
- Qwen2 layers implementation in mlx-lm (includes MLP)
Essientially, SwiGLU is a combination of GLU and the SiLU (Sigmoid Linear Unit) activation function:
- GLU is a gating mechanism that allows the model to learn which parts of the input to focus on. It typically involves an element-wise product of two linear projections of the input, one of which might be passed through an activation function. Compared to ReLU used in the original FFN, GLU can help the model learn more complex relationships in the data, deciding which features to keep and which to discard.
- SiLU (Sigmoid Linear Unit) is a smooth, non-monotonic activation function that has been shown to perform well in various deep learning tasks. Compared to ReLU and sigmoid used in GLU, it is fully differentiable without the zero-gradient βdead zonesβ, retains non-zero output even for negative inputs.
You need to implement the silu
function in basics.py
first. For silu
, it takes a tensor of the shape N.. x I
and returns a tensor of the same shape.
The silu
function is defined as:
Then implement Qwen2MLP
. The structure for Qwen2's MLP block is:
- A gate linear projection ().
- An up linear projection ().
- A SiLU activation function applied to the output of .
- An element-wise multiplication of the SiLU-activated output and the output. This forms the "gated" part.
- A final down linear projection ().
This can be expressed as: Where denotes element-wise multiplication. All linear projections in Qwen2's MLP are typically implemented without bias.
N.. is zero or more dimensions for batches
E is hidden_size (embedding dimension of the model)
I is intermediate_size (dimension of the hidden layer in MLP)
L is the sequence length
input: N.. x L x E
w_gate: I x E
w_up: I x E
w_down: E x I
output: N.. x L x E
You can test your implementation by running:
pdm run test --week 1 --day 4 -- -k task_2
At the end of the day, you should be able to pass all tests of this day:
pdm run test --week 1 --day 4
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.
Week 1 Day 5: The Qwen2 Model
In day 5, we will implement the Qwen2 model.
Before we start, please make sure you have downloaded the models:
huggingface-cli download Qwen/Qwen2-0.5B-Instruct-MLX
huggingface-cli download Qwen/Qwen2-7B-Instruct-MLX
Otherwise, some of the tests will be skipped.
Task 1: Implement Qwen2TransformerBlock
src/tiny_llm/qwen2_week1.py
π Readings
Qwen2 uses the following transformer block structure:
input
/ |
| input_layernorm (RMSNorm)
| |
| Qwen2MultiHeadAttention
\ |
Add (residual)
/ |
| post_attention_layernorm (RMSNorm)
| |
| MLP
\ |
Add (residual)
|
output
You should pass all tests for this task by running:
# Download the models if you haven't done so
huggingface-cli download Qwen/Qwen2-0.5B-Instruct-MLX
huggingface-cli download Qwen/Qwen2-7B-Instruct-MLX
# Run the tests
pdm run test --week 1 --day 5 -- -k task_1
Task 2: Implement Embedding
src/tiny_llm/embedding.py
π Readings
The embedding layer maps one or more tokens (represented as an interger) to one or more vector of dimension embedding_dim
.
In this task, you will implement the embedding layer.
Embedding::__call__
weight: vocab_size x embedding_dim
Input: N.. (tokens)
Output: N.. x embedding_dim (vectors)
This can be done with a simple array index lookup operation.
In the Qwen2 model, the embedding layer can also be used as a linear layer to map the embeddings back to the token space.
Embedding::as_linear
weight: vocab_size x embedding_dim
Input: N.. x embedding_dim
Output: N.. x vocab_size
You should pass all tests for this task by running:
# Download the models if you haven't done so; we need to tokenizers
huggingface-cli download Qwen/Qwen2-0.5B-Instruct-MLX
huggingface-cli download Qwen/Qwen2-7B-Instruct-MLX
# Run the tests
pdm run test --week 1 --day 5 -- -k task_2
Task 3: Implement Qwen2ModelWeek1
Now that we have built all the components of the Qwen2 model, we can implement the Qwen2ModelWeek1 class.
src/tiny_llm/qwen2_week1.py
π Readings
In this course, you will not implement the process of loading the model parameters from the tensor files. Instead, we
will load the model using the mlx-lm
library, and then we will place the loaded parameters into our model. Therefore,
the Qwen2ModelWeek1
class will take a MLX model as the constructor argument.
The Qwen2 model has the following layers:
input
| (tokens: N..)
Embedding
| (N.. x hidden_size); note that hidden_size==embedding_dim
Qwen2TransformerBlock
| (N.. x hidden_size)
Qwen2TransformerBlock
| (N.. x hidden_size)
...
|
RMSNorm
| (N.. x hidden_size)
Embedding::as_linear OR Linear (lm_head)
| (N.. x vocab_size)
output
You can access the number of layers, hidden size, and other model parameters from mlx_model.args
. Note that different
size of the Qwen2 models use different strategies to map the embeddings back to the token space. For the 0.5b model, it
directly uses the Embedding::as_linear
layer. For the 7b model, it has a separate lm_head
linear layer. You can
decide which strategy to use based on the mlx_model.args.tie_word_embeddings
argument. If it is true, then you should
use Embedding::as_linear
. Otherwise, the lm_head
linear layer will be available and you should load its parameters.
The input to the model is a sequence of tokens. The output is the logits (probability distribution) of the next token. In the next day, we will implement the process of generating the response from the model, and decide the next token based on the probability distribution output.
Also note that the MLX model we are using (Qwen2-7B/0.5B-Instruct) is a quantized model. Therefore, you also need to
dequantize the weights before loading them into our tiny-llm model. You can use the provided quantize::dequantize_linear
function to dequantize the weights.
You also need to make sure that you set mask=causal
when the input sequence is longer than 1. We will explain why
in the next day.
You should pass all tests for this task by running:
# Download the models if you haven't done so
huggingface-cli download Qwen/Qwen2-0.5B-Instruct-MLX
huggingface-cli download Qwen/Qwen2-7B-Instruct-MLX
# Run the tests
pdm run test --week 1 --day 5 -- -k task_3
At the end of the day, you should be able to pass all tests of this day:
pdm run test --week 1 --day 5
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.
Week 1 Day 6: Generating the Response: Prefill and Decode
In day 6, we will implement the process of generating the response when using the LLM as a chatbot. The implementation is not a lot of code, but given that it uses a large portion of the code we implemented in the previous days, we want to allocate this day to debug the implementation and make sure everything is working as expected.
Task 1: Implement simple_generate
src/tiny_llm/generate.py
The simple_generate
function takes a model, a tokenizer, and a prompt, and generates the response. The generation
process is done in two parts: first prefill, and then decode.
First thing is to implement the _step
sub-function. It takes a list of tokens y
, and the offset of the first token
provided to the model. The model will return the logits: the probability distribution of the next token for each position.
y: N.. x S, where in week 1 we don't implement batch, so N.. = 1
offset: int
output_logits: N.. x S x vocab_size
You only need the last token's logits to decide the next token. Therefore, you need to select the last token's logits from the output logits.
logits = output_logits[:, -1, :]
Then, you can optionally apply the log-sum-exp trick to normalize the logits to avoid numerical instability. As we only
do argmax sampling, the log-sum-exp trick is not necessary. Then, you need to sample the next token from the logits.
You can use the mx.argmax
function to sample the token with the highest probability over the last dimension
(the vocab_size axis). The function returns the next token number. This decoding strategy is called greedy decoding as we always
pick the token with the highest probability.
With the _step
function implemented, you can now implement the full simple_generate
function. The function will
first prefill the model with the prompt. As the prompt is a string, you need to first convert it to a list of tokens
by using the tokenizer tokenizer.encode
.
- The prefill step is done by calling the
_step
function with all the tokens in the prompt withoffset=0
. It gives back the first token in the response. - The decode step is done by calling the
_step
function with all the previous tokens and the offset of the last token.
You will need to implement a while loop to keep generating the response until the model outputs the EOS tokenizer.eos_token_id
token.
In the loop, you will need to store all previous tokens in a list, and use the detokenizer tokenizer.detokenizer
to print the response.
An example of the sequences provided to the _step
function is as below:
tokenized_prompt: [1, 2, 3, 4, 5, 6]
prefill: _step(model, [1, 2, 3, 4, 5, 6], 0) # returns 7
decode: _step(model, [1, 2, 3, 4, 5, 6, 7], 7) # returns 8
decode: _step(model, [1, 2, 3, 4, 5, 6, 7, 8], 8) # returns 9
...
We will optimize the decode
process to use key-value cache to speed up the generation next week.
You can test your implementation by running the following command:
pdm run main --solution tiny_llm --loader week1 --model Qwen/Qwen2-0.5B-Instruct-MLX \
--prompt "Give me a short introduction to large language model"
pdm run main --solution tiny_llm --loader week1 --model Qwen/Qwen2-7B-Instruct-MLX \
--prompt "Give me a short introduction to large language model"
It should gives you a reasonable response of "what is a large language model". Replace --solution tiny_llm
with
--solution ref
to use the reference solution.
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.
Week 1 Day 7: Sampling and Preparing for Week 2
In day 7, we will implement various sampling strategies. And we will get you prepared for week 2.
Task 1: Sampling
We implemented the default greedy sampling strategy in the previous day. In this task, we will implement the temperature, top-k, and top-p (nucleus) sampling strategies.
src/tiny_llm/sampler.py
Temperature Sampling
The first sampling strategy is the temperature sampling. When temp=0
, we use the default greedy strategy. When it is
larger than 0, we will randomly select the next token based on the logprobs. The temperature parameter scales the distribution.
When the value is larger, the distribution will be more uniform, making the lower probability token more likely to be
selected, and therefore making the model more creative.
To implement temperature sampling, simply divide the logprobs by the temperature and use mx.random.categorical
to
randomly select the next token.
pdm run main --solution tiny_llm --loader week1 --model Qwen/Qwen2-0.5B-Instruct-MLX --sampler-temp 0.5
Top-k Sampling
In top-k sampling, we will only keep the top-k tokens with the highest probabilities before sampling the probabilities. This is done before the final temperature scaling.
You can use mx.argpartition
to partition the output so that you can know the indices of the top-k elements, and then,
mask those logprobs outside the top-k with -mx.inf
. After that, do temperature sampling.
pdm run main --solution tiny_llm --loader week1 --model Qwen/Qwen2-0.5B-Instruct-MLX --sampler-temp 0.5 --sampler-top-k 10
Top-p (Nucleus) Sampling
In top-p (nucleus) sampling, we will only keep the top-p tokens with the highest cumulative probabilities before sampling the probabilities. This is done before the final temperature scaling.
There are multiple ways of implementing it. One way is to first use mx.argsort
to sort the logprobs (from highest
probability to lowest), and then, do a cumsum
over the sorted logprobs to get the cumulative probabilities. Then, mask
those logprobs outside the top-p with -mx.inf
. After that, do temperature sampling.
pdm run main --solution tiny_llm --loader week1 --model Qwen/Qwen2-0.5B-Instruct-MLX --sampler-temp 0.5 --sampler-top-p 0.9
Task 2: Prepare for Week 2
In week 2, we will optimize the serving infrastructure of the Qwen2 model. We will write some C++ code and Metal kernel to make some operations run faster. You will need Xcode and its command-line tools, which include the Metal compiler, to compile the C++ code and Metal kernels.
- Install Xcode: Install Xcode from the Mac App Store or from the Apple Developer website (this may require an Apple Developer account).
- Launch Xcode and Install Components: After installation, launch Xcode at least once. It may prompt you to install additional macOS components; please do so (this is usually the default option).
- Install Xcode Command Line Tools:
Open your Terminal and run:
xcode-select --install
- Set Default Xcode Path (if needed):
Ensure that your command-line tools are pointing to your newly installed Xcode. You can do this by running:
(Adjust the path if your Xcode is installed in a different location).sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer
- Accept Xcode License:
You may also need to accept the Xcode license:
sudo xcodebuild -license accept
- Install CMake:
brew install cmake
(This instruction is graciously provided by Liu Jinyi.)
You can test your installation by compiling the code in src/extensions
with a axpby
function as part of the official
mlx extension tutorial:
pdm run build-ext
pdm run build-ext-test
It should print correct: True
.
If you are not familiar with C++ or Metal programming, we also suggest doing some small exercises to get familiar with
them. You can implement some element-wise operations like exp
, sin
, cos
and replace the MLX ones in your model
implementation.
That's all for week 1! We have implemented all the components to serve the Qwen2 model. Now we are ready to start week 2, where we will optimize the serving infrastructure and make it run blazing fast on your Apple Silicon device.
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.
Glossary Index
- Scaled Dot Product Attention
- Multi Head Attention
- Linear
- Rotary Positional Encoding
- Grouped Query Attention
- Qwen2 Attention Module
- RMSNorm
- SiLU
- SwiGLU
- MLP
- Embedding
- Qwen2 Transformer Block
- Week 1 Qwen2 Model
- dequantize_linear
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.