Preface

This course is designed for systems engineers who want to understand how LLMs work.

As a system engineer, I always wonder how things work internally and how to optimize them. I had a hard time figuring out the LLM stuff. Most of the open source projects that serve LLMs are highly optimized with CUDA kernels and other low-level optimizations. It is not easy to understand the whole picture by looking at a codebase of 100k lines of code. Therefore, I decided to implement an LLM serving project from scratch -- with only matrix manipulations APIs, so that I can understand what it takes to load those LLM model parameters and do the math magic to generate text.

You can think of this course as an LLM version of CMU Deep Learning Systems course's needle project.

Prerequisites

You should have some experience with the basics of deep learning and have some idea of how PyTorch works. Some recommended resources are:

CMU Intro to Machine Learning -- this course teaches you the basics of machine learning
CMU Deep Learning Systems -- this course teaches you how to build PyTorch from scratch

Environment Setup

This course uses MLX, an array/machine learning library for Apple Silicon. Nowaways it's much easier to get an Apple Silicon device than NVIDIA GPUs. In theory you can also do this course with PyTorch or numpy, but we just don't have the test infra to support them. We test your implementation against PyTorch's CPU implementation and MLX's implementation to ensure correctness.

Course Structure

This course is divided into 3 weeks. We will serve the Qwen2-7B-Instruct model and optimize it throughout the course.

Week 1: serve Qwen2 with purely matrix manipulation APIs. Just Python.
Week 2: optimizations, implement C++/Metal custom kernels to make the model run faster.
Week 3: more optimizations, batch the requests to serve the model with high throughput.

How to Use This Book

The thing you are reading right now is the tiny-llm book. It is designed more like a guidebook instead of a textbook that explains everything from scratch. In this course, we provide the materials that we find useful on the Internet when the author(s) implemented the tiny-llm project. The Internet does a better job of explaining the concepts and I do not think it is necessary to repeat everything here. Think of this as a guide (of a list of tasks) and some hints! We will also unify the language of the Internet materials so that it is easier to correspond them to the codebase. For example, we will have a unified dimension symbols for the tensors. You do not need to figure out what H, L, E stands for and what dimension of the matrixes are passed into the function.

About the Authors

This course is created by Chi and Connor.

Chi is a systems software engineer at Neon (now acquired by Databricks), focusing on storage systems. Fascinated by the vibe of large language models (LLMs), he created this course to explore how LLM inference works.

Connor is a software engineer at PingCAP, developing the TiKV distributed key-value database. Curious about the internals of LLMs, he joined this course to practice how to build a high-performance LLM serving system from scratch, and contributed to building the course for the community.

Community

You may join skyzh's Discord server and study with the tiny-llm community.

Get Started

Now, you can start to set up the environment following the instructions in Setting Up the Environment and begin your journey to build tiny-llm!

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book © 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.

Setting Up the Environment

To follow along this course, you will need a Macintosh device with Apple Silicon. We manage the codebase with pdm.

Install pdm

Please follow the offcial guide to install pdm.

Clone the Repository

git clone https://github.com/skyzh/tiny-llm

The repository is organized as follows:

src/tiny_llm -- your implementation
src/tiny_llm_week1_ref -- reference implementation of week 1
tests/ -- unit tests for your implementation
tests_ref_impl_week1/ -- unit tests for the reference implementation of week 1
book/ -- the book

We provide all reference implementations and you can refer to them if you get stuck in the course.

Install Dependencies

cd tiny-llm
pdm install -v # this will automatically create a virtual environment and install all dependencies

Check the Installation

pdm run check-installation
# The reference solution should pass all the *week 1* tests
pdm run test-refsol -- -- -k week_1

Run Unit Tests

Your code is in src/tiny_llm. You can run the unit tests with:

pdm run test

Download the Model Parameters

We will use the Qwen2-7B-Instruct model for this course. It takes ~20GB of memory in week 1 to load the model parameters. If you do not have enough memory, you can consider using the smaller 0.5B model.

Follow the guide of this page to install the huggingface cli.

The model parameters are hosted on Hugging Face. Once you authenticated your cli with the credentials, you can download them with:

huggingface-cli login
huggingface-cli download Qwen/Qwen2-7B-Instruct-MLX

Then, you can run:

pdm run main --solution ref --loader week1

It should load the model and print some text.

In week 2, we will write some kernels in C++/Metal, and we will need to set up additional tools for that. We will cover it later.

Week 1: From Matmul to Text

In this week, we will start from the basic matrix operations and see how those these matrix manipulations can turn the Qwen2 model parameters into a model that generates text. We will implement the neural network layers used in the Qwen2 model using mlx's matrix APIs.

We will use the Qwen2-7B-Instruct model for this week. As we need to dequantize the model parameters, the model of 4GB download size needs 20GB of memory in week 1. If you do not have enough memory, you can consider using the smaller 0.5B model.

The MLX version of the Qwen2-7B-Instruct model we downloaded in the setup is an int4 quantized version of the original bfloat16 model.

What We will Cover

Attention, Multi-Head Attention, and Grouped/Multi Query Attention
Positional Embeddings and RoPE
Put the attention layers together and implement the whole Transformer block
Implement the MLP layer and the whole Transformer model
Load the Qwen2 model parameters and generate text

What We will Not Cover

To make the journey as interesting as possible, we will skip a few things for now:

How to quantize/dequantize a model -- that will be part of week 2. The Qwen2 model is quantized so we will need to dequantize them before we can use them in our layer implementations.
Actually we still used some APIs other than matrix manipulations -- like softmax, exp, log, etc. But they are simple and not implementing them would not affect the learning experience.
Tokenizer -- we will not implement the tokenizer from scratch. We will use the mlx_lm tokenizer to tokenize the input.
Loading the model weights -- I don't think it's an interesting thing to learn how to decode those tensor dump files, so we will use the mlx_lm to load the model and steal the weights from the loaded model into our layer implementations.

Basic Matrix APIs

Although MLX does not offer an introductory guide for beginners, its Python API is designed to be highly compatible with NumPy. To get started, you can refer to NumPy: The Absolute Basic for Beginners to learn essential matrix operations.

You can also refer to the MLX Operations API for more details.

Qwen2 Models

You can try the Qwen2 model with MLX/vLLM. You can read the blog post below to have some idea of what we will build within this course. At the end of this week, we will be able to chat with the model -- that is to say, use Qwen2 to generate text, as a causal language model.

The reference implementation of the Qwen2 model can be found in huggingface transformers, vLLM, and mlx-lm. You may utilize these resources to better understand the internals of the model and what we will implement in this week.

📚 Readings

Week 1 Day 1: Attention and Multi-Head Attention

In day 1, we will implement the basic attention layer and the multi-head attention layer. Attention layers take a input sequence and focus on different parts of the sequence when generating the output. Attention layers are the key building blocks of the Transformer models.

📚 Reading: Transformer Architecture

We use the Qwen2 model for text generation. The model is a decoder-only model. The input of the model is a sequence of token embeddings. The output of the model is the most likely next token ID.

📚 Reading: LLM Inference, the Decode Phase

Back to the attention layer. The attention layer takes a query, a key, and a value. In a classic implementation, all of them are of the same shape: N.. x L x D.

N.. is zero or some number of dimensions for batches. Within each of the batch, L is the sequence length and D is the dimension of the embedding for a given head in the sequence.

So, for example, if we have a sequence of 1024 tokens, where each of the token has a 512-dimensional embedding (head_dim), we will pass a tensor of the shape N.. x 1024 x 512 to the attention layer.

Task 1: Implement `scaled_dot_product_attention_simple`

In this task, we will implement the scaled dot product attention function. We assume the input tensors (Q, K, V) have the same dimensions. In the next few chapters, we will support more variants of attentions that might not have the same dimensions for all tensors.

src/tiny_llm/attention.py

📚 Readings

Annotated Transformer
PyTorch Scaled Dot Product Attention API (assume enable_gqa=False, assume dim_k=dim_v=dim_q and H_k=H_v=H_q)
MLX Scaled Dot Product Attention API (assume dim_k=dim_v=dim_q and H_k=H_v=H_q)
Attention is All You Need

Implement scaled_dot_product_attention following the below attention function. The function takes key, value, and query of the same dimensions, and an optional mask matrix M.

$Attention = softmax (\frac{Q K ^{T}}{d _{k}} + M) V$

Note that $\frac{1}{d _{k}}$ is the scale factor. The user might specify their own scale factor or use the default one.

L is seq_len, in PyTorch API it's S (source len)
D is head_dim

key: N.. x L x D
value: N.. x L x D
query: N.. x L x D
output: N.. x L x D
scale = 1/sqrt(D) if not specified

You may use softmax provided by mlx and implement it later in week 2.

Because we are always using the attention layer within the multi-head attention layer, the actual tensor shape when serving the model will be:

key: 1 x H x L x D
value: 1 x H x L x D
query: 1 x H x L x D
output: 1 x H x L x D
mask: 1 x H x L x L

.. though the attention layer only cares about the last two dimensions. The test case will test any shape of the batching dimension.

At the end of this task, you should be able to pass the following tests:

pdm run test --week 1 --day 1 -- -k task_1

Task 2: Implement `SimpleMultiHeadAttention`

In this task, we will implement the multi-head attention layer.

src/tiny_llm/attention.py

📚 Readings

Annotated Transformer
PyTorch SimpleMultiHeadAttention API (assume dim_k=dim_v=dim_q and H_k=H_v=H_q)
MLX SimpleMultiHeadAttention API (assume dim_k=dim_v=dim_q and H_k=H_v=H_q)
The Illustrated GPT-2 (Visualizing Transformer Language Models) helps you better understand what key, value, and query are.

Implement SimpleMultiHeadAttention. The layer takes a batch of vectors, maps it through the K, V, Q weight matrixes, and use the attention function we implemented in task 1 to compute the result. The output needs to be mapped using the O weight matrix.

You will also need to implement the linear function in basics.py first. For linear, it takes a tensor of the shape N.. x I, a weight matrix of the shape O x I, and a bias vector of the shape O. The output is of the shape N.. x O. I is the input dimension and O is the output dimension.

For the SimpleMultiHeadAttention layer, the input tensors query, key, value have the shape N x L x E, where E is the dimension of the embedding for a given token in the sequence. The K/Q/V weight matrixes will map the tensor into key, value, and query separately, where the dimension E will be mapped into a dimension of size H x D, which means that the token embedding gets mapped into H heads, each with a dimension of D. You can directly reshape the tensor to split the H x D dimension into two dimensions of H and D to get H heads for the token.

Now, you have a tensor of the shape N.. x L x H x D for each of the key, value, and query. To apply the attention function, you first need to transpose them into shape N.. x H x L x D.

This makes each attention head an independent batch, so that attention can be calculated separately for each head across the sequence L.
If you kept H behind L, attention calculation would mix head and sequence dimensions, which is not what we want — each head should focus only on the relationships between tokens in its own subspace.

The attention function produces output for each of the head of the token. Then, you can transpose it back into N.. x L x H x D and reshape it so that all heads get merged back together with a shape of N.. x L x (H x D). Map it through the output weight matrix to get the final output.

E is hidden_size or embed_dim or dims or model_dim
H is num_heads
D is head_dim
L is seq_len, in PyTorch API it's S (source len)

w_q/w_k/w_v: E x (H x D)
output/input: N x L x E
w_o: (H x D) x E

At the end of the task, you should be able to pass the following tests:

pdm run test --week 1 --day 1 -- -k task_2

You can run all tests for the day with:

pdm run test --week 1 --day 1

Week 1 Day 2: Positional Encodings and RoPE

In day 2, we will implement the positional embedding used in the Qwen2 model: Rotary Postional Encoding. In a transformer model, we need a way to embed the information of the position of a token into the input of the attention layers. In Qwen2, positional embedding is applied within the multi head attention layer on the query and key vectors.

📚 Readings

Task 1: Implement Rotary Postional Encoding "RoPE"

You will need to modify the following file:

src/tiny_llm/positional_encoding.py

In traditional RoPE (as described in the readings), the positional encoding is applied to each head of the query and key vectors. You can pre-compute the frequencies when initializing the RoPE class.

If offset is not provided, the positional encoding will be applied to the entire sequence: 0th frequency applied to the 0th token, up to the (L-1)-th token. Otherwise, the positional encoding will be applied to the sequence according to the offset slice. If the offset slice is 5..10, then the sequence length provided to the layer would be 5, and the 0th token will be applied with the 5th frequency.

You only need to consider offset being None or a single slice. The list[slice] case will be implemented when we start implementing the continuous batching feature. Assume all batches provided use the same offset.

x: (N, L, H, D)
cos/sin_freqs: (MAX_SEQ_LEN, D // 2)

In the traditional form of RoPE, each head on the dimension of D is viewed as consequtive complex pairs. That is to say, if D = 8, then, x[0] and x[1] are a pair, x[2] and x[3] are another pair, and so on. A pair gets the same frequency from cos/sin_freqs.

Note that, practically, D can be even or odd. In the case of D being odd, the last dimension of x doesn’t have a matching pair, and is typically left untouched in most implementations. For simplicity, we just assume that D is always even.

output[0] = x[0] * cos_freqs[0] + x[1] * -sin_freqs[0]
output[1] = x[0] * sin_freqs[0] + x[1] * cos_freqs[0]
output[2] = x[2] * cos_freqs[1] + x[3] * -sin_freqs[1]
output[3] = x[2] * sin_freqs[1] + x[3] * cos_freqs[1]
...and so on

You can do this by reshaping x to (N, L, H, D // 2, 2) and then applying the above formula to each pair.

📚 Readings

You can test your implementation by running the following command:

pdm run test --week 1 --day 2 -- -k task_1

Task 2: Implement `RoPE` in the non-traditional form

The Qwen2 model uses a non-traditional form of RoPE. In this form, the head embedding dimension is split into two halves, and the two halves are applied with different frequencies. Let's say x1 = x[.., :HALF_DIM] and x2 = x[.., HALF_DIM:].

output[0] = x1[0] * cos_freqs[0] + x2[0] * -sin_freqs[0]
output[HALF_DIM] = x1[0] * sin_freqs[0] + x2[0] * cos_freqs[0]
output[1] = x1[1] * cos_freqs[1] + x2[1] * -sin_freqs[1]
output[HALF_DIM + 1] = x1[1] * sin_freqs[1] + x2[1] * cos_freqs[1]
...and so on

You can do this by directly getting the first half / second half of the embedding dimension of x and applying the frequencies to each half separately.

📚 Readings

vLLM implementation of RoPE

You can test your implementation by running the following command:

pdm run test --week 1 --day 2 -- -k task_2

At the end of the day, you should be able to pass all tests of this day:

pdm run test --week 1 --day 2

Week 1 Day 3: Grouped Query Attention (GQA)

In day 3, we will implement Grouped Query Attention (GQA). The Qwen2 models use GQA which is an optimization technique for multi-head attention that reduces the computational and memory costs associated with the Key (K) and Value (V) projections. Instead of each Query (Q) head having its own K and V heads (like in Multi-Head Attention, MHA), multiple Q heads share the same K and V heads. Multi-Query Attention (MQA) is a special case of GQA where all Q heads share a single K/V head pair.

Readings

Task 1: Implement `scaled_dot_product_attention_grouped`

You will need to modify the following file:

src/tiny_llm/attention.py

In this task, we will implement the grouped scaled dot product attention function, which forms the core of GQA.

Implement scaled_dot_product_attention_grouped in src/tiny_llm/attention.py. This function is similar to the standard scaled dot product attention, but handles the case where the number of query heads is a multiple of the number of key/value heads.

The main progress is the same as the standard scaled dot product attention. The difference is that the K and V heads are shared across multiple Q heads. This means that instead of having H_q separate K and V heads, we have H K and V heads, and each K and V head is shared by n_repeats = H_q // H Q heads.

The core idea is to reshape query, key, and value so that the K and V tensors can be effectively broadcasted to match the query heads within their groups during the matmul operations. * Think about how to isolate the H and n_repeats dimensions in the query tensor. * Consider adding a dimension of size 1 for n_repeats in the key and value tensors to enable broadcasting. Then perform the scaled dot product attention calculation (matmul, scale, optional mask, softmax, matmul). Broadcasting should handle the head repetition implicitly.

Note that, leverage broadcasting instead of repeating the K and V tensors is more efficient. This is because broadcasting allows the same data to be used in multiple places without creating multiple copies of the data, which can save memory and improve performance.

At last, don't forget to reshape the final result back to the expected output shape.

N.. is zero or more dimensions for batches
H_q is the number of query heads
H is the number of key/value heads (H_q must be divisible by H)
L is the query sequence length
S is the key/value sequence length
D is the head dimension

query: N.. x H_q x L x D
key: N.. x H x S x D
value: N.. x H x S x D
mask: N.. x H_q x L x S
output: N.. x H_q x L x D

Please note that besides the grouped heads, we also extend the implementation that Q, K, and V might not have the same sequence length.

You can test your implementation by running the following command:

pdm run test --week 1 --day 3 -- -k task_1

Task 2: Causal Masking

Readings

Writing an LLM from scratch, part 9 -- causal attention

In this task, we will implement the causal masking for the grouped attention.

The causal masking is a technique that prevents the attention mechanism from attending to future tokens in the sequence. When mask is set to causal, we will apply the causal mask.

The causal mask is a square matrix of shape (L, S), where L is the query sequence length and S is the key/value sequence length. The mask is a lower triangular matrix, where the elements on the diagonal and below the diagonal are 0, and the elements above the diagonal are -inf. For example, if L = 3 and S = 5, the mask will be:

0   0   0   -inf -inf
0   0   0   0    -inf
0   0   0   0    0

Please implement the causal_mask function in src/tiny_llm/attention.py and then use it in the scaled_dot_product_attention_grouped function. Also note that our causal mask diagonal position is different from the PyTorch API.

You can test your implementation by running the following command:

pdm run test --week 1 --day 3 -- -k task_2

Task 3: Qwen2 Grouped Query Attention

In this task, we will implement the Qwen2 Grouped Query Attention. You will need to modify the following file:

src/tiny_llm/qwen2_week1.py

Qwen2MultiHeadAttention implements the multi-head attention for Qwen2. You will need to implement the following pseudo code:

x: B, L, E
q = linear(x, wq, bq) -> B, L, H_q, D
k = linear(x, wk, bk) -> B, L, H, D
v = linear(x, wv, bv) -> B, L, H, D
q = rope(q, offset=slice(offset, offset + L))
k = rope(k, offset=slice(offset, offset + L))
(transpose as needed)
x = scaled_dot_product_attention_grouped(q, k, v, scale, mask) -> B, L, H_q, D ; Do this at float32 precision
(transpose as needed)
x = linear(x, wo) -> B, L, E

You can test your implementation by running the following command:

pdm run test --week 1 --day 3 -- -k task_3

At the end of the day, you should be able to pass all tests of this day:

pdm run test --week 1 --day 3

Week 1 Day 4: RMSNorm and Multi Perceptron Layer

In day 4, we will implement two crucial components of the Qwen2 Transformer architecture: RMSNorm and the MLP (Multi-Layer Perceptron) block, also known as the FeedForward Network. RMSNorm is a layer normalization technique that helps stabilize training with less computational overhead compared to traditional layer normalization. The MLP block is a feedforward network that processes the output of the attention layers, applying non-linear transformations to enhance the model's expressiveness.

Task 1: Implement `RMSNorm`

In this task, we will implement the RMSNorm layer.

src/tiny_llm/layer_norm.py

📚 Readings

Root Mean Square Layer Normalization
Qwen2 layers implementation in mlx-lm (includes RMSNorm) - See Qwen2RMSNorm.

RMSNorm is defined as:

$y = \frac{x}{mean ( x ^{2} ) + ϵ} \cdot weight$

Where:

x is the input tensor.
weight is a learnable scaling parameter.
epsilon (eps) is a small constant added for numerical stability (e.g., 1e-5 or 1e-6).
mean(x^2) is the sum of squares and then division by the number of elements.

The normalization is applied independently to each sample’s feature vector, typically over the last dimension of input. Note that, mean calculation should be performed with float32 accumulation to maintain precision before taking the square root, even if the input and weights are in a lower precision format (e.g., float16 or bfloat16).

D is the embedding dimension.

x: N.. x D
weight: D
output: N.. x D

You can test your implementation by running:

pdm run test --week 1 --day 4 -- -k task_1

Task 2: Implement the MLP Block

In this task, we will implement the MLP block named Qwen2MLP.

src/tiny_llm/qwen2_week1.py

The original Transformer model utilized a simple Feed-Forward Network (FFN) within each block. This FFN typically consisted of two linear transformations with a ReLU activation in between, applied position-wise.

Modern Transformer architectures, including Qwen2, often employ more advanced FFN variants for improved performance. Qwen2 uses a specific type of Gated Linear Unit (GLU) called SwiGLU.

📚 Readings

Essientially, SwiGLU is a combination of GLU and the SiLU (Sigmoid Linear Unit) activation function:

GLU is a gating mechanism that allows the model to learn which parts of the input to focus on. It typically involves an element-wise product of two linear projections of the input, one of which might be passed through an activation function. Compared to ReLU used in the original FFN, GLU can help the model learn more complex relationships in the data, deciding which features to keep and which to discard.
SiLU (Sigmoid Linear Unit) is a smooth, non-monotonic activation function that has been shown to perform well in various deep learning tasks. Compared to ReLU and sigmoid used in GLU, it is fully differentiable without the zero-gradient “dead zones”, retains non-zero output even for negative inputs.

You need to implement the silu function in basics.py first. For silu, it takes a tensor of the shape N.. x I and returns a tensor of the same shape. The silu function is defined as: $SiLU (x) = x * sigmoid (x) = \frac{x}{1 + e ^{- x}}$

Then implement Qwen2MLP. The structure for Qwen2's MLP block is:

A gate linear projection ( $W_{g a t e}$ ).
An up linear projection ( $W_{u p}$ ).
A SiLU activation function applied to the output of $W_{g a t e}$ .
An element-wise multiplication of the SiLU-activated $W_{g a t e}$ output and the $W_{u p}$ output. This forms the "gated" part.
A final down linear projection ( $W_{d o w n}$ ).

This can be expressed as: $MLP (x) = (SiLU (W_{g a t e} (x)) ⊙ W_{u p} (x)) W_{d o w n}$ Where $⊙$ denotes element-wise multiplication. All linear projections in Qwen2's MLP are typically implemented without bias.

N.. is zero or more dimensions for batches
E is hidden_size (embedding dimension of the model)
I is intermediate_size (dimension of the hidden layer in MLP)
L is the sequence length

input: N.. x L x E
w_gate: I x E
w_up: I x E
w_down: E x I
output: N.. x L x E

You can test your implementation by running:

pdm run test --week 1 --day 4 -- -k task_2

At the end of the day, you should be able to pass all tests of this day:

pdm run test --week 1 --day 4

Week 1 Day 5: The Qwen2 Model

In day 5, we will implement the Qwen2 model.

Before we start, please make sure you have downloaded the models:

huggingface-cli download Qwen/Qwen2-0.5B-Instruct-MLX
huggingface-cli download Qwen/Qwen2-7B-Instruct-MLX

Otherwise, some of the tests will be skipped.

Task 1: Implement `Qwen2TransformerBlock`

src/tiny_llm/qwen2_week1.py

📚 Readings

Qwen2 uses the following transformer block structure:

  input
/ |
| input_layernorm (RMSNorm)
| |
| Qwen2MultiHeadAttention
\ |
  Add (residual)
/ |
| post_attention_layernorm (RMSNorm)
| |
| MLP
\ |
  Add (residual)
  |
output

You should pass all tests for this task by running:

# Download the models if you haven't done so
huggingface-cli download Qwen/Qwen2-0.5B-Instruct-MLX
huggingface-cli download Qwen/Qwen2-7B-Instruct-MLX
# Run the tests
pdm run test --week 1 --day 5 -- -k task_1

Task 2: Implement `Embedding`

src/tiny_llm/embedding.py

📚 Readings

LLM Embeddings Explained: A Visual and Intuitive Guide

The embedding layer maps one or more tokens (represented as an interger) to one or more vector of dimension embedding_dim. In this task, you will implement the embedding layer.

Embedding::__call__
weight: vocab_size x embedding_dim
Input: N.. (tokens)
Output: N.. x embedding_dim (vectors)

This can be done with a simple array index lookup operation.

In the Qwen2 model, the embedding layer can also be used as a linear layer to map the embeddings back to the token space.

Embedding::as_linear
weight: vocab_size x embedding_dim
Input: N.. x embedding_dim
Output: N.. x vocab_size

You should pass all tests for this task by running:

# Download the models if you haven't done so; we need to tokenizers
huggingface-cli download Qwen/Qwen2-0.5B-Instruct-MLX
huggingface-cli download Qwen/Qwen2-7B-Instruct-MLX
# Run the tests
pdm run test --week 1 --day 5 -- -k task_2

Task 3: Implement `Qwen2ModelWeek1`

Now that we have built all the components of the Qwen2 model, we can implement the Qwen2ModelWeek1 class.

src/tiny_llm/qwen2_week1.py

📚 Readings

Qwen2.5-7B-Instruct model parameters

In this course, you will not implement the process of loading the model parameters from the tensor files. Instead, we will load the model using the mlx-lm library, and then we will place the loaded parameters into our model. Therefore, the Qwen2ModelWeek1 class will take a MLX model as the constructor argument.

The Qwen2 model has the following layers:

input
| (tokens: N..)
Embedding
| (N.. x hidden_size); note that hidden_size==embedding_dim
Qwen2TransformerBlock
| (N.. x hidden_size)
Qwen2TransformerBlock
| (N.. x hidden_size)
...
|
RMSNorm 
| (N.. x hidden_size)
Embedding::as_linear  OR  Linear (lm_head)
| (N.. x vocab_size)
output

You can access the number of layers, hidden size, and other model parameters from mlx_model.args. Note that different size of the Qwen2 models use different strategies to map the embeddings back to the token space. For the 0.5b model, it directly uses the Embedding::as_linear layer. For the 7b model, it has a separate lm_head linear layer. You can decide which strategy to use based on the mlx_model.args.tie_word_embeddings argument. If it is true, then you should use Embedding::as_linear. Otherwise, the lm_head linear layer will be available and you should load its parameters.

The input to the model is a sequence of tokens. The output is the logits (probability distribution) of the next token. In the next day, we will implement the process of generating the response from the model, and decide the next token based on the probability distribution output.

Also note that the MLX model we are using (Qwen2-7B/0.5B-Instruct) is a quantized model. Therefore, you also need to dequantize the weights before loading them into our tiny-llm model. You can use the provided quantize::dequantize_linear function to dequantize the weights.

You also need to make sure that you set mask=causal when the input sequence is longer than 1. We will explain why in the next day.

You should pass all tests for this task by running:

# Download the models if you haven't done so
huggingface-cli download Qwen/Qwen2-0.5B-Instruct-MLX
huggingface-cli download Qwen/Qwen2-7B-Instruct-MLX
# Run the tests
pdm run test --week 1 --day 5 -- -k task_3

At the end of the day, you should be able to pass all tests of this day:

pdm run test --week 1 --day 5

Week 1 Day 6: Generating the Response: Prefill and Decode

In day 6, we will implement the process of generating the response when using the LLM as a chatbot. The implementation is not a lot of code, but given that it uses a large portion of the code we implemented in the previous days, we want to allocate this day to debug the implementation and make sure everything is working as expected.

Task 1: Implement `simple_generate`

src/tiny_llm/generate.py

The simple_generate function takes a model, a tokenizer, and a prompt, and generates the response. The generation process is done in two parts: first prefill, and then decode.

First thing is to implement the _step sub-function. It takes a list of tokens y, and the offset of the first token provided to the model. The model will return the logits: the probability distribution of the next token for each position.

y: N.. x S, where in week 1 we don't implement batch, so N.. = 1
offset: int
output_logits: N.. x S x vocab_size

You only need the last token's logits to decide the next token. Therefore, you need to select the last token's logits from the output logits.

logits = output_logits[:, -1, :]

Then, you can optionally apply the log-sum-exp trick to normalize the logits to avoid numerical instability. As we only do argmax sampling, the log-sum-exp trick is not necessary. Then, you need to sample the next token from the logits. You can use the mx.argmax function to sample the token with the highest probability over the last dimension (the vocab_size axis). The function returns the next token number. This decoding strategy is called greedy decoding as we always pick the token with the highest probability.

📚 The Log-Sum-Exp Trick
📚 Decoding Strategies in Large Language Models

With the _step function implemented, you can now implement the full simple_generate function. The function will first prefill the model with the prompt. As the prompt is a string, you need to first convert it to a list of tokens by using the tokenizer tokenizer.encode.

The prefill step is done by calling the _step function with all the tokens in the prompt with offset=0. It gives back the first token in the response.
The decode step is done by calling the _step function with all the previous tokens and the offset of the last token.

You will need to implement a while loop to keep generating the response until the model outputs the EOS tokenizer.eos_token_id token. In the loop, you will need to store all previous tokens in a list, and use the detokenizer tokenizer.detokenizer to print the response.

An example of the sequences provided to the _step function is as below:

tokenized_prompt: [1, 2, 3, 4, 5, 6]
prefill: _step(model, [1, 2, 3, 4, 5, 6], 0) # returns 7
decode: _step(model, [1, 2, 3, 4, 5, 6, 7], 7) # returns 8
decode: _step(model, [1, 2, 3, 4, 5, 6, 7, 8], 8) # returns 9
...

We will optimize the decode process to use key-value cache to speed up the generation next week.

You can test your implementation by running the following command:

pdm run main --solution tiny_llm --loader week1 --model Qwen/Qwen2-0.5B-Instruct-MLX \
  --prompt "Give me a short introduction to large language model"
pdm run main --solution tiny_llm --loader week1 --model Qwen/Qwen2-7B-Instruct-MLX \
  --prompt "Give me a short introduction to large language model"

It should gives you a reasonable response of "what is a large language model". Replace --solution tiny_llm with --solution ref to use the reference solution.

Week 1 Day 7: Sampling and Preparing for Week 2

In day 7, we will implement various sampling strategies. And we will get you prepared for week 2.

Task 1: Sampling

We implemented the default greedy sampling strategy in the previous day. In this task, we will implement the temperature, top-k, and top-p (nucleus) sampling strategies.

src/tiny_llm/sampler.py

📚 mlx-lm sampler implementation

Temperature Sampling

The first sampling strategy is the temperature sampling. When temp=0, we use the default greedy strategy. When it is larger than 0, we will randomly select the next token based on the logprobs. The temperature parameter scales the distribution. When the value is larger, the distribution will be more uniform, making the lower probability token more likely to be selected, and therefore making the model more creative.

To implement temperature sampling, simply divide the logprobs by the temperature and use mx.random.categorical to randomly select the next token.

pdm run main --solution tiny_llm --loader week1 --model Qwen/Qwen2-0.5B-Instruct-MLX --sampler-temp 0.5

Top-k Sampling

In top-k sampling, we will only keep the top-k tokens with the highest probabilities before sampling the probabilities. This is done before the final temperature scaling.

You can use mx.argpartition to partition the output so that you can know the indices of the top-k elements, and then, mask those logprobs outside the top-k with -mx.inf. After that, do temperature sampling.

pdm run main --solution tiny_llm --loader week1 --model Qwen/Qwen2-0.5B-Instruct-MLX --sampler-temp 0.5 --sampler-top-k 10

Top-p (Nucleus) Sampling

In top-p (nucleus) sampling, we will only keep the top-p tokens with the highest cumulative probabilities before sampling the probabilities. This is done before the final temperature scaling.

There are multiple ways of implementing it. One way is to first use mx.argsort to sort the logprobs (from highest probability to lowest), and then, do a cumsum over the sorted logprobs to get the cumulative probabilities. Then, mask those logprobs outside the top-p with -mx.inf. After that, do temperature sampling.

pdm run main --solution tiny_llm --loader week1 --model Qwen/Qwen2-0.5B-Instruct-MLX --sampler-temp 0.5 --sampler-top-p 0.9

Task 2: Prepare for Week 2

In week 2, we will optimize the serving infrastructure of the Qwen2 model. We will write some C++ code and Metal kernel to make some operations run faster. You will need Xcode and its command-line tools, which include the Metal compiler, to compile the C++ code and Metal kernels.

Install Xcode: Install Xcode from the Mac App Store or from the Apple Developer website (this may require an Apple Developer account).
Launch Xcode and Install Components: After installation, launch Xcode at least once. It may prompt you to install additional macOS components; please do so (this is usually the default option).
Install Xcode Command Line Tools: Open your Terminal and run:
```
xcode-select --install
```
Set Default Xcode Path (if needed): Ensure that your command-line tools are pointing to your newly installed Xcode. You can do this by running:
```
sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer
```
(Adjust the path if your Xcode is installed in a different location).
Accept Xcode License: You may also need to accept the Xcode license:
```
sudo xcodebuild -license accept
```
Install CMake:
```
brew install cmake
```

(This instruction is graciously provided by Liu Jinyi.)

You can test your installation by compiling the code in src/extensions with a axpby function as part of the official mlx extension tutorial:

pdm run build-ext
pdm run build-ext-test

It should print correct: True.

If you are not familiar with C++ or Metal programming, we also suggest doing some small exercises to get familiar with them. You can implement some element-wise operations like exp, sin, cos and replace the MLX ones in your model implementation.

That's all for week 1! We have implemented all the components to serve the Qwen2 model. Now we are ready to start week 2, where we will optimize the serving infrastructure and make it run blazing fast on your Apple Silicon device.

Keyboard shortcuts

Tiny LLM - LLM Serving in a Week