Preface

This course is designed for systems engineers who want to understand how LLMs work.

As a system engineer, I always wonder how things work internally and how to optimize them. I had a hard time figuring out the LLM stuff. Most of the open source projects that serve LLMs are highly optimized with CUDA kernels and other low-level optimizations. It is not easy to understand the whole picture by looking at a codebase of 100k lines of code. Therefore, I decided to implement an LLM serving project from scratch -- with only matrix manipulations APIs, so that I can understand what it takes to load those LLM model parameters and do the math magic to generate text.

You can think of this course as an LLM version of CMU Deep Learning Systems course's needle project.

Prerequisites

You should have some experience with the basics of deep learning and have some idea of how PyTorch works. Some recommended resources are:

Environment Setup

This course uses MLX, an array/machine learning library for Apple Silicon. Nowaways it's much easier to get an Apple Silicon device than NVIDIA GPUs. In theory you can also do this course with PyTorch or numpy, but we just don't have the test infra to support them. We test your implementation against PyTorch's CPU implementation and MLX's implementation to ensure correctness.

Course Structure

This course is divided into 3 weeks. We will serve the Qwen2-7B-Instruct model and optimize it throughout the course.

  • Week 1: serve Qwen2 with purely matrix manipulation APIs. Just Python.
  • Week 2: optimizations, implement C++/Metal custom kernels to make the model run faster.
  • Week 3: more optimizations, batch the requests to serve the model with high throughput.

How to Use This Book

The thing you are reading right now is the tiny-llm book. It is designed more like a guidebook instead of a textbook that explains everything from scratch. In this course, we provide the materials that we find useful on the Internet when the author(s) implemented the tiny-llm project. The Internet does a better job of explaining the concepts and I do not think it is necessary to repeat everything here. Think of this as a guide (of a list of tasks) and some hints! We will also unify the language of the Internet materials so that it is easier to correspond them to the codebase. For example, we will have a unified dimension symbols for the tensors. You do not need to figure out what H, L, E stands for and what dimension of the matrixes are passed into the function.

About the Authors

This course is created by Chi and Connor.

Chi is a systems software engineer at Neon, acquired by Databricks, focusing on storage systems. Fascinated by the vibe of large language models (LLMs), he created this course to explore how LLM inference works.

Community

You may join skyzh's Discord server and study with the tiny-llm community.

Join skyzh's Discord Server

Get Started

Now, you can start to set up the environment following the instructions in Setting Up the Environment and begin your journey to build tiny-llm!

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.

Setting Up the Environment

To follow along this course, you will need a Macintosh device with Apple Silicon. We manage the codebase with pdm.

Install pdm

Please follow the offcial guide to install pdm.

Clone the Repository

git clone https://github.com/skyzh/tiny-llm

The repository is organized as follows:

src/tiny_llm -- your implementation
src/tiny_llm_week1_ref -- reference implementation of week 1
tests/ -- unit tests for your implementation
tests_ref_impl_week1/ -- unit tests for the reference implementation of week 1
book/ -- the book

We provide all reference implementations and you can refer to them if you get stuck in the course.

Install Dependencies

cd tiny-llm
pdm install -v # this will automatically create a virtual environment and install all dependencies

Check the Installation

pdm run python check.py
# The reference solution should pass all the *week 1* tests
pdm run test-refsol -- -- -k week_1

Run Unit Tests

Your code is in src/tiny_llm. You can run the unit tests with:

pdm run test

Download the Model Parameters

We will use the Qwen2-7B-Instruct model for this course. It takes ~20GB of memory in week 1 to load the model parameters. If you do not have enough memory, you can consider using the smaller 0.5B model.

Follow the guide of this page to install the huggingface cli.

The model parameters are hosted on Hugging Face. Once you authenticated your cli with the credentials, you can download them with:

huggingface-cli login
huggingface-cli download Qwen/Qwen2-7B-Instruct-MLX

Then, you can run:

pdm run main --solution ref --loader week1

It should load the model and print some text.

In week 2, we will write some kernels in C++/Metal, and we will need to set up additional tools for that. We will cover it later.

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.

Week 1: From Matmul to Text

This book is not complete and this chapter is not finalized yet. We might switch to Qwen3 in the final version of the course.

In this week, we will start from the basic matrix operations and see how those these matrix manipulations can turn the Qwen2 model parameters into a model that generates text. We will implement the neural network layers used in the Qwen2 model using mlx's matrix APIs.

We will use the Qwen2-7B-Instruct model for this week. As we need to dequantize the model parameters, the model of 4GB download size needs 20GB of memory in week 1. If you do not have enough memory, you can consider using the smaller 0.5B model.

The MLX version of the Qwen2-7B-Instruct model we downloaded in the setup is an int4 quantized version of the original bfloat16 model.

What We will Cover

  • Attention, Multi-Head Attention, and Grouped/Multi Query Attention
  • Positional Embeddings and RoPE
  • Put the attention layers together and implement the whole Transformer block
  • Implement the MLP layer and the whole Transformer model
  • Load the Qwen2 model parameters and generate text

What We will Not Cover

To make the journey as interesting as possible, we will skip a few things for now:

  • How to quantize/dequantize a model -- that will be part of week 2. The Qwen2 model is quantized so we will need to dequantize them before we can use them in our layer implementations.
  • Actually we still used some APIs other than matrix manipulations -- like softmax, exp, log, etc. But they are simple and not implementing them would not affect the learning experience.
  • Tokenizer -- we will not implement the tokenizer from scratch. We will use the mlx_lm tokenizer to tokenize the input.
  • Loading the model weights -- I don't think it's an interesting thing to learn how to decode those tensor dump files, so we will use the mlx_lm to load the model and steal the weights from the loaded model into our layer implementations.

Basic Matrix APIs

Although MLX does not offer an introductory guide for beginners, its Python API is designed to be highly compatible with NumPy. To get started, you can refer to NumPy: The Absolute Basic for Beginners to learn essential matrix operations.

You can also refer to the MLX Operations API for more details.

Qwen2 Models

You can try the Qwen2 model with MLX/vLLM. You can read the blog post below to have some idea of what we will build within this course. At the end of this week, we will be able to chat with the model -- that is to say, use Qwen2 to generate text, as a causal language model.

The reference implementation of the Qwen2 model can be found in huggingface transformers, vLLM, and mlx-lm. You may utilize these resources to better understand the internals of the model and what we will implement in this week.

πŸ“š Readings

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.

Week 1 Day 1: Attention and Multi-Head Attention

This book is not complete and this chapter is not finalized yet. We are still working on the reference solution, writing tests, and unify the math notations in the book.

In day 1, we will implement the basic attention layer and the multi-head attention layer. Attention layers take a input sequence and focus on different parts of the sequence when generating the output. Attention layers are the key building blocks of the Transformer models.

πŸ“š Reading: Transformer Architecture

We use the Qwen2 model for text generation. The model is a decoder-only model. The input of the model is a sequence of token embeddings. The output of the model is the most likely next token ID.

πŸ“š Reading: LLM Inference, the Decode Phase

Back to the attention layer. The attention layer takes a query, a key, and a value. In a classic implementation, all of them are of the same shape: N.. x L x D.

N.. is zero or some number of dimensions for batches. Within each of the batch, L is the sequence length and D is the dimension of the embedding for a given head in the sequence.

So, for example, if we have a sequence of 1024 tokens, where each of the token has a 512-dimensional embedding (head_dim), we will pass a tensor of the shape N.. x 1024 x 512 to the attention layer.

Task 1: Implement scaled_dot_product_attention_simple

In this task, we will implement the scaled dot product attention function. We assume the input tensors (Q, K, V) have the same dimensions. In the next few chapters, we will support more variants of attentions that might not have the same dimensions for all tensors.

src/tiny_llm/attention.py

πŸ“š Readings

Implement scaled_dot_product_attention following the below attention function. The function takes key, value, and query of the same dimensions, and an optional mask matrix M.

Note that is the scale factor. The user might specify their own scale factor or use the default one.

L is seq_len, in PyTorch API it's S (source len)
D is head_dim

key: N.. x L x D
value: N.. x L x D
query: N.. x L x D
output: N.. x L x D
scale = 1/sqrt(D) if not specified

You may use softmax provided by mlx and implement it later in week 2.

Because we are always using the attention layer within the multi-head attention layer, the actual tensor shape when serving the model will be:

key: 1 x H x L x D
value: 1 x H x L x D
query: 1 x H x L x D
output: 1 x H x L x D
mask: 1 x H x L x L

.. though the attention layer only cares about the last two dimensions. The test case will test any shape of the batching dimension.

At the end of this task, you should be able to pass the following tests:

pdm run test --week 1 --day 1 -- -k task_1

Task 2: Implement SimpleMultiHeadAttention

In this task, we will implement the multi-head attention layer.

src/tiny_llm/attention.py

πŸ“š Readings

Implement SimpleMultiHeadAttention. The layer takes a batch of vectors, maps it through the K, V, Q weight matrixes, and use the attention function we implemented in task 1 to compute the result. The output needs to be mapped using the O weight matrix.

You will also need to implement the linear function in basics.py first. For linear, it takes a tensor of the shape N.. x I, a weight matrix of the shape O x I, and a bias vector of the shape O. The output is of the shape N.. x O. I is the input dimension and O is the output dimension.

For the SimpleMultiHeadAttention layer, the input tensors query, key, value have the shape N x L x E, where E is the dimension of the embedding for a given token in the sequence. The K/Q/V weight matrixes will map the tensor into key, value, and query separately, where the dimension E will be mapped into a dimension of size H x D, which means that the token embedding gets mapped into H heads, each with a dimension of D. You can directly reshape the tensor to split the H x D dimension into two dimensions of H and D to get H heads for the token.

Now, you have a tensor of the shape N.. x L x H x D for each of the key, value, and query. To apply the attention function, you first need to transpose them into shape N.. x H x L x D.

  • This makes each attention head an independent batch, so that attention can be calculated separately for each head across the sequence L.
  • If you kept H behind L, attention calculation would mix head and sequence dimensions, which is not what we want β€” each head should focus only on the relationships between tokens in its own subspace.

The attention function produces output for each of the head of the token. Then, you can transpose it back into N.. x L x H x D and reshape it so that all heads get merged back together with a shape of N.. x L x (H x D). Map it through the output weight matrix to get the final output.

E is hidden_size or embed_dim or dims or model_dim
H is num_heads
D is head_dim
L is seq_len, in PyTorch API it's S (source len)

w_q/w_k/w_v: E x (H x D)
output/input: N x L x E
w_o: (H x D) x E

At the end of the task, you should be able to pass the following tests:

pdm run test --week 1 --day 1 -- -k task_2

You can run all tests for the day with:

pdm run test --week 1 --day 1

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.

Week 1 Day 2: Positional Encodings and RoPE

This book is not complete and this chapter is not finalized yet. We are still working on the reference solution, writing tests, and unify the math notations in the book.

In day 2, we will implement the positional embedding used in the Qwen2 model: Rotary Postional Encoding. In a transformer model, we need a way to embed the information of the position of a token into the input of the attention layers. In Qwen2, positional embedding is applied within the multi head attention layer on the query and key vectors.

πŸ“š Readings

Task 1: Implement Rotary Postional Encoding "RoPE"

You will need to modify the following file:

src/tiny_llm/positional_encoding.py

In traditional RoPE (as described in the readings), the positional encoding is applied to each head of the query and key vectors. You can pre-compute the frequencies when initializing the RoPE class.

If offset is not provided, the positional encoding will be applied to the entire sequence: 0th frequency applied to the 0th token, up to the (L-1)-th token. Otherwise, the positional encoding will be applied to the sequence according to the offset slice. If the offset slice is 5..10, then the sequence length provided to the layer would be 5, and the 0th token will be applied with the 5th frequency.

You only need to consider offset being None or a single slice. The list[slice] case will be implemented when we start implementing the continuous batching feature. Assume all batches provided use the same offset.

x: (N, L, H, D)
cos/sin_freqs: (MAX_SEQ_LEN, D // 2)

In the traditional form of RoPE, each head on the dimension of D is viewed as consequtive complex pairs. That is to say, if D = 8, then, x[0] and x[1] are a pair, x[2] and x[3] are another pair, and so on. A pair gets the same frequency from cos/sin_freqs.

Note that, practically, D can be even or odd. In the case of D being odd, the last dimension of x doesn’t have a matching pair, and is typically left untouched in most implementations. For simplicity, we just assume that D is always even.

output[0] = x[0] * cos_freqs[0] + x[1] * sin_freqs[0]
output[1] = x[0] * -sin_freqs[0] + x[1] * cos_freqs[0]
output[2] = x[2] * cos_freqs[1] + x[3] * sin_freqs[1]
output[3] = x[2] * -sin_freqs[1] + x[3] * cos_freqs[1]
...and so on

You can do this by reshaping x to (N, L, H, D // 2, 2) and then applying the above formula to each pair.

πŸ“š Readings

You can test your implementation by running the following command:

pdm run test --week 1 --day 2 -- -k task_1

Task 2: Implement RoPE in the non-traditional form

The Qwen2 model uses a non-traditional form of RoPE. In this form, the head embedding dimension is split into two halves, and the two halves are applied with different frequencies. Let's say x1 = x[.., :HALF_DIM] and x2 = x[.., HALF_DIM:].

output[0] = x1[0] * cos_freqs[0] + x2[0] * sin_freqs[0]
output[HALF_DIM] = x1[0] * -sin_freqs[0] + x2[0] * cos_freqs[0]
output[1] = x1[1] * cos_freqs[1] + x2[1] * sin_freqs[1]
output[HALF_DIM + 1] = x1[1] * -sin_freqs[1] + x2[1] * cos_freqs[1]
...and so on

You can do this by directly getting the first half / second half of the embedding dimension of x and applying the frequencies to each half separately.

πŸ“š Readings

You can test your implementation by running the following command:

pdm run test --week 1 --day 2 -- -k task_2

At the end of the day, you should be able to pass all tests of this day:

pdm run test --week 1 --day 2

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.

Week 1 Day 3: Grouped Query Attention (GQA)

This book is not complete and this chapter is not finalized yet. We are still working on the reference solution, writing tests, and unify the math notations in the book.

In day 3, we will implement Grouped Query Attention (GQA). The Qwen2 models use GQA which is an optimization technique for multi-head attention that reduces the computational and memory costs associated with the Key (K) and Value (V) projections. Instead of each Query (Q) head having its own K and V heads (like in Multi-Head Attention, MHA), multiple Q heads share the same K and V heads. Multi-Query Attention (MQA) is a special case of GQA where all Q heads share a single K/V head pair.

Readings

Task 1: Implement scaled_dot_product_attention_grouped

You will need to modify the following file:

src/tiny_llm/attention.py

In this task, we will implement the grouped scaled dot product attention function, which forms the core of GQA.

Implement scaled_dot_product_attention_grouped in src/tiny_llm/attention.py. This function is similar to the standard scaled dot product attention, but handles the case where the number of query heads is a multiple of the number of key/value heads.

The main progress is the same as the standard scaled dot product attention. The difference is that the K and V heads are shared across multiple Q heads. This means that instead of having H_q separate K and V heads, we have H K and V heads, and each K and V head is shared by n_repeats = H_q // H Q heads.

The core idea is to reshape query, key, and value so that the K and V tensors can be effectively broadcasted to match the query heads within their groups during the matmul operations. * Think about how to isolate the H and n_repeats dimensions in the query tensor. * Consider adding a dimension of size 1 for n_repeats in the key and value tensors to enable broadcasting. Then perform the scaled dot product attention calculation (matmul, scale, optional mask, softmax, matmul). Broadcasting should handle the head repetition implicitly.

Note that, leverage broadcasting instead of repeating the K and V tensors is more efficient. This is because broadcasting allows the same data to be used in multiple places without creating multiple copies of the data, which can save memory and improve performance.

At last, don't forget to reshape the final result back to the expected output shape.

N.. is zero or more dimensions for batches
H_q is the number of query heads
H is the number of key/value heads (H_q must be divisible by H)
L is the query sequence length
S is the key/value sequence length
D is the head dimension

query: N.. x H_q x L x D
key: N.. x H x S x D
value: N.. x H x S x D
mask: N.. x H_q x L x S
output: N.. x H_q x L x D

Please note that besides the grouped heads, we also extend the implementation that Q, K, and V might not have the same sequence length.

You can test your implementation by running the following command:

pdm run test --week 1 --day 3 -- -k task_1

Task 2: Causal Masking

Readings

In this task, we will implement the causal masking for the grouped attention.

The causal masking is a technique that prevents the attention mechanism from attending to future tokens in the sequence. When mask is set to causal, we will apply the causal mask.

The causal mask is a square matrix of shape (L, S), where L is the query sequence length and S is the key/value sequence length. The mask is a lower triangular matrix, where the elements on the diagonal and below the diagonal are 0, and the elements above the diagonal are -inf. For example, if L = 3 and S = 5, the mask will be:

0   0   0   -inf -inf
0   0   0   0    -inf
0   0   0   0    0

Please implement the causal_mask function in src/tiny_llm/attention.py and then use it in the scaled_dot_product_attention_grouped function. Also note that our causal mask diagonal position is different from the PyTorch API.

You can test your implementation by running the following command:

pdm run test --week 1 --day 3 -- -k task_2

Task 3: Qwen2 Grouped Query Attention

In this task, we will implement the Qwen2 Grouped Query Attention. You will need to modify the following file:

src/tiny_llm/qwen2_week1.py

Qwen2MultiHeadAttention implements the multi-head attention for Qwen2. You will need to implement the following pseudo code:

x: B, L, E
q = linear(x, wq, bq) -> B, L, H_q, D
k = linear(x, wk, bk) -> B, L, H, D
v = linear(x, wv, bv) -> B, L, H, D
q = rope(q, offset=slice(offset, offset + L))
k = rope(k, offset=slice(offset, offset + L))
(transpose as needed)
x = scaled_dot_product_attention_grouped(q, k, v, scale, mask) -> B, L, H_q, D ; Do this at float32 precision
(transpose as needed)
x = linear(x, wo) -> B, L, E

You can test your implementation by running the following command:

pdm run test --week 1 --day 3 -- -k task_3

At the end of the day, you should be able to pass all tests of this day:

pdm run test --week 1 --day 3

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.

Week 1 Day 4: RMSNorm and Multi Perceptron Layer

This book is not complete and this chapter is not finalized yet. We are still working on the reference solution, writing tests, and unify the math notations in the book.

In day 4, we will implement two crucial components of the Qwen2 Transformer architecture: RMSNorm and the MLP (Multi-Layer Perceptron) block, also known as the FeedForward Network. RMSNorm is a layer normalization technique that helps stabilize training with less computational overhead compared to traditional layer normalization. The MLP block is a feedforward network that processes the output of the attention layers, applying non-linear transformations to enhance the model's expressiveness.

Task 1: Implement RMSNorm

In this task, we will implement the RMSNorm layer.

src/tiny_llm/layer_norm.py

πŸ“š Readings

RMSNorm is defined as:

Where:

  • x is the input tensor.
  • weight is a learnable scaling parameter.
  • epsilon (eps) is a small constant added for numerical stability (e.g., 1e-5 or 1e-6).
  • mean(x^2) is the sum of squares and then division by the number of elements.

The normalization is applied independently to each sample’s feature vector, typically over the last dimension of input. Note that, mean calculation should be performed with float32 accumulation to maintain precision before taking the square root, even if the input and weights are in a lower precision format (e.g., float16 or bfloat16).

D is the embedding dimension.

x: N.. x D
weight: D
output: N.. x D

You can test your implementation by running:

pdm run test --week 1 --day 4 -- -k task_1

Task 2: Implement the MLP Block

In this task, we will implement the MLP block named Qwen2MLP.

src/tiny_llm/qwen2_week1.py

The original Transformer model utilized a simple Feed-Forward Network (FFN) within each block. This FFN typically consisted of two linear transformations with a ReLU activation in between, applied position-wise.

Modern Transformer architectures, including Qwen2, often employ more advanced FFN variants for improved performance. Qwen2 uses a specific type of Gated Linear Unit (GLU) called SwiGLU.

πŸ“š Readings

Essientially, SwiGLU is a combination of GLU and the SiLU (Sigmoid Linear Unit) activation function:

  • GLU is a gating mechanism that allows the model to learn which parts of the input to focus on. It typically involves an element-wise product of two linear projections of the input, one of which might be passed through an activation function. Compared to ReLU used in the original FFN, GLU can help the model learn more complex relationships in the data, deciding which features to keep and which to discard.
  • SiLU (Sigmoid Linear Unit) is a smooth, non-monotonic activation function that has been shown to perform well in various deep learning tasks. Compared to ReLU and sigmoid used in GLU, it is fully differentiable without the zero-gradient β€œdead zones”, retains non-zero output even for negative inputs.

You need to implement the silu function in basics.py first. For silu, it takes a tensor of the shape N.. x I and returns a tensor of the same shape. The silu function is defined as:

Then implement Qwen2MLP. The structure for Qwen2's MLP block is:

  • A gate linear projection ().
  • An up linear projection ().
  • A SiLU activation function applied to the output of .
  • An element-wise multiplication of the SiLU-activated output and the output. This forms the "gated" part.
  • A final down linear projection ().

This can be expressed as: Where denotes element-wise multiplication. All linear projections in Qwen2's MLP are typically implemented without bias.

N.. is zero or more dimensions for batches
E is hidden_size (embedding dimension of the model)
I is intermediate_size (dimension of the hidden layer in MLP)
L is the sequence length

input: N.. x L x E
w_gate: I x E
w_up: I x E
w_down: E x I
output: N.. x L x E

You can test your implementation by running:

pdm run test --week 1 --day 4 -- -k task_2

At the end of the day, you should be able to pass all tests of this day:

pdm run test --week 1 --day 4

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.

Glossary Index

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.