Preface

This course is designed for systems engineers who want to understand how LLMs work.

As a system engineer, I always wonder how things work internally and how to optimize them. I had a hard time figuring out the LLM stuff. Most of the open source projects that serve LLMs are highly optimized with CUDA kernels and other low-level optimizations. It is not easy to understand the whole picture by looking at a codebase of 100k lines of code. Therefore, I decided to implement an LLM serving project from scratch -- with only matrix manipulations APIs, so that I can understand what it takes to load those LLM model parameters and do the math magic to generate text.

You can think of this course as an LLM version of CMU Deep Learning Systems course's needle project.

Prerequisites

You should have some experience with the basics of deep learning and have some idea of how PyTorch works. Some recommended resources are:

Environment Setup

This course uses MLX, an array/machine learning library for Apple Silicon. Nowaways it's much easier to get an Apple Silicon device than NVIDIA GPUs. In theory you can also do this course with PyTorch or numpy, but we just don't have the test infra to support them. We test your implementation against PyTorch's CPU implementation and MLX's implementation to ensure correctness.

Course Structure

This course is divided into 3 weeks. We will serve the Qwen2-7B-Instruct model and optimize it throughout the course.

  • Week 1: serve Qwen2 with purely matrix manipulation APIs. Just Python.
  • Week 2: optimizations, implement C++/Metal custom kernels to make the model run faster.
  • Week 3: more optimizations, batch the requests to serve the model with high throughput.

How to Use This Book

The thing you are reading right now is the tiny-llm book. It is designed more like a guidebook instead of a textbook that explains everything from scratch. In this course, we provide the materials that we find useful on the Internet when the author(s) implemented the tiny-llm project. The Internet does a better job of explaining the concepts and I do not think it is necessary to repeat everything here. Think of this as a guide (of a list of tasks) and some hints! We will also unify the language of the Internet materials so that it is easier to correspond them to the codebase. For example, we will have a unified dimension symbols for the tensors. You do not need to figure out what H, L, E stands for and what dimension of the matrixes are passed into the function.

Community

You may join skyzh's Discord server and study with the tiny-llm community.

Join skyzh's Discord Server

Get Started

Now, you can start to set up the environment following the instructions in Setting Up the Environment and begin your journey to build tiny-llm!

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book © 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.

Setting Up the Environment

To follow along this course, you will need a mactonish device with Apple Silicon. We manage the codebase with the poetry dependency manager.

Install Poetry

Please follow the offcial guide to install poetry.

Clone the Repository

git clone https://github.com/skyzh/tiny-llm

The repository is organized as follows:

src/tiny_llm -- your implementation
src/tiny_llm_week1_ref -- reference implementation of week 1
tests/ -- unit tests for your implementation
tests_ref_impl_week1/ -- unit tests for the reference implementation of week 1
book/ -- the book

We provide all reference implementations and you can refer to them if you get stuck in the course.

Install Dependencies

cd tiny-llm
poetry install

Check the Installation

poetry run python check.py
# The reference solution should pass all the tests
poetry run pytest tests_ref_impl_week1

Run Unit Tests

Your code is in src/tiny_llm. You can run the unit tests with:

poetry run pytest tests

Download the Model Parameters

We will use the Qwen2-7B-Instruct model for this course. It takes ~20GB of memory in week 1 to load the model parameters. If you do not have enough memory, you can consider using the smaller 0.5B model. (We will make the course compatible with it in the future; meanwhile, you have to figure out things on your own if you use the 0.5B model. Likely, this only matters after week 1 day 6 when you start to load the model parameters.)

Follow the guide of this page to install the huggingface cli. You should install it in your user directory/globally instead of in the tiny-llm virtual environment created by poetry.

The model parameters are hosted on Hugging Face. Once you authenticated your cli with the credentials, you can download them with:

# do not do this in the virtual environment created by poetry; do `deactivate` first if you did `poetry shell`
huggingface-cli login
huggingface-cli download Qwen/Qwen2-7B-Instruct-MLX

Then, you can run:

poetry run python main_ref_impl_week1.py

It should load the model and print some text.

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book © 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.

Week 1: From Matmul to Text

In this week, we will start from the basic matrix operations and see how those these matrix manipulations can turn the Qwen2 model parameters into a model that generates text. We will implement the neural network layers used in the Qwen2 model using mlx's matrix APIs.

We will use the Qwen2-7B-Instruct model for this week. As we need to dequantize the model parameters, the 4GB model needs 20GB of memory in week 1. If you do not have enough memory, you can consider using the smaller 0.5B model (we do not have infra to test it so you need to figure out things on your own unfortunately).

What We will Cover

  • Attention, Multi-Head Attention, and Grouped/Multi Query Attention
  • Positional Embeddings and RoPE
  • Put the attention layers together and implement the whole Transformer block
  • Implement the MLP layer and the whole Transformer model
  • Load the Qwen2 model parameters and generate text

What We will Not Cover

To make the journey as interesting as possible, we will skip a few things for now:

  • How to quantize/dequantize a model -- that will be part of week 2. The Qwen2 model is quantized so we will need to dequantize them before we can use them in our layer implementations.
  • Actually we still used some APIs other than matrix manipulations -- like softmax, exp, log, etc. But they are simple and not implementing them would not affect the learning experience.
  • Tokenizer -- we will not implement the tokenizer from scratch. We will use the mlx_lm tokenizer to tokenize the input.
  • Loading the model weights -- I don't think it's an interesting thing to learn how to decode those tensor dump files, so we will use the mlx_lm to load the model and steal the weights from the loaded model into our layer implementations.

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book © 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.

Week 1 Day 1: Attention and Multi-Head Attention

In day 1, we will implement the basic attention layer and the multi-head attention layer. Attention layers take a input sequence and focus on different parts of the sequence when generating the output. Attention layers are the key building blocks of the Transformer models.

📚 Reading: Transformer Architecture

We use the Qwen2 model for text generation. The model is a decoder-only model. The input of the model is a sequence of token embeddings. The output of the model is the most likely next token ID.

📚 Reading: LLM Inference, the Decode Phase

Back to the attention layer. The attention layer takes a query, a key, and a value. In a classic implementation, all of them are of the same shape: N.. x L x D.

N.. is zero or some number of dimensions for batches. Within each of the batch, L is the sequence length and D is the dimension of the embedding for a given head in the sequence.

So, for example, if we have a sequence of 1024 tokens, where each of the token has a 512-dimensional embedding (head_dim), we will pass a tensor of the shape N.. x 1024 x 512 to the attention layer.

Task 1: Implement scaled_dot_product_attention

📚 Readings

Implement scaled_dot_product_attention. The function takes key, value, and query of the same dimensions.

L is seq_len, in PyTorch API it's S (source len)
D is head_dim

K: N.. x L x D
V: N.. x L x D
Q: N.. x L x D
output: N.. x L x D

You may use softmax provided by mlx and implement it later in week 2.

Because we are always using the attention layer within the multi-head attention layer, the actual tensor shape when serving the model will be:

K: 1 x H x L x D
V: 1 x H x L x D
Q: 1 x H x L x D
output: 1 x H x L x D

.. though the attention layer only cares about the last two dimensions. The test case will test any shape of the batching dimension.

At the end of this task, you should be able to pass the following tests:

poetry run pytest tests -k test_attention_simple
poetry run pytest tests -k test_attention_with_mask

Task 2: Implement MultiHeadAttention

📚 Readings

Implement MultiHeadAttention. The layer takes a batch of vectors x, maps it through the K, V, Q weight matrixes, and use the attention function we implemented in day 1 to compute the result. The output needs to be mapped using the O weight matrix.

You will also need to implement the linear function first. For linear, it takes a tensor of the shape N.. x I, a weight matrix of the shape O x I, and a bias vector of the shape O. The output is of the shape N.. x O. I is the input dimension and O is the output dimension.

For the MultiHeadAttention layer, the input tensor x has the shape N x L x E, where E is the dimension of the embedding for a given token in the sequence. The K/Q/V weight matrixes will map the tensor into key, value, and query separately, where the dimension E will be mapped into a dimension of size H x D, which means that the token embedding gets mapped into H heads, each with a dimension of D. You can directly reshape the tensor to split the H x D dimension into two dimensions of H and D to get H heads for the token. Then, apply the attention function to each of the head (this requires a transpose, using swapaxes in mlx). The attention function takes N.. x H x L x D as input so that it produces an output for each of the head of the token. Then, you can transpose it into N.. x L x H x D and reshape it so that all heads get merged back together with a shape of N.. x L x (H x D). Map it through the output weight matrix to get the final output.

E is hidden_size or embed_dim or dims or model_dim
H is num_heads
D is head_dim
L is seq_len, in PyTorch API it's S (source len)

W_q/k/v: E x (H x D)
output/x: N x L x E
W_o: (H x D) x E

At the end of the day, you should be able to pass the following tests:

poetry run pytest tests -k test_multi_head_attention

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book © 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.

Glossary Index

The functionality is covered in which days?

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book © 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.