Preface
This course is designed for systems engineers who want to understand how LLMs work.
As a system engineer, I always wonder how things work internally and how to optimize them. I had a hard time figuring out the LLM stuff. Most of the open source projects that serve LLMs are highly optimized with CUDA kernels and other low-level optimizations. It is not easy to understand the whole picture by looking at a codebase of 100k lines of code. Therefore, I decided to implement an LLM serving project from scratch -- with only matrix manipulations APIs, so that I can understand what it takes to load those LLM model parameters and do the math magic to generate text.
You can think of this course as an LLM version of CMU Deep Learning Systems course's needle project.
Prerequisites
You should have some experience with the basics of deep learning and have some idea of how PyTorch works. Some recommended resources are:
- CMU Intro to Machine Learning -- this course teaches you the basics of machine learning
- CMU Deep Learning Systems -- this course teaches you how to build PyTorch from scratch
Environment Setup
This course uses MLX, an array/machine learning library for Apple Silicon. Nowaways it's much easier to get an Apple Silicon device than NVIDIA GPUs. In theory you can also do this course with PyTorch or numpy, but we just don't have the test infra to support them. We test your implementation against PyTorch's CPU implementation and MLX's implementation to ensure correctness.
Course Structure
This course is divided into 3 weeks. We will serve the Qwen2-7B-Instruct model and optimize it throughout the course.
- Week 1: serve Qwen2 with purely matrix manipulation APIs. Just Python.
- Week 2: optimizations, implement C++/Metal custom kernels to make the model run faster.
- Week 3: more optimizations, batch the requests to serve the model with high throughput.
How to Use This Book
The thing you are reading right now is the tiny-llm book. It is designed more like a guidebook instead of a textbook
that explains everything from scratch. In this course, we provide the materials that we find useful on the Internet
when the author(s) implemented the tiny-llm project. The Internet does a better job of explaining the concepts and I
do not think it is necessary to repeat everything here. Think of this as a guide (of a list of tasks) and some hints!
We will also unify the language of the Internet materials so that it is easier to correspond them to the codebase.
For example, we will have a unified dimension symbols for the tensors. You do not need to figure out what H
, L
, E
stands for and what dimension of the matrixes are passed into the function.
About the Authors
This course is created by Chi and Connor.
Chi is a systems software engineer at Neon, acquired by Databricks, focusing on storage systems. Fascinated by the vibe of large language models (LLMs), he created this course to explore how LLM inference works.
Community
You may join skyzh's Discord server and study with the tiny-llm community.
Get Started
Now, you can start to set up the environment following the instructions in Setting Up the Environment and begin your journey to build tiny-llm!
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.
Setting Up the Environment
To follow along this course, you will need a Macintosh device with Apple Silicon. We manage the codebase with pdm.
Install pdm
Please follow the offcial guide to install pdm.
Clone the Repository
git clone https://github.com/skyzh/tiny-llm
The repository is organized as follows:
src/tiny_llm -- your implementation
src/tiny_llm_week1_ref -- reference implementation of week 1
tests/ -- unit tests for your implementation
tests_ref_impl_week1/ -- unit tests for the reference implementation of week 1
book/ -- the book
We provide all reference implementations and you can refer to them if you get stuck in the course.
Install Dependencies
cd tiny-llm
pdm install -v # this will automatically create a virtual environment and install all dependencies
Check the Installation
pdm run python check.py
# The reference solution should pass all the *week 1* tests
pdm run test-refsol -- -- -k week_1
Run Unit Tests
Your code is in src/tiny_llm
. You can run the unit tests with:
pdm run test
Download the Model Parameters
We will use the Qwen2-7B-Instruct model for this course. It takes ~20GB of memory in week 1 to load the model parameters. If you do not have enough memory, you can consider using the smaller 0.5B model.
Follow the guide of this page to install the huggingface cli.
The model parameters are hosted on Hugging Face. Once you authenticated your cli with the credentials, you can download them with:
huggingface-cli login
huggingface-cli download Qwen/Qwen2-7B-Instruct-MLX
Then, you can run:
pdm run main --solution ref --loader week1
It should load the model and print some text.
In week 2, we will write some kernels in C++/Metal, and we will need to set up additional tools for that. We will cover it later.
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.
Week 1: From Matmul to Text
This book is not complete and this chapter is not finalized yet. We might switch to Qwen3 in the final version of the course.
In this week, we will start from the basic matrix operations and see how those these matrix manipulations can turn the Qwen2 model parameters into a model that generates text. We will implement the neural network layers used in the Qwen2 model using mlx's matrix APIs.
We will use the Qwen2-7B-Instruct model for this week. As we need to dequantize the model parameters, the model of 4GB download size needs 20GB of memory in week 1. If you do not have enough memory, you can consider using the smaller 0.5B model.
The MLX version of the Qwen2-7B-Instruct model we downloaded in the setup is an int4 quantized version of the original bfloat16 model.
What We will Cover
- Attention, Multi-Head Attention, and Grouped/Multi Query Attention
- Positional Embeddings and RoPE
- Put the attention layers together and implement the whole Transformer block
- Implement the MLP layer and the whole Transformer model
- Load the Qwen2 model parameters and generate text
What We will Not Cover
To make the journey as interesting as possible, we will skip a few things for now:
- How to quantize/dequantize a model -- that will be part of week 2. The Qwen2 model is quantized so we will need to dequantize them before we can use them in our layer implementations.
- Actually we still used some APIs other than matrix manipulations -- like softmax, exp, log, etc. But they are simple and not implementing them would not affect the learning experience.
- Tokenizer -- we will not implement the tokenizer from scratch. We will use the
mlx_lm
tokenizer to tokenize the input. - Loading the model weights -- I don't think it's an interesting thing to learn how to decode those tensor dump files, so
we will use the
mlx_lm
to load the model and steal the weights from the loaded model into our layer implementations.
Basic Matrix APIs
Although MLX does not offer an introductory guide for beginners, its Python API is designed to be highly compatible with NumPy. To get started, you can refer to NumPy: The Absolute Basic for Beginners to learn essential matrix operations.
You can also refer to the MLX Operations API for more details.
Qwen2 Models
You can try the Qwen2 model with MLX/vLLM. You can read the blog post below to have some idea of what we will build within this course. At the end of this week, we will be able to chat with the model -- that is to say, use Qwen2 to generate text, as a causal language model.
The reference implementation of the Qwen2 model can be found in huggingface transformers, vLLM, and mlx-lm. You may utilize these resources to better understand the internals of the model and what we will implement in this week.
π Readings
- Qwen2.5: A Party of Foundation Models!
- Key Concepts of the Qwen2 Model
- Huggingface Transformers - Qwen2
- vLLM Qwen2
- mlx-lm Qwen2
- Qwen2 Technical Report
- Qwen2.5 Technical Report
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.
Week 1 Day 1: Attention and Multi-Head Attention
This book is not complete and this chapter is not finalized yet. We are still working on the reference solution, writing tests, and unify the math notations in the book.
In day 1, we will implement the basic attention layer and the multi-head attention layer. Attention layers take a input sequence and focus on different parts of the sequence when generating the output. Attention layers are the key building blocks of the Transformer models.
π Reading: Transformer Architecture
We use the Qwen2 model for text generation. The model is a decoder-only model. The input of the model is a sequence of token embeddings. The output of the model is the most likely next token ID.
π Reading: LLM Inference, the Decode Phase
Back to the attention layer. The attention layer takes a query, a key, and a value. In a classic implementation, all
of them are of the same shape: N.. x L x D
.
N..
is zero or some number of dimensions for batches. Within each of the batch, L
is the sequence length and D
is
the dimension of the embedding for a given head in the sequence.
So, for example, if we have a sequence of 1024 tokens, where each of the token has a 512-dimensional embedding (head_dim),
we will pass a tensor of the shape N.. x 1024 x 512
to the attention layer.
Task 1: Implement scaled_dot_product_attention_simple
In this task, we will implement the scaled dot product attention function. We assume the input tensors (Q, K, V) have the same dimensions. In the next few chapters, we will support more variants of attentions that might not have the same dimensions for all tensors.
src/tiny_llm/attention.py
π Readings
- Annotated Transformer
- PyTorch Scaled Dot Product Attention API (assume
enable_gqa=False
, assume dim_k=dim_v=dim_q and H_k=H_v=H_q) - MLX Scaled Dot Product Attention API (assume dim_k=dim_v=dim_q and H_k=H_v=H_q)
- Attention is All You Need
Implement scaled_dot_product_attention
following the below attention function. The function takes key, value, and query of the same dimensions, and an optional mask matrix M
.
Note that is the scale factor. The user might specify their own scale factor or use the default one.
L is seq_len, in PyTorch API it's S (source len)
D is head_dim
key: N.. x L x D
value: N.. x L x D
query: N.. x L x D
output: N.. x L x D
scale = 1/sqrt(D) if not specified
You may use softmax
provided by mlx and implement it later in week 2.
Because we are always using the attention layer within the multi-head attention layer, the actual tensor shape when serving the model will be:
key: 1 x H x L x D
value: 1 x H x L x D
query: 1 x H x L x D
output: 1 x H x L x D
mask: 1 x H x L x L
.. though the attention layer only cares about the last two dimensions. The test case will test any shape of the batching dimension.
At the end of this task, you should be able to pass the following tests:
pdm run test --week 1 --day 1 -- -k task_1
Task 2: Implement SimpleMultiHeadAttention
In this task, we will implement the multi-head attention layer.
src/tiny_llm/attention.py
π Readings
- Annotated Transformer
- PyTorch SimpleMultiHeadAttention API (assume dim_k=dim_v=dim_q and H_k=H_v=H_q)
- MLX SimpleMultiHeadAttention API (assume dim_k=dim_v=dim_q and H_k=H_v=H_q)
- The Illustrated GPT-2 (Visualizing Transformer Language Models) helps you better understand what key, value, and query are.
Implement SimpleMultiHeadAttention
. The layer takes a batch of vectors, maps it through the K, V, Q weight matrixes, and use the attention function we implemented in task 1 to compute the result. The output needs to be mapped using the O
weight matrix.
You will also need to implement the linear
function in basics.py
first. For linear
, it takes a tensor of the shape N.. x I
, a weight matrix of the shape O x I
, and a bias vector of the shape O
. The output is of the shape N.. x O
. I
is the input dimension and O
is the output dimension.
For the SimpleMultiHeadAttention
layer, the input tensors query
, key
, value
have the shape N x L x E
, where E
is the dimension of the
embedding for a given token in the sequence. The K/Q/V
weight matrixes will map the tensor into key, value, and query
separately, where the dimension E
will be mapped into a dimension of size H x D
, which means that the token embedding
gets mapped into H
heads, each with a dimension of D
. You can directly reshape the tensor to split the H x D
dimension
into two dimensions of H
and D
to get H
heads for the token.
Now, you have a tensor of the shape N.. x L x H x D
for each of the key, value, and query. To apply the attention function, you first need to transpose them into shape N.. x H x L x D
.
- This makes each attention head an independent batch, so that attention can be calculated separately for each head across the sequence
L
. - If you kept
H
behindL
, attention calculation would mix head and sequence dimensions, which is not what we want β each head should focus only on the relationships between tokens in its own subspace.
The attention function produces output for each of the head of the token. Then, you can transpose it back into N.. x L x H x D
and reshape it
so that all heads get merged back together with a shape of N.. x L x (H x D)
. Map it through the output weight matrix to get
the final output.
E is hidden_size or embed_dim or dims or model_dim
H is num_heads
D is head_dim
L is seq_len, in PyTorch API it's S (source len)
w_q/w_k/w_v: E x (H x D)
output/input: N x L x E
w_o: (H x D) x E
At the end of the task, you should be able to pass the following tests:
pdm run test --week 1 --day 1 -- -k task_2
You can run all tests for the day with:
pdm run test --week 1 --day 1
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.
Week 1 Day 2: Positional Encodings and RoPE
This book is not complete and this chapter is not finalized yet. We are still working on the reference solution, writing tests, and unify the math notations in the book.
In day 2, we will implement the positional embedding used in the Qwen2 model: Rotary Postional Encoding. In a transformer model, we need a way to embed the information of the position of a token into the input of the attention layers. In Qwen2, positional embedding is applied within the multi head attention layer on the query and key vectors.
π Readings
- You could have designed state of the art positional encoding
- Roformer: Enhanced Transformer with Rotary Positional Encoding
Task 1: Implement Rotary Postional Encoding "RoPE"
You will need to modify the following file:
src/tiny_llm/positional_encoding.py
In traditional RoPE (as described in the readings), the positional encoding is applied to each head of the query and key vectors.
You can pre-compute the frequencies when initializing the RoPE
class.
If offset
is not provided, the positional encoding will be applied to the entire sequence: 0th frequency applied to the
0th token, up to the (L-1)-th token. Otherwise, the positional encoding will be applied to the sequence according to the
offset slice. If the offset slice is 5..10, then the sequence length provided to the layer would be 5, and the 0th token
will be applied with the 5th frequency.
You only need to consider offset
being None
or a single slice. The list[slice]
case will be implemented when we
start implementing the continuous batching feature. Assume all batches provided use the same offset.
x: (N, L, H, D)
cos/sin_freqs: (MAX_SEQ_LEN, D // 2)
In the traditional form of RoPE, each head on the dimension of D
is viewed as consequtive complex pairs. That is to
say, if D = 8, then, x[0] and x[1] are a pair, x[2] and x[3] are another pair, and so on. A pair gets the same frequency
from cos/sin_freqs
.
Note that, practically, D can be even or odd. In the case of D being odd, the last dimension of x
doesnβt have a matching pair,
and is typically left untouched in most implementations. For simplicity, we just assume that D is always even.
output[0] = x[0] * cos_freqs[0] + x[1] * sin_freqs[0]
output[1] = x[0] * -sin_freqs[0] + x[1] * cos_freqs[0]
output[2] = x[2] * cos_freqs[1] + x[3] * sin_freqs[1]
output[3] = x[2] * -sin_freqs[1] + x[3] * cos_freqs[1]
...and so on
You can do this by reshaping x
to (N, L, H, D // 2, 2) and then applying the above formula to each pair.
π Readings
- PyTorch RotaryPositionalEmbeddings API
- MLX Implementation of RoPE before the custom metal kernel implementation
You can test your implementation by running the following command:
pdm run test --week 1 --day 2 -- -k task_1
Task 2: Implement RoPE
in the non-traditional form
The Qwen2 model uses a non-traditional form of RoPE. In this form, the head embedding dimension is split into two halves,
and the two halves are applied with different frequencies. Let's say x1 = x[.., :HALF_DIM]
and x2 = x[.., HALF_DIM:]
.
output[0] = x1[0] * cos_freqs[0] + x2[0] * sin_freqs[0]
output[HALF_DIM] = x1[0] * -sin_freqs[0] + x2[0] * cos_freqs[0]
output[1] = x1[1] * cos_freqs[1] + x2[1] * sin_freqs[1]
output[HALF_DIM + 1] = x1[1] * -sin_freqs[1] + x2[1] * cos_freqs[1]
...and so on
You can do this by directly getting the first half / second half of the embedding dimension of x
and applying the
frequencies to each half separately.
π Readings
You can test your implementation by running the following command:
pdm run test --week 1 --day 2 -- -k task_2
At the end of the day, you should be able to pass all tests of this day:
pdm run test --week 1 --day 2
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.
Week 1 Day 3: Grouped Query Attention (GQA)
This book is not complete and this chapter is not finalized yet. We are still working on the reference solution, writing tests, and unify the math notations in the book.
In day 3, we will implement Grouped Query Attention (GQA). The Qwen2 models use GQA which is an optimization technique for multi-head attention that reduces the computational and memory costs associated with the Key (K) and Value (V) projections. Instead of each Query (Q) head having its own K and V heads (like in Multi-Head Attention, MHA), multiple Q heads share the same K and V heads. Multi-Query Attention (MQA) is a special case of GQA where all Q heads share a single K/V head pair.
Readings
- GQA Paper (Training Generalized Multi-Query Transformer Models from Pre-Trained Checkpoints)
- Qwen layers implementation in mlx-lm
- PyTorch API (the case where enable_gqa=True)
- torchtune.modules.MultiHeadAttention
Task 1: Implement scaled_dot_product_attention_grouped
You will need to modify the following file:
src/tiny_llm/attention.py
In this task, we will implement the grouped scaled dot product attention function, which forms the core of GQA.
Implement scaled_dot_product_attention_grouped
in src/tiny_llm/attention.py
. This function is similar to the standard scaled dot product attention, but handles the case where the number of query heads is a multiple of the number of key/value heads.
The main progress is the same as the standard scaled dot product attention. The difference is that the K and V heads are shared across multiple Q heads. This means that instead of having H_q
separate K and V heads, we have H
K and V heads, and each K and V head is shared by n_repeats = H_q // H
Q heads.
The core idea is to reshape query
, key
, and value
so that the K and V tensors can be effectively broadcasted to match the query heads within their groups during the matmul
operations.
* Think about how to isolate the H
and n_repeats
dimensions in the query
tensor.
* Consider adding a dimension of size 1 for n_repeats
in the key
and value
tensors to enable broadcasting.
Then perform the scaled dot product attention calculation (matmul
, scale, optional mask, softmax
, matmul
). Broadcasting should handle the head repetition implicitly.
Note that, leverage broadcasting instead of repeating the K and V tensors is more efficient. This is because broadcasting allows the same data to be used in multiple places without creating multiple copies of the data, which can save memory and improve performance.
At last, don't forget to reshape the final result back to the expected output shape.
N.. is zero or more dimensions for batches
H_q is the number of query heads
H is the number of key/value heads (H_q must be divisible by H)
L is the query sequence length
S is the key/value sequence length
D is the head dimension
query: N.. x H_q x L x D
key: N.. x H x S x D
value: N.. x H x S x D
mask: N.. x H_q x L x S
output: N.. x H_q x L x D
Please note that besides the grouped heads, we also extend the implementation that Q, K, and V might not have the same sequence length.
You can test your implementation by running the following command:
pdm run test --week 1 --day 3 -- -k task_1
Task 2: Causal Masking
Readings
In this task, we will implement the causal masking for the grouped attention.
The causal masking is a technique that prevents the attention mechanism from attending to future tokens in the sequence.
When mask
is set to causal
, we will apply the causal mask.
The causal mask is a square matrix of shape (L, S)
, where L
is the query sequence length and S
is the key/value sequence length.
The mask is a lower triangular matrix, where the elements on the diagonal and below the diagonal are 0, and the elements above the diagonal are -inf. For example, if L = 3
and S = 5
, the mask will be:
0 0 0 -inf -inf
0 0 0 0 -inf
0 0 0 0 0
Please implement the causal_mask
function in src/tiny_llm/attention.py
and then use it in the scaled_dot_product_attention_grouped
function. Also note that our causal mask diagonal position is different from the PyTorch API.
You can test your implementation by running the following command:
pdm run test --week 1 --day 3 -- -k task_2
Task 3: Qwen2 Grouped Query Attention
In this task, we will implement the Qwen2 Grouped Query Attention. You will need to modify the following file:
src/tiny_llm/qwen2_week1.py
Qwen2MultiHeadAttention
implements the multi-head attention for Qwen2. You will need to implement the following pseudo code:
x: B, L, E
q = linear(x, wq, bq) -> B, L, H_q, D
k = linear(x, wk, bk) -> B, L, H, D
v = linear(x, wv, bv) -> B, L, H, D
q = rope(q, offset=slice(offset, offset + L))
k = rope(k, offset=slice(offset, offset + L))
(transpose as needed)
x = scaled_dot_product_attention_grouped(q, k, v, scale, mask) -> B, L, H_q, D ; Do this at float32 precision
(transpose as needed)
x = linear(x, wo) -> B, L, E
You can test your implementation by running the following command:
pdm run test --week 1 --day 3 -- -k task_3
At the end of the day, you should be able to pass all tests of this day:
pdm run test --week 1 --day 3
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.
Week 1 Day 4: RMSNorm and Multi Perceptron Layer
This book is not complete and this chapter is not finalized yet. We are still working on the reference solution, writing tests, and unify the math notations in the book.
In day 4, we will implement two crucial components of the Qwen2 Transformer architecture: RMSNorm and the MLP (Multi-Layer Perceptron) block, also known as the FeedForward Network. RMSNorm is a layer normalization technique that helps stabilize training with less computational overhead compared to traditional layer normalization. The MLP block is a feedforward network that processes the output of the attention layers, applying non-linear transformations to enhance the model's expressiveness.
Task 1: Implement RMSNorm
In this task, we will implement the RMSNorm
layer.
src/tiny_llm/layer_norm.py
π Readings
- Root Mean Square Layer Normalization
- Qwen2 layers implementation in mlx-lm (includes RMSNorm) - See
Qwen2RMSNorm
.
RMSNorm is defined as:
Where:
x
is the input tensor.weight
is a learnable scaling parameter.epsilon
(eps) is a small constant added for numerical stability (e.g., 1e-5 or 1e-6).mean(x^2)
is the sum of squares and then division by the number of elements.
The normalization is applied independently to each sampleβs feature vector, typically over the last dimension of input.
Note that, mean calculation should be performed with float32
accumulation to maintain precision before taking the square root, even if the input and weights are in a lower precision format (e.g., float16
or bfloat16
).
D is the embedding dimension.
x: N.. x D
weight: D
output: N.. x D
You can test your implementation by running:
pdm run test --week 1 --day 4 -- -k task_1
Task 2: Implement the MLP Block
In this task, we will implement the MLP block named Qwen2MLP
.
src/tiny_llm/qwen2_week1.py
The original Transformer model utilized a simple Feed-Forward Network (FFN) within each block. This FFN typically consisted of two linear transformations with a ReLU activation in between, applied position-wise.
Modern Transformer architectures, including Qwen2, often employ more advanced FFN variants for improved performance. Qwen2 uses a specific type of Gated Linear Unit (GLU) called SwiGLU.
π Readings
- Attention is All You Need (Transformer Paper, Section 3.3 "Position-wise Feed-Forward Networks")
- GLU Paper(Language Modeling with Gated Convolutional Networks)
- SilU(Swish) activation function
- SwiGLU Paper(GLU Variants Improve Transformer)
- PyTorch SiLU documentation
- Qwen2 layers implementation in mlx-lm (includes MLP)
Essientially, SwiGLU is a combination of GLU and the SiLU (Sigmoid Linear Unit) activation function:
- GLU is a gating mechanism that allows the model to learn which parts of the input to focus on. It typically involves an element-wise product of two linear projections of the input, one of which might be passed through an activation function. Compared to ReLU used in the original FFN, GLU can help the model learn more complex relationships in the data, deciding which features to keep and which to discard.
- SiLU (Sigmoid Linear Unit) is a smooth, non-monotonic activation function that has been shown to perform well in various deep learning tasks. Compared to ReLU and sigmoid used in GLU, it is fully differentiable without the zero-gradient βdead zonesβ, retains non-zero output even for negative inputs.
You need to implement the silu
function in basics.py
first. For silu
, it takes a tensor of the shape N.. x I
and returns a tensor of the same shape.
The silu
function is defined as:
Then implement Qwen2MLP
. The structure for Qwen2's MLP block is:
- A gate linear projection ().
- An up linear projection ().
- A SiLU activation function applied to the output of .
- An element-wise multiplication of the SiLU-activated output and the output. This forms the "gated" part.
- A final down linear projection ().
This can be expressed as: Where denotes element-wise multiplication. All linear projections in Qwen2's MLP are typically implemented without bias.
N.. is zero or more dimensions for batches
E is hidden_size (embedding dimension of the model)
I is intermediate_size (dimension of the hidden layer in MLP)
L is the sequence length
input: N.. x L x E
w_gate: I x E
w_up: I x E
w_down: E x I
output: N.. x L x E
You can test your implementation by running:
pdm run test --week 1 --day 4 -- -k task_2
At the end of the day, you should be able to pass all tests of this day:
pdm run test --week 1 --day 4
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.
Glossary Index
- Scaled Dot Product Attention
- Multi Head Attention
- Linear
- Rotary Positional Encoding
- Grouped Query Attention
- RMSNorm
- SiLU
- SwiGLU
- MLP
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book Β© 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.