Week 1 Day 4: RMSNorm and Multi Perceptron Layer

In day 4, we will implement two crucial components of the Qwen2 Transformer architecture: RMSNorm and the MLP (Multi-Layer Perceptron) block, also known as the FeedForward Network. RMSNorm is a layer normalization technique that helps stabilize training with less computational overhead compared to traditional layer normalization. The MLP block is a feedforward network that processes the output of the attention layers, applying non-linear transformations to enhance the model's expressiveness.

Task 1: Implement `RMSNorm`

In this task, we will implement the RMSNorm layer.

src/tiny_llm/layer_norm.py

📚 Readings

Root Mean Square Layer Normalization
Qwen2 layers implementation in mlx-lm (includes RMSNorm) - See RMSNorm.

RMSNorm is defined as:

$y = \frac{x}{mean ( x ^{2} ) + ϵ} \cdot weight$

Where:

x is the input tensor.
weight is a learnable scaling parameter.
epsilon (eps) is a small constant added for numerical stability (e.g., 1e-5 or 1e-6).
mean(x^2) is the sum of squares and then division by the number of elements.

The normalization is applied independently to each sample’s feature vector, typically over the last dimension of input. Note that, mean calculation should be performed with float32 accumulation to maintain precision before taking the square root, even if the input and weights are in a lower precision format (e.g., float16 or bfloat16).

D is the embedding dimension.

x: N.. x D
weight: D
output: N.. x D

You can test your implementation by running:

pdm run test --week 1 --day 4 -- -k task_1

Task 2: Implement the MLP Block

In this task, we will implement the MLP block named Qwen2MLP.

src/tiny_llm/qwen2_week1.py

The original Transformer model utilized a simple Feed-Forward Network (FFN) within each block. This FFN typically consisted of two linear transformations with a ReLU activation in between, applied position-wise.

Modern Transformer architectures, including Qwen2, often employ more advanced FFN variants for improved performance. Qwen2 uses a specific type of Gated Linear Unit (GLU) called SwiGLU.

📚 Readings

Essentially, SwiGLU is a combination of GLU and the SiLU (Sigmoid Linear Unit) activation function:

GLU is a gating mechanism that allows the model to learn which parts of the input to focus on. It typically involves an element-wise product of two linear projections of the input, one of which might be passed through an activation function. Compared to ReLU used in the original FFN, GLU can help the model learn more complex relationships in the data, deciding which features to keep and which to discard.
SiLU (Sigmoid Linear Unit) is a smooth, non-monotonic activation function that has been shown to perform well in various deep learning tasks. Compared to ReLU and sigmoid used in GLU, it is fully differentiable without the zero-gradient “dead zones”, retains non-zero output even for negative inputs.

You need to implement the silu function in basics.py first. For silu, it takes a tensor of the shape N.. x I and returns a tensor of the same shape. The silu function is defined as: $SiLU (x) = x * sigmoid (x) = \frac{x}{1 + e ^{- x}}$

Then implement Qwen2MLP. The structure for Qwen2's MLP block is:

A gate linear projection ( $W_{g a t e}$ ).
An up linear projection ( $W_{u p}$ ).
A SiLU activation function applied to the output of $W_{g a t e}$ .
An element-wise multiplication of the SiLU-activated $W_{g a t e}$ output and the $W_{u p}$ output. This forms the "gated" part.
A final down linear projection ( $W_{d o w n}$ ).

This can be expressed as: $MLP (x) = (SiLU (W_{g a t e} (x)) ⊙ W_{u p} (x)) W_{d o w n}$ Where $⊙$ denotes element-wise multiplication. All linear projections in Qwen2's MLP are typically implemented without bias.

N.. is zero or more dimensions for batches
E is hidden_size (embedding dimension of the model)
I is intermediate_size (dimension of the hidden layer in MLP)
L is the sequence length

input: N.. x L x E
w_gate: I x E
w_up: I x E
w_down: E x I
output: N.. x L x E

You can test your implementation by running:

pdm run test --week 1 --day 4 -- -k task_2

At the end of the day, you should be able to pass all tests of this day:

pdm run test --week 1 --day 4

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book © 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.

Keyboard shortcuts

Tiny LLM - LLM Serving in a Week

Week 1 Day 4: RMSNorm and Multi Perceptron Layer

Task 1: Implement RMSNorm

Task 2: Implement the MLP Block

Task 1: Implement `RMSNorm`