Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Week 1 Day 4: RMSNorm and Multi Perceptron Layer

In day 4, we will implement two crucial components of the Qwen3 Transformer architecture: RMSNorm and the MLP (Multi-Layer Perceptron) block, also known as the FeedForward Network. RMSNorm is a layer normalization technique that helps stabilize training with less computational overhead compared to traditional layer normalization. The MLP block is a feedforward network that processes the output of the attention layers, applying non-linear transformations to enhance the model’s expressiveness.

Task 1: Implement RMSNorm

In this task, we will implement the RMSNorm layer.

src/tiny_llm/layer_norm.py

Day 3 used mx.fast.rms_norm directly so that the GQA chapter could stay focused on attention. This task implements the same normalization rule as a reusable layer. After this point, the transformer block, final model norm, and any Q/K normalization path can use your own RMSNorm implementation instead of treating normalization as a built-in API.

📚 Readings

RMSNorm is defined as:

Where:

  • x is the input tensor.
  • weight is a learnable scaling parameter.
  • epsilon (eps) is a small constant added for numerical stability (e.g., 1e-5 or 1e-6).
  • mean(x^2) is the sum of squares and then division by the number of elements.

The normalization is applied independently to each sample’s feature vector, typically over the last dimension of input. Note that, mean calculation should be performed with float32 accumulation to maintain precision before taking the square root, even if the input and weights are in a lower precision format (e.g., float16 or bfloat16). After computing the normalized value, cast it back to the original input dtype before applying weight. This matches the low-precision path used by MLX’s fast RMSNorm kernels: the normalization statistics are accumulated in float32, while the final scaling by weight happens in the model dtype.

D is the embedding dimension.

x: N.. x D
weight: D
output: N.. x D

You can test your implementation by running:

pdm run test --week 1 --day 4 -- -k task_1

Task 2: Implement the MLP Block

In this task, we will implement the MLP block named Qwen3MLP.

src/tiny_llm/qwen3_week1.py

The original Transformer model utilized a simple Feed-Forward Network (FFN) within each block. This FFN typically consisted of two linear transformations with a ReLU activation in between, applied position-wise.

Modern Transformer architectures, including Qwen3, often employ more advanced FFN variants for improved performance. Qwen3 uses a specific type of Gated Linear Unit (GLU) called SwiGLU.

A plain FFN can be abstracted as:

h = activation(W_up(x))
out = W_down(h)

GLU keeps the same expand-then-project-back shape, but adds another projection that gates the intermediate features before W_down. This gives the MLP a learned, input-dependent way to control which intermediate channels matter, instead of only applying an activation to the same features produced by W_up.

SwiGLU is the GLU variant used by Qwen3:

u = W_up(x)
g = SiLU(W_gate(x))
out = W_down(g * u)

📚 Readings

Essentially, SwiGLU is a combination of GLU and the SiLU (Sigmoid Linear Unit) activation function:

  • GLU is a gating mechanism that allows the model to learn which parts of the input to focus on. It typically involves an element-wise product of two linear projections of the input, one of which might be passed through an activation function. Compared to ReLU used in the original FFN, GLU can help the model learn more complex relationships in the data, deciding which features to keep and which to discard.
  • SiLU (Sigmoid Linear Unit) is a smooth, non-monotonic activation function that has been shown to perform well in various deep learning tasks. Compared to ReLU and sigmoid used in GLU, it is fully differentiable without the zero-gradient “dead zones”, retains non-zero output even for negative inputs.

You need to implement the silu function in basics.py first. For silu, it takes a tensor of the shape N.. x I and returns a tensor of the same shape. The silu function is defined as: Compute the sigmoid part in a numerically stable way:

if x >= 0:
    sigmoid(x) = 1 / (1 + exp(-x))
else:
    sigmoid(x) = exp(x) / (1 + exp(x))

The negative branch is algebraically equivalent to the direct sigmoid formula, but it avoids exp(-x) becoming exp(large positive) when x is a large negative value. In vector code, this can be expressed with abs(x): compute the direct branch using |x|, then use 1 - y for negative inputs. That matches MLX’s low-precision GPU path more closely than the direct division form.

Then implement Qwen3MLP. The structure for Qwen3’s MLP block is:

  • A gate linear projection ().
  • An up linear projection ().
  • A SiLU activation function applied to the output of .
  • An element-wise multiplication of the SiLU-activated output and the output. This forms the “gated” part.
  • A final down linear projection ().

This can be expressed as: Where denotes element-wise multiplication. All linear projections in Qwen3’s MLP are typically implemented without bias.

N.. is zero or more dimensions for batches
E is hidden_size (embedding dimension of the model)
I is intermediate_size (dimension of the hidden layer in MLP)
L is the sequence length

input: N.. x L x E
w_gate: I x E
w_up: I x E
w_down: E x I
output: N.. x L x E

You can test your implementation by running:

pdm run test --week 1 --day 4 -- -k task_2

At the end of the day, you should be able to pass all tests of this day:

pdm run test --week 1 --day 4

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book © 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.