Week 1 Day 2: Positional Encodings and RoPE
In day 2, we will implement the positional embedding used in the Qwen2 model: Rotary Postional Encoding. In a transformer model, we need a way to embed the information of the position of a token into the input of the attention layers. In Qwen2, positional embedding is applied within the multi head attention layer on the query and key vectors.
📚 Readings
- You could have designed state of the art positional encoding
- Roformer: Enhanced Transformer with Rotary Positional Encoding
Task 1: Implement Rotary Postional Encoding "RoPE"
You will need to modify the following file:
src/tiny_llm/positional_encoding.py
In traditional RoPE (as described in the readings), the positional encoding is applied to each head of the query and key vectors.
You can pre-compute the frequencies when initializing the RoPE
class.
If offset
is not provided, the positional encoding will be applied to the entire sequence: 0th frequency applied to the
0th token, up to the (L-1)-th token. Otherwise, the positional encoding will be applied to the sequence according to the
offset slice. If the offset slice is 5..10, then the sequence length provided to the layer would be 5, and the 0th token
will be applied with the 5th frequency.
x: (N, L, H, D)
cos/sin_freqs: (MAX_SEQ_LEN, D // 2)
In the traditional form of RoPE, each head on the dimension of D
is viewed as consequtive complex pairs. That is to
say, if D = 8, then, x[0] and x[1] are a pair, x[2] and x[3] are another pair, and so on. A pair gets the same frequency
from cos/sin_freqs
.
output[0] = x[0] * cos_freqs[0] + x[1] * sin_freqs[0]
output[1] = x[0] * -sin_freqs[0] + x[1] * cos_freqs[0]
output[2] = x[2] * cos_freqs[1] + x[3] * sin_freqs[1]
output[3] = x[2] * -sin_freqs[1] + x[3] * cos_freqs[1]
...and so on
You can do this by reshaping x
to (N, L, H, D // 2, 2) and then applying the above formula to each pair.
📚 Readings
- PyTorch RotaryPositionalEmbeddings API
- MLX Implementation of RoPE before the custom metal kernel implementation
You can test your implementation by running the following command:
pdm run pytest tests -k week_1_day_2_task_1 -v
Task 2: Implement RoPE
in the non-traditional form
The Qwen2 model uses a non-traditional form of RoPE. In this form, the head embedding dimension is split into two halves,
and the two halves are applied with different frequencies. Let's say x1 = x[.., :HALF_DIM]
and x2 = x[.., HALF_DIM:]
.
output[0] = x1[0] * cos_freqs[0] + x2[0] * sin_freqs[0]
output[HALF_DIM] = x1[0] * -sin_freqs[0] + x2[0] * cos_freqs[0]
output[1] = x1[1] * cos_freqs[1] + x2[1] * sin_freqs[1]
output[HALF_DIM + 1] = x1[1] * -sin_freqs[1] + x2[1] * cos_freqs[1]
...and so on
You can do this by directly getting the first half / second half of the embedding dimension of x
and applying the
frequencies to each half separately.
You can test your implementation by running the following command:
pdm run pytest tests -k week_1_day_2_task_2 -v
📚 Readings
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book © 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.