Week 1 Day 5: The Qwen3 Model

In day 5, we will implement the Qwen3 model.

Before we start, please make sure you have downloaded the models:

hf download Qwen/Qwen3-0.6B-MLX-4bit
hf download Qwen/Qwen3-1.7B-MLX-4bit
hf download Qwen/Qwen3-4B-MLX-4bit

Otherwise, some of the tests will be skipped.

Task 1: Implement `Qwen3TransformerBlock`

src/tiny_llm/qwen3_week1.py

📚 Readings

Qwen3 uses the following transformer block structure:

  input
/ |
| input_layernorm (RMSNorm)
| |
| Qwen3MultiHeadAttention
\ |
  Add (residual)
/ |
| post_attention_layernorm (RMSNorm)
| |
| MLP
\ |
  Add (residual)
  |
output

You should pass all tests for this task by running:

# Download the models if you haven't done so
hf download Qwen/Qwen3-0.6B-MLX-4bit
hf download Qwen/Qwen3-1.7B-MLX-4bit
hf download Qwen/Qwen3-4B-MLX-4bit
# Run the tests
pdm run test --week 1 --day 5 -- -k task_1

Task 2: Implement `Embedding`

src/tiny_llm/embedding.py

📚 Readings

LLM Embeddings Explained: A Visual and Intuitive Guide

The embedding layer maps one or more tokens (represented as an integer) to one or more vector of dimension embedding_dim. In this task, you will implement the embedding layer.

Embedding::__call__
weight: vocab_size x embedding_dim
Input: N.. (tokens)
Output: N.. x embedding_dim (vectors)

This can be done with a simple array index lookup operation.

In the Qwen3 model, the embedding layer can also be used as a linear layer to map the embeddings back to the token space.

Embedding::as_linear
weight: vocab_size x embedding_dim
Input: N.. x embedding_dim
Output: N.. x vocab_size

You should pass all tests for this task by running:

# Download the models if you haven't done so; we need to tokenizers
hf download Qwen/Qwen3-0.6B-MLX-4bit
hf download Qwen/Qwen3-1.7B-MLX-4bit
hf download Qwen/Qwen3-4B-MLX-4bit
# Run the tests
pdm run test --week 1 --day 5 -- -k task_2

Task 3: Implement `Qwen3ModelWeek1`

Now that we have built all the components of the Qwen3 model, we can implement the Qwen3ModelWeek1 class.

src/tiny_llm/qwen3_week1.py

📚 Readings

In this course, you will not implement the process of loading the model parameters from the tensor files. Instead, we will load the model using the mlx-lm library, and then we will place the loaded parameters into our model. Therefore, the Qwen3ModelWeek1 class will take a MLX model as the constructor argument.

The Qwen3 model has the following layers:

input
| (tokens: N..)
Embedding
| (N.. x hidden_size); note that hidden_size==embedding_dim
Qwen3TransformerBlock
| (N.. x hidden_size)
Qwen3TransformerBlock
| (N.. x hidden_size)
...
|
RMSNorm 
| (N.. x hidden_size)
Embedding::as_linear  OR  Linear (lm_head)
| (N.. x vocab_size)
output

You can access the number of layers, hidden size, head dimension, and other model parameters from mlx_model.args which is defined in ModelArgs. You can reach the loaded weights from mlx_model.model; the layer names are easiest to inspect from the Qwen3 MLX model metadata on Hugging Face.

By this point, you have implemented RMSNorm yourself. If your day 3 attention path still calls mx.fast.rms_norm for q_norm and k_norm, you can now replace those calls with RMSNorm(head_dim, q_norm, eps=...) and RMSNorm(head_dim, k_norm, eps=...). They implement the same formula; the built-in call existed only to avoid teaching RMSNorm before the GQA chapter.

Note that different size of the Qwen3 models use different strategies to map the embeddings back to the token space. Some models directly use the Embedding::as_linear layer, while others have a separate lm_head linear layer. You can decide which strategy to use based on the mlx_model.args.tie_word_embeddings argument. If it is true, then you should use Embedding::as_linear. Otherwise, the lm_head linear layer will be available and you should load its parameters.

The input to the model is a sequence of tokens. The output is the logits (probability distribution) of the next token. In the next day, we will implement the process of generating the response from the model, and decide the next token based on the probability distribution output.

Also note that the MLX model we are using is a quantized model. Therefore, you also need to dequantize the weights before loading them into our tiny-llm model. You can use the provided quantize::dequantize_linear function to dequantize the weights.

You also need to make sure that you set mask=causal when the input sequence is longer than 1. We will explain why in the next day.

You should pass all tests for this task by running:

# Download the models if you haven't done so
hf download Qwen/Qwen3-0.6B-MLX-4bit
hf download Qwen/Qwen3-1.7B-MLX-4bit
hf download Qwen/Qwen3-4B-MLX-4bit
# Run the tests
pdm run test --week 1 --day 5 -- -k task_3

At the end of the day, you should be able to pass all tests of this day:

pdm run test --week 1 --day 5

Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book © 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.

Keyboard shortcuts

Tiny LLM - LLM Serving in a Week

Week 1 Day 5: The Qwen3 Model

Task 1: Implement Qwen3TransformerBlock

Task 2: Implement Embedding

Task 3: Implement Qwen3ModelWeek1

Task 1: Implement `Qwen3TransformerBlock`

Task 2: Implement `Embedding`

Task 3: Implement `Qwen3ModelWeek1`