Week 1 Day 7: Sampling and Preparing for Week 2
In day 7, we will implement various sampling strategies. And we will get you prepared for week 2.
Task 1: Sampling
We implemented the default greedy sampling strategy in the previous day. In this task, we will implement the temperature, top-k, and top-p (nucleus) sampling strategies.
src/tiny_llm/sampler.py
Temperature Sampling
The first sampling strategy is the temperature sampling. When temp=0
, we use the default greedy strategy. When it is
larger than 0, we will randomly select the next token based on the logprobs. The temperature parameter scales the distribution.
When the value is larger, the distribution will be more uniform, making the lower probability token more likely to be
selected, and therefore making the model more creative.
To implement temperature sampling, simply divide the logprobs by the temperature and use mx.random.categorical
to
randomly select the next token.
pdm run main --solution tiny_llm --loader week1 --model Qwen/Qwen2-0.5B-Instruct-MLX --sampler-temp 0.5
Top-k Sampling
In top-k sampling, we will only keep the top-k tokens with the highest probabilities before sampling the probabilities. This is done before the final temperature scaling.
You can use mx.argpartition
to partition the output so that you can know the indices of the top-k elements, and then,
mask those logprobs outside the top-k with -mx.inf
. After that, do temperature sampling.
pdm run main --solution tiny_llm --loader week1 --model Qwen/Qwen2-0.5B-Instruct-MLX --sampler-temp 0.5 --sampler-top-k 10
Top-p (Nucleus) Sampling
In top-p (nucleus) sampling, we will only keep the top-p tokens with the highest cumulative probabilities before sampling the probabilities. This is done before the final temperature scaling.
There are multiple ways of implementing it. One way is to first use mx.argsort
to sort the logprobs (from highest
probability to lowest), and then, do a cumsum
over the sorted logprobs to get the cumulative probabilities. Then, mask
those logprobs outside the top-p with -mx.inf
. After that, do temperature sampling.
pdm run main --solution tiny_llm --loader week1 --model Qwen/Qwen2-0.5B-Instruct-MLX --sampler-temp 0.5 --sampler-top-p 0.9
Task 2: Prepare for Week 2
In week 2, we will optimize the serving infrastructure of the Qwen2 model. We will write some C++ code and Metal kernel to make some operations run faster. You will need Xcode and its command-line tools, which include the Metal compiler, to compile the C++ code and Metal kernels.
- Install Xcode: Install Xcode from the Mac App Store or from the Apple Developer website (this may require an Apple Developer account).
- Launch Xcode and Install Components: After installation, launch Xcode at least once. It may prompt you to install additional macOS components; please do so (this is usually the default option).
- Install Xcode Command Line Tools:
Open your Terminal and run:
xcode-select --install
- Set Default Xcode Path (if needed):
Ensure that your command-line tools are pointing to your newly installed Xcode. You can do this by running:
(Adjust the path if your Xcode is installed in a different location).sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer
- Accept Xcode License:
You may also need to accept the Xcode license:
sudo xcodebuild -license accept
- Install CMake:
brew install cmake
(This instruction is graciously provided by Liu Jinyi.)
You can test your installation by compiling the code in src/extensions
with a axpby
function as part of the official
mlx extension tutorial:
pdm run build-ext
pdm run build-ext-test
It should print correct: True
.
If you are not familiar with C++ or Metal programming, we also suggest doing some small exercises to get familiar with
them. You can implement some element-wise operations like exp
, sin
, cos
and replace the MLX ones in your model
implementation.
That's all for week 1! We have implemented all the components to serve the Qwen2 model. Now we are ready to start week 2, where we will optimize the serving infrastructure and make it run blazing fast on your Apple Silicon device.
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book © 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.