I keep reading about Chinese labs “attacking” frontier labs and performing “model distillation.” Then very recently Musk admitted xAI used distillation on OpenAI model’s and admitted it was common practice.

How does one distill a model?

The idea is that you have

A teacher model: large, well-trained, expensive to run
A student model: smaller, faster, cheaper to deploy
A task: classification, regression, whatever the teacher excels at

Instead of training the cheaper model directly, make the student approximate the teacher’s output distribution as closely as possible, often on a subset of the training data.

There are different types of distillation:

Sequence-level
Token-level / logit KL
On-policy / GKD
Cross-tokenizer

Neural networks usually produce class probabilities by applying a “softmax” output layer that converts the logit, z_i, computed for each class into a probability, q_i, by comparing zi with the other logits. The T represents the temperature, and the higher the T the softer the distribution will be over classes (Hinton et al., 2015).

$q_{i} = \frac{e x p ( z _{i} / T )}{\sum _{j} e x p ( z _{j} / T )}$

The simplest way to distill knowledge in a model is by training the student model with the outputs of the teacher model with a higher T.

[In Progress]

1. Sequence-level distillation

2. Sequence-level distillation

Reading

Hinton, Vinyals & Dean (2015). Distilling the Knowledge in a Neural Network. The original paper.

Perpetually Incomplete

Recent Notes

2026-05-05

Model Distillation [In Progress]

2026-05-04

Explorer

Model Distillation [In Progress]

1. Sequence-level distillation

2. Sequence-level distillation

Reading

Recent Notes

2026-05-05

Model Distillation [In Progress]

2026-05-04

Graph View

Table of Contents