I keep reading about Chinese labs “attacking” frontier labs and performing “model distillation.” Then very recently Musk admitted xAI used distillation on OpenAI model’s and admitted it was common practice.

How does one distill a model?

The idea is that you have

  • A teacher model: large, well-trained, expensive to run
  • A student model: smaller, faster, cheaper to deploy
  • A task: classification, regression, whatever the teacher excels at

Instead of training the cheaper model directly, make the student approximate the teacher’s output distribution as closely as possible, often on a subset of the training data.

There are different types of distillation:

  1. Sequence-level
  2. Token-level / logit KL
  3. On-policy / GKD
  4. Cross-tokenizer

Neural networks usually produce class probabilities by applying a “softmax” output layer that converts the logit, zi, computed for each class into a probability, qi, by comparing zi with the other logits. The T represents the temperature, and the higher the T the softer the distribution will be over classes (Hinton et al., 2015).

The simplest way to distill knowledge in a model is by training the student model with the outputs of the teacher model with a higher T.

[In Progress]

1. Sequence-level distillation

2. Sequence-level distillation

Reading

  • Hinton, Vinyals & Dean (2015). Distilling the Knowledge in a Neural Network. The original paper.