Soft distillation training support #2901

gagika · 2025-12-30T06:46:22Z

Description

This PR introduces train_distill.py, a new training script that implements a "Post-Pruning Recovery" distillation workflow by integrating the Tunix library with MaxText.

Context & Problem Solved:
This script recovers model quality after structural pruning (e.g., reducing attention heads) by enabling a distillation process where a smaller, trainable Student model mimics the output distribution of a larger, frozen Teacher model on general data.

Key Implementation Details:

Dual Model Loading: Initializes two distinct MaxText models (Student and Teacher) in the same process using pyconfig with student_overrides and teacher_overrides to support different architectures.
Tunix Integration: Wraps MaxText models in TunixMaxTextAdapter to expose a standard interface compatible with the Tunix DistillationTrainer.

Run distillation

python3 -m src.MaxText.distillation.train_distill src/MaxText/configs/distillation.yml   run_name=distillation_test   base_output_directory=${BASE_OUTPUT_DIRECTORY}   checkpoint_period=2000   hf_access_token=$HF_TOKEN   log_period=100  save_checkpoint_on_completion=True

Tests

I have tested this script end-to-end on a TPU VM.
Verified that Student and Teacher models load with distinct configurations

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2025-12-30T06:56:43Z

Codecov Report

❌ Patch coverage is 0% with 180 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/MaxText/distillation/train_distill.py	0.00%	180 Missing ⚠️

📢 Thoughts on this report? Let us know!

richjames0

lgtm

NicoGrande

LGTM!

NicoGrande · 2026-01-08T23:53:53Z

src/MaxText/configs/distillation.yml

+tokenizer_path: "meta-llama/Llama-3.1-8B"
+tokenizer_type: "huggingface"
+
+# dataset_path: "gs://max-datasets-rogue"


Can we remove these commented lines?

NicoGrande · 2026-01-09T00:09:28Z

src/MaxText/distillation/train_distill.py

+  # 1. Setup Mesh
+  devices = jax.devices()
+  devices_array = maxtext_utils.create_device_mesh(student_config, devices)
+  mesh = jax.sharding.Mesh(devices_array, student_config.mesh_axes)


Is my understanding correct that both models will be collocated on the same Mesh? Will this need to be extended to disaggregated to work for different size student / teacher models for example?

yes, it's collocated setup on the same mesh, we can still run different model sizes on the same mesh with possibly different sharding.

disaggregated setup not yet planned / supported, we can extend in the future if needed.

gagika force-pushed the agagik-distill branch from 9f395a9 to 4d91809 Compare December 30, 2025 17:23

gagika changed the title ~~Add soft distillation training script and configuration.~~ Soft distillation training support Dec 30, 2025

gagika force-pushed the agagik-distill branch 4 times, most recently from b1c786c to de0349d Compare January 6, 2026 04:56

gagika force-pushed the agagik-distill branch from cc4cb22 to 4a186a4 Compare January 6, 2026 18:02

gagika marked this pull request as ready for review January 6, 2026 18:10

gagika requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, gobbleturk, hengtaoguo, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners January 6, 2026 18:10

gagika assigned richjames0, SurbhiJainUSC and RissyRan Jan 6, 2026

gagika force-pushed the agagik-distill branch from 4a186a4 to 01f665d Compare January 6, 2026 23:03

gagika requested a review from sheng-li January 7, 2026 19:36

gagika assigned sheng-li Jan 7, 2026

richjames0 approved these changes Jan 8, 2026

View reviewed changes

NicoGrande approved these changes Jan 9, 2026

View reviewed changes

Add soft distillation training script and configuration.

f02adc1

gagika force-pushed the agagik-distill branch from 4460f62 to f02adc1 Compare January 9, 2026 02:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Soft distillation training support #2901

Soft distillation training support #2901

gagika commented Dec 30, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 30, 2025 •

edited

Loading

Uh oh!

richjames0 left a comment

Uh oh!

NicoGrande left a comment

Uh oh!

NicoGrande Jan 8, 2026

Uh oh!

gagika Jan 9, 2026

Uh oh!

NicoGrande Jan 9, 2026

Uh oh!

gagika Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Soft distillation training support #2901

Are you sure you want to change the base?

Soft distillation training support #2901

Conversation

gagika commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Run distillation

Tests

Checklist

Uh oh!

codecov bot commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

richjames0 left a comment

Choose a reason for hiding this comment

Uh oh!

NicoGrande left a comment

Choose a reason for hiding this comment

Uh oh!

NicoGrande Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gagika Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

NicoGrande Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

gagika Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gagika commented Dec 30, 2025 •

edited

Loading

codecov bot commented Dec 30, 2025 •

edited

Loading