Add Packing Support for Context Parallelism (Ring Attention) #2906

kocchop · 2025-12-31T02:02:00Z

Description

Enables sequence packing for context parallelism with ring strategy using TransformerEngine's DotProductAttention. Includes comprehensive GPU tests for ring attention with packing for sm90+.

Currently supports packing only for ring attention
Replaced local sequence reordering with TE reorder_causal_load_balancing api
Currently the load balancing strategy is automatically picked based on the packing config

Tests

Added a GPU integration test that works for sm90+.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

…ion+thd

codecov · 2025-12-31T02:13:06Z

Codecov Report

❌ Patch coverage is 0% with 37 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/MaxText/train_utils.py	0.00%	11 Missing ⚠️
src/MaxText/layers/attention_op.py	0.00%	7 Missing ⚠️
src/MaxText/max_utils.py	0.00%	6 Missing ⚠️
src/MaxText/maxtext_utils.py	0.00%	5 Missing ⚠️
src/MaxText/common_types.py	0.00%	4 Missing ⚠️
src/MaxText/pyconfig.py	0.00%	2 Missing ⚠️
src/MaxText/pyconfig_deprecated.py	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

richjames0

a couple of nits but lgtm

richjames0 · 2025-12-31T02:57:09Z

src/MaxText/layers/attention_op.py

    # Handle packing configurations
    if self.config.packing and self.config.dataset_type != "synthetic":
+      if using_context_parallelism and not using_load_balanced_ring_cp:
+        raise AssertionError("Packing is only supported for load balanced ring attention with context parallelism.")


Nit: AssertionError feels weird here to me. Maybe an argumenterror?

converted to ValueError

richjames0 · 2025-12-31T03:05:00Z

src/MaxText/maxtext_utils.py



-def get_reorder_callable(cp_size, shard_mode):
+def get_reorder_callable(cp_size, shard_mode, reorder_strategy=0):  # 0=DualChunkSwap, 1=Striped


As I read this late at night I imagine you're using an integer here so it's comprehensible by JAX but could this be made into an enum without breaking things (at worse using .value?

changed to enum @richjames0

gobbleturk · 2025-12-31T16:46:48Z

src/MaxText/maxtext_utils.py



-def shard_reorder_causal_load_balanced(batch, cp_size, shard_mode):
+def shard_reorder_causal_load_balanced(batch, cp_size, shard_mode, reorder_strategy=0):


Can we make reorder_strategy configurable via base.yml?

done, ptal @gobbleturk

Comment - Terminology/naming is hard, folks have been using the name "striped" to refer to DUAL_CHUNK_SWAP. I guess the string "striped" has to be passed to transformer engine for the other strategy (I would prefer the name "interleaved")...

I highly appreciate your comment with examples of the two strategies to clearly show what they mean in our codebase anyway!

…llelism - Add ReorderStrategy enum to common_types.py (AUTO, DUAL_CHUNK_SWAP, STRIPED) - Add context_parallel_reorder_strategy config option - Update pyconfig, types.py, and train_utils.py to use enum - Map MaxText enum to TE ReorderStrategy in max_utils.py

gobbleturk · 2026-01-05T15:25:25Z

src/MaxText/max_utils.py

-  """Reorders the example batch sequences"""
+@partial(jax.jit, static_argnums=(1, 2))
+def reorder_causal_load_balanced(batch, cp_size, reorder_strategy):
+  """Reorders the example batch sequences


Great comment explaining the two of them with examples, thank you!

gobbleturk

Thanks for the tests and great comments illustrating the two reorder strategies!

RissyRan

LGTM, just minor comments.

RissyRan · 2026-01-05T18:50:33Z

src/MaxText/configs/base.yml

 ### Determine if we want to use load balance for context parallelism
 context_parallel_load_balance: True
 context_parallel_strategy: "all_gather" # "all_gather" or "ring"
+context_parallel_reorder_strategy: "auto" # "auto", "dual_chunk_swap", or "striped"


Thanks for adding this strategy! Could you help add some explanation here for each from reorder_causal_load_balanced?

RissyRan · 2026-01-05T19:01:16Z

src/MaxText/train_utils.py

-        data_iterator = map(maxtext_utils.get_reorder_callable(context_parallel_size, config.shard_mode), data_iterator)
+
+        # Determine load balancing reorder strategy based on whether packing is enabled
+        if config.context_parallel_reorder_strategy == ReorderStrategy.AUTO:


Wondering if AUTO usually gives the best result? Or if there will be an compatible issue if user select ReorderStrategy.STRIPED, but without packing?

Trying to understand if we just provides 2 strategies are good enough.

kocchop added 4 commits December 23, 2025 20:04

initial commits to get the ring+thd path running

2550157

added sgement_pos to cudnn_flash_attention to support for ring_attent…

aaf1d43

…ion+thd

fixed linting and line length issues

4c1d1e6

added an integration test for testing ring attn with packing

40c1569

kocchop requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, gagika, gobbleturk, hengtaoguo, jiangjy1982, khatwanimohit, parambole, richjames0, shralex, shuningjin, suexu1025 and vipannalla as code owners December 31, 2025 02:02

richjames0 approved these changes Dec 31, 2025

View reviewed changes

gobbleturk reviewed Dec 31, 2025

View reviewed changes

kocchop requested review from gobbleturk and richjames0 January 3, 2026 00:18

gobbleturk reviewed Jan 5, 2026

View reviewed changes

gobbleturk approved these changes Jan 5, 2026

View reviewed changes

RissyRan reviewed Jan 5, 2026

View reviewed changes



		def get_reorder_callable(cp_size, shard_mode):
		def get_reorder_callable(cp_size, shard_mode, reorder_strategy=0): # 0=DualChunkSwap, 1=Striped



		def shard_reorder_causal_load_balanced(batch, cp_size, shard_mode):
		def shard_reorder_causal_load_balanced(batch, cp_size, shard_mode, reorder_strategy=0):

Add Packing Support for Context Parallelism (Ring Attention) #2906

Are you sure you want to change the base?

Add Packing Support for Context Parallelism (Ring Attention) #2906

Uh oh!

Conversation

kocchop commented Dec 31, 2025

Description

Tests

Checklist

Uh oh!

codecov bot commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

richjames0 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gobbleturk left a comment

Choose a reason for hiding this comment

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Dec 31, 2025 •

edited

Loading