Add ZenFlow code for Stage 3#7516
Conversation
|
Hi @tohtana @sfc-gh-truwase @Antlera, when you have some time, could you please take a look at this PR? Thanks! |
db2dfac to
133290e
Compare
|
@JoshWoo2003 - could you please resolve merge conflicts? |
d550814 to
47b10d8
Compare
|
Sorry for the very late reply! I’ve resolved the merge conflicts and updated the affinity setting as suggested. |
|
Hi @JoshWoo2003, the affinity part looks good to me. Thanks for the change! Can you also fix formatting? Thanks! |
- Introduced a new file: zenflow/engine_stage3.py to implement ZenFlow-specific Stage 3 logic. - Modified zero/stage3.py to ensure compatibility with Zenflow's execution flow. - Updated zero/parameter_offload.py to support the integration of ZenFlow with ZeRO-Stage 3. Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
- Add ZenFlowSelectiveAdamW_stage3 to support ZeRO Stage 3 - Update unit tests for ZeRO-Stage 3 with ZenFlow Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
- Add default value (`zenflow=False`) in DeepSpeedZeROOffload.__init__ - Prevents TypeError when instantiating optimizer without zenflow Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
- Resolved merge conflicts with upstream changes - Unified ZenFlow affinity behavior for Stage 3 with Stage 1 and Stage 2 Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Ma, Guokai <guokai.ma@intel.com>
4f4e752 to
26cc5ec
Compare
|
Thanks for the review, @delock! The formatting issues were due to my branch being behind the base. I’ve rebased onto upstream/master and the latest push should fix them. Please take another look when you have a chance—thanks! @loadams @sfc-gh-truwase @tohtana @Antlera |
Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
- Extracted common process setup logic into `zenflow_utils.py` for reuse across stages. - Removed unused `process_pool` assignment. - Added explanatory comments to clarify `adamw` call differences between offload and non-offload paths. Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Ma, Guokai <guokai.ma@intel.com> Co-authored-by: Tingfeng Lan <erc8gx@virginia.edu>
|
@JoshWoo2003 thanks for addressing the PR feedback. Please take a look at the CI failure. |
- Added ZenFlowSelectiveAdamW_stage3 coverage in unit tests (offload & non-offload paths). - Fixed a logic bug introduced after refactoring code. Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
@sfc-gh-truwase Thanks for the reminder! I’ve fixed the CI failure and pushed the update. |
This PR completes the ZenFlow integration for DeepSpeed ZeRO Stage 3.
Highlights:
Note: Intergration with ZeRO Stage 1&2 was introduced in #7391