-
Notifications
You must be signed in to change notification settings - Fork 75
use unpadded compute threads in cparams #5834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Description
|
| Relevant files | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Enhancement |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 No relevant tests |
| ⚡ Recommended focus areas for review |
Missing Error Handling
getStaticComputeThreadsInDim function doesn't have the same error handling as getThreadCountInDim. Specifically, it lacks the NVF_ERROR check for GpuLower::hasCurrent() and doesn't handle the case when compile-time CTA shape is not known. This could lead to runtime crashes or undefined behavior. |
Test failures
-
(Medium, 1)
Scalar numerical mismatch in thunder GPT nvFuser CUDA testTest Name H100 Source thunder.tests.test_networks.test_nanogpt_complete_autograd_nvfuser_cuda_thunder.dtypes.float32 ❌ -
(Medium, 1)
nvFuser PingPongCircularBuffering large numerical mismatch on H100Test Name H100 Source PingPongCircularBuffering.StageSlicePositionComputeAt/stage_slice_position_4 ❌ Link
|
!test |
Greptile SummaryThis PR separates the storage of padded and unpadded thread dimensions in compile parameters to fix register sharing calculations for warp-specialized kernels. Key Changes:
Rationale: Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Scheduler as normalization_inner_tma.cpp
participant CompileParams as executor_params.h (CompileParams)
participant ParallelDimMap as parallel_dimension_map.cpp
participant RegisterSharing as Register Sharing Logic
Note over Scheduler: Initialize thread dimensions
Scheduler->>Scheduler: compute_bdimx = bdimx (before padding)
Scheduler->>Scheduler: compute_bdimy = 1
Scheduler->>Scheduler: compute_bdimz = 1
Note over Scheduler: Apply warp specialization padding
alt Warp Specialization on TIDy
Scheduler->>Scheduler: bdimy += 1 (add padding)
else Warp Specialization on TIDx
Scheduler->>Scheduler: bdimx += kWarpSpecializationPaddedThreads
end
Note over Scheduler: Store both padded and unpadded values
Scheduler->>CompileParams: cparams.bdimx = bdimx (padded)
Scheduler->>CompileParams: cparams.bdimy = bdimy (padded)
Scheduler->>CompileParams: cparams.bdimz = bdimz (padded)
Scheduler->>CompileParams: cparams.compute_bdimx = compute_bdimx (unpadded)
Scheduler->>CompileParams: cparams.compute_bdimy = compute_bdimy (unpadded)
Scheduler->>CompileParams: cparams.compute_bdimz = compute_bdimz (unpadded)
Note over Scheduler: Calculate register sharing for compute threads
Scheduler->>RegisterSharing: getRegisterSharing(reg_per_thread, <br/>compute_bdimx * compute_bdimy * compute_bdimz, <br/>kWarpSpecializationPaddedThreads)
Note over ParallelDimMap: Later during lowering
ParallelDimMap->>CompileParams: Read compute_bdim* values
ParallelDimMap->>ParallelDimMap: getStaticComputeThreadsInDim(pt)<br/>returns unpadded thread count
ParallelDimMap->>ParallelDimMap: Use unpadded values for <br/>register sharing calculations
|
Explicitly tracking unpadded compute thread dimensions, then we can derive the padded value, and further validate if padded value is provided in cparams.