-
Notifications
You must be signed in to change notification settings - Fork 270
[CK] Add new fwd conv fp16/bf16 instances optimized for unit group size. #3670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
[CK] Add new fwd conv fp16/bf16 instances optimized for unit group size. #3670
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds new FP16 and BF16 convolution instances optimized for unit group size (G=1) to improve performance. The new instances leverage the compute-optimized block GEMM pipeline and demonstrate significant performance improvements (up to 14% TFLOP increase).
Changes:
- Added two new BF16 instances optimized for G=1 with different block configurations (256x256x256 and 512x128x32)
- Added two new FP16 instances optimized for G=1 with matching block configurations
- Modified line endings to add commas for existing instances to accommodate new entries
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3<NDimSpatial,ALayout,BLayout, DsLayout,ELayout, F16, F16, F32, F16, DsDataTypes, F16, PassThrough, PassThrough, OutElementOp, ConvSpec, GemmMNKPadding, 256, 128, 128, 64, 8, 8, 32, 32, 2, 2, S<8, 32, 1>, S<1, 0, 2>, S<1, 0, 2>, 2, 8, 8, 0, S<8, 32, 1>, S<1, 0, 2>, S<1, 0, 2>, 2, 8, 8, 0, 1, 1, S<1, 32, 1, 8>, 8, BlockGemmPipelineScheduler::Intrawave, BlockGemmPipelineVersion::v4> | ||
| // clang-format on | ||
| DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3<NDimSpatial,ALayout,BLayout, DsLayout,ELayout, F16, F16, F32, F16, DsDataTypes, F16, PassThrough, PassThrough, OutElementOp, ConvSpec, GemmMNKPadding, 256, 128, 128, 64, 8, 8, 32, 32, 2, 2, S<8, 32, 1>, S<1, 0, 2>, S<1, 0, 2>, 2, 8, 8, 0, S<8, 32, 1>, S<1, 0, 2>, S<1, 0, 2>, 2, 8, 8, 0, 1, 1, S<1, 32, 1, 8>, 8, BlockGemmPipelineScheduler::Intrawave, BlockGemmPipelineVersion::v4>, | ||
|
|
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty line contains trailing whitespace. Remove the trailing whitespace to maintain code cleanliness.
Proposed changes
Added new FP16/BF16 instances that are optimized for group size = 1. The new instance use the compute optimized block GEMM pipeline.