-
Notifications
You must be signed in to change notification settings - Fork 75
GroupedBlockQuantizeOp PR2: Adding python API and updating llama4 benchmark #5777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1. refactor existing block_layout op and block_quantization_kernel to re-use existing runtime functions; 2. added runtime function for GroupedBlockQuantizeOp
|
Review updated until commit 8a6d209 Description
|
| Relevant files | |||||||
|---|---|---|---|---|---|---|---|
| Enhancement |
| ||||||
| Tests |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| 🔒 No security concerns identified |
| ⚡ Recommended focus areas for review |
Test Coverage
test_grouped_block_quantize_op appears comprehensive but only tests a single configuration. Consider if additional parameter combinations (different block sizes, tensor dimensions, or data types) should be tested to ensure robustness of the new grouped block quantization functionality. |
## Context The series of PRs is trying to enable a single kernel for quantization and layout handling of block scaling factor on grouped tensors. Existing solution for nvfp4 quantization of activation Tensor for grouped_mm relies on two operation: i. BlockQuantizationOp produces scaled_tv and block_scaling_factor. ii. block_scaling_factor needs to be processed by PreprocessGroupedMatmulInputSf in order to satisfy the swizzle layout required by grouped_mm kernels The series of PRs tries to merge the two operation into a single one. ### Stacked PRs #5775 GroupedBlockQuantizationOp PR0: Adding runtime function #5776 GroupedBlockQuantizationOp PR1: Adding codegen support #5777 GroupedBlockQuantizationOp PR2: Adding python API and updating llama4 benchmark ## What's in this PR 1. refactor existing runtime function for re-use by the new op; 2. added runtime function for GroupedBlockQuantizeOp.
## Context The series of PRs is trying to enable a single kernel for quantization and layout handling of block scaling factor on grouped tensors. Existing solution for nvfp4 quantization of activation Tensor for grouped_mm relies on two operation: i. BlockQuantizationOp produces scaled_tv and block_scaling_factor. ii. block_scaling_factor needs to be processed by PreprocessGroupedMatmulInputSf in order to satisfy the swizzle layout required by grouped_mm kernels The series of PRs tries to merge the two operation into a single one. ### Stacked PRs #5775 GroupedBlockQuantizationOp PR0: Adding runtime function #5776 GroupedBlockQuantizationOp PR1: Adding codegen support #5777 GroupedBlockQuantizationOp PR2: Adding python API and updating llama4 benchmark ## What's in this PR 1. Adding Fusion IR node GroupedBlockQuantizationOp. The operation is a combination of BlockQuantizationOp and PreprocessGroupedMatmulInputSf, where it inherits all the validation / checks from the two operations. The operation is similar to BlockQuantizationOp, with the exception that: i. The block scaling factor output doesn't have the swizzle logic represented as allocation domain transformations; ii. It takes an additional inputs (input_offsets and output_offsets) to facilitate group indexing, similar to PreprocessGroupedMatmulInputSf. 2. Adding cpp test case for GroupedBlockQuantizationOp.
|
!test |
Greptile SummaryThis PR adds Python API support for Key Changes:
Technical Details:
The implementation is clean, follows existing patterns in the codebase, and includes proper testing. Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant PythonAPI as Python API (ops.nv_grouped_block_quantize)
participant CPP as C++ groupedBlockQuantize
participant Runtime as Runtime Function
participant Output as Quantized Output
User->>PythonAPI: Call nv_grouped_block_quantize(input, input_offsets, output_offsets, global_scale, block_size, dtype)
PythonAPI->>CPP: groupedBlockQuantize(input, input_offsets, output_offsets, Block128x4, global_scale, block_size, dtype)
CPP->>Runtime: Execute GroupedBlockQuantizationOp
Note over Runtime: Combines quantization + swizzle layout
Runtime-->>CPP: BlockQuantizationResults{quantized_tensor, block_scales}
CPP-->>PythonAPI: Return (quantized_tensor, block_scales)
PythonAPI-->>User: Python tuple (fp4_tensor, swizzled_scales)
Note over User,Output: Old flow (2 operations):
Note over User: 1. nv_block_quantize(input) → (fp4, scales)
Note over User: 2. preprocess_grouped_matmul_input_sf(scales, offsets) → swizzled_scales
Note over User,Output: New flow (1 operation):
Note over User: nv_grouped_block_quantize(input, offsets, blockscale_offsets) → (fp4, swizzled_scales)
|
Context
The series of PRs is trying to enable a single kernel for quantization and layout handling of block scaling factor on grouped tensors.
Existing solution for nvfp4 quantization of activation Tensor for grouped_mm relies on two operation:
i. BlockQuantizationOp produces scaled_tv and block_scaling_factor.
ii. block_scaling_factor needs to be processed by PreprocessGroupedMatmulInputSf in order to satisfy the swizzle layout required by grouped_mm kernels
The series of PRs tries to merge the two operation into a single one.
Stacked PRs
#5775 GroupedBlockQuantizationOp PR0: Adding runtime function
#5776 GroupedBlockQuantizationOp PR1: Adding codegen support
#5777 GroupedBlockQuantizationOp PR2: Adding python API and updating llama4 benchmark
What's in this PR
ops.nv_grouped_block_quantizefor GroupedBlockQuantizationOp;