[ET-VK][qconv] Add flexible layout impl for quantized pointwise conv#17221
[ET-VK][qconv] Add flexible layout impl for quantized pointwise conv#17221SS-JIA wants to merge 1 commit intogh/SS-JIA/410/basefrom
Conversation
This commit adds a flexible memory layout implementation for quantized pointwise
(1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a
new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory
layouts, rather than being restricted to a single fixed layout.
Key Components Added
1. Two New GLSL Compute Shaders
- q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses
BufferMetadata UBOs and layout specialization constants to support multiple
memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses
scalar array indexing for output writes to handle different stride patterns.
- q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C
layout that uses simpler ivec4 indexing. Currently not enabled in production
(gated by if (false) in C++).
Both shaders use:
- 4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread)
- dotPacked4x8AccSatEXT for efficient int8 dot products
- Texture2D for weight storage, buffers for input/output
- Per-channel weight quantization with symmetric int8 weights
2. C++ Operator Implementation (Q8taConv2dPW.cpp)
- prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D
format optimized for the shader's access pattern
- add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer
metadata UBOs
- add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader
- q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight
prepacking, and kernel selection
3. Test Infrastructure Updates
- TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps
quantize → conv2d_pw → dequantize for end-to-end testing
- test_q8ta_conv2d_pw.cpp: Comprehensive test suite with:
- Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.)
- Performance test cases (480→160, 48→22, 128→128, 576→64 channels)
- Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C,
kPackedInt8_4C
- Both texture and buffer storage types for floating-point tensors
- Reference implementation comparison for correctness validation
Architecture
The shader handles layout flexibility via:
1. Layout specialization constants (outp_layout, inp_layout) passed from C++
2. BufferMetadata UBOs providing runtime strides for input/output tensors
3. compute_outp_buffer_idx() function that computes correct buffer indices based
on layout
4. get_outer_packed_dim_block_size() from block_indexing.glslh to determine
stride patterns
Differential Revision: [D92307253](https://our.internmc.facebook.com/intern/diff/D92307253/)
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17221
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New Failures, 1 Pending, 1 Unrelated FailureAs of commit 8724d37 with merge base 477867a ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This commit adds a flexible memory layout implementation for quantized pointwise
(1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a
new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory
layouts, rather than being restricted to a single fixed layout.
Key Components Added
1. Two New GLSL Compute Shaders
- q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses
BufferMetadata UBOs and layout specialization constants to support multiple
memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses
scalar array indexing for output writes to handle different stride patterns.
- q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C
layout that uses simpler ivec4 indexing. Currently not enabled in production
(gated by if (false) in C++).
Both shaders use:
- 4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread)
- dotPacked4x8AccSatEXT for efficient int8 dot products
- Texture2D for weight storage, buffers for input/output
- Per-channel weight quantization with symmetric int8 weights
2. C++ Operator Implementation (Q8taConv2dPW.cpp)
- prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D
format optimized for the shader's access pattern
- add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer
metadata UBOs
- add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader
- q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight
prepacking, and kernel selection
3. Test Infrastructure Updates
- TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps
quantize → conv2d_pw → dequantize for end-to-end testing
- test_q8ta_conv2d_pw.cpp: Comprehensive test suite with:
- Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.)
- Performance test cases (480→160, 48→22, 128→128, 576→64 channels)
- Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C,
kPackedInt8_4C
- Both texture and buffer storage types for floating-point tensors
- Reference implementation comparison for correctness validation
Architecture
The shader handles layout flexibility via:
1. Layout specialization constants (outp_layout, inp_layout) passed from C++
2. BufferMetadata UBOs providing runtime strides for input/output tensors
3. compute_outp_buffer_idx() function that computes correct buffer indices based
on layout
4. get_outer_packed_dim_block_size() from block_indexing.glslh to determine
stride patterns
Differential Revision: [D92307253](https://our.internmc.facebook.com/intern/diff/D92307253/)
ghstack-source-id: 338324595
Pull Request resolved: #17221
This PR needs a
|
Stack from ghstack (oldest at bottom):
This commit adds a flexible memory layout implementation for quantized pointwise
(1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a
new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory
layouts, rather than being restricted to a single fixed layout.
Key Components Added
BufferMetadata UBOs and layout specialization constants to support multiple
memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses
scalar array indexing for output writes to handle different stride patterns.
layout that uses simpler ivec4 indexing. Currently not enabled in production
(gated by if (false) in C++).
Both shaders use:
format optimized for the shader's access pattern
metadata UBOs
prepacking, and kernel selection
quantize → conv2d_pw → dequantize for end-to-end testing
kPackedInt8_4C
Architecture
The shader handles layout flexibility via:
on layout
stride patterns
Differential Revision: D92307253