[ET-VK][qconv] Add layout-flexible impl of quantized depthwise conv2d#17108
[ET-VK][qconv] Add layout-flexible impl of quantized depthwise conv2d#17108SS-JIA wants to merge 3 commits intogh/SS-JIA/401/basefrom
Conversation
Adds a new layout-agnostic quantized depthwise convolution operator
`etvk.q8ta_conv2d_dw` that uses BufferMetadata and layout specialization
constants to support arbitrary memory layouts (contiguous, channels-last,
4W4C block-packed, etc.).
Key changes:
1. New shader `q8ta_conv2d_dw.glsl`:
- Uses BufferMetadata for input/output tensor addressing
- Layout-aware via `inp_layout`/`outp_layout` specialization constants
- Processes 4 adjacent width positions × 4 channels per thread (4Wx4C tile)
- Includes optimized paths for simple layouts (outer_block_size == 1)
2. New indexing utilities in `indexing.glslh`:
- `texel_idx_to_tensor4d_idx()`: converts linear texel index to 4D tensor coords
- `tensor4d_idx_to_texel_idx()`: converts 4D tensor index to texel index
3. Code refactoring:
- Extract `Conv2DParams` struct and `create_conv2d_params()` to ConvolutionUtils.h
- Create Q8taConv2dDW.cpp with new operator implementation
- Add Q8taConv2d.h with public API declarations
- Move `prepack_quantized_conv2d_dw_weight()` to new implementation file
4. New workgroup size helpers:
- `pick_q8ta_conv2d_dw_global_wg_size()`: computes {W4, H, C4} dispatch size
- `pick_q8ta_conv2d_dw_local_wg_size()`: adaptive local size based on tensor dims
5. Test updates:
- Rename test to `test_q8_conv2d_dw.cpp`
- Add `TestQ8taConv2d.cpp` with shared test utilities
Differential Revision: [D92061368](https://our.internmc.facebook.com/intern/diff/D92061368/)
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17108
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New Failures, 1 Pending, 1 Unrelated FailureAs of commit f77a947 with merge base 477867a ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Adds a new layout-agnostic quantized depthwise convolution operator
`etvk.q8ta_conv2d_dw` that uses BufferMetadata and layout specialization
constants to support arbitrary memory layouts (contiguous, channels-last,
4W4C block-packed, etc.).
Key changes:
1. New shader `q8ta_conv2d_dw.glsl`:
- Uses BufferMetadata for input/output tensor addressing
- Layout-aware via `inp_layout`/`outp_layout` specialization constants
- Processes 4 adjacent width positions × 4 channels per thread (4Wx4C tile)
- Includes optimized paths for simple layouts (outer_block_size == 1)
2. New indexing utilities in `indexing.glslh`:
- `texel_idx_to_tensor4d_idx()`: converts linear texel index to 4D tensor coords
- `tensor4d_idx_to_texel_idx()`: converts 4D tensor index to texel index
3. Code refactoring:
- Extract `Conv2DParams` struct and `create_conv2d_params()` to ConvolutionUtils.h
- Create Q8taConv2dDW.cpp with new operator implementation
- Add Q8taConv2d.h with public API declarations
- Move `prepack_quantized_conv2d_dw_weight()` to new implementation file
4. New workgroup size helpers:
- `pick_q8ta_conv2d_dw_global_wg_size()`: computes {W4, H, C4} dispatch size
- `pick_q8ta_conv2d_dw_local_wg_size()`: adaptive local size based on tensor dims
5. Test updates:
- Rename test to `test_q8_conv2d_dw.cpp`
- Add `TestQ8taConv2d.cpp` with shared test utilities
Differential Revision: [D92061368](https://our.internmc.facebook.com/intern/diff/D92061368/)
ghstack-source-id: 337539965
Pull Request resolved: #17108
This PR needs a
|
…wise conv2d"
Adds a new layout-agnostic quantized depthwise convolution operator
`etvk.q8ta_conv2d_dw` that uses BufferMetadata and layout specialization
constants to support arbitrary memory layouts (contiguous, channels-last,
4W4C block-packed, etc.).
Key changes:
1. New shader `q8ta_conv2d_dw.glsl`:
- Uses BufferMetadata for input/output tensor addressing
- Layout-aware via `inp_layout`/`outp_layout` specialization constants
- Processes 4 adjacent width positions × 4 channels per thread (4Wx4C tile)
- Includes optimized paths for simple layouts (outer_block_size == 1)
2. New indexing utilities in `indexing.glslh`:
- `texel_idx_to_tensor4d_idx()`: converts linear texel index to 4D tensor coords
- `tensor4d_idx_to_texel_idx()`: converts 4D tensor index to texel index
3. Code refactoring:
- Extract `Conv2DParams` struct and `create_conv2d_params()` to ConvolutionUtils.h
- Create Q8taConv2dDW.cpp with new operator implementation
- Add Q8taConv2d.h with public API declarations
- Move `prepack_quantized_conv2d_dw_weight()` to new implementation file
4. New workgroup size helpers:
- `pick_q8ta_conv2d_dw_global_wg_size()`: computes {W4, H, C4} dispatch size
- `pick_q8ta_conv2d_dw_local_wg_size()`: adaptive local size based on tensor dims
5. Test updates:
- Rename test to `test_q8_conv2d_dw.cpp`
- Add `TestQ8taConv2d.cpp` with shared test utilities
Differential Revision: [D92061368](https://our.internmc.facebook.com/intern/diff/D92061368/)
[ghstack-poisoned]
…wise conv2d"
Adds a new layout-agnostic quantized depthwise convolution operator
`etvk.q8ta_conv2d_dw` that uses BufferMetadata and layout specialization
constants to support arbitrary memory layouts (contiguous, channels-last,
4W4C block-packed, etc.).
Key changes:
1. New shader `q8ta_conv2d_dw.glsl`:
- Uses BufferMetadata for input/output tensor addressing
- Layout-aware via `inp_layout`/`outp_layout` specialization constants
- Processes 4 adjacent width positions × 4 channels per thread (4Wx4C tile)
- Includes optimized paths for simple layouts (outer_block_size == 1)
2. New indexing utilities in `indexing.glslh`:
- `texel_idx_to_tensor4d_idx()`: converts linear texel index to 4D tensor coords
- `tensor4d_idx_to_texel_idx()`: converts 4D tensor index to texel index
3. Code refactoring:
- Extract `Conv2DParams` struct and `create_conv2d_params()` to ConvolutionUtils.h
- Create Q8taConv2dDW.cpp with new operator implementation
- Add Q8taConv2d.h with public API declarations
- Move `prepack_quantized_conv2d_dw_weight()` to new implementation file
4. New workgroup size helpers:
- `pick_q8ta_conv2d_dw_global_wg_size()`: computes {W4, H, C4} dispatch size
- `pick_q8ta_conv2d_dw_local_wg_size()`: adaptive local size based on tensor dims
5. Test updates:
- Rename test to `test_q8_conv2d_dw.cpp`
- Add `TestQ8taConv2d.cpp` with shared test utilities
Differential Revision: [D92061368](https://our.internmc.facebook.com/intern/diff/D92061368/)
[ghstack-poisoned]
Stack from ghstack (oldest at bottom):
Adds a new layout-agnostic quantized depthwise convolution operator
etvk.q8ta_conv2d_dwthat uses BufferMetadata and layout specializationconstants to support arbitrary memory layouts (contiguous, channels-last,
4W4C block-packed, etc.).
Key changes:
New shader
q8ta_conv2d_dw.glsl:inp_layout/outp_layoutspecialization constantsNew indexing utilities in
indexing.glslh:texel_idx_to_tensor4d_idx(): converts linear texel index to 4D tensor coordstensor4d_idx_to_texel_idx(): converts 4D tensor index to texel indexCode refactoring:
Conv2DParamsstruct andcreate_conv2d_params()to ConvolutionUtils.hprepack_quantized_conv2d_dw_weight()to new implementation fileNew workgroup size helpers:
pick_q8ta_conv2d_dw_global_wg_size(): computes {W4, H, C4} dispatch sizepick_q8ta_conv2d_dw_local_wg_size(): adaptive local size based on tensor dimsTest updates:
test_q8_conv2d_dw.cppTestQ8taConv2d.cppwith shared test utilitiesDifferential Revision: D92061368