[ET][Cadence] patch transpose kernel for safe load & store by zonglinpeng · Pull Request #17177 · pytorch/executorch

zonglinpeng · 2026-02-03T22:58:31Z

Stack from ghstack (oldest at bottom):

(to be filled)

Original patch: cad-audio@1cb9ee7#diff-72cb6b6b59f6d8be4021885fa14490cd182776f883b6d2132678968b9e7cb267

Fixing Crash Issue in `xa_nn_transpose_32` for Unaligned Buffers

Overview

This document explains the code changes made to xa_nn_transpose_32.c to fix a memory alignment crash issue in the transpose operation for Cadence HiFi DSPs. The patch addresses two critical bugs that caused crashes when processing unaligned input/output buffers.

Files Modified

File	Description
`backends/cadence/hifi/operators/op_permute_copy.cpp`	Permute copy operator
`backends/cadence/hifi/operators/op_transpose_copy.cpp`	Transpose copy operator
`backends/cadence/hifi/third-party/nnlib/xa_nn_transpose_32.c`	Core transpose kernel

Change 1: Simplified Inner Loop (Lines 170-191)

The Problem

The original code had a flawed alignment check:

// OLD CODE - BUGGY
if((((unsigned)p_inp4 & 1) == 0) && (((unsigned)p_out & 1) == 0))
{
    // "Aligned" path using AE_L32X2_IP / AE_S32X2_IP
    ae_int32x2 *__restrict__ pae_i = (ae_int32x2 *)(p_inp4);
    ae_int32x2 *__restrict__ pae_o = (ae_int32x2 *)(p_out);
    ae_int32x2 d0;
    for(itr4 = 0; itr4 < (out_dim4 >> 1); itr4++)
    {
        AE_L32X2_IP(d0, pae_i, 2 * sizeof(WORD32));  // Aligned load
        AE_S32X2_IP(d0, pae_o, 2 * sizeof(WORD32));  // Aligned store
    }
    // ... handle remainder
}
else
{
    // Unaligned path using AE_LA32X2_IP / AE_SA32X2_IP
    // ...
}

Root Cause Analysis

The alignment check & 1 only verifies 2-byte alignment (checking if the least significant bit is 0). However:

Data Type	Size	Required Alignment
`WORD32`	4 bytes	4-byte alignment
`ae_int32x2`	8 bytes (64-bit SIMD)	8-byte alignment

The bug: The check should be:

(ptr & 0x7) == 0 for 8-byte alignment, OR
(ptr & 0x3) == 0 for 4-byte alignment (minimum)

The incorrect & 1 check meant that when pointers were 2-byte aligned but NOT 8-byte aligned, the code incorrectly took the "aligned" path.

Consequences of the Bug

When AE_L32X2_IP / AE_S32X2_IP (aligned intrinsics) are used on misaligned addresses:

Hardware Exception: Cadence HiFi DSPs may generate an alignment exception
Crash: The program terminates unexpectedly
Data Corruption: In some configurations, incorrect data may be loaded/stored silently

The Fix

// NEW CODE - SAFE (always uses unaligned intrinsics)
WORD32 *p_inp4 = p_inp3+(itr3*inp_stride[p_5D_permute_vec[3]]);
ae_int32x2 *__restrict__ pae_i = (ae_int32x2 *)(p_inp4);
ae_int32x2 *__restrict__ pae_o = (ae_int32x2 *)(p_out);
ae_valign a_inp = AE_LA64_PP(pae_i);      // Prime alignment register for input
ae_valign a_out = AE_ZALIGN64();           // Initialize alignment register for output
ae_int32x2 d0;
for(itr4 = 0; itr4 < (out_dim4 >> 1); itr4++)
{
    AE_LA32X2_IP(d0, a_inp, pae_i);        // Unaligned load (safe for any alignment)
    AE_SA32X2_IP(d0, a_out, pae_o);        // Unaligned store (safe for any alignment)
}
AE_SA64POS_FP(a_out, pae_o);               // Flush remaining data in alignment register
ae_int32 *__restrict__ puae_i = (ae_int32 *)(pae_i);
ae_int32 *__restrict__ puae_o = (ae_int32 *)(pae_o);
#pragma loop_count max=3
for(itr4 = 0; itr4 < (out_dim4 & 1); itr4++)
{
    puae_o[itr4] = puae_i[itr4];           // Handle odd element
}

Why This Fix Works

Always uses unaligned intrinsics (AE_LA32X2_IP / AE_SA32X2_IP)
- These intrinsics work correctly for both aligned and unaligned data
- They use internal alignment registers to handle any pointer alignment
Eliminates the faulty branch entirely
- No alignment check needed
- No risk of taking the wrong code path
Minimal performance impact
- On modern HiFi DSPs, unaligned intrinsics on aligned data have negligible overhead
- The hardware efficiently handles the alignment internally

Intrinsic Comparison

Intrinsic	Type	Alignment Requirement	Use Case
`AE_L32X2_IP`	Aligned load	Requires 8-byte aligned pointer	Only when alignment is guaranteed
`AE_LA32X2_IP`	Unaligned load	Works with any alignment	Safe for all cases
`AE_S32X2_IP`	Aligned store	Requires 8-byte aligned pointer	Only when alignment is guaranteed
`AE_SA32X2_IP`	Unaligned store	Works with any alignment	Safe for all cases
`AE_SA64POS_FP`	Flush	N/A	Must be called after unaligned stores

Change 2: Fixed Strided Load (Lines 216-222)

The Problem

// OLD CODE - PROBLEMATIC
AE_L32_XP(d0, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2);
AE_L32_XP(d1, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2);

Root Cause Analysis

The AE_L32_XP intrinsic:

Loads a 32-bit value from the pointer
Modifies the pointer by adding the byte offset (second parameter)
Has specific alignment and type requirements

Issues with the old code:

Byte offset calculation: inp_stride[...] << 2 converts stride to bytes, but the pointer increment behavior inside the macro can be problematic with unaligned addresses
Intrinsic alignment requirements: AE_L32_XP may have stricter alignment requirements than expected on some HiFi variants
Pointer type mismatch: The cast (ae_int32 *)p_inp4 combined with the post-increment behavior can cause undefined behavior

The Fix

// NEW CODE - EXPLICIT AND SAFE
d0 = AE_L32_X((ae_int32 *)p_inp4, 0);           // Load from p_inp4 + 0 bytes
p_inp4 += inp_stride[p_5D_permute_vec[4]];      // Manual pointer increment (elements)
d1 = AE_L32_X((ae_int32 *)p_inp4, 0);           // Load from updated p_inp4
p_inp4 += inp_stride[p_5D_permute_vec[4]];      // Manual pointer increment (elements)

Why This Fix Works

AE_L32_X(ptr, offset) - Loads from ptr + offset without modifying the pointer
- More predictable behavior
- No side effects on the pointer variable
Explicit pointer arithmetic - p_inp4 += stride increments by elements (not bytes)
- Correct for WORD32* pointer type
- The << 2 byte conversion is no longer needed
Clearer semantics
- Load and pointer update are separate operations
- Easier to debug and maintain
- No reliance on macro-specific behavior

Comparison

Aspect	Old (`AE_L32_XP`)	New (`AE_L32_X` + manual increment)
Pointer update	Inside intrinsic (byte offset)	Explicit (element offset)
Offset calculation	`stride << 2` (bytes)	`stride` (elements)
Clarity	Implicit side effect	Explicit, readable
Safety	Potential alignment issues	Guaranteed correct
Debuggability	Hard to trace	Easy to step through

Summary

Change	Root Cause	Fix Strategy	Impact
Inner loop simplification	Wrong alignment check (`& 1` instead of `& 0x7`)	Always use unaligned intrinsics	Prevents crashes on misaligned buffers
Strided load fix	`AE_L32_XP` intrinsic issues	Use `AE_L32_X` + manual pointer increment	Prevents potential undefined behavior

Key Takeaway

Both fixes follow the same principle: Instead of trying to detect alignment and use different code paths, use intrinsics that handle unaligned data correctly in all cases. This approach is:

✅ Safer - No risk of incorrect alignment detection
✅ Simpler - Less code, fewer branches
✅ Minimal performance impact - Modern HiFi DSPs handle unaligned access efficiently
✅ More maintainable - Easier to understand and debug

Additional Changes in Operator Files

`op_permute_copy.cpp`

// Changed: Use input tensor's scalar type, not output
const auto in_type = in.scalar_type();  // Was: out.scalar_type()

// Changed: Reduced max dimensions from 16 to 5
constexpr int kNnlibMaxDim = 5;  // Was: 16

// Changed: Separate loops for clarity
for (int i = 0; i < num_inp_dims; i++) {
    p_inp_shape[i] = in.size(i);
}
for (int i = 0; i < num_out_dims; i++) {
    p_out_shape[i] = out.size(i);
}
for (int i = 0; i < num_inp_dims; i++) {
    p_permute_vec[i] = dims[i];
}

`op_transpose_copy.cpp`

// Changed: Use input tensor's scalar type
const auto in_type = in.scalar_type();  // Was: out.scalar_type()

// Changed: Use consistent loop variable
for (int i = 0; i < num_inp_dims; i++) {  // Was: in.dim()
    p_inp_shape[i] = in.size(i);
    p_out_shape[i] = out.size(i);
}

These changes ensure consistency between the input type checking and the actual data being processed.

Testing Recommendations

Test with unaligned buffers: Create test cases where input/output buffers are not 8-byte aligned
Test various tensor shapes: Include odd dimensions that result in unaligned memory access patterns
Test on actual HiFi hardware: Alignment issues may not manifest in simulation
Stress test with random data: Verify correctness across many iterations

References

Cadence HiFi DSP Programmer's Reference Manual
Xtensa Intrinsics Guide for HiFi4/HiFi5
ExecutorTorch Cadence Backend Documentation

Differential Revision: D92083443

Original patch: cad-audio@1cb9ee7#diff-72cb6b6b59f6d8be4021885fa14490cd182776f883b6d2132678968b9e7cb267 # Fixing Crash Issue in `xa_nn_transpose_32` for Unaligned Buffers ## Overview This document explains the code changes made to `xa_nn_transpose_32.c` to fix a **memory alignment crash issue** in the transpose operation for Cadence HiFi DSPs. The patch addresses two critical bugs that caused crashes when processing unaligned input/output buffers. ## Files Modified | File | Description | |------|-------------| | `backends/cadence/hifi/operators/op_permute_copy.cpp` | Permute copy operator | | `backends/cadence/hifi/operators/op_transpose_copy.cpp` | Transpose copy operator | | `backends/cadence/hifi/third-party/nnlib/xa_nn_transpose_32.c` | Core transpose kernel | --- ## Change 1: Simplified Inner Loop (Lines 170-191) ### The Problem The original code had a **flawed alignment check**: ```c // OLD CODE - BUGGY if((((unsigned)p_inp4 & 1) == 0) && (((unsigned)p_out & 1) == 0)) { // "Aligned" path using AE_L32X2_IP / AE_S32X2_IP ae_int32x2 *__restrict__ pae_i = (ae_int32x2 *)(p_inp4); ae_int32x2 *__restrict__ pae_o = (ae_int32x2 *)(p_out); ae_int32x2 d0; for(itr4 = 0; itr4 < (out_dim4 >> 1); itr4++) { AE_L32X2_IP(d0, pae_i, 2 * sizeof(WORD32)); // Aligned load AE_S32X2_IP(d0, pae_o, 2 * sizeof(WORD32)); // Aligned store } // ... handle remainder } else { // Unaligned path using AE_LA32X2_IP / AE_SA32X2_IP // ... } ``` ### Root Cause Analysis The alignment check `& 1` only verifies **2-byte alignment** (checking if the least significant bit is 0). However: | Data Type | Size | Required Alignment | |-----------|------|-------------------| | `WORD32` | 4 bytes | 4-byte alignment | | `ae_int32x2` | 8 bytes (64-bit SIMD) | **8-byte alignment** | **The bug:** The check should be: - `(ptr & 0x7) == 0` for 8-byte alignment, OR - `(ptr & 0x3) == 0` for 4-byte alignment (minimum) The incorrect `& 1` check meant that when pointers were 2-byte aligned but **NOT** 8-byte aligned, the code incorrectly took the "aligned" path. ### Consequences of the Bug When `AE_L32X2_IP` / `AE_S32X2_IP` (aligned intrinsics) are used on misaligned addresses: 1. **Hardware Exception**: Cadence HiFi DSPs may generate an alignment exception 2. **Crash**: The program terminates unexpectedly 3. **Data Corruption**: In some configurations, incorrect data may be loaded/stored silently ### The Fix ```c // NEW CODE - SAFE (always uses unaligned intrinsics) WORD32 *p_inp4 = p_inp3+(itr3*inp_stride[p_5D_permute_vec[3]]); ae_int32x2 *__restrict__ pae_i = (ae_int32x2 *)(p_inp4); ae_int32x2 *__restrict__ pae_o = (ae_int32x2 *)(p_out); ae_valign a_inp = AE_LA64_PP(pae_i); // Prime alignment register for input ae_valign a_out = AE_ZALIGN64(); // Initialize alignment register for output ae_int32x2 d0; for(itr4 = 0; itr4 < (out_dim4 >> 1); itr4++) { AE_LA32X2_IP(d0, a_inp, pae_i); // Unaligned load (safe for any alignment) AE_SA32X2_IP(d0, a_out, pae_o); // Unaligned store (safe for any alignment) } AE_SA64POS_FP(a_out, pae_o); // Flush remaining data in alignment register ae_int32 *__restrict__ puae_i = (ae_int32 *)(pae_i); ae_int32 *__restrict__ puae_o = (ae_int32 *)(pae_o); #pragma loop_count max=3 for(itr4 = 0; itr4 < (out_dim4 & 1); itr4++) { puae_o[itr4] = puae_i[itr4]; // Handle odd element } ``` ### Why This Fix Works 1. **Always uses unaligned intrinsics** (`AE_LA32X2_IP` / `AE_SA32X2_IP`) - These intrinsics work correctly for **both** aligned and unaligned data - They use internal alignment registers to handle any pointer alignment 2. **Eliminates the faulty branch entirely** - No alignment check needed - No risk of taking the wrong code path 3. **Minimal performance impact** - On modern HiFi DSPs, unaligned intrinsics on aligned data have negligible overhead - The hardware efficiently handles the alignment internally ### Intrinsic Comparison | Intrinsic | Type | Alignment Requirement | Use Case | |-----------|------|----------------------|----------| | `AE_L32X2_IP` | Aligned load | **Requires 8-byte aligned pointer** | Only when alignment is guaranteed | | `AE_LA32X2_IP` | Unaligned load | Works with any alignment | Safe for all cases | | `AE_S32X2_IP` | Aligned store | **Requires 8-byte aligned pointer** | Only when alignment is guaranteed | | `AE_SA32X2_IP` | Unaligned store | Works with any alignment | Safe for all cases | | `AE_SA64POS_FP` | Flush | N/A | Must be called after unaligned stores | --- ## Change 2: Fixed Strided Load (Lines 216-222) ### The Problem ```c // OLD CODE - PROBLEMATIC AE_L32_XP(d0, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2); AE_L32_XP(d1, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2); ``` ### Root Cause Analysis The `AE_L32_XP` intrinsic: - Loads a 32-bit value from the pointer - **Modifies the pointer** by adding the byte offset (second parameter) - Has specific alignment and type requirements **Issues with the old code:** 1. **Byte offset calculation**: `inp_stride[...] << 2` converts stride to bytes, but the pointer increment behavior inside the macro can be problematic with unaligned addresses 2. **Intrinsic alignment requirements**: `AE_L32_XP` may have stricter alignment requirements than expected on some HiFi variants 3. **Pointer type mismatch**: The cast `(ae_int32 *)p_inp4` combined with the post-increment behavior can cause undefined behavior ### The Fix ```c // NEW CODE - EXPLICIT AND SAFE d0 = AE_L32_X((ae_int32 *)p_inp4, 0); // Load from p_inp4 + 0 bytes p_inp4 += inp_stride[p_5D_permute_vec[4]]; // Manual pointer increment (elements) d1 = AE_L32_X((ae_int32 *)p_inp4, 0); // Load from updated p_inp4 p_inp4 += inp_stride[p_5D_permute_vec[4]]; // Manual pointer increment (elements) ``` ### Why This Fix Works 1. **`AE_L32_X(ptr, offset)`** - Loads from `ptr + offset` **without modifying** the pointer - More predictable behavior - No side effects on the pointer variable 2. **Explicit pointer arithmetic** - `p_inp4 += stride` increments by **elements** (not bytes) - Correct for `WORD32*` pointer type - The `<< 2` byte conversion is no longer needed 3. **Clearer semantics** - Load and pointer update are separate operations - Easier to debug and maintain - No reliance on macro-specific behavior ### Comparison | Aspect | Old (`AE_L32_XP`) | New (`AE_L32_X` + manual increment) | |--------|-------------------|-------------------------------------| | Pointer update | Inside intrinsic (byte offset) | Explicit (element offset) | | Offset calculation | `stride << 2` (bytes) | `stride` (elements) | | Clarity | Implicit side effect | Explicit, readable | | Safety | Potential alignment issues | Guaranteed correct | | Debuggability | Hard to trace | Easy to step through | --- ## Summary | Change | Root Cause | Fix Strategy | Impact | |--------|------------|--------------|--------| | Inner loop simplification | Wrong alignment check (`& 1` instead of `& 0x7`) | Always use unaligned intrinsics | Prevents crashes on misaligned buffers | | Strided load fix | `AE_L32_XP` intrinsic issues | Use `AE_L32_X` + manual pointer increment | Prevents potential undefined behavior | ### Key Takeaway **Both fixes follow the same principle:** Instead of trying to detect alignment and use different code paths, use intrinsics that handle unaligned data correctly in all cases. This approach is: - ✅ **Safer** - No risk of incorrect alignment detection - ✅ **Simpler** - Less code, fewer branches - ✅ **Minimal performance impact** - Modern HiFi DSPs handle unaligned access efficiently - ✅ **More maintainable** - Easier to understand and debug --- ## Additional Changes in Operator Files ### `op_permute_copy.cpp` ```cpp // Changed: Use input tensor's scalar type, not output const auto in_type = in.scalar_type(); // Was: out.scalar_type() // Changed: Reduced max dimensions from 16 to 5 constexpr int kNnlibMaxDim = 5; // Was: 16 // Changed: Separate loops for clarity for (int i = 0; i < num_inp_dims; i++) { p_inp_shape[i] = in.size(i); } for (int i = 0; i < num_out_dims; i++) { p_out_shape[i] = out.size(i); } for (int i = 0; i < num_inp_dims; i++) { p_permute_vec[i] = dims[i]; } ``` ### `op_transpose_copy.cpp` ```cpp // Changed: Use input tensor's scalar type const auto in_type = in.scalar_type(); // Was: out.scalar_type() // Changed: Use consistent loop variable for (int i = 0; i < num_inp_dims; i++) { // Was: in.dim() p_inp_shape[i] = in.size(i); p_out_shape[i] = out.size(i); } ``` These changes ensure consistency between the input type checking and the actual data being processed. --- ## Testing Recommendations 1. **Test with unaligned buffers**: Create test cases where input/output buffers are not 8-byte aligned 2. **Test various tensor shapes**: Include odd dimensions that result in unaligned memory access patterns 3. **Test on actual HiFi hardware**: Alignment issues may not manifest in simulation 4. **Stress test with random data**: Verify correctness across many iterations --- ## References - Cadence HiFi DSP Programmer's Reference Manual - Xtensa Intrinsics Guide for HiFi4/HiFi5 - ExecutorTorch Cadence Backend Documentation Differential Revision: [D92083443](https://our.internmc.facebook.com/intern/diff/D92083443/) [ghstack-poisoned]

pytorch-bot · 2026-02-03T22:58:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17177

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit 178f98e with merge base 593775b ():

NEW FAILURES - The following jobs have failed:

Lint / android-java-format / linux-job (gh)
pull / android / build-llm-demo / linux-job (gh)
RuntimeError: Command docker exec -t d0cfd943a732dcd71c4e65b9c422069e0dfc349f9086323507a61d2207feff81 /exec failed with exit code 1
pull / unittest-arm-backend-with-no-deps (test_pytest_models_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t b8a4b7be09325a3cf1dcf29c9d582477022c64a8d8269f4d3bcb4c2fa648cf4f /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-02-03T22:59:07Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET][Cadence] patch transpose kernel for safe load & store#17177

[ET][Cadence] patch transpose kernel for safe load & store#17177
zonglinpeng wants to merge 1 commit intogh/zonglinpeng/17/basefrom
gh/zonglinpeng/17/head

zonglinpeng commented Feb 3, 2026

Uh oh!

pytorch-bot bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zonglinpeng commented Feb 3, 2026

Fixing Crash Issue in xa_nn_transpose_32 for Unaligned Buffers

Overview

Files Modified

Change 1: Simplified Inner Loop (Lines 170-191)

The Problem

Root Cause Analysis

Consequences of the Bug

The Fix

Why This Fix Works

Intrinsic Comparison

Change 2: Fixed Strided Load (Lines 216-222)

The Problem

Root Cause Analysis

The Fix

Why This Fix Works

Comparison

Summary

Key Takeaway

Additional Changes in Operator Files

op_permute_copy.cpp

op_transpose_copy.cpp

Testing Recommendations

References

Uh oh!

pytorch-bot bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17177

❌ 3 New Failures

Uh oh!

github-actions bot commented Feb 3, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fixing Crash Issue in `xa_nn_transpose_32` for Unaligned Buffers

`op_permute_copy.cpp`

`op_transpose_copy.cpp`

pytorch-bot bot commented Feb 3, 2026 •

edited

Loading

This PR needs a `release notes:` label