[WIP] Remove redundant casts in LLVMIR by justinrosner · Pull Request #2202 · ROCm/rocMLIR

justinrosner · 2026-01-08T17:38:17Z

Motivation

When processing mixed-precision computations (e.g., attention kernels with f32 intermediate values stored as f16), the generated IR often contains redundant precision conversion patterns:

%wide = ...                           ; f32 computation result
%narrow = llvm.fptrunc %wide : f32 to f16
llvm.store %narrow, %narrow_buf       ; store truncated value
...
%loaded = llvm.load %narrow_buf       ; load truncated value  
%extended = llvm.fpext %loaded : f16 to f32  ; extend back to f32

This pattern causes unnecessary precision loss compared to just keeping the original wide value. This pass eliminates these redundant casts by redirectoing loads to read from a parallel wide buffer when possible.

This implements: https://github.com/ROCm/rocMLIR-internal/issues/1932

Technical Details

This PR introduces the RemoveRedundantCasts pass that operates at the LLVMIR dialect level to optimize fptrunc -> store -> load -> fpext patterns.

General Algorithm:

Find all fptrunc -> store patterns in the function. For each pattern, record whether there's already a parallel store of the wide value to a separate buffer.
Find all load -> fpext patterns where the load is from a buffer that has fptrunc stores.
Verify safety for each load+fpext pattern:
- All stores to the narrow buffer must be from tracked fptrunc patterns (i.e., no untracked stores that could write different values)
- All tracked stores must dominate the load
- The narrow buffer must be an alloca
For safe patterns, create a wide buffer and the corresponding stores if they don't exist. If a parallel store already exists, reuse it:
- Create a wide alloca right after the narrow alloca
- For each fptrunc store, insert a store of the wide value to the wide buffer (right after the narrow store, using the same indices)
Apply the transformation:
- Redirect the load to read from the wide buffer instead
- Replace uses of the fpext result with the wide load result
- Delete the fpext (and the old load/GEP if unused)
Clean up unused narrow buffer operations:
- If the narrow buffer has no remaining uses, erase the fptrunc stores
  - These can only be erased if they are not used by any other operations
- Erase the narrow alloca if it has no remaining uses

Test Plan

Run the initial example in https://github.com/ROCm/rocMLIR-internal/issues/1932 and make sure that it is no longer generating the extra truncs/exts
Nightly CI

Test Result

Nightly CI

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

dhernandez0 · 2026-01-09T09:21:22Z

mlir/test/Dialect/LLVMIR/remove-redundant-casts.mlir

+  %7 = llvm.fptrunc %1 : vector<4xf32> to vector<4xf16>
+  llvm.store %7, %6 : vector<4xf16>, !llvm.ptr<5>
+  %8 = llvm.getelementptr %2[12] : (!llvm.ptr<5>) -> !llvm.ptr<5>, f16
+  %9 = llvm.fptrunc %1 : vector<4xf32> to vector<4xf16>


why are there so many repeated llvm.fptrunc %1? I don't understand this.
Wouldn't it be easier to do this earlier? for example handling arith.extf, etc?

The fptrunc's seem to be the result of loop unrolling. They are all writing to the same buffer. I was doing this earlier and there are quite a few more difficulties with moving this pass somewhere right after GridwiseGemmToBlockwise. The dominance analysis (used for safety) becomes tricky because it doesn't work well when the trunc/ext ops are in different regions, and also having to rewrite the linalg generic makes things more difficult as well.

Both approaches have their pros and cons. We can discuss this more in the team meeting, or elsewhere offline.

dhernandez0 · 2026-01-09T09:24:20Z

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

+//      - If the narrow buffer has no remaining uses, erase the fptrunc stores
+//        - These can only be erased if they are not used by any other
+//          operations
+//      - Erase the narrow alloca if it has no remaining uses


do we have tests for these two cases?

dhernandez0 · 2026-01-09T09:29:00Z

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

+
+      // Look for existing parallel wide store
+      for (Operation *wideUser : wideValue.getUsers()) {
+        auto wideStore = dyn_cast<StoreOp>(wideUser);


should we check it's the source instead of destination?

dhernandez0 · 2026-01-09T09:29:27Z

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

+      info.wideStore = nullptr;
+
+      // Look for existing parallel wide store
+      for (Operation *wideUser : wideValue.getUsers()) {


nit: this could be done outside of this loop and store the results in a SmallVector?

dhernandez0 · 2026-01-09T09:29:48Z

mlir/lib/Dialect/Rock/Transforms/RemoveRedundantCasts.cpp

+                                << wideStore << "\n");
+        info.wideBuffer = wideBuffer;
+        info.wideStore = wideStore;
+        break;


why do we store only the first one?

justinrosner added 11 commits January 5, 2026 21:41

Initial truncf finding

1fafdc0

Add in logic so that we are only finding truncfs with direct stores

625ba46

Minor comment and debug message fixes

f48f74a

Add detection for extf ops

6f9114a

Partial verification of store/load chains

a8a7cda

Initial attempt at LLVMIR level transformation

9026632

Add E2E test

9a0cbb2

More LIT tests

1e9671a

Add newline

800c5dc

Clang-format

3ee0bf8

Remove some extra lines

713d577

justinrosner mentioned this pull request Jan 8, 2026

Remove redundant cast chains #1944

Closed

3 tasks

justinrosner added 2 commits January 8, 2026 13:55

Merge branch 'develop' into 1932-remove-casts

6fcaaaa

Conservative checks

107b461

dhernandez0 reviewed Jan 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Remove redundant casts in LLVMIR#2202

[WIP] Remove redundant casts in LLVMIR#2202
justinrosner wants to merge 13 commits intodevelopfrom
1932-remove-casts

justinrosner commented Jan 8, 2026 •

edited

Loading

Uh oh!

dhernandez0 Jan 9, 2026

Uh oh!

justinrosner Jan 9, 2026

Uh oh!

dhernandez0 Jan 9, 2026

Uh oh!

dhernandez0 Jan 9, 2026

Uh oh!

dhernandez0 Jan 9, 2026

Uh oh!

dhernandez0 Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

justinrosner commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

dhernandez0 Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

justinrosner Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

dhernandez0 Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

dhernandez0 Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

dhernandez0 Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

dhernandez0 Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justinrosner commented Jan 8, 2026 •

edited

Loading