Enable lowering from taskflow.task to neura.kernel & Mapping #247

ShangkunLi · 2026-01-24T12:03:49Z

Sorry for this too large pr......

Counter Classification

We classify the counter into three types:

root: no parent, has child(ren)
relay: has parent, has child(ren)
leaf: has parent, no child

We need to map each counter op onto the tile array. But only leaf counter has the self-increment logic in FU. For other two types, they only have a register to store the counter values, the values are updated through off-array affine controller.

Task Classification

We classify tasks into two categories:

task with taskflow.counter:

module {
  func.func @_Z6kernelPiS_S_(%arg0: memref<?xi32>, %arg1: memref<?xi32>, %arg2: memref<?xi32>) -> i32 attributes {llvm.linkage = #llvm.linkage<external>} {
    %c0_i32 = arith.constant 0 : i32
    %value_outputs = "taskflow.task"(%arg0, %arg2, %c0_i32) <{operandSegmentSizes = array<i32: 2, 1>, resultSegmentSizes = array<i32: 0, 1>, task_name = "Task_0"}> ({
    ^bb0(%arg3: memref<?xi32>, %arg4: memref<?xi32>, %arg5: i32):
      %0 = taskflow.counter attributes {counter_id = 0 : i32, counter_type = "leaf", lower_bound = 0 : index, step = 1 : index, upper_bound = 32 : index} : index
      %1 = "taskflow.hyperblock"(%0, %arg5) <{operandSegmentSizes = array<i32: 1, 1>}> ({
      ^bb0(%arg6: index, %arg7: i32):
        %2 = memref.load %arg3[%arg6] : memref<?xi32>
        %3 = memref.load %arg4[%arg6] : memref<?xi32>
        %4 = arith.muli %2, %3 : i32
        %5 = arith.addi %arg7, %4 : i32
        taskflow.hyperblock.yield iter_args_next(%5 : i32) results(%5 : i32)
      }) : (index, i32) -> i32
      "taskflow.yield"(%1) <{operandSegmentSizes = array<i32: 0, 1>}> : (i32) -> ()
    }) : (memref<?xi32>, memref<?xi32>, i32) -> i32
    return %value_outputs : i32
  }
}

This kind of task is driven by the counter; it is also terminated by the (root) counter (leaf counter when there is only one counter).

This kind of task can be further classified into two categories:
a. hyperblock with yield results: We introduce an extract_predicate op to extract the predicate bit from the root counter and grant_predicate the return value
b. hyperblock without yield results: The hyperblock execution terminates when the root counter sends a signal to the controller

task without taskflow.counter:

module {
  func.func @_Z6kernelPiS_S_(%arg0: memref<?xi32>, %arg1: memref<?xi32>, %arg2: memref<?xi32>) -> i32 attributes {llvm.linkage = #llvm.linkage<external>} {
    %c0_i32 = arith.constant 0 : i32
    %value_outputs = "taskflow.task"(%arg0, %arg2, %c0_i32) <{operandSegmentSizes = array<i32: 2, 1>, resultSegmentSizes = array<i32: 0, 1>, task_name = "Task_0"}> ({
    ^bb0(%arg3: memref<?xi32>, %arg4: memref<?xi32>, %arg5: i32):
      %0 = neura.kernel inputs(%arg3, %arg4, %arg5 : memref<?xi32>, memref<?xi32>, i32) {
      ^bb0(%arg6: memref<?xi32>, %arg7: memref<?xi32>, %arg8: i32):
        %c0 = arith.constant 0 : index
        %1 = builtin.unrealized_conversion_cast %c0 : index to i64
        %c32 = arith.constant 32 : index
        %c1 = arith.constant 1 : index
        llvm.br ^bb1(%1, %arg8 : i64, i32)
      ^bb1(%2: i64, %3: i32):  // 2 preds: ^bb0, ^bb2
        %4 = builtin.unrealized_conversion_cast %2 : i64 to index
        %5 = arith.cmpi slt, %4, %c32 : index
        llvm.cond_br %5, ^bb2, ^bb3
      ^bb2:  // pred: ^bb1
        %6 = memref.load %arg6[%4] : memref<?xi32>
        %7 = memref.load %arg7[%4] : memref<?xi32>
        %8 = arith.muli %6, %7 : i32
        %9 = arith.addi %3, %8 : i32
        %10 = arith.addi %4, %c1 : index
        %11 = builtin.unrealized_conversion_cast %10 : index to i64
        llvm.br ^bb1(%11, %9 : i64, i32)
      ^bb3:  // pred: ^bb1
        neura.yield results(%3 : i32)
      } : i32
      "taskflow.yield"(%0) <{operandSegmentSizes = array<i32: 0, 1>}> : (i32) -> ()
    }) : (memref<?xi32>, memref<?xi32>, i32) -> i32
    return %value_outputs : i32
  }
}

This kind of task is self-driven, so we utilize an existing method similar to func::FuncOp to handle this task.

Taskflow to Neura Conversion

We redefine the neura.kernel with the IsolatedFromAbove trait.
We implement the convert-taskflow-to-neura to convert the taskflow.hyperblock into neura.kernel
If the source taskflow.task has taskflow.counters outside the hyperblock, we embed them into the entry block of the neura.kernel as neura.counter

`taskflow.task` Mapping

Each taskflow.task is converted to a task that contains one neura.kernel
The neura.kernel is mapped onto the tile array

tancheng · 2026-01-24T21:29:24Z

Sorry for this too large pr......

Counter Classification

We classify the counter into three types:

root: no parent, has child(ren)

relay: has parent, has child(ren)

leaf: has parent, no child

We need to map each counter op onto the tile array. But only leaf counter has the self-increment logic in FU. For other two types, they only have a register to store the counter values, the values are updated through off-array affine controller.

Task Classification

We classify tasks into two categories:

task with taskflow.counter:
module {
  func.func @_Z6kernelPiS_S_(%arg0: memref<?xi32>, %arg1: memref<?xi32>, %arg2: memref<?xi32>) -> i32 attributes {llvm.linkage = #llvm.linkage<external>} {
    %c0_i32 = arith.constant 0 : i32
    %value_outputs = "taskflow.task"(%arg0, %arg2, %c0_i32) <{operandSegmentSizes = array<i32: 2, 1>, resultSegmentSizes = array<i32: 0, 1>, task_name = "Task_0"}> ({
    ^bb0(%arg3: memref<?xi32>, %arg4: memref<?xi32>, %arg5: i32):
      %0 = taskflow.counter attributes {counter_id = 0 : i32, counter_type = "leaf", lower_bound = 0 : index, step = 1 : index, upper_bound = 32 : index} : index
      %1 = "taskflow.hyperblock"(%0, %arg5) <{operandSegmentSizes = array<i32: 1, 1>}> ({
      ^bb0(%arg6: index, %arg7: i32):
        %2 = memref.load %arg3[%arg6] : memref<?xi32>
        %3 = memref.load %arg4[%arg6] : memref<?xi32>
        %4 = arith.muli %2, %3 : i32
        %5 = arith.addi %arg7, %4 : i32
        taskflow.hyperblock.yield iter_args_next(%5 : i32) results(%5 : i32)
      }) : (index, i32) -> i32
      "taskflow.yield"(%1) <{operandSegmentSizes = array<i32: 0, 1>}> : (i32) -> ()
    }) : (memref<?xi32>, memref<?xi32>, i32) -> i32
    return %value_outputs : i32
  }
}
This kind of task is driven by the counter; it is also terminated by the (root) counter (leaf counter when there is only one counter).

This kind of task can be further classified into two categories: a. hyperblock with yield results: We introduce an extract_predicate op to extract the predicate bit from the root counter and grant_predicate the return value b. hyperblock without yield results: The hyperblock execution terminates when the root counter sends a signal to the controller

task without taskflow.counter:
module {
  func.func @_Z6kernelPiS_S_(%arg0: memref<?xi32>, %arg1: memref<?xi32>, %arg2: memref<?xi32>) -> i32 attributes {llvm.linkage = #llvm.linkage<external>} {
    %c0_i32 = arith.constant 0 : i32
    %value_outputs = "taskflow.task"(%arg0, %arg2, %c0_i32) <{operandSegmentSizes = array<i32: 2, 1>, resultSegmentSizes = array<i32: 0, 1>, task_name = "Task_0"}> ({
    ^bb0(%arg3: memref<?xi32>, %arg4: memref<?xi32>, %arg5: i32):
      %0 = neura.kernel inputs(%arg3, %arg4, %arg5 : memref<?xi32>, memref<?xi32>, i32) {
      ^bb0(%arg6: memref<?xi32>, %arg7: memref<?xi32>, %arg8: i32):
        %c0 = arith.constant 0 : index
        %1 = builtin.unrealized_conversion_cast %c0 : index to i64
        %c32 = arith.constant 32 : index
        %c1 = arith.constant 1 : index
        llvm.br ^bb1(%1, %arg8 : i64, i32)
      ^bb1(%2: i64, %3: i32):  // 2 preds: ^bb0, ^bb2
        %4 = builtin.unrealized_conversion_cast %2 : i64 to index
        %5 = arith.cmpi slt, %4, %c32 : index
        llvm.cond_br %5, ^bb2, ^bb3
      ^bb2:  // pred: ^bb1
        %6 = memref.load %arg6[%4] : memref<?xi32>
        %7 = memref.load %arg7[%4] : memref<?xi32>
        %8 = arith.muli %6, %7 : i32
        %9 = arith.addi %3, %8 : i32
        %10 = arith.addi %4, %c1 : index
        %11 = builtin.unrealized_conversion_cast %10 : index to i64
        llvm.br ^bb1(%11, %9 : i64, i32)
      ^bb3:  // pred: ^bb1
        neura.yield results(%3 : i32)
      } : i32
      "taskflow.yield"(%0) <{operandSegmentSizes = array<i32: 0, 1>}> : (i32) -> ()
    }) : (memref<?xi32>, memref<?xi32>, i32) -> i32
    return %value_outputs : i32
  }
}
This kind of task is self-driven, so we utilize an existing method similar to func::FuncOp to handle this task.

Taskflow to Neura Conversion

We redefine the neura.kernel with the IsolatedFromAbove trait.

We implement the convert-taskflow-to-neura to convert the taskflow.hyperblock into neura.kernel

If the source taskflow.task has taskflow.counters outside the hyperblock, we embed them into the entry block of the neura.kernel as neura.counter

taskflow.task Mapping

Each taskflow.task is converted to a task that contains one neura.kernel

The neura.kernel is mapped onto the tile array

Would a task be driven by multiple counters? What the IRs look like when there are root, relay, and leaf, co-existing.

lib/NeuraDialect/Transforms/TransformCtrlToDataFlowPass.cpp

ShangkunLi · 2026-01-26T03:07:55Z

Would a task be driven by multiple counters? What the IRs look like when there are root, relay, and leaf, co-existing.

Hi @tancheng, we ensure that each task is a canonicalized task before mapping it onto a CGRA. As described in the figure, each canonicalized task only contains one root counter. So a task cannot be driven by multiple counters. If we want to fuse two independent loops into one task, we can create a root counter with {lower bound = 0, upper_bound = 1, step = 1} and drive each loop's source root counter. Doing this enables us to fuse the two DFGs and spatial-temporally map the fused DFG on the tile array. Do you think this is okay?

The IR that has root, relay, and leaf co-existing looks like:

%memory_outputs_0 = "taskflow.task"(%arg1, %arg2, %arg6, %arg1, %arg2, %arg6) <{operandSegmentSizes = array<i32: 3, 3>, resultSegmentSizes = array<i32: 1, 0>, task_name = "Task_1"}> ({
    ^bb0(%arg10: memref<?x8x5xi32>, %arg11: memref<?x8x5xi32>, %arg12: memref<?xi32>, %arg13: memref<?x8x5xi32>, %arg14: memref<?x8x5xi32>, %arg15: memref<?xi32>):
      %1 = taskflow.counter attributes {counter_id = 0 : i32, counter_type = "root", lower_bound = 0 : index, step = 1 : index, upper_bound = 4 : index} : index
      %2 = taskflow.counter parent(%1 : index) attributes {counter_id = 1 : i32, counter_type = "relay", lower_bound = 0 : index, step = 1 : index, upper_bound = 8 : index} : index
      %3 = taskflow.counter parent(%2 : index) attributes {counter_id = 2 : i32, counter_type = "leaf", lower_bound = 0 : index, step = 1 : index, upper_bound = 5 : index} : index
      "taskflow.hyperblock"(%1, %2, %3) <{operandSegmentSizes = array<i32: 3, 0>}> ({
      ^bb0(%arg16: index, %arg17: index, %arg18: index):
        %4 = memref.load %arg13[%arg16, %arg17, %arg18] : memref<?x8x5xi32>
        %5 = memref.load %arg14[%arg16, %arg17, %arg18] : memref<?x8x5xi32>
        %6 = arith.addi %4, %5 : i32
        memref.store %6, %arg15[%arg18] : memref<?xi32>
        taskflow.hyperblock.yield
      }) : (index, index, index) -> ()
      "taskflow.yield"(%arg15) <{operandSegmentSizes = array<i32: 1, 0>}> : (memref<?xi32>) -> ()
    }) : (memref<?x8x5xi32>, memref<?x8x5xi32>, memref<?xi32>, memref<?x8x5xi32>, memref<?x8x5xi32>, memref<?xi32>) -> memref<?xi32>

tancheng · 2026-01-26T05:42:01Z

The relay and leaf look like offsets? Is the IR related to task C in your example? (Let's say if its upper_bound = 5)

tancheng · 2026-01-26T05:42:50Z

Do we generate taskflow.channel between affine/tasks if the are producer-consumer?

ShangkunLi · 2026-01-26T05:44:36Z

Do we generate taskflow.channel between affine/tasks if the are producer-consumer?

This part belongs to mapping tasks onto multi-cgra, which will be handled next.

ShangkunLi · 2026-01-26T06:07:59Z

The relay and leaf look like offsets? Is the IR related to task C in your example? (Let's say if its upper_bound = 5)

Yes, @tancheng if the upper bound is 5, then the IR is related to task C (MCT 1) in the figure.

As for the hardware part, I plan to introduce 2 loop counter ops for the distributed counter unit:

OPT_LEAF_ROOT_COUNT: this is used for when we only have one leaf counter; this leaf counter also acts like a root counter. When it reaches the upper bound, it sends the CMD_COMPLETE to the controller.
OPT_LEAF_COUNT: this is used for when we have a root -> (relay) -> leaf counter chain. The counter sends a signal to the outer affine controller for updating.

As for the outer loops' value, I plan to make them constants op updated by outer affine controllers.

But the problem is how to stop the dataflow execution when we want to update outer constants.

tancheng · 2026-01-26T23:01:28Z

The relay and leaf look like offsets? Is the IR related to task C in your example? (Let's say if its upper_bound = 5)

Yes, @tancheng if the upper bound is 5, then the IR is related to task C (MCT 1) in the figure.

As for the hardware part, I plan to introduce 2 loop counter ops for the distributed counter unit:

OPT_LEAF_ROOT_COUNT: this is used for when we only have one leaf counter; this leaf counter also acts like a root counter. When it reaches the upper bound, it sends the CMD_COMPLETE to the controller.

OPT_LEAF_COUNT: this is used for when we have a root -> (relay) -> leaf counter chain. The counter sends a signal to the outer affine controller for updating.

As for the outer loops' value, I plan to make them constants op updated by outer affine controllers.

But the problem is how to stop the dataflow execution when we want to update outer constants.

FU can send back cmd to controller/CPU: https://github.com/tancheng/VectorCGRA/blob/210756acc861a75ba5cb5742bcc7cc204adc9999/fu/single/RetRTL.py#L85

ShangkunLi · 2026-01-27T03:36:41Z

The relay and leaf look like offsets? Is the IR related to task C in your example? (Let's say if its upper_bound = 5)

Yes, @tancheng if the upper bound is 5, then the IR is related to task C (MCT 1) in the figure.
As for the hardware part, I plan to introduce 2 loop counter ops for the distributed counter unit:

OPT_LEAF_ROOT_COUNT: this is used for when we only have one leaf counter; this leaf counter also acts like a root counter. When it reaches the upper bound, it sends the CMD_COMPLETE to the controller.

OPT_LEAF_COUNT: this is used for when we have a root -> (relay) -> leaf counter chain. The counter sends a signal to the outer affine controller for updating.

As for the outer loops' value, I plan to make them constants op updated by outer affine controllers.
But the problem is how to stop the dataflow execution when we want to update outer constants.

FU can send back cmd to controller/CPU: https://github.com/tancheng/VectorCGRA/blob/210756acc861a75ba5cb5742bcc7cc204adc9999/fu/single/RetRTL.py#L85

My concern is that when one leaf counter reaches its upper bound, it sends a cmd to the controller. But when should the controller start to configure the outer loop values and retrigger the leaf counter, since we don't know when the valid predicates will be consumed thoroughly.

tancheng · 2026-01-27T03:52:06Z

The relay and leaf look like offsets? Is the IR related to task C in your example? (Let's say if its upper_bound = 5)

Yes, @tancheng if the upper bound is 5, then the IR is related to task C (MCT 1) in the figure.
As for the hardware part, I plan to introduce 2 loop counter ops for the distributed counter unit:

OPT_LEAF_ROOT_COUNT: this is used for when we only have one leaf counter; this leaf counter also acts like a root counter. When it reaches the upper bound, it sends the CMD_COMPLETE to the controller.

OPT_LEAF_COUNT: this is used for when we have a root -> (relay) -> leaf counter chain. The counter sends a signal to the outer affine controller for updating.

As for the outer loops' value, I plan to make them constants op updated by outer affine controllers.
But the problem is how to stop the dataflow execution when we want to update outer constants.

FU can send back cmd to controller/CPU: https://github.com/tancheng/VectorCGRA/blob/210756acc861a75ba5cb5742bcc7cc204adc9999/fu/single/RetRTL.py#L85

My concern is that when one leaf counter reaches its upper bound, it sends a cmd to the controller. But when should the controller start to configure the outer loop values and retrigger the leaf counter, since we don't know when the valid predicates will be consumed thoroughly.

Similar to return, can the controller wait for a few cycles (e.g., 10 cycles) to configure the next iteration?

ShangkunLi · 2026-01-27T04:13:38Z

The relay and leaf look like offsets? Is the IR related to task C in your example? (Let's say if its upper_bound = 5)

Yes, @tancheng if the upper bound is 5, then the IR is related to task C (MCT 1) in the figure.
As for the hardware part, I plan to introduce 2 loop counter ops for the distributed counter unit:

OPT_LEAF_ROOT_COUNT: this is used for when we only have one leaf counter; this leaf counter also acts like a root counter. When it reaches the upper bound, it sends the CMD_COMPLETE to the controller.

OPT_LEAF_COUNT: this is used for when we have a root -> (relay) -> leaf counter chain. The counter sends a signal to the outer affine controller for updating.

As for the outer loops' value, I plan to make them constants op updated by outer affine controllers.
But the problem is how to stop the dataflow execution when we want to update outer constants.

FU can send back cmd to controller/CPU: https://github.com/tancheng/VectorCGRA/blob/210756acc861a75ba5cb5742bcc7cc204adc9999/fu/single/RetRTL.py#L85

My concern is that when one leaf counter reaches its upper bound, it sends a cmd to the controller. But when should the controller start to configure the outer loop values and retrigger the leaf counter, since we don't know when the valid predicates will be consumed thoroughly.

Similar to return, can the controller wait for a few cycles (e.g., 10 cycles) to configure the next iteration?

But this means that every time we try to start the innermost loop execution, we need an extra 10 cycles. Moreover, 10 cycles may not be enough if we want to combine two CGRAs into a big CGRA, which means it is not scalable. I will talk to gemini and try to find other solutions..

ShangkunLi added 15 commits January 22, 2026 13:31

add counter classification pass

7426218

change the definition of taskflow.hyperblock.yield

7d5da15

change the definition of neura.kernel

8390cb9

enable taskflow to neura conversion

de017d7

assign accelerator for neura.kernel

79c9480

enable neura.kernel lowering in conversion passes

66256b2

enable promote func/kernel arguments to constant

fc675a5

enable canonicalize-return for neura.kernel

fc9b751

enable leverage-predicated-values for neura.kernel

d67c9e9

enable kernel with counters dataflow lowering

2ddefaf

enable kernel without counters dataflow lowering

07f46eb

enable kenrel mapping

774745c

enable kernel mapping

b47a188

distinguish iter_arg_init in fold-constant pass

72259dd

add tests for e2e taskflow2neura test

c915995

ShangkunLi requested review from guosran and tancheng January 24, 2026 12:04

guosran reviewed Jan 24, 2026

View reviewed changes

lib/NeuraDialect/Transforms/TransformCtrlToDataFlowPass.cpp Show resolved Hide resolved

ShangkunLi and others added 4 commits January 26, 2026 11:32

Merge branch 'main' into taskflow2neura

352bf5f

change the definition of taskflow.hyperblock.yield

4fa5855

[clean] remove redundant code

1966849

[clean] remove redudant files

0072e82

Merge branch 'main' into taskflow2neura

88c9454

sync with main

3e523f0

Enable lowering from taskflow.task to neura.kernel & Mapping #247

Are you sure you want to change the base?

Enable lowering from taskflow.task to neura.kernel & Mapping #247

Uh oh!

Conversation

ShangkunLi commented Jan 24, 2026

Counter Classification

Task Classification

Taskflow to Neura Conversion

taskflow.task Mapping

Uh oh!

tancheng commented Jan 24, 2026

Counter Classification

Task Classification

Taskflow to Neura Conversion

taskflow.task Mapping

Uh oh!

Uh oh!

ShangkunLi commented Jan 26, 2026

Uh oh!

tancheng commented Jan 26, 2026

Uh oh!

tancheng commented Jan 26, 2026

Uh oh!

ShangkunLi commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShangkunLi commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tancheng commented Jan 26, 2026

Uh oh!

ShangkunLi commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tancheng commented Jan 27, 2026

Uh oh!

ShangkunLi commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`taskflow.task` Mapping

`taskflow.task` Mapping

ShangkunLi commented Jan 26, 2026 •

edited

Loading

ShangkunLi commented Jan 26, 2026 •

edited

Loading

ShangkunLi commented Jan 27, 2026 •

edited

Loading