Skip to content

Conversation

@ShangkunLi
Copy link
Collaborator

Sorry for this too large pr......

Counter Classification

We classify the counter into three types:

  1. root: no parent, has child(ren)
  2. relay: has parent, has child(ren)
  3. leaf: has parent, no child

We need to map each counter op onto the tile array. But only leaf counter has the self-increment logic in FU. For other two types, they only have a register to store the counter values, the values are updated through off-array affine controller.

Task Classification

We classify tasks into two categories:

  1. task with taskflow.counter:
module {
  func.func @_Z6kernelPiS_S_(%arg0: memref<?xi32>, %arg1: memref<?xi32>, %arg2: memref<?xi32>) -> i32 attributes {llvm.linkage = #llvm.linkage<external>} {
    %c0_i32 = arith.constant 0 : i32
    %value_outputs = "taskflow.task"(%arg0, %arg2, %c0_i32) <{operandSegmentSizes = array<i32: 2, 1>, resultSegmentSizes = array<i32: 0, 1>, task_name = "Task_0"}> ({
    ^bb0(%arg3: memref<?xi32>, %arg4: memref<?xi32>, %arg5: i32):
      %0 = taskflow.counter attributes {counter_id = 0 : i32, counter_type = "leaf", lower_bound = 0 : index, step = 1 : index, upper_bound = 32 : index} : index
      %1 = "taskflow.hyperblock"(%0, %arg5) <{operandSegmentSizes = array<i32: 1, 1>}> ({
      ^bb0(%arg6: index, %arg7: i32):
        %2 = memref.load %arg3[%arg6] : memref<?xi32>
        %3 = memref.load %arg4[%arg6] : memref<?xi32>
        %4 = arith.muli %2, %3 : i32
        %5 = arith.addi %arg7, %4 : i32
        taskflow.hyperblock.yield iter_args_next(%5 : i32) results(%5 : i32)
      }) : (index, i32) -> i32
      "taskflow.yield"(%1) <{operandSegmentSizes = array<i32: 0, 1>}> : (i32) -> ()
    }) : (memref<?xi32>, memref<?xi32>, i32) -> i32
    return %value_outputs : i32
  }
}

This kind of task is driven by the counter; it is also terminated by the (root) counter (leaf counter when there is only one counter).

This kind of task can be further classified into two categories:
a. hyperblock with yield results: We introduce an extract_predicate op to extract the predicate bit from the root counter and grant_predicate the return value
b. hyperblock without yield results: The hyperblock execution terminates when the root counter sends a signal to the controller

  1. task without taskflow.counter:
module {
  func.func @_Z6kernelPiS_S_(%arg0: memref<?xi32>, %arg1: memref<?xi32>, %arg2: memref<?xi32>) -> i32 attributes {llvm.linkage = #llvm.linkage<external>} {
    %c0_i32 = arith.constant 0 : i32
    %value_outputs = "taskflow.task"(%arg0, %arg2, %c0_i32) <{operandSegmentSizes = array<i32: 2, 1>, resultSegmentSizes = array<i32: 0, 1>, task_name = "Task_0"}> ({
    ^bb0(%arg3: memref<?xi32>, %arg4: memref<?xi32>, %arg5: i32):
      %0 = neura.kernel inputs(%arg3, %arg4, %arg5 : memref<?xi32>, memref<?xi32>, i32) {
      ^bb0(%arg6: memref<?xi32>, %arg7: memref<?xi32>, %arg8: i32):
        %c0 = arith.constant 0 : index
        %1 = builtin.unrealized_conversion_cast %c0 : index to i64
        %c32 = arith.constant 32 : index
        %c1 = arith.constant 1 : index
        llvm.br ^bb1(%1, %arg8 : i64, i32)
      ^bb1(%2: i64, %3: i32):  // 2 preds: ^bb0, ^bb2
        %4 = builtin.unrealized_conversion_cast %2 : i64 to index
        %5 = arith.cmpi slt, %4, %c32 : index
        llvm.cond_br %5, ^bb2, ^bb3
      ^bb2:  // pred: ^bb1
        %6 = memref.load %arg6[%4] : memref<?xi32>
        %7 = memref.load %arg7[%4] : memref<?xi32>
        %8 = arith.muli %6, %7 : i32
        %9 = arith.addi %3, %8 : i32
        %10 = arith.addi %4, %c1 : index
        %11 = builtin.unrealized_conversion_cast %10 : index to i64
        llvm.br ^bb1(%11, %9 : i64, i32)
      ^bb3:  // pred: ^bb1
        neura.yield results(%3 : i32)
      } : i32
      "taskflow.yield"(%0) <{operandSegmentSizes = array<i32: 0, 1>}> : (i32) -> ()
    }) : (memref<?xi32>, memref<?xi32>, i32) -> i32
    return %value_outputs : i32
  }
}

This kind of task is self-driven, so we utilize an existing method similar to func::FuncOp to handle this task.

Taskflow to Neura Conversion

  1. We redefine the neura.kernel with the IsolatedFromAbove trait.
  2. We implement the convert-taskflow-to-neura to convert the taskflow.hyperblock into neura.kernel
  3. If the source taskflow.task has taskflow.counters outside the hyperblock, we embed them into the entry block of the neura.kernel as neura.counter

taskflow.task Mapping

  • Each taskflow.task is converted to a task that contains one neura.kernel
  • The neura.kernel is mapped onto the tile array

@tancheng
Copy link
Contributor

Sorry for this too large pr......

Counter Classification

We classify the counter into three types:

  1. root: no parent, has child(ren)
  2. relay: has parent, has child(ren)
  3. leaf: has parent, no child

We need to map each counter op onto the tile array. But only leaf counter has the self-increment logic in FU. For other two types, they only have a register to store the counter values, the values are updated through off-array affine controller.

Task Classification

We classify tasks into two categories:

  1. task with taskflow.counter:
module {
  func.func @_Z6kernelPiS_S_(%arg0: memref<?xi32>, %arg1: memref<?xi32>, %arg2: memref<?xi32>) -> i32 attributes {llvm.linkage = #llvm.linkage<external>} {
    %c0_i32 = arith.constant 0 : i32
    %value_outputs = "taskflow.task"(%arg0, %arg2, %c0_i32) <{operandSegmentSizes = array<i32: 2, 1>, resultSegmentSizes = array<i32: 0, 1>, task_name = "Task_0"}> ({
    ^bb0(%arg3: memref<?xi32>, %arg4: memref<?xi32>, %arg5: i32):
      %0 = taskflow.counter attributes {counter_id = 0 : i32, counter_type = "leaf", lower_bound = 0 : index, step = 1 : index, upper_bound = 32 : index} : index
      %1 = "taskflow.hyperblock"(%0, %arg5) <{operandSegmentSizes = array<i32: 1, 1>}> ({
      ^bb0(%arg6: index, %arg7: i32):
        %2 = memref.load %arg3[%arg6] : memref<?xi32>
        %3 = memref.load %arg4[%arg6] : memref<?xi32>
        %4 = arith.muli %2, %3 : i32
        %5 = arith.addi %arg7, %4 : i32
        taskflow.hyperblock.yield iter_args_next(%5 : i32) results(%5 : i32)
      }) : (index, i32) -> i32
      "taskflow.yield"(%1) <{operandSegmentSizes = array<i32: 0, 1>}> : (i32) -> ()
    }) : (memref<?xi32>, memref<?xi32>, i32) -> i32
    return %value_outputs : i32
  }
}

This kind of task is driven by the counter; it is also terminated by the (root) counter (leaf counter when there is only one counter).

This kind of task can be further classified into two categories: a. hyperblock with yield results: We introduce an extract_predicate op to extract the predicate bit from the root counter and grant_predicate the return value b. hyperblock without yield results: The hyperblock execution terminates when the root counter sends a signal to the controller

  1. task without taskflow.counter:
module {
  func.func @_Z6kernelPiS_S_(%arg0: memref<?xi32>, %arg1: memref<?xi32>, %arg2: memref<?xi32>) -> i32 attributes {llvm.linkage = #llvm.linkage<external>} {
    %c0_i32 = arith.constant 0 : i32
    %value_outputs = "taskflow.task"(%arg0, %arg2, %c0_i32) <{operandSegmentSizes = array<i32: 2, 1>, resultSegmentSizes = array<i32: 0, 1>, task_name = "Task_0"}> ({
    ^bb0(%arg3: memref<?xi32>, %arg4: memref<?xi32>, %arg5: i32):
      %0 = neura.kernel inputs(%arg3, %arg4, %arg5 : memref<?xi32>, memref<?xi32>, i32) {
      ^bb0(%arg6: memref<?xi32>, %arg7: memref<?xi32>, %arg8: i32):
        %c0 = arith.constant 0 : index
        %1 = builtin.unrealized_conversion_cast %c0 : index to i64
        %c32 = arith.constant 32 : index
        %c1 = arith.constant 1 : index
        llvm.br ^bb1(%1, %arg8 : i64, i32)
      ^bb1(%2: i64, %3: i32):  // 2 preds: ^bb0, ^bb2
        %4 = builtin.unrealized_conversion_cast %2 : i64 to index
        %5 = arith.cmpi slt, %4, %c32 : index
        llvm.cond_br %5, ^bb2, ^bb3
      ^bb2:  // pred: ^bb1
        %6 = memref.load %arg6[%4] : memref<?xi32>
        %7 = memref.load %arg7[%4] : memref<?xi32>
        %8 = arith.muli %6, %7 : i32
        %9 = arith.addi %3, %8 : i32
        %10 = arith.addi %4, %c1 : index
        %11 = builtin.unrealized_conversion_cast %10 : index to i64
        llvm.br ^bb1(%11, %9 : i64, i32)
      ^bb3:  // pred: ^bb1
        neura.yield results(%3 : i32)
      } : i32
      "taskflow.yield"(%0) <{operandSegmentSizes = array<i32: 0, 1>}> : (i32) -> ()
    }) : (memref<?xi32>, memref<?xi32>, i32) -> i32
    return %value_outputs : i32
  }
}

This kind of task is self-driven, so we utilize an existing method similar to func::FuncOp to handle this task.

Taskflow to Neura Conversion

  1. We redefine the neura.kernel with the IsolatedFromAbove trait.
  2. We implement the convert-taskflow-to-neura to convert the taskflow.hyperblock into neura.kernel
  3. If the source taskflow.task has taskflow.counters outside the hyperblock, we embed them into the entry block of the neura.kernel as neura.counter

taskflow.task Mapping

  • Each taskflow.task is converted to a task that contains one neura.kernel
  • The neura.kernel is mapped onto the tile array

Would a task be driven by multiple counters? What the IRs look like when there are root, relay, and leaf, co-existing.

@ShangkunLi
Copy link
Collaborator Author

Would a task be driven by multiple counters? What the IRs look like when there are root, relay, and leaf, co-existing.

Screenshot 2026-01-26 at 10 48 52

Hi @tancheng, we ensure that each task is a canonicalized task before mapping it onto a CGRA. As described in the figure, each canonicalized task only contains one root counter. So a task cannot be driven by multiple counters. If we want to fuse two independent loops into one task, we can create a root counter with {lower bound = 0, upper_bound = 1, step = 1} and drive each loop's source root counter. Doing this enables us to fuse the two DFGs and spatial-temporally map the fused DFG on the tile array. Do you think this is okay?

The IR that has root, relay, and leaf co-existing looks like:

%memory_outputs_0 = "taskflow.task"(%arg1, %arg2, %arg6, %arg1, %arg2, %arg6) <{operandSegmentSizes = array<i32: 3, 3>, resultSegmentSizes = array<i32: 1, 0>, task_name = "Task_1"}> ({
    ^bb0(%arg10: memref<?x8x5xi32>, %arg11: memref<?x8x5xi32>, %arg12: memref<?xi32>, %arg13: memref<?x8x5xi32>, %arg14: memref<?x8x5xi32>, %arg15: memref<?xi32>):
      %1 = taskflow.counter attributes {counter_id = 0 : i32, counter_type = "root", lower_bound = 0 : index, step = 1 : index, upper_bound = 4 : index} : index
      %2 = taskflow.counter parent(%1 : index) attributes {counter_id = 1 : i32, counter_type = "relay", lower_bound = 0 : index, step = 1 : index, upper_bound = 8 : index} : index
      %3 = taskflow.counter parent(%2 : index) attributes {counter_id = 2 : i32, counter_type = "leaf", lower_bound = 0 : index, step = 1 : index, upper_bound = 5 : index} : index
      "taskflow.hyperblock"(%1, %2, %3) <{operandSegmentSizes = array<i32: 3, 0>}> ({
      ^bb0(%arg16: index, %arg17: index, %arg18: index):
        %4 = memref.load %arg13[%arg16, %arg17, %arg18] : memref<?x8x5xi32>
        %5 = memref.load %arg14[%arg16, %arg17, %arg18] : memref<?x8x5xi32>
        %6 = arith.addi %4, %5 : i32
        memref.store %6, %arg15[%arg18] : memref<?xi32>
        taskflow.hyperblock.yield
      }) : (index, index, index) -> ()
      "taskflow.yield"(%arg15) <{operandSegmentSizes = array<i32: 1, 0>}> : (memref<?xi32>) -> ()
    }) : (memref<?x8x5xi32>, memref<?x8x5xi32>, memref<?xi32>, memref<?x8x5xi32>, memref<?x8x5xi32>, memref<?xi32>) -> memref<?xi32>

@tancheng
Copy link
Contributor

The relay and leaf look like offsets? Is the IR related to task C in your example? (Let's say if its upper_bound = 5)

@tancheng
Copy link
Contributor

Do we generate taskflow.channel between affine/tasks if the are producer-consumer?

@ShangkunLi
Copy link
Collaborator Author

ShangkunLi commented Jan 26, 2026

Do we generate taskflow.channel between affine/tasks if the are producer-consumer?

This part belongs to mapping tasks onto multi-cgra, which will be handled next.

@ShangkunLi
Copy link
Collaborator Author

ShangkunLi commented Jan 26, 2026

The relay and leaf look like offsets? Is the IR related to task C in your example? (Let's say if its upper_bound = 5)

Yes, @tancheng if the upper bound is 5, then the IR is related to task C (MCT 1) in the figure.

As for the hardware part, I plan to introduce 2 loop counter ops for the distributed counter unit:

  • OPT_LEAF_ROOT_COUNT: this is used for when we only have one leaf counter; this leaf counter also acts like a root counter. When it reaches the upper bound, it sends the CMD_COMPLETE to the controller.
  • OPT_LEAF_COUNT: this is used for when we have a root -> (relay) -> leaf counter chain. The counter sends a signal to the outer affine controller for updating.

As for the outer loops' value, I plan to make them constants op updated by outer affine controllers.

But the problem is how to stop the dataflow execution when we want to update outer constants.

@tancheng
Copy link
Contributor

The relay and leaf look like offsets? Is the IR related to task C in your example? (Let's say if its upper_bound = 5)

Yes, @tancheng if the upper bound is 5, then the IR is related to task C (MCT 1) in the figure.

As for the hardware part, I plan to introduce 2 loop counter ops for the distributed counter unit:

  • OPT_LEAF_ROOT_COUNT: this is used for when we only have one leaf counter; this leaf counter also acts like a root counter. When it reaches the upper bound, it sends the CMD_COMPLETE to the controller.
  • OPT_LEAF_COUNT: this is used for when we have a root -> (relay) -> leaf counter chain. The counter sends a signal to the outer affine controller for updating.

As for the outer loops' value, I plan to make them constants op updated by outer affine controllers.

But the problem is how to stop the dataflow execution when we want to update outer constants.

FU can send back cmd to controller/CPU: https://github.com/tancheng/VectorCGRA/blob/210756acc861a75ba5cb5742bcc7cc204adc9999/fu/single/RetRTL.py#L85

@ShangkunLi
Copy link
Collaborator Author

ShangkunLi commented Jan 27, 2026

The relay and leaf look like offsets? Is the IR related to task C in your example? (Let's say if its upper_bound = 5)

Yes, @tancheng if the upper bound is 5, then the IR is related to task C (MCT 1) in the figure.
As for the hardware part, I plan to introduce 2 loop counter ops for the distributed counter unit:

  • OPT_LEAF_ROOT_COUNT: this is used for when we only have one leaf counter; this leaf counter also acts like a root counter. When it reaches the upper bound, it sends the CMD_COMPLETE to the controller.
  • OPT_LEAF_COUNT: this is used for when we have a root -> (relay) -> leaf counter chain. The counter sends a signal to the outer affine controller for updating.

As for the outer loops' value, I plan to make them constants op updated by outer affine controllers.
But the problem is how to stop the dataflow execution when we want to update outer constants.

FU can send back cmd to controller/CPU: https://github.com/tancheng/VectorCGRA/blob/210756acc861a75ba5cb5742bcc7cc204adc9999/fu/single/RetRTL.py#L85

My concern is that when one leaf counter reaches its upper bound, it sends a cmd to the controller. But when should the controller start to configure the outer loop values and retrigger the leaf counter, since we don't know when the valid predicates will be consumed thoroughly.

@tancheng
Copy link
Contributor

The relay and leaf look like offsets? Is the IR related to task C in your example? (Let's say if its upper_bound = 5)

Yes, @tancheng if the upper bound is 5, then the IR is related to task C (MCT 1) in the figure.
As for the hardware part, I plan to introduce 2 loop counter ops for the distributed counter unit:

  • OPT_LEAF_ROOT_COUNT: this is used for when we only have one leaf counter; this leaf counter also acts like a root counter. When it reaches the upper bound, it sends the CMD_COMPLETE to the controller.
  • OPT_LEAF_COUNT: this is used for when we have a root -> (relay) -> leaf counter chain. The counter sends a signal to the outer affine controller for updating.

As for the outer loops' value, I plan to make them constants op updated by outer affine controllers.
But the problem is how to stop the dataflow execution when we want to update outer constants.

FU can send back cmd to controller/CPU: https://github.com/tancheng/VectorCGRA/blob/210756acc861a75ba5cb5742bcc7cc204adc9999/fu/single/RetRTL.py#L85

My concern is that when one leaf counter reaches its upper bound, it sends a cmd to the controller. But when should the controller start to configure the outer loop values and retrigger the leaf counter, since we don't know when the valid predicates will be consumed thoroughly.

Similar to return, can the controller wait for a few cycles (e.g., 10 cycles) to configure the next iteration?

@ShangkunLi
Copy link
Collaborator Author

The relay and leaf look like offsets? Is the IR related to task C in your example? (Let's say if its upper_bound = 5)

Yes, @tancheng if the upper bound is 5, then the IR is related to task C (MCT 1) in the figure.
As for the hardware part, I plan to introduce 2 loop counter ops for the distributed counter unit:

  • OPT_LEAF_ROOT_COUNT: this is used for when we only have one leaf counter; this leaf counter also acts like a root counter. When it reaches the upper bound, it sends the CMD_COMPLETE to the controller.
  • OPT_LEAF_COUNT: this is used for when we have a root -> (relay) -> leaf counter chain. The counter sends a signal to the outer affine controller for updating.

As for the outer loops' value, I plan to make them constants op updated by outer affine controllers.
But the problem is how to stop the dataflow execution when we want to update outer constants.

FU can send back cmd to controller/CPU: https://github.com/tancheng/VectorCGRA/blob/210756acc861a75ba5cb5742bcc7cc204adc9999/fu/single/RetRTL.py#L85

My concern is that when one leaf counter reaches its upper bound, it sends a cmd to the controller. But when should the controller start to configure the outer loop values and retrigger the leaf counter, since we don't know when the valid predicates will be consumed thoroughly.

Similar to return, can the controller wait for a few cycles (e.g., 10 cycles) to configure the next iteration?

But this means that every time we try to start the innermost loop execution, we need an extra 10 cycles. Moreover, 10 cycles may not be enough if we want to combine two CGRAs into a big CGRA, which means it is not scalable. I will talk to gemini and try to find other solutions..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants