-
Notifications
You must be signed in to change notification settings - Fork 15
Enable lowering from taskflow.task to neura.kernel & Mapping #247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Would a task be driven by multiple counters? What the IRs look like when there are |
Hi @tancheng, we ensure that each task is a canonicalized task before mapping it onto a CGRA. As described in the figure, each canonicalized task only contains one root counter. So a task cannot be driven by multiple counters. If we want to fuse two independent loops into one task, we can create a root counter with The IR that has %memory_outputs_0 = "taskflow.task"(%arg1, %arg2, %arg6, %arg1, %arg2, %arg6) <{operandSegmentSizes = array<i32: 3, 3>, resultSegmentSizes = array<i32: 1, 0>, task_name = "Task_1"}> ({
^bb0(%arg10: memref<?x8x5xi32>, %arg11: memref<?x8x5xi32>, %arg12: memref<?xi32>, %arg13: memref<?x8x5xi32>, %arg14: memref<?x8x5xi32>, %arg15: memref<?xi32>):
%1 = taskflow.counter attributes {counter_id = 0 : i32, counter_type = "root", lower_bound = 0 : index, step = 1 : index, upper_bound = 4 : index} : index
%2 = taskflow.counter parent(%1 : index) attributes {counter_id = 1 : i32, counter_type = "relay", lower_bound = 0 : index, step = 1 : index, upper_bound = 8 : index} : index
%3 = taskflow.counter parent(%2 : index) attributes {counter_id = 2 : i32, counter_type = "leaf", lower_bound = 0 : index, step = 1 : index, upper_bound = 5 : index} : index
"taskflow.hyperblock"(%1, %2, %3) <{operandSegmentSizes = array<i32: 3, 0>}> ({
^bb0(%arg16: index, %arg17: index, %arg18: index):
%4 = memref.load %arg13[%arg16, %arg17, %arg18] : memref<?x8x5xi32>
%5 = memref.load %arg14[%arg16, %arg17, %arg18] : memref<?x8x5xi32>
%6 = arith.addi %4, %5 : i32
memref.store %6, %arg15[%arg18] : memref<?xi32>
taskflow.hyperblock.yield
}) : (index, index, index) -> ()
"taskflow.yield"(%arg15) <{operandSegmentSizes = array<i32: 1, 0>}> : (memref<?xi32>) -> ()
}) : (memref<?x8x5xi32>, memref<?x8x5xi32>, memref<?xi32>, memref<?x8x5xi32>, memref<?x8x5xi32>, memref<?xi32>) -> memref<?xi32> |
|
The relay and leaf look like offsets? Is the IR related to task C in your example? (Let's say if its upper_bound = 5) |
|
Do we generate taskflow.channel between affine/tasks if the are producer-consumer? |
This part belongs to mapping tasks onto multi-cgra, which will be handled next. |
Yes, @tancheng if the upper bound is 5, then the IR is related to task C (MCT 1) in the figure. As for the hardware part, I plan to introduce 2 loop counter ops for the distributed counter unit:
As for the outer loops' value, I plan to make them constants op updated by outer affine controllers. But the problem is how to stop the dataflow execution when we want to update outer constants. |
FU can send back cmd to controller/CPU: https://github.com/tancheng/VectorCGRA/blob/210756acc861a75ba5cb5742bcc7cc204adc9999/fu/single/RetRTL.py#L85 |
My concern is that when one leaf counter reaches its upper bound, it sends a cmd to the controller. But when should the controller start to configure the outer loop values and retrigger the leaf counter, since we don't know when the valid predicates will be consumed thoroughly. |
Similar to |
But this means that every time we try to start the innermost loop execution, we need an extra 10 cycles. Moreover, 10 cycles may not be enough if we want to combine two CGRAs into a big CGRA, which means it is not scalable. I will talk to gemini and try to find other solutions.. |

Sorry for this too large pr......
Counter Classification
We classify the counter into three types:
root: no parent, has child(ren)relay: has parent, has child(ren)leaf: has parent, no childWe need to map each counter op onto the tile array. But only
leafcounter has the self-increment logic in FU. For other two types, they only have a register to store the counter values, the values are updated through off-array affine controller.Task Classification
We classify tasks into two categories:
taskflow.counter:This kind of task is driven by the counter; it is also terminated by the (
root) counter (leafcounter when there is only one counter).This kind of task can be further classified into two categories:
a. hyperblock with yield results: We introduce an
extract_predicateop to extract the predicate bit from therootcounter andgrant_predicatethe return valueb. hyperblock without yield results: The hyperblock execution terminates when the root counter sends a signal to the controller
taskflow.counter:This kind of task is self-driven, so we utilize an existing method similar to
func::FuncOpto handle this task.Taskflow to Neura Conversion
neura.kernelwith theIsolatedFromAbovetrait.convert-taskflow-to-neurato convert thetaskflow.hyperblockintoneura.kerneltaskflow.taskhastaskflow.counters outside thehyperblock, we embed them into the entry block of theneura.kernelasneura.countertaskflow.taskMappingtaskflow.taskis converted to ataskthat contains oneneura.kernelneura.kernelis mapped onto the tile array