-
Notifications
You must be signed in to change notification settings - Fork 225
Description
The issue began when I wanted to use static_thread_pool in a production environment, which is limited to C++17. Therefore, I implemented a minimal version into my environment. Subsequently, I discovered that there was a certain probability of tasks being lost. My first step was to use TSAN to check the issue. TSAN did not report a data race in the x86 environment, but it did report a data race on the Apple M3 machine in the steal_front function and bulk_put function. After investigating, I realized that inappropriate memory ordering was the cause. I modified operations involving thief_block_, tail_, and steal_tail_ to use acquire-release semantics, which resolved the TSAN errors. However, tasks still got lost in the x86 environment. After extensive investigation, I discovered a sequence of timing events that could cause a block's tasks to never be executed.
1.The thief calls advance_steal_index to prepare to enter the next block, named 'a', and pauses between is_stealable and modifying thief_block_.
2.The owner takes over block 'a', finds that the block hasn't been stolen, and successfully takes it over.
3.The thief modifies thief_block_ and successfully enters block 'a'.
4.The thief sees steal_tail == block_size and assumes the block is full, then attempts to call advance_steal_index.
5.The owner pushes a large number of tasks into block 'a', then moves forward to block 'a+2' or farther.
6.The thief finds block a+1 is stealable and successfully enters block 'a+1'.
At this point, the strange issue occurs where the tasks in block 'a' disappear.
I believe the issue can be fixed by using steal_head_ within the steal function to determine Done and Empty, instead of steal_tail_.