Skip to content

Conversation

@vincent-noel
Copy link
Collaborator

I had some crash that I wanted to fix for a while, and finally found the issues:

  • pAttackTarget is not cleaned when the target is deleted, resulting in a use after free
  • neighbors are not always symmetrical, but when we delete a cell we suppose they are, resulting in a use after free
  • spring_attachments are not always symmetrical, but when we delete a cell we suppose they are, resulting in a use after free.

Here the fixes I'm proposing for now:

  • When deleting a cell, I check that no other cell has it as an attack target and remove it if so
  • When deleting a cell, after removing it from its known neighbors, check every cell to remove from unknown (non-symmetrical) neighbors
  • When deleting a cell, after removing it from its known spring_attachments, check every cell to remove from unknown (non-symmetrical) spring_attachments

Clearly, this can be improved, but it fixes the present issues.

This comment was marked as outdated.

vincent-noel and others added 5 commits December 2, 2025 14:45
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

This comment was marked as outdated.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +3477 to +3483
for (int j=0; j < all_cells->size(); j++)
{
Cell* pC = (*all_cells)[j];
if (( pC != this) && pC->phenotype.cell_interactions.pAttackTarget == this) {
pC->phenotype.cell_interactions.pAttackTarget = NULL;
}
}
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop iterates over all_cells without synchronization, and the #pragma omp parallel for creates concurrent writes to pAttackTarget. This can cause race conditions if cells are being deleted or modified concurrently. Consider removing the parallel directive or adding proper synchronization.

Copilot uses AI. Check for mistakes.
Comment on lines +3474 to +3484
void Cell::remove_self_from_attackers( void )
{
#pragma omp parallel for
for (int j=0; j < all_cells->size(); j++)
{
Cell* pC = (*all_cells)[j];
if (( pC != this) && pC->phenotype.cell_interactions.pAttackTarget == this) {
pC->phenotype.cell_interactions.pAttackTarget = NULL;
}
}
}
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This O(n) loop iterates through all cells to clean up attack targets, which can be expensive for large simulations with many cells. Consider using a reverse index (e.g., a set/list of attackers stored on each potential target cell) to make this operation O(k) where k is the number of attacking cells, which is typically much smaller than n.

Copilot uses AI. Check for mistakes.
Comment on lines +3409 to +3422
for ( int i = 0; i < all_cells->size(); i++ )
{
Cell* pC = (*all_cells)[i];
if (pC != this)
{
auto SearchResult = std::find(
pC->state.neighbors.begin(),pC->state.neighbors.end(),this );
if ( SearchResult != pC->state.neighbors.end() )
{
std::cout << "Cell " << pC->ID << " still has cell " << this->ID << " as a neighbor!" << std::endl;
pC->state.neighbors.erase( SearchResult );
}
}
}
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop iterates over all_cells without thread safety protections. If cell deletion can occur concurrently (e.g., from multiple threads), this could lead to iterator invalidation or accessing deallocated memory. Consider adding appropriate synchronization or documenting that delete_cell must be called from a single-threaded context.

Copilot uses AI. Check for mistakes.
Comment on lines +3455 to +3469
for ( int i = 0; i < all_cells->size(); i++ )
{
Cell* pC = (*all_cells)[i];
if (pC != this)
{
auto SearchResult = std::find(
pC->state.spring_attachments.begin(),pC->state.spring_attachments.end(),this );
if ( SearchResult != pC->state.spring_attachments.end() )
{
std::cout << "Cell " << pC->ID << " still has cell " << this->ID << " as a spring attachment!" << std::endl;
pC->state.spring_attachments.erase( SearchResult );
}

}
}
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop iterates over all_cells without thread safety protections. If cell deletion can occur concurrently (e.g., from multiple threads), this could lead to iterator invalidation or accessing deallocated memory. Consider adding appropriate synchronization or documenting that delete_cell must be called from a single-threaded context.

Copilot uses AI. Check for mistakes.
@drbergman
Copy link
Collaborator

I remember being quite opposed to this PR, but I may have misunderstood what it was doing? I had thought we were looping (or even double looping?) over all cells every mechanics iteration or something like that. I see now that this is instead at every deletion, which is far more rare. So, while this is still a bandaid (we should just have all these lists be accurate all the time), this is not nearly the performance hit I recall imagining. If we don't have the collective time to track down the source of this bug, this patch should be prioritized for merging

@drbergman
Copy link
Collaborator

I diagrammed out how dynamic_spring_attachments could cause this error to arise. Since that drawing is embarassing, let me explain further here.

Let A -> B indicate that "A is attached to B", i.e., B is in A.state.spring_attachments. Assume that as we parallel loop over all cells, calling dynamic_spring_attachments that we start with A <-> B and that

  1. T1 (Thread 1) updates A.
  2. T1 begins detaching A and B (calls detach_cells_as_spring)
  3. T1 removes B from A list (calls A->detach_cell_as_spring(B)). STATE CHANGED A <- B
  4. T2 (Thread 2) updates B.
  5. T2 begins attaching A and B (calls attach_cells_as_spring which it can do without checking if they are already attached)
  6. T2 attaches A to B (calls B->attach_cell_as_spring(A)). Here, it notices that A is already in B's B.state.spring_attachments list and does nothing
  7. T2 attaches B to A (calls A->attach_cell_as_spring(B)). STATE CHANGED A <-> B
  8. T1 finally removes A from B (calls B->detach_cell_as_spring(A)). STATE CHANGED A -> B

It's that thread 1 seemed to get blocked so long as T2 does its attachments that seems likely to make this vanishingly rare.

I think this also reveals the solution to this part of the bug giving rise to this PR: wrap the joint calls to attach_cell_as_spring in a critical block, i.e.

void attach_cells_as_spring( Cell* pCell_1, Cell* pCell_2 )
{
#pragma omp critical {
	pCell_1->attach_cell_as_spring(pCell_2);
	pCell_2->attach_cell_as_spring(pCell_1);
}
return; 
}

This way, once Thread 1 decides to detach A from B, it must complete that detachment before any other thread can start a new attachment/detachment. This will ensure that the spring attachments will remain symmetric through this function. It leaves in place the (acceptable IMO) nondeterminism of whether cells A and B are attached after a given timestep (the order in which A and B are updated could result in them being attached or not at the end of this loop, purely based on which gets picked up first by a thread).

Haven't looked into neighbors to see if a similar dynamic is possible there.

Untitled 43.pdf

@drbergman
Copy link
Collaborator

drbergman commented Jan 29, 2026

I've just witnessed this1 happen in very large numbers. Both on real data passed on to me from a sim and in just using the template project with 5000 cells at start. In the latter case, I found four examples in the neighbor graph at time t=60.0 min. For example, cell 36 has cell 208 in its list of neighbors, but 208 does not have 36. These are their lines:

...
36: 1366,643,208,3173,1021
...
208: 3846,3173,3108,4701,1021
...

output00000001_cell_neighbor_graph.txt

Footnotes

  1. asymmetry in neighbors

@drbergman
Copy link
Collaborator

drbergman commented Jan 30, 2026

max_cell_interactive_distance_in_voxel updates too rarely: only on volume updates, division, or transition. all of these are at phenotype scale, not mechanics scale which is when cells can enter new mechanics voxels. this can cause some voxels to be erroneously skipped over in the is_neighbor_voxel inside add_potentials because the max interaction distance in those voxels is still the initial value of 0.0.

this does little to explain why at t=60 there remains inconsistencies, since this should be resolved as every mechanics voxel ends up with a cell and so sets its max interaction distance accordingly. more work required there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants