Skip to content

Conversation

@NikolayS
Copy link
Owner

Adds comprehensive test suite to reproduce Theory 1 (shared memory queue saturation) for the recurring IPC:ParallelFinish hang issue.

Test files:

  • test_parallel_queue_saturation.sql: Main reproduction test with 250K dead tuples and flood_error_queue() function to saturate 16KB error queues
  • monitor_parallel_hang.sql: Monitoring script to observe wait events
  • test_parallel_hang_alternative.sql: 7 alternative test approaches
  • run_reproduction_test.sh: Automated setup and execution script
  • test_parallel_hang_README.md: Complete documentation

Theory being tested: Workers block indefinitely when error queues fill up, creating circular dependency where workers need leader to drain queue but leader only drains when ParallelMessagePending flag is set, which requires successful worker message send (impossible when queue full).

Note: Tests target PostgreSQL 16.3 specifically per production environment.

Adds comprehensive test suite to reproduce Theory 1 (shared memory queue
saturation) for the recurring IPC:ParallelFinish hang issue.

Test files:
- test_parallel_queue_saturation.sql: Main reproduction test with 250K dead
  tuples and flood_error_queue() function to saturate 16KB error queues
- monitor_parallel_hang.sql: Monitoring script to observe wait events
- test_parallel_hang_alternative.sql: 7 alternative test approaches
- run_reproduction_test.sh: Automated setup and execution script
- test_parallel_hang_README.md: Complete documentation

Theory being tested: Workers block indefinitely when error queues fill up,
creating circular dependency where workers need leader to drain queue but
leader only drains when ParallelMessagePending flag is set, which requires
successful worker message send (impossible when queue full).

Note: Tests target PostgreSQL 16.3 specifically per production environment.
Analyzes two critical commits merged after PostgreSQL 16.3:
- 6f6521d (16.4): Don't enter parallel mode when holding interrupts
- 06424e9 (16.5): Improved fix for interrupt handling

These commits directly address the IPC:ParallelFinish hang issue by
preventing parallel worker launch when leader cannot process interrupts,
which eliminates the deadlock scenario where workers block on full error
queues while leader cannot drain them.

Recommendation: Upgrade to PostgreSQL 16.5+ to resolve production issue.
NikolayS pushed a commit that referenced this pull request Dec 23, 2025
truncate_useless_pathkeys() seems to have neglected to account for
PathKeys that might be useful for WindowClause evaluation.  Modify it so
that it properly accounts for that.

Making this work required adjusting two things:

1. Change from checking query_pathkeys to check sort_pathkeys instead.
2. Add explicit check for window_pathkeys

For #1, query_pathkeys gets set in standard_qp_callback() according to the
sort order requirements for the first operation to be applied after the
join planner is finished, so this changes depending on which upper
planner operations a particular query needs.  If the query has window
functions and no GROUP BY, then query_pathkeys gets set to
window_pathkeys.  Before this change, this meant PathKeys useful for the
ORDER BY were not accounted for in queries with window functions.

Because of #1, #2 is now required so that we explicitly check to ensure
we don't truncate away PathKeys useful for window functions.

Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAApHDvrj3HTKmXoLMbUjTO=_MNMxM=cnuCSyBKidAVibmYPnrg@mail.gmail.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants