Implement service bus monitoring with heartbeat detection and automatic recovery#4601
Implement service bus monitoring with heartbeat detection and automatic recovery#4601
Conversation
…tions Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
|
@copilot can the supervisor be moved into |
Unit Test Results687 tests 687 ✅ 8s ⏱️ Results for commit 221e57e. ♻️ This comment has been updated with latest results. |
|
@copilot can you add something in the supervisor function to check that there is still a heartbeat. |
…ection Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
Added heartbeat monitoring to the supervisor function. Each worker process now updates a heartbeat file every iteration, and the supervisor checks these timestamps to detect stuck processes. Stale processes (no heartbeat update for 5 minutes) are automatically terminated and restarted. This ensures the system can recover from processes that are alive but not actively working. Changes include:
Commit: c7072b9 |
|
@copilot this needs to be for deployment status updater, ion the API, not the runner. |
…s updater Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
Moved the heartbeat monitoring from the resource processor to the deployment status updater in the API as requested. The implementation now includes:
Commit: 202e726 |
|
@copilot fix the linting issues:
/github/workspace/api_app/service_bus/deployment_status_updater.py:57:1: W293 blank line contains whitespace
|
Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
Fixed all linting issues by removing trailing whitespace from blank lines in both deployment_status_updater.py and test_deployment_status_update.py. Also incremented the API version from 0.24.3 to 0.24.4 as requested. Commit: 381bd9c |
|
/test-extended 381bd9c |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/15887800326 (with refid (in response to this comment from @marrobi) |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
|
changelog is needed for this |
…d automatic recovery; update health check endpoint to include consumer statuses
| task.cancel() | ||
| try: | ||
| await task | ||
| except asyncio.CancelledError: |
| patch("service_bus.service_bus_consumer.time.monotonic", return_value=fixed_time): | ||
| try: | ||
| await consumer._receive_messages_loop() | ||
| except asyncio.CancelledError: |
| with patch("service_bus.service_bus_consumer.asyncio.sleep", side_effect=mock_sleep): | ||
| try: | ||
| await consumer._receive_messages_loop() | ||
| except asyncio.CancelledError: |
| patch("service_bus.service_bus_consumer.asyncio.create_task", side_effect=create_fail_task): | ||
| try: | ||
| await consumer.supervisor_with_heartbeat_check() | ||
| except KeyboardInterrupt: |
| patch("service_bus.service_bus_consumer.asyncio.create_task", side_effect=create_mock_task): | ||
| try: | ||
| await consumer.supervisor_with_heartbeat_check() | ||
| except KeyboardInterrupt: |
| patch("service_bus.service_bus_consumer.time.monotonic", side_effect=monotonic_values): | ||
| try: | ||
| await consumer._receive_messages_loop() | ||
| except asyncio.CancelledError: |
| patch("service_bus.service_bus_consumer.asyncio.create_task", side_effect=create_mock_task): | ||
| try: | ||
| await consumer.supervisor_with_heartbeat_check() | ||
| except KeyboardInterrupt: |
|
@copilot address the PR comments |
The CHANGELOG.md has already been updated in commit 42bf9d0. The entry is in the BUG FIXES section under Unreleased: |
Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
|
/test-extended d8ce5cf |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/21835593211 (with refid (in response to this comment from @marrobi) |
|
🤖 pr-bot 🤖
(in response to this comment from @marrobi) |
|
/test-extended 221e57e |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/21900074237 (with refid (in response to this comment from @marrobi) |
|
/test-destroy-env |
|
Destroying PR test environment (RG: rg-trece7e1779)... (run: https://github.com/microsoft/AzureTRE/actions/runs/21901546593) |
|
Destroying branch test environment (RG: rg-tre7798f6d6)... (run: https://github.com/microsoft/AzureTRE/actions/runs/21901546593) |
|
Branch test environment destroy complete (RG: rg-tre7798f6d6) |
|
/test-extended |
|
🤖 pr-bot 🤖
(in response to this comment from @marrobi) |
|
/test-extended 221e57e |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/21901627819 (with refid (in response to this comment from @marrobi) |
Resolves #4464
What is being addressed
VM operations in the TRE UI were getting stuck in "awaiting_action" status indefinitely. The resource processor completed operations successfully (VMs started/stopped in Azure), but the API never received status updates, leaving operations stuck in the UI.
Root cause: Service bus consumers (
DeploymentStatusUpdaterandAirlockStatusUpdater) could fail silently or hang without detection or recovery mechanisms.How is this addressed
Implemented a comprehensive monitoring and auto-recovery system with dual detection:
Key Features
ServiceBusConsumer Base Class: Provides heartbeat monitoring and supervisor functionality for all service bus consumers.
Dual Monitoring:
receive_messages()tasks fail or completeAutomatic Recovery:
Implementation Details
/healthendpointFiles Changed
CHANGELOG.md- Added entry in BUG FIXES sectionapi_app/_version.py- Incremented to 0.26.1service_bus/service_bus_consumer.py- New base class with monitoring (follows PEP 8 import conventions)service_bus/deployment_status_updater.py- Inherits from base classservice_bus/airlock_request_status_update.py- Same integrationmain.py- Uses supervisor functionsservices/logging.py- Enhanced loggingapi/routes/health.py- Consumer status integrationservices/health_checker.py- Health check supportResult
Before: Manual API restart required when consumers failed
After: Self-healing system with zero manual intervention needed and health endpoint monitoring
This eliminates indefinitely stuck operations and makes the TRE service bus system resilient to both transient failures and silent hangs.
Documentation and versioning
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.