-
Notifications
You must be signed in to change notification settings - Fork 151
Open
Description
Summary
Worker intermittently stops polling after completing long-running synchronous activities. The gRPC connection appears to drop after activity completion, but the worker process remains
alive. Auto-reconnection eventually works (5-15 minutes later), but causes delays in activity processing.
Environment
- temporalio version: 1.18.1
- grpcio version: 1.75.1
- Python version: 3.11
- Temporal Server: Running in Docker (localhost:7233)
- OS: Ubuntu Linux
Worker Configuration
with concurrent.futures.ThreadPoolExecutor(max_workers=60) as executor:
worker = Worker(
client,
task_queue="core-py",
max_concurrent_activities=60,
activities=[...],
workflows=[],
activity_executor=executor,
graceful_shutdown_timeout=timedelta(seconds=300),
)
await worker.run()
Activity Characteristics
- Type: Synchronous (runs on ThreadPoolExecutor)
- Duration: 30 minutes to 3 hours
- Timeout: 36,000 seconds (10 hours)
- Heartbeat: Disabled (0s)
Observed Behavior
┌──────┬───────────────────────────────────────────────────┐
│ Step │ What Happens │
├──────┼───────────────────────────────────────────────────┤
│ 1 │ Worker polls, receives activity │
├──────┼───────────────────────────────────────────────────┤
│ 2 │ Activity executes (30 min - 3 hrs) │
├──────┼───────────────────────────────────────────────────┤
│ 3 │ Activity completes, CompleteActivity RPC succeeds │
├──────┼───────────────────────────────────────────────────┤
│ 4 │ Worker stops polling - no new poll requests │
├──────┼───────────────────────────────────────────────────┤
│ 5 │ Server logs "task queue closed" after 5 min │
├──────┼───────────────────────────────────────────────────┤
│ 6 │ Worker auto-reconnects after 5-15 min │
└──────┴───────────────────────────────────────────────────┘
Evidence
Queue closes exactly 5 min after activity completion:
┌────────────────────┬──────────────┐
│ Activity Completed │ Queue Closed │
├────────────────────┼──────────────┤
│ 08:02:39 │ 08:07:39 │
├────────────────────┼──────────────┤
│ 08:57:38 │ 09:02:38 │
├────────────────────┼──────────────┤
│ 09:46:46 │ 09:51:46 │
└────────────────────┴──────────────┘
During stuck period:
- Process alive, 45 threads active
- DB connections work
- No TCP connection to port 7233
Key Observations
1. Intermittent - not every completion triggers it
2. No duration correlation - 27 min to 168 min activities affected
3. Auto-recovery works - just slow (5-15 min)
4. Completion succeeds - only polling stops afterward
Suspected Cause
Race condition in SDK's handling of synchronous activity completion. After ThreadPoolExecutor returns result and CompleteActivity RPC succeeds, polling doesn't resume properly.Metadata
Metadata
Assignees
Labels
No labels