Skip to content

[Bug] Worker intermittently stops polling after synchronous activity completion #1295

@Chaitanya-Varun

Description

@Chaitanya-Varun

Summary

Worker intermittently stops polling after completing long-running synchronous activities. The gRPC connection appears to drop after activity completion, but the worker process remains
alive. Auto-reconnection eventually works (5-15 minutes later), but causes delays in activity processing.

Environment

  • temporalio version: 1.18.1
  • grpcio version: 1.75.1
  • Python version: 3.11
  • Temporal Server: Running in Docker (localhost:7233)
  • OS: Ubuntu Linux

Worker Configuration

with concurrent.futures.ThreadPoolExecutor(max_workers=60) as executor:
    worker = Worker(
        client,
        task_queue="core-py",
        max_concurrent_activities=60,
        activities=[...],
        workflows=[],
        activity_executor=executor,
        graceful_shutdown_timeout=timedelta(seconds=300),
    )
    await worker.run()

Activity Characteristics

- Type: Synchronous (runs on ThreadPoolExecutor)
- Duration: 30 minutes to 3 hours
- Timeout: 36,000 seconds (10 hours)
- Heartbeat: Disabled (0s)

Observed Behavior
┌──────┬───────────────────────────────────────────────────┐
│ StepWhat Happens                    │
├──────┼───────────────────────────────────────────────────┤
│ 1Worker polls, receives activity                   │
├──────┼───────────────────────────────────────────────────┤
│ 2Activity executes (30 min - 3 hrs)                │
├──────┼───────────────────────────────────────────────────┤
│ 3Activity completes, CompleteActivity RPC succeeds │
├──────┼───────────────────────────────────────────────────┤
│ 4Worker stops polling - no new poll requests       │
├──────┼───────────────────────────────────────────────────┤
│ 5Server logs "task queue closed" after 5 min       │
├──────┼───────────────────────────────────────────────────┤
│ 6Worker auto-reconnects after 5-15 min             │
└──────┴───────────────────────────────────────────────────┘
Evidence

Queue closes exactly 5 min after activity completion:
┌────────────────────┬──────────────┐
│ Activity CompletedQueue Closed │
├────────────────────┼──────────────┤
│ 08:02:3908:07:39     │
├────────────────────┼──────────────┤
│ 08:57:3809:02:38     │
├────────────────────┼──────────────┤
│ 09:46:4609:51:46     │
└────────────────────┴──────────────┘
During stuck period:
- Process alive, 45 threads active
- DB connections work
- No TCP connection to port 7233

Key Observations

1. Intermittent - not every completion triggers it
2. No duration correlation - 27 min to 168 min activities affected
3. Auto-recovery works - just slow (5-15 min)
4. Completion succeeds - only polling stops afterward

Suspected Cause

Race condition in SDK's handling of synchronous activity completion. After ThreadPoolExecutor returns result and CompleteActivity RPC succeeds, polling doesn't resume properly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions