-
Notifications
You must be signed in to change notification settings - Fork 0
Learnings
A well-known lightweight distributed task queue for queuing background tasks in the python ecosystem.
- Task Queues: A queue that lets applications perform work, called tasks, asynchronously outside of a user request.
Celery works with 3 primitives, 2 of which are used in this application:
- Chain: Execute tasks sequentially (e.g., Task A → Task B → Task C).
- Group: Execute tasks in parallel (e.g., Task A, Task B, Task C run simultaneously).
- Chord (Excluded): Execute a group of tasks in parallel, followed by a callback task.
- Distributed Computing: Celery allows tasks to be distributed across multiple workers, enabling horizontal scaling.
- Task Retries & Error Handling: Tasks can be configured to retry automatically with exponential backoff in case of failures. Exponential backoff is applied to prevent overloading the server with retries.
- Task Monitoring: Celery provides tools like Flower to monitor task execution, worker status, and task results in real-time.
An extension of Celery that schedules celery tasks to be sent to the task queue with the help of cron jobs.
- Cron Jobs: Celery Beat allows you to schedule tasks at specific intervals (e.g., every minute, hourly, daily).
Signals allow decoupled applications to receive notifications when certain actions occur elsewhere in the application.
One of the possible brokers usable with celery.
- Message Brokers: Redis facilitates communication between the task producer (client) and task consumers (workers).
- Task Queuing: Tasks are stored in Redis queues until workers pick them up for execution.
- Result Backend: Redis can store task results, allowing clients to retrieve results asynchronously. You need to configure this in celery to use Chords
- Persistence and Scalability: Redis supports data persistence and can handle high-throughput task queues, making it ideal for distributed systems.
Initially this project was done to investigate inter-container communication in docker but I ended up focusing more on playing with distributed systems. The tools recommended by chatGPT ended up being what I recall being under the hood of airflow which gave me a deeper understanding of how the tool works in general.
Thinking back to the use of airflow and its DAGs, the properties of each tool pretty much explain themselves on why they are suitable choices for airflow itself. Combining abstractions of tasks with distributed tools possessing high throughput, scheduling, fault-tolerance and monitoring allows for a robust orchestrator that many DEs know and love.