Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork232
Description
There have been a lot of issues and PRs referencing this, but no one has got it quite right yet. I would like to discuss the things we need to get Multiple Schedulers running for failover.
Therq
set of libraries are amazingly simple and I would love to continue using them. I feel like this might be a deal breaker for a lot of folks in adapting RQ + RQ-Scheduler.
Use Case
The feature I'm most interested in is: Multiple Schedulers running at the same time, but only one scheduler will be active. If the active scheduler dies for whatever reason, an inactive scheduler will become active.
This is a very important feature for us as we're hoping to run the scheduler on multiple servers for a failover. (Also makes our deployment easier as each server stays identical).
Previous Attempts
#143 Seems to be a PR for Multi Schedulers, but it introduces a bug where more than 1 Scheduler won't even start/register itself.
#170 Tries to fix this issue by completely removing the Birth/Death registration which may not be ideal as we no longer have track of all registered schedulers, and who is active at any given moment.
In both the above cases, (on a first glance, but pardon me If I'm wrong) the locking mechanism doesn't seem reliable and may cause multiple schedulers to acquire the lock.
Fix
I would like to propose a fix for these issues, and introduce it as a somewhat reliable feature.
A rough plan I have in mind:
- Register each scheduler with a unique key, thus keeping track of all registered schedulers
- Acquire a lock only if no one else has a lock
- Keep the lock until you die / crash / deregister
- Other schedulers (who have also registered themselves) will every so often check the lock for expiry.
- As soon as they find an expired lock, they will attempt to gain a lock. This must be done without race conditions.
- We don't want to release the lock easily as that will cause other schedulers to become active. That is not really desirable as our main goal is redundancy and failover.
Please let me know if a PR like this would be appreciated (via an Emoji Thumbsup) / please let me know your thoughts on this@selwin