- Notifications
You must be signed in to change notification settings - Fork1
Ruby implementation of a distributed lock based on Google Cloud Storage
License
FooBarWidget/distributed-lock-google-cloud-storage-ruby
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This is a Ruby implementation of adistributed locking algorithm based on Google Cloud Storage. This means the algorithm uses Google Cloud Storage to coordinate concurrency and to store shared state.
A distributed lock is like a regular Mutex, but works across processes and machines. It's useful for making distributed (multi-machine) workloads concurrency-safe.
One concrete use case in CI/CD pipelines. For example,Fullstaq Ruby's CI/CD pipelinebuilds and publishes APT/YUM packages to shared storage. Because multiple CI runs can be active concurrently, I needed something to ensure that they don't corrupt each others' work by writing to the same shared storage concurrently.
There are many ways to manage concurrency. Work queues is a popular method. But in Fullstaq Ruby's case, that requires setting up additional infrastructure and/or complicating CI/CD code. I deemed using a distributed lock to be the simplest solution.
This distributed lock ishigh-latency. A locking operation's average speed is in the order of hundreds of milliseconds. In the worst case (when the algorithm detects that a client left without releasing the lock), it could take several tens of seconds to several minutes to obtain a lock, depending on the specific timing settings. Use this lock only if your workload can tolerate such a latency.
TLDR: it's pretty reliable.
We use Google Cloud Storage for storing shared locking state and for coordinating concurrency primitives. Thus, our reliability depends partially on Google Cloud Storage's own availability. Google Cloud Storage's availability reputation is pretty good.
The algorithm is designed to automatically recover from any error condition (besides Google Cloud Storage availability) that I could think of. Thus, this distributed lock islow-maintenance: things should just work. Just keep in mind that auto-recovery could take several minutes depending on the specific timing settings.
The most important error condition is that clients of the lock could leave unexpectedly (e.g. a crash), without explicitly releasing the lock. Many other implementations don't handle this situation well: they freeze up, requiring an administrator to manually release the lock. We recover automatically from this situation. See alsohow I designed this algorithm.
Other distributed locks based on Google Cloud Storage include:
In designing my own algorithm, I've carefully examined the above alternatives (and more). They all have various issues, including being prone to crashes, utilizing unbounded backoff, being prone to unintended lock releases due to networking issues, and more. My algorithm addresses all issues not addressed by the above alternatives. Please readthe algorithm design to learn more.
Redis is also a popular system for building distributed locks on top of (for exampleRedlock). Like our lock, Redis-based locks often support auto-recovery of stale locks via timeouts. But they are often unsafe because they don't sufficiently take into account for the fact that a lock can become stale due to long-running operations, arbitrary network and compute delays, etc. Martin Kleppmann haswritten an extensive critique about this issue.
Our lock addresses this issue through the use oflock refreshing and lock healthchecking.
Add to your Gemfile:
gem'distributed-lock-google-cloud-storage'
See also thefull API docs
Initialize a Lock instance. It must be backed by a Google Cloud Storage bucket and object. Then do your work within a#synchronize block.
Important: If your work is a long-running operation, then also be sure to call#check_health!periodically to check whether the lock is still healthy. This call throws an exception if it's not healthy. Learn more inLong-running operations, lock refreshing and lock health checking.
require'distributed-lock-google-cloud-storage'lock=DistributedLock::GoogleCloudStorage::Lock.new(bucket_name:'your bucket name',path:'locks/mywork')lock.synchronizedodo_some_work# IMPORTANT: _periodically_ call this!lock.check_health!do_more_workend
We useApplication Default Credentials by default. If you don't want that, then pass acloud_storage_options argument to the constructor, in which you set thecredentials option.
DistributedLock::GoogleCloudStorage::Lock(bucket_name:'your bucket name',path:'locks/mywork',cloud_storage_options:{credentials:'/path/to/keyfile.json'})
credentials is anything accepted byGoogle::Cloud::Storage.new'scredentials parameter, which currently means it's one of these:
- A String: a path to a keyfile.
- A Hash: the contents of a keyfile.
- A
Google::Auth::Credentialsobject.
A lock is considered taken only if there's a corresponding object in Cloud Storage. Releasing the lock means deleting the object. A client could sometimes fail to delete the object — for example because of a crash, a freeze or a network problem. We automatically recover from this situation by putting atime-to-live (TTL) value on the object. If the object is older than its TTL value, then we consider the lock to bestale, and we'll automatically clean it up next time a client tries to obtain the lock.
The TTL value is configurable via thettl parameter in the constructor. A lower TTL value allows faster recovery from stale locks, but has a higher risk of incorrectly detecting lock staleness. For example: maybe the original owner of the lock was only temporarily frozen because it lacked CPU time.
So the TTL should be generous, in the order of minutes. The default is 5 minutes.
If you perform an operation inside the lock thatmight take longer than the TTL, then we call that along-running operation. Performing such long-running operations is safe: you generally don't have to worry about the lock become stale during the operation. But you need to be aware of caveats.
We support long-running operations byrefreshing the lock's timestamp once in a while so that the lock does not become stale. This refreshing happens automatically in a background thread. The behavior of this refreshing operation is configurable through therefresh_interval andmax_refresh_fails parameters in the constructor.
Refreshingcould fail, for example because of a network delays, or because some other clientstill concluded that the lock is stale and took over ownership of the lock, or because something else unexpected happened to the object. If refreshing fails too many times consecutively, then we declare the lock asunhealthy.
Declaring unhealthiness is an asynchronous event, and does not directly affect your code's flow. We won't abort your code or force it to raise an exception. Instead, your code should periodically check whether the lock has been declared unhealthy. Once you detect it, you mustimmediately abort work, because another client could have taken over the lock's ownership by now.
Aborting work is easier said than done, and comes with its own caveats. You should therefore readthis discussion.
There are two ways to check whether the lock is still healthy:
- By calling
#check_health!, which might throw aDistributedLock::GoogleCloudStorage::LockUnhealthyError. - By calling
#healthy?, which returns a boolean.
Both methods are cheap, and internally only check for a flag. So it's fine to call these methods inside hot loops.
Instant recovery works as follows: if a client A crashes (and fails to release the lock) and restarts, and in the mean time the lock hasn't been taken by another client B, then client A should be able to instantly retake onwership of the lock.
Instant recovery is distinct from the normal TTL-based auto-recovery mechanism. Instant recovery doesn't have to wait for the TTL to expire, nor does it come with the risk of incorrectly detecting the lock as stale. However, the situations in way instant recovery can be applied, is more limited.
Instant recovery works through the use ofinstance identities. The instance identity is what the locking algorithm uses to uniquely identify clients.
The identity is unique on a per-thread basis, which makes the lock thread-safe. It also means that in order for instant recovery to work, the same thread that crashed (and failed to release the lock) has to restart its operation and attempt to obtain the lock again.
By default, we log info, warning and error messages to stderr. If you want logs to be handled differently, then set thelogger parameter in the constructor. For example, to only log warnings and errors:
logger=Logger.new($stderr)logger.level=Logger::WARNDistributedLock::GoogleCloudStorage::Lock(bucket_name:'your bucket name',path:'locks/mywork',logger:logger,)
Do you have a problem with the lock and do you want to know why? Then enable debug logging. For example:
logger=Logger.new($stderr)logger.level=Logger::DEBUGDistributedLock::GoogleCloudStorage::Lock(bucket_name:'your bucket name',path:'locks/mywork',logger:logger,)
The full API docs arehere.
Please read theContribution guide.
About
Ruby implementation of a distributed lock based on Google Cloud Storage
Topics
Resources
License
Contributing
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Uh oh!
There was an error while loading.Please reload this page.