FooBarWidget/distributed-lock-google-cloud-storage-rubyPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star18

Ruby implementation of a distributed lock based on Google Cloud Storage

License

MIT license

18 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
lib		lib
spec		spec
.ecrc		.ecrc
.editorconfig		.editorconfig
.gitignore		.gitignore
.rspec		.rspec
.rubocop.yml		.rubocop.yml
.yardopts		.yardopts
CONTRIBUTING.md		CONTRIBUTING.md
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE.md		LICENSE.md
README.md		README.md
Rakefile		Rakefile
distributed-lock-google-cloud-storage.gemspec		distributed-lock-google-cloud-storage.gemspec

Repository files navigation

Distributed lock based on Google Cloud Storage

Introduction

This is a Ruby implementation of adistributed locking algorithm based on Google Cloud Storage. This means the algorithm uses Google Cloud Storage to coordinate concurrency and to store shared state.

What's this for?

A distributed lock is like a regular Mutex, but works across processes and machines. It's useful for making distributed (multi-machine) workloads concurrency-safe.

One concrete use case in CI/CD pipelines. For example,Fullstaq Ruby's CI/CD pipelinebuilds and publishes APT/YUM packages to shared storage. Because multiple CI runs can be active concurrently, I needed something to ensure that they don't corrupt each others' work by writing to the same shared storage concurrently.

There are many ways to manage concurrency. Work queues is a popular method. But in Fullstaq Ruby's case, that requires setting up additional infrastructure and/or complicating CI/CD code. I deemed using a distributed lock to be the simplest solution.

When to use

This distributed lock ishigh-latency. A locking operation's average speed is in the order of hundreds of milliseconds. In the worst case (when the algorithm detects that a client left without releasing the lock), it could take several tens of seconds to several minutes to obtain a lock, depending on the specific timing settings. Use this lock only if your workload can tolerate such a latency.

Reliability

TLDR: it's pretty reliable.

We use Google Cloud Storage for storing shared locking state and for coordinating concurrency primitives. Thus, our reliability depends partially on Google Cloud Storage's own availability. Google Cloud Storage's availability reputation is pretty good.

The algorithm is designed to automatically recover from any error condition (besides Google Cloud Storage availability) that I could think of. Thus, this distributed lock islow-maintenance: things should just work. Just keep in mind that auto-recovery could take several minutes depending on the specific timing settings.

The most important error condition is that clients of the lock could leave unexpectedly (e.g. a crash), without explicitly releasing the lock. Many other implementations don't handle this situation well: they freeze up, requiring an administrator to manually release the lock. We recover automatically from this situation. See alsohow I designed this algorithm.

Comparisons with alternatives

Compared to other Cloud Storage locks

Other distributed locks based on Google Cloud Storage include:

In designing my own algorithm, I've carefully examined the above alternatives (and more). They all have various issues, including being prone to crashes, utilizing unbounded backoff, being prone to unintended lock releases due to networking issues, and more. My algorithm addresses all issues not addressed by the above alternatives. Please readthe algorithm design to learn more.

Compared to Redis locks

Redis is also a popular system for building distributed locks on top of (for exampleRedlock). Like our lock, Redis-based locks often support auto-recovery of stale locks via timeouts. But they are often unsafe because they don't sufficiently take into account for the fact that a lock can become stale due to long-running operations, arbitrary network and compute delays, etc. Martin Kleppmann haswritten an extensive critique about this issue.

Our lock addresses this issue through the use oflock refreshing and lock healthchecking.

Installation

Add to your Gemfile:

gem'distributed-lock-google-cloud-storage'

Usage

See also thefull API docs

Initialize a Lock instance. It must be backed by a Google Cloud Storage bucket and object. Then do your work within a#synchronize block.

Important: If your work is a long-running operation, then also be sure to call#check_health!periodically to check whether the lock is still healthy. This call throws an exception if it's not healthy. Learn more inLong-running operations, lock refreshing and lock health checking.

require'distributed-lock-google-cloud-storage'lock=DistributedLock::GoogleCloudStorage::Lock.new(bucket_name:'your bucket name',path:'locks/mywork')lock.synchronizedodo_some_work# IMPORTANT: _periodically_ call this!lock.check_health!do_more_workend

Google Cloud authentication

We useApplication Default Credentials by default. If you don't want that, then pass acloud_storage_options argument to the constructor, in which you set thecredentials option.

DistributedLock::GoogleCloudStorage::Lock(bucket_name:'your bucket name',path:'locks/mywork',cloud_storage_options:{credentials:'/path/to/keyfile.json'})

credentials is anything accepted byGoogle::Cloud::Storage.new'scredentials parameter, which currently means it's one of these:

A String: a path to a keyfile.
A Hash: the contents of a keyfile.
AGoogle::Auth::Credentials object.

Auto-recovery from stale locks

A lock is considered taken only if there's a corresponding object in Cloud Storage. Releasing the lock means deleting the object. A client could sometimes fail to delete the object — for example because of a crash, a freeze or a network problem. We automatically recover from this situation by putting atime-to-live (TTL) value on the object. If the object is older than its TTL value, then we consider the lock to bestale, and we'll automatically clean it up next time a client tries to obtain the lock.

The TTL value is configurable via thettl parameter in the constructor. A lower TTL value allows faster recovery from stale locks, but has a higher risk of incorrectly detecting lock staleness. For example: maybe the original owner of the lock was only temporarily frozen because it lacked CPU time.

So the TTL should be generous, in the order of minutes. The default is 5 minutes.

Long-running operations, lock refreshing and lock health checking

If you perform an operation inside the lock thatmight take longer than the TTL, then we call that along-running operation. Performing such long-running operations is safe: you generally don't have to worry about the lock become stale during the operation. But you need to be aware of caveats.

We support long-running operations byrefreshing the lock's timestamp once in a while so that the lock does not become stale. This refreshing happens automatically in a background thread. The behavior of this refreshing operation is configurable through therefresh_interval andmax_refresh_fails parameters in the constructor.

Refreshingcould fail, for example because of a network delays, or because some other clientstill concluded that the lock is stale and took over ownership of the lock, or because something else unexpected happened to the object. If refreshing fails too many times consecutively, then we declare the lock asunhealthy.

Declaring unhealthiness is an asynchronous event, and does not directly affect your code's flow. We won't abort your code or force it to raise an exception. Instead, your code should periodically check whether the lock has been declared unhealthy. Once you detect it, you mustimmediately abort work, because another client could have taken over the lock's ownership by now.

Aborting work is easier said than done, and comes with its own caveats. You should therefore readthis discussion.

There are two ways to check whether the lock is still healthy:

By calling#check_health!, which might throw aDistributedLock::GoogleCloudStorage::LockUnhealthyError.
By calling#healthy?, which returns a boolean.

Both methods are cheap, and internally only check for a flag. So it's fine to call these methods inside hot loops.

Instant recovery from stale locks

Instant recovery works as follows: if a client A crashes (and fails to release the lock) and restarts, and in the mean time the lock hasn't been taken by another client B, then client A should be able to instantly retake onwership of the lock.

Instant recovery is distinct from the normal TTL-based auto-recovery mechanism. Instant recovery doesn't have to wait for the TTL to expire, nor does it come with the risk of incorrectly detecting the lock as stale. However, the situations in way instant recovery can be applied, is more limited.

How instant recovery works: instance identities

Instant recovery works through the use ofinstance identities. The instance identity is what the locking algorithm uses to uniquely identify clients.

The identity is unique on a per-thread basis, which makes the lock thread-safe. It also means that in order for instant recovery to work, the same thread that crashed (and failed to release the lock) has to restart its operation and attempt to obtain the lock again.

Logging

By default, we log info, warning and error messages to stderr. If you want logs to be handled differently, then set thelogger parameter in the constructor. For example, to only log warnings and errors:

logger=Logger.new($stderr)logger.level=Logger::WARNDistributedLock::GoogleCloudStorage::Lock(bucket_name:'your bucket name',path:'locks/mywork',logger:logger,)

Troubleshooting

Do you have a problem with the lock and do you want to know why? Then enable debug logging. For example:

logger=Logger.new($stderr)logger.level=Logger::DEBUGDistributedLock::GoogleCloudStorage::Lock(bucket_name:'your bucket name',path:'locks/mywork',logger:logger,)

API docs

The full API docs arehere.

Contributing

Please read theContribution guide.

About

Ruby implementation of a distributed lock based on Google Cloud Storage

Releases2

Version 1.1.0 Latest

Sep 11, 2021

+ 1 release

Languages

Ruby100.0%

Movatterモバイル変換

License

FooBarWidget/distributed-lock-google-cloud-storage-ruby

Folders and files

Latest commit

History

Repository files navigation

Distributed lock based on Google Cloud Storage

Introduction

What's this for?

When to use

Reliability

Comparisons with alternatives

Compared to other Cloud Storage locks

Compared to Redis locks

Installation

Usage

Google Cloud authentication

Auto-recovery from stale locks

Long-running operations, lock refreshing and lock health checking

Instant recovery from stale locks

How instant recovery works: instance identities

Logging

Troubleshooting

API docs

Contributing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases2

Uh oh!

Languages