slok/goresiliencePublic

NotificationsYou must be signed in to change notification settings
Fork24
Star191

A library to improve the resilience of Go applications in an easy and flexible way

medium.com/@slok/goresilience-a-go-library-to-improve-applications-resiliency-14d229aee385

License

Apache-2.0 license

191 stars 24 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
.github		.github
bulkhead		bulkhead
chaos		chaos
circuitbreaker		circuitbreaker
concurrencylimit		concurrencylimit
errors		errors
examples		examples
internal/mocks		internal/mocks
metrics		metrics
retry		retry
timeout		timeout
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG		CHANGELOG
LICENSE		LICENSE
Makefile		Makefile
Readme.md		Readme.md
example_test.go		example_test.go
go.mod		go.mod
go.sum		go.sum
goresilience.go		goresilience.go
goresilience_test.go		goresilience_test.go

Repository files navigation

Goresilience

Goresilience is a Go toolkit to increase the resilience of applications. Inspired by hystrix and similar libraries at it's core but at the same time very different:

Features

Increase resilience of the programs.
Easy to extend, test and with clean design.
Go idiomatic.
Use the decorator pattern (middleware), like Go's http.Handler does.
Ability to create custom resilience flows, simple, advanced, specific... by combining different runners in chains.
Safety defaults.
Not couple to any framework/library.
Prometheus/Openmetrics metrics as first class citizen.

Motivation

You are wondering, why another circuit breaker library...?

Well, this is not a circuit breaker library. Is true that Go has some good circuit breaker libraries (likesony/gobreaker,afex/hystrix-go orrubyist/circuitbreaker). But there is a lack a resilience toolkit that is easy to extend, customize and establishes a design that can be extended, that's why goresilience born.

The aim of goresilience is to use the library with the resilience runners that can be combined or used independently depending on the execution logic nature (complex, simple, performance required, very reliable...).

Also one of the key parts of goresilience is the extension to create new runners yourself and use it in combination with the bulkhead, the circuitbreaker or any of the runners of this library or from others.

Getting started

The usage of the library is simple. Everything is based onRunner interface.

The runners can be used in two ways, in standalone mode (one runner):

package mainimport ("context""log""time""github.com/slok/goresilience/timeout")funcmain() {// Create our command.cmd:=timeout.New(timeout.Config{Timeout:100*time.Millisecond,    })fori:=0;i<200;i++ {// Execute.result:=""err:=cmd.Run(context.TODO(),func(_ context.Context)error {iftime.Now().Nanosecond()%2==0 {time.Sleep(5*time.Second)            }result="all ok"returnnil        })iferr!=nil {result="not ok, but fallback"        }log.Printf("the result is: %s",result)    }}

or combining in a chain of multiple runners by combining runner middlewares. In this example the execution will be retried timeout and concurrency controlled using a runner chain:

package mainimport ("context""errors""fmt""github.com/slok/goresilience""github.com/slok/goresilience/bulkhead""github.com/slok/goresilience/retry""github.com/slok/goresilience/timeout")funcmain() {// Create our execution chain.cmd:=goresilience.RunnerChain(bulkhead.NewMiddleware(bulkhead.Config{}),retry.NewMiddleware(retry.Config{}),timeout.NewMiddleware(timeout.Config{}),    )// Execute.calledCounter:=0result:=""err:=cmd.Run(context.TODO(),func(_ context.Context)error {calledCounter++ifcalledCounter%2==0 {returnerrors.New("you didn't expect this error")        }result="all ok"returnnil    })iferr!=nil {result="not ok, but fallback"    }fmt.Printf("result: %s",result)}

As you see, you could create any combination of resilient execution flows by combining the different runners of the toolkit.

Static Runners

Static runners are the ones that based on a static configuration and don't change based on the environment (unlike the adaptive ones).

Timeout

This runner is based on timeout pattern, it will execute thegoresilience.Func but if the execution duration is greater than a T duration timeout it will return a timeout error.

Checkexample.

Retry

This runner is based on retry pattern, it will retry the execution ofgoresilience.Func in case it failed N times.

It will use a exponential backoff with some jitter (for more information checkthis)

Checkexample.

Bulkhead

This runner is based onbulkhead pattern, it will control the concurrency ofgoresilience.Func executions using the same runner.

It also can timeout if agoresilience.Func has been waiting too much to be executed on a queue of execution.

Checkexample.

Circuit breaker

This runner is based oncircuitbreaker pattern, it will be storing the results of the executedgoresilience.Func in N buckets of T time to change the state of the circuit based on those measured metrics.

Checkexample.

Chaos

This runner is based onfailure injection of errors and latency. It will inject those failures on the required executions (based on percent or all).

Checkexample.

Adaptive Runners

Concurrency limit

Concurrency limit is based on Netflixconcurrency-limit library. It tries to implement the same features but for goresilience library (nd compatible with other runners).

It limits the concurrency based on less configuration and adaptive based on the environment is running on that moment, hardware, load...

This Runner will limit the concurrency (like bulkhead) but it will use different TCP congestion algorithms to adapt the concurrency limit based on errors and latency.

The Runner is based on 4 components.

Limiter: This is the one that will measure and calculate the limit of concurrency based on different algorithms that can be choose, for exampleAIMD.
Executor: This is the one executing thegoresilience.Func itself, it has different queuing implementations that will prioritize and drop executions based on the implementations.
Runner: This is the runner itself that will be used by the user and is the glue of theLimiter and theExecutor. This will had a policy that will treat the execution result as an error, success or ignore for the Limiter algorithm.
Result policy: This is a function that can be configured on the concurrencylimit Runner. This function receives the result of the executed function and returns a result for the limit algorithm. This policy is responsible to tell the limit algorithm if the received error should be count as a success, failure or ignore on the calculation of the concurrency limit. For example: only count the errors that have been 502 other ones ignore.

CheckAIMD example.CheckCoDel example.

Executors

FIFO: This executor is the default one it will execute the queue jobs in a first-in-first-out order and also has a queue wait timeout.
LIFO: This executor will execute the queue jobs in a last-in-first-out order and also has a queue wait timeout.
AdaptiveLIFOCodel: Implementation of Facebook'sCoDel+adaptive LIFO algorithm. This executor is used withStatic limiter.

Limiter

Static: This limiter will set a constant limit that will not change.
AIMD: This limiter is based onAIMD TCP congestion algorithm. It increases the limit at a constant rate and when congestion occurs (by timeout or result failure) it will decrease by a configured factor

Result policy

FailureOnExternalErrorPolicy: Will treat as failure every error that is not from concurrencylimit package.
NoFailurePolicy: Will never return a failure, just ignore when an error occurs, this can be used to adapt only on RTT/latency.
FailureOnRejectedPolicy: Will treat as failure every time the execution has been rejected with aerrors.ErrRejectedExecution error.

Other

Metrics

All the runners can be measured using ametrics.Recorder, but instead of passing to every runner, the runners will try to get this recorder from the context. So you can wrap any runner usingmetrics.NewMiddleware and it will activate the metrics support on the wrapped runners. This should be the first runner of the chain.

At this moment onlyPrometheus is supported.

In thisexample the runners are measured.

Measuring has always a performance hit (not too high), on most cases is not a problem, but there is a benchmark to see what are the numbers:

BenchmarkMeasuredRunner/Without_measurement_(Dummy).-4            300000              6580 ns/op             677 B/op         12 allocs/opBenchmarkMeasuredRunner/With_prometheus_measurement.-4            200000             12901 ns/op             752 B/op         15 allocs/op

Hystrix-like

Using the different runners a hystrix like library flow can be obtained. You can see a simple example of how it can be done on thisexample

http middleware

Creating HTTP middlewares with goresilience runners is simple and clean. You can see an example of how it can be done on thisexample. The example shows how you can protect the server by load shedding using an adaptive concurrencylimitgoresilience.Runner.

Architecture

At its core, goresilience is based on a very simple idea, theRunner interface,Runner interface is the unit of execution, its accepts acontext.Context, agoresilience.Func and returns anerror.

The idea of the Runner is the same as the go'shttp.Handler, having a interface you could create chains of runners, also known as middlewares (Also called decorator pattern).

The library comes with decorators calledMiddleware that return a function that wraps a runner with another runner and gives us the ability to create a resilient execution flow having the ability to wrap any runner to customize with the pieces that we want including custom ones not in this library.

This way we could create execution flow like this example:

Circuit breaker└── Timeout    └── Retry

Extend using your own runners

To create your own runner, You need to have 2 things in mind.

Implement thegoresilience.Runner interface.
Give constructors to get agoresilience.Middleware, this way yourRunner could be chained with otherRunners.

In this example (full examplehere) we create a new resilience runner to make chaos engineering that will fail at a constant rate set on theConfig.FailEveryTimes setting.

Following the library convention withNewFailer we get the standalone Runner (the one that is not chainable). And withNewFailerMiddleware We get aMiddleware that can be used withgoresilience.RunnerChain to chain with other Runners.

Note: We can usenil onNew becauseNewMiddleware usesgoresilience.SanitizeRunner that will return a valid Runner as the last part of the chain in case of beingnil (for more information about this checkgoresilience.command).

// Config is the configuration of constFailertypeConfigstruct {// FailEveryTimes will make the runner return an error every N executed times.FailEveryTimesint}// New is like NewFailerMiddleware but will not wrap any other runner, is standalone.funcNew(cfgConfig) goresilience.Runner {returnNewMiddleware(cfg)(nil)}// NewMiddleware returns a new middleware that will wrap runners and will fail// every N times of executions.funcNewMiddleware(cfgConfig) goresilience.Middleware {returnfunc(next goresilience.Runner) goresilience.Runner {calledTimes:=0// Use the RunnerFunc helper so we don't need to create a new type.returngoresilience.RunnerFunc(func(ctx context.Context,f goresilience.Func)error {// We should lock the counter writes, not made because this is an example.calledTimes++ifcalledTimes==cfg.FailEveryTimes {calledTimes=0returnfmt.Errorf("failed due to %d call",calledTimes)            }// Run using the the chain.next=goresilience.Sanitize(next)returnnext.Run(ctx,f)        })    }}