last9/slo-computerPublic

NotificationsYou must be signed in to change notification settings
Fork3
Star133

SLOs, Error windows and alerts are complicated. Here an attempt to make it easy

133 stars 3 forks Branches Tags Activity

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
slo		slo
.gitignore		.gitignore
FEATURES.md		FEATURES.md
Makefile		Makefile
OPEN_ISSUES.md		OPEN_ISSUES.md
README.md		README.md
cmd_cpu.go		cmd_cpu.go
cmd_service.go		cmd_service.go
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Repository files navigation

SLO Computer

Note

@last9 advocates using Service Level Objectives.One of the biggest challenges we run into is the lack of practical algorithms behind Burn Rate and alerting. This is our first attempt at it.

What is SLO Computer?

SLO Computer simplifies the complex world of Service Level Objectives (SLOs), error budgets, and alerting.

SLOs, error windows, burn rates, and budget spend are convoluted terms that can throw anyone off. Even the SRE workbook by Google can leave you with a lot of open questions. We continue to be amazed by how widely misunderstood this topic is (and how easy it can make your lives if used well).

This toolkit helps SREs and DevOps engineers:

Calculate appropriate alert thresholds based on service throughput and desired SLO targets
Determine if a service has enough traffic to benefit from SLO-based alerting
Generate alert policies for AWS burstable CPU instances

Installation and Setup

Prerequisites

Go 1.16 or later

Building from Source

# Clone the repositorygit clone https://github.com/last9/slo-computer.gitcd slo-computer# Build using Makemake build

Quick Start

The project includes a Makefile with helpful commands:

# Build the applicationmake build# Run testsmaketest# Run an example service SLO calculationmake example-service# Run an example CPU burst calculationmake example-cpu# See all available commandsmakehelp

Usage

usage: slo [<flags>]<command> [<args> ...]Last9 SLO toolkitFlags:  --help     Show context-sensitivehelp (also try --help-long and --help-man).  --version  Show application version.Commands:help [<command>...]    Show help.  suggest --throughput=THROUGHPUT --slo=SLO --duration=DURATION    suggest alerts based on service throughput and SLO duration  cpu-suggest --instance=INSTANCE --utilization=UTILIZATION    suggest alerts based on CPU utilization and Instancetype

Command Parameters

`suggest` Command

--throughput: Number of requests per minute your service handles
--slo: Your desired SLO percentage (e.g., 99.9)
--duration: SLO time period in hours (e.g., 720 for 30 days)

`cpu-suggest` Command

--instance: AWS instance type (e.g., t3.micro, t3a.xlarge)
--utilization: Average CPU utilization percentage (0-100)

The goal of these commands is to factor in some "bare minimum" input to:

Determine if this is a low traffic service where an SLO approach makes little sense
Compute theactual alert values and conditions to set alerts on

Examples

Service SLO Alerts

Q: What alerts should I set for my service to achieve 99.9% availability over 30 days?

./slo-computer suggest --throughput=4200 --slo=99.9 --duration=720

Output:

Alert if error_rate > 0.002 for last [24h0m0s] and also last [2h0m0s]This alert will trigger once 6.67% of error budget is consumed,and leaves 360h0m0s before the SLO is defeated.Alert if error_rate > 0.010 for last [1h0m0s] and also last [5m0s]This alert will trigger once 1.39% of error budget is consumed,and leaves 72h0m0s before the SLO is defeated.

Q: What about a low-traffic service?

./slo-computer suggest --throughput=100 --slo=99.9 --duration=168

Output:

slo-computer: error:If this service reported 10.000 errors for a duration of 5m0sSLO (for the entire duration) will be defeated wihin 1h40m47sProbably- Use ONLY spike alert model, and not SLOs (easiest)- Reduce the MTTR for this service (toughest)- SLO is too aggressive and can be lowered (business decision)- Combine multiple services into one single service (team wide)

CPU Burst Credit Alerts

Q: What alerts should I set for my AWS burstable instance?

./slo-computer cpu-suggest --instance=t3a.xlarge --utilization=15

Output:

Alert if 100.00 % consumption sustains for 10m0s AND recent 5m0s.At this rate, burst credits will deplete after 10h0m0sAlert if 80.00 % consumption sustains for 3h45m0s AND recent 55m0s.At this rate, burst credits will deplete after 15h0m0s

Understanding the Results

For Service SLOs

The tool generates two types of alerts:

Slow burn alert: Detects gradual error rate increases that would eventually exhaust your error budget
Fast burn alert: Detects sudden spikes in error rates that require immediate attention

Each alert includes:

The error rate threshold to monitor
The time windows to evaluate
How much of your error budget would be consumed when the alert triggers
How much time remains before your SLO is breached if the error rate continues

For CPU Burst Credits

The tool generates alerts that help you monitor when your AWS burstable instance might run out of CPU credits:

Alert thresholds for different CPU utilization levels
Time windows to monitor
Time until credit depletion at the current rate

Key Concepts

Service SLOs

Throughput: The number of requests your service handles per minute
SLO: Your Service Level Objective (e.g., 99.9% availability)
Duration: The time period for your SLO in hours (e.g., 720 for 30 days)
Error Budget: The amount of allowable errors within your SLO period (calculated as(100% - SLO%) * total requests)
Burn Rate: How quickly you're consuming your error budget relative to the expected rate

CPU Burst Credits

Instance: AWS burstable instance type (T2, T3, T4g families)
Utilization: Average CPU utilization percentage
Credit Rate: How quickly the instance earns CPU credits
Baseline Performance: The CPU performance level the instance can sustain indefinitely

Using as a Library

You can also use SLO Computer as a library in your Go projects:

import ("time""github.com/last9/slo-computer/slo")// Create a new SLOs,err:=slo.NewSLO(time.Duration(720)*time.Hour,// SLO period of 30 days4200,// 4200 requests per minute99.9,// 99.9% availability target)// Calculate alertsalerts:=slo.AlertCalculator(s)// For CPU burst calculationscc:=slo.InstanceCapacity("t3.micro")b,err:=slo.NewBurstCPU(cc,75.0)// 75% utilizationburstAlerts:=slo.BurstCalculator(b)

Troubleshooting

Common Errors

Error: "strconv.ParseFloat: parsing "SLO": invalid syntax"
Make sure to replace "SLO" with an actual number (e.g., 99.9) in your command:

# Incorrect./slo-computer suggest --throughput=1000000 --slo=SLO --duration=90# Correct./slo-computer suggest --throughput=1000000 --slo=99.9 --duration=90

Error about low traffic services
If you receive a message about your service being low-traffic, consider:

Using spike-based alerting instead of SLO-based alerting
Combining multiple services to increase the traffic volume
Lowering your SLO target to a more achievable level

Roadmap

We're actively working on improving SLO Computer. Check out our roadmap:

Open Issues - Planned improvements and bug fixes
Feature Enhancements - Upcoming features and user experience improvements

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

About Last9

This project is sponsored and maintained byLast9. Last9 is a telemetry data platform.

About

SLOs, Error windows and alerts are complicated. Here an attempt to make it easy

Movatterモバイル変換

last9/slo-computer

Folders and files

Latest commit

History

Repository files navigation

SLO Computer

What is SLO Computer?

Installation and Setup

Prerequisites

Building from Source

Quick Start

Usage

Command Parameters

suggest Command

cpu-suggest Command

Examples

Service SLO Alerts

CPU Burst Credit Alerts

Understanding the Results

For Service SLOs

For CPU Burst Credits

Key Concepts

Service SLOs

CPU Burst Credits

Using as a Library

Troubleshooting

Common Errors

Roadmap

Contributing

About Last9

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases1

Packages0

Uh oh!

Contributors7

Uh oh!

Languages

`suggest` Command

`cpu-suggest` Command

Packages