- Notifications
You must be signed in to change notification settings - Fork3
last9/slo-computer
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Note
@last9 advocates using Service Level Objectives.One of the biggest challenges we run into is the lack of practical algorithms behind Burn Rate and alerting. This is our first attempt at it.
SLO Computer simplifies the complex world of Service Level Objectives (SLOs), error budgets, and alerting.
SLOs, error windows, burn rates, and budget spend are convoluted terms that can throw anyone off. Even the SRE workbook by Google can leave you with a lot of open questions. We continue to be amazed by how widely misunderstood this topic is (and how easy it can make your lives if used well).
This toolkit helps SREs and DevOps engineers:
- Calculate appropriate alert thresholds based on service throughput and desired SLO targets
- Determine if a service has enough traffic to benefit from SLO-based alerting
- Generate alert policies for AWS burstable CPU instances
- Go 1.16 or later
# Clone the repositorygit clone https://github.com/last9/slo-computer.gitcd slo-computer# Build using Makemake build
The project includes a Makefile with helpful commands:
# Build the applicationmake build# Run testsmaketest# Run an example service SLO calculationmake example-service# Run an example CPU burst calculationmake example-cpu# See all available commandsmakehelp
usage: slo [<flags>]<command> [<args> ...]Last9 SLO toolkitFlags: --help Show context-sensitivehelp (also try --help-long and --help-man). --version Show application version.Commands:help [<command>...] Show help. suggest --throughput=THROUGHPUT --slo=SLO --duration=DURATION suggest alerts based on service throughput and SLO duration cpu-suggest --instance=INSTANCE --utilization=UTILIZATION suggest alerts based on CPU utilization and Instancetype
--throughput
: Number of requests per minute your service handles--slo
: Your desired SLO percentage (e.g., 99.9)--duration
: SLO time period in hours (e.g., 720 for 30 days)
--instance
: AWS instance type (e.g., t3.micro, t3a.xlarge)--utilization
: Average CPU utilization percentage (0-100)
The goal of these commands is to factor in some "bare minimum" input to:
- Determine if this is a low traffic service where an SLO approach makes little sense
- Compute theactual alert values and conditions to set alerts on
Q: What alerts should I set for my service to achieve 99.9% availability over 30 days?
./slo-computer suggest --throughput=4200 --slo=99.9 --duration=720
Output:
Alert if error_rate > 0.002 for last [24h0m0s] and also last [2h0m0s]This alert will trigger once 6.67% of error budget is consumed,and leaves 360h0m0s before the SLO is defeated.Alert if error_rate > 0.010 for last [1h0m0s] and also last [5m0s]This alert will trigger once 1.39% of error budget is consumed,and leaves 72h0m0s before the SLO is defeated.
Q: What about a low-traffic service?
./slo-computer suggest --throughput=100 --slo=99.9 --duration=168
Output:
slo-computer: error:If this service reported 10.000 errors for a duration of 5m0sSLO (for the entire duration) will be defeated wihin 1h40m47sProbably- Use ONLY spike alert model, and not SLOs (easiest)- Reduce the MTTR for this service (toughest)- SLO is too aggressive and can be lowered (business decision)- Combine multiple services into one single service (team wide)
Q: What alerts should I set for my AWS burstable instance?
./slo-computer cpu-suggest --instance=t3a.xlarge --utilization=15
Output:
Alert if 100.00 % consumption sustains for 10m0s AND recent 5m0s.At this rate, burst credits will deplete after 10h0m0sAlert if 80.00 % consumption sustains for 3h45m0s AND recent 55m0s.At this rate, burst credits will deplete after 15h0m0s
The tool generates two types of alerts:
- Slow burn alert: Detects gradual error rate increases that would eventually exhaust your error budget
- Fast burn alert: Detects sudden spikes in error rates that require immediate attention
Each alert includes:
- The error rate threshold to monitor
- The time windows to evaluate
- How much of your error budget would be consumed when the alert triggers
- How much time remains before your SLO is breached if the error rate continues
The tool generates alerts that help you monitor when your AWS burstable instance might run out of CPU credits:
- Alert thresholds for different CPU utilization levels
- Time windows to monitor
- Time until credit depletion at the current rate
- Throughput: The number of requests your service handles per minute
- SLO: Your Service Level Objective (e.g., 99.9% availability)
- Duration: The time period for your SLO in hours (e.g., 720 for 30 days)
- Error Budget: The amount of allowable errors within your SLO period (calculated as
(100% - SLO%) * total requests
) - Burn Rate: How quickly you're consuming your error budget relative to the expected rate
- Instance: AWS burstable instance type (T2, T3, T4g families)
- Utilization: Average CPU utilization percentage
- Credit Rate: How quickly the instance earns CPU credits
- Baseline Performance: The CPU performance level the instance can sustain indefinitely
You can also use SLO Computer as a library in your Go projects:
import ("time""github.com/last9/slo-computer/slo")// Create a new SLOs,err:=slo.NewSLO(time.Duration(720)*time.Hour,// SLO period of 30 days4200,// 4200 requests per minute99.9,// 99.9% availability target)// Calculate alertsalerts:=slo.AlertCalculator(s)// For CPU burst calculationscc:=slo.InstanceCapacity("t3.micro")b,err:=slo.NewBurstCPU(cc,75.0)// 75% utilizationburstAlerts:=slo.BurstCalculator(b)
Error: "strconv.ParseFloat: parsing "SLO": invalid syntax"
Make sure to replace "SLO" with an actual number (e.g., 99.9) in your command:
# Incorrect./slo-computer suggest --throughput=1000000 --slo=SLO --duration=90# Correct./slo-computer suggest --throughput=1000000 --slo=99.9 --duration=90
Error about low traffic services
If you receive a message about your service being low-traffic, consider:
- Using spike-based alerting instead of SLO-based alerting
- Combining multiple services to increase the traffic volume
- Lowering your SLO target to a more achievable level
We're actively working on improving SLO Computer. Check out our roadmap:
- Open Issues - Planned improvements and bug fixes
- Feature Enhancements - Upcoming features and user experience improvements
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is sponsored and maintained byLast9. Last9 is a telemetry data platform.
About
SLOs, Error windows and alerts are complicated. Here an attempt to make it easy