Concepts in service monitoring

Service monitoring and the SLO API help you manage your services likeGoogle manages its own services. The core notions of service monitoring includethe following:

  • Selecting metrics that act asservice-level indicators (SLIs).
  • Using the SLIs to setservice-level objectives (SLOs) for the SLI values.
  • Using theerror budget implied by the SLO to mitigate risk in yourservice.

This page introduces these concepts and describes some of the thingsto consider when designing an SLO. The other pages in this section putthese concepts into practice.

Terminology

Service monitoring has a set of core concepts, which are introduced here:

  • Service-level indicator (SLI): a measurement of performance.
  • Service-level objective (SLO): a statement of desired performance.
  • Error budget: starts at 1 - SLO and declines as the actual performancemisses the SLO.

Service-level indicators

Cloud Monitoring collects metrics that measure the performance ofthe service infrastructure. Examples of performance metrics include thefollowing:

  • Request count: for example, the number of HTTP requests per minutethat result in 2xx or 5xx responses.
  • Response latencies: for example, the latency for HTTP 2xx responses.

The performance metrics are automatically identified based for a set ofknown service types: Cloud Service Mesh, Istio on Google Kubernetes Engine, and App Engine. You can also define your own service typeand select performance metrics for it.

The performance metrics are the basis of the SLIs for your service.An SLI describes the performance of some aspectof your service. For services on Cloud Service Mesh, Istio on Google Kubernetes Engine, and App Engine, useful SLIs are already known.For example, if your service has request-count or response-latencies metrics,standard service-level indicators (SLIs) can be derived from those metricsby creating ratios as follows:

  • Anavailability SLI is the ratio of the numberof successful responses to the number of all responses.
  • Alatency SLI is the ratio of the number of calls belowa latency threshold to the number of all calls.

You can also set up service-specific SLIs for some other measure ofwhat “good performance” means. These SLIs generally fall intotwo categories:

  • Request-based SLIs, where good service is measured by counting atomicunits of service, like the number of successful HTTP requests.
  • Windows-based SLIs, where good service is measured by counting the numberof time periods, or windows, during which performance meets a goodnesscriterion, like response latency below a given threshold.

These SLIs are described in more detail inCompliance in request- and windows-based SLOs.

For examples that create SLIs for selected services, seeCreating SLIs from metrics.

Service-level objectives

An SLO is a target value for an SLI, measuredover a period of time. The service determines the available SLIs, andyou specify SLOs based on the SLIs. The SLO defines what qualifies as goodservice. You can create up to 500 SLOsfor each service in Cloud Monitoring.

An SLO is built on the following kinds of information:

  • An SLI, which measures the performance of the service.
  • A performance goal, which specifies the desired level of performance.
  • A time period, called the compliance period, for measuring how the SLIcompares to the performance goal.

For example, you might have requirements like these:

  • Latency can exceed 300 ms in only 5 percent of the requests over a rolling30-day period.
  • The system must have 99% availability measured over a calendar week.

Requirements like these can provide the basis for SLOs. SeeDesigning and usingSLOs for guidance on setting good SLOs.

Changes in SLO compliance can also indicate the onset of failures. Monitoringthese changes might give you enough warning that you can fix a problem beforeit cascades. So alerting policies are typically used to monitor SLO compliance.For more information, seeAlerting on your error budget.

A useful SLO targets less than 100%, because the SLO determines your errorbudget. SLOs are typically described as a “number of nines”: 99% (2nines), 99.9% (3 nines), and so forth. The highest value you can set is 99.9%,but you can use any lower value that is appropriate for your service.

Error budgets

An SLO specifies the degree to which a service must perform during acompliance period. What's left over in the compliance period becomes theerror budget. The error budget quantifies the degree to which a servicecan fail to perform during the compliance period and still meet the SLO.

Error budgets let you track how many bad individual events (like requests) areallowed to occur during the remainder of your compliance period before youviolate the SLO. You can use the error budget to help you manage maintenancetasks like deployment of new versions. If the error budget is close to depleted,then taking risky actions like pushing new updates might result in yourviolating an SLO.

Your error budget for a compliance period is (1 −SLO goal) ×(eligible events in compliance period). For example, if your SLO is for 85%of requests to be good in a 7-day rolling period, then your error budget allows15% of these requests to be bad. If you received, say, 60,480 requests in thepast week, your error budget is 15% of that total, or 9,072 requests that arepermitted to be bad. If you served more errors than this, your service was outof SLO for the 7-day compliance period.

Designing and using SLOs

What makes a good SLO? What are things to consider in making the choices?This section provides an overview of some of the general concepts behinddesigning and using SLOs. This topic is covered in much more detail inSite Reliability Engineering: How Google Runs ProductionSystems, in thechapter on SLOs.

SLOs define the target performance you want from your service.In general, SLOs should beno higher than necessary or meaningful.If your users cannot tell the difference between 99% availability and99.9% availability of your service, use the lower value as the SLO. The highervalue is more expensive to meet, and it won't make a difference toyour users. A service required to meet a 100% SLO goal has no error budget.Setting such an SLO is bad practice.

SLOs are typically more stringent than public or contractualcommitments. You want an SLO to be tighter than a public commitment. This way,if something happens that causes violation of the SLO, you are aware of andfixing the problem before it causes a violation of a commitment or contract.Violating a commitment or contract may have reputational, financial, or legalimplications. An SLO is part of an early-warning system to prevent that fromhappening.

Compliance periods

There are two types of compliance periods for SLOs:

  • Calendar-based periods (from date to date)
  • Rolling periods (fromn days ago to now, wheren ranges from 1 to 30days)

Calendar-based compliance periods

Compliance periods can be set to calendar periods like a week or a month.The compliance period and error budget reset on well-known calendar boundaries.For the possible values, seeCalendarPeriod.

With a calendar period, you get a performance score at the end of the period.Measured against the performance threshold, the performance score lets youknow whether your service was compliant or not. When you use a calendarperiod, you only get a compliance rating once every compliance period, eventhough you see the performance throughout the period. But the end-of-periodscore gives you an easy-to-read value that matches easily against yourcustomer billing periods (if you have external paying customers).

Like months on a calendar, monthly compliance periods vary in the number of daysthey cover.

Rolling window-based compliance periods

You can also measure compliance over a rolling period, so that you arealways evaluating, for example, the last 30 days. With a rolling period,the oldest data in the previous calculation drops out of the currentcalculation and new data replaces it.

With a rolling window, you get more compliance measurements; that is,you get a measure of compliance for the last 30 days, rather than one permonth. Services can transition between compliance and noncomplianceas the SLO status changes daily, as old data points are dropped and new onesadded.

Compliance in request- and windows-based SLOs

Determining whether an SLO is in compliance depends on two factors:

  • How the compliance period is determined. This determination is discussedinCompliance periods.
  • The type of SLO. There are two types of SLOs:
    • Request-based SLOs
    • Windows-based SLOs

Compliance is the ratio of good events to total events,measured over the compliance period. The type of SLO determines whatconstitutes an “event”.

If your SLO is 99.9%, then you're meeting it if your compliance is atleast 99.9%. The max value is 100%.

Request-based SLOs

Arequest-based SLO is based on an SLI that is defined as theratio of the number of good requests to the total number of requests.A request-based SLO is met when that ratio meets or exceeds the goal forthe compliance period.

For example, consider this request-based SLO: “Latency is below 100 msfor at least 95% of requests.” A good request is one with a responsetime less than 100 ms, so the measure of compliance is the fractionof requests with response times under 100 ms. The service is compliant ifthis fraction is at least 0.95.

Request-based SLOs give you a sense of what percentage of work your servicedid properly over the entire compliance period, no matter how the load wasdistributed throughout the compliance period.

Windows-based SLOs

Awindows-based SLO is based on an SLI defined as the ratio of thenumber of measurement intervals that meets some goodness criterion to the totalnumber of intervals. A windows-based SLO is met when that ratio meets or exceedsthe goal for the compliance period.

For example, consider this SLO: “The 95th percentile latency metric isless than 100 ms for at least 99% of 10-minute windows”. A goodmeasurement period is a 10-minute span in which 95% of the requestshave latency under 100 ms. The measure of compliance is the fraction ofsuch good periods. The service is compliant if this fraction is at least 0.99.

Note: If the raw metric available to you is a latency percentile, then youmust use a windows-based SLO.

For another example, suppose you configure your compliance period to be arolling 30 days, the measurement interval to be a minute, and the SLO goal tobe 99%. To meet this SLO, your service must have 42,768 “good”intervals out of 43,200 minutes (99% of the number of minutes in 30 days).

A windows-based SLO gives you an idea of what percentage of the time yourcustomers found the service to be working well or poorly. Thistype of SLO can hide the effects of “bursty” behavior: Ameasurement interval that failed every one of its calls counts against theSLO as much as a measurement interval that had one error too many. Also,intervals with a low number of calls count against the SLO as much as ameasurement interval with heavy activity.

Trajectory of error budgets

The error budget is the difference between 100% good service and yourSLO, the desired level of good service. The difference between them isyour wiggle room.

In general, an error budget starts as a maximum value and drops over time,triggering an SLO violation when the error budget drops below 0.

There are a couple of notable exceptions to this pattern:

  • If you have a request-based SLO measured over a calendar compliance period,and the service has increased activity over the compliance period, theremaining error budget can actually rise.

    How is that possible? The SLO system can't know in advance how muchactivity the service will have in each compliance period, so it extrapolatesa likely value. This value is the ratio of calls up to thepresent time over the elapsed time since the beginning of the complianceperiod, multiplied by the length of the compliance period.

    As activity rate goes up, the expected traffic for the period also goes up,and as a result, the error budget rises.

  • If you are measuring an SLO over a rolling compliance period, you areeffectively always at the end of a compliance period. Rather than startingfrom scratch, old data points are continuously dropped and new data pointsare continuously added.

    If a period of poor compliance rolls out of the compliance window, andif the present time, replacing it, is compliant, the error budget goes up.At any point in time, an error budget ≥ 0 indicates a compliant rollingSLO window, and an error budget < 0 indicates a non-compliant rollingSLO window.

Monitoring your error budget

You can create alerting policies to warn you that your error budget is beingconsumed at a faster than desired rate. SeeAlerting on your errorbudget for more information.

What's next

  • Microservices describes microservices and how to usethe Google Cloud console to configure, view, and manage your microservices.
  • Alerting on your burn rate describes how to monitoryour SLIs so that you are alerted to possible problems.
  • Working with the SLO API shows how to use the SLO API,a subset of the Cloud Monitoring API, to create services, SLOs and relatedstructures.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-11-05 UTC.