Movatterモバイル変換

[0]ホーム

Jump to content

Site reliability engineering

Edit links

From Wikipedia, the free encyclopedia

(Redirected fromSite Reliability Engineering)

Use of software engineering practices for IT

This article has multiple issues. Please helpimprove it or discuss these issues on thetalk page.(Learn how and when to remove these messages)

This articleappears to contain a large number ofbuzzwords. Please helpimprove it by replacing such wording with clear,neutral, andencyclopedic terms.(May 2023) (Learn how and when to remove this message)

This article'suse ofexternal links may not follow Wikipedia's policies or guidelines. Pleaseimprove this article by removingexcessive orinappropriate external links, and converting useful links where appropriate intofootnote references.(February 2025) (Learn how and when to remove this message)

(Learn how and when to remove this message)

Site Reliability Engineering (SRE) is a discipline in the field ofSoftware Engineering andIT infrastructure support that monitors and improves the availability and performance of deployed software systems and large software services (which are expected to deliver reliable response times across events such as new software deployments, hardware failures, and cybersecurity attacks).^[1] There is typically a focus on automation and aninfrastructure as Code methodology. SRE uses elements ofsoftware engineering,IT infrastructure,web development, andoperations^[2] to assist with reliability. It is similar toDevOps as they both aim to improve the reliability and availability of deployed software systems.

History

[edit]

Site Reliability Engineering originated atGoogle with Benjamin Treynor Sloss,^[3]^[4] who founded SRE team in 2003.^[5] The concept expanded within thesoftware development industry, leading various companies to employ sitereliability engineers.^[6] By March 2016, Google had more than 1,000 site reliability engineers on staff.^[7] Dedicated SRE teams are common at largerweb development companies.^[8] In middle-sized and smaller companies, DevOps teams sometimes perform SRE, as well.^[6] Organizations that have adopted the concept includeAirbnb,Dropbox,IBM,^[9]LinkedIn,^[10]Netflix,^[7] andWikimedia.^[11]

Definition

[edit]

Site reliability engineers (SREs) are responsible for a combination of systemavailability,latency,performance, efficiency,change management,monitoring,emergency response, andcapacity planning.^[12] SREs often have backgrounds insoftware engineering,systems engineering, and/orsystem administration.^[13] The focuses of SRE includeautomation,system design, and improvements tosystem resilience.^[13]

SRE is considered a specific implementation of DevOps;^[14] focusing specifically on building reliable systems, whereas DevOps covers a broader scope of operations.^[15]^[16]^[17] Despite having different focuses, some companies have rebranded their operations teams to SRE teams.^[6]

Principles and practices

[edit]

Common definitions of the practices include (but are not limited to):^[2]^[18]

Automation of repetitive tasks for cost-effectiveness.
Defining reliability goals to prevent endless effort.
Design of systems with a goal to reduce risks to availability, latency, and efficiency.
Observability, the ability to ask arbitrary questions about a system without having to know ahead of time what to ask.^[19]

Common definitions of the principles include (but are not limited to):

Toil management, the implementation of the first principle outlined above.
Defining and measuring reliability goals—SLIs,SLOs, and error budgets.
Non-Abstract Large Scale Systems Design (NALSD) with a focus on reliability.
Designing for and implementing observability.
Defining, testing, and running anincident management process.
Capacity planning.
Change and release management, includingCI/CD.
Chaos engineering.

Deployment

[edit]

SRE teams collaborate with other departments within organizations to guide the implementation of the mentioned principles. Below is an overview of common practices:^[20]

Kitchen Sink

[edit]

Kitchen Sink refers to the expansive and often unbounded scope of services and workflows that SRE teams oversee. Unlike traditional roles with clearly defined boundaries, SREs are tasked with various responsibilities, including system performance optimization, incident management, and automation. This approach allows SREs to address multiple challenges, ensuring that systems run efficiently and evolve in response to changing demands and complexities.

Infrastructure

[edit]

Infrastructure SRE teams focus on maintaining and improving the reliability of systems that support other teams' workflows. While they sometimes collaborate with platform engineering teams, their primary responsibility is ensuring up-time, performance, and efficiency. Platform teams, on the other hand, primarily develop the software and systems used across the organization. While reliability is a goal for both, platform teams prioritize creating and maintaining the tools and services used by internal stakeholders, whereas Infrastructure SRE teams are tasked with ensuring those systems run smoothly and meet reliability standards.

Tools

[edit]

SRE teams utilize a variety of tools with the aim of measuring, maintaining, and enhancing system reliability. These tools play a role in monitoring performance, identifying issues, and facilitating proactive maintenance. For instance,Nagios Core is commonly employed for system monitoring and alerting, whilePrometheus (software) is frequently used for collecting and querying metrics in cloud-native environments.

Product or Application

[edit]

SRE teams dedicated to specific products or applications are common in large organizations.^[21] These teams are responsible for ensuring the reliability, scalability, and performance of key services. In larger companies, it's typical to have multiple SRE teams, each focusing on different products or applications, ensuring that each area receives specialized attention to meet performance and availability targets.

Embedded

[edit]

In an embedded model, individual SREs or small SRE pairs are integrated within software engineering teams. These SREs collaborate with developers, applying core SRE principles—such as automation, monitoring, and incident response—directly to the software development lifecycle. This approach aims to enhance reliability, performance, and collaboration between SREs and developers.

Consulting

[edit]

Consulting SRE teams specialize in advising organizations on the implementation of SRE principles and practices. Typically composed of seasoned SREs with a history across various implementations, these teams provide insights and guidance for specific organizational needs. When working directly with clients, these SREs are often referred to as 'Customer Reliability Engineers.'

In large organizations that have adopted SRE, a hybrid model is common^{[citation needed]}. This model includes various implementations, such as multiple Product/Application SRE teams dedicated to addressing the specific reliability needs of different products. An Infrastructure SRE team may collaborate with a Platform engineering group to achieve shared reliability goals for a unified platform that supports all products and applications.

Industry

[edit]

Since 2014, theUSENIX organization has hosted the annualSREcon conference, bringing together site reliability engineers from various industries. This conference is a platform for professionals to share knowledge, explore effective practices, and discuss trends in site reliability engineering.^[22]

References

[edit]

^"What is SRE? - Site Reliability Engineering Explained - AWS".Amazon Web Services, Inc. Retrieved2024-12-26.
^^a ^b"Evaluating where your team lies on the SRE spectrum".Google Cloud Blog. Retrieved2021-06-26.
^Hill, Patrick."Love DevOps? Wait until you meet SRE".Atlassian. RetrievedJune 17, 2021.
^"What is SRE?".Red Hat. RetrievedJune 17, 2021.
^Treynor, Ben (2014)."Keys to SRE".USENIX SREcon14. RetrievedJune 17, 2021.
^^a ^b ^cGossett, Stephen (June 1, 2020)."What Is a Site Reliability Engineer? What Does an SRE Do?".Built In. RetrievedJune 17, 2021.
^^a ^bFischer, Donald (March 2, 2016)."Are site reliability engineers the next data scientists?".TechCrunch.Archived from the original on August 12, 2019. RetrievedJune 17, 2021.
^Beres, Cristi (October 9, 2024)."SRE & DevOps: Striking the Perfect IT Match".Synergo Group.{{cite web}}: CS1 maint: url-status (link)
^"Site Reliability Engineering".IBM Cloud Education.IBM. November 12, 2020. RetrievedJune 21, 2021.
^"Site Reliability Engineering (SRE)".engineering.linkedin.com. RetrievedMarch 12, 2024.
^"SRE - Wikitech".wikitech.wikimedia.org. Retrieved2021-10-17.
^Treynor, Ben."In Conversation" (Interview). Interviewed by Niall Murphy. Google Site Reliability Engineering.
^^a ^bJones, Chris; Underwood, Todd; Nukala, Shylaja (June 2015)."Hiring Site Reliability Engineers"(PDF).;login:. Vol. 40, no. 3. pp. 35–39.Archived(PDF) from the original on August 24, 2017. RetrievedJune 17, 2021.
^Dave Harrison (9 Oct 2018)."Interview with Betsy Beyer, Stephen Thorne of Google". Retrieved24 July 2024.
^Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall, eds. (2016).Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA:O'Reilly Media.ISBN 978-1-4919-5118-7.OCLC 945577030.
^Vargo, Seth;Fong-Jones, Liz (March 1, 2018).What's the Difference Between DevOps and SRE? (class SRE implements DevOps) (Video).Google.Archived from the original on December 3, 2019. RetrievedMarch 8, 2018.
^"What is SRE? - SRE Explained - AWS".Amazon Web Services, Inc. Retrieved2022-11-05.
^"The 7 SRE Principles [And How to Put Them Into Practice]".www.blameless.com. Retrieved2021-06-26.
^"Learn about observability | Honeycomb".docs.honeycomb.io. Retrieved2021-06-26.
^"SRE at Google: How to structure your SRE team".Google Cloud Blog. Retrieved2021-06-26.
^"SRE at Google: How to structure your SRE team".Google Cloud Blog. Retrieved2024-11-11.
^"Usenix SREcon".USENIX. 2021. RetrievedJune 17, 2021.

External links

[edit]

Awesome Site Reliability Engineering resources list
How they SRE resources list
SRE Weekly weekly newsletter devoted to SRE
SRE at Google landing page for learning more about SRE in Google
Komodor K8s Reliability learning centre with resources for SREs working with Kubernetes
SRE: What Do You Need To Know To Master This Role? resource list

Software engineering

Fields

Concepts

Orientations

Models

Developmental	Agile EUP Executable UML Incremental model Iterative model Prototype model RAD Scrum Spiral model UP V-model Waterfall model XP Model-driven engineering Round-trip engineering
Other	CMMI Data model ER model Function model Information model Metamodeling Object model SPICE Systems model View model
Languages	IDEF SysML UML USL

Systems design

Requirements	Functional requirement Non-functional requirement
Capacity	Users Compute Memory Storage Network Web traffic
Interface	API SOAP REST GraphQL gRPC Webhook
High-level design	Client Load balancing Application server Message queue
Integration	Vertical Peer-to-peer API Gateway pub-sub Event Source ETL Batch Stream Orchestration
Architectures	Monolithic Microservices Event-Driven Client–server Serverless Circuit breaker
Database	Relational NoSQL Graph Key–value LSM Tree Time series
Principles	Consistency Reliability CAP theorem Rate limiting
Scalability	Stateless Asynchronous I/O Loose coupling
Availability	Fault tolerance Failover Single point of failure Replication
Latency	Cache CDN Shard Database index Pre-caching
Storage	Object storage Block storage
Observability	Logging Metrics Tracing