site-reliability-engineering
Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.
Here are 101 public repositories matching this topic...
Language:All
Sort:Most stars
A curated list of Site Reliability and Production Engineering resources.
- Updated
Jun 10, 2024
A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
- Updated
Feb 22, 2025 - JavaScript
A Chaos Engineering Platform for Kubernetes.
- Updated
Mar 17, 2025 - Go
A curated list of Chaos Engineering resources.
- Updated
Dec 28, 2023
An easy to use and powerful chaos engineering experiment toolkit.(阿里巴巴开源的一款简单易用、功能强大的混沌实验注入工具)
- Updated
Mar 6, 2025 - Go
Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is athttps://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q
- Updated
Mar 17, 2025 - Go
Chaos testing, network emulation, and stress testing tool for containers
- Updated
Mar 17, 2025 - Go
A collection of postmortem templates
- Updated
Jul 12, 2023
A curated list of Site Reliability and Production Engineering Tools
- Updated
Mar 3, 2025
Web UI for Jaeger
- Updated
Mar 14, 2025 - JavaScript
Your 24/7 On-Call AI Agent - Solve Alerts Faster with Automatic Correlations, Investigations, and More
- Updated
Mar 17, 2025 - Python
Making on-call suck less for engineers
- Updated
Nov 3, 2024 - Python
This repository includes resources which are more than sufficient to prepare for google interview if you are applying for a software engineer position or a site reliability engineer position
- Updated
Aug 18, 2022
What to Read to Learn More About DevOps
- Updated
Sep 14, 2022
Curated list of good SRE interview questions.
- Updated
Aug 16, 2022
Open-source AI copilot that lets you chat with your observability data and code 🧙♂️
- Updated
Nov 23, 2024 - TypeScript
A chaos engineering platform for supporting the complete fault drill lifecycle.
- Updated
May 27, 2024 - Go
A role-playing game for incident management training
- Updated
Feb 27, 2024 - HTML
Google Site Reliability Engineering book converted in audio
- Updated
Mar 22, 2017
OpenShift Guide. Learn about the Red Hat OpenShift Container Platform, Data Science, Code Ready Containers, Podman, Buildah, and Kubernetes.
- Updated
Jan 4, 2024 - Python
- Followers
- 124 followers
- Wikipedia
- Wikipedia