Introduction

SRE is what happens when a software engineer is tasked with what used to be called operations.

40–90% of total cost of a system incurred post-delivery
Reliability—primary concern
- Probability the system will perform the required function without failure under stated conditions for a stated period of time
Hire software engineers to run products and create systems to accomplish work previously performed manually by sysadmins
- At Google, SREs: 50–60% software engineers; 40-60% have most software engineer skills, but additional expertise, e.g. UNIX, networking
- Create team that becomes bore performing manual sysadmin tasks, and has skills to create software to replace manual work
Team focussed on engineering
- 50% cap on ops work—upper limit
- Remainder spend on development to automate processes
- Prevent team from becoming sysadmins
- Goal: automated system that runs itself
Responsible for: availability, latency, performance and efficiency

Tenets

50% time cap on ops work
When reached—redirect work to dev team
- Reassign bugs/tickets
- Integrate devs into on-call rotas
Ends when ops load drops below 50%
Need to monitor
Org needs to understand why mechanism exists
Max 2 events per 8–12hr on-call shift
- Handle event, clean up, restore service, conduct post-mortem
- Always conduct post-mortem—blameless culture

Reconcile goals of devs and ops
100% reliability wrong for most things
- Users unable to notice differences between 99.999% and 100% uptime—no benefit of last 0.001%
Business question to determine correct reliability target
- What will users be happy with?
- Alternative services?
- What happens to usage at different availability levels?

cost_vs_reliability

error_budget = 1 - availability_target
- e.g. 99.999% availability = 0.001% error budget
Spend on e.g. launching new features
Outages non longer “bad”—just an expected part of the process of innovation

Primary means to track system health
Traditionally—alert on value/condition
- Not effective—don’t rely on humans to decide on required action
- Automate interpretation
- Only alert humans if action required
Valid outputs:
- Alerts—immediate action required
- Tickers—(non-immediate) action required
- Logging—recorded for diagnostic purposes, expectation not read unless prompted to

Mean time to repair
- How quickly team can restore service
Use playbooks top record best practices
- 3x improvement on MTTR than “winging it”

70% of outages due to changes in live system
Best practices:
- Progression rollouts
- Quickly and accurately detect problems
- Rollback when problems occur
Remove humans from loop

Ensure sufficient capacity/redundancy
Incorporate organic and inorganic growth:
- Organic—natural product adoption
- Inorganic—e.g. marketing campaigns, new features etc.
- Regular load testing