Introduction
SRE is what happens when a software engineer is tasked with what used to be called operations.
- 40–90% of total cost of a system incurred post-delivery
- Reliability—primary concern
- Probability the system will perform the required function without failure under stated conditions for a stated period of time
- Hire software engineers to run products and create systems to accomplish work previously performed manually by sysadmins
- At Google, SREs: 50–60% software engineers; 40-60% have most software engineer skills, but additional expertise, e.g. UNIX, networking
- Create team that becomes bore performing manual sysadmin tasks, and has skills to create software to replace manual work
- Team focussed on engineering
- 50% cap on ops work—upper limit
- Remainder spend on development to automate processes
- Prevent team from becoming sysadmins
- Goal: automated system that runs itself
- Responsible for: availability, latency, performance and efficiency
Downsides of Sysadmin Approach
- Previously: systems run by sysadmins
- Discrete teams: development and ops
- Easy to implement—precedent
- Pitfalls:
- Direct costs: manual intervention—team must scale as system scales, proportional to load on system
- Indirect costs: different backgrounds between teams—language, skill sets, assumptions
- Split between teams—conflict
- Example: dev teams want to release to prod as quickly as possible; ops teams want to ensure system is up—most issues caused by changes
- Teams’ goals are opposed
Tenets
Focus on Engineering
- 50% time cap on ops work
- When reached—redirect work to dev team
- Reassign bugs/tickets
- Integrate devs into on-call rotas
- Ends when ops load drops below 50%
- Need to monitor
- Org needs to understand why mechanism exists
- Max 2 events per 8–12hr on-call shift
- Handle event, clean up, restore service, conduct post-mortem
- Always conduct post-mortem—blameless culture
Error Budgets
- Reconcile goals of devs and ops
- 100% reliability wrong for most things
- Users unable to notice differences between 99.999% and 100% uptime—no benefit of last 0.001%
- Business question to determine correct reliability target
- What will users be happy with?
- Alternative services?
- What happens to usage at different availability levels?
error_budget = 1 - availability_target
- e.g. 99.999% availability = 0.001% error budget
- Spend on e.g. launching new features
- Outages non longer “bad”—just an expected part of the process of innovation
Monitoring
- Primary means to track system health
- Traditionally—alert on value/condition
- Not effective—don’t rely on humans to decide on required action
- Automate interpretation
- Only alert humans if action required
- Valid outputs:
- Alerts—immediate action required
- Tickers—(non-immediate) action required
- Logging—recorded for diagnostic purposes, expectation not read unless prompted to
MTTR
- Mean time to repair
- How quickly team can restore service
- Use playbooks top record best practices
- 3x improvement on MTTR than “winging it”
Change Management
- 70% of outages due to changes in live system
- Best practices:
- Progression rollouts
- Quickly and accurately detect problems
- Rollback when problems occur
- Remove humans from loop
Capacity Planning
- Ensure sufficient capacity/redundancy
- Incorporate organic and inorganic growth:
- Organic—natural product adoption
- Inorganic—e.g. marketing campaigns, new features etc.
- Regular load testing
Provisioning
- Add new capacity when required
Efficiency and Performance
- Function of demand (load), capacity and software efficiency
- SREs—predict demand, provision capacity, modify software
- Provision to meet target response speed
- Monitor performance