🧠 mattdunn.info

Search

Continuous Resilience

Last updated 29 Apr 2022

sre
devops

Introduction

Cost of downtime: $5,600/min—Gartner

continuous_resilience

Companies don’t like fear of outages
Results in process and rules (e.g. change management) to try and reduce outages
- E2E testing
- Advisory board
- Code freezes
- Planned deployments
- Strict SLAs
Not realistic!

”Everything fails all the time.” Werner Vogels

Embrace failure—continuous resilience

Resilience

Ability for system to bounce back from failure
Four essential capabilities:
- Anticipate
- Monitor
- Respond
- Learn

Anticipate

Code Reviews and Profile

Good code:
- Does what it should
- Consistent style
- Easy to understand
- Well documented
Improve with code reviews and profiling
- Human-based—not reliable
- Amazon CodeGuru—AI, automatically find bugs and issues

Resilient Application Patterns

API-based architecture
- Clients
  - Set timeouts
  - Retries with back-off, jitter and max retries
  - Limit queue sizes
- Back end
  - Rate limiting
  - Load shedding

Simple Designs and Constant Work

Less strenuous to pull vs push messages e.g. write messages to S3, then pull

Limit Impact of Failures with Cells

Impact of once cell doesn’t impact adjacent cells
- Control blast radius
- e.g. regions

Immutable Deployments Vs In-Place

Blue/green and canaries, split on:
- Internal/external
- Paying/non-paying
- Geographical
- Random
Zero downtime

Monitor

Steady State

Monitor key external metrics, not just e.g. CPU
- Customer experience
- e.g. orders per second
- Helps to confirm system is back to normal

Observability

Three pillars:
- Logs—record of change of state
- Metrics—point-in-time numeric data
- Traces—single user’s journey accross multiple applications and systems (e.g. microservices)
Correlate with AWS ServiceLens

Respond

Fast recovery more important than fewer failures
Process:
- Map dependencies—rank in terms in crticalness
- Create response plan

Automated Response

Event-driven
e.g. Business requirement: S3 buckets should never be public
- AWS EventBridge listens to event changing bucket to public
- Lambda function reverts the bucket back to private, and sends an alert notifiction

Learn

Correctoion of Errors (COE)

Post-mortem—what happened, not who
- Technical/process/organisation
- Missing documentation
- No blame
COE:
- What happened?
- What data do we have?
- Impact?
- Contributing factors?
- Learnings?

Chaos Engineering

Perform experiments
Observe what happens
Improve
- Uncover hidden issues
AWS Fault Injection Simulator

References

Towards continuous resilience

Graph View

Backlinks

AWS Summit London 2022

Copyright © Matt Dunn 2024

GitHub
LinkedIn
Quartz v4.2.3