🏡 Back Home 🔎 Search

Reliability

Reliability - the ability of the service to be healthy under normal conditions

Availability - metric that defines how many failures you have. Downtime is a key measure.

Reliability best practices

Fix known OE and design problems
- create must fix roadmap with the team
- fix root causes of heath reboots
- use customer impact data to push back delivery of new features
Architect for redundancy
- design to scale out with more compute and storage nodes with minimal incremental work
- minimum of 3 compute nodes per service
- min of 3 AZ
- avoid design with a single sql data store
Limit queues
- you need 100 and and we need 110 and put requests in a queue. customers waiting for 10s will start hitting refresh growing the queue up to 220.
- use small bounded queues
- once a queue backs up change fifo -> lifo and flush the queue of all requests
- discard requests with a cheap to process error
Throttle abnormal high volume and high latency callers
- use top contributors tools to do that
Periodic Logging Clean up
- fix the root cause of exceptions and warning logs
- avoid extra logging in production
- automated log purges
Manage deployment risks
- deploy gradually, per region starting with one box
- automate deploymenrs, verifications and rollback - full CI/CD
- health dashboards for deployments, alerts and autorollback