🏡 Back Home 🔎 Search

Reliability

Reliability

Reliability - the ability of the service to be healthy under normal conditions

Availability - metric that defines how many failures you have. Downtime is a key measure.

reliability.png.png

Reliability best practices

  1. Fix known OE and design problems
    • create must fix roadmap with the team
    • fix root causes of heath reboots
    • use customer impact data to push back delivery of new features
  2. Architect for redundancy
    • design to scale out with more compute and storage nodes with minimal incremental work
    • minimum of 3 compute nodes per service
    • min of 3 AZ
    • avoid design with a single sql data store
  3. Limit queues
    • you need 100 and and we need 110 and put requests in a queue. customers waiting for 10s will start hitting refresh growing the queue up to 220.
    • use small bounded queues
    • once a queue backs up change fifo -> lifo and flush the queue of all requests
    • discard requests with a cheap to process error
  4. Throttle abnormal high volume and high latency callers
    • use top contributors tools to do that
  5. Periodic Logging Clean up
    • fix the root cause of exceptions and warning logs
    • avoid extra logging in production
    • automated log purges
  6. Manage deployment risks
    • deploy gradually, per region starting with one box
    • automate deploymenrs, verifications and rollback - full CI/CD
    • health dashboards for deployments, alerts and autorollback