🏡 Back Home 🔎 Search

Resiliency

Resiliency

Resiliency - ability to recover fast

  1. Prevent retry storm

    • aggregate retries at the process/host or service level
    • set a small, rate based retry budget
    • one budget is exceeded fail quickly with an error
    • use adpative retry strategies
  2. Set tight timeouts for dependency calls

    • set SLA for your service to establish response time budget
    • dependency timeouts need to fit within the response time SLA
    • the deeper your service in the service graph the lower timeout value
    • timeouts must bet much lower than 1s
  3. “Levers” to limit functionality and load

    • when levers are activated do LESS not more
    • keep it simple, avoid bi-modal behabviour
    • use Gizmo to trigger levers
    • TEST your lever, at minimum before peaks
  4. Bulkheads

    • ex. different thread pools for different APIs
    • avoid large config files that are update by different entities
  5. Avoid bi-modal behaviour - when service operates in a very different mode when an error occurs

    • ex. calling a DB directly, bypassing cache
    • Do LESS, not more, not different on failure

    resiliency lever vs bulkheads.png.png