🏡 Back Home 🔎 Search

Resiliency

Resiliency - ability to recover fast

Prevent retry storm
- aggregate retries at the process/host or service level
- set a small, rate based retry budget
- one budget is exceeded fail quickly with an error
- use adpative retry strategies
Set tight timeouts for dependency calls
- set SLA for your service to establish response time budget
- dependency timeouts need to fit within the response time SLA
- the deeper your service in the service graph the lower timeout value
- timeouts must bet much lower than 1s
“Levers” to limit functionality and load
- when levers are activated do LESS not more
- keep it simple, avoid bi-modal behabviour
- use Gizmo to trigger levers
- TEST your lever, at minimum before peaks
Bulkheads
- ex. different thread pools for different APIs
- avoid large config files that are update by different entities
Avoid bi-modal behaviour - when service operates in a very different mode when an error occurs
- ex. calling a DB directly, bypassing cache
- Do LESS, not more, not different on failure