Conf42: Cloud Native 2021


Improving reliability using health checks and dependency management

Andrew Robinson
Principal Solutions Architect @ AWS

Andrew Robinson's LinkedIn account Andrew Robinson's twitter account

We can improve the reliability of services by decoupling dependencies, using health checks, and understanding when to use fail-open and fail-closed behaviours. In this session we’ll talk about and demonstrate how to implement graceful degradation, monitor all the layers of your workload to help detect failures, route traffic only to healthy nodes, use fail-open and fail-closed as appropriate in response to faults, and reduce mean time to recovery.

We’ll take some lessons learnt from the AWS Well-Architected framework and from the Amazon Builder’s Library, showing some of how Amazon builds and operates it’s software.

