Gone are the days when people who don’t understand technology knew what it means by “Server down”. We have become used-to uninterrupted services. Even a small lag raises eyebrows now.

As more and more mission critical workloads move into the Cloud, High Availability (HA) has become a crucial aspect of system design. HA refers to guarantees of a system/component being continuously operational or availabile in service. It is usually measured relative to 100% availability. You would have seen up-time guarantees like 99.99999% available. This means ~6min downtime per year.

You would find Cloud service providers providing guarantees in 11, 12 and 15, 9s. This is good, but still a big enough down time for a mission critical service.

We normally build an HA system with redundant hardware and s/w fault tolerance to minimize human intervention. The term, ‘Tolerance‘, is of two types, Fault and Failures.

A Failure is a state where System fails to meet its specifications. Fault is failure of a sub-system. It can result in other sub-systems to fault and, optinally, the overall system to fail.

Faults can be transient or permanent. They can be intermittent. Following are effective ways of dealing with faults

  1. Forecast faults, you need mathematical models that identifies presence of a fault and its consequence. These models are often built using fault-injection and studying the resultant faults/failures.
  2. Avoid and Remove faults, the system needs to go through different verification techniques that guarantees a stronger system.
  3. Tolerance, is also called graceful-degradation or fail-safe, methods through which the system can be stopped into a safe state.

At the core of Fault tolerance is Redundancy. H/w, S/w, Time and information redundancy. This enables a system to have high tolerance and, thus, provide high availability.

In Azure, there are three design patterns that help design system with maximized availability

  1. Throttling – Controlling resource consumption by application, tenant or service
  2. Queue based load leveling – Use queue as buffers between a service and its tasks to smoothen loads
  3. Health monitoring – Expose functional checks that external tool can access

More on this soon…