What are 9's ?

by Bryon D Beilman

% of Service Availabiltiy : Downtime/Year
----------------------------------------

  • 99% : 3.6 days
  • 99.9% : 9 hours
  • 99.99% : 50 min
  • 99.999% : 5 min
  • High availability means keeping a system,service or network available to the end users all the time.

    The 9's are a common way of referring to how much annual uptime a given system or services is expected to provide. It is not unusual in today's markets to see service level agreements between ISPs and customers in which the ISP must deliver a stated level of availability for their customer or pay a penalty. High availability can also be very important within a corporation as well. If a mail server goes down, the soft cost of the productivity lost can be significant.

    Creating a highly available system requires work in two main areas: removing single points of failure and having solid, well-understood procedures in place for recovering from a failure.

    Where are the single points of failure?

  • Storage is a good place to start making a highly available system. Components with moving parts like hard drives have comparatively high failure rates and the low cost of storage makes mirrored RAID storage a cost-effective way of protecting against downtime as well as loss of data.
  • Redundant storage controllers should also be considered, particularly with fibre attached devices. The failure rates on GBICs is high and multiply attached storage can provide higher throughput as well as the redundancy.
  • Network connectivity should be examined to protect against failure. Internet sites should consider having links to multiple providers. Multiple NICs and switches can be added to the environment as well.
  • High availability clustering software can be added to many systems to provide either a cold standby system that can take the place of a failed system or to provide multiple systems that share the workload but can assume the workload from other failed systems.
  • Replacement parts for components likely to fail such as disks should be kept onsite if possible to reduce downtime of a failed system or potential downtime of a system running in a degraded mode.
  • A Disaster Recovery procedure should be in place and well documented so in the event of an outage, recovery can be made quickly. Simulated disasters should be scheduled if at all possible to work out kinks in the procedure. Multiple people need to be available who are familiar with the DR procedures in case one is unavailable.
  • Documentation should exist and be readily available to all staff who manage a highly available system. When trying to achieve 'five 9's or 99.999% (5 min) the minutes really do count!
  • Subscribe Here For Our Blogs:

    Recent Posts

    Categories

    see all