Top 5 Tests for your IT Infrastructure

Written by Bryon Beilman | Mar 31, 2014 2:37:40 AM

by Bryon D Beilman

How confident do you feel about the robustness of your IT architecture? Have you designed your services to withstand an issue? When you think about what could happen, how many levels of failure does your design withstand? One company that has taken this concept very seriously is Netflix and a service called "Chaos Monkey" which is part of their Simian Army. You can read about what they do, but the basic concept is that they have written a service to take their instances off line at random during the course of the day to make sure that it is indeed resilient. Many companies perform change management events during non business hours, but if you are at Netflix, they bring their service nodes down during the middle of the day so that they know it works and thus are not paged at 3 am.

If you don't have the resources or scale of Netflix, you can at least test the following components of your architecture on a regular basis.

1) Power - You bought servers with dual power supplies and perhaps you have them on separate UPS units and those are on separate circuits. Technically, you should be able to unplug a power supply or even unplug a single UPS unit or perhaps both at the same time and there should be no disruption. This will ensure that you have properly balanced your servers across the UPS units and you have not over subscribed your circuits. If you don't feel comfortable doing this, then you should review your configuration and make changes.

2) Disks - RAID and hot swap disks are the only real choices to design your disk infrastructure. Can you pull a disk and feel comfortable that your server will continue doing the job it is supposed to do? Do you have hot or at least cold spares that will automatically or at least quickly take the place of the failed disk. You should feel comfortable with this test.

3) Restores - You can back up all day long, but what about restores. You should test restore data at various levels. Without going into a full disaster recovery discussion, can you restore a file, a disk or a whole system? There are many methods and software that can help you achieve this goal, but if you can't restore what you expect, then perhaps your backups are not doing there job.

4) Server Nodes - What happens if you pull the full power or network from a server? What happens to your application stack or service you are providing? You may not feel comfortable pulling the power (both) of a live server as you might have disk corruption, but this would be a very significant test. There are certainly ways of simulating service node failures. When we implement a highly available architecture, when we are done, we do just that; we simulate complete failures and fail back of those nodes. If you don't do it in a controlled environment, how will handle it when it happens at random?

5) Services - Have you designed your services in highly available fashion? Is there a redundant service, a load balancer, a slave or at least using DNS CNAMES so that services can be moved to other services in case of a failure. The items mentioned in 1-4 are important, but the reality is that you are designing your IT systems for service availability and all of the components of disk, power or servers really are there to provide the service.

Chaos Monkey is a good idea, so maybe you should build your own simian army.

View full post