by Bryon D Beilman
"I Am Jack's Complete Lack of Surprise". This is a quote from Fight Club, the cult film (and book) which featured many sayings that start with "I am Jack's ...." . This particular quote gained traction for me during a recent network outage that we worked on for a customer. Why was I surprised, but then later not surprised? Because someone did not follow the basics of change management and it caused quite a few issues.
We received alerts that certain sites were down and upon debugging it, noticed that not only were there issues, but the internal monitors could not contact the outside world and it pointed to a much bigger issue. This customer's revenue depends on serving content from their sites, so they have iuvo to take care of their servers and inside networks, but uses their data center provider to manage the firewall , IDS and external network connections with a dedicated team of Security professionals.
As much as I would love to bash them because this long event literally took an 18 hour period out of our teams's schedule , I want to focus on what went wrong and let this be a lesson to others. The cause of this network issue boils down to the following sequence of events.
- Firewall Team Engineer is "cleaning up unused NAT rules" following a weekend upgrade. He then leaves at 7pm, there is no notification and nothing documented about this change.
- Alerts go off, the Firewall Ops team is contacted, but nobody knows why it is broken. Although they are "skilled", they do see issues, so make some changes to fix it, which fixes some things, but it is still not correct.
- The firewall team thinks there is a firewall hardware issue and works to fail over to the secondary, but because of a series of events, shift changes, lack of remote hands, it extends the outage time.
- Finally, the original engineer comes back on line, has a theory about it being something he might have changed. Changes it back and 95% is back. The remaining 5% is fixed by us asking other questions , where we discover that during the upgrade, they had turned on something that should have no negative effects, but it turns out does.
I have given technical talks on Change Management and I will admit, that is difficult to keep people engaged, because it is not that interesting and much of it is common sense. But so many companies and teams do this so poorly. Strangely, the contract and SLA process for this managed security service requires the customer to submit changes for a change control event to them and it has to be reviewed, and recorded before it is changed. It appears that it does not apply in the reverse, that the customer is required to approve all changes that the security team makes.
Here are three things that will hopefully help you avoid this type of issue.
- Communicate, Communicate, Communicate. - Propose the change, the backout plan , get approval (or at least give someone the chance to see what is going to change), then communicate what you did and how.
- Document, Document, Document. - Some configurations can become very complex and keeping it well documented not only lets others properly manage it, it also helps other understand design constraints or why things were done so that future changes can start with the right baseline.
- Contract: If you enter a contract with a vendor that has a change process for you to make requests, make sure that they are also required to get your approval for changes and there are penalties for them not following the process.
In this case, if the shift engineers knew what had changed, when and why, they could have easily reverted the change with much less of an impact. If they had documented what was going to change during the upgrade (turning on a new feature), then they would have received feedback and it would have been caught before it was rolled into production.
Change Management is a required weapon in your arsenal for running IT environments. Insist that your vendors do the same.