When I started writing this article, Facebook had been unavailable to the entire world for over four hours. Not intermittent, not “articles didn’t load”, just non-existent. As in if you type facebook.com, your browser spins for a while and then says “Server not found” or similar. Facebook managed to wipe themselves off the internet. It’s kind of refreshing, like a mini social media vacation. It came back a while later, but five to six hours is a lifetime on the internet and can result in significanty lost revenue.
The ultimate reason for this happening is complicated, but it boils down to this…have you heard the idiom, “Don’t put all your eggs in one basket”? This is especially true of IT services that make your email work, your website visible, or your mobile app function. The more technical term is “Single Point Of Failure”, often abbreviated as SPOF. This means that somewhere along the chain of communication there is a single point which if were to fail, the entire service would fail. Say you run your web server on a traditional system, but there is only one system – that is a SPOF. If that system is down, your website is down. And if it fails for an abnormal reason, you can be scrambling to get it working again and it can take hours – or even days. Perhaps you have multiple servers, but they are behind a single load balancer, which is behind a single firewall – the load balancer and the firewall are SPOFs, and because you have two of them in the chain that increases the odds of your service going offline.
Hardware is typically considered when thinking about avoiding SPOFs, but services can be susceptible as well. This article is about one particular service on which all other internet services are dependent, and if it is unavailable, then no services are available. What is this superservice that lets your company live or die in its fragility? Domain Name Service, or DNS.
Most of you probably don’t think about your DNS much, if at all. If you are a small company, you probably registered your domain name with someone like GoDaddy, and just followed some instructions on setting up your services and you were done. And I’m not trying to scare you into thinking about it all the time…but you should at least make sure your DNS service is with a reputable provider.
Just to illustrate how important DNS is, the internet doesn’t work on names, it works on numbers – everything is addressed by a 32-bit (usually) or 128-bit address. Think of it like the phone number to the local pizza place. If you call it enough you might remember it, but you don’t need to as you can look them up by name in a directory. We humans remember names better than long numbers. The DNS system is the directory that helps translate names into those numbers. If I want to go to www.facebook.com, my computer asks a local DNS server what the address of www.facebook.com is. If it doesn’t already know, it looks up where the DNS servers for facebook.com are and asks them directly. If I send an email to an employee at facebook.com, my mail server looks up where to send the email to a facebook.com server the same way.
Because of the importance of DNS, DNS services are a popular target of Denial of Service (DOS) attacks – the attackers attempt to overwhelm the DNS service so that it is unable to serve legitimate requests. Your better DNS providers already have ways to mitigate such attacks.
What Happened at Facebook?
Back to Facebook. Facebook hosts its own DNS servers. They had four of them, so the servers themselves were redundant. They suddenly disappeared. A lot of chatter on Twitter assumed that they must be under a DOS attack – but that wasn’t it. It was something far simpler. What happened is that the routing information on how to get to the network those DNS servers were located on was removed, such that every system on the internet had no idea how to get to them. This not only meant you couldn’t get to read your friend’s posts about their vacation, but you couldn’t email anyone at Facebook either.
If they had at least one DNS server available, they could potentially push an update to point to a server set up somewhere else to at least say “Down for maintenance” or something like that. And email could potentially still work if they didn’t also host that on servers that were affected by the outage. Ideally, they could have avoided at least part of the outage by running at least one DNS server on some other service that couldn’t be affected by any sort of internal issue. While they could potentially update their DNS info to add a new DNS server, it can take many hours to a couple of days for that information to be propagated throughout the Internet. Quite a few years ago Microsoft suffered an outage to microsoft.com when they only had two DNS servers, and they happened to be housed on the same internal network in the same datacenter – and the network went down. When they came back, they moved the DNS servers to be in different locations. I started calling this “The Microsoft Lesson”.
Proactive Steps with your DNS
So, what can you do to make your DNS more resilient? As I mentioned earlier, if you are using a reputable DNS provider, you are generally covered. There should be a minimum of two servers, and although the more the better, you don’t need to go crazy– rarely are there ever more than four. If you aren’t sure what your DNS servers are, you can use a ‘whois’ lookup tool like https://www.namesilo.com/whois and enter your domain name. You should see a section with several “Name Server:” entries – those are your name servers. As an extra step, you should look up the IP addresses of each and compare them. You can do that with this tool: https://mxtoolbox.com/DnsLookup.aspx or the “nslookup” command-line tool in most operating systems. You will likely get a 32-bit address returned in the “A.B.C.D” notation, like “192.168.75.15”. It’s the first three we want to compare. The more different they are, the better. They don’t need to be wildly different, but different. Also, the left-most numbers are more significant here – the first number being different is better than only the third number. This is because as you go from left to right, you are zooming in on the network that the server is located on. If the first three numbers are the same, then the servers are very likely located on the same network, which means that the network itself is a SPOF, and if all our DNS servers are there, we have a greater likelihood of DNS-related outages. This isn’t a perfect check – if only the third number is different, it still could mean they are sufficiently separated. Most DNS providers will have them on separate networks already. The better ones will be regionally distinct as well.
Some providers, such as Amazon Web Services’ Route53 service, and Azure’s DNS service, will place the servers in different Top-Level Domains (TLDs) – as in one server will be in.COM, one in .NET, etc. which means they are even stored in different databases in the master DNS databases.
If you are hosting your own DNS servers, keep in mind of all the redundancy factors. Don’t place all the DNS servers on a single network, behind a single firewall or router, etc. Even if those elements are themselves redundant via High Availability services, they can still fail to pass traffic for a variety of reasons such as configuration errors, upstream failures, etc. Nothing is foolproof. You should have at least two separately distinct networks, and if possible, different providers. If you must, spin up a cheap VM with some other provider and make it one of your DNS servers.
Are you interested in preventing your own SPOF? We from iuvo Technologies will be happy to help you prevent this! Contact us, today.