Downtime is not a matter of IF, but a matter of when...

2018-12-05

Survival in the Cloud: Downtime is not a matter of IF, it is a matter of WHEN

Jonathan - Engineering

A great thinker once said: "The Internet is held together with bubble gum and duct tape". And that is so true.

One thing we like to educate our clients on is the subject of downtime. Downtime can be greatly minimized, but not forever eliminated.

First, do not believe any claim such as 100% uptime. It is not a possibility. Even one second of downtime would make that statement mathematically false. And events of downtime of a few seconds or a few minutes, here and there, do happen.

Why do those happen? There are thousands of reasons. Here are a few examples:

  • Someone out in the streets is digging a hole for utility work, and suddenly, they hit a fiber optic line that cuts that line in half. And although data centers have multiple optical lines running in different directions, that single cut will actually cut off all traffic through that line for at least a minute or two, until it gets rerouted down another path. The rerouting is not instantaneous.
  • There, you just experienced downtime of a minute or two. There are, on average, 17,000 accidental fiber line cuts in the US every year.
  • Another type of fiber cut: intentional, such as California's incidents of fiber cuts
  • Another example: hurricane Florence, which hit Virginia in 2018, cut off power to a small room that housed Amazon's AWS Direct fiber optic lines. All data centers in the Ashburn, Virginia "data center valley" lost access to Amazon's compute nodes (thousands of major websites went offline).
  • In 2009, a fire broke out in a building near a data center, which cut off power to the data center. That in itself isn't a problem, as the data center has a power generator on-site. However, when the fire department crew started using the water hydrants, water pressure dropped and the power generator at that data center no longer got water to cool itself. The power generator shut itself down, turning off the entire data center.
  • In 2011, hurricane Sandy hit New York City, pushing ocean water into the streets. Said water went into telecom basements all over the city, completely destroying them, cutting off Internet access to major data centers. One data center in the area completely lost power, because their power plant was in the basement, which flooded. Thus, the power plant went diving into the water.
  • Let's go back further to a tragic moment in our history. During the September 11, 2001 attacks on New York's world trade center, hundreds of thousands of square feet filled with data centers were destroyed. Not only that, but many clients in these data centers were keeping their data backups at a data center across the street, which was also destroyed.
  • On February 2017, Amazon's S3 service in the North East of the US went offline for several hours due to human error. Thousands of websites malfunctioned during that time.
  • Microsoft's Azure went offline for 7 hours in Europe due to human error, when technicians doing maintenance on fire suppression systems accidentally activated them, automatically initiating a power shut down.
  • On April 2016, all of Google's data centers around the world went offline for 21 minutes at the exact same time, due to a human error that spread to all of their networking routing equipment, at a global scale.

And there are hundreds of other stories of downtime, every single year. And all of these data centers promised 100% uptime.

Anything that is man made will fail. Whether it is your car's engine, that old TV you have, your phone, or your server's memory module, the hard disk in it, data storage clusters, the network switches, data center power units, etc, all WILL fail sooner or later.

Whether it is tomorrow, ten or thirty years from now, they all fail. Anything electronic or mechanical will not last forever.

Companies like us have comprehensive replacement plans to swap out equipment once they become of a certain age, or we suspect something may be failing. And when that happens, there is downtime during such swaps.

Not only that, but there is also the element of human error, whether it is you, one of your system administrators, or one of ours, there is the possibility someone can push the wrong button (remember when Amazon's S3 service went offline for 8 hours?).

If you are operating in the cloud, you need to be ready. For example, you can't run an operating system for, say, ten years. Why? Because its developer will no longer support it, and you open yourself to security hacks. You can't run a server with a hard disk forever: the hard disk will fail someday, as it has mechanical moving parts. Even SSDs fail...

Never think of downtime as IF, but as WHEN it will happen and be prepared.