Perform controlled test reboots and failures to make sure your infrastructure responds properly to breaking down.

2018-11-27

Survival in the cloud: control test reboots are necessary to avoid heart attacks; schedule them with our staff.

Jonathan - Engineering

A client recently asked us to add an 8TB hard disk to his dedicated server while the server was live. That's no problem, as this client has a dedicated server with hot disk swap capabilities (we can install/remove disks via a front facing ejection mechanism).

We do, however, always recommend to reboot the server after a disk change.

The client though did not want to reboot his server to do this, which we understand, nor did he wish to provide the root password for us to do the necessary configurations. But we explained to the client through how to add this new disk.

The problem is though, months later, one of the client's admins needed to reboot the server in the middle of an emergency, and after that, the dedicated server did not boot back up. His admin did not have the root password for us to log in and check.

All that time, the client's admin thought they had suffered disk failure. He was nervous and scared.

Ultimately, later on, we discovered the issue was simply caused by a typo in the /etc/fstab file when the client configured the new 8TB disk to automatically mount at boot up.

It is very important, when major changes are made, that you do a control reboot. It is better to correct problems during a scheduled maintenance window than to go through an extended downtime event because proper reboot testing was not done.

It is also important that test reboots are done to make sure that your application, and everything it needs, are set to automatically start on boot.

When you are ready to do a controlled reboot test, inform our staff of the date and time you intend to do this, and they will be ready in case something goes wrong.