When was the last time you have tested your backup. Have you ever performed a drill to test how quickly it would take you to recover? Recently, I was headed out on a vacation and a client had a disaster with a SAN crash. After calling the SAN Vendor, we discovered the data was corrupted.
What we had to do next was start data restores. It was a perfect test of the disaster plan. We had the luck of the SAN crashing on a Friday and we had the majority of the weekend to recover. In this scenario we had about 15 servers to restore and we were able to get them all restored with the exception of one.
After this event, it got me to thinking on how many people could have recovered this quickly and I can offer some tips on why this went so well.
Bare Metal Backups.
This feature alone saved so much time. We had the ability to boot up with a CD and restore the C:\ drive of all of the servers. This saved us from having to reload the OS and do all of the patching and configuring. This saved hours if not days of times. One thing we could have improved on this is ensuring that we had the iso files created for each server. This is an important feature that every backup system should have.
Disk Based Backup.
After the bare metal backups where restored, when then proceeded to restore the data. Since all of this was online and available, we did not have to worry about which tape it was one. We had the ability to restore both Microsoft Windows databases and also files. We also had the ability to restore up to 8 jobs at a time, which also saved considerable time.
Since most of these servers where virtual, I did not need to be onsite. I could do the entire server rebuilding remotely. I did most of this being at least 2 states away with a VPN connection and with no one being onsite. I was able to create Virtual Machines and also load up the ISO file to begin the Bare Metal Restore.
Since the SAN crashed, the VMWARE servers had enough disk space to handle all of the virtual machines.
Since the SAN is the weak link with VMWARE, we designed the network to include some redundant systems. The first and foremost thing we did is make sure we had redundant physical servers for key systems. Microsoft Exchange, Windows File Servers and domain controller services all had a redundant server and allowed the client to maintain most of its business critical functions. A few departments had applications that were not available, but it was only for a handful of users and these applications did not warrant a redundant system.
This is also an import part of a good overall disaster recovery and business uptime. If you can create redundant systems in a different physical location, you can save yourself stress and your company downtime.
The first lesson is to make sure you test your backup system and ensure that all of the system can be recovered. We had one system that did not restore correctly and we choose to rebuild it rather than restore it. We had to get the vendor on the phone to get it corrected.
The second lesson is to make sure you have a good list of server’s names and IP addresses. We had a few servers that had IP addresses documented incorrectly and it caused issues during the recovery process.
The third lesson has to do with Domain Controllers. Windows domain controllers have a feature that detects what is called USN rollback.