How to quickly recover from disaster

This post is not about Disaster Recovery like switching to a 2nd datacenter. Neither is it about how to restore your VMDK’s or data. Oh tell me Gabe, what is it about then?

Well, this is about the practical issues you experience during those stressful moments when a disaster has just happened and some tips on how to take some of the stress away. As some of you might have seen on twitter, last week wasn’t the best week I’ve experienced with ESX. But apart from a lot of stress and troubleshooting, it also taught me and my colleague Arnim van Lieshout some wise lessons. Summary at the bottom if you don’t want to read all.

Last week we had a whole cluster fail on us. 6 ESX hosts gave a PSOD all at the same time. At this time we suspect some problems in the SAN Fabric, but that is not relevant to this post. What is relevant is that in that cluster we had 200 VMs going down, including our Virtual Center. When receiving the first calls from the monitoring-desk telling us that ESX servers had gone down, we didn’t have Virtual Center to find out what was happening. Because we have Virtual Center as a VM and forgot to exclude it from DRS, we didn’t know on what host VC was. So we lost about 15min searching for our VC VM. Finally, when found, it turned out that the VC VM was hanging on a PXE boot, a quick reset of the VM solved this and VC was up and running 2 minutes later.

Once we had Virtual Center up and running and 3 managers breathing in our necks, we started acting. But we should have thought first. We both started powering on VMs by groups of 10. But this cause Virtual Center to become (of course) very busy. And because DRS was active on that cluster, a lot of VMs got migrated before they got powered on or in the first power-on fase, I’m not sure when DRS does this. But anyway, there were suddenly a lot of power-on actions and migrations going on. This of course caused Virtual Center to react very slow to our following commands. There is nothing more stressful in a stressy situation, then your GUI not responding and you can’t continue starting up other VMs.

Next problem we had to face was figuring out which VMs should be left powered-off. For some reason you always have VMs “lying” around that shouldn’t be started. The templates are easy, but we had to be careful selecting the other VMs, which made it much more time consuming then needed. Beside of the problem of which VM to start, we also didn’t know which to start first. In a good disaster recovery plan, you always should have a list of all servers and their priority, but well, you know…..

We decided for quick implementation to create a sub folder at top level in the “Virtual Machines & Templates” view named “PowerOff”. We moved all VMs that should stay powered-off there. Now you can click on any other folder and just select all VMs and power them on. This is not a permanent solution. For our final implementation we decided on creating a custom field in VC in which the priority of a VM is entered. We would then use a priority like DNS uses for its DNS records. So instead of 1,2,3,etc. We would use 10, 20, 30, etc. and 99 could then be: leave off. Using a custom field would of course make it an excellent option to be used within a powershell script. This script should then start VMs based on their priority field. And then, once VC is very busy powering all VMs, leave it alone. Get coffee or just watch VC do its work. But don’t try to do other things on VC at the same time.

Then after VC is not so busy anymore, re-enable DRS and check HA. Now, start making diagnostics logs, because we want to know what happened. In VC export diagnostics for your environment and prepare to upload them to VMware. Because you want to start investigating yourself and not just wait for support to call back, logon to the console of the ESX hosts and collect the /var/log/vmkernel logs. If you suffered a PSOD, check in your /root folder for any vmkernel dump files and collect them. You can also extract a log out of the dump with  “vmkdump -l <dumpfile>” which gives you info on the last few seconds before the PSOD. While we were doing this, it became clear to us that we had been missing a lot of warnings in the week before. We are now looking for monitoring tools that can monitor more then just uptime pings.

To summarize things we are going to change within a few weeks:

  • Exclude VC from your DRS by setting it to “disable”. This is best practice according to VMware’s White papers, but we forgot to set it.
  • We agreed on one ESX host (ESX050) on which VC will be running ALWAYS. Only when this host is down for maintenance, VC will be moved to ESX051. After ESX050 is up again, VC will be VMotioned back to ESX050.
  • After such a massive failure, we want to power on all VMs as soon as possible and not have DRS migrate them. So we first disable DRS.
  • We want to quickly, without difficult selection track, boot as many VMs as possible. Therefore use different folders for now
  • In future we will be implementing a custom Virtual Center field at VM level which holds a priority value: 10, 20, 30, …. , 99 and have them started by a powershell script
  • After VC received the power-on command for all VMs, its queue will be very long. Therefore we sit and wait for VC to clear its queue with all tasks.
  • Find a good tool to monitor not only uptime but also health of the ESX hosts.