How to quickly recover from disaster

This post is not about Disaster Recovery like switching to a 2nd datacenter. Neither is it about how to restore your VMDK’s or data. Oh tell me Gabe, what is it about then?

Well, this is about the practical issues you experience during those stressful moments when a disaster has just happened and some tips on how to take some of the stress away. As some of you might have seen on twitter, last week wasn’t the best week I’ve experienced with ESX. But apart from a lot of stress and troubleshooting, it also taught me and my colleague Arnim van Lieshout some wise lessons. Summary at the bottom if you don’t want to read all.

Last week we had a whole cluster fail on us. 6 ESX hosts gave a PSOD all at the same time. At this time we suspect some problems in the SAN Fabric, but that is not relevant to this post. What is relevant is that in that cluster we had 200 VMs going down, including our Virtual Center. When receiving the first calls from the monitoring-desk telling us that ESX servers had gone down, we didn’t have Virtual Center to find out what was happening. Because we have Virtual Center as a VM and forgot to exclude it from DRS, we didn’t know on what host VC was. So we lost about 15min searching for our VC VM. Finally, when found, it turned out that the VC VM was hanging on a PXE boot, a quick reset of the VM solved this and VC was up and running 2 minutes later.

Once we had Virtual Center up and running and 3 managers breathing in our necks, we started acting. But we should have thought first. We both started powering on VMs by groups of 10. But this cause Virtual Center to become (of course) very busy. And because DRS was active on that cluster, a lot of VMs got migrated before they got powered on or in the first power-on fase, I’m not sure when DRS does this. But anyway, there were suddenly a lot of power-on actions and migrations going on. This of course caused Virtual Center to react very slow to our following commands. There is nothing more stressful in a stressy situation, then your GUI not responding and you can’t continue starting up other VMs.

Next problem we had to face was figuring out which VMs should be left powered-off. For some reason you always have VMs “lying” around that shouldn’t be started. The templates are easy, but we had to be careful selecting the other VMs, which made it much more time consuming then needed. Beside of the problem of which VM to start, we also didn’t know which to start first. In a good disaster recovery plan, you always should have a list of all servers and their priority, but well, you know…..

We decided for quick implementation to create a sub folder at top level in the “Virtual Machines & Templates” view named “PowerOff”. We moved all VMs that should stay powered-off there. Now you can click on any other folder and just select all VMs and power them on. This is not a permanent solution. For our final implementation we decided on creating a custom field in VC in which the priority of a VM is entered. We would then use a priority like DNS uses for its DNS records. So instead of 1,2,3,etc. We would use 10, 20, 30, etc. and 99 could then be: leave off. Using a custom field would of course make it an excellent option to be used within a powershell script. This script should then start VMs based on their priority field. And then, once VC is very busy powering all VMs, leave it alone. Get coffee or just watch VC do its work. But don’t try to do other things on VC at the same time.

Then after VC is not so busy anymore, re-enable DRS and check HA. Now, start making diagnostics logs, because we want to know what happened. In VC export diagnostics for your environment and prepare to upload them to VMware. Because you want to start investigating yourself and not just wait for support to call back, logon to the console of the ESX hosts and collect the /var/log/vmkernel logs. If you suffered a PSOD, check in your /root folder for any vmkernel dump files and collect them. You can also extract a log out of the dump with  “vmkdump -l <dumpfile>” which gives you info on the last few seconds before the PSOD. While we were doing this, it became clear to us that we had been missing a lot of warnings in the week before. We are now looking for monitoring tools that can monitor more then just uptime pings.

To summarize things we are going to change within a few weeks:

  • Exclude VC from your DRS by setting it to “disable”. This is best practice according to VMware’s White papers, but we forgot to set it.
  • We agreed on one ESX host (ESX050) on which VC will be running ALWAYS. Only when this host is down for maintenance, VC will be moved to ESX051. After ESX050 is up again, VC will be VMotioned back to ESX050.
  • After such a massive failure, we want to power on all VMs as soon as possible and not have DRS migrate them. So we first disable DRS.
  • We want to quickly, without difficult selection track, boot as many VMs as possible. Therefore use different folders for now
  • In future we will be implementing a custom Virtual Center field at VM level which holds a priority value: 10, 20, 30, …. , 99 and have them started by a powershell script
  • After VC received the power-on command for all VMs, its queue will be very long. Therefore we sit and wait for VC to clear its queue with all tasks.
  • Find a good tool to monitor not only uptime but also health of the ESX hosts.

18 thoughts on “How to quickly recover from disaster

  1. Amazing scenario. I’m suprised you had the energy to write that after what you had gone through. I would be interested in knowing what caused a PSOD on all boxes at the same time.

  2. Hey Gabe,

    Man, that was a crazy situation. I echo Jason’s comments as well. As far as a good monitoring tool I’ve used vFoglight from Vizioncore with a number of my clients. It is very granular (almost too much so), but out of the box it does a great job of alerting you to all types of conditions on the ESX hosts – heap memory errors, overcomitment of memory, paging, etc.

  3. Hi Gabe,

    I think you forgot the first and most important action to take in such situation.
    – Take a deap breath, Relax and don’t forget your coffee.

    and to everone else who thinks 6 PSODs is heavy stuff. That was Monday. Wednesday we had another cluster of 7 down, so gabe’s total is 13 PSODs in a week. Like he mentioned in a tweet…who can beat that?

  4. Gabrie, an excellent post, A well rounded explanation of your predicament and subsequent escape.

    I echo Jason’s comments regarding the PSOD cause.

  5. Hi Gabe,
    if you have a problem with VC, remember a couple of things:
    – WebAcess to each of the 6 hosts, will let you power things on.
    – VIC directly to each host as well.

    One thing I’ve seen a couple of times is when people rush to turn on the VC, but forget to turn on the VM which has the VC database on it. Trust me, nothing makes VC run slower :)

  6. @Forbes Guthrie: We did use the VI client to make a direct connection to each host, but it takes quite some time then when searching for VC.

    That is why we now “glued” it to one host.

    Gabrie

  7. Gabe, great read. I will certiainly be taking some tips from your situation and implementing them before im ever faced with a PSOD.

    Simon

  8. Root cause was because storage connection to secondary site got disconnected for a short time and the SAN responded to ESX with a reply ESX didn't understand and therefore threw an exception.

    As the VMware engineer explained, when ESX gets a response it doesn't know how to handle, it can't just do anything with it. The only proper action is to exit the building.

  9. Root cause was because storage connection to secondary site got disconnected for a short time and the SAN responded to ESX with a reply ESX didn't understand and therefore threw an exception.

    As the VMware engineer explained, when ESX gets a response it doesn't know how to handle, it can't just do anything with it. The only proper action is to exit the building.

  10. Gabe – did you ever write that PowerShell? Here's one a wrote a little while back which allows you to run VC / SQL / DCs on any host through DRS as it goes and finds the VMs for you and powers on in the right sequence. It's one of my first scripts so be gentle! BTW – like your idea of a priority field to sequence the startup of other VMs … will look at extending my script based on that idea.

    # ====================================================================
    # ColdStart.ps1 – Restart key infrastructure components.
    # Sequence:
    # 1 – Find and start DCs.
    # 2 – Find and start SQL Server (if necessary).
    # 3 – Find and start VC.
    #
    # Rev 1.0 – Initial release
    # ====================================================================

    function start_VM {
    param ($VMs)
    foreach ($Esx in $EsxCreds.keys) {
    Connect-VIServer -Server $Esx -User “root” -Password $EsxCreds.$Esx -wa silentlycontinue | Out-Null
    $Found = Get-VM $VMs -ea silentlycontinue -wa silentlycontinue
    if ( $Found ) {
    foreach ($FoundVM in $Found) {
    if ( $FoundVM.PowerState -ne “PoweredOn” ) {
    Write-Host ” Found powered off VM '$FoundVM' on ESX host $Esx”
    Write-Host ” POWERING ON…”
    Start-VM $FoundVM
    } else {
    Write-Host ” Found already powered on VM '$FoundVM' on ESX host $Esx”
    }
    }
    }
    Disconnect-VIServer * -Confirm:$false
    }
    }

    # Setup

    if ( ! (Get-PSSnapin VMware.VimAutomation.Core) ) { Add-PSSnapin VMware.VimAutomation.Core | Out-Null }

    # Define the ESX hostname/IP & credential pairs here….

    $EsxCreds = @{
    “x.x.x.1” = 'rootpassword1';
    “x.x.x.2” = 'rootpassword2' }

    # Specify the names of the VMs that are DCs (leave blank if none).
    $DCs = (“DC1″,”DC2”)

    # Specify the name of the SQL Server VM that that supports VC (leave blank if SQL is local to VC).
    $SQL = “SQL”

    # Specify the name of the VC VM.
    $VC = “VC”

    # Fire 'em up.

    Write-Host “ColdStart.ps1 STARTING”
    Write-Host “”

    if ( $DCs ) {
    Write-Host “STATUS: Looking for Domain Controller VMs …”
    start_VM $DCs
    }

    if ( $SQL ) {
    Write-Host “STATUS: Looking for SQL Server VM (which supports VC) …”
    start_VM $SQL
    }

    Write-Host “STATUS: Looking for vCenter Server VM …”
    Write-Host “”

    start_VM $VC

    Write-Host “”
    Write-Host “ColdStart.ps1 COMPLETE”

  11. Sanjai,

    I am not sure if you will read this because your post is over a year hold; however, I want to give it a try. In your case of the Multiple PSOD’s, do you know what was on the PSOD itself? I recently experienced an issue with multiple PSODS at the same time on the same cluster. This cluster had been fully functional for over a year and we have never had any issues. Our Cluster has both CLARiiON and HDS LUNS; therefore, our systems seem similar.

    Let me know if you have as screen capture of your PSOD..

    Please email internetforumuser@gmail.com
    @gmail:disqus 
    Thanks

Comments are closed.