Hyper-V, not in my datacenter (part 3: Motions and storage)

In part 3 of this series, I want to shed a light on the difference between the types of Migration in ESX and Hyper-V and the differences in storage between the both of them. Although I think most of the readers know the differences, I will start with a brief summary.

  • Hyper-V QuickMigration: When you want to move a running Virtual Machine from one host to another with as little downtime as possible, you can use QuickMigration. On host-A the VM will be suspended to disk, and then host-B will unsuspend the VM and run it.
  • Hyper-V & ESX Cold Migration: Moving a VM that is not running, to different host and/or different storage.
  • ESX VMotion: Moving of a running VM to a different host, without downtime.
  • ESX Storage VMotion: Moving the disks of a VM to different storage without downtime (VM doesn’t move).

No maintenance during business hours

In my datacenter I want to have the least possible downtime and the different types of migration will definitely benefit from that. When comparing Hyper-V’s QuickMigration to ESX’s VMotion, Microsoft often says you don’t need VMotion, QuickMigration is good enough for most businesses. Well, you might expect this, but I think different of it.

Often, when promoting QuickMigration, Microsoft states that moving VMs during business hours is not needed. Normally when you patch servers it is done outside business hours and at that time it don’t matter if you have to reboot just your Windows VMs or reboot your Hyper-V hosts as well. To me this looks like MS thinks that business today stops at 5pm and after all employees go home the servers do not actually do anything at night. I know MS knows better and that it’s just marketing talk, (what about backups eh), but let’s investigate QuickMigration and the effects a little more.

Quick migrating a Virtual Machine from one host to another is downtime, period. With a VM configured with 512Mb RAM, you would only have about 6 seconds of downtime, if using 1GB Nics to transfer the memory (which is quite normal). On the other hand, with a 4GB VM, this downtime can take up to more than one minute. However it must be noted that the downtime of the VM is not equal to the downtime of the service that VM is providing. When performing a QuickMigration, all sessions with other systems are disconnected. After the VM is running again, all these sessions have to reconnect again and this is the weak point of QuickMigration.

In real life I see that numerous business applications will not react well to loosing the connection to other systems. In this case it is not enough to just have the VM running again, the admin will also have to check all the systems that are depending on this VM as well as the systems that the VM depends on, to confirm that they can see and talk to each other again. And in larger businesses, it is normally not the admin that can check these connections; there is most probably an application admin that now has to stay after business hours just because you want to reboot your Hyper-V host.

Let’s do a little test. A good hypervisor should easily have 20 VMs running on it. Now take a look at your current environment, pick 20 servers / services / applications, find out what other services they depend on and find their application admins. Now tell them that their applications are running fine but because of an urgent patch you have to reboot today / tonight. Next to impossible? What I want to point out, is that on top of your normal Windows maintenance reboot, on top of normal network maintenance, on top of normal application maintenance, you’re now introducing another maintenance window and one with quite some implications, because you impact upon so many systems.

QuickMigration CPU check

While searching for more info on Hyper-V QuickMigration, I found a rather disturbing video by VMware. Now, it is fair to say that VMware which is the biggest competitor of Hyper-V, but somehow I do believe the video is 100% accurate. Should it not be, please let me know. The video shows then when a QuickMigration is performed between hosts with different CPU features, Hyper-V does not check the compatibility between these features. The implications here is that you could have applications crash because the application is using CPU features on host-A that are not available when running on host-B. Actually I was surprised this can happen in Hyper-V. Making QuickMigration vulnerable for these kind of faults, require you to keep track of which CPU families your hosts are running and check between which you can quick migrate to and from. A very tedious task and prone for errors.
When using VMotion with VMware ESX, there is a CPU check before a VM is migrated to a new host this prevents application failures by simply not allowing a move to an incompatible CPU. Further when using VMware EVC (Enhanced VMotion Compatibility) a CPU mask is placed over hosts. (See my article on EVC here). This feature brings all of the hosts in a cluster to the lowest common set of CPU features that all hosts can utilize, thereby allowing VMotion between these hosts. This will prevent application failure in 99% of the cases. Why not 100%? Well the application also has to behave well. Officially when using a CPU feature, an application has to check for the existence of that feature prior to utilizing it. However applications only do this at startup and when this occuers things can go wrong.

Storage design for QuickMigration

When designing a Hyper-V solutions that utilizes QuickMigration one has to keep in mind that each VM requires a separate LUN assigned to be assigned to it for HA and QuickMigration to work.This is because both techniques utilize Windows Clustering to switch the LUN from one host to another. There is quite a significant drawback to this compared to VMware’s VMFS file system.

In the VMware environment I designed, the design was to have about 30 VMs on one LUN. Considering the average disk size of approximately 13GB, I calculated that we needed 400Gb LUNs to be able to hold the VMDKs and added 100Gb extra for VM memory swap and snapshot room, to come to a total of 500Gb. For 725 VMs, I need about 25 Luns of 500Gb each. A total of 12,5TB and 2,5TB spare space.

However looking at the same numbers for a Hyper-V environment, I would calculate about 10Gb extra space per VM. This is used for memory swap and running snapshots. However, because Hyper-V requires a LUN per VM, I would need 725 x 10Gb = 7,25TB of spare space. That is almost 5TB difference !!!

(Remark: Officially, you can have multiple VMs on one Hyper-V LUN and still use HA. But you will have to switch these VMs at the same time because the “owner” of the LUN will change. This will not work for QuickMigration.)

My conclusion on the subject of “hypervisor motion and storage” is that a feature like VMotion is a must have feature in the datacenter. Without such a feature, maintenance of your hypervisor environment will increase downtime considerably. In this field Hyper-V is not suited for the datacenter and if used, will increase the amount of storage needed compared to a ESX environment.

Thank you Tom Howarth for checking this post before publishing.

Series:

Hyper-V, not in my datacenter (part 1: Hardware)

Hyper-V, not in my datacenter (part 2: Guest OS and Memory overcommit)

Hyper-V, not in my datacenter (part 3: Motions and storage)