Just before I was heading home, my colleague asked me to assist with a VM he couldn’t power on anymore. Seemed a customer had tried to commit a snapshot and the job timed out. The customer then tried some other things and suddenly the VM was down. While trying to power on, there was an error telling us that disk1 could not be found.
Checking VM settings in VMX file and VMSD file
In the VM properties and in the vmx file, I found these references:
First harddisk: disk0 [DataStore4] FS05/FS05-000002.vmdk Second harddisk: disk1 [DataStore4] FS05/FS05_1-000002.vmdk Third harddisk: disk2 [DataStore4] FS05/FS05_2-000002.vmdk
These are clearly pointing to a snapshot. Strange thing was however that disk1 was 32GB in size instead of 1TB !!! That got me scared a little. In the vmsd file that normally keeps record of the snapshots I found the following references:
snapshot0.disk0.fileName = "FS05.vmdk" snapshot0.disk0.node = "scsi0:0" snapshot0.disk1.fileName = "FS05_1-000001.vmdk" snapshot0.disk1.node = "scsi0:1" snapshot0.disk2.fileName = "FS05_2-000001.vmdk" snapshot0.disk2.node = "scsi0:2"
So there was the first contradiction. The vmx file is pointing at the *-000002.vmdk files while the vmsd file is pointing at the *-000001.vmdk files.
Finding an error report
Next step was to look for any mentioning of an error in the log files. Quickly I discovered that the vmware.log file reported that it is looking for the *-000002.vmdk files and that the FS05_1-000002.vmdk was not found:
Unable to find file FS05_1-000002.vmdk Unable to find file FS05_1-000002.vmdk DISK: OPEN scsi0:1 'FS05_1-000002.vmdk' persistent R[(null)] Unable to find file FS05_1-000002.vmdk DISKLIB-LINK : "FS05_1-000002.vmdk" : failed to open (The system cannot find the file specified). DISKLIB-CHAIN : "FS05_1-000002.vmdk" : failed to open (The system cannot find the file specified). DISKLIB-LIB : Failed to open 'FS05_1-000002.vmdk' with flags 0xa (The system cannot find the file specified). Msg_Post: Error [msg.disk.fileNotFound] VMware ESX Server cannot find the virtual disk "FS05_1-000002.vmdk". Please verify the path is valid and try again. [msg.disk.noBackEnd] Cannot open the disk 'FS05_1-000002.vmdk' or one of the snapshot disks it depends on. [msg.disk.configureDiskError] Reason: The system cannot find the file specified.---------------------------------------- Module DiskEarly power on failed.
Contents of the VM directory
I then looked at the vmdk files in the VM directory. First for the first disk, which is FS05:
FS05-flat.vmdk (32G) FS05.vmdk (397 bytes) FS05-000002-delta.vmdk (208M) FS05-000002.vmdk (217 bytes)
The FS05–flat file is the one holding all the data, the FS05.vmdk file is the descriptor file, it’s an ASCII file that you can look into. The same goes for the FS05-000002.vmdk file that is a descriptor of the FS05-000002-delta.vmdk which contains the data of the snapshot. Strange thing is that there is no FS05-000001.vmdk or delta.vmdk. It is not entirely impossible that the 000001 file is not present, for example when you would have made two snapshots from the same starting point, instead of each snapshot based on the previous one.
The next set of files is for the second disk FS05_1:
FS05_1-flat.vmdk (1.0T) FS05_1.vmdk (403 bytes) FS05_1-000001-delta.vmdk (13G) FS05_1-000001.vmdk (223 bytes) FS05_1-000002-delta.vmdk (402M)
Now here we have a bigger issue. The FS05_1-flat.vmdk containing 1TB of very important data (it always is when a customer calls), is “covered” by the FS05_1.vmdk. The FS05_000001-delta.vmdk also has its descriptor file FS05_1-000001.vmdk but the FS05_1-000002-delta.vmdk does not have a descriptor file.
The FS05_2 disk luckily looked quite complete:
FS05_2-flat.vmdk (20G) FS05_2.vmdk (399 bytes) FS05_2-000001-delta.vmdk (32M ) FS05_2-000001.vmdk (221 bytes) FS05_2-000002-delta.vmdk (16M ) FS05_2-000002.vmdk (228 bytes)
About CID’s and ParentCIDs’
The descriptor vmdk file normally holds a CID and a ParentCID value, these values link the snapshots and flat vmdk together. Let’s look at the CID’s in the FS05_2 vmdk’s and how they point to each other. In red you see the file name of the vmdk descriptor file.
Important is the pointer to the parentCID and the parentFileNameHint. In file FS05_2-000001.vmdk you see how the parentCID from the 000002 file matches the CID from the 000001 file. The last file to look at is the FS05_2.vmdk file. Since this is the vmdk file that describes the –flat.vmdk it holds some more info on the disk geometry of the vmdk, but this info is not important to us now. Again we only use the CID lines. You’ll notice the parentCID value is ffffffff. In all vmdk files that are related to the –flat vmdk I found the parentCID value being ffffffff, but I don’t know if this is required. Again we see that the value of parentCID of the 000001 matches the CID of the FS05_2.vmdk. This snapshot chain is OK. The FS05_2 files have been checked and confirmed that the chain is correct. Next step is to reconstruct the chain for the FS05_1 vmdks.
Reconstructing the vmdk
This actually is rather simple. Since we’re missing the FS05_1-000002.vmdk we can just copy the FS05_1-000001.vmdk and edit it to hold the right entries. Be aware that there are some other entries too in the FS05_1-000001.vmdk file and we need them, so a real full copy is needed of the descriptor. Below is the content of the FS05_1-000001.vmdk:
# Disk DescriptorFile version=1 CID=79cda434 parentCID=7cec9dae createType="vmfsSparse" parentFileNameHint="FS05_1.vmdk" # Extent description RW 2147483648 VMFSSPARSE "FS05_1-000001-delta.vmdk" # The Disk Data Base #DDB
Now copy this to FS05_1-000002.vmdk and edit the lines I marked:
# Disk DescriptorFile version=1 CID=458dea24 <--- Just pick any number parentCID=79cda434 <--- make sure it links to the parent createType="vmfsSparse" parentFileNameHint="FS05_1-000001-delta.vmdk" <--- edit this # Extent description RW 2147483648 VMFSSPARSE "FS05_1-000002-delta.vmdk" <--- edit this # The Disk Data Base #DDB
The 458dea24 CID can be any number, you just make it up since it has no relation to the VMDK. This is what took a lot of time the first time I was playing with this. I thought I had to retrieve this info from the VMDK somehow, but from VMware Support I learned that this CID can be any number.
The last few steps
After the file has been edited, the chain of the vmdks is correct again. Now the last step is to get this thing up and running again. Remember that the vmsd file was also incorrect, but we can just delete the vmsd file, you did make a backup / copy? Next we remove the VM from the vCenter inventory and the re-add the VM. And as the last step take a new snapshot and then commit all snapshots. After the snapshot has been committed, check the VM settings to see if they are pointing to the correct parent files. If you’re confident that this is correct you’re done. Now power on the VM.