How to Recover Virtual Machines with Corrupt Snapshots

There are times when a virtual machine might start rebooting or shut down completely, due to one or more snapshots for that machine getting corrupt. Depending on the type of failure, recovery from such a situation is possible and at times, with all data intact. For example, when a backup solution takes a snapshot as part of its backup process, that snapshot might immediately get corrupted. As it has just been taken, there might not be a lot of changed data in the snapshot at that point. A complete recovery in this example is achievable.

Common symptoms for this kind of problem are:

Virtual machine starts displaying the message in the console:

“The redo log of .vmdk is corrupted. Power off the virtual machine. If the problem still persists, discard the redo log.”

Pressing OK to the message mentioned above, causes the machine to display the message again.

Powering-off the virtual machine might not be possible and could be displaying the message in the console:

“The attempted operation cannot be performed in the current state”

I wrote this blog post (originally for my company blog) as I felt while different KB articles document the process in parts, I couldn’t find one that guides someone through the whole recovery process.

Assumptions

I’ll be making the following assumptions for the purpose of this post:

The failure is occurring on virtual machine(s) with one or more snapshots, created either manually or via an automated mechanism e.g. a backup solution.
The virtual machine is displaying errors about inconsistent, corrupt or invalid snapshots.
The person going through the process is familiar with VMware operations and can deal with minor variations in the discussed scenario.
The process to force shutdown of a virtual machine is required for ESXi 5.x hosts (while syntax for other versions will be different, the process remains the same).

Virtual Machine Restore Process

Step 1: Save Virtual Machine Logs

The first action is to save logs for this virtual machine, which can be found in the virtual machine folder on the datastore. A new log file is created every time the machine restarts and the last seven files are maintained in the folder on a rolling basis. Saving these will ensure valuable diagnostic data is not lost due to continuous reboots. Due to the state the virtual machine is in, it might not be possible to save vmware.log but the other log files should be copied directly from the datastore to a safe location.

Step 2: Shutdown Virtual Machine

This is to avoid having any further damage to the current snapshots before a copy of the machine is made. It’s possible for vCenter to lose control of the virtual machine in such situations and power operations might not work from the VI Client. If that happens, refer to “Force Virtual Machine Shutdown Process” section near the end of this posting for techniques to force shutdown of the machine.

Step 3: Make a copy of the Virtual Machine folder

Once the virtual machine is shut down, make a copy of the virtual machine folder to another location on the same or another datastore. Name the folder something appropriate e.g. <Machine Name>-Backup.

Note: A clone is not what is required and it probably won’t work in such a situation.

Step 4: Attempt to fix the snapshots

First check if the datastore has enough space remaining as snapshots do become corrupted if there isn’t enough space available. As there might be other snapshots going on in the background, estimate generously and if there isn’t enough space, use Storage vMotion to migrate machines off that datastore, to have a safe amount available.

Once there is enough space available, try taking another snapshot and if successful, try committing it. This operation might fix the snapshot chain and consolidate all data into the disks. If this process fails, then follow the remainder of the process to manually restore the machine from remaining snapshots.

Step 5: Confirmation of existing virtual disk configuration

Go into virtual machine settings and confirm the number and names of the existing virtual disks. As there are snapshots present, the disk(s) will be pointing to the last-known snapshot(s). Also, make note of the datastore the machine resides on.

Step 6: Command-Line access to ESXi server

Gain shell access to an ESXi server in the cluster which can see the datastore with the virtual machine in question. The ESXi server should also have access to the datastore where the repair will be carried out. If using SSH, it might be disabled (it is by default), in which case, you may have to start the service manually.

Note: Seek approval (if security policy requires it) before this is done.

Once SSH is enabled, use PuTTY (or a similar tool) to connect and login using “root” credentials

Step 7: Confirmation of snapshots present

Once logged in, change directory to:

/vmfs/volumes/<Datastore Name>/<Machine Name>

and run:

ls *.vmdk –lrt

to display all virtual disk components.

Make note of what “Flat” and “Delta” disks are present. While it can vary in certain situations, the virtual machine’s original disks will be named the same as the virtual machine name by default. If there are more than one virtual disk present, it should have “_1” appended to the base name and so on. If there are snapshots present, they will have “-000001” appended to each disk name for the first snapshot and “-000002” for the second and so on, by default. Make note of all this information and confirm that it is in line with what you noted in Step 5.

Step 8: Repair of the virtual disks

Start with the highest set of snapshots and for each disk in that set, run the following command:

vmkfstools –i <Source Disk> <Destination Disk>

where <Source Disk> is the source snapshot. Please note: <Source Disk> is the base .vmdk name of the snapshot i.e. not the one with –flat, -delta or –ctk in the name. <Destination Disk> is the new disk, where all disk changes need to be consolidated. The new name should be similar to the source but not identical. <Machine Name>-Recovered.vmdk is one example for the first disk. Keep the same naming convention throughout for all disk names e.g. <Machine Name>-Recovered_1.vmdk, <Machine Name>-Recovered_2.vmdk and so on.

For example:

vmkfstools –i <Machine Name>-000003.vmdk <Machine Name>-Recovered.vmdk

for the first disk from the third snapshot set.

vmkfstools –i <Machine Name>_1-000003.vmdk <Machine Name>-Recovered_1.vmdk

for the second disk in the same set and so on.

Repeat the process for all disks in the snapshot set identified earlier in step 7. If the process is successful, move on to step 9. If there is failure on one or more disks in the set, the following error message may be displayed:

Failed to clone disk: Bad File descriptor (589833)

If that error occurs, skip that disk and keep running the process for other disks as they might still be useful. However, the set will likely be rejected to run as production so the next recent snapshot set should be tried. Follow the same process until all disks in a snapshot set are successfully consolidated into a new disk set. If this is an investigation into the events leading up to the failure then additional sets might have to be consolidated in the same way. All sets should now consolidate successfully.

Step 9: Restoration of the virtual machine

Using the “Datastore Browser”, create a new folder called “<Machine Name>-Recovered”, either on the same datastore or another. Move the newly-created “Recovered” vmdk file(s) to the new folder. Also, copy <Machine Name>.vmx and <Machine Name>.nvram to the new folder and rename both files to become <Machine Name>-Recovered.vmx and <Machine Name>-Recovered.nvram, respectively.

Download <Machine Name>-Recovered.vmx to the local machine and edit it in Wordpad. Replace all instances of <Machine Name>-00000x (where “x” is the last snapshot the machine’s disks are pointing to) with <Machine Name>-Recovered. Repeat for other disks if present e.g. _1, _2 and save the file. This should make the .vmx match all newly-consolidated disks. Rename the original vmx file in the datastore to <Machine Name>.vmx.bak and upload the edited <Machine Name>-Recovered.vmx back into the same location. Once uploaded, go to the “Datastore Browser”, right-click the vmx file and follow the standard process of adding a virtual machine to inventory, naming it “<Machine Name>-Recovered”.

Once in the list, edit the virtual machine settings and disconnect the network adapter. It might require connecting to a valid virtual machine network first but the main thing is that the network adapter should be disconnected.

Once done, take a snapshot of the virtual machine and power the machine up. At this point, a “Virtual Machine Question” will come up. Answer it by selecting the “I copied it” answer. If the disk consolidation operation was successful for all disks, the machine will come up successfully. The machine can now be inspected and put into service or investigated for a problem.

Once operation of the machine has been tested and the decision has been made to bring it into service, shut down the virtual machine, reconnect the virtual network adapter to the correct network and power it back up. After boot is complete, login to the machine to confirm service status, network connectivity, domain membership and other operations. If all operations are as expected then the restore process is complete and the snapshot can be deleted.

Force Virtual Machine Shutdown Process

First Technique: Using vim-cmd to identify and shutdown the VM

While connected to the ESXi shell and logged in as “root”, run the following command to get a list of all virtual machines running on the target host:

vim-cmd vmsvc/getallvms

The command will return all the virtual machines currently running on the host. Note the Vmid of the virtual machine in question. Get the current state of that virtual machine as seen by the host first, by running:

vim-cmd vmsvc/power.getstate <Vmid>

If the virtual machine is still running, try to shut it down gracefully using:

vim-cmd vmsvc/power.shutdown <Vmid>

If the graceful shut down fails, try the power.off option:

vim-cmd vmsvc/power.off <Vmid>

Second Technique: Using ps to identify and kill the VM

Warning: Only use the following process as a last resort. Terminating the wrong process could render the host non-responsive.

While connected to the ESXi shell and logged in as “root”, list all processes for target virtual machine on the current host by running:

ps | grep vmx

That will return a number of lines. Identify entries containing vmx-vcpu-0:<Machine Name> and others. Make note of the number in the second column of numbers, which represents the Parent Process ID. For most of the lines returned for that machine, this number should be the same as in the second column. One line belonging to “vmx” will contain that number in both first and second columns. That is the ProcessID of the target virtual machine.

Once identified, terminate the process using the following command:

kill <ProcessID>

Wait for a minute or so as it might take some time. If after that, the virtual machine hasn’t powered-off, then run the following command:

kill -9 <ProcessID>

The method in the section will not result in a graceful shut down but it should terminate the machine, allowing for the recovery to take place. If the machine still cannot be terminated, further investigation will be required on the host and the only option left will be to vMotion other virtual machines off this host and rebooting the host in question.

Final Words

The beauty of virtualization is that one can test most service scenarios without actually causing impact to service and this process is no exception. For that reason, I would strongly recommend practising this process in your lab environment so that you are well prepared in case disaster strikes. Any virtual machine with some changes between snapshots, should be a good example to test with.

So, what are you waiting for? Have a go and it would be great to hear your feedback on how the process went.

This article is a slightly modified version of my post originally posted here on the Xtravirt Blog.

How to Recover Virtual Machines with Corrupt Snapshots

Assumptions