A few days back, we were hit by a problem on one of our vCenter servers (vCenter 4.0 Update 1) in that the “VMware VirtualCenter Server” service starting failing.  The message in “System” event log was “The VMware VirtualCenter Server service terminated unexpectedly.  It has done this n time(s).  The following corrective action will be taken in 300000 milliseconds: Restart the service.”, where “n” was the number of restarts it had already done.  The service was trying to restart every five minutes and was failing.  This was obviously worrying because loss of that service means one couldn’t connect to the console to manage the environment.  I know that you could manage some of it by connecting directly to ESX servers but there are lots of things you can’t do.

My colleagues did some initial investigation but couldn’t find anything obviously wrong with the server i.e. in terms of disk space and other services.  So, the issue was passed on to me.  I double-checked their investigations and also quickly went through the list of things mentioned in this article here.  Now, I must say that if the server has been stable for a while and this service suddenly starts misbehaving in this way, chances are that most of the things mentioned in this article will come out clean but still it’s a good idea to go through these checks, especially the ones involving checking free disk space and health of the database.  Also, check the ODBC connection by going into its settings (if you’re running a x64 server, you might find it by double-clicking C:WindowsSysWOW64odbcad32.exe) and clicking “Test Data Source…” on the last screen.  If it comes back with “TEST COMPLETED SUCCESSFULLY!” then your database and connectivity is good – Don’t mess with it at this stage as it could make your situation worse.  For the sake of completeness, I should also mention that vpxd.log was also clean.

Unfortunately, none of the checks mentioned in the article revealed any answers.  To be honest, I wouldn’t be writing this article if they had!  It was time to do some deep digging.  Following is what you’ll have to do, which in my case, led to the resolution of this problem:

  1. Enable “Trivia Logging” and restart the service.  The process requires editing of vpxd.cfg and is described here.  I would recommend having the log dumped in a different folder (as mentioned in the document) so that you can cleanup easily when the problem is fixed.  Caution: It will generate lots of data and you should disable it as soon as you’ve resolved the problem.  For that reason, another recommendation is to keep the original vpxd.cfg as a copy as it helps with quick restore.
  2. Restart the “VMware VirtualCenter Server” service.  The service will fail again but it’ll take longer as it will be logging a lot of data.
  3. Go to the folder where the log is being generated.  There will be some zipped files as the logs are compressed after they reach a certain size.  Open the most recent log file and start looking for issues.  Now, given that it’s a huge file and not having seen what’s normal and what’s not, it might be very difficult to find the cause here but this is useful exercise if you can’t pinpoint the issue by normal means.  That said, I am only going to talk about the issue I want to mention here and given I’ve been through this process once, I’ll save you the trouble!  Towards the end of the file, I found entries which started small but grew to say: “
    [2000].backing.parent.[insert the word “parent” here 64 times!].parent, vm-xxx”, where xxx is your problem VM!  What’s happening here is that some automated process couldn’t get rid of a snapshot it took to back the VM up (VMware Data Recovery appliance in my case) and the problem continued until finally too many snapshots accumulated, preventing the service to come up.  “67” seems to be the magic number here!
  4. Find vm-xxx.  There are various ways of doing it but one easy way is to search for “vmId” in the same log file.  You’ll find entries similar to “vmId [xxx={VM name}]” in the file, one of which will correspond to the VM you’re looking for.
  5. Locate the ESX server where the VM is running.  Make a console connection to that ESX server and change directory to /vmfs/volumes/[datastore]/[VM Directory].  Confirm the presence of extra snapshots in here.  Please note: Chances are that the GUI won’t show you all these snapshots so this step is important.  The files that you’re looking for will be named *-0000??.vmdk, where ?? is the disk number.
  6. Once confirmed, shutdown the problem VM from the ESX console and remove it from inventory.  Make sure that you don’t “Delete from Disk”!  Also, either shutdown the appliance creating snapshots for that VM or disable the associated process so that you can fix the problem without introducing more issues.
  7. Note the name of the folder containing the VM on the store mounted by ESX.  Remember: It’s case-sensitive.  Rename the folder directly on the store to something else so that vCenter can’t find it.
  8. Chances are that this is the only machine with the problem but find any others that may have the same problem and repeat the steps mentioned above for all those machines.  If there are machines heading towards the problem but are not quite there yet then fix those machines later.
  9. At this point, start the “VMware VirtualCenter Server” service.  This time, the service should stay stable.  If it does, disable trivia logging by restoring the original vpxd.cfg file and restarting “VMware VirtualCenter Server” service.
  10. Bring up the vCenter console.  The problem machine should now show as orphaned.  Remove it from Inventory.

This fixes the basic problem and your vCenter service should be back, restoring management functions.  Obviously, you would want to bring the problem machine back as it must be in service before the problem happened.  Also, we need to restore the machine as it was before you shut it down but without the accumulated snapshots.  Here is how you do it:

  1. If you have spare space, make another copy of the renamed VM folder.  This is for safekeeping and can be deleted when this is all fixed.
  2. Create an empty folder in the datastore named exactly as it was before.  Remember you noted it down in step 7 in the previous section and use the same case.
  3. Go to the renamed folder (from an ESX console session) and run the following command: vmkfstools  -i [Source vmdk] [Destination vmdk], where [Source vmdk] is the highest numbered vmdk for the VM e.g.[VM Name]-000067.vmdk (assuming this was the last snapshot in the directory) and [Destination vmdk] is the full path to the newly created folder + original vmdk name e.g. /vmfs/volumes/[datastore]/[VM folder]/[VM name].vmdk.  You don’t need to specify disk format or adapter type as it takes the same configuration as the source.  In the end, you should have the cloned vmdk (along with a [VM name]–flat.vmdk) in the folder named as the original.
  4. Download the original vmx file from the renamed folder and open it in Wordpad.  This is to ensure you create the new VM (in the next step) using the cloned .vmdk with the same parameters.
  5. Create a new VM, using the same parameters as in the vmx file.  When on “Select a Disk” screen, choose “Use an existing virtual disk”.  Click “Next” and on “Select Existing Disk”, browse to the newly cloned disk and click “Next”.  Complete the remaining steps in virtual machine creation wizard as normal and click “Finish” in the end.
  6. Browse the datastore containing the new VM.  The wizard will have created a new folder [VM Folder]_1, where you’ll find three files named [VM name].vmx, [VM name].vmsd and [VM name].vmxf.  Copy these files to the folder where the newly cloned vmdk is.
  7. You’ll now have the VM registered in vCenter but this is not the one we need as it points to the folder with _1 in the name.  Remove this VM from inventory.
  8. Browse to the folder (using the Datastore browser in vCenter console) with the cloned disk. Right-click on the .vmx file (recently copied) and select “Add to Inventory…”.  Go through the questions to add this machine to vCenter.
  9. “Power On” the machine as normal.

This activity will ensure that your VM is restored without the snapshots, with the same folder name as it was before (which is best practice) and it should be in the same state as it was when you shut it down.  Once you are happy with the operation of everything (probably after a few days), delete the extra [VM Folder]_1 folder, the “renamed” folder and any additional copy that you may have created for safekeeping.

Hope this helps!