VMware Capacity Planner troubleshooting tips

My last VMware Capacity Planner job with a customer was a little cumbersome. The customer had around 350 systems spread all over the world and links between them with high latency which normally is just within limits, but it seemed the Capacity Planner had troubles with it. It took quite some time getting everything running the way I should. I decided to write a blog post on this since there is not much to be found on VMware Capacity Planner troubleshooting.

I couldn’t have done all the trouble shooting without the great tips from Kelly Culwell who had some notes from his VMware Capacity Planner training that helped me a lot. Thanks Kelly.

 

Follow the wizard

When setting up your data collector, be sure to follow the steps suggested in the wizard, found on the HOME tab. Also, for larger environments, don’t start a new task before the previous one has finished, this might slow down your initial collections and give you a lot of errors and have you redo the inventory again.

An important step is of course to get as many systems in your collection as possible. Easiest way to discover systems is first performing the “Discover Domains” task followed by the “Discover Systems” task. The default settings for the “Discover Systems” task can be found under Admin -> options ->jobs. Then open the “Manual – Discover Systems” job and modify the “Discover” task by pressing “Settings”. You can see here what this job does like scanning Active Directory and DNS. You will also see that it doesn’t do an IP subnet scan. When working with a lot of hosts that are not part of any Active Directory, it can be very handy to perform such an IP scan. I do prefer however that if you need an IP subnet scan, you create a separate task for this. Also keep in mind that according to this KB article and IP scan stops when there is a gap of 20 empty IP addresses: Systems within a specified range of IP addresses are not discovered

The third task in the HOME tab is one of the most important tests you have. After discovery of systems was done by for example reading the active directory or performing a network scan, it is now time to test the collected systems. With the “Test Collection” task a verification is done to make sure the systems you want to collect data from are reachable and the collector is able to logon to retrieve more info.

 

Problem: Host unreachable

Logon to the data collector locally and try to ping the host by netbios name and FQDN and see if the IP address is being resolved. Also check when pinging the host, what the latency is. My experience is that sometimes when there is a 500ms or higher latency, the host can be pinged, but the data collector still reports host unreachable. I haven’t been able to solve this issue. Only thing I can advise is to move the data collector closer to the hosts.

 

Problem: Connection refused / Connect to Remote Registry server failed

-   Make sure the account being used by the data collector to connect to this host, is member of the local admins on this computer.

-   The account used to connect to the host, may not have a blank password

-   The Administrator must be defined in relevant security policies.  Open secpol.msc and check the users granted permissions for the following:

  • Profile Single Process
  • Profile System Performance
  • Access this computer from the Network
  • Log on as a service
  • Log on locally

-   Check Perfmon Registry Access rights from the Data Manager by attempting to browser the Registry in the navigation pane.  Expand the server that’s failing, expand the Registry folder, and expand HKEY_LOCAL_MACHINE.  If you see an error similar to the one displayed on the slide, check the access control list or ACL for the registry and make sure the default rights have not been changed.  The default rights are:

  • System: Full,
  • Administrator: Full,
  • Restricted: Read, and
  • Everyone: Read.
  • If you are unable to access all the keys restart the Remote Registry on that target server.

-   Check the file system access control lists in NTFS. Both Administrator and SYSTEM must have Full Control in the ACL for the files:

  • %SYSTEMROOT%\System32\Perfc009.dat
  • %SYSTEMROOT%\system32\Perfh009.dat

-   The following table lists the core services that are required to support data collection.  Access errors can often be a result of one of these services not having the incorrect start-up type, or user rights to support data collection.

Remote Registry Automatic (Local Service)
Performance Logs and Alerts Manual (Network Service)
Remote Procedure Call (RPC) Automatic (Network Service)
Remote Procedure Call (RPC) Locator Automatic (Network Service)
Windows Management Instrumentation (WMI) Automatic (Local Service)
Windows Management Instrumentation (WMI) Driver Extensions Automatic (Local Service)

-   In addition to the core services these helper services are required to support data collection as well:

COM+ Event System Manual (Local System)
COM+ System Application Manual (Local System)
WMI Performance Adapter Manual (Local System)
Net Logon Manual (Local System)
Secondary Logon Automatic (Local System)
Remote Access Connection Manager Manual (Local System)
Workstation Automatic (Local System)
Server Automatic (Local System)

 

Problem: Connect to WMI failed

  • There are a number of troubleshooting tips for this error:
  • Make sure the account being used by the data collector to connect to this host, is member of the local admins on this computer.
  • The account used to connect to the host, may not have a blank password
  • Check that DCOM is enabled on both data collector and host. Check the registry key: “HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Ole” for the string “EnableDCOM” with the value “Y”.
  • Make sure that WMI is installed. On Windows 2000 and later WMI is installed by default. On Windows NT4, WMI must be installed manually. To check for the presence of WMI, click Start -> run and type wbemtest. If the Windows Management Instrumentation Tester application starts, then WMI is installed.
  • For Windows XP Professional target systems, make sure all remote logins are not being coerced to the guest account. This is also known as ForceGuest.  This is enabled by default on computers that are not attached to a domain by default.  Run the Local Security Policy editor and select the Network access: Sharing and security model for local accounts.  If this entry is set to guest only, right-click it and choose properties. Select Classic – local users authenticate as themselves, and restart the computer.
  • On Windows XP Service Pack 2 systems, configure the firewall to allow remote administration.  To do this, open a command prompt and type netsh firewall set service Remote Admin.
  • Configure internal Firewalls to allow WMI messages.  Even if the customer is not running any firewall software, some antivirus solutions contain their own firewall functionality.  Such software that is not properly configured may cause WMI errors.(Extra info at: http://msdn.microsoft.com/en-us/library/aa389286(VS.85).aspx )

 

Sizing the data collector

When collecting data, the collector can handle up to 500 systems. However, in the field I’ve learned that 500 is good when all the systems are local, but when you have many systems behind firewalls, in different subnets or remote networks, this number could drop to maybe 300 systems per collector. Best option is to place a data collector at each site, but specifically in my case this was not an option since there weren’t enough systems available for VMware Capacity Planner to run on.

Do pay attention on how long the performance collection job runs. Normally a performance collection job runs each hour but when it takes more than an hour, it gets aborted when the next job is supposed to start. When this happens you have two options, either set the job to run 2 hourly so that it will have more time to finish. Or separate the slow responding systems into a second data collector and only have this one collect performance every 2 hours or maybe this job is small enough after the separation to keep it at one hour intervals.

Something to keep in mind is that certain operating systems or servers may collect slower than others. Older, less powerful systems may return data more slowly, like very busy or loaded servers. Linux and UNIX data collection may take slightly longer to run than Windows servers as more data may be returned. You may need to keep this in mind when sizing for an assessment with a lot of Linux and UNIX servers.

 

Logging and reporting

If you need to dive deeper into what is happening with the data collector, the first step is to check the reports. Under the Reports option in the top menu, you’ll see a number of very useful reports that show what data is and is not being collected. Another option is to increase the logging level. Go to Admin -> options and on the General tab you see the default value is “Detailed Progress” with a log file size of 5 MB, but you can set the logging even 3 levels higher if you want to.

11 thoughts on “VMware Capacity Planner troubleshooting tips

  1. Nice post Gabe. I've had a nearly identical experience to the one you're describing with slow links and lots of servers spread out throughout the country. Setting the performance task interval to once every two hours usually does the trick and still gives you enough relevant data to analyze the environment.

    One more thing I'll say about Capacity Planner is that you shouldn't always take the inventory data that it gathers as fact. Often times a spreadsheet maintained by the IT staff will be just as much or more accurate. As an example, I recently completed an analysis in which the environment had non-standard (self assembled) servers. On nearly all of those servers, the Data Collector reported their RAM as having 8 times as much as it actually did. 8GB systems were reported as having 64GB which totally threw off the consolidation estimates. As the outside consultant I didn't know that this was incorrect data so this delayed things a bit.

    Great post. Hopefully this becomes a standard post for CP troubleshooting similar to Duncan's esxtop post for performance analysis.

Comments are closed.