I had the pleasure of attending VMware’s VMworld conference this past week in San Francisco and sat in on several sessions. One of my favorites was called Troubleshooting Using vCenter Operations Manager, given by Kit Colbert and Praveen Kannan. They did a live lab in the session that turned out really well and showed how useful vCenter Operations Manager can be. I thought I’d take this opportunity to share some of what they showed us in the lab.

Before we get started, it might be useful to share some introductory details on vCOPS (vCenter Operations Manager). It can be installed as a plug-in in your current vSphere environment. After some minimal configuration it will be up and running. vCOPS works best after it has been in the environment for at least a few weeks. Its usefulness does not necessarily lie in finding outright errors (although it can do that), but in finding anomalies in your environment. It “learns” the environment and can point out what is out of the norm. There are three core scores that are given on the main dashboard as shown in Figure A.

Figure A

VMware calls these core elements badges. There’s the health badge that shows immediate problems, the risk badge that shows future problems, and the efficiency badge that shows opportunities to optimize. There are then subcategories under each of these badges which contribute to the scores.

In the live demo, they offered up three scenarios to show the value of the tool. The first showed how to find what’s causing the slow performance of a workload as shown in the steps below.

  1. In the search field found in the upper-right corner, type in the name of the VM that is slow.
  2. Under the Alerts pane there is an option to filter by workload. Click on the workload filter.
  3. Find the workload alert and then click on it.
  4. From here you can see the symptoms, such as heavy disk I/O.
  5. Now click on the Operation Tab.
  6. Check out the Workload section and you can see that the datastore “skittle” (icon representing the datastore) is red.
  7. Click on the datastore skittle and click details.
  8. Click on the Analysis tab and select Storage as a focus area then filter by VM.
  9. You can see that the color is based on latency, and if your problem is storage latency you’ll see it in here.
  10. You can deduce that you either need faster storage or more spindles because the current datastore can’t handle the VM workload.

The next scenario dealt with capacity constraints. vCOPS can show you which of your VMs are undersized, meaning they don’t have enough memory, CPU, etc., configured. Here are the steps you can follow to find out if you have a sizing problem with your VMs:

  1. Search for the problematic VM by name.
  2. By looking at the dashboard you will be able to see the workload is very high and it’s hitting the memory pretty hard. Although you can see this right in the dashboard, you may want to see if this is a common issue with this machine.
  3. Click on the Planning Tab. then click on the Stress badge
  4. Here you’ll be able to see how much of the time memory has been undersized. So if your memory is showing that it’s been undersized for 80% of the time, it may be time to add more memory!

The last scenario was the most interesting to me. It demonstrated how you can find out which changes to a VM may have caused downtime. It’s such a useful thing to be able to narrow down, and even reverse, changes in one click if you have vCenter Configuration Manager. Here are the steps:

  1. Search for the problematic VM
  2. Click on the red VM skittle under the Operations tab.
  3. Click on the host of the VM and you’ll be able to see the CPU is showing as red.
  4. Click on the CPU
  5. Click on the Events tab
  6. There is a time window that can be changed if necessary. You’ll want to look for the time where the graph changes from green to red. This is most likely when the change was made to your VM.
  7. Drill in to where the change is and look at the events list.
  8. In the events list, you’ll be able to see if something was installed, like an antivirus. This installation or restart, etc., will most likely be the reason the CPU kept spiking and started showing red in vCOPS.  As mentioned above, if you combine this with vCenter Configuration Manager, you can actually find out which user made the change and roll it back in one click!