Recently I was working with a company on their VMware View 4.6 environment. They had been testing it for several months, but always experienced some issues. They added some more pools and a couple hosts (ESXi 5.0 servers) and the issues seemed to get worse. Of course all the errors were intermittent and so it was hard to narrow down the exact cause, which turned out to be multiple causes. In this blog, I’ll take you through some of the troubleshooting and how, in the end, I was able to correct the issues with some help from VMware support.
Mostly we were getting “Agent Unreachable” messages when we looked in the View Manager at the desktops. Keep in mind “Agent Unreachable” is not necessarily a bad thing, because at one point or another all desktops will get that message. It’s only bad if it never changes to the “Available” status which is what was happening in this environment. Not all of the desktops were getting this error, but by the end of the day there would be 40+ desktops out of a couple hundred that ended up with a permanent status of Agent Unreachable. Immediately I thought there might be a problem with DNS, the network, or DHCP. Now this environment has a different DHCP server on every subnet and we were experiencing this issue on multiple subnets. Surely there wouldn’t be a problem with not one but two DCHP servers. Turns out, there was. They were both experiencing high CPU loads (pegged at 100% actually) due to some other problematic services. I shut down these services feeling pretty confident the View environment would work perfectly. This was not the case.
At this point I went to one of the View desktops that were giving an error. We logged in to it via the console through the VI client and saw that it was getting an APIPA address of 169.254.x.x. That explained why the agent was unreachable, but I didn’t see why it was getting that error. We looked at another desktop with the same issue and noticed that it was on the same host. This led to looking at that host, where we were able to see that all the VMs on this host were experiencing this network issue. To narrow down the problem, we tried pinging between two VMs on the same VLAN, vSwitch and host but the pings didn’t work indicating an issue with the actual ESXi host and not the physical network. VMware had me restart the host and all of a sudden these VMs were able to get IP addresses again. I was able to re-provision several desktops and everything seemed to be working right. The next day, though, I got a call saying the problem was back. I went in to check if the host was doing the same thing and it was. I placed the host in Maintenance mode so no other desktops could provision on that server and everything seemed to be working better. There were less problematic desktops, although there were still some. I then did a fresh install of ESXi on that server and took it out of Maintenance mode. Everything was working better, but we were still getting some errors…back to the drawing board.
All the hosts seemed to be working, DHCP was no longer pegged at 100%, and DNS was fine…so it was time to look at the networking. I pulled up the switch stack run configuration and saw that LACP was configured on the ports connected to the ESXi hosts. Now, for me, link aggregation and VMware are kind of murky. I’ve been told by VMware support that LACP is not supported in versions 5.0 and before. However, after talking to a coworker and doing some reading, I saw that it is actually supported, but only in static mode. On this switch stack, the ports were configured with LACP in dynamic mode (which looks like this in the run config of the interface: channel-group 1 mode active). You can either change this to use static LACP or you can stop using LACP altogether. If you use LACP in static mode, then you need to change your NIC teaming policies in VMware to use Route Based on IP Hash. If you choose not to use LACP at all, then you need to change your NIC teaming policies to use something else, such as the default, Route Based on the Originating Virtual Port ID. Once I changed this to the proper configuration, everything started working properly.
These troubleshooting steps may not fix all your issues, but hopefully it gives you a place to start and some ideas on what your issues might be if you’re also having similar issues. Also keep in mind that your desktops will only be as good as your base image. So make sure that you use a clean install for your base image, apply all updates (especially for Windows XP) and apply VMware Tools, then the View Agent…in that order! I learned a few lessons during this troubleshooting, the biggest one being just because you’re only seeing one issue doesn’t mean there’s only one cause.