Westminster College’s recently completed virtualization project is the second part of what began quite a while back as an ad hoc way to retire some critically aging servers. The servers were still hosting Web applications that we were in the process of phasing out; as such, we didn’t want to buy new servers and completely redeploy those services, so we put into place a couple of VMware ESX (3 & 3.5) servers and used PlateSpin’s physical-to-virtual (P2V) software to remove the potential of hardware failure from the equation.
I’ll start at the beginning
In early 2007, shortly after my arrival at Westminster College, it became apparent that my plan to phase out an existing portal application was going to take a whole lot longer than I had hoped. The supported services were intertwined in many different processes; in fact, three years later, we’re still running one of the applications in production, but it’s the last one. Supporting this portal application were a couple of really, really old servers that were well past their warranty expiration date. On top of that, completely redeploying the portal application was one of the last things I wanted to do since it was only tenuously held together, and the people that had implemented the solution were long gone and had left behind only basic documentation. I also wanted to reduce the number of servers we were running in our small data center; even the older servers were running at only a fraction of their capacity but still needed to be replaced on some kind of cycle and be plugged into electrical outlets consuming power.
The desire to move to newer hardware without breaking the bank, reduce electrical consumption, and not have to redeploy all of our existing services led to the phase one virtualization rollout. Once we had that solution in place, we ran in that configuration for a while. Over time, we virtualized a number of newer servers as well, also using the P2V method. As new services were brought on line, we generally deployed them on one of the virtual hosts.
The hosts were simple containers to house virtual machines and were not connected to a SAN; all of the storage was local. That said, these were Westminster’s first steps into VMware and, they accomplished the necessary goals at the time.
On to the next steps
Over the years, I’ve become a big believer in the “virtualize everything whenever possible” motto. The great success of the first phase led me to decide to expand virtualization to encompass everything that we could, but I wanted to do so in a much more robust way.
Our initial foray did not implement any availability methods, which was fine for the purpose, but as we moved into our “virtualize everything” mode, we needed SAN-backed ESX servers and a bit more robustness. To achieve our availability goals, we wanted to make sure that we didn’t have any single points of failure. To that end, everything is redundant, and we’ve deployed more servers than are necessary to support our current virtual workloads. We have room for growth, which we will need.
Again, we’re a small environment, so the architecture is pretty simple, but here’s what we have:
- An EMC AX4 SAN - iSCSI, dual controllers, 12 x 300 GB SAS + 12 x 750GB SATA. Fully and 100% redundant.
- 3 x Dell M600 blade servers, 2 x 2.66 GHz Quad Core Intel Xeon processors, 32 GB RAM each, 6 NICs each (chassis houses 6 x Dell M6220 switches - 1 for each NIC in each server)
- 2 x NICs for front-end connectivity
- 2 x NICs for connectivity to AX4 (iSCSI)
- Each of these is connected to a separate Ethernet switch.
- Each NIC connects to a different storage processor on the AX4.
- Each storage connection resides on a different physical network card.
- 1 x NIC for vMotion
- 1 x NIC for Fault Tolerance
We’re running 28 virtual machines across these three hosts. Of the processing resources we have in this three host cluster, we’re using, on average, about 10% of the computing power available to us (Figure A), so there is plenty of room for growth, and we have no worries about performance if one of the physical hosts fails. On the RAM side, we’re using just over 30% of the total RAM available in the cluster, but I think we can bring that down by paying more attention to how individual virtual machines are provisioned (Figure B).
We’re using about 10% of our computing resources. (Click the image to enlarge.)
We’re using a bit over 30% of the RAM resources of the cluster. (Click the image to enlarge.)
In Figures A and B, note that there are two periods during which we experienced a problem with vCenter that affected statistics gathering. Also, while each machine has 32 GB of RAM, one of our hosts has Dell’s RAM RAID capability turned on, which helps protect the host in the event of a RAM problem. As a result, that server reports only 24 GB of available RAM. Due to having host-level redundancy, we’ll be disabling this feature during a maintenance window in order to have the benefit of the full 32 GB of RAM.
In Figure C, you’ll see a look at the full infrastructure. The 50 and 51 are simply internal identifiers.
The whole ESX environment. (Click the image to enlarge.)
This summer, we’ll make some changes to our environment to increase overall availability, including:
- A migration from our single-server (physical) Exchange 2007 system to a multi-server (virtual) Exchange 2010 environment. The only service that will remain physical is unified messaging.
- We’re using SharePoint for a lot of stuff, including our public-facing Web site. Our existing SharePoint environment consists of two servers: a dedicated database server and the MOSS server running the other components. As we explore SharePoint 2010, we’ll more likely than not migrate away from the physical SharePoint infrastructure as well.
Even if I have to add additional ESX hosts to support newer initiatives (though I don’t think I will), the availability advantages are too great to ignore.
The virtualization project at Westminster exceeded all of my original goals. We’ve been able to very easily extend the life of aging applications, reduce power consumption, increase availability, and make a huge dent in the budget for equipment replacement in the data center.