Windows

Active Directory virtualization best practices

Brad Bird offers this set of best practices for virtualizing the Microsoft Active Directory role with particular attention paid to time synchronization, fault tolerance, high availability, and FSMO role positioning.

In May 2009, I worked with Infinite Group Inc. and conducted a virtualization assessment for all state community colleges in Mississippi. In particular, I was asked to create a set of best practices as guidance to use for virtualizing Active Directory.

Since most Windows-based network services rely on Microsoft Active Directory, virtualizing this role requires careful consideration. In particular, the following elements must be carefully planned:

  • Time synchronization
  • Fault tolerance
  • High availability
  • FSMO role positioning

Time Synchronization

All Active Directory services are dependent on the time in some way. For services such as authentication and eventing and licensing, the relationship is obvious. For other services such as updating, the relationship may not be so obvious.

In virtualizing servers, most if not all elements are virtualized and not physical. This includes the processor. The way that the time is maintained is in accordance with processor ticks. Most physical clocks or time-keeping devices are imprecise to some degree. It is very difficult to maintain precise and accurate time without there being some degree of a time skew, which requires periodic adjustments to correct for accuracy in time keeping.

In the case of virtualization, virtual machines require a mechanism to adjust or translate the virtual processor ticks and synchronize them with some time source. This skew is more apparent in virtual machines than in their physical counterparts, and therefore these adjustments occur more often.

Time Synchronization is one reason that it is not recommended to have Active Directory services deployed in an entirely virtualized environment.

Fault Tolerance

In any Active Directory deployment, more than one server with the Active Directory Domain Services role deployed is recommended for fault tolerance. In fact, at least two Domain Controllers are recommended as a best practice for every Domain deployed in an Active Directory forest. The reason for this is to ensure that more than one server exists at any given time with a copy of the Active Directory database.

Since Active Directory technology is designed such that every domain controller installed is as authoritative as their neighbors, this phenomenon is called multi-master. The term multi-master itself is normally used when referring to Active Directory replication, which is the process of copying changes within the Active Directory database from one domain controller to another.

In the case of virtualization, typically one domain controller in every domain should be configured as a physical server to ensure fault tolerance in the event of a failure.

High Availability

Virtualizing Active Directory does have the distinct advantage of indirectly enabling an Active Directory domain controller to be configured as highly available.

If only physical servers were used, there would be no practical way to make an Active Directory domain controller highly available. To achieve this functionality, the Active Directory database and log files would require careful placement within highly available file-share resources, which vastly increase the complexity of an environment.

Active Directory domain controllers that have been installed on a virtual server may be installed on a cluster with the virtual machine itself being the highly available workload. This effectively allows Active Directory domain controllers to become highly available quite easily.

FSMO Role Positioning

At a basic level, Active Directory domain services make use of a multi-master model. There are, however, several Active Directory functions or roles that must be tied to a particular server and cannot be shared among all domain controllers. These are referred to as flexible single master operations (FSMO) roles.

In addition to the database and log files, Active Directory requires that these roles be in service and available for communication. If some of these roles are configured on a virtual server, it is recommended that the server not contain any critical workloads other than Active Directory domain services.

The reason for this is because if the virtual server were to fail and not be quickly recoverable, the FSMO roles contained on it would need to be seized by some other Active Directory domain controller server. This process is not clean, and if the failed server were to ever be recovered, there would be meta data remaining that pertains to Active Directory, which is no longer valid. The recommended course of action to re-establish the server into service would be to reinstall the operating system, rejoin the domain, and then add the Active Directory domain services role and allow it to replicate with other Active Directory domain controller partners. Only once this is done, should any FSMO roles be transferred back to the server.

In the case of a virtual machine, the process of rebuilding or provisioning a new server can be only a few minutes, which is a significant improvement over the time needed to bring a physical server back into service.

The recommended course of action for the failed virtual server would be to decommission it as this is no longer useful to rejoin the domain and the repercussions of deleting a file are significantly less than maintaining an expensive physical server asset while not in use.

About

Brad Bird is a lead technical consultant and MCT certified trainer based in Ottawa, ON. He works with large organizations, helping them architect, implement, configure, and customize System Center technologies, integrating them into their business pr...

12 comments
david.hunt
david.hunt

I'll start by saying that my virtualisation experience is with VMware (ESX / Server / Workstation) in a large multi-site environment. Next, I'll say something controversial... I don't know why anyone would use a Windows Hypervisor. I confess to being a VMware fan, but have seen the difference in administration, issues and uptime between even VMware products (Workstation and Server) hosted on Windows, compared with hosted on Linux. VMware ESX is just rock solid. We've run hosts for up to a year without a hickup. The eventual need to reboot being for site power maintenance and VMware patching / product upgrade. I keep seeing posts, not just in TechRepublic, that talk about and recommend NTP synchronisation for guests in VM environments. This is contrary to the best practice published by VMware. So why is it that you should use VMware Tools to obtain Guest time synchronisation from the VMware Host? Rather than me parroting VMware, I refer the reader to VMware's excellent treatise (http://www.vmware.com/pdf/vmware_timekeeping.pdf). For thise you don't want to read 24 odd pages, the really quick explanation is that... Time in any operating system will deviate from actual time due to the inherent instability of the timing source. For this reason methods are employed to enable synchronisation with an external timing source, such as a national time standard service. TCP/IP is not capable of *Real Time* communication, and of course we have the damnably slow speed of light to contend with ;-) Thus the synchronisation facilities NTP and W32Time have built-in algorithms to attempt to measure the netwotrk delay to the timing source and compensate. As the delay is not fixed, this results in the host time approximating or chasing the time source. As we cascade time synchonisation sources, the amount of the variation away from the original source increases (lookup the definition of "Stratum"). Now introduce the complexity of virtualisation, where there is also a changing variation between the VM host and VM guest, due to time variable slicing. VMware has a mechanism to correct time in the Guest for the variable time slicing caused errors. This means that if we use the guest's native time synchronisation (NTP or W32Time, then we have two algorithms that are unaware of the correction introduced by the other. Thus the tracking of actual time becomes less precise and the possible extremes of variation increase. None of this is going to kill anything in a simple environment, but should you have a more complex environment and be synchronisiong PCs from DCs that are swinging in Real Time, plus introducing multiple time trees within the organisation, running at different Stratum levels, you can start to get time synhronisation errors randomly appearing. One must be very careful to properly design a time tree for your organisation, whith appropriately cascaded Stratum levels between the different elements to avoid such problems, which are very difficult to diagnose unless you understand the time tree that has been implemented. Several of the principles should be:- 1. Use VMware Tools to perform Guest time synchronisation. 2. Don't run NTP or W32Time in Guests. 3. Synchronise all VM Hosts at the same Stratum level ( as close, in Stratum level, as possible to a National Standard source). 4. Syncronise PCs, Workstations to the same Stratum level as the VM Hosts or to the VM Hosts themselves. 5. Dont use virtualised hosts to propagate time synchronisation. I hope this provides some food for thought... Happy new Year to all.

seamusobr
seamusobr

I agree with the other post I do not understand why time synchronization should be an issue. The pdc emulator should be configured to take its time from an external source not the host it is running on. Still not sure though why you would need a physical DC as you can have multiple virtual DC's running on different physical boxes.

jdemersseman
jdemersseman

I appreciate the time you put into this piece, but based on five years of production experience with VMware ESX, I'm going to disagree. Overall, your "best practices" seem to be written from the perspective of one for whom virtualization is a rather novel concept. Our firm installed ESX in the Fall of 2004, and installed two virtual DC's replacing our physical DC's. We have had no issues with this arrangement. That being said, my comments are from the VMware perspective. I have extremely high confidence in their product. I can't vouch for other vendors. Time synchronization This is not a reason to maintain a physical DC. Physical servers are subject to degradation as well, so while, this might be slightly exacerbated in a virtual system, the underlying problem remains. Rather this is a reason to synchronize your DC's time with an authoritative NTP server in the nation in which you're headquartered, like the US Naval Observatory in the US. A list of stratum 1 time servers can be found at http://support.ntp.org/bin/view/Servers/StratumOneTimeServers. Fault Tolerance Why is maintaining a physical DC any more fault tolerant than keeping virtual DC's on separate hosts with no common points of failure (like the same SAN)? (Even that is not a requirement for smaller environments where a non-replicated SAN represents an understood single point of failure for company IT operations in general). If I have two DC's on two separate hosts, I contend that is just as fault tolerant as having a separate physical DC, and it's clearly more cost effective. High Availability Yes. FSMO Role Positioning I'm not clear what you?re going for here. Losing a physical system which hosts the FSMO roles would be more painful from a recovery standpoint than losing a virtual system, because the latter is easier to rebuild if necessary. To my mind everything else is the same in either scenario. In closing, in 2006 we had a contractor come in to head up our infrastructure team. He had never worked with virtualization before and was aghast that we had virtualized both our DC's. He promptly moved a physical workload to another system and rebuilt that as a physical DC with all the FSMO roles. Within a few months his contract expired and he took a fulltime position with an integrated service provider and became a VMware Certified Trainer and their lead virtualization architect. We laugh about his reluctance now.

WayneAndersen
WayneAndersen

Clock issues can be a little more flacky on a virtual machine that a physical machine. If you power down a physical machine there is a battery powered clock that maintains the time so when you power it up it has close to the current time. It is possible for a virtual machine to be in a "suspended" state for say an hour and when it is "unsuspended" the clock would be off by an hour until it synched up again. I have a virtual DC but also have a physical DC.

brad
brad

You bring up good points and I do not disagree with them. Incidentally, this piece was written to be vendor agnostic. When I was doing the virtualization assessment, this was one of my main guidelines. When you speak of additional hosts, you are talking about another physical computer with more virtual machines on it. Avtually, I am well aware that Active Directory can exist in an entirely virtual environment and evidently, yours is and has been for some time and I am happy that you have had no issues. From the point of view of virtualization, shops the size of yours (my impression given the ease with which you refer to multiple hosts) often have a dedicated virtualization administrator. In addition to being master of virtualization, this administrator arguably must be an expert in all technologies that he virtualizes. If an administrator "happens" to be an expert in Active Directory AND in virtualization, that is great and if he has appropriately informed the company he works for of exactly how vital AD is to their organization and what they must do to make sure that these VMs never go down even better. Now, how many companies do you think are in this situation? Based on the fact that I have been repeatedly hired to consult for companies to make sure that this never happens again "past tense", I'll just let you know you are in the few and the proud. Virtualization is a fantastic thing, however it adds yet another layer of complexity to an already complexe technology like Active Directory. Bear in mind, the first thing I said is that I agree with most of what you've said and am thrilled your experience is good. However, I have been there when things are not so good. I would love to hear more about what you do with VMWare by the way, please ping me offline at brad@owsug.ca if you feel like it.

tmickey
tmickey

I understand the virtualized DC's being on separate hosts with separate SANs, but many organizations do not have that very costly luxury. As a best practice I would say you "always" have one physical DC just in case you lose your hosts, logon requests and AD will continue to function. But in theory, if you cover all your bases without single points of failure then you could virtualize everything with 0 physical servers aside from the hosts. I would not call that a best practice though. I use ESXi 4 and truly love it. it is great and I am virtualizing DC's, but I will always have 1 physical DC. And it will have a physical tape drive. I think that is "Best Practice". Not necessarily required, needed or necessary, but a best practice nonetheless.

sscribner
sscribner

As a consultant to many clients - I have the opportunity to see a lot of different environments- What is so funny is how engineers who have not jumped the virtualization train complain and push back when we come into a shop that is not using virtualization and virtualize - They change so quick- Fear - That?s what I think it is - Fear of what they don't know and of change - I am always most satisfied when we convert the internal engineer and now they want MORE virtualized servers. As for the time sync issue- I agree with this post - Physical or Virtual - No difference in the degradation of the CPU clock - It may be more on one than the other -but it is still the same issue - Use an external time source - If you?re not then you?re not using best practices - physical or virtual.

david.hunt
david.hunt

I agree with almost all of the comments made by "jdemersseman", with the exception of the time synchronisation. Time synchronisation seems to be very difficult for people to get their mind around, so I'll post a separate comment on that as a new thread. I can only imagine the author was talking about a small environment where virtualising (the only) two domain controllers in the same fault domain (same VM Host; same SAN; same computer room etc..). In a multi-site domain, where you have on-site virtualisation at separate physical sites, there is less risk to virtualising all AD functions, than having them on physical hosts, as long as they are distributed so as to provide fault tolerance commensurate with the required service level. Rebuilding a domain controller in a VM environment is much quicker than a physical machine. I believe the original article should have indicated the environment nature to which the practices were targetted. As enviroments can differ so widely, from simple single site with one VM Host to multi-site domains with multiple VM hosta per site, this is a significant consideration.

windowsmt60
windowsmt60

I think that the only symptom of clock issues would not so much for the server itself as it would be an issue for the domain clients authenticating to the DCs and replication between DCs. But this would only be apparent however with a large number of clients and a small number of DCs, as well as a saturation of VMWare platform. The degradation would have to be so severe that between NTP updates, the time would have to suffer loss of clock ticks enough to affect replication, etc... but if the system were saturated to that point, I think you would have much greater problems in the form of other unanswered UDP requests, GC lookups, DNS, authentication issues, etc... So I agree that it is highly unlikely that you would have a major issue with NTP. What you WOULD want to make sure of, is that infrastructure master and GCs were on different hosts, and divide high traffic FSMO roles between physical hosts. I think each environment is different, and I can speak from experience that you do need to be careful where you put your services and applications in the VMWare environment. It is quite different having 500 nodes in a 2 DC environment versus 15000 nodes in a 8 DC environment or larger. The bottom line is any virtualization project must take into account the nature of the applications on the host (for instance are they real-time such as Domain functions, VoIP recording, network monitoring, etc... or are they services that can be queued such as SQL, Exchange, etc...) and the saturation of memory, processor and network on a given host to determine how to deploy effectively. Click Here for guidelines on determining the load supported for your domain controllers and subsequent configuration. The point really should be not so much hard and fast rules but the caution that while VMWare and MSVS allow companies to gain economies of scale in a tight economy where dropping money to the bottom line often comes in the form of cost savings and hardware (and with it applicable AMCs) reduction, it is wise to scrutinize your current utilization and try to project your needs for your domain. After all, what good is a domain controller if it cannot respond to requests? Virtualization is a powerful tool, but often admins, thinking it can solve every problem overlook the real-time nature of some services and applications and learn through bitter experience some of the limitations. I personally have three virtualized domain controllers, two GCs for HA, and one with the infrastructure master role. This provides great resiliency and quick recoverability, with the caveat that the 3 DCs are on 3 different VM physical hosts, sharing various applications on the backend. I architect my systems very carefully and benchmark each VM before deciding on a deployment strategy. All in all I would agree with the assessment that virtualization is not in its infancy any longer and is able to handle more than in the past, but I still err on the side of caution when planning for capacity and load. Have a great week! windowsmt60@hotmail.com

windowsmt60
windowsmt60

I saw this a couple of years ago, and it is well worth the look. It has some nice best practices and will give you some great ideas on troubleshooting. Thanks Jogu... Remember that if you have a "28 minute drift" as indicated in this document, you should be asking the question of why you are losing that many clock ticks since the last sync however and considering how you should allocate your hosts. Have a great day all!

windowsmt60
windowsmt60

To sscribner's post, the time sync issue would just be an indication that you are not using overall best practices. As far as virtualization pushback by engineers who were schooled in the dedicated hardware realm, I agree also. Once a shop begins adopting, even those who are hesitant, if not resistant to the ideas, become more and more wanting to adopt... perhaps even when they shouldn't. The key is education and understanding not only your applications, but the VMWare environment. I personally am a huge proponent of virtualization, but I also have the experience to know that applications that are not designed well need to be evaluated and migrated to new solutions or optimized and recompiled before they are simply dropped into a VM. The beauty of VM is that you can create a relatively low cost test bed that mirrors your production environment, allowing you to test the applications, and review its resource utilization before you impact all the servers on a VS. ESX manages resources very well, and all but the most intensive real-time applications will run VERY well in a well designed VMWare environment. I have less confidence, based on experience, with MSVS, but I think that the technology is developing at Microsoft and will likely improve over the next couple of versions. Ultimately, virtualization is a great tool, and handles most applications very well. windowsmt60@hotmail.com

Editor's Picks