How a VM role works in Windows Azure to accomplish resilient applications

John Joyner describes the VM role in Windows Azure and explains how you can get resilient applications in the cloud without relying on highly fault-tolerant hardware.

The strong points of the public cloud — on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service —can make powerful business sense when decision time arrives on where to build new infrastructure. Especially for rapid scale-out and short-term projects, there is no substitute for the "instant on/instant off" experience of the public cloud. With proper public cloud product selection, high capital expenditures for new applications (as well as surprisingly under-powered or expensively over-architected platforms) can become worries of the past.

If you want to run your own virtual machine (VM) inside a public cloud today, you have several public cloud providers to choose from. The most well-known public clouds today are Amazon Web Services, Microsoft Windows Azure, Google Apps, and offerings from Dell, Rackspace, SoftLayer, and others. Each cloud provider has its own technology for managing and interacting with hosted VMs; some are quite different from one another in their approach. In particular, Microsoft's Windows Azure VM role takes a different road from most others in this market.

Architecting for the Azure platform: Tolerate node failure

Microsoft did not design Azure with Infrastructure as a Service (IaaS) in mind. (Conventional VM hosting is an IaaS discipline.) Rather, Azure is a Platform as a Service (PaaS) offering— a global platform of SQL, IIS, and .NET services on which a developer can run code-completely abstracted from the physical hosts and virtual machines delivering the application. Azure literally empowers an organization's developers to bypass those irritating IT Pros down the hall and deploy Internet-based applications on a global scale without any on-premise IT assistance.

When "Azure-as-PaaS" deploys code for you, actual dedicated VMs in the Azure fabric are spun up to run each role. These are called "worker role" or "web role" instances depending on whether they need web server services or not. Azure automation copies your web pages and .NET code to the blank VM images. Every instance of the role is identical, and scaling is rapid and effective since the same base VM is reused over and over. Your code is automatically deployed to new role instances.

Windows Azure accomplishes application high-availability by forcing the architecture of resilient applications. You must author your Azure application so that the loss of one worker or web role is tolerated with little or no interruption in delivery of the overall service — you deploy at least two instances of every role to provide resiliency. The VM in which an Azure role instance runs is a transient environment. Individual instances of roles are expendable in this model. (Your Azure worker and web role instances may talk to an Azure SQL instance or a "storage blob" on the back end, where persistent data does exist, that powers the application.)

This methodology of achieving application redundancy is uncomfortable for most IT Pros to hear about the first time. Architecting high availability of applications by depending on high availability of specific hardware components (like servers and Storage Area Networks, or SANs) is frankly the bread-and-butter of many network architects. An economic argument might emerge at a certain scale of operations: It costs less to re-architect applications for cloud environments that scale easily and that do tolerate hardware failure, than it costs to provision an ever-growing physical plant of highly fault-tolerant hardware for existing applications.

What it means to run in a VM in a transient environment

Somewhere along the line IT Pros got a peek at the rich global Azure VM fabric and said, "I want a piece of that too."  Microsoft's initial response to the community was — these VMs don't have persistent storage. Of what use could they be in infrastructure computing? Yet it turns out that some traditional infrastructure roles, with a little creative programming, can thrive in this environment. Demand for this new type of VM, which can play in a world of application resiliency, caused Microsoft to add the "VM role" to the Azure platform.

An Azure VM is fundamentally an exposed Azure worker/web role instance that you can configure and run as a conventional VM. A very important thing to understand about Microsoft's implementation of the Azure VM role is: There is a lack of persistent storage across VM role restarts. A VM role restart (not the same as a reboot of a VM) rolls a VM back to its base image. A VM role restart occurs either on- command in the Azure console, or automatically when there is a hardware failure or other unstable condition on a particular Azure host. Whenever a VM role is restarted, a new, empty differencing disk is created on the Azure host and a brand new VM instance spins up.

Figure A - How an image is deployed to Windows Azure and applied to create VM role instances.
Figure A is a diagram showing how two VM instances, each composed of a read-only base image and an ephemeral (differencing) disk, comprise a complete Azure VM role. At a high level, the Azure VM role works like this:

  1. You purchase an Azure subscription and provision one or more VM role instances. You create "service package" and "service configuration" files that define your VM role. Uploading these files creates a logical container in Azure to host your VM role.
  2. You upload a read-only server VHD image. This is called the base VHD. This VHD is transparently mounted to your VM role instances and they boot from this base image. The Azure interface makes this easy to do, regardless of how many instances you have and how many global data centers you want to run them in.
  3. Every change made to the server after boot is written to a second VHD, known as a differencing or ephemeral disk. The differencing disk for a given Azure VM role instance only exists on the physical host where the VM is running in the Azure cloud.
  4. The way Azure handles application high-availability is to always have at least one other instance of your VM role running on another Azure host.
  5. When there is an Azure host failure, the differencing disk associated with a particular VM instance is lost. The application stays up because the surviving VM role(s) are still running on other Azure host(s).
  6. A new instance of your VM role is generated on an Azure host. The new instance may have a random computer name, but will automatically get the correct networking and DNS information needed to access the VM.
  7. Once the automatic VM redeployment is complete, the VM can run scripted procedures to finalize its configuration automatically, or you can remotely connect to the VM using Remote Desktop Protocol (RDP) and complete its configuration in a conventional manner.

About John Joyner

John Joyner, MCSE, CMSP, MVP Cloud and Datacenter Management, is senior architect at ClearPointe, a cloud provider of systems management services. He is co-author of the "System Center Operations Manager: Unleashed" book series from Sams Publishing, ...

Editor's Picks

Free Newsletters, In your Inbox