After Hours

The Evolution of Zynga's zCloud: Interview with CTO of Infrastructure, Allan Leinwand

Rick Freedman's extensive interview with Allan Leinwand, the CTO of Infrastructure for Zynga, yields important insights for successfully implementing a hybrid cloud infrastructure on a large scale.

TechRepublic: IT professionals who are reading this interview are looking at their own world, at their own infrastructures, and they're asking themselves about the complexity of implementing a model like the one you describe. How would you rate the operational complexity of the scenario you describe here, and where are the traps? Allan: We've got a lot of highly competent technical individuals at Zynga that make this achievable. This is not some suite of software that you can roll out; Zynga spent a lot of time engineering how zCloud is orchestrated and provisioned, and how those provisioning tools map and manage onto (AWS). Our over-arching goal has always been to make zCloud and AWS together a functional piece of infrastructure, a hybrid of public and private that we can manage as a single entity. We've spent a lot of time developing tools so that, for example, we can take a set of servers, allocate them to a particular service, provision them in zCloud, and then literally through a drop-down menu, launch them over onto AWS, and vice versa. We wanted to make it as flexible as possible, but we also wanted to isolate ourselves from getting caught up in the operational complexities.

Let me give you some examples of things we did to make this easier to swallow operationally. AWS uses a hypervisor that manages the services you provision through them. We didn't want to build something incompatible with them, so we  developed a partnership with a tool provider called RightScale that allowed us to implement the same tools to manage zCloud, so literally, it's RightScale managing both environments. We wanted to make sure that we wouldn't have to recompile images as we moved them back and forth from private to public cloud, so we built something we call simple regions, tightly connected bits of infrastructure where we built our data centers so they are directly hardware-attached, in the single millisecond range, from us to Amazon and to our partners on the social network side. Amazon has a concept of Availability Zones, where you can have physical buildings within a tightly defined geography. We implemented this idea of Availability Zones into zCloud, and we engineered zCloud Availability Zones and Amazon Availability Zones as one integrated piece of infrastructure, so we didn't have to think about an additional 100 milliseconds from Amazon to zCloud.

We wanted to make sure that the security mechanisms that Amazon was building, like matching blocks of IP addresses, or security measures regarding port mappings, would be implemented similarly in zCloud. We wanted to avoid a situation where we were doing layer 2 security in Amazon and layer 3 security in zCloud. We went through a lot of discussion, and frankly a lot of pain, to make sure that these integrations could be engineered into the player infrastructure. While it was a lot of work for us, for zCloud to move from proof-of-concept sitting in a lab to being a working part of our infrastructure took about six months.

TechRepublic: How much of a challenge was it to find vendors sophisticated enough to bring value to this effort?

Allan: We ended up working with Cloud.com for the orchestration layer. We were lucky because AWS uses Xen, and XenServer was owned by Citrix at the time, and Citrix purchased Cloud.com, so all that worked out rather nicely for us. We had to search for the right hardware vendors that could handle the pace at which we were deploying things.

So, in the more recent period, we've made a strategic decision to, as we say, "own the base and not the spike." We made a number of significant capital investments focused on meeting the needs of our players, and scaling these geographically diverse, highly available regions. By the end of 2011, we'd flipped our usage model so that we had 80% of our daily active users on zCloud and 20% on Amazon. So the headline on that has been that we've gone from 80% public and 20% private to 20% public and 80% private. For us, more important than the headline, is the ability to migrate back and forth, to "burst back" if we had to, so that if some celebrity decides to promote our games by getting kicked off an airplane, we can control that traffic. When these sorts of things happen, none of our infrastructure teams sweat. They understand that they have the tools and the flexibility to make it work in whatever circumstances.

We have the ability to spin up servers very quickly. We can implement thousands of physical servers in under 24 hours. From a pure data center perspective, think about racking and stacking servers, integrating them into the right switching architecture, plugging them into power, network and  optics, in the middle of running games, all in less than a day. We've had to do a lot of innovation, discover a lot of new ways of harnessing the industry to orchestrate systems that hadn't been done at this scale before. When we started building zCloud, I don't think anyone had a private cloud anywhere near our scale. That's what we were told by virtually every vendor we partnered with. Our vendors now tell us we're the world's largest hybrid cloud.

Ongoing risks and strategy --> Page 3

About

Rick Freedman is the author of three books on IT consulting, including "The IT Consultant." Rick is an independent consultant and trainer, working, through his company Consulting Strategies Inc., to help agile teams and organizations understand agile...

1 comments
dougrichards
dougrichards

Great to see how the mostly inflexible model of cloud building is being challenged by the innovation zCloud is developing.

Editor's Picks