After Hours optimize

The Evolution of Zynga's zCloud: Interview with CTO of Infrastructure, Allan Leinwand

Rick Freedman's extensive interview with Allan Leinwand, the CTO of Infrastructure for Zynga, yields important insights for successfully implementing a hybrid cloud infrastructure on a large scale.

TechRepublic: What are the key risks? What keeps you up at night? Allan: The key overriding risk was, if we're going to own the base, it better be solid. That really means focus on availability. We spent a good part of 2011 rethinking the whole idea of availability. Most data center people, when they talk about availability, they're talking about the availability of their servers. You're looking at the standard metrics of server uptime, redundancies, and that's what we were doing as well. And our numbers looked pretty good. We were doing everything in a highly-available, industry-standard way. But when we looked at availability in terms of player incidents, we'd see outages that were not within our control. Outages by power providers, outages by facility providers, and we'd discount those from our metrics. Then we woke up and realized that these things, though out of our control, did affect our availability, because any time a player anywhere on the planet can't get to a Zynga game, regardless of the reason, we should be thinking about how to prepare ourselves to better control that. TechRepublic: Isn't there a risk that a momentary inconvenience, if it hits an entire geography or persists, becomes a "bet the business" moment? Allan: Absolutely. We shifted our thinking from the conventional data center availability metrics to the bigger question of what, in any circumstance, could prevent any player on any platform or device from getting to your game? We need to make sure that they're social, they're accessible, and they're fun. All of these elements are dependent on the games being as available as possible. So we focused on finding these single points of failure across the infrastructure. How do we add availability to areas that we previously thought weren't possible? How do we add multiple connections to our social network partners? How do we make direct fiber connections to multiple pipes of the public clouds? How do we build redundancies across the entire stack, because we've seen spot failures of specific products from specific vendors? What can we do operationally in terms of multiple uploads to multiple geographies, to add availability even in areas where we may not have a lot of gameplay going on?

We began to focus a lot on automated provisioning. We realized that, if a specific part of the infrastructure becomes unavailable, we need to flex the infrastructure in another part of the public cloud so that players don't see a negative effect on gameplay. Operations that used to take days or weeks have been automated so that they literally take minutes. We knew we had to go beyond the classic approach of sequentially provisioning servers, building out racks, building software stacks, building firewalls,  and go, as we say, "all the way in on availability." We actually saw a large cable ISP lose connectivity, and we knew from our statistics that this ISP sent us a significant number of players. We knew that those players, at that moment, couldn't get to Zynga. We can't go retrench cable around the country, we can't solve it that way; but knowing that our availability took a hit made us rethink what availability means. The key point is that it really changes your mindset when you go from measuring yourself to measuring how the globe gets to you. It's a mindset that you have to instill in the organization that defines what global availability really means to players. It's no longer enough to look at your application and say, " I was 100% available, my servers were up, my application was up, my network was up." When you think about it from the players' perspective, it enforces an entirely different mindset.

TechRepublic: Everyone is buzzing about the newly-announced Zynga strategy to let players access non-Zynga games through your platform. Help me understand how that initiative ties in with the infrastructure decisions we've been discussing. Allan: Internally, zCloud lets us operate as a platform that serves our entire lineup of games and studios worldwide. As we roll out the Zynga Platform, third party developers will soon be able to leverage the technology that we built for creating and scaling social games. The backend tech that we offer developers to scale their games in the future will certainly leverage the zCloud infrastructure. TechRepublic: So do you foresee a future scenario in which Zynga owns 100% of this and the public cloud element goes away? Allan: We love the public cloud. The public cloud is a key tool and a key component of this; it allows us to be flexible and scalable, and so I don't see us breaking from the hybrid model. I love Amazon Web Services; for example, I love having them as the shock absorber that I can use and flex. I love knowing that if I want  to move a game or a geography there for a while, I can I think about AWS as the four-door-sedan of infrastructure, and I don't say that negatively. AWS just wasn't built to be a high-performance vehicle for our specific application, but as a general purpose, highly available vehicle that can be flexed to suit our requirements. Working with AWS allowed us to do the cache analysis, the network analysis, the memory and server usage analysis to build the zCloud high-performance vehicle. We've now learned so much from the opportunity to do that, that for every three servers we use in the AWS cloud, we can replicate those services with one zCloud server.

The original goal of using the public cloud is to have a general purpose infrastructure within which you can scale and flex, and that's something we will continue to do. When you reach the scale of understanding your workload, and you're lucky enough to be thinking about owning more versus renting more infrastructure, then you can spend the time to optimize the private cloud for your own requirements. Obviously, Zynga had to. We had reached the scale where we were making investments in owning the base, and when you own the house, you want it to look good. You want to tune it up in the way that best suits your needs. Knowing your app, and its server, storage, network and performance needs, you can match the private portion of your cloud to match that. That doesn't mean that public cloud doesn't have a place going forward. I flatly reject that - public cloud has a great place. Private cloud has a great place. The ideal model for a lot of organizations is the hybrid cloud.

About

Rick Freedman is the author of three books on IT consulting, including "The IT Consultant." Rick is an independent consultant and trainer, working, through his company Consulting Strategies Inc., to help agile teams and organizations understand agile...

1 comments
dougrichards
dougrichards

Great to see how the mostly inflexible model of cloud building is being challenged by the innovation zCloud is developing.