Google may be able to make a monolithic system work at scale, but you’re not Google. And, to get to Google-like scale, you’re going to have to transition to a microservices architecture, or you’ll crash and burn.
That’s one lesson I learned from catching up with Eric Bowman, a master at scaling massive e-commerce platforms. I last spoke to him two years ago, after he had led the re-architecture of Gilt’s website so that it could handle its huge scaling challenges and daily traffic spikes.
Today, he runs engineering at Zalando, the German e-commerce giant that employs over 10,000 people and ships more than 1,500 fashion brands to customers in 15 European countries–generating $3.43 billion in revenue last year.
With over 700 people on his engineering team, I talked to Eric about the lessons he has learned in moving huge online businesses in production from a legacy architecture to a modern Reactive Platform. The answer, it turns out, comes down to thinking differently about architecture and opening up code.
Moving to microservices without shutting down
Even if you believe cloud luminaries like Adrian Cockcroft and their insistence on microservices architectures, it can be hard to get there. When Bowman joined Zalando, he needed to find a way to shift from its monolithic Java architecture to an agile, microservices-driven architecture while keeping the Zalando site up and running.
It didn’t happen overnight.
In Bowman’s words, “Re-architecting like this is an 18-to-24-month job.” At Zalando, the site was originally built on Magento using PHP, and at one time was the biggest Magento site in world. After bumping into serious scaling walls, the team opted to spend a few months rewriting the entire stack in Java, making some unorthodox decisions along the way, like keeping business logic in stored procedures in PostgreSQL, and using SOAP.
SEE Why microservices are about to have their “cloud” moment (TechRepublic)
Both were controversial decisions at the time, but also incredibly smart…for a while. Eventually, these technologies led to systems that were hard to evolve, and neither is great as teams scale. At a certain point, Bowman told me, “It felt like we couldn’t update anything without updating everything. Some companies can make a monolithic system work at scale, but it’s expensive, and we’re not Google or Facebook. Yet.”
One way that Bowman accomplished this was by dropping SOAP in favor of REST in a quest for a microservices-driven architecture, something Zalando has embraced in earnest. REST allows APIs to evolve, which proves to be a massive fulcrum of leverage when APIs can evolve without breaking.
However, as Bowman stresses, “People get caught up in discussions over whether something is ‘truly RESTful’ and lose sight of why REST is so important: It was invented so the internet could be upgraded without breaking it, and any large enterprise can benefit from embracing these ideas in order to build resilient systems.”
Critically, Zalando has adopted an “API First” approach based on OpenAPI, so that its APIs are decoupled from their implementations, all follow the same style, and benefit from peer review. The combination of API First and REST (done well) results in highly stable interactions between systems that are constantly evolving, Bowman stresses.
“We also build in a SaaS style, which means that all the services we create are on the open internet and built to be used by other companies in addition to ourselves,” Bowman said.
Part of this “SaaS style” is a hefty reliance on the cloud, “in part because it has such a subtle, profound effect on how engineers build systems.” He continued:
For companies still using data centers, there is this terrible anti-pattern that emerges when it takes weeks to get new hardware live to run new systems. This has a horrible effect on complexity over time. In a nutshell, it applies subtle design pressure to keep adding complexity to systems that are already deployed, and over months or years the result is systems that are much harder to evolve than they should be.
Bowman calls this combination of organization and architecture “Radical Agility,” which he launched in March 2015. Among other things, he noted, “this openness has let us open up our tech stack to new technologies and enabled languages like Scala, Clojure, and Go to gain a foothold.”
Enabling this Radical Agility has been Lightbend, the company behind some of the key technologies Zalando has used to engineer this approach, leading to a wealth of open source activity around Scala, Akka, Play like play-swagger, beard and scarl.
The importance of being open
Another fundamental tenet of Zalando’s newfound agility is open source. I asked Bowman what role open source plays in Zalando engineering:
“When you’re dealing with money, and dealing in high availability, you have to be able to go down to the metal to diagnose problems,” Bowman said. “Having anything mission critical without source code is inconceivable to me.”
Though he grudgingly admitted that “AWS is the exception” to this rule, he continued, “As an industry that’s changing the world through what we build, at a scale previously inconceivable, we are shockingly bad at codifying software principles”, leading to “a huge amount of incidental and accidental variation.” In this messiness, “Lots of eyes and brains looking together at hard problems is tremendously valuable.”
Valuable in terms of uncovering problems, but also in accelerating innovation.
This open, agile approach leads Bowman to conclude that companies that want to scale must “be prepared to sacrifice your architecture every two orders of magnitude of growth,” because, “What works great today won’t work when the traffic is 100x.” That’s just the way it is, or how it can be, rather, with a Radical Agility approach to scale.