Forget streaming. Forget the cloud. Etsy goes with the tried-and-true to gain confidence in the data infrastructure behind its business.
Etsy promises to offer one of the hottest IPOs of the year, as its online marketplace business booms. With nearly $2 billion in goods sold last year, and roughly $200 million in revenues, Etsy is a company that seems to be proving that authentic, small-scale manufactured goods can deliver an ever-growing business.
Given the scale at which Etsy's online business operates, one would think the company would invest in distributed NoSQL databases, real-time data processing, run in the cloud, and other accoutrements of today's web giants.
But no. Etsy's data infrastructure is decidedly retro, and that's the way Etsy CTO Kellan Elliott-McCrea and team seem to prefer it. As Elliott-McCrea told me in an interview, Etsy's approach is to build confidence in known tools, master them, and then use them to solve big problems.
This is not the big data you're looking for
Five years ago, Etsy went through a replatforming exercise. As the company struggled to scale and to iteratively build, the engineering team took a step back. With now CEO Chad Dickerson and Elliott-McCrea arriving from Flickr, they decided to go with what they knew worked, rather than hope for the best with shiny new technology.
That meant MySQL. Lots of it.
With a PHP stack serving the front-end (replacing a "fully buzzword compliant" service-oriented architecture), Elliott-McCrea and team started to move out of Etsy's semi-monolithic Postgres back-end in favor of a Flickr-esque horizontally sharded MySQL data layer that allowed infinite horizontal scale.
It also gave Etsy the ability to develop in a more agile, DevOps-friendly way.
While the company did experiment with NoSQL databases, Elliott-McCrea tells me that Etsy's approach "isn't really about any particular technology." Instead, the company favors "a small number of well-known tools" geared toward "long-term operability of the software."
At the time, he continues, NoSQL databases were still very early in their lifecycle. This made them "exciting" in one sense, but Etsy wanted to focus on solving "exciting" business challenges, not merely using particular tools.
Going with NoSQL at the time would have forced the engineering team to spend too much time on learning and hardening the technology, rather than using it to solve "big, hard, interesting problems." For Etsy, the important thing is the end, not the means:
"The actual technical details of a distributed NoSQL store aren't exciting, per se. This wasn't the first time we've built a horizontally scaled infrastructure on MySQL. We had done this before at Flickr and elsewhere. Hence, we built our own distribution layer on top of MySQL, given that we knew it so well."
Against the big data grain
This same approach permeates Etsy's analytics stack. While real-time or streaming data technologies like Apache Spark are all the rage today, Etsy still lives largely in batch-oriented Hadoop.
In fact, according to Elliott-McCrea, as interesting as he finds Spark, the company doesn't currently use it. The company is starting to use Kafka to glean operational insights (especially from use of the company's mobile app), but it's a three-year journey, not something that company is trying to do overnight.
As for Hadoop, Etsy doesn't even use that fancy cloud-based Elastic MapReduce kind of Hadoop, but boring old run-in-your-own-datacenter Hadoop.
When I brought up Amazon Web Services (AWS) data science chief Matt Wood and his contention that big data demands an elastic infrastructure, Elliott-McCrea acknowledged the concern but ultimately dismissed it. The reason? Confidence.
Etsy makes a big deal about "gaining confidence." The software they use, and the way in which they use it, is all geared toward helping the team "gain confidence." While Etsy started with EMR, Elliott-McCrea notes that "a year in we felt that we had enough experience with the type of questions that we were going to ask our data that we could lay out our own Hadoop cluster in-house."
Not only did this result in a 10x increase in utilization, but the shift away from the cloud also generated "very real cost savings." By moving in-house, Etsy was better able to democratize its data, or give broad access to a wide array of data. Given the company's contention that access to data is a requirement to doing good work, this makes sense.
Today, 80% of Etsy employees access its data warehouse on a weekly basis, with greater insight to that data. As to Wood's argument that cloud elasticity encourages big data experimentation, Elliott-McCrea thinks "you can get better experimentation--assuming you have the experience running a data center--if you bring it in-house."
A question of confidence
What comes through in my conversation with Elliott-McCrea is how seriously the company takes confidence.
In addition to the other examples already noted, Elliott-McCrea said Etsy regularly migrates workloads out of Hadoop, a "generalized tool," to Vertica or some other technology once the company "gets to know" a particular workload well, like fraud analysis. Etsy doesn't care about using what's cool, but rather about using what's useful (and optimal) for a given problem.
As he told me, "It's about craftsmanship, about respecting the tools but also mastering the tools. You gain confidence in the tools (through metrics, data, lunch-and-learns, etc.) and over time, the tool becomes invisible." This, then, allows Etsy to focus on serving its community and the larger software industry.
This is one big reason the company is such an active user of and contributor to open-source software. As Elliott-McCrea declares, "I never feel very comfortable with software if I don't have the source," source that allows the company to gain confidence in the software it elects to use, both through experience using it as it is and by contributing code to make open-source projects even better.
It's an earthy, pragmatic approach to engineering. One that should scale even as the company's business scales.