Big Data

How a mix of Cassandra and DC/OS makes massive scale simple

Scalability used to be a secret reserved for web companies, but now mainstream enterprises can scale like Google.

Image: iStockphoto/NicoElNino

In the world of distributed computing, few datastores can boast the chops of Apache Cassandra. Born inside Facebook data centers eight years ago, Cassandra drives massive-scale data pipelines at companies like Apple and Netflix. It has now jumped the chasm to mainstream enterprises.

But, mainstream enterprises still struggle to adopt Cassandra and other modern distributed systems like Apache Kafka and Apache Spark. A typical enterprise lacks the army of engineers and operators that Facebook and Netflix command.

Global 2000 enterprises are just beginning to graduate from the virtual machine "stack," where each application gets its own dedicated cluster, to a new, lighter-weight model of applications and services sharing resources across the entire data center or cloud. This is the uber trend driving containers, microservices, "cloud-native" tools, and the entire alphabet soup of buzzwords in enterprise infrastructure today.

SEE MongoDB and Cassandra put relational databases on notice (TechRepublic)

To help solve these problems, DataStax, the provider of database software for cloud applications based on Cassandra, recently aligned with the newly-launched DC/OS (Datacenter Operating System), the open source data center platform project led by Mesosphere with more than 60 other launch partners. Datastax and Mesosphere aim to make it even easier to install and run Cassandra and other sophisticated systems, providing such features as one-click installation from the app-store-like DC/OS Universe.

I recently spoke with Martin Van Ryswyk, EVP Engineering at DataStax, about how they leverage DC/OS. He described how mastering this new paradigm for running and scaling distributed applications is being driven by big data.

TechRepublic: Why is Cassandra so popular?

Van Ryswyk: Cassandra is known for being an always-on, multi-data center capable database that really stands out in the combo of throughput and low latency performance that scales. Netflix uses us for personalization. Netflix swapped Oracle out for DataStax Enterprise because their previous Oracle infrastructure collapsed when Netflix's volume went up exponentially, and certain outages occurred. We're doing more than a trillion transactions per day with Netflix that is single digit latency.

Apple iTunes, eBay, Spotify—we've had a ton of usage with these major web-based businesses. But now, we're also seeing big banks, Fortune 500 enterprise IT, and more than 500 enterprise customers powering their data infrastructure with Cassandra and DataStax Enterprise.

TechRepublic: Why is multi-data center so important?

Van Ryswyk: Today's applications can tolerate no downtime. Our customers need their dataset to be replicated in a masterless cluster architecture, so that they can have New York, San Francisco, and London all serving up user queries with the same levels of performance. Plus, if your New York data center goes offline, your application won't. Netflix actually lost a whole data center when AWS lost a region, but not a single customer received an error message because of DataStax's multi-data center architecture.

SEE NoSQL keeps rising, but relational databases still dominate big data (TechRepublic)

We have a lot of customers that have two private data centers and they'll spin up in Amazon or Azure in another data center in that cluster just so they have a resilient backup plan in case their own facilities have an issue. A couple of years ago that sounded like future IT, but it's mainstreaming very fast and it's a requirement that Cassandra is really uniquely capable of solving for.

TechRepublic: What extra value does DC/OS give to DataStax Enterprise users?

Van Ryswyk: DataStax takes the view that we are agnostic on the underlying infrastructure. We allow customers to spin up nodes—whether virtual machines or containers—and then DataStax Enterprise gets an IP address and handles the installation. So, we really focus on the provisioning with the database in mind and how the database communicates node-to-node.

We're very good at that.

But, with microservices architectures, and developers building data pipelines of lots of application frameworks, a lot of the infrastructure is evolving beneath the layer that frameworks like DataStax Enterprise run. We really started to see this with the advent of the so-called "SMACK" stack—the very popular combination of Apache Spark, Apache Mesos, Akka, Cassandra, and Kafka. When you have these technologies installed, your dev team can create rich data pipelines and data-driven applications that are built for speed and resilience.

For enterprise users in more traditional industries—banking, retail, you name it—where things get really tricky for them is when they need to install each of these frameworks and figure out how to connect them, but also independently scale them. So DevOps is moving towards the DC/OS platform to make it simple to install, connect, and scale everything, and to deliver intelligence at the infrastructure level in delivering resources—including high availability of both compute and data—to these services.

TechRepublic: Where does DC/OS stand relative to Google Kubernetes, Docker SWARM, and other orchestration solutions that are targeting a similar opportunity?

Van Ryswyk: It's a fairly confusing landscape for enterprises. There are no clean camps. There is a lot of distinct competition among these technologies at the moment. We're actually supporting all of these frameworks in parallel.

But from our point of view, DC/OS is really exciting for the simple reason that it's the only platform that has a two-level scheduler that simplifies the installation and scalability of all of these different frameworks across a shared infrastructure. I mentioned Spark, Akka, Cassandra, and Kafka, and the popularity of the SMACK stack, but there are countless other emerging frameworks out there that each have their own unique snowflake operations concerns.

And, DC/OS is the first platform with the specific charter to simplify how enterprises use these technologies, rather than spending way too much time in the weeds just deploying and scaling them. For a company like DataStax that wants Cassandra to be ubiquitous as the enterprise data layer, being available on DC/OS means that the barriers to using our technology, and any technologies users might want to integrate with it, are lower than ever.

Also see

About Matt Asay

Matt Asay is a veteran technology columnist who has written for CNET, ReadWrite, and other tech media. Asay has also held a variety of executive roles with leading mobile and big data software companies.

Editor's Picks

Free Newsletters, In your Inbox