Streaming data is a big deal. Too bad that it's not easy.
Even as some keep insisting that batch-oriented MapReduce has a key part to play for years, the shift to streaming data is well underway. According to Gartner analyst Merv Adrian, reporting from the big data confab Strata, "Everywhere I look, it's evident the community's pivot to data in motion is underway."
One company pushing this streaming future forward is MemSQL, creator of an in-memory database and the primary committer to Streamliner, a new open-source tool meant to facilitate the deployment of real-time pipelines with Spark. To delve into this shift toward streaming data, I caught up with Ankur Goyal, MemSQL's vice president of engineering.
Primary drivers of real-time data
According to Goyal, there are two primary drivers of streaming or real-time data.
The first is simply the data deluge we're all swimming in. The explosion in data volume means that every minute, hour, and day, you're generating more data.
Legacy systems simply can't keep up. For example, they can generate so much data in an hour that it takes more than an hour to load, and they fall perpetually further behind, Goyal notes. The classic definition of real-time requires processing a task fast enough that you're ready to work when the next one arrives.
The second point is that the industry has spent the past several years building the foundation for data processing as a competitive edge. It's now standard practice for enterprises to have data warehouses, run analytics, and derive value from data.
As a result, it's no longer enough to be data-driven on historical data. As Goyal points out, with real-time technology, you can ask harder questions, more often, and on the most valuable data. It's quickly becoming the new competitive edge in the data space.
Which brings us to Spark.
Distributed computation becomes a first-class citizen
Most code written around database and data warehouse infrastructure assumed that the system was running on a single node. ETL tools are a great example—most batch processing tools in the enterprise can use multiple cores but not multiple nodes.
Spark is a new programming paradigm that treats distributed computation as a first-class citizen—but not just a first-class citizen, because distributed processing is actually a requirement for modern workloads.
For example, large companies want to stream more than one GB per second through a cluster on AWS. As Goyal insists, however, organizations can't do that with any single node bottlenecks, so every piece of code in the pipeline must be distributed.
Spark, with its rich ecosystem of libraries and applications, is a perfect fit to run in real-time and with a distributed database. But getting data from Spark into a database like MemSQL requires a little help, which is why MemSQL just released Streamliner to make it easier to stream data from Spark into MemSQL (other databases like Cassandra have their own ways of connecting to Spark).
Users want to treat streaming data with the same flexibility they get with data warehouses. Loading data into a datastore like MemSQL gives them the flexibility to query it on the fly and in production, against live and historical data.
The future of streaming data
This is what the near-term future of big data looks like.
When I asked Goyal what lessons he has learned from working with companies building real-time data pipelines, he cited two principles:
- The data pipeline must be distributed all the way through. This isn't just for the oft-cited "high availability" reason, but because there's so much data flowing that no single disk/CPU/network pipe is big enough to handle a modern data stream.
- Users want the ability to play with data ad-hoc and run flexible SQL queries. Real-time stream processing is all about giving enterprises the same flexibility they get with a data warehouse but on real-time data and with real-time queries.
As potent as the current state of streaming data is, Goyal suggests it will only get better. As he predicts, "Coming next are computations beyond what you can express in SQL, done on the fly in the database, and enabled by Spark." This isn't something we'll see in 10 years, but rather "These new patterns of computations are going to become part of the standard BI toolchain quickly."
- Three reasons you need to run Spark in the cloud
- Hadoop complexity is part of the master plan, says Cloudera exec
- Cloudera co-founder identifies the biggest opportunities for big data
- Spark promises to up-end Hadoop, but in a good way
Matt Asay is a veteran technology columnist who has written for CNET, ReadWrite, and other tech media. Asay has also held a variety of executive roles with leading mobile and big data software companies.