A new stream-based data processing project threatens to upend Spark's reign before it begins in earnest.
Big data infrastructure is an embarrassment of riches these days. While we once subsisted on batch-oriented Hadoop, today we have Spark, Storm, Kafka, and a blistering array of incredible tooling for any use case an enterprise could imagine. All of it free. All of it open source.
Among this big data elite is Apache Spark, the most active big data development community in the world. Given Spark's prominence, it seems like a quixotic task to unseat it.
Yet, that is exactly what Concord.io, a distributed stream processing framework built on top of Apache Mesos, hopes to do. Not only does Concord fill in blanks left by Spark (event-based or low-latency streaming) and Storm (difficult to scale), but it also puts a premium on developer efficiency, automating the provisioning and management of servers when scaling applications.
Will it be enough?
Physician, cannibalize thyself
No matter how beloved the open source project, nothing lasts very long in the darwinian world of rapid, open source innovation. Yes, including Spark.
In fact, a year ago, writing about Spark's meteoric rise, I asked:
The real question, however, is whether Spark will live long enough to realize its promise. Given the frenetic pace of open-source innovation in big data, it's very likely that Spark will give way to an even better system before it finds widespread adoption.
I therefore wasn't too surprised when Shinji Kim, co-founder of Concord Systems and originator of Concord, reached out to me to discuss Spark's heir. Surprised, but still unperturbed.
SEE Apache Spark rises to become most active open source project in big data (TechRepublic)
After all, even though Kim could credibly claim that Concord is technically superior to Spark, "best" doesn't always win. And to be clear, developers have a plethora of options for real-time data processing: Spark, Storm, DataTorrent RTS, or even a distributed, in-memory, RDBMS like VoltDB. What makes Concord worth a developer devoting time to learning and applying it?
Kim responded by insisting that "Developers looking to build real-time applications that require low-latency and high throughput will care about Concord."
She then called out the technical deficiencies in rival options: "Spark can be used for batch jobs but you can't do event-based or low-latency streaming work. Storm can be used but it has a lot of challenges at scale. In-memory databases could work well to store latest data but you still have to program your application interacting with it."
Fair enough, but still not enough. Developers, after all, have embraced these various technologies in spite of their shortcomings because they offer so much more than MapReduce has given them.
Concord Systems claims, "As an event-based stream processing framework written in C++, Concord runs 10x faster message throughput than open source alternatives like Apache Storm or Spark Streaming, with milliseconds of per-event latency." That's impressive, but Concord's real benefit may be how it improves the lives of developers.
Developers, after all, are the new kingmakers of the enterprise world, and technologies like AWS and MongoDB that make them more productive have tended to dominate. I asked Kim about this, and she explained:
Concord is designed for real-time services and applications rather than moving ETL jobs from batch to streaming. We have high availability support and dynamic operator management, which means you have no downtime when you're deploying, scaling, or updating any parts of your applications. This allows developers to debug and iterate fast, which is usually a hard thing to do in a distributed systems environment.
Fine, but what does a developer sacrifice if she chooses Concord over Storm or Spark? Where is it suboptimal compared to these other offerings? According to Kim, "Currently, Concord does 'at-most-once' processing and it will lose its local state/cache if there's a node or operator failure." However, she goes on, "We're working on integrating with Kafka to support 'at-least-once.'" So this deficiency may be transitory, not a permanent blemish.
SEE Spark promises to up-end Hadoop, but in a good way (TechRepublic)
As it currently stands, Kim tells me, "The best use cases for Concord is when you prefer performance over having the perfect result." As such, "We'd recommend Concord to run a financial market data processing (continuous VaR model or P&L of a portfolio) or to run a real-time bidding model on programmatic ad exchanges because the latest data has the highest value, and the value decreases over time."
Given the above, it's perhaps not surprising that she advises against using Concord for counting bank transactions or financial reporting.
Even so, this leaves a huge opportunity for Concord and the developers who embrace it. Henry Saputra, member of the Apache Software Foundation and a contributor to Apache Flink, puts it this way: "The idea of event-based processing combined with a distributed router inside each executor to achieve high performance is truly unique in the market."
That unique approach may be enough to give Concord its 15 minutes of big data fame, and perhaps much more.
- Top 10 priorities for a successful Hadoop implementation (TechRepublic)
- Spark promises to up-end Hadoop, but in a good way (TechRepublic)
- New Hadoop survey makes big data predictions for 2016 (ZDNet)
- IBM launches Apache Spark cloud service (ZDNet)
- The meteoric rise of Spark and the evolution of Hadoop (Tech Pro Research)