Apache Kafka is a natural complement to Apache Spark, but it's not the only one. Here's how to figure out what to use as your next-gen messaging bus.
As hotness goes, it's hard to beat Apache Spark. According to a new Syncsort survey, Spark has displaced Hadoop as the most visible and active big data project. Given that Spark makes it much more straightforward (and possible) to manage high-velocity data, this isn't surprising.
What is surprising, however, is how fast Apache Kafka is closing in on Spark, its kissing cousin.
According to Redmonk analysis, Kafka is "is increasingly in demand for usage in servicing workloads like IoT, among others." This, according to Redmonk analyst Fintan Ryan, has resulted in "a huge uptick in developer interest in, chatter around, and usage of, Kafka."
So, where does Kafka grow from here, and should you use it?
Up and to the right
Batch-oriented data infrastructure was fine in the early days of big data, but as the industry has grown comfortable with streaming data, tools like Hadoop have fallen out of favor. While there will likely always be a place for Hadoop to shine, as Spark takes over a general-purpose message broker like Kafka starts to make a lot of sense.
As Ryan writes, "With new workloads in areas such as IoT, mobile and gaming generating massive, and ever increasing, streams of data, developers have been looking for a mechanism to easily consume the data in a consistent and coherent manner."
Kafka sits at the front-end of streaming data, acting as a messaging system to capture and publish feeds, with Spark (or other) as the transformation tier that allows data to be "manipulated, enriched and analyzed before it is persisted for use by an application," as MemSQL CEO Eric Frenkiel wrote.
This partnership with popular streaming systems like Spark has resulted in "consistent growth of active users on the Kafka users mailing list, which is just over 260% since July 2014," Ryan notes.
In fact, demand for Kafka is so high right now that it's outpacing even Spark, at least in terms of relative employer demand:
Even if we look instead at absolute job postings, Kafka is on a tear:
(Judging by Google search interest, Hadoop still has the lead, but jobs arguably provide a better measure of adoption.)
Kafka is clearly booming, but should you use it?
When to use Kafka
The answer to that question is, of course, "it depends." The Kafka core development team indicates a few key use cases (messaging, website activity tracking, log aggregation, operational metrics, stream processing), but even with these use cases, something like Apache Storm or RabbitMQ might make more sense.
When trying to determine whether to use Kafka or RabbitMQ, for example, Pivotal's Stuart Charlton summarizes the key reasons to use Kafka: "Use Kafka if you have a fire hose of events (100k+/sec) you need delivered in partitioned order 'at least once' with a mix of online and batch consumers, you want to be able to re-read messages, you can deal with current limitations around node-level HA (or can use trunk code), and/or you don't mind supporting incubator-level software yourself via forums/IRC."
That was written back in 2012, and a lot has changed since then (Kafka's robustness, for example). Today, it makes an excellent alternative to traditional messaging brokers like IBM MQ or Active MQ, primarily because it's blisteringly fast and scales out exceptionally well. And if you're still wondering if you should use it, try searching its highly active (and relatively friendly) mailing list.
Because, let's face it, you need to figure it out soon. As the world has gone mobile, it has become mandatory to make that data available and understood in real-time. The need for hyper-fast distributed, partitioned, replicated commit log service will only grow, making it critical to figure out Kafka now.