Not all data is fit to be streamed. Not yet, anyway. But one big data exec argues that the universe just keeps expanding for streaming data.
When I wrote a recent column about how streaming data increasingly belongs to Apache Kafka, I heard from many readers. One of the biggest objections is that I was too quick to throw out the baby (Apache Hadoop) with the bathwater.
One of my frequent correspondents is Justin Langseth, founder and CEO of big data visualization startup Zoomdata. He had some interesting insights to share in our exchange over the rise of Kafka and what enterprises need to consider when they decide to embrace streaming data.
You're going to need more batch-bell
TechRepublic: You agree with my premise that the world is going to live streams of data, but you argue that it's too soon to talk about completely displacing Hadoop and big data lakes of information.
Langseth: All data is originally generated at a point on the "edge" and transmitted in a stream for onward processing and eventual storage. No data is generated "in batch." On the other end, people want to make business decisions based on the most recent data, as well as how its fits into historical context.
So...streams on the left, and streams on the right.
SEE: How to build a successful data scientist career (free PDF) (TechRepublic)
If both the generation and consumption ends require a stream-orientation, more and more people are wondering why anything in the middle needs to be anything other than stream-based as well. Originally, batch was developed because a scribe in a marketplace thousands of years ago ran out of room on their scroll and needed to start a fresh scroll. When computers came around, the only way to transport data was to FedEx it on tapes, and if you're going to do that, it's more efficient to do it with more than one record at a time. So batch was born.
Hadoop represents the last hurrah for batch-oriented processing. There are just too many reasons why it's much easier, faster, and more secure to handle the end-to-end data pipeline as streaming all the way. So if you have valuable business and data processes happening in a batch-oriented system, you clearly shouldn't just throw them away tomorrow. And if those are based on Hadoop, that's fine too. But if you're building something new today, there are strong arguments to avoid any batch steps and just stream end-to-end.
Kafka can't do everything...yet
TechRepublic: Apache Kafka, natively designed for this new world of streaming data, saw a 260% jump in developer popularity last year. What are the pros and cons of Kafka for your customer streaming use cases?
Langseth: Kafka is the de facto architecture to stream data. It has an active community, and it just works. I'm not sure why anyone would use something else unless it's a fully managed service like Kinesis on AWS if you have everything else dependent on AWS services and you're committed to staying there. Then, I guess you could think about it, but still Kinesis could win there, too.
There are certainly some things Kafka doesn't do as well as other things that you'll need, like in-flight data cleansing, joining, aggregating, and so on; but there are things like Apache Spark Streaming that work together with Kafka for those.
The last thing you'll need that Kafka doesn't do is long-term storage of data, so you'll need a "final resting place" (if you will) for your data. And that should be something reliable, relatively performant, and as cheap as possible, like Amazon S3. That being said, there are people in the Kafka community who have visions of making Kafka do all those things I just mentioned that it doesn't do yet, so stay tuned to that.
Alternatives to Kafka
TechRepublic: What are the alternatives to Kafka for streaming? Are they any good and, if so, why is Kafka getting all the love/growth?
Langseth: Amazon Kinesis, for one, as well as any of the numerous open source projects whose names end with "MQ." Are they any good? Probably, or no one would use them for new projects. Why is Kafka getting the most attention? A combination of critical mass, a passionate community, and underlying architectural soundness. More technical details on Kafka vs. RabitMQ can be found in this excellent Quora answer on the topic.
The world that Kafka built
TechRepublic: Fresh eyes are required for businesses looking at the potential of real-time streaming data, right? If so, what kinds of new use cases and possibilities do you see once you embrace streaming?
Langseth: One of the most important things that is often overlooked when designing a new streaming data system these days is the format of the payload data. You can design the world's best streaming architecture to transport your data, but if you're not sending a well-thought-through payload you're missing the point. Ideally, you generate the data cleanly, in a format where fields are clearly defined (JSON or XML), and ideally you send aspects of the schema along with each data packet (example: AVRO).
Also, if you need the newly-generated data to quickly be used for analytics or machine learning, look at things like Apache Arrow, which is designed to allow disparate processes to interact with the same in-memory data in real-time.
In terms of business value, the simpler the architecture, the easier it will be to build, test, and maintain. And the more quickly and easily new data can be used in analytics and machine learning, the more competitive advantage can be gained in any data-driven business. And gaining competitive advantage is what leveraging streaming data is all about.
- How Apache Kafka takes streaming data mainstream (TechRepublic)
- Apache Kafka is booming, but should you use it? (TechRepublic)
- The CIO can't afford ignorance of big data tech (TechRepublic)
- How open source helps startups get a big data boost (TechRepublic)
- Some Hadoop vendors don't understand who their biggest competitor really is (TechRepublic)