Apache Kafka, the open source distributed streaming platform, is making an increasingly vocal claim for stream data "world domination" (to coin Linus Torvald's whimsical initial modest goals with Linux). Last summer I wrote about Kafka and the company behind its enterprise rise, Confluent. Kafka adoption was accelerating as the central platform for managing streaming data in organizations, with production deployments of Kafka claiming six of the top 10 travel companies, seven of the top 10 global banks, eight of the top 10 insurance companies, and nine of the top 10 US telecom companies.
Today, it's used in production by more than a third of the Fortune 500.
But 2016 may be most noted for Kafka joining the "Four Commas Club," a nod to the popular HBO comedy series "Silicon Valley" where character Russ Hanneman is the flashy and obnoxious billionaire investor who is quick to point out to the tormented heroes of the show that it takes three commas to make the number 1,000,000,000. Last year Linkedin, Microsoft, and Netflix all passed the threshold of processing more than one TRILLION messages a day over Kafka. That's four commas: 1,000,000,000,000. That's scale.
I asked Confluent CTO and co-founder Neha Narkhede what was behind these numbers.
Hadoop is too slow
TechRepublic: Kafka is putting up very large numbers, even for the big data world. I think a lot of people saw the technology as a messaging queue that scaled, kind of a scale-out enterprise bus that moved data very fast. Something more seems to be happening here.
Narkhede: My co-founders and I originally created Kafka at Linkedin in 2010 when our own systems ran up against the limits of a monolithic, centralized database. We saw the need for a distributed architecture with microservices that we could scale quickly and robustly. The legacy systems couldn't help us anymore. On one hand, the traditional messaging queues were real-time but didn't scale and, on the other, the ETL systems couldn't handle data in real-time.
We looked deeply into the architecture of existing systems, why it didn't work and combined that with our experience in modern distributed systems to create Kafka. It was built to be real-time, could store data to feed batch systems from the same pipe, and could enable stream processing to make sense of the data in real-time in addition to moving it around. We had the vision of building the entire company's business logic as stream processors that express transformations on streams of data.
In order to do that, you need a highly efficient pipe to move data around, need connectors to existing systems, and need a stream processing layer. That is what we call a complete streaming platform. So though Kafka started off as a very scalable messaging system, it grew to complete our vision of being a distributed streaming platform.
Fitting the needs of modern business
TechRepublic: Are there many enterprises doing real-time stream processing at scale today? A lot of the Fortune 500 is still wondering how to monetize all the data they sucked into their Hadoop clusters in the first place.
Narkhede: It's true that most of the sophisticated backend data processing in enterprises is actually conducted by big batch processes that run on big daily data dumps (the Hadoop people rely on Kafka as the preferred data pipeline to their Hadoop clusters, by the way).
But managing and using the Hadoop ecosystem is very challenging, which has been the main reason companies still struggle with moving their relational DWH workloads over to Hadoop. There's a complicated world of integration, ETL, buses, and queuing that makes all this batch magic happen. It's complicated, expensive, and hard to do.
But as business becomes fully digital, the daily batch cycle of reaction to new data makes less and less sense. For most modern businesses, their core data is continuous, not batch. It's streaming information in real-time of sales, customer experiences, shipments, a system crash, and so on. It's natural that you would want to process and analyze this data continuously and in real-time.
And you want to express most of the business monitoring on Kafka in real-time that was previously run in batch on Hadoop and other batch systems. This is a change we observe at Confluent across various industries from financial services to retail, from gaming to healthcare, and so on. Companies want the ability to create new products, respond to customers, and make business decisions in real-time.
Making money while making friends
TechRepublic: As the creators of Kafka, how do you walk the line of building a profitable business at Confluent around software that is free and open source?
Narkhede: It is a smart way to get a new technology in the hands of developers across the world. It helps you build a pipeline for your product and reduce the cost of sales. Monetizing the technology is easier when the technology plays a critical role in serving your business goals.
This journey starts by realizing that the developer is the new decision maker. If the product experience is tailored to ensure that the developers are successful and the technology plays a critical role in your business, you have the foundational pieces of building a growing and profitable business around an open-source technology. That is the story of Kafka and Confluent's success so far. Kafka is used as the source-of-truth pipeline carrying critical data that businesses rely on for real-time decision-making. Its claim to fame has been a persistent focus on user experience and community development.
Since the early days, we built Kafka to solve a real problem at LinkedIn and deployed it internally before we released it to the community. That has been key for its success as a fast-growing, open source project. Since we started Confluent a little over two years ago, adoption of open source Kafka has accelerated dramatically. Enterprises like to have a company backing the software, especially when they go into production with mission-critical systems that rely on Kafka. They like that we have nine of the 13 most active Kafka committers working for us. As a company, we hire the best Kafka talent and invest in innovating on the open source core and adding more features that then make the software attractive to more companies and users.
We offer a fully open-source version of Confluent: Confluent Open Source that offers a set of tools and software that make it easier to be successful with Kafka in a company. And we also offer Confluent Enterprise that builds on the open-source version and offers various proprietary add-ons to successfully configure, manage and monitor Kafka in production. Our approach is working. We just announced a record year as a young company, with over 700% subscription growth year-over-year.
- How Apache Kafka takes streaming data mainstream (TechRepublic)
- Apache Kafka is booming, but should you use it? (TechRepublic)
- The CIO can't afford ignorance of big data tech (TechRepublic)
- How open source helps startups get a big data boost (TechRepublic)
- Some Hadoop vendors don't understand who their biggest competitor really is (TechRepublic)
Matt Asay is a veteran technology columnist who has written for CNET, ReadWrite, and other tech media. Asay has also held a variety of executive roles with leading mobile and big data software companies.