Given that all the best big data infrastructure is open source, why do enterprises still spend so heavily? According to new Wikibon research, the big data market will approach $40 billion this year and soar to $100 billion within the next 10 years.
And yet, as Gartner analyst Nick Heudecker captures in a customer complaint, "Why am I paying all these vendors for what's effectively open source software?"
In the case of Confluent, the company behind the Apache Kafka technology first developed by LinkedIn, the answer is all about packaging. This strategy—build and promote a popular open source project and then monetize management tooling around it—is now a well-trod path, but seems to be particularly fruitful for Confluent.
Making big bigger
Though "big data" used to be synonymous with Hadoop, it has come to comprise a host of software—nearly all of it open source—that includes things as varied as MongoDB, Apache Spark, and Apache Kafka. Despite the open source nature of much of this software, there's a lot of money to be made (Figure A).
The first step in monetizing this open source bounty, however, is popularity. No one will bother to pay for support, much less tooling to make adoption of a particular project more productive, for a random project with minuscule adoption.
This isn't a problem for Apache Kafka, however.
SEE: An inside look at why Apache Kafka adoption is exploding (TechRepublic)
Apache Kafka is already in production in thousands of companies around the world, including more than one-third of the Fortune 500 and the majority of Silicon Valley's tech giants. The reason is simple: Apache Kafka allows companies to go from treating data as something static, that sits in data warehouses or so-called "data lakes," and enables them to instead build on top of real-time data streams that change continuously with their business.
Making old things new
If this sounds disruptive to the old guard of data infrastructure, it is. As Jay Kreps, CEO of Confluent, told me in an interview, the approach Confluent has taken with Kafka means that it can "act as a replacement for a lot of legacy software solutions in enterprise messaging systems, ESBs, complex event processing, data integration, and ETL—all the hard, sticky, expensive stuff that keeps data centers running and keeps companies in business." Nor is it just about modernizing legacy infrastructure: "This shift in architecture can power use cases in microservices, stream processing, and IoT that just weren't possible before."
So on the one hand Apache Kafka upgrades old tech approaches, and in the process it enables hitherto impossible use cases. Not bad.
This is a really big deal, and pretty unique. As Kreps said to me:
Kafka and the whole category of stream processing represents a fundamental paradigm shift in how the digital part of a company is built, how data is used, and how applications are built. This is actually a pretty rare thing. Normally, the area of infrastructure software is much more staid. The basic concepts of databases and filesystems just don't change much. Even more recent developments in NoSQL stores, and cloud systems are mostly taking what we already had and making it more scalable.
Apache Kafka does this by becoming the "central nervous system for data," as Kreps styles it. In other words, everything that happens in a company—every customer interaction, every API request, every database change—can be represented as a real-time stream that anything else can tap into, process, or react to.
SEE: How Apache Kafka takes streaming data mainstream (TechRepublic)
To understand why this is such a big deal it's worth considering an analogy to an older communications technology: the telephone. Imagine if the telephone had required each house to build custom phone lines to connect to each person you might want to call, rather than tapping into a central exchange that connects you to everyone.
It sounds ridiculous, but this is more or less exactly the situation with how digital systems and applications are connected in companies. Apache Kafka provides a central streaming platform that acts as the central exchange like the telephone system, where data streams can be stored, processed, and sent on to any subscribers.
If this sounds magical, well, it is. Or can be. Apache Kafka doesn't come with all the bells and whistles (and baggage) of a traditional messaging system, and it can be rough around the edges. This is where Confluent aims to improve things by packaging open source Apache Kafka along with proprietary add-on functionality that makes it easier to use and fills some of its product gaps. It's one reason Sequoia, Benchmark, and Index Ventures plowed another $50 million into the company to accelerate these efforts, as enterprises look for ways to manage exploding volumes of data.
- Apache Kafka is booming, but should you use it? (TechRepublic)
- Apache Spark rises to become most active open source project in big data (TechRepublic)
- Hadoop ignited a "Cambrian explosion," says its creator (TechRepublic)
- Data lakes: The smart person's guide (TechRepublic)
- Free ebook: Executive's guide to IoT and big data (TechRepublic)
- Big data policy (Tech Pro Research)
Matt is currently head of the developer ecosystem at Adobe. The views expressed are his own, not those of his employer.
Matt Asay is a veteran technology columnist who has written for CNET, ReadWrite, and other tech media. Asay has also held a variety of executive roles with leading mobile and big data software companies.