How companies are using SQL to unlock Kafka streaming data

Commentary: As companies increasingly try to process streaming data, familiar SQL is taking center stage.

Young team of data analysts are captivated by the code

Image: Getty Images/iStockphoto

The more enterprises turn to machine learning (ML) and artificial intelligence (AI) to power their businesses, the more they depend on streaming data to keep up. According to a 2019 Lightbend survey of 804 IT professionals, 33% said they're using streaming data for ML/AI, a 5x jump over the 6% who said the same in 2017. As for the primary technology used to manage that stream processing, 48% are using Apache Kafka in production. This represents a huge opportunity for streaming data, generally, and Kafka, specifically, with one caveat: Developers must first learn how to work with Kafka.

I'm not referring to potential difficulties in setting up and managing Kafka, but rather the inherent difficulty in both capturing and processing real-time (or streaming) data. Or as Eventador co-founder Kenny Gorman said in an interview, "People are familiar with Kafka but don't know how to query it." Most data professionals grew up using SQL to query data at rest in a database, but are now having to learn new ways to query streaming data, which is essentially a SQL statement that never ends. Matching SQL to Kafka streams is a bit of a holy grail.

SEE: Data analytics: A guide for business leaders (free PDF) (TechRepublic)

Would you like some SQL with your order of Kafka?

Confluent was early to solving this problem with KSQL, a tool that allows for continuous queries of streaming data (as opposed to one-time queries typical of a database). More recently, Eventador has focused on what it calls "continuous SQL." While Eventador started as a fully managed Kafka service, the company has evolved to focus on enabling developers to more easily query Kafka streams. 

Or, rather, enabling more than just developers to work with streaming data. As Eventador calls out, with continuous SQL "instead of requiring specialized Java and Scala knowledge and the extensive timeline required for deployment, a broader group can inspect and reason about streaming data using SQL." It's all about making the potential inherent in real-time streaming data more accessible and more broadly used within the enterprise.

In the more traditional database world, it's straightforward to capture the latest state of incoming data because it may sit in a database. Streaming data is different, however, and it's more difficult to capture that latest state. According to Gorman, today developers must write all the code to make data materialize from streaming which, in turn, removes all the power of the stream. Even so, they've been forced to do this because without somehow storing data in a way that an app can use it, they've been locked out of querying Kafka (or other tools like Apache Storm) streams.

Continuous SQL promises to democratize access to streaming data across the enterprise.

Data without a base

Though SQL may be a natural additive to Kafka streams, as Gorman put it, making Kafka play nicely with SQL was hardly a simple task. SQL treats data differently from a streaming platform like Kafka, as noted in an Eventador blog post:

Continuous SQL should be familiar to anyone who has used SQL with a RDBMS, but it does have some important differences. In a relational database system (RDBMS), SQL is interpreted and validated, an execution plan is created, a cursor is spawned, results are gathered into that cursor, and then iterated over for a point in time picture of the data. This picture is a result set, it has a start and an end….

In contrast, Continuous SQL queries continuously process results to a sink of some type. The SQL statement is interpreted and validated against a schema (the set of tuples). The statement is then executed – the results matching the criteria are continuously returned. Jobs defined in SQL look a lot like regular stream processing jobs – the difference being they were created using SQL vs something like Java, Scala or Python. Data being emitted via Continuous SQL are the continuous results – there is a beginning, but no end. A boundless stream of tuples.

As Confluent and Eventador (and likely others) align the familiar SQL with the still comparatively unfamiliar Kafka, customers are beginning to see the benefits. According to Chris Ferraro, CTO at Digital Assets Data, in an interview with Eventador, "Streaming data is core to our business. Eventador SQLStreamBuilder gave us the capability to ingest complicated feeds at massive scale and perform production-quality, continuous SQL jobs against them. This was...a complete game-changer for us."

A game changer for Digital Assets Data, and presumably for an ever growing group of organizations able to apply familiar tools like SQL to newer technologies like Kafka.

Disclosure: I work for AWS, but nothing herein relates directly or indirectly to my work there.

Also see