Big Data

Lack of standards threatens to derail the big data innovation train

Innovation is flourishing in real-time big data technologies, but that could also create problems, says the Reactive movements Jonas Boner.

Big data

Big data means big competition.

As vendors at every layer of the "stack" vie for their place in the evolving big data architecture, one of the busiest battlegrounds so far may be the back-end data movement and logic between systems. For many, the "big" in big data is less about overall volume of data and much more about the need for speed in shuffling data around in real time.

In turn, this has driven a shift to Apache Spark from MapReduce as enterprises look beyond batch processing to streaming.

To get an update on the latest technology trends driving the streaming evolution, I spoke with distributed computing expert Jonas Bonér, co-founder and CTO at Typesafe, creator of the Akka project, and co-author of the Reactive Manifesto.

TechRepublic: What's new about the changes underway in the back-end systems that shuffle data around?

Bonér: The fundamental shift is that we've moved from "data at rest" to "data in motion." The data used to be offline, and now it's online.

The first wave of big data was still "data at rest." You stored massive amounts in HDFS or similar, and then you had offline batch processes crunching the data over night, often with hours of latency.

In the second wave, we saw that the need to react in real time to the "data in motion"—to capture the live data, process it and feed back the result into the running system, with seconds and sometimes even subseconds response times—has become more and more important.

This opened up for hybrid architectures, like the Lambda Architecture, where you had two layers: the "speed layer" and the "batch layer," where the result from the real-time processing in the "speed layer" was later merged back into the "batch layer." It solved some of the immediate need for reacting quickly to (at least a subset) of the data.

But it also added needless complexity by forcing you to maintain two independent models and data processing pipelines for your data, as well as merge it in the end.

I believe that the third wave—that we have already started to see happening—is to fully embrace "data in motion," and—for most use cases and data sizes—move away from the traditional batch-oriented architecture altogether towards pure-stream processing architecture.

This shift requires new tools and techniques and often increased hardware costs, but the tools are here, and the result is systems that are more responsive and adaptive, more resilient and scalable, but with reduced complexity, making them easier to maintain and understand. Such benefits are, in my opinion, essential in today's competitive business landscape and, in most cases, worth the potential additional costs (and we all know that hardware is getting cheaper every year).

Another big change that we see today is that applications themselves must handle the logic for piping streaming data in and out, and that's the opportunity that we are targeting with our platform at Typesafe: for applications to be truly Reactive within streaming architectures.

This is critical because today, handling streaming within applications is wrought with contention, single points of failure, system overload, and unpredictable performance guarantees.

TechRepublic: What are the big benefits to be gained at the application layer from streaming architectures?

Bonér: Now, we actually have the tools available that allow us to react quickly to changes in user patterns and changes in how the application is being used.

This can be business-critical data, but it can also be streams of health checks, giving hints on how the application is coping with load. This is the goldmine, allowing you to directly feed all of that data back into the system and have a continuous feedback loop where the system can adapt to users and performance patterns.

It's not so much the amount of data—big is not that interesting—it's being able to react quickly and adapt your system to changes in user, business, and performance data.

TechRepublic: Tell us a little bit about Akka Streams and the advances that Typesafe has made there to simplify managing disparate data streams within an application?

Bonér: Akka Streams allows developers to define stream processing graphs as staged computations, so-called "blueprints"; these are objects that can be stored away, composed, and reused.

It decouples what should be done from how it's done by allowing the developer to run ("materialize") the "blueprint" or schedule it for later execution by one of the runners/execution engines, which can be single threaded, parallel, or distributed.

This solves the workflow for an application to participate in a streaming architecture by capturing data from different stream endpoints, producing new outputs, and stringing together these streams as graphs, like LEGO blocks. It provides a rich set of predefined stream transformations stages that handle the range of splitting, transforming, and merging types of tasks that are common to streaming.

We're excited to have brought some simplicity and a comfortable model to the steps that are otherwise very complicated and unpredictable if you are writing an application for streaming.

TechRepublic: What's your take on the overall innovation happening within this streaming movement and the key players involved?

Bonér: It's a great time to be a systems architect, because the diversity of choice at every level is very strong. But it can also be pretty confusing as a consumer based on all of the options and the rate that things are evolving.

There is a ton of innovation around distributed real-time processing across clusters—things like Gearpump (Intel), Samza (LinkedIn), S4 (Yahoo), Apache Storm (Twitter), and MillWheel (Google).

There has been an explosion of Reactive Streams-compatible libraries, like Akka Streams (Typesafe), RxJava (Netflix), Reactor (Pivotal), and Vert.x (Red Hat). Java 8 has a Stream library.

There are Event Sourcing/CQRS systems, like EventStore, Akka Persistence, and Eventuate.

There are systems like Spark Streaming for ingesting data into Apache Spark (through micro-batching), and there are streaming drivers for most NoSQL Databases, like Cassandra, Riak, MongoDB, Membase, and others.

You have event logging infrastructure tools like Kafka, LevelDB, and JavaChronicle.

There is streaming for SQL, like Typesafe's Slick.

You have the classic complex event stream processing (CEP) tools like Esper and Oracle CEP, and then there are a lot of interesting things happening in high-frequency trading (HFT) around ultra low latency, with products like Ultra Messaging and Aeron.

TechRepublic: Innovation is great, but that sounds like an alphabet soup of streaming projects. Is there a risk that streaming will run into some of the same issues that stalled SOA, where there are too many competing standards and proprietary approaches?

Bonér: That's an excellent correlation and cause for concern. SOA—superficially—was about interoperability, but then the implementations were typically bogged down by proprietary application servers, middleware, and centralized systems that created contention and brittleness.

With streaming systems, we see the importance of avoiding poor isolation, single points of failure, and contention—and that's where the Reactive Streams specification is so important.

Our hope is that everyone participating in this streaming ecosystem will converge on this standard—joining Pivotal, Netflix, Red Hat, Twitter, Typesafe and Oracle—so that any streaming system has the same basic guarantees of interoperability and handling the most common "backpressure" challenges in streaming to ensure that a single bad link in a streaming architecture can't take the entire system down.

There's a lot of competition between the vendors in the streaming landscape, but there's also some great collaboration on its hardest problems. Most of the products today are developed in the open, as true open-source projects, with passionate communities driving innovation through collaboration and real-world requirements, which I think will maximize the chances of success.

Also see

About Matt Asay

Matt Asay is a veteran technology columnist who has written for CNET, ReadWrite, and other tech media. Asay has also held a variety of executive roles with leading mobile and big data software companies.

Editor's Picks

Free Newsletters, In your Inbox