Big Data

Spark or Hadoop? Let gravity decide, says Basho CTO

Data follows data, says Basho's Dave McCrory, which is why big data will live in both the cloud and enterprise data centers.

Big data

As the big data community gores itself over real-time vs. batch, Basho CTO Dave McCrory (@mccrory) offers an easy way to settle the question:

Let gravity decide.

Or data gravity, to be more precise. In a world where "data attracts data," McCrory tells me in an interview, enterprises will be "inclined to store additional quantities of large data in the same place." That means more data in traditional data centers, but also more data in the cloud.

In short, big data, according to McCrory, isn't about cloud vs. data center, or real-time Spark vs. batch-oriented MapReduce. It's about all of the above.

It's similar to what I heard from Etsy CTO Kellan Elliott-McCrea and Hadoop creator (and Cloudera executive) Doug Cutting, but McCrory adds insight worth reading.

TechRepublic: Where do you stand on the batch vs. streaming spectrum? Is batch a historical necessity (but still history), or will it remain an essential part of big data for a long, long time?

McCrory: Batch will exist for a long, long time—primarily due to technological limitations. As data grows, it attracts more data (and applications that want to consume that data).

The problem with this and data warehouses (aka "Data Lakes") is that by storing all of this data together, it becomes more difficult to process.

Real-time (aka stream) processing allows the data to be processed in flight, solving data-in-motion needs/problems. These two problems are not mutually exclusive, and the best solution I've seen is to combine both solutions together, which is the approach that Lambda architectures take.

TechRepublic: Where do you think big data workloads are going to live in our increasingly cloud-oriented future?

McCrory: Big data workloads will live in large data centers where they are most advantaged. Why will they live in specific places? Because data attracts data.

If I already have a large quantity of data in a specific cloud, I'm going to be inclined to store additional quantities of large data in the same place. As I do this and add workloads that interact with this data, more data will be created.

This virtuous cycle is a part of the Data Gravity effect that I coined years ago.

Data Gravity also is influenced by outside factors, such as regulation and compliance. The biggest question today is do you bring the data to the workload or the workload to the data? The answer, I believe, is both.

TechRepublic: You're CTO of Basho Technologies, the company behind NoSQL database Riak. How does Riak fit into this world?

McCrory: Riak is a Key/Value store that focuses on scalability, availability and correctness, and operational simplicity.

  • Scalability is key when addressing these problems, because as your data grows at a geometric rate, your storage needs to keep up with it.
  • Availability and correctness, because if your data isn't available and correct, it isn't worth anything.
  • And operationally simple, because complexity destroys productivity and efficiency.

Riak allows data to be distributed and replicated so that it doesn't have to live in a single location or data center. By using multi-cluster replication (MCR), data can be replicated and synchronized in real-time across data centers. This gives people the flexibility to move their workload to the most favorable place, because their data is already there.

TechRepublic: Speaking of NoSQL, more generally, do you see workloads being pushed to Hadoop that better fit in Riak/NoSQL land? How should enterprises determine the optimal technology for their particular big data problems?

McCrory: I see Hadoop as something that is becoming the 21st Century data warehouse.

People doing serious work with MapReduce-style problems are flocking to Spark because it's incredibly easy to understand and model, and it's much easier to code against. There are also the incredible performance gains that people are seeing from Spark that isn't hurting either.

Spark provides the analytics solution that Hadoop was supposed to deliver on.

NoSQL plays into this, because it's designed to support real-time and application-centric workloads that were traditionally serviced by relational databases. The problem with relational databases is that they weren't designed to scale out.

This convergence of real-time, application-centric, and scale-out is what makes a solution like NoSQL—and specifically Riak—so attractive today.

Also see

About Matt Asay

Matt Asay is a veteran technology columnist who has written for CNET, ReadWrite, and other tech media. Asay has also held a variety of executive roles with leading mobile and big data software companies.

Editor's Picks

Free Newsletters, In your Inbox