Big data developers' hallelujah moment for distributed storage

Alluxio is blindingly fast, super simple, and Berkeley's newest big data baby. Here's how it's redefining the storage layer.

Image: iStockphoto/zhudifeng

Some of the biggest breakthroughs in big data analytics were birthed by UC Berkeley's AMPLab. Apache Spark is the hugely popular in-memory computation engine, backed by IBM and other major vendors. Apache Mesos, which powers Twitter and Apple's Siri, is the kernel of a data center or cloud operating system that pools all the resources to run and manage workloads at scale.

We're talking computations with datasets at the hundreds of terabytes and even petabyte scale done in memory. That is massive, serious scale.

What's missing, however, is the storage layer.

How to efficiently and reliably share data at memory speed across cluster computing frameworks is a huge challenge. As datasets continue to grow, storage and networking create serious bottlenecks for many distributed workloads. But, it's not just a performance problem--storage system interfaces are complicated and tough for developers to reason with, and getting data to the applications and frameworks doing the computation is one of the hardest things about big data.

SEE: Job description: Data scientist (Tech Pro Research)

To address these challenges, AMPLab PhD candidate Haoyuan Li developed Alluxio (until recently known as Tachyon), a memory-centric, fault-tolerant virtual distributed storage system with a unified namespace. Conceived only three years ago, Alluxio has been adopted by many large companies and is the storage layer of AMPLab's Berkeley Data Analytics Stack (BDAS).

Distributed storage on steroids

Alluxio is basically a virtual distributed storage system with a memory-centric architecture. It works very well for big data and other scale-out applications. The system has already been adopted by many different businesses. For example, Baidu uses Alluxio in production to speed up its end-to-end query latency by 30 times. Barclays leverages Alluxio to reduce its analytics job run time from hours to seconds.

That's fast. By some measures, it's 100 times faster than Spark SQL running solo.

Beyond the significant performance boost, Alluxio unifies computation frameworks with underlying storage systems through a virtual interface. It enables any framework/application to be able to access and analyze any data from any storage.

This was a dramatic breakthrough.

Making fast...simple

Alluxio abstracts away all the underlying complexity of any persistent file or storage system. This is a hallelujah moment for developers writing distributed applications.

SEE: The Power of IoT and Big Data (ZDNet)

It means that they can run any big data framework (Apache Spark, Apache MapReduce, Apache Flink, Impala, etc.) with any storage system or filesystem underneath (Alibaba OSS, Amazon S3, EMC, NetApp, OpenStack Swift, Red Hat GlusterFS, and more) and access any storage media, from DRAM to HDD to SSD and more. All they have to know is one API.


However, this isn't simply manna for developers. It's also good news for operators.

After all, there's no rip and replace requirement for the expensive storage systems in their data center to realize the benefits of Alluxio. They can keep those NetApp and EMC boxes to protect their precious data. And, if a new technology in storage arrives, they can simply roll it into the data center. This is future proofing storage. With Alluxio, any application can access any data from anywhere. Any application can store any data to anywhere.

Developer darling

Alluxio is also rapidly gaining developer traction. At the three-year milestone, it's way ahead of other hugely-popular open source big data frameworks and datastores, at least in terms of developer activity:


If Alluxio takes off, it may become much more than the BDAS storage layer. It could become the standard storage layer that any data center should run.

Given its performance and comparative simplicity, that's not too quixotic a thought.

Also see