If we're going to talk about truly BIG data, we need to look to China. With 1.3 billion people, a quickly expanding urban economy, and rising rates of internet and smartphone penetration, China generates a mind-boggling amount of data annually.
As just one example, I spoke to Baidu, China's search giant, about how it crunches petabytes of data every day for analytic queries in under 30 seconds. It can move data at memory speeds using new virtual distributed storage software called Alluxio, formerly Tachyon, that was originally developed at UC Berkeley's AMPLab.
I recently caught up with Lei Xu, big data platform senior engineer of Qunar, China's top e-commerce travel site that carries more than 80% of online travel business in the Middle Kingdom. I was curious about how their big data backend keeps up with their business' exploding performance and scale requirements.
TechRepublic: First, please describe the scale of your big data operations at Qunar.
Xu: Our streaming platform processes around six billion system log entries daily; that's more than 4.5 terabytes of data. Every day. And it's growing quickly. One example of a critical job running on this platform is our real-time user recommendation engine. Stability and low latency are everything. It's based on the log analysis of users' clicks and search behavior. Faster analysis gives our users more accurate feedback.
TechRepublic: How did you architect your original backend for processing big data and what were the problems you ran into as more users hit your site?
Xu: Our real-time stream processing systems uses Apache Mesos for cluster management. We use Mesos to manage Apache Spark, Flink, Logstash, and Kibana. Logs come from multiple sources and we consolidate them with Kafka. The main computing frameworks, Spark streaming and Flink, subscribe to the data in Kafka, process the data, and then persist the results to HDFS [Hadoop Distributed File System].
But this architecture had some serious bottlenecks.
First, the HDFS storage system was located in a remote cluster, adding major network latency. Data exchange became a choke point. Because HDFS uses spinning disks, I/O operations (especially write) have high latency, too. Spark streaming executors need to read from HDFS and the repetitive cross-cluster read operations really deprecate performance.
Next, if a Spark streaming executor runs out of memory and fails, it will restart on another node but won't be able to reuse the checkpoint information from the original node. That means some jobs can never finish.
Finally, when streaming tasks persist data using MEMORY_ONLY, you replicate data in JVM memory causing garbage collection issues and even failure. Let's say you get clever and just use MEMORY_TO_DISK or DISK_ONLY, then the whole performance processing architecture is capped by the speed of disk I/O. We needed more speed.
TechRepublic: How did you remove these bottlenecks in your new architecture?
Xu: The original stream processing logic remains unchanged in the new architecture, but we deployed Alluxio between the computation frameworks and HDFS. In this architecture, only the final results need to be persisted to the remote HDFS cluster. The computing frameworks exchange data via Alluxio in real-time processing at memory speed data access.
The main difference before and after is that much of the data used and generated by Spark streaming is stored in Alluxio, eliminating communication with the slow, remote HDFS cluster. Flink and Zeppelin and other computing frameworks also access the data stored in Alluxio at memory speeds.
There were several key features in Alluxio that helped us dramatically improve overall system performance and reduce latency:
First, Alluxio uses a tiered storage approach. Alluxio runs on every compute node to manage the local memory, SSD, and disk. Data required by the stream processing is saved locally as much as possible, reducing network bandwidth. Alluxio applies a clever eviction notice to store data in memory that is used most often.
Unified namespace greatly simplifies input and output logic. Alluxio unifies the HDFS management and provides a global, unified namespace for all computation frameworks. The other important feature is a simple API with easy integration. It's truly easy to convert applications running on HDFS to Alluxio by swapping in "Alluxio" for "hdfs" in the URIs. That's it!
Finally, Alluxio off-heap storage helps with Java garbage collection. Spark streaming jobs store data in Alluxio instead of Spark's executor's JVM. This keeps the benefits of in-memory performance, while greatly reducing Java garbage collection costs. It also prevents out-of-memory errors caused by redundant in-memory data blocks in Spark executors.
TechRepublic: What kind of payoff did you see by changing architecture?
Xu: Performance was the huge payoff. Our new system is much faster. We saw a 15X speedup on average with a 300X speedup at peak service times. Throughput increased from about 20 to 300 events per second to a stable throughput of 7,800 events per second. Users on our travel site love the faster response times.
- Could Concord topple Apache Spark from its big data throne? (TechRepublic)
- Apache Spark rises to become most active open source project in big data (TechRepublic)
- Big data developers' hallelujah moment for distributed storage (TechRepublic)
- Top 10 priorities for a successful Hadoop implementation (TechRepublic)
- Spark promises to up-end Hadoop, but in a good way (TechRepublic)
Matt Asay is a veteran technology columnist who has written for CNET, ReadWrite, and other tech media. Asay has also held a variety of executive roles with leading mobile and big data software companies.