Hadoop is cool, and Spark is fast, but sometimes you need optimized hardware to handle increasingly bigger data workloads. That's the premise behind Kinetica, an in-memory database that channels the power of massively distributed graphics processing units (GPUs) to promise 100-1,000x better real-time analytics performance.
Such a promise is somewhat dizzying, given the bevy of big data analytics options available today. But it's also a tad optimistic, given that GPUs are fantastic for workloads dependent on heavily parallelized matrix math, but not necessarily ideal for a wider range of big data applications.
Not yet, anyway.
The rise of GPUs in big data
Kinetica (formerly GPUdb) has been around for several years, winning awards as it displaces Oracle and other industry heavyweights in significant deployments. First, there was the terrorist tracking database used by the US government to track and kill terrorists. More recently, Kinetica was deployed by the US Postal Service to reduce fraud and streamline operations.
To what effect? Try delivery of more than 150 billion pieces of mail in 2015 while driving 70 million fewer miles, thereby saving seven million gallons of fuel. All this while pulling data from more than 213,000 scanning devices with 15,000+ concurrent users at post offices and processing facilities throughout the US, also combining geospatial data to predict real-time events.This represents, by the way, a 200x performance improvement over the relational database that USPS had been using.
While seemingly diverse, such workloads strike the sweet spot of GPUs, as Todd Mostak, founder and CEO of MapD, wrote: "GPUs excel at tasks requiring large amounts of arithmetically intense calculations, such as visual simulations, hyper-fast database transactions, computer vision and machine learning tasks."
Figuring out where GPUs fit
The trick, then, is to figure out where to apply GPU-oriented databases, because they're not equally good for all big data applications.
As Nikita Shamgunov, CTO and co-founder of in-memory database company MemSQL, told me, "There is no question GPUs provide advantages for certain workloads, in particular things like deep learning. GPUs work very well for deep learning because the problem can be broken into many small operations with each small operation executed simultaneously across a large number of cores."
Adding to this, Jared Rosoff, senior director of engineering at VMware, informed me that "A single GPU is 1000s of cores optimized for matrix math ops. Deep learning is lots of very parallelizable matrix math." Not surprisingly, then, "deep learning, like computer graphics, depends on lots of parallelizable matrix math that fits perfectly" with GPUs.
Outside of deep learning and things like data visualization, however, the tried-and-true CPU-oriented database is often a better choice, Shamgunov continues:For areas outside of deep learning, there are still open debates as to the overall cost/benefit of using GPUs compared to CPUs. Companies like Intel are very efficient at packaging CPU power at a low cost. And the industry infrastructure surrounding CPUs still dwarfs anything similar on the GPU front.
In other words, harnessing CPUs tends to be cheaper with minimal productivity expense, and there is far more industry support for CPUs. Additionally, some aspects of big data simply lend themselves better to the CPU.
"For example, other areas of data processing queries are dominated by joins and shuffles, such as re-partitioning the data across the cluster on a different key," Shamgunov said. "These operations are extremely efficient on CPUs."
Rosoff also weighed in on this, saying that "most software can't take advantage of this degree of parallelism or operate with GPUs' limited instruction set," making it a perfect solution for deep learning-type applications, but a poor fit for other workloads.
Over time, of course, we're likely to see enterprises combine the two approaches, using GPUs where they shine and CPUs everywhere else. It's also likely that databases will start incorporating more support for GPUs as they become more common.
- Could Concord topple Apache Spark from its big data throne? (TechRepublic)
- Top 10 priorities for a successful Hadoop implementation (TechRepublic)
- Apache Spark rises to become most active open source project in big data (TechRepublic)
- Spark promises to up-end Hadoop, but in a good way (TechRepublic)
- The meteoric rise of Spark and the evolution of Hadoop (Tech Pro Research)
Matt Asay is a veteran technology columnist who has written for CNET, ReadWrite, and other tech media. Asay has also held a variety of executive roles with leading mobile and big data software companies.