There are two kinds of Big Data companies want to process: the kind that has built up in data repositories and is waiting to be probed for answers on trends, performance, and other strategic questions that organizations have never been able to answer with traditional data; and real-time “Big Data in motion” that tells you exactly that is going on to the second and enables you to take on-the-fly actions.
What are the business cases for Big Data in motion?
- A hedge fund manager needs a means to instantaneously update trading positions in seconds, and he looks to his system for real-time information analysis;
- An online travel reservation system needs accurate, up-to-the-second travel availability information for every customer booking travel, regardless of how many customers are simultaneously using the system;
- An e-commerce clothing retailer needs instantaneous analytics feedback on its online selling activity and also on real-time customer responses to special promotions.
All are cases where Hadoop, which parallel-processes large sets of data across clusters of servers and is currently the Big Data processing software of choice for enterprise Big Data-struggles to perform.
The reason is simple. Hadoop’s strength is in batch processing relatively static data contained in the HDFS (Hadoop Distributed File System). HDFS files are stored on disk, which has slower data access speeds. Data can be added to these HDFS files, but the data in the files cannot be changed. Consequently, if you need to dynamically and rapidly change your data and not just add to it, HDFS files are not a viable solution.
Where HDFS excels is in a Big Data batch processing environment that uses a parallel processing software like MapReduce http://www.mapr.com/products/apache-hadoop to process data that doesn’t dynamically change-but this isn’t going to help the hedge fund manager who has to act on new data now.
Fortunately, technology innovators are finding ways to harness Big Data in motion so enterprises can access real-time intelligence for decision-making.
The open source community created Storm, a distributed computation system that can process streams of Big Data in real time. IBM’s InfoSphere Streams enables users to develop and repurpose applications to rapidly process and analyze information that is in-coming from thousands of real-time sources. And just last month, ScaleOut Software, which provides in-memory data grids (IMDGs), announced its hServer, an IMDG that enables Hadoop analysis of grid-based data.
In ScaleOut’s case, the in-memory data grid caches data that is rapidly changing at near in-memory speeds, which the company says resulted in an 11x reduction in average data access in benchmark tests. IBM’s InfoSphere Streams, also designed for real-time Big Data analysis, delivers sub-millisecond latencies in real-time analysis of data that spans text, images, audio, voice, VoIP (voice over Internet Protocol), video, web traffic, email, GPS (global positioning system) data, financial transaction data, satellite data, and sensors.
All of this is heartening news for companies wanting to enter the world of real-time Big Data analytics.
“Enterprises want help with applications that use data that is changing and churning rapidly, and if they can utilize Hadoop with technologies that facilitate real-time analytics, they want to do that,” said Bill Bain ScaleOut’s CEO. “We’re excited to take an important step towards meeting that need.”