It appears that the technology industry and not-for-profit tech organizations are already answering one big data roadmap question that has been looming for CIOs. The road to big data is going to run through Hadoop. One by one, tech industry sectors and technologies are lining up with solutions that will continue to enrich Hadoop performance in the data center.
Not long ago, I talked about HBase, another Apache product that overcomes Hadoop's shortcomings with connective relationships between the pieces of data that it processes. HBase overcomes Hadoop's limitations by using a set of tables that are all defined with their own unique primary keys, with each table containing a series of columns filled with attributes for the table's primary key. The HBase technology creates greater granularity in data searches because there are now more variables in the form of unique keys and their attributes that sites can employ when they mine big data for the answers they are seeking.
Meanwhile, other tech companies are attacking the network bandwidth constraints that big data processed through Hadoop will soon be coming up against. The goal is to speed up Hadoop processing by improving network throughput, and also to improve how Hadoop uses disk in processing,
First, to the network.
Some IT departments begin by deploying Hadoop on a single server, but the ultimate goal with scalable Hadoop (and servers) is to cluster together several servers to handle the growing workloads of big data jobs that must be parallel-processed. Normally, these jobs are scheduled with an eye toward completing as many of them as quickly as possible-so for instance, if you have a long job running, that job will continue to run but meanwhile, several shorter jobs might fully run and complete before the long job ends.
If you are using a cluster of servers, data must flow from server to server depending upon the needs of each job. This pass-through of data and processing is done over the network, which tethers all servers together.
This also means that network bandwidth potentially becomes an issue-because in most corporate networks, bandwidth is going to be shared with traditional transactional applications and (in some cases) even with phone traffic such as VoIP (voice over IP).
The question, then, for big data being processed by Hadoop is-even if you have industrial-strength servers and memory resources, do you have the network bandwidth and throughput needed to flow this data through?
The good news is that more organizations are upgrading their network bandwidth to 10 Gbps (gigabits per second) Ethernet backbones, which improves bandwidth. On top of this, Internet2 , a consortium of universities partnered with business and government, is beginning deployment of 100 Gbps Ethernet for the scientific and academic communities in the U.S.-a feat that will undoubtedly be commercialized for enterprises over the coming years. Companies like Quantcast, an early Hadoop adopter that measures Internet traffic for clients, report that Hadoop struggles when it comes to processing large big data jobs expeditiously, because Hadoop can't efficiently cut through the intense data I/O (input/output) demands of big data. They sense salvation in faster running networks, which are sure to appear in data centers over the next few years.
Finally, there is Hadoop's use of memory and disk itself. One innovation in this area is an in-memory data grid offered by ScaleOut Software on its hServer product. The solution accelerates data access and execution on Hadoop by assisting applications that use data that is rapidly changing or churning (like big data being gathered off and analyzed from an e-commerce Website). What hServer offers is a means to stage data on an in-memory data grid for rapid access, coupled with cache memory for rapidly changing data from Hadoop's HDFS (Hadoop distributed file system). This accelerates Hadoop's performance.
From the CIO's chair, all are ground-breaking developments-and it certainly makes a strategic decision to pursue Hadoop for big data processing easier. Now, the biggest question might be-what will we see next for Hadoop performance enhancement?
Mary E. Shacklett is president of Transworld Data, a technology research and market development firm. Prior to founding the company, Mary was Senior Vice President of Marketing and Technology at TCCU, Inc., a financial services firm; Vice President of Product Research and Software Development for Summit Information Systems, a computer software company; and Vice President of Strategic Planning and Technology at FSI International, a multinational manufacturing company in the semiconductor industry. Mary is a keynote speaker and has more than 1,000 articles, research studies, and technology publications in print.