Sort out database responsibilities for big data before you deploy

There is an abundance of big data database technology solutions. The tough part is arriving at what works best in your enterprise's situation.

Hadoop, Hbase, relational databases, SQL (structured query language) and NoSQL (not only structured query language) have all been mentioned in database strategies for big data. But after awhile, there get to be so many choices that enterprises find themselves overwhelmed.

The good news is that an abundance of big data database technology solutions exist. The tough part is getting through all of these solutions to arrive at what works best in your enterprise's situation.

Data base planners invariably discover that their big data must do two things well from a database perspective:

  1. Process all of the big data that is collected in a batch environment, and
  2. Provide rapid access to this data online to what may be hundreds or even thousands of users at a time.

In this sense, big data is no different than other data: some database technologies are better suited for mass (batch) processing, while others specialize in rapid online access.

First, let's start with the massive parallel processing of the data.

Massive parallel processing

Hadoop is the software that has emerged to fill this niche. It's both inexpensive and scalable because it can parallel process on commodity x86 computing platforms. Hadoop breaks up data into pieces that are parallel processed and then replicated across servers. The result is rapid batch processing of data and also replication that virtually "failproofs" a big data system.

Hadoop can handle the petabytes and even exabytes of big data that enterprises now find they must manage. It uses MapReduce as a big data query engine that data scientists can exploit once data is processed into HDFS (Hadoop Distributed File System) files. Of course, these are batch queries.

Also read: Hadoop success requires avoidance of past data mistakes

Rapid access

To get to the point of offering big data queries that are online and closer to real time, enterprises need other database approaches beside Hadoop with its batch orientation. To fill the void, there are NoSQL products in the market such as Cassandra, HBase, MongoDB, etc. These products can complement batch Hadoop processing by picking up where Hadoop leaves off, since they can skim off important pieces of query-eligible data from Hadoop files and aggregate them in a highly searchable and accessible database that can meet the performance requirements and access needs of many concurrent online users.

 However, like Hadoop, NoSQL solutions also have their own shortcomings. You might say they fail the "ACID" (atomicity, consistency, isolation, durability) test for database that enterprise data base administrators look for, and that DBAs take for granted whenever they work with traditional SQL databases.

What ACID means for your database is that any data that is operated on is always completed, or it doesn't execute. Additionally, no database transaction is ever left half finished; transactions are kept isolated from each other until they are finished; and the database tracks all data operations to enable full recovery from any server failure.

Today, NoSQL does not come with ACID guarantees, so this is a major concern from the DBA's standpoint. It is also why NoSQL solution providers are hard at work to cure the problem.

So what do you do?

Hadoop is getting well established as a de facto batch processing engine for big data, and virtually every technology solution provider has also embraced it. This makes it logical for enterprises to incorporate Hadoop into their big data database strategies for batch processing.

Remaining questions really are on the online query side of big data, and what online queries need to produce.

If the enterprise requirements are for near real time big data queries, and if there will be many of these queries at the same time, NoSQL might be worth the sacrifice of some of the database performance guarantees DBAs would prefer. On the other hand, if near-real time query demands from your users are likely to be infrequent, with a user base that is limited to just a few individuals, traditional data marts and warehouses supported with SQL queries might be appropriate. In some cases, enterprises could find themselves in both environments, choosing to deploy SQL in some circumstances and NoSQL in others.

Enterprise selections will vary and as usual, there is no pat "right answer" for every situation. What does matter in the end is making an informed choice.

Also read:

By Mary Shacklett

Mary E. Shacklett is president of Transworld Data, a technology research and market development firm. Prior to founding the company, Mary was Senior Vice President of Marketing and Technology at TCCU, Inc., a financial services firm; Vice President o...