Hadoop, Hbase, relational databases, SQL (structured query language) and
NoSQL (not only structured query
language
) have all been mentioned in database strategies for big data. But
after awhile, there get to be so many choices that enterprises find themselves
overwhelmed.

The good news is that an abundance of big data database
technology solutions exist. The tough part is getting through all of these
solutions to arrive at what works best in your enterprise’s situation.

Data base planners invariably discover that their big data
must do two things well from a database perspective:

  1. Process all of the big
    data that is collected in a batch environment, and
  2. Provide rapid access to
    this data online to what may be hundreds or even thousands of users at a
    time.

In this sense, big data is no different than other data:
some database technologies are better suited for mass (batch) processing, while
others specialize in rapid online access.

First, let’s start with the massive parallel processing of
the data.

Massive parallel processing

Hadoop is the
software that has emerged to fill this niche. It’s both inexpensive and scalable
because it can parallel process on commodity x86 computing platforms. Hadoop breaks
up data into pieces that are parallel processed and then replicated across
servers. The result is rapid batch processing of data and also replication that
virtually “failproofs” a big data system.

Hadoop can handle the petabytes and even exabytes of big
data that enterprises now find they must manage. It uses MapReduce
as a big data query engine that data scientists can exploit once data is
processed into HDFS (Hadoop Distributed
File System
) files. Of course, these are batch queries.


Also read: Hadoop
success requires avoidance of past data mistakes


Rapid access

To get to the point of offering big data queries that are
online and closer to real time, enterprises need other database approaches
beside Hadoop with its batch orientation. To fill the void, there are NoSQL products
in the market such as Cassandra, HBase, MongoDB, etc. These products can complement batch
Hadoop processing by picking up where Hadoop leaves off, since they can skim
off important pieces of query-eligible data from Hadoop files and aggregate them
in a highly searchable and accessible database that can meet the performance requirements
and access needs of many concurrent online users.

 However, like Hadoop, NoSQL solutions also have their own shortcomings.
You might say they fail the “ACID” (atomicity,
consistency, isolation, durability
) test for database that enterprise data
base administrators look for, and that DBAs take for granted whenever they work
with traditional SQL databases.

What ACID means for your database is that any data that is operated
on is always completed, or it doesn’t execute. Additionally, no database transaction
is ever left half finished; transactions are kept isolated from each other
until they are finished; and the database tracks all data operations to enable
full recovery from any server failure.

Today, NoSQL does not come with ACID guarantees, so this is
a major concern from the DBA’s standpoint. It is also why NoSQL solution providers
are hard at work to cure the problem.

So what do you do?

Hadoop is getting well established as a de facto batch
processing engine for big data, and virtually every technology solution
provider has also embraced it. This makes it logical for enterprises to
incorporate Hadoop into their big data database strategies for batch
processing.

Remaining questions really are on the online query side of
big data, and what online queries need to produce.

If the enterprise requirements are for near real time big
data queries, and if there will be many of these queries at the same time,
NoSQL might be worth the sacrifice of some of the database performance
guarantees DBAs would prefer. On the other hand, if near-real time query
demands from your users are likely to be infrequent, with a user base that is
limited to just a few individuals, traditional data marts and warehouses
supported with SQL queries might be appropriate. In some cases, enterprises could
find themselves in both environments, choosing to deploy SQL in some circumstances
and NoSQL in others.

Enterprise selections will vary and as usual, there is no
pat “right answer” for every situation. What does matter in the end is making an informed choice.

Also read: