Cloud computing democratizes big data – any enterprise can now work with unstructured data at a huge scale.
At first glance, it isn’t obvious why the unstructured data methods of the new big data world are even necessary. Even if new methods bring new business value, why not stay on-premise? Why bother with cloud databases?
The big data label
Big data is one of those new, shiny labels, like SDN, DevOps and cloud computing, that is both hard to ignore and hard to understand. There is no single “big data” type – it is a collective label stuck on unstructured data, the technology stack it inhabits, and the new business processes that are growing up around it.
For instance, the discipline of big data analytics is about getting business value out of large data sets. Data scientists work with resources and processes to turn data into useful information. The classic RDBMS (Relational DataBase Management System) can handle a lot of data, and has been doing so for decades. Why can’t a data scientist stick with structured data in an RDBMS? Which is best – RDBMS or NoSQL?
Structured or unstructured data?
The technical stack an enterprise chooses is dictated by the type of data they need to store, and the type of data is dictated by business requirements.
The RDBMS is good for managing structured, highly relational data and will continue to be the software of choice for many requirements.
For the growing amount of unstructured data produced by social media, sensor networks, and federated analytics data-and for constantly changing data that needs to be replicated to other operating sites or mobile workers-NoSQL technologies better fit those use-cases. Unstructured data can be terabytes or even petabytes in size.
On-premise relational technology stack
The RDBMS is the type of storage software that has been dominant for decades. All data in an RDBMS is structured – clean, ordered and easy to understand. That makes it good for some work but bad at others. RDBMS products are also well known; a generation of DB administrators is experienced in RDBMS care and feeding.
One big problem with an RDBMS is when it gets too busy. When the quantity of data starts filling up the disk, and the queries are thrashing the CPU and the result sets choke the RAM, more resources are required to keep the DBMS working. There is only one way to scale, and that’s “up.” Scaling out doesn’t work because a relational database service only has one front door. And the only way to scale up is to buy a bigger box.
Scaling up does not cure RDBMS problems. Even the biggest computer, with its huge IT budget-gobbling price tag, only solves the resource problem. The IT department still has to solve other problems like HA fail-over, disaster recovery and storing data where it’s needed.
If the infrastructure is on-premise, there are traditional problems to overcome. Managing on-premise RDBMS is expensive and time consuming. An on-premise MySQL, Oracle or SQLServer database service is propped up by an overloaded IT department with a queue of work and inflexible hardware. If an enterprise rents Microsoft Azure Database, Google Cloud SQL or Amazon RDS these infrastructure headaches go away.
A big data cloud: New solutions, new headaches
In theory, managing cloud-based big data is cost-effective, scalable, and fast to build. Unfortunately, it’s not all good news.
DB administrators don’t have an easy ride. The NoSQL databases that have appeared in the last few years, with their key-value pairs, document stores, and missing schemas, don’t look like the relational databases they are slowly replacing. Also, the new rivers of data are difficult to capture, store, process, report on, and archive.
It’s not so bad for system administrators. If they run a private cloud, the new unstructured data technology stack of hardware and software looks like the old structured data stack – IaaS at the bottom, a database service in the middle, and applications on top delivering the business value. If they manage public cloud services, they don’t have to touch the lower layers of the technology stack.
Sticking data in Windows Azure Tables, Amazon SimpleDB, or MongoDB is just the start of the data science required to make the most of big data. There is plenty of business partnering, re-skilling and other attitude adjustment to take care of.
Nick Hardiman builds and maintains the infrastructure required to run Internet services. Nick deals with the lower layers of the Internet - the machines, networks, operating systems, and applications. Nick's job stops there, and he hands over to the designers and developers who build the top layer that customers use.