For the past few years there has been a lot of buzz around the concept of Big Data, mainly tied in to the rise of cloud computing. Big Data is an all-encompassing term coined to represent the huge volumes of unstructured data that many businesses and most large Internet companies have to handle today. The keyword in this definition is unstructured. Traditional database structures are perfectly suited to handling large amounts of data as long as that data undergoes a structuring process. Big Data, however, has no real underlying structure. It can be tweets, Facebook status updates, web pages, text, or any other similar collection of data.
Traditional tools, such as conventional relational databases, are not very well suited to handling this kind of information. The pioneers in this area were Internet companies, who eschewed the relational database model and ended up creating their own frameworks to work with Big Data, consisting of specialized data structures, NoSQL databases, distributed processing systems, and other elements. Over time, these frameworks were consolidated, and many of them are freely available for use today. They rely heavily on parallel processing, and this is where virtualization, and cloud computing specifically, comes in to help.
In the beginning…
Google was perhaps the first company (and certainly the first web company) to make effective use of Big Data. By collecting and analyzing massive collections of web pages and of the relationships between them (the links), it was able to create the first truly universal search engine, capable of querying and indexing billions of pages without human intervention. Instead of relying on traditional technology - relational databases - Google’s engineers created a massively distributed system, and they kept this system low-cost by using off-the-shelf hardware. In developing this system, they gave birth to the current form of several large-scale computing elements, becoming responsible for the widespread usage of the MapReduce framework, of which Hadoop is one particular implementation (interestingly enough, Yahoo, not Google, was one of the biggest contributors to the project).
Hadoop, in turn, has been greatly responsible for the Big Data boom. It is an open-source implementation of the MapReduce framework that enables anyone to perform large-scale distributed computing tasks on any cluster of computers, or even on a single computer. Before the rise of cloud IaaS providers, having a cluster meant a large investment in hardware, even if you were building it based on commodity hardware. Today, however, anyone can quickly spin up a cluster composed of however many virtual nodes in any provider, keep that cluster alive for as long as there are tasks to run, and spin down the nodes when they aren’t needed anymore. In fact, several of the largest IaaS providers have started offering pre-configured Hadoop clusters, so that users don’t even have to worry about configuration.
As massively parallel architectures become more and more commonplace - both in individual computers, with multiple cores and processing nodes and in distributed systems, which run in several servers - it becomes possible to perform much more complex tasks. Companies of all sizes have access to the computing power necessary to process a lot more data. This generates a positive feedback loop: as end-users believe that they can process more data, they will demand more data collection, which in turn leads to a greater need for data processing, and so on.
Big Data, however, comes with challenges of its own. First, the algorithms necessary to process the data can be much more complex than what we are used to. This, in turn, means that it can be much more complex for end users to customize the systems to fit their needs. And Big Data analysis can be a very difficult task in itself. One key point with having a lot of information is that, as volumes grow, the signal-to-noise ratio worsens: if you could listen to all conversations going on in the world at any given time, how would you choose which ones were meaningful?
Another issue that is often overlooked by Big Data proponents is that correlation does not mean causation. In fact, the more information is being processed, the greater are the chances of finding spurious correlations that have no real meaning, or at least no relationship of causality. If we looked at all possible variables, we would probably “discover” that most people who get sick have worn clothes at some point in their lives, which does not mean that their sickness is caused by wearing clothes.
Finally, the traditional cloud computing challenges must also be considered. If you are running a massively distributed process, how secure are your computing nodes? What would happen if one of these nodes were compromised? In a sense you would be better off, because only part of the data would be stolen. At the same time, there are that many more servers to worry about.
In spite of these challenges, Big Data has the potential to change the way many businesses work, and the cloud provides an excellent low-cost opportunity for anyone to try out the processing frameworks and algorithms. The first thing to consider is if you effectively have a Big Data problem: can your business benefit from using Big Data? Do your data volumes really require these distributed mechanisms? If you do, however, the cloud is your best processing platform bet.