Big Data

Where are all the cloud-based Hadoop services?

Ian Hardenburgh looks at Apache Hadoop's software framework as the best possibility for harnessing the amorphous data collected by social media marketing efforts.

More so than ever enterprises need to hastily monitor the effectiveness of their social media marketing campaigns through new analytical methods like clickstream and opinion mining, with trailblazing speed. However, the problem is that in more cases than not, the kind of data associated with these endeavors is largely unstructured, or at least, not oriented in a manner in which information can be readily imported into a relational database. Furthermore, even if software providers or ISVs could make some type of connectivity layer available, who's to say that the data involved can be easily transformed in a way that follows those SQL constructs that virtually every major database deployment (e.g., MySQL, SQL Server, Sybase) uses?

The Apache Hadoop software framework for distributed computing aims to take up this challenge, and has proved that it could become a faster, scalable and more affordable solution for processing extremely large sets of amorphous information. On the contrary though, one has to wonder if it will be able to keep up with certain advancements, like pre-existing business intelligence tools that are already widely immersed in the cloud. Moreover, one also has to wonder if Hadoop cloud services will ever become interoperable with other online services/applications, like Salesforce's Chatter, or even Twitter, not to mention prove to become a viable cloud-based database/PaaS service.

Hadoop's main appeal comes out of the previously alluded idea of faster writes with very big sets of loosely defined data (as with text coming from website feed or online forum). It basically does this by breaking those larger sets of data into smaller ones, via use of a specialized fault-tolerant file system (Hadoop Distributed File System). In other words, Hadoop might be known as that which takes a NoSQL approach to batch processing, in order to store data as fast as possible, by worrying about presenting some kind of cohesive schema later, through what is known as dataflow programming. Additionally, Hadoop does this in an extremely cost effective manner, when positioned against your more traditional RDMS solution storing terabytes of data.

Another one of Hadoop's greatest appeals might be noted as its offered flexibility with programming. Even though Hadoop uses a Java framework, one could use a multitude of high-level programming languages (similar to SQL), in order to query data stores. Some of the more prevalent languages utilized today include MapReduce, Pig and Hive. These languages can be used to emulate the look and feel of a column-oriented database, as well as provide the needed functionality of a full-scale data warehouse for ad hoc querying. Certain languages, like Pig, can also be used to control parallel write processes for even faster batch processing.

So with all this in mind, what are the implications for Hadoop and the cloud? With enterprise-level social media marketing on the rise, one would think the two were made for each other. However, Hadoop's history with the enterprise cloud isn't a long one. Amazon EC2 adopted Hadoop back in 2009 with its Elastic MapReduce, and Google released the first component of its Google App Engine MapReduce toolkit, later in 2010, but complex installation and maintenance scenarios have hampered significant employment. Nevertheless, scarce use thus far might be coming to an end, as IBM has just released its Infosphere BigInsights application, and Microsoft has several Hadoop projects ready to come out of oven. Microsoft even promises their release on Azure as soon as Christmas, and some signs are pointing to having its Hadoop store integrate with their business intelligence software, such as SQL Server Analysis Services and PowerPivot.

Hopefully these initiatives, and the idea that Silicon Valley is starting to realize Hadoop's potential as an instrument of social media (Facebook is running the world's largest Hadoop cluster), will not only foster continued development with managed deployments (e.g., Cloudera, MapR, Hortonworks) and the cloud services positioning them, but will also start to outrightly connect with SaaS/PaaS services and online social media applications, such as with the aforementioned Chatter and Twitter, as this is where Hadoop is most needed.


Ian is a manager of business intelligence/analytics for a small cap NYSE traded energy company. He also freelance writes about business and technology, as well as consults SMBs upon Internet marketing strategy.


Editor's Picks