Hadoop success requires avoidance of past data mistakes

To reach its full potential, Hadoop implementations should avoid the data warehouse infrastructure mistakes of the past.

By Doug Bryan

Twenty-one years ago, a year before the first web browser appeared, Walmart's Teradata data warehouse exceeded a terabyte of data and kicked off a revolution in supply-chain analytics. Today Hadoop is doing the same for demand-chain analytics. The question is, will we just add more zeros to our storage capacity this time or will we learn from our data warehouse infrastructure mistakes?

These mistakes include:

  • data silos,
  • organizational silos, and
  • confusing velocity with response time

Data Silos

A data silo is a system that has lots of inputs but few outputs. The Wikipedia page for "data warehouse" shows an architecture diagram with operational systems on the left, data marts on the right, and a "data vault" in the middle, but the third definition of "vault" at Merriam-Webster.com is "a burial chamber." All too often, enterprise data warehouses have become data burial chambers, or perhaps, data hospice facilities: places where data goes to die.

To prevent this from happening to Hadoop systems we need more techniques to get data out of the central data store to people and other systems. A few data marts just aren't sufficient anymore for connecting with development partners, ad tech vendors, and the myriad of customer touch points available to retailers and brands.

Data export techniques should cover a variety of performance characteristics so that the best technique can be used for each use case. Such techniques include:

  • Good ol' batch FTP of flat files, XML files, and compact binary file formats such as Avro
  • Publish-subscribe messaging interfaces, a.k.a. enterprise message busses, such as Kafka
  • Real-time REST APIs built on high-speed databases such as HBase and Voldemort
  • OLAP and data visualization user interfaces for business analysts who aren't  data scientists, such as Pentaho, Tableau, and Simba for Excel

Let's consider the last two in more detail. First, "real-time" means different things to different people. Fifty milliseconds (1/20th of a second) is real-time for stock trading. Google found that an increase of 500 milliseconds (1/2 a second) in page load time decreases traffic 20% and Amazon found that even a 100 millisecond (1/10th of a second) increase in load time significantly decreases retail website revenue.

One-tenth of a second response time is a high bar for API's to meet. To achieve it at the 95th percentile, retailers need multiple data centers per market so that shoppers always use a data center that is close by, thereby minimizing response times. In short, they need multiple front-end data centers for each Hadoop backend data center.


Secondly, OLAP and data visualization are part of an exciting industry trend toward the "democratization of data" where the goal is to enable people to access required data themselves rather than routing queries through some central analytics department. Nike FuelBand, Fitbit, and 23andMe are examples of this trend in consumer products, and OLAP and data visualization are enabling technologies for business users. Democratization of data holds the promise of preventing another big data warehouse mistake from the past: organizational silos.

Organizational Silos

An organizational silo, like a data silo, has lots of inputs but few outputs: it's a people bottleneck. Too often if a business analyst wanted data they had to go to a central analytics team, wait in line, get the analytics team to understand their need, wait a few days for the results, realize that the results weren't what they thought they'd asked for, and repeat the process until one side gave up. Then when business analysts complain and ask why on earth it could take so long, analytics just says, "There's a lot of math involved. You wouldn't understand." Over the past 20 years, that situation has created a kind of analytics aristocracy that's not very useful. If large companies can create such organizational silos with SQL, BI and SAS, just imagine the kind of silos they'll be able to create with the new technologies Hadoop, MapReduce, and R. Data democratization is the cure for organizational silos.

Velocity vs. Response Time

The last data warehouse mistake we can avoid with Hadoop systems is confusing velocity for response time. Consider an analogy.

Suppose you're shipping a package from Los Angeles to San Francisco, but due to your shipper's infrastructure, it goes through Memphis. If it takes 12 hours from LA to Memphis (1,800 miles) and 12 hours from Memphis to San Francisco (2,000 miles), that's 3,800 miles in 24 hours or 158 miles per hour. Pretty fast. However if you cut out Memphis and go directly from LA to San Francisco (380 miles) in 12 hours then that just 32 miles per hour: pretty slow. Yet the slower route gets the package delivered 12 hours earlier.

The point is that velocity should be measured from the customer's point of view, not the infrastructure's, since infrastructure only exists to serve the customer.

The following diagram shows what used to be a typical data flow from a customer, through a data warehouse, and then back to the customer, where each of the eight steps was scheduled and run in batch. Even if each link is fast, the whole round trip is rather slow.


With cloud-based Hadoop systems we can simplify this and greatly increase response time. Data is pushed directly from Hadoop to front-ends for use by real-time APIs, and to data marts for use by business analysts. Rather than updating customer attributes daily, weekly, or quarterly, this architecture enables real-time updates, click-by-click.


Bottom line

Hadoop holds immense promise for adding many more zeros to our storage and analytics capacity, and transforming companies to be more data driven. However to reach its full potential we should avoid the mistakes of the past. Otherwise, we're in for another twenty years of silos, aristocracies, and inadequate response times, or as aristocrats sometimes says, "different tree same monkeys."

Doug Bryan is a Data Scientist at RichRelevance. Prior to joining RichRelevance he was the VP of Analytics at iCrossing/Core Audience, a digital ad agency and DMP owned by Hearst. Earlier roles include co-founding the paid search auto-bidder startup OptiMine, customer lifecycle management applications of predictive analytics at KXEN, product recommendations team lead at Amazon.com, manager at Accenture’s Center for Strategic Technology Research, and research staff and lecturer in computer science at Stanford University.