By Doug Bryan

Twenty-one years ago, a year before the first web browser
appeared, Walmart’s Teradata data warehouse exceeded a terabyte of data and kicked
off a revolution in supply-chain analytics. Today Hadoop is doing the same for demand-chain analytics.
The question is, will we just add more zeros to our storage capacity this time
or will we learn from our data warehouse infrastructure mistakes?

These mistakes include:

  • data silos,
  • organizational silos, and
  • confusing velocity with
    response time

Data Silos

A data silo is a system that has lots of inputs but few
outputs. The Wikipedia page
for “data warehouse” shows an architecture diagram with operational
systems on the left, data marts on the right, and a “data vault” in
the middle, but the third definition of “vault” at Merriam-Webster.com
is “a burial chamber.” All too often, enterprise data warehouses have
become data burial chambers, or perhaps, data hospice facilities: places where
data goes to die.

To prevent this from happening to Hadoop systems we need more
techniques to get data out of the central data store to people and other systems.
A few data marts just aren’t sufficient anymore for connecting with development
partners, ad tech vendors, and the myriad of customer touch points available to
retailers and brands.

Data export techniques should cover a variety of performance
characteristics so that the best technique can be used for each use case. Such
techniques include:

  • Good ol’ batch FTP of flat
    files, XML files, and compact binary file formats such as Avro
  • Publish-subscribe
    messaging interfaces, a.k.a. enterprise message busses, such as Kafka
  • Real-time REST APIs built
    on high-speed databases such as HBase and Voldemort
  • OLAP and data visualization
    user interfaces for business analysts who aren’t  data scientists, such as Pentaho,
    Tableau, and Simba for Excel

Let’s consider the last two in more detail. First,
“real-time” means different things to different people
. Fifty milliseconds
(1/20th of a second) is real-time for stock trading. Google found that an
increase of 500 milliseconds (1/2 a second) in page load time decreases traffic
20% and Amazon found that even a 100 millisecond (1/10th of a second) increase
in load time significantly decreases retail website revenue.

One-tenth of a second response time is a high bar for API’s
to meet. To achieve it at the 95th percentile, retailers need multiple data
centers per market so that shoppers always use a data center that is close by,
thereby minimizing response times. In short, they need multiple front-end data
centers for each Hadoop backend data center.

Secondly, OLAP and data visualization are part of an
exciting industry trend toward the “democratization of data” where
the goal is to enable people to access required data themselves rather than routing
queries through some central analytics department. Nike FuelBand, Fitbit, and 23andMe
are examples of this trend in consumer products, and OLAP and data
visualization are enabling technologies for business users. Democratization of
data holds the promise of preventing another big data warehouse mistake from
the past: organizational silos.

Organizational Silos

An organizational silo, like a data silo, has lots of inputs
but few outputs: it’s a people bottleneck. Too often if a business analyst
wanted data they had to go to a central analytics team, wait in line, get the
analytics team to understand their need, wait a few days for the results,
realize that the results weren’t what they thought they’d asked for, and repeat
the process until one side gave up. Then when business analysts complain and
ask why on earth it could take so long, analytics just says, “There’s a
lot of math involved. You wouldn’t understand.” Over the past 20 years,
that situation has created a kind of analytics aristocracy that’s not very
useful. If large companies can create such organizational silos with SQL, BI
and SAS, just imagine the kind of silos they’ll be able to create with the new
technologies Hadoop, MapReduce,
and R. Data democratization is the cure for organizational silos.

Velocity vs. Response Time

The last data warehouse mistake we can avoid with Hadoop
systems is confusing velocity for response time. Consider an analogy.

Suppose you’re shipping a package from Los Angeles to San
Francisco, but due to your shipper’s infrastructure, it goes through Memphis. If
it takes 12 hours from LA to Memphis (1,800 miles) and 12 hours from Memphis to
San Francisco (2,000 miles), that’s 3,800 miles in 24 hours or 158 miles per
hour. Pretty fast. However if you cut out Memphis and go directly from LA to
San Francisco (380 miles) in 12 hours then that just 32 miles per hour: pretty
slow. Yet the slower route gets the package delivered 12 hours earlier.

The point is that velocity should be measured from the
customer’s point of view, not the infrastructure’s, since infrastructure only
exists to serve the customer.

The following diagram shows what used to be a typical data
flow from a customer, through a data warehouse, and then back to the customer, where
each of the eight steps was scheduled and run in batch. Even if each link is
fast, the whole round trip is rather slow.

With cloud-based Hadoop systems we can simplify this and
greatly increase response time. Data is pushed directly from Hadoop to
front-ends for use by real-time APIs, and to data marts for use by business
analysts. Rather than updating customer attributes daily, weekly, or quarterly,
this architecture enables real-time updates, click-by-click.

Bottom line

Hadoop holds immense promise for adding many more zeros to
our storage and analytics capacity, and transforming companies to be more data
driven. However to reach its full potential we should avoid the mistakes of the
past. Otherwise, we’re in for another twenty years of silos, aristocracies, and
inadequate response times, or as aristocrats sometimes says, “different
tree same monkeys.”


Doug Bryan is a Data Scientist at RichRelevance. Prior to joining
RichRelevance he was the VP of Analytics at iCrossing/Core Audience, a digital
ad agency and DMP owned by Hearst. Earlier roles include co-founding the paid
search auto-bidder startup OptiMine, customer lifecycle management applications
of predictive analytics at KXEN, product recommendations team lead at
Amazon.com, manager at Accenture’s Center for Strategic Technology Research,
and research staff and lecturer in computer science at Stanford University.