Don't risk starting your big data exercise in the deep end

Resist the urge to tackle the big data three Vs -- volume, velocity, and variety -- at once. A more sensible approach reduces your risk of sinking when you want to be swimming.


Image: Wikimedia Commons/Svhartje

When my mother was a child growing up in Hawaii, her first swimming lesson was basically a short boat trip to the Pacific Ocean and a strong shove. If my grandparents had chosen to employ this technique in the Molokai Channel -- one of the most treacherous swimming channels in the world -- I probably wouldn't be here to tell this story.

What's the connection to big data?

I've often criticized companies for spending huge amounts of money on fancy big data hardware and data scientists only to splash around in the shallow end without any real idea of what to do with the data. But that doesn't mean powering through all three Vs -- volume, velocity, and variety -- at the same time from the start. If you're going to take the big data challenge, it's best to paddle your way through volume, velocity, and variety one at a time.

1: Start big with large volumes

Volume is the big data V most people get wrong because of confusion around the topic.

When people think of anything big, volume is intuitively the first thing that comes to mind. And since the conversation around big data is usually coupled with newfangled technologies like Hadoop, the listener associates these technologies with a large volume of data. Contrary to popular perception, if large volume is the only dimension of big data you're dealing with, the last thing you need is fancy big data technology.

For example, imagine you're running an advisory for busy (or lazy) quantitative analysts who are looking for statistical arbitrage opportunities. Your entire data set consists of timestamp, exchange, stock symbol, bid price, ask price, and trading volume, and you capture this data for every stock on every major exchange around the world, every second of the trading day. The analysis you run is not time-sensitive; there's a batch job that runs every weekend to select the best stocks to consider for the upcoming week.

Although this situation will generate huge amounts of data, you don't need or want a bunch of Hadoop clusters -- you're much better off sticking with an RDBMS like Oracle or MySQL. Not only is your data structure simple and well-defined, but it's necessary to support the analysis that's required. Even with several petabytes of data (which seems to be the informal threshold these days for big data), a standard Oracle database with a few data marts for analysis can handle this just fine. The analysis doesn't require all this data -- that's where sampling comes into play. Not to say that you can't boost your confidence levels with an enormous data set, but why complicate your life when it isn't necessary?

For these reasons, I suggest starting your big data journey with large volumes of structured data if possible. You won't necessarily have a competitive advantage at this point, because you will not have done anything that's really a breakthrough, though it will give you a comfortable and familiar place to start.

2: Get bigger with high velocity

Now you have a lot of data, but can you process it in real-time? This is where things get interesting. You must ask yourself (and answer honestly): Is there a competitive advantage or requirement to processing the data in real-time?

This reminds me of discussions I've had with clients about the prospect of creating an operational data store (ODS). Traditionally, an ODS sits between your transactional system (e.g., ERP) and your enterprise data warehouse (EDW), serving as a point of transient aggregation and analysis.

An ODS is classified by how often it refreshes: a Class 1 ODS refreshes in near-real-time, and a Class 3 ODS refreshes once a day. When I was working with Sun Microsystems, we had a Class 2 ODS that refreshed three times a day. Many clients think they need a Class 1 ODS until they see the price tag, and then they realize a Class 3 would probably suit their needs better.

In the new age of big data, the architecture that supports the use of an ODS is questionable, yet the fundamental question that drives its design considerations remains: Do you really need to process this information in real-time?

Although the cost consideration has waned somewhat, it has been replaced with this risk consideration: Do you need to have near-real-time processing so badly that you'll assume the risk of emerging technologies that relatively few people fully understand?

If the answer is yes, then move to a distributed file system like Hadoop's Distributed File System (HDFS). A massive amount of real-time input and output is an RDBMS's kryptonite, and it's also where something like HDFS will shine. If the answer is no, then stick to the old technology -- it's less risky, and the cost of ownership is less. Even though the hardware may be cheaper with the HDFS route, you'll pay for it with expensive resources and everything else that goes along with the big data vortex. For these reasons, I suggest your second move with big data is to tackle the velocity problem.

3: Tackle the biggest challenge -- multiple formats

Variety, the third V of big data, is where the buck stops with traditional technology and analysis methods. Unstructured data will put a traditional RDBMS out of business – RDBMSs weren't built for it.

Even before the big data craze, Oracle tried to accommodate RDBMSs with Binary Large Objects (BLOBs) and such, though it's never worked that well, and it's been a pain to deal with. And although creative analysis techniques like Natural Language Processing (NLP) have been around for about 60 years, up until this point, the mainstream of analytics for thousands of years has been focused on clean, structured data. Unstructured data is relatively new territory for data scientists.

Although the biggest challenges with big data lie with the variety of data, they also present the biggest opportunities. My definition of big data when used for competitive purposes is: Big data is the massive amount of rapidly moving and freely available data that potentially serves a valuable and unique need in the marketplace, but is extremely expensive and difficult to mine by traditional means. This last clause is very important; trying to use statistical techniques that are a thousand years old on stable technology that's been around for decades is not going to produce a breakthrough.

The ability to process large amounts of data very quickly is good; however, when you can do this across a wide range of unstructured formats (video, audio, free-flowing text), you're entering an area where few people can play. It's also the point where very bright data scientists and whiz-bang technology is absolutely required. For these reasons, I suggest that you save this piece of your big data puzzle for last.


When employing big data in your corporate strategy, it's important to attack this with a purpose, but don't be quixotic about it. If at all possible, start with large volumes, move to velocity, and then take on the variety challenge. This sensible approach will prevent you from taking on too much initial risk. I get sink or swim, but at least give yourself a chance!