mother was a child growing up in Hawaii, her first swimming lesson was
basically a short boat trip to the Pacific Ocean and a strong shove. If my
grandparents had chosen to employ this technique in the Molokai Channel — one of the most treacherous swimming channels in the world — I
probably wouldn’t be here to tell this story.
the connection to big data?
often criticized companies for spending huge amounts of money on fancy big data
hardware and data scientists only to splash around in the shallow end without
any real idea of what to do with the data. But that doesn’t mean powering
through all three Vs — volume, velocity, and variety — at the same time from the
start. If you’re going to take the big data challenge, it’s best to paddle your
way through volume, velocity, and variety one at a time.
Start big with large volumes
is the big data V most people get wrong because of confusion around the topic.
people think of anything big,
volume is intuitively the first thing that comes to mind. And since the
conversation around big data is usually coupled with newfangled technologies
like Hadoop, the listener associates these
technologies with a large volume of data. Contrary to popular perception, if
large volume is the only dimension of big data you’re dealing with, the last
thing you need is fancy big data technology.
example, imagine you’re running an advisory for busy (or lazy) quantitative
analysts who are looking for statistical arbitrage opportunities. Your entire
data set consists of timestamp, exchange, stock symbol, bid price, ask price,
and trading volume, and you capture this data for every stock on every major
exchange around the world, every second of the trading day. The analysis you
run is not time-sensitive; there’s a batch job that runs every weekend to
select the best stocks to consider for the upcoming week.
this situation will generate huge amounts of data, you don’t need or want a
bunch of Hadoop clusters — you’re much better off sticking with an RDBMS like Oracle or MySQL. Not only is your data structure simple
and well-defined, but it’s necessary to support the analysis that’s required.
Even with several petabytes of data (which seems to be the informal threshold
these days for big data), a standard Oracle database with a few data marts for
analysis can handle this just fine. The analysis doesn’t require all this data — that’s where
sampling comes into play. Not to say that you can’t boost your confidence
levels with an enormous data set, but why complicate your life when it isn’t
these reasons, I suggest starting your big data journey with large volumes of
structured data if possible. You won’t necessarily have a competitive advantage
at this point, because you will not have done anything that’s really a breakthrough,
though it will give you a comfortable and familiar place to start.
bigger with high velocity
have a lot of data, but can you process it in real-time? This is where things
get interesting. You must ask yourself (and answer honestly): Is there a
competitive advantage or requirement to processing the data in real-time?
reminds me of discussions I’ve had with clients about the prospect of creating
an operational data store (ODS). Traditionally, an ODS sits between your
transactional system (e.g., ERP) and your enterprise data warehouse (EDW),
serving as a point of transient aggregation and analysis.
is classified by how often it refreshes: a Class 1 ODS refreshes in near-real-time,
and a Class 3 ODS refreshes once a day. When I was working with Sun
Microsystems, we had a Class 2 ODS that refreshed three times a day. Many
clients think they need a Class 1 ODS until they see the price tag, and then
they realize a Class 3 would probably suit their needs better.
new age of big data, the architecture that supports the use of an ODS is
questionable, yet the fundamental question that drives its design
considerations remains: Do you really
need to process this information in real-time?
the cost consideration has waned somewhat, it has been replaced with this risk
consideration: Do you need to have near-real-time processing so badly that
you’ll assume the risk of emerging technologies that relatively few people
answer is yes, then move to a distributed file system like Hadoop’s Distributed File System (HDFS). A massive amount of real-time input and output
is an RDBMS’s kryptonite, and it’s also where something like HDFS will shine. If
the answer is no, then stick to the old technology — it’s less risky, and the
cost of ownership is less. Even though the hardware may be cheaper with the
HDFS route, you’ll pay for it with expensive resources and everything else that
goes along with the big data vortex. For these reasons, I suggest your second
move with big data is to tackle the velocity problem.
Tackle the biggest challenge — multiple formats
the third V of big data, is where the buck stops with traditional technology
and analysis methods. Unstructured data will put a traditional RDBMS out of
business – RDBMSs weren’t built for it.
before the big data craze, Oracle tried to accommodate RDBMSs with Binary Large
Objects (BLOBs) and such, though it’s never worked that well, and it’s been a
pain to deal with. And although creative analysis techniques like Natural
Language Processing (NLP) have been around for about 60 years, up until this
point, the mainstream of analytics for thousands of years has been focused on
clean, structured data. Unstructured data is relatively new territory for data
the biggest challenges with big data lie with the variety of data, they also
present the biggest opportunities. My definition of big data when used for
competitive purposes is: Big data is the massive amount of rapidly moving and
freely available data that potentially serves a valuable and unique need in the
marketplace, but is extremely
expensive and difficult to mine by traditional means. This last clause
is very important; trying to use statistical techniques that are a thousand
years old on stable technology that’s been around for decades is not going to
produce a breakthrough.
ability to process large amounts of data very quickly is good; however, when
you can do this across a wide range of unstructured formats (video, audio,
free-flowing text), you’re entering an area where few people can play. It’s
also the point where very bright data scientists and whiz-bang technology is
absolutely required. For these reasons, I suggest that you save this piece of
your big data puzzle for last.
employing big data in your corporate strategy, it’s important to attack this
with a purpose, but don’t be quixotic about it. If at all possible, start with
large volumes, move to velocity, and then take on the variety challenge. This sensible
approach will prevent you from taking on too much initial risk. I get sink or
swim, but at least give yourself a chance!