Before “Big Data” became a household name in most enterprises, I was visiting with a CEO of a major online movie download firm. “We live and die with the ability of our customers being able to get the movie titles that they want from us instantly,” he told me. “To do this, we make sure that the most popular titles for any given day are the easiest to access.”
Meeting this objective requires manipulating big data in the form of movie files so that ease of access and retrieval is aligned with what customers are asking for. In one sense, big data access/retrieval runs counter to traditional IT thinking, which through the years has prioritized storage and access to transactions on the basis of time (first in first out -or FIFO)-and has set up data so that the freshest data records from the standpoint of time are always first.
In the case of big data like popular movie download requests, time remains as an important access and retrieval parameter (since you want to know today’s most popular titles)—but the data itself (in the form of the movie) also must be stored and prioritized for easy access. This need to analyze and prioritize storage by content of the data as well as the timestamp of requests for it, is what adds complication to the data access and storage strategy of big data.
Let’s take a look at storage:
In traditional transaction processing environments, sites have gotten by for years with storage techniques like “striping” data http://searchstorage.techtarget.com/definition/disk-striping across inexpensive hard disks. With striping, data is only written to 20 percent of each disk (the other 80 percent goes unused). The goal is to keep data access and retrieval “moving” by speeding an application’s access and retrieval functions since the app only needs to read through 20 percent of each hard disk to find a transaction.
But with big data, there is more to it than that.
For instance, you can’t just flow big data across multiple hard disks (each partially used) to make data retrieval faster. Instead, you have to take a look at the data itself, determine (through usage information and file requests) which data is most often being requested, and then set up your storage so this data can get accessed rapidly. At the same time, you must look at your other big data files that are not being requested often, and assign them to storage (like hard disk) that is more economical (and slower).
This is why the handling of big data requires a tiered storage approach http://wikibon.org/wiki/v/Tiered_storage-and for most data centers, the introduction of “rapid” (and expensive) storage solutions like cache memory and solid state disk.
In the case of the online movie download business, their IT folks determined that it was necessary to make investments into cache memory/solid state disk so they could place their most popular big data movie files on this storage for rapid customer retrieval. IT also invested in tiered storage automation software that could analyze which movie titles were most often requested (and place these on cache/solid state disk) and which big data movie files were seldom requested (and could be placed on slower, cheaper hard disks. The exercise also required IT to reconsider its traditional attitudes about storage—which used to be a “commodity” item that they just ran out and bought when they needed more, but that now had to be strategically thought out so it could be positioned (and invested in) to support big data.
Movie download services are an important “use study” for enterprises to consider as they move their data centers into the big data era. This is because few shops have treated storage and data management as areas of strategic concern. The need for customers, both inside and outside of the business, to get at big data with ease and without complication is going to change this. And while many industry vendors are coming to the rescue with turnkey storage and processing solutions for big data, IT must still develop its own internal best practices and expertise to deliver best value to the business.