Data stores continue to be overwhelmed by big data, so why don’t data center managers get rid of excess big data that isn’t of use?

The main reason why is fear of missing out on any possible uses of big data analytics. There is an ever-present thought that the VP of marketing might one day ask for a long-term trends analysis of product sales over the past 20 years. Companies have made use of data that old, and you never know where new governance and regulatory requirements might take you, so why not hold on to the data to be safe?

There is also the very real possibility that these vast stores of data will go unused for years and even decades, while companies continue to steadfastly store and maintain them. Gartner refers to this unused data as “dark” data, and defines it as “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Similar to dark matter in physics, dark data often comprises most organizations’ universe of information assets. Thus, organizations often retain dark data for compliance purposes only. Storing and securing data typically incurs more expense (and sometimes greater risk) than value because often organizations don’t classify it or intend to use it.”

In an InformationWeek article, Nuix‘s Julie Colgan stated, “Data is dark when we don’t know it exists, when we can’t find it, when we can’t interpret it, and when we can’t share or interface with it.” Colgan is the director of information governance solutions for Nuix, which helps companies manage growing volumes of unidentified, unstructured data that sits in their storage repositories. “Sometimes data goes dark because we’re simply too busy to deal with it, so we push it to the side and ignore it.”

So, how can you “lighten up” dark data and still ensure you retain necessary data? Here are three suggestions.

1: Filter data

If you are using machine- or internet-generated big data, you’re getting a lot of noise as well as useful information. Data filtration that can isolate the information you want and eliminate the rest is one way to purify data feeds before you end up with a lot of unidentifiable junk in your data repositories. Vendors and tools can help you with this data cleansing process, but they can’t help if you haven’t identified the present and most likely future pieces of data that your business will need.

2: Export data

If you are concerned about retaining information for decades for purposes of governance or long-term trends analysis, start exporting this data to a trusted cloud-based vendor for safekeeping. You can bring the data back into your data center for analysis when the time comes.

3: Define data retention policies

Be as aggressive in defining data retention polices with business users for big data as you are with systems of record data. This is a hallmark of excellent data center management.


Will this solve all of your big data storage management and safekeeping problems? No, but it will go a long way towards getting a handle on the rivers of data that flow into your data center every day. It will also enable you to meet the demands of long-term forecasting and data governance that could come your way.