Mary Shacklett outlines six tips for classifying and cleaning up data.
Researcher IDC's Digital Universe Report in 2010 projected a 45x growth in data by 2020, with 80 percent of that data under enterprise management. Equally as revealing was the sub-title of that same report: "Extracting Value from Chaos."
Finding value out of chaos is exactly what corporate IT is faced with as big data continues its advance to the forefront of mission-critical enterprise computing. All of this paves the way for the inglorious task of "housecleaning" this mass of semi- and unstructured data so it can be readily accessed to answer mission-critical questions, meet corporate data governance and regulatory standards, provide agility (and not latency) for applications that are being asked to run faster, and reduce the amount of data to be stored so that corporate data centers and storage systems are not deluged.
How do companies "get there?"#1-Go back to data retention 101
Data retention meetings between IT and end users have been a fact of life since the 1960s. Companies typically meet once per year to review how long data should be stored on customers, transactions, financial systems, service systems, etc. The knee-jerk response from end users is that they don't want to get rid of anything-and in recent years, regulators in many different industries have also mandated companies to store data longer.
Because data retention meetings are long and laborious, and also because IT has to attend every one of them, some companies do not hold these meetings as often as they should-and others even avoid them altogether. If your company is planning a major role for big data, make sure that data retention meetings are also part of the strategy. Unless you get consensus on which data to keep and for how long, you will never be able to survive the avalanche of data that is coming your way with big data.#2-Classify your data
One of the outcomes you will definitely want from the data retention discussion is which data is most important to the business, and at what different levels of security should individuals within the company (and even outside of it) be able to access and update this data? This data classification is important for two reasons: 1) it enables you through automated tools to "tag" your data so that the most important and/or most accessed data can be stored on the quickest storage retrieval technologies in your data center; and 2) it ensures that individual clearances for the data are properly defined and administered.#3-Deduplicate your data
Big data repositories often unearth one memo with graphics that has been sent to half a dozen users throughout the company, and then separately stored in their respective email accounts. Technologies like data deduplication go through this data, and eliminate replicate copies of the same document—thereby streamlining both data and data access.#4-Tier your storage
Once data is classified for its importance or frequency of access, it can also be loaded into tiered storage that places the most important and/or most accessed data on cache memory or solid disk drives, and the less accessed data on slower hard drives. This economizes the storage footprint in data centers and also reduces energy costs.#5-Build quality into your data
No matter how much deduplication, data retention or data classification you do, if the data is incomplete or inaccurate, it is going to compromise any application it is used in. There is no end-to-end foolproof automation that can do all of the data fixing—but there are automated data editing tools that can assist the process by running through your data and identifying any potential erroneous data based upon business rules that you provide. What still needs to be corrected, can be corrected by hand or by the click of a mouse. Cleaning up data remains a laborious task, but at least there are now some tools that can help.#6-Be fierce and relentless
Classifying and cleaning up data shouldn't just be for the data you already have under management. It is even more important to apply these rules and techniques to incoming data. It will mean that there is less to "clean up" going forward-and that's a good thing!