Even 30 years ago, IT workers knew about data wrangling — it was the tedious data mapping work from data repositories into user interfaces or from one application to another that the IT "grunts" and interns did. Like modern data wrangling, this often thankless task involved cleaning data, connecting tools and applications, and getting data into a usable format. No one wanted to do it, but if it wasn't done, nothing else worked.
Data wrangling — defined as "the process of manually converting or mapping data from one 'raw' form into another format that allows for more convenient consumption of the data with the help of semi-automated tools" — is an even larger chore in the big data world, which wants to combine structured and unstructured data from myriad sources and is anything but orderly when the data first arrives in raw form.
A New York Times article from August 2014 described data wrangling as "handcrafted janitor work" where data scientists "spend from 50 percent to 80 percent of their time mired in this mundane labor of collecting and preparing unruly digital data before it can be explored for useful nuggets." Monica Rogati, vice president of data science at Jawbone, a digital device manufacturer, seconded the motion by adding, "It's something that is not appreciated by data civilians. At times, it feels like everything we do."
The fact remains that converting big data into usable data remains a painstaking and time-consuming job.
Solutions that can help you wrangle data
This is the space that companies including Trifacta are focusing on as they target big data end users and provide products that can clean, structure, and enrich data for end-reporting mechanisms like Tableau or Excel spreadsheets.
"We run our software on top of Hadoop and prepare Hadoop data for end user consumption and use," said Will Davis, marketing director for product management at Trifacta. "We want to alleviate the pain points that users experience working with big data and to enable non-IT business users to consume it, whether it is centrally stored in IT or locally resident on their desktops."
How is this done?
"The tool presents a visual representation of the data," said Alon Bartur, Trifacta's principle product manager. "It makes certain assumptions concerning the structuring of this data, and the user sees these assumptions by indicators that assess what the likely quality level is of each piece of data. Users know immediately from the indicators whether the data that they are seeing is of high quality or whether it is questionable and might require additional investigation. The user interface is designed for point and click interactions and the system gives the users suggestions of how to organize data reports, as well as certain data transforms that the user can run and what the likely outcomes of these transforms are."
70% reduction in data prep time
Davis mentioned that PepsiCo wanted to improve its product planning and replenishment forecasts for business partners like grocery stores that also offered food products. It wanted to know on a more granular level how much product was being shipped to which stores in these large chains, and what weekly and quarterly consumption rates were for each store location. Analytics reports had to be developed by end users and then rolled up into a total sales forecast that would be used to inform production cycles.
Formerly, the company performed this exercise by painstakingly dumping data into a Microsoft Access database and then manually creating a series of Excel reports. Because there was so much manual work, errors were introduced into the process.
By moving to Hadoop-generated data and using a top-level data wrangling software like Trifacta and then feeding the data into Tableau reports that could be shared with executive management, the report team reduced its data prep time by 70% and its overall prep time by 90%. The end results for the company were faster time to value for the information and faster reaction time for forecast adjustments.
A step in the right direction
Trifacta's solution is a step in the right direction, although it is still best suited for generic everyday use and may not be able to get into the fine points of data refinements that are needed for more precise analytics. Trifacta recognizes this and is continuing to advance its products.
- Mina Hsiang: Engineer. Healthcare.gov rescue team member. Health data wrangler. (TechRepublic)
- Hilary Mason: Fast Forward Labs CEO. One-time aspiring taxi driver. Your nerd best friend. (TechRepublic)
- Machine learning frees up data scientists' time, simplifies smart applications (TechRepublic)
- Farm out big data chores so employees can focus on analytics (TechRepublic)
Mary E. Shacklett is president of Transworld Data, a technology research and market development firm. Prior to founding the company, Mary was Senior Vice President of Marketing and Technology at TCCU, Inc., a financial services firm; Vice President of Product Research and Software Development for Summit Information Systems, a computer software company; and Vice President of Strategic Planning and Technology at FSI International, a multinational manufacturing company in the semiconductor industry. Mary is a keynote speaker and has more than 1,000 articles, research studies, and technology publications in print.