Messy data is slowing down machine learning projects and driving up costs

A survey finds that bad data leads to bad business decisions like miscalculating demand and targeting the wrong customer prospects.

Cracking the code: Why more companies are focusing on AI projects

The "garbage in, garbage out" warning about bad data is more relevant than ever as datasets grow ever more enormous and drive ever more business decisions. 

A new survey from Trifacta quantified the impact of bad data on machine learning (ML) projects with a third of respondents stating that poor data quality causes ML projects to take longer, cost more and fail to hit the anticipated results.

About a third of respondents listed those problems as the biggest challenges to artificial intelligence and ML implementations. A majority of respondents said that companies are making business decisions based on the analysis of this messy data, which means AI projects could become a financial liability instead of an asset. Survey respondents said that using bad data could result in:

  • Miscalculating demand                59%
  • Targeting the wrong prospects     26%

Even C-suite leaders are worried with 75% stating they are not confident in the quality of their data.

This survey reinforces other research that found data analysts are spending most of their time cleaning data instead of analyzing it. A quarter of respondents said they were spending 20 hours or more to prepare data for an AI/ML initiative:

  • 1 to 4 hours             28%
  • 5 to 9 hours             25%
  • 10 to 19 hours         22%
  • 20 hours or more     24%

Cleaning the data is worth with time investment according to the survey with 29% of respondents saying data was completely accurate and 51% saying data was very accurate after cleaning. The report recommends deduplication, data validation, and analyzing relationships between fields as the best way to improve data accuracy.

Trifacta also recommends analyzing third-party data from customers, semi-structured data or data from relational databases to improve data quality. 

SEE: Special report: Managing AI and ML in the enterprise (free PDF)

The survey found that companies are most interested in customer data, financial data, employee data, and sales transactions. Only 14% of survey respondents said they had access to all the data sources they needed. Bringing in external data presents a new set of challenges. Survey respondents said that common barriers to combining data sources are:

  • Combining data from different systems          28%
  • Merging from different sources                      27%
  • Reformatting                                                   25%

The report said that legacy, compartmentalized data integration systems can't handle the speed, scale, and diversity of today's data. Companies will only see the benefits of AI and cloud computing to the extent that internal data is usable.

The results in this report are from an online survey that was conducted in August 2019 by Researchscape International and Trifacta. Eighteen percent of respondents were at the vice president or C-suite level with 28% of respondents in director and executive roles and 30% as analysts. Trifacta's products automate data preparation.

Also see

High Angle View Of Sleeping Businesswoman

AndreyPopov/Getty Images/iStockphoto