Enterprises increasingly depend on data, but what if that data is incorrect? What if, for example, you’re a hotel chain that relies on algorithms to correctly calculate the price of hotel rooms, but the inflowing data is wrong? No matter how smart that algorithm, the hotel prices will be dumb. As it turns out, this is a true story for a European hotel chain, and the company helping them to ensure data quality is Soda.
It turns out that data needn’t be too different from software. In software, developers use unit testing to ensure code quality; the analog for data is data testing. Similarly, in software, a massive industry has been built up around application monitoring (including observability). Now there’s data monitoring.
“It’s inevitable that data will break,” Tom Baeyens, co-Founder and Chief Technology Officer of Soda, said in an interview. “You cannot prevent mistakes. The only thing you can do is start chasing them and be the first to know, and that’s where data monitoring and testing comes in.”
It’s a new market, sitting at the nexus of IT and lines of business. And, given the importance of data, it’s destined to be a very big market.
SEE: Report: SMB’s unprepared to tackle data privacy (TechRepublic Premium)
Open sourcing data quality
Given Baeyens’s past, it’s not surprising that he’d bring an open source approach to the problem of data quality. I first knew Baeyens back when he was at open source pioneer JBoss, which was then acquired by Red Hat. Later Baeyens started his own open source business process management company (Activiti), which was acquired by open source content management company Alfresco. Soda, in short, is not his first (or third!) foray into open source.
Recently Baeyens took the next step in his open source journey, open sourcing Soda SQL, which offers configurable, open source SQL data testing capabilities:
The configuration options within Soda SQL enable data engineers to control the tests set to screen for bad data and the metrics that are used to evaluate the results. Soda SQL uses efficient SQL requests to extract data metrics and column profiles with full control over the queries provided through declarative YAML configuration files. The tests run by Soda SQL are performed across the data pipeline and trigger alerts when problematic or bad data is found. The results can be viewed directly and used to catch problems, quarantine bad data and send updates to the Soda Enterprise data monitoring. This enables individual testing by data engineers to be integrated with the enterprise-wide data testing strategy.
But how does this work within the enterprise?
While Soda SQL is more geared toward data engineers, Soda also offers a hosted service geared toward the business user and, specifically, the chief data officer (CDO). Interest in data testing and monitoring might start with the CDO when they recognize the need to ensure quality data feeding executive dashboards, machine learning models, and more.
SEE: How to be a successful Chief Data Officer: 3 tips (TechRepublic)
At the same time, data engineers, responsible for building data pipelines (transforming, extracting, and preparing data for usage), just need to do some minimal checks to ensure they’re not shipping faulty data. Or, you might have a data platform engineer who just wants hands-off monitoring after connecting to the data platform warehouse.
In this universe, data testing and data monitoring are two distinct things. In both cases, Baeyens said, “The large majority of people with which we speak have an uncomfortable feeling that they should be doing more with data validation, data testing, and monitoring, but they don’t know where to start, or it’s just kind of blurry for them.”
Soda is trying to democratize data monitoring, in particular, by making it easy for non-technical, business-oriented people to build the data monitors. Given Baeyens’s past with business process management (BPM), and how BPM allows non-technical people to architect businesses processes, it’s not surprising this would be a focal area for Soda.
Will it work? Time will tell, but one thing is clear: The rising importance of data is making the importance of ensuring the quality and integrity of that data rise even faster.
Disclosure: I work for AWS, but the views expressed herein are mine.