Given the importance of data for delivering machine learning and other data science-related workloads, data quality has never been more crucial for enterprises. Small wonder, then, that data quality is the top objective for data teams, according to multiple surveys.
Though companies may all nod in agreement at this statement, actually delivering data quality remains elusive for many. Open source data quality solutions can help, especially for companies that are looking for alternatives to the bigger data quality solutions.
- Why do companies need data quality solutions?
- Benefits of open source data quality solutions
- Top open source data quality tools
Why do companies need data quality solutions?
“It’s inevitable that data will break,” Tom Baeyens, co-founder and CTO of Soda, said in an interview. “You cannot prevent mistakes. The only thing you can do is start chasing them and be the first to know, and that’s where data monitoring and testing come in.”
Even if a company starts with pristine data, entropy sets in. From skewed inventory data to something as simple as misspelled customer names, poor data leads to poor business decisions and customer experiences. To Baeyens’ point, and similar to bug-free software, data quality is as much about process as anything else.
SEE: Hiring kit: Data scientist (TechRepublic Premium)
Data quality isn’t something you buy, but data quality solutions can help enterprises implement the right processes to improve data quality over time. As Talend described in a recent whitepaper, “data quality must be an always-on operation, a continuous and iterative process where you constantly control, validate, and enrich your data; smooth your data flows; and get better insights.”
Benefits of open source data quality solutions
Data quality, generally, can be measured across a number of different factors. These might include data completeness, accuracy, availability or accessibility to relevant users, timeliness, and consistency. Yet, despite increased attention to these aspects of data quality, many enterprises still rely on black-box, proprietary solutions that yield little insight into why the tooling recommends certain actions on a given dataset.
Open source isn’t a panacea for data or software quality but, as mentioned, open source data quality solutions can help to improve the processes associated with delivering quality. One of the clear trends in data science, generally, has been a shift toward open source data infrastructure, precisely because no one wants to bet blindly on algorithms that can be used but not understood.
So, which open source data quality solutions stand out?
Top open source data quality tools
One of the most interesting data quality tools isn’t really a data quality tool at all. Rather, the Delta Lake open source storage framework, first created by Databricks but contributed to and maintained by the Linux Foundation, ensures any data lake can be turned into a data warehouse with all of the attendant benefits, including making it more easily queryable.
Delta Lake helps companies feel comfortable storing all of their data in a common, open source format, making it easier to use that data and apply data quality tools against it.
Talend Open Studio
Talend, already mentioned, offers the popular Talend Open Studio for users that want an open source data quality solution. Talend makes it easy to observe, cleanse and analyze text fields, along with several other related tasks. The solution has a polished, easy-to-follow UI, as well as a robust community that can step in to help answer user questions.
As is detailed in an Indeed.com analysis, “One unique value proposition of Open Studio is its ability to match time-series data … Without adding any code, users can analyze the data ranging from simple data profiling to profiling based on different fields.”
Apache Griffin is another community-driven open source data quality solution. Griffin supports both batch and streaming modes and includes a unified process to measure data quality. Griffin first enables an enterprise to define what data quality means for them across factors such as timeliness and completeness; then, they can identify the most critical characteristics. With this process, it’s easy to measure how data is living up to that data quality definition. Companies as varied as Expedia, VMware and Huawei rely on Griffin.
One newer entrant to the open source data quality universe is Soda, founded by open source veteran, Tom Baeyens. Soda helps data engineers control the tests used to screen for bad data and the metrics that are employed to evaluate results. Soda SQL uses efficient SQL requests to extract data metrics and column profiles with full control over the queries provided through declarative YAML configuration files.
Though Soda will often be used by data engineers, the platform is trying to democratize data monitoring, making it easy for non-technical, business-oriented people to build data monitors.
OpenRefine is a community-driven tool that is primarily used to tame messy data. Though it originated with Google, OpenRefine can be used to explore, clean and transform data at significant scale.
Disclosure: I work for MongoDB, but the views expressed herein are mine.