Developing a QA strategy for unstructured data and analytics can be a trying and elusive process, but there are several things we've learned that can improve accuracy of results.
In a traditional application development process, quality assurance occurs at the unit-test level, the integration test level and, finally, in a staging area where a new application is trialed in an environment similar to what it will perform with in production. While it's not uncommon for less-than-perfect data to be used in early stages of application testing, the confidence in data accuracy for transactional systems is high. By the time an application gets to final staging tests, the data that it processes is seldom in question.
SEE: Kubernetes: A cheat sheet (free PDF) (TechRepublic)
With analytics, which uses a different development process and a mix of structured and unstructured data, testing and quality assurance for data aren't as straightforward.
Here are the challenges:
1. Data quality
Unstructured data that is incoming to analytics must be correctly parsed into digestible pieces of information to be of high quality. Before parsing occurs, the data must be prepped so it is compatible with the data formats in many different systems that it must interact with. Data also must be pre-edited so as much needless noise (such as connection "handshakes" between appliances in Internet of Things data) are eliminated. With so many different sources for data, each with its own set of issues, data quality can be difficult to obtain.
SEE: When accurate data produces false information (TechRepublic)
2. Data drift
In analytics, data can begin to drift as new data sources are added and new queries alter analytics direction. Data and analytics drift can be a healthy response to changing business conditions, but it can also get companies away from the original business use case that the data and analytics were intended for.
SEE: Electronic Data Disposal Policy (TechRepublic Premium)
3. Business use case drift
Use case drift is highly related to drifts in data and analytics queries. There is nothing wrong with business use case drift—if the original use case has been resolved or is no longer important. However, if the need to satisfy the original business use case remains, it is incumbent on IT and the end business to maintain the integrity of data needed for that use case and to create a new data repository and analytics for emerging use cases.
SEE: 3 rules for designing a strong analytics use case for your proposed project (TechRepublic)
4. Eliminating the right data
In one case, a biomedical team studying a particular molecule wanted to accumulate every piece of data it could find about this molecule from a worldwide collection of experiments, papers and research The amount of data that artificial intelligence and machine learning had to review to collect this molecule-specific data was enormous, so the team made a decision up front to bypass any data that was not directly related to this molecule.The risk was that they might miss some tangential data that could be important, but it was not a large enough risk to prevent them from slimming down their data to ensure that only the highest quality, most relevant data was collected.
Data science and IT teams can use this approach as well. By narrowing the funnel of data that comes into an analytics data repository, data quality can be improved.
5. Deciding your data QA standards
How perfect does your data need to be in order to perform value-added analytics for your company? The standard for analytics results is that they must come within 95% accuracy of what subject matter experts would have determined for any one query. If data quality lags, it won't be possible to meet the 95% accuracy threshold.
However, there are instances when an organization can begin to use data that is less-than-perfect and still derive value from it. One example is in general trends analysis, such as gauging increases in traffic over a road system or increases in temperatures over time for a fruit crop. The caveat is: If you're using less-than-perfect data for general guidance, never make this mission-critical analytics.
- Geospatial data is being used to help track pandemics and emergencies (TechRepublic)
- Akamai boosts traffic by 350% but keeps energy use flat thanks to edge computing (TechRepublic)
- How to become a data scientist: A cheat sheet (TechRepublic)
- Top 5 programming languages data admins should know (free PDF) (TechRepublic download)
- Data Encryption Policy (TechRepublic Premium)
- Volume, velocity, and variety: Understanding the three V's of big data (ZDNet)
- Big data: More must-read coverage (TechRepublic on Flipboard)