The accuracy of your analytics may depend on how well you asked questions of your data.
The field of computer-based natural language processing and analytics first emerged in the 1950s. Today, the practice is employed in the mobile and computer application automation that we experience every day. And while natural language processing has dramatically improved through the years, it is still an evolving science.
For most of us, we have only to look as far as our word processors and mobile apps, which help us innumerable times through their built-in algorithms and learning processes with interpretations of spelling and vocabulary, but can also interpret words incorrectly. (Example: I am writing this on a Mac and my language interpreter just interpreted "as far" in the first sentence of this paragraph as "Safari," which is the Mac browser.)
We can work around these natural language processing limitations in big data applications, but the stakes get higher when algorithms and queries are run against big data in pharmaceutical analytics, for example, and they come up against human language ambiguities.
One case concerning an online healthcare website was documented in a 2014 New York Times article. The goal of the website was to give consumers information about drug side effects and interactions. The website used data in a variety of different formats that were culled from a variety of different sources and then aggregated into a big data repository that would be probed by internally developed analytics algorithms. Unfortunately, since the same drug's side effects were described in different ways in different data sources (e.g. drowsiness, somnolence, sleepiness), complications from these languages ambiguities arose that compromised the algorithm's effectiveness and its ultimate accuracy. Consequently, additional labor had to be continuously put into algorithm refinement.
In a technical analysis of problems like this, Xavier Amatriain, a computer science researcher and vice president of engineering at Quora, discussed in an article the importance of not over- or under-matching the sophistication of data algorithms and queries against the type of data they are evaluating. For instance, we normally assume that the more data you run against a data algorithm, the more accurate the analytics result you are going to get.
Amatriain said that isn't always the case, and that in some cases the algorithms, or basic data queries that are asked of data are too simplistic to be benefited by any more data. In other cases, the questions and algorithms we use against big data are too complicated. They require so many different characteristics of each data element to analyze, that they can't conclude anything.
This dilemma was again highly publicized in 2013, when Google Flu Trends (GFT) predicted twice as many visits to the doctor for flu-like illnesses than the Centers for Disease Control and Prevention (CDC), which bases estimates on surveillance reports from laboratories across the United States. Instead, Google based its analytics and its algorithm on not only information that the CDC was using but also on reports from a diversity of sources on search words like "fever," or "cough." Needless to say, Google's predictions for the flu that year were overstated. The variance came down to not only choice of data, but how the algorithmic formula developed to query the data was interpreting it.
What can companies learn from these experiences?
It is not enough to just think about your data, and all of the sources that you can aggregate it from. In the end, the accuracy of your analytics may well depend on how well you asked the question.