Enterprise Software

More data isn't always a good thing in text mining


In text mining it seems obvious that we should use all the data we can get our hands on for use in drawing conclusions. The temptation is always to use the broadest possible query to select the data set, because we don't want to miss anything that might be important. The problem with such an all inclusive strategy is that it often adds more noise that obscures the signal we're trying to detect.

So, for instance, if I'm doing a study for a chocolate candy manufacturer and simply enter the query, "chocolate," the vast majority of the data I collect for my study will have nothing whatever to do with chocolate candy. This will make it much harder to detect the relevant trends and themes in the data related to chocolate candy because they'll be obscured by unrelated issues, such as the color chocolate or chocolate ice cream. So the query "chocolate candy" might actually make more sense, even though it leaves out a lot of relevant data. As long as we have enough data, adding more that is mostly irrelevant could actually make our analysis less effective.

But how much data is enough? The answer may surprise you. It doesn't really take as much data as you might think to spot a potentially interesting trend or correlation. To see why, let's try a simple thought experiment. Say we're given a coin and we're told that it may or may not be "loaded," where a loaded coin is one that when flipped nearly always comes up heads, whereas a normal coin will only come up heads half the time. How many flips of the coin will it require for me to determine that the coin is fair or loaded with 99% confidence? The answer is 7 (the first flip of heads gives me 50% confidence (1/2), the next flip 25% (1/4)... the seventh flip .007 (1/128)). So in this simple experiment I only needed seven data points to tell that something was probably amiss with the coin.

But if seven examples is enough to draw a conclusion from a simple experiment, why do we usually use thousands of examples to draw conclusions from text? There are actually a couple of reasons. Partly it's because we frequently don't get to design our experiment before the data is generated. So we basically have to take whatever data is given to us, and some of it is certain to be redundant or irrelevant for our purposes. The other issue is that we usually aren't simply trying to determine the answer to one yes/no question (e.g. "is the coin loaded or not") but rather are looking across thousands of potential features and correlations to find a handful that are potentially interesting. When you have to cover more bases, you naturally need more data to do it with.

So the better, more relevant the data, and the more focused the subject of the analysis, the less data you actually need to get an accurate picture. Typically when I get a fairly focused set of short documents (paragraphs) that are relevant to the subject under study, I can usually get a pretty good picture of between 25 and 50 themes using between 1000-10000 documents. Right around 500 documents usually turns out to be too small a set to be interesting (it might even be easier just to read the documents one by one, than it is to try to analyze them using text mining techniques). Once I get above 100,000 documents, I'll usually either sample the data or divide into smaller chunks using some other feature of interest.

The moral of the story is, adding more data is not a panacea. Being thoughtful about what you want to study and why and then carefully selecting data that is relevant to those objectives will produce much better results in the end.

Scott Spangler is an IBM senior technical staff member who has been researching knowledge-based systems and data mining for the past 20 years. He is the co-author, along with Jeffrey Kreulen, of the book "Mining the Talk: Unlocking the Business Value in Unstructured Information", which shows readers how to leverage unstructured data to become more competitive, responsive and innovative.

The book is published by Pearson Education, under the IBM Press imprint, July, 2007, ISBN 0132339536, Copyright 2008 by International Business Machines Corporation. All rights reserved. For more information, please visit: www.ibmpressbooks.com or www.informit.com.

1 comments
brainwavelive
brainwavelive

I do agree to the blog 'more data isn't always a good thing in text mining'. Especially the last part that says it is important for us to know why do we need data for? Merely querying for a particular information always throws up some irrelevant data, that often we do not require. One of the reason is because the database is not intelligent enough to sense the reason for our querying. It is important to have some tool, some technology that can activate 'thoughts' in databases. But it is a difficult task. one of the problem being we always predefine data and develop query based on the data structure or 'schema'. Although ontology offers us choice to gather as much information as possible under particular domain, yet for a lack of a proper storage format (that is not imposing on the data), having only relevant information at the time of querying is not possible. An already existing cell in a table cannot handle those data that has evolved over time and space, for which another cell in a new table need to be created. And this goes on. This makes difficult to apply 'thoughts' in the database layer to activate any intelligence that can sense the reason for a particular query. It is time to have a fundamental unit in place that can handle data that is evolving and only this can throw up accurate results at the time of querying. Like 'gene', 'meme' is one of the fundamental unit that not only handle data that is evolved over time and space but propagates itself throughout the entire enterprise system to connect itself with the relevant ones. Being free from any 'schema' that is limiting in functional aspects, querying a 'Neural' database throws up accurate results for it can sense the reason for querying. Refer to 'The Brainwave Platform' for more.

Editor's Picks