More data isn't always a good thing in text mining

In text mining it seems obvious that we should use all the data we can get our hands on for use in drawing conclusions. The temptation is always to use the broadest possible query to select the data set, because we don't want to miss anything that might be important. The problem with such an all inclusive strategy is that it often adds more noise that obscures the signal we're trying to detect.

So, for instance, if I'm doing a study for a chocolate candy manufacturer and simply enter the query, "chocolate," the vast majority of the data I collect for my study will have nothing whatever to do with chocolate candy. This will make it much harder to detect the relevant trends and themes in the data related to chocolate candy because they'll be obscured by unrelated issues, such as the color chocolate or chocolate ice cream. So the query "chocolate candy" might actually make more sense, even though it leaves out a lot of relevant data. As long as we have enough data, adding more that is mostly irrelevant could actually make our analysis less effective.

But how much data is enough? The answer may surprise you. It doesn't really take as much data as you might think to spot a potentially interesting trend or correlation. To see why, let's try a simple thought experiment. Say we're given a coin and we're told that it may or may not be "loaded," where a loaded coin is one that when flipped nearly always comes up heads, whereas a normal coin will only come up heads half the time. How many flips of the coin will it require for me to determine that the coin is fair or loaded with 99% confidence? The answer is 7 (the first flip of heads gives me 50% confidence (1/2), the next flip 25% (1/4)... the seventh flip .007 (1/128)). So in this simple experiment I only needed seven data points to tell that something was probably amiss with the coin.

But if seven examples is enough to draw a conclusion from a simple experiment, why do we usually use thousands of examples to draw conclusions from text? There are actually a couple of reasons. Partly it's because we frequently don't get to design our experiment before the data is generated. So we basically have to take whatever data is given to us, and some of it is certain to be redundant or irrelevant for our purposes. The other issue is that we usually aren't simply trying to determine the answer to one yes/no question (e.g. "is the coin loaded or not") but rather are looking across thousands of potential features and correlations to find a handful that are potentially interesting. When you have to cover more bases, you naturally need more data to do it with.

So the better, more relevant the data, and the more focused the subject of the analysis, the less data you actually need to get an accurate picture. Typically when I get a fairly focused set of short documents (paragraphs) that are relevant to the subject under study, I can usually get a pretty good picture of between 25 and 50 themes using between 1000-10000 documents. Right around 500 documents usually turns out to be too small a set to be interesting (it might even be easier just to read the documents one by one, than it is to try to analyze them using text mining techniques). Once I get above 100,000 documents, I'll usually either sample the data or divide into smaller chunks using some other feature of interest.

The moral of the story is, adding more data is not a panacea. Being thoughtful about what you want to study and why and then carefully selecting data that is relevant to those objectives will produce much better results in the end.

Scott Spangler is an IBM senior technical staff member who has been researching knowledge-based systems and data mining for the past 20 years. He is the co-author, along with Jeffrey Kreulen, of the book "Mining the Talk: Unlocking the Business Value in Unstructured Information", which shows readers how to leverage unstructured data to become more competitive, responsive and innovative.

The book is published by Pearson Education, under the IBM Press imprint, July, 2007, ISBN 0132339536, Copyright 2008 by International Business Machines Corporation. All rights reserved. For more information, please visit: or