A Data-drive Feature Selection Method in Text Categorization
Text Categorization (TC) is the process of grouping texts into one or more predefined categories based on their content. It has become a key technique for handling and organizing text data. One of the most important issues in TC is Feature Selection (FS). Many FS methods have been put forward and widely used in TC field, such as Information Gain (IG), Document Frequency thresholding (DF) and Mutual Information. Empirical studies show that some of these (e.g. IG, DF) produce better categorization performance than others (e.g. MI). A basic research question is why these FS methods cause different performance. Many existing works seek to answer this question based on empirical studies.