Download now Free registration required
Statistical machine learning algorithms have been successfully applied to many Natural Language Processing (NLP) problems. Compared to manually constructed systems, statistical NLP systems are often easier to develop and maintain since only annotated training text is required. From annotated data, the underlying statistical algorithm can build a model so that annotations for future data can be predicted. However, the performance of a statistical system can also depend heavily on the characteristics of the training data. If one applies such a system to text with characteristics different from that of the training data, then performance degradation will occur. This paper examines this issue empirically using the sentence boundary detection problem.
- Format: PDF
- Size: 97.7 KB