Parallel Sentences Mining From The Web

Parallel sentences can benefit many NLP applications (e.g., machine translation and cross language information retrieval.) In this paper, the candidate bilingual webs pages are returned by submit sentence pairs to search engine and then validated by surface patterns. The authors propose an algorithm to candidate bilingual resource extraction and filter useless bilingual web pages. The pair sentences included in the candidate bilingual web pages is verified by a maximum entropy classifier combining length, word-overlap, alignment and text location features. Training sets and the mining seeds are acquired automatically. Experiment shows satisfactory parallel resource mining performance.

Provided by: Binary Information Press Topic: Data Management Date Added: Dec 2009 Format: PDF

Find By Topic