Parallel Sentences Mining From The Web
Parallel sentences can benefit many NLP applications (e.g., machine translation and cross language information retrieval.) In this paper, the candidate bilingual webs pages are returned by submit sentence pairs to search engine and then validated by surface patterns. The authors propose an algorithm to candidate bilingual resource extraction and filter useless bilingual web pages. The pair sentences included in the candidate bilingual web pages is verified by a maximum entropy classifier combining length, word-overlap, alignment and text location features. Training sets and the mining seeds are acquired automatically. Experiment shows satisfactory parallel resource mining performance.