Date Added: Mar 2013
Text analytics has become increasingly important with the rapid growth of text data. Particularly, Information Extraction (IE), which extracts structured data from text, has received significant attention. Unfortunately, IE is often computationally intensive. To address this issue, MapReduce has been used for large scale IE. Recently, there are emerging efforts from both academia and industry on pushing IE inside DBMSs. This leads to an interesting and important question: given that both MapReduce and parallel DBMSs are for large scale analytics, which platform is a better choice for large scale IE? In this paper, the authors propose a benchmark to systematically study the performance of both platforms for large scale IE tasks.