Efficient Updates for Web-Scale Indexes Over the Cloud
In this paper, the authors present a distributed system which enables fast and frequent updates on web-scale Inverted Indexes. The proposed update technique allows incremental processing of new or modified data and minimizes the changes required to the index, significantly reducing the update time which is now independent of the existing index size. By utilizing Hadoop MapReduce, for parallelizing the update operations, and HBase, for distributing the Inverted Index, they create a high-performance, fully distributed index creation and update system. To the best of their knowledge, this is the first open source system that creates, updates and serves large-scale indexes in a distributed fashion.