Efficient Algorithm for Removing Duplicate Documents

Provided by: International Journal of Soft Computing and Engineering (IJSCE)
Topic: Data Management
Format: PDF
Internet or web world has a large amount of information, which may be HTML documents, word, PDF files, audio and video files, images etc. Huge challenges are being faced by the researches to provide the required and related documents to the users according to the user query. Additional overheads are available for researchers pertaining to identify the duplicate and near duplicate web documents. This paper addresses these issues through Genetic algorithm and duplicate web documents identification function is used to improve relevance of retrieved documents by removing the duplicate records from the dataset.

Find By Topic