International Journal of Soft Computing and Engineering (IJSCE)
Internet or web world has a large amount of information, which may be HTML documents, word, PDF files, audio and video files, images etc. Huge challenges are being faced by the researches to provide the required and related documents to the users according to the user query. Additional overheads are available for researchers pertaining to identify the duplicate and near duplicate web documents. This paper addresses these issues through Genetic algorithm and duplicate web documents identification function is used to improve relevance of retrieved documents by removing the duplicate records from the dataset.