International Journal of Computer Applications
Today's important task is to clean data in data warehouses which has complex hierarchical structure. This is possibly done by detecting duplicates in large databases to increase the efficiency of data mining and to make it effective. Recently, new algorithms are proposed that consider relations in a single table; hence by comparing records pairwise they can easily find out duplications. But now-a-days, the data is being stored in more complex and semi-structured or hierarchical structure and the problem arose is how to detect duplicates on XML data.