A Structural, Content-Similarity Measure for Detecting Spam Documents on the Web

The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines must deal with an annoying problem: the presence of spam documents that are ranked among legitimate ones. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. To improve the quality of Web searches, the number of spam documents on the Web must be reduced, if they cannot be eradicated entirely.