Measuring the Validity of Peer-to-Peer Data for Information Retrieval Applications
Peer-to-Peer (p2p) networks are being increasingly adopted as an invaluable resource for various Information Retrieval (IR) applications, including similarity estimation, content recommendation and trend prediction. However, these networks are usually extremely large and noisy, which raises doubts regarding the ability to actually extract sufficiently accurate information. This paper quantifies the measurement effort required to obtain and optimize the information obtained from p2p networks for the purpose of IR applications. The authors identify and measure inherent difficulties in collecting p2p data, namely, partial crawling, user-generated noise, sparseness, and popularity and localization of content and search queries. These aspects are quantified using music les shared in the Gnutella p2p network.