University of Koblenz-Landau
Knowledge about the distribution of data provides the basis for various tasks in the context of Linked Open Data (LOD), e.g. for estimating the result set size of a query, for the purpose of statistical schema induction or for using information theoretic metrics to detect patterns. In this paper, the author investigate the potential of obtaining estimates for such distributions from samples of linked data. Therefore, the author consider three sampling methods applicable to public RDF data on the web as well as smoothing techniques to overcome the problem of unseen events in the sample space of a distribution.