Statistical Distortion: Consequences of Data Cleaning

Free registration required

Executive Summary

The authors introduce the notion of statistical distortion as an essential metric for measuring the effectiveness of data cleaning strategies. They use this metric to propose a widely applicable yet scalable experimental framework for evaluating data cleaning strategies along three dimensions: glitch improvement, statistical distortion and cost-related criteria. Existing metrics focus on glitch improvement and cost, but not on the statistical impact of data cleaning strategies. They illustrate their framework on real world data, with a comprehensive suite of experiments and analyses.

  • Format: PDF
  • Size: 744.6 KB