Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers
Source: New York University
There are various costs associated with the preprocessing stage, including costs of acquiring features, formulating data, cleaning data, obtaining expert labeling of data etc. There persists repeated acquisition of labels for data items when the labeling is imperfect. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. Repeated-labeling can improve label quality and model quality, but not always. When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. When the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. For certain label-quality/cost regimes, the benefit is substantial. It can improve both the quality of the labeled data directly, and the quality of the models learned from the data. In particular, selective repeated-labeling seems to be preferable, taking into account both labeling uncertainty and model uncertainty. Practically relevant setting is where the label assignment to a case is inherently uncertain. This is a separate setting where repeated-labeling could provide benefits. It is possible to obtain certain labels from multiple sources relatively cheaper. The use of these values as training labels for supervised modeling holds potential. The analysis therefore has practical applications.