Semantically-Grounded Construction of Centroids for Datasets with Textual Attributes
Centroids are key components in many data analysis algorithms such as clustering or micro-aggregation. They are understood as the central value that minimizes the distance to all the objects in a dataset or cluster. Methods for centroid construction are mainly devoted to datasets with numerical and categorical attributes, focusing on the numerical and distributional properties of data. Textual attributes, on the contrary, consist on term lists referring to concepts with a specific semantic content (i.e., meaning), which cannot be evaluated by means of classical numerical operators.