Data Profiling Using Attribute Clustering
Finding trends in database data is hard when presented with data sets containing many attributes (columns). The difficulty is increased when the data is in text fields and may include large summary or remarks fields. This paper discusses an approach that uses attribute level clustering in order to discover trends or profiles in the data. This is different from traditional uses of clustering in that each attribute is clustered separately and then the results are combined to define profiles. For example, in a case study of the Global Terrorism Database (GTD) data set, there are 98 columns (attributes) in the data.