Compression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets
This paper presents an efficient framework for error-bounded compression of high-dimensional discrete-attribute datasets. Such datasets, which frequently arise in a wide variety of applications, pose some of the most significant challenges in data analysis. Sub-sampling and compression are two key technologies for analyzing these datasets. The proposed framework, PROXIMUS, provides a technique for reducing large datasets into a much smaller set of representative patterns, on which traditional (expensive) analysis algorithms can be applied with minimal loss of accuracy.