Representative Selection for Big Data via Sparse Graph and Geodesic Grassmann Manifold Distance
This paper addresses the problem of identifying a very small subset of data points that belong to a significantly larger massive dataset (i.e., Big Data). The small number of selected data points must adequately represent and faithfully characterize the massive Big Data. Such identification process is known as representative selection. The authors propose a novel representative selection framework by generating a norm sparse graph for a given Big-Data dataset. The Big Data is partitioned recursively into clusters using a spectral clustering algorithm on the generated sparse graph. They consider each cluster as one point in a Grassmann manifold, and measure the geodesic distance among these points. The distances are further analyzed using a min-max algorithm to extract an optimal subset of clusters.