Clustering XML Data Streams by Structure based on Sliding Windows and Exponential Histograms
To group online XML data streams by structure, this paper introduces an algorithm named the CXDSS-SWEH. It is a dynamic clustering algorithm based on sliding windows and exponential histograms. Firstly, the algorithm formalizes an XML document into a structure synopsis named Temporal Cluster Feature for XML Structure (TCFXS). Secondly, it allots the TCFXS to some cluster through measuring similarities between the TCFXS and each existing cluster. At last, updating clusters in sliding windows are real-time modified through criterions of false positive exponential histograms. The authors have conducted a series of experiments involving real and simulative XML data streams for validating empirical effects on clustering quality, memory and time consumption.