Data Management

Algorithm for Enumerating All Maximal Frequent Tree Patterns Among Words in Tree-Structured Documents and Its Application

Download Now Date Added: Dec 2009
Format: PDF

To extract structural features from tree-structured documents among nodes in which characteristic words appear, the authors described a text-mining algorithm for enumerating all frequent Consecutive Path Patterns (CPP) on a list W of words in Uchida et al., PAKDD 2004. In this paper, they first extend a CPP to a tree pattern, which is called a Tree Association Pattern (TAP), over a set W of words. A TAP is an ordered rooted tree t such that the root of t has no child or at least two children, all leaves of t are labeled with non-empty subsets of W and all internal nodes, if they exist, are labeled with strings.