Re-Structuring HTML Documents Structure Automatically Through Clustering
Source: JATIT
In this paper the authors present a novel approach to automatically re-structuring HTML documents by extracting semantic structures from their header and body, the body of a web page is generally software generated via template and its layout has a physical schema. The approach is to extract trees that are based on hierarchical relations in HTML documents, for this task they used two algorithms, first is Header extraction Algorithm which extracts header trees from head of HTML document and second is an algorithm for automatically partitioning HTML documents into tree like semantic structures from body part of web pages. Then they use an application called layout changer which changes a layout of one web page to another by aligning extracted header trees and partition trees.
| Format: | Size: | 40.60 | |
| Date: | Apr 2009 |



