Web Development

Re-Structuring HTML Documents Structure Automatically Through Clustering

Download Now Free registration required

Executive Summary

In this paper the authors present a novel approach to automatically re-structuring HTML documents by extracting semantic structures from their header and body, the body of a web page is generally software generated via template and its layout has a physical schema. The approach is to extract trees that are based on hierarchical relations in HTML documents, for this task they used two algorithms, first is Header extraction Algorithm which extracts header trees from head of HTML document and second is an algorithm for automatically partitioning HTML documents into tree like semantic structures from body part of web pages. Then they use an application called layout changer which changes a layout of one web page to another by aligning extracted header trees and partition trees.

  • Format: PDF
  • Size: 40.6 KB