Feature Analysis for Semantic Clustering of Sequence Documents
The sequence data maintained in public databases are available in heterogeneous formats like FASTA, XML, and ASN1. The XML representation of data is heterogeneous in nature with different DTD in various databases. The difference lies in the representation of sequence description as XML tags. The protein and genomic data in XML format in Genbank has more than 3500 tags to represent the functional description. The sequence documents extracted in any available format, has very vast information related to sequences. Each sequence data has information like its description, alternate names, gene-id, object-id, length, taxon, database references, sequence length and soon. Analyzing the sequence description for understanding the biological process becomes complex due to large number of attributes.