Provided by: Machine Intelligence Research Labs (MIR Labs)
Topic: Big Data
In this paper the authors present a hierarchical clustering algorithm aimed at creating groups of stems with similar characteristics. The resulting groups (clusters) are expected to comprise stems belonging to the same inflectional paradigm (e.g. verbs in passive voice) in order to support the creation of a morphological lexicon. A new metric for calculating the distance between the data objects is proposed, that better suits the specific application by addressing problems that may occur due to the limited amount of information from the data.