Quarrying Dataspaces: Schemaless Profiling of Unfamiliar Information Sources
Source: Portland State University
It is often assumed that traditional data integration and analysis approaches are familiar with the structure, semantics and capabilities of the available information sources prior to the effective use of applicable tools. This assumption however is false in practical. In this paper database profiling is used as the cardinal activity when a project is started in a database which is not familiar. The meaning of database profiling is to analyze structures and properties exposed by an information source that allows assessment of the utility and significance of the data source as a whole, assessment of compatibility with services of dataspace support platform and to determinate and externalize structure for preparing specific data applications. This article defines dataspace profiling and articulating requirements for dataspace profilers. The paper then describes the Quarry system offering a general browse and query interface for supporting dataspace profiling activities which includes path profiling, over a range of information sources with least cost to setup and minimum a priori suppositions. The paper also highlights on the strong performance of the techniques used in Quarry in large scale applications. Quarry is specifically used to profile a thorough standard for medication classification supplied under a general plan and the metadata for observation of environment and forecasting system. It is observed that in these contexts Quarry offers more benefits over the existing tools. Thus the Quarry system might serve as a dataspace profiling tool in future.