A grammar-based approach for XML schema extraction and heterogeneous document integration
Prudhvi Janga and
Karen C. Davis
International Journal of Data Mining, Modelling and Management, 2019, vol. 11, issue 3, 235-258
Abstract:
The availability of vast amounts of heterogeneous XML web data motivates finding efficient methods to search, integrate, query, and present this data. The structure of XML documents is useful for achieving these tasks; however, not every XML document on the web includes a schema. We discuss challenges and solutions in the area of generation and integration of XML schemas. We propose and implement a framework for efficient schema extraction and integration from heterogeneous XML document collections collected from the web. Our approach introduces the schema extended context-free grammar (SECFG) to model XML schemas, including detection of attributes, data types, and element occurrences. Unlike other implementations, our approach supports the generation of XML schemas in any XML schema language, e.g., DTD or XSD. We compare our approach with other proposed approaches and conclude that we offer the same or better functionality more efficiently and with greater flexibility. The approach we propose is flexible enough to facilitate integration of and translation to tabular (relational) data.
Keywords: XML schema; schema integration; schema extraction; schema discovery. (search for similar items in EconPapers)
Date: 2019
References: Add references at CitEc
Citations:
Downloads: (external link)
http://www.inderscience.com/link.php?id=100385 (text/html)
Access to full text is restricted to subscribers.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:ids:ijdmmm:v:11:y:2019:i:3:p:235-258
Access Statistics for this article
More articles in International Journal of Data Mining, Modelling and Management from Inderscience Enterprises Ltd
Bibliographic data for series maintained by Sarah Parker ().