Dataset comparison workflows
Marko Robnik-Šikonja
International Journal of Data Science, 2018, vol. 3, issue 2, 126-145
Abstract:
To assess similarity of two datasets from the point of view of data science, univariate statistical comparisons are mostly insufficient. We present a methodology which estimates similarity of datasets from the point of view of data mining tasks. For example, we provide a relevant information for a decision if a new/related dataset can be used with an existing supervised or unsupervised model or not. We propose several workflows which cover: (a) statistical properties of generated data; (b) distance based structural similarity and (c) predictive similarity of two datasets. We evaluate the proposed workflows on random splits of several datasets and by comparing original datasets with datasets produced by a generator of semi-artificial data. The results show that the proposed workflows can reveal relevant similarity information about datasets needed in many data mining scenarios.
Keywords: data analytics; data mining; machine learning; data similarity; clustering; classification. (search for similar items in EconPapers)
Date: 2018
References: Add references at CitEc
Citations:
Downloads: (external link)
http://www.inderscience.com/link.php?id=92282 (text/html)
Access to full text is restricted to subscribers.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:ids:ijdsci:v:3:y:2018:i:2:p:126-145
Access Statistics for this article
More articles in International Journal of Data Science from Inderscience Enterprises Ltd
Bibliographic data for series maintained by Sarah Parker ().