Hands-on training about data clustering with orange data mining toolbox

Demšar, Janez; Zupan, Blaž

Hands-on training about data clustering with orange data mining toolbox

Janez Demšar and Blaž Zupan

PLOS Computational Biology, 2024, vol. 20, issue 12, 1-9

Abstract: Data clustering is a core data science approach widely used and referenced in the scientific literature. Its algorithms are often intuitive and can lead to exciting, insightful results that are easy to interpret. For these reasons, data clustering techniques could be the first method encountered in data science training. This paper proposes a hands-on approach to data clustering training suitable for introductory courses. The education approach features problem-based training that starts with the data and gradually introduces various data processing and analysis methods, illustrating them through visual representations of data and models. The proposed training is suitable for a general audience, does not require a background in statistics, mathematics, or computer science, and aims to engage the audience through practical examples, an exploratory approach to data analysis with visual analysis, experimentation, and a gentle learning curve. The manuscript details the pedagogical units of the training, motivates them through the sequence of methods introduced, and proposes data sets and data analysis workflows to be explored in the class.Author summary: The highest satisfaction for any instructor comes from an engaged audience, a motivated class that pays attention, and student questions that open up new venues for exploring the planned material. Any introduction to data science deserves such an audience, while the burden is on the instructor to prepare an exciting lesson that covers the planned material and keeps students engaged with just the right mix of theory and practice. We could think of no better topic to cover in this way than an introduction to machine learning and no better way to introduce this field than through data clustering. Of course, by including the necessary ingredients to assist instructors: use cases to explore, a visual analytics environment to use in the classroom, and a set of problems to intuitively introduce concepts ranging from data representation, similarity scoring, clustering methods, to evaluation and explanation of the resulting models. In the manuscript, we propose the ingredients of such training and offer them in a form ready to be explored by instructors in practical, hands-on courses.

Date: 2024
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012574 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 12574&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1012574

DOI: 10.1371/journal.pcbi.1012574

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().