Using Machine Learning and Natural Language Processing for Unveiling Similarities between Microbial Data
Lucija Brezočnik (),
Tanja Žlender,
Maja Rupnik and
Vili Podgorelec
Additional contact information
Lucija Brezočnik: Faculty of Electrical Engineering and Computer Science, University of Maribor, SI-2000 Maribor, Slovenia
Tanja Žlender: National Laboratory of Health, Environment and Food, Centre for Medical Microbiology, Department for Microbiological Research, SI-2000 Maribor, Slovenia
Maja Rupnik: National Laboratory of Health, Environment and Food, Centre for Medical Microbiology, Department for Microbiological Research, SI-2000 Maribor, Slovenia
Vili Podgorelec: Faculty of Electrical Engineering and Computer Science, University of Maribor, SI-2000 Maribor, Slovenia
Mathematics, 2024, vol. 12, issue 17, 1-20
Abstract:
Microbiota analysis can provide valuable insights in various fields, including diet and nutrition, understanding health and disease, and in environmental contexts, such as understanding the role of microorganisms in different ecosystems. Based on the results, we can provide targeted therapies, personalized medicine, or detect environmental contaminants. In our research, we examined the gut microbiota of 16 animal taxa, including humans, as well as the microbiota of cattle and pig manure, where we focused on 16S rRNA V3-V4 hypervariable regions. Analyzing these regions is common in microbiome studies but can be challenging since the results are high-dimensional. Thus, we utilized machine learning techniques and demonstrated their applicability in processing microbial sequence data. Moreover, we showed that techniques commonly employed in natural language processing can be adapted for analyzing microbial text vectors. We obtained the latter through frequency analyses and utilized the proposed hierarchical clustering method over them. All steps in this study were gathered in a proposed microbial sequence data processing pipeline. The results demonstrate that we not only found similarities between samples but also sorted groups’ samples into semantically related clusters. We also tested our method against other known algorithms like the Kmeans and Spectral Clustering algorithms using clustering evaluation metrics. The results demonstrate the superiority of the proposed method over them. Moreover, the proposed microbial sequence data pipeline can be utilized for different types of microbiota, such as oral, gut, and skin, demonstrating its reusability and robustness.
Keywords: machine learning; NLP; hierarchical clustering; microbial data; microbiome; n-gram (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2024
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/12/17/2717/pdf (application/pdf)
https://www.mdpi.com/2227-7390/12/17/2717/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:12:y:2024:i:17:p:2717-:d:1468218
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().