Economics at your fingertips  

Clustering genomic words in human DNA using peaks and trends of distributions

Ana Helena Tavares (), Jakob Raymaekers, Peter Rousseeuw (), Paula Brito and Vera Afreixo
Additional contact information
Ana Helena Tavares: University of Aveiro
Jakob Raymaekers: KU Leuven
Paula Brito: University of Porto
Vera Afreixo: University of Aveiro

Advances in Data Analysis and Classification, 2020, vol. 14, issue 1, No 4, 57-76

Abstract: Abstract In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the ‘trend’), and a sparse vector of detrended data captures the peak structure. A simulation study demonstrates the effectiveness of the clustering procedure in grouping distributions with similar peak behavior and/or baseline features. The procedure is applied to investigate similarities between the distribution patterns of genomic words of lengths 3 and 5 in the human genome. These experiments demonstrate the potential of the new method for identifying words with similar distance patterns.

Keywords: Classification; Pattern recognition; Robustness; Word distances; 62H30; 62P10 (search for similar items in EconPapers)
Date: 2020
References: View references in EconPapers View complete reference list from CitEc
Citations: Track citations by RSS feed

Downloads: (external link) Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link:

Ordering information: This journal article can be ordered from
http://www.springer. ... ds/journal/11634/PS2

DOI: 10.1007/s11634-019-00362-x

Access Statistics for this article

Advances in Data Analysis and Classification is currently edited by H.-H. Bock, W. Gaul, A. Okada, M. Vichi and C. Weihs

More articles in Advances in Data Analysis and Classification from Springer, German Classification Society - Gesellschaft für Klassifikation (GfKl), Japanese Classification Society (JCS), Classification and Data Analysis Group of the Italian Statistical Society (CLADAG), International Federation of Classification Societies (IFCS)
Bibliographic data for series maintained by Sonal Shukla ().

Page updated 2020-09-12
Handle: RePEc:spr:advdac:v:14:y:2020:i:1:d:10.1007_s11634-019-00362-x