Similarity Measures for Clustering SNP Data
Katja Ickstadt and
Silvia Selinski
No 2005,27, Technical Reports from Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen
Abstract:
The issue of suitable similarity measures for a particular kind of genetic data – so called SNP data – arises from the GENICA (Interdisciplinary Study Group on Gene Environment Interaction and Breast Cancer in Germany) case-control study of sporadic breast cancer. The GENICA study aims to investigate the influence and interaction of single nucleotide polymorphic (SNP) loci and exogenous risk factors. A single nucleotide polymorphism is a point mutation that is present in at least 1 % of a population. SNPs are the most common form of human genetic variations. In particular, we consider 65 SNP loci and 2 insertions of longer sequences in genes involved in the metabolism of hormones, xenobiotics and drugs as well as in the repair of DNA and signal transduction. Assuming that these single nucleotide changes may lead, for instance, to altered enzymes or to a reduced or enhanced amount of the original enzymes – with each alteration alone having minor effects – we aim to detect combinations of SNPs that under certain environmental conditions increase the risk of sporadic breast cancer. The search for patterns in the present data set may be performed by a variety of clustering and classification approaches. We consider here the problem of suitable measures of proximity of two variables or subjects as an indispensable basis for a further cluster analysis. Generally, clustering approaches are a useful tool to detect structures and to generate hypothesis about potential relationships in complex data situations. Searching for patterns in the data there are two possible objectives: the identification of groups of similar objects or subjects or the identification of groups of similar variables within the whole or within subpopulations. Comparing the individual genetic profiles as well as comparing the genetic information across subpopulations we discuss possible choices of similarity measures, in particular similarity measures based on the counts of matches and mismatches. New matching coefficients are introduced with a more flexible weighting scheme to account for the general problem of the comparison of SNP data: The large proportion of homozygous reference sequences relative to the homo- and heterozygous SNPs is masking the accordances and differences of interest.
Keywords: GENICA; single nucleotide polymorphism (SNP); sporadic breast cancer; similarity; Matching Coefficient; Flexible Matching Coefficient (search for similar items in EconPapers)
Date: 2005
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (2)
Downloads: (external link)
https://www.econstor.eu/bitstream/10419/22617/1/tr27-05.pdf (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:zbw:sfb475:200527
Access Statistics for this paper
More papers in Technical Reports from Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen Contact information at EDIRC.
Bibliographic data for series maintained by ZBW - Leibniz Information Centre for Economics ().