Genomic feature selection by coverage design optimization
Stephen Reid,
Aaron M. Newman,
Maximilian Diehn,
Ash A. Alizadeh and
Robert Tibshirani
Journal of Applied Statistics, 2018, vol. 45, issue 14, 2658-2676
Abstract:
We introduce a novel data reduction technique whereby we select a subset of tiles to ‘cover’ maximally events of interest in large-scale biological datasets (e.g. genetic mutations), while minimizing the number of tiles. A tile is a genomic unit capturing one or more biological events, such as a sequence of base pairs that can be sequenced and observed simultaneously. The goal is to reduce significantly the number of tiles considered to those with areas of dense events in a cohort, thus saving on cost and enhancing interpretability. However, the reduction should not come at the cost of too much information, allowing for sensible statistical analysis after its application. We envisage application of our methods to a variety of high throughput data types, particularly those produced by next-generation sequencing (NGS) experiments. The procedure is cast as a convex optimization problem, which is presented, along with methods of its solution. The method is demonstrated on a large dataset of somatic mutations spanning 5000+ patients, each having one of 29 cancer types. Applied to these data, our method dramatically reduces the number of gene locations required for broad coverage of patients and their mutations, giving subject specialists a more easily interpretable snapshot of recurrent mutational profiles in these cancers. The locations identified coincide with previously identified cancer genes. Finally, despite considerable data reduction, we show that our covering designs preserve the cancer discrimination ability of multinomial logistic regression models trained on all of the locations ( $ >1M $ >1M).
Date: 2018
References: Add references at CitEc
Citations:
Downloads: (external link)
http://hdl.handle.net/10.1080/02664763.2018.1432577 (text/html)
Access to full text is restricted to subscribers.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:taf:japsta:v:45:y:2018:i:14:p:2658-2676
Ordering information: This journal article can be ordered from
http://www.tandfonline.com/pricing/journal/CJAS20
DOI: 10.1080/02664763.2018.1432577
Access Statistics for this article
Journal of Applied Statistics is currently edited by Robert Aykroyd
More articles in Journal of Applied Statistics from Taylor & Francis Journals
Bibliographic data for series maintained by Chris Longhurst ().