Multiple Testing and Data Adaptive Regression: An Application to HIV-1 Sequence Data

Birkner, Merrill; Sinisi, Sandra; van der Laan, Mark

Multiple Testing and Data Adaptive Regression: An Application to HIV-1 Sequence Data

Merrill Birkner, Sandra Sinisi and Mark van der Laan
Additional contact information
Merrill Birkner: Division of Biostatistics, School of Public Health, University of California, Berkeley
Sandra Sinisi: Division of Biostatistics, School of Public Health, University of California, Berkeley
Mark van der Laan: Division of Biostatistics, School of Public Health, University of California, Berkeley

No 1161, U.C. Berkeley Division of Biostatistics Working Paper Series from Berkeley Electronic Press

Abstract: Analysis of viral strand sequence data and viral replication capacity could potentially lead to biological insights regarding the replication ability of HIV-1. Determining specific target codons on the viral strand will facilitate the manufacturing of target specific antiretrovirals. Various algorithmic and analysis techniques can be applied to this application. We propose using multiple testing to find codons which have significant univariate associations with replication capacity of the virus. We also propose using a data adaptive multiple regression algorithm to obtain multiple predictions of viral replication capacity based on an entire mutant/non-mutant sequence profile. The data set to which these techniques were applied consists of 317 patients, each with 282 sequenced protease and reverse transcriptase codons. Initially, the multiple testing procedure (Pollard and van der Laan, 2003) was applied to the individual specific viral sequence data. A single-step multiple testing procedure method was used to control the family wise error rate (FWER) at the five percent alpha level. Additional augmentation multiple testing procedures were applied to control the generalized family wise error (gFWER) or the tail probability of the proportion of false positives (TPPFP). Finally, the loss-based, cross-validated Deletion/Substitution/Addition regression algorithm (Sinisi and van der Laan, 2004) was applied to the dataset separately. This algorithm builds candidate estimators in the prediction of a univariate outcome by minimizing an empirical risk, and it uses cross-validation to select fine-tuning parameters such as: size of the regression model, maximum allowed order of interaction of terms in the regression model, and the dimension of the vector of covariates. This algorithm also is used to measure variable importance of the codons. Findings from these multiple analyses are consistent with biological findings and could possibly lead to further biological knowledge regarding HIV-1 viral data.

Keywords: Bootstrap; codon; generalized family wise error rate; HIV-1; multiple testing; prediction; tail probability of the proportion of false positives; type I error; variable selection (search for similar items in EconPapers)
Date: 2004-10-26
Note: oai:bepress.com:ucbbiostat-1161
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
http://www.bepress.com/cgi/viewcontent.cgi?article=1161&context=ucbbiostat (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bep:ucbbio:1161

Access Statistics for this paper

More papers in U.C. Berkeley Division of Biostatistics Working Paper Series from Berkeley Electronic Press
Bibliographic data for series maintained by Christopher F. Baum ().