Optimized phylogenetic clustering of HIV-1 sequence data for public health applications

Chato, Connor; Feng, Yi; Ruan, Yuhua; Xing, Hui; Herbeck, Joshua; Kalish, Marcia; Poon, Art F Y

Optimized phylogenetic clustering of HIV-1 sequence data for public health applications

Connor Chato, Yi Feng, Yuhua Ruan, Hui Xing, Joshua Herbeck, Marcia Kalish and Art F Y Poon

PLOS Computational Biology, 2022, vol. 18, issue 11, 1-24

Abstract: Clusters of genetically similar infections suggest rapid transmission and may indicate priorities for public health action or reveal underlying epidemiological processes. However, clusters often require user-defined thresholds and are sensitive to non-epidemiological factors, such as non-random sampling. Consequently the ideal threshold for public health applications varies substantially across settings. Here, we show a method which selects optimal thresholds for phylogenetic (subset tree) clustering based on population. We evaluated this method on HIV-1 pol datasets (n = 14, 221 sequences) from four sites in USA (Tennessee, Washington), Canada (Northern Alberta) and China (Beijing). Clusters were defined by tips descending from an ancestral node (with a minimum bootstrap support of 95%) through a series of branches, each with a length below a given threshold. Next, we used pplacer to graft new cases to the fixed tree by maximum likelihood. We evaluated the effect of varying branch-length thresholds on cluster growth as a count outcome by fitting two Poisson regression models: a null model that predicts growth from cluster size, and an alternative model that includes mean collection date as an additional covariate. The alternative model was favoured by AIC across most thresholds, with optimal (greatest difference in AIC) thresholds ranging 0.007–0.013 across sites. The range of optimal thresholds was more variable when re-sampling 80% of the data by location (IQR 0.008 − 0.016, n = 100 replicates). Our results use prospective phylogenetic cluster growth and suggest that there is more variation in effective thresholds for public health than those typically used in clustering studies.Author summary: A genetic cluster of virus infections is a group of DNA or RNA sequences that are much more similar to each other than they are to other infections from the same population of hosts. These clusters can reveal where virus transmission has been occurring the most rapidly, which can provide useful information for a public health response. Genetic clusters are often built by reconstructing a phylogeny—a tree-based model of how the sequences are related by common ancestors—and locating distinct parts of the tree with short branches. However, there are no objective, general-purpose criteria for deciding which parts of a tree constitute clusters, and there are an unlimited number of ways to partition a tree into clusters. In this study, we develop a computational method to determine the best clustering criteria based on our ability to predict where the next infections will occur. We apply this method to anonymized HIV-1 sequence data sets from Canada, the United States, and China, to characterize the sensitivity of clustering criteria to different risk populations and sampling contexts. Our results indicate that the clustering criteria typically used for phylogenetic studies of HIV-1 are not optimal for public health applications.

Date: 2022
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010745 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 10745&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1010745

DOI: 10.1371/journal.pcbi.1010745

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().