Machine learning to predict the source of campylobacteriosis using whole genome data
Nicolas Arning,
Samuel K Sheppard,
Sion Bayliss,
David A Clifton and
Daniel J Wilson
PLOS Genetics, 2021, vol. 17, issue 10, 1-17
Abstract:
Campylobacteriosis is among the world’s most common foodborne illnesses, caused predominantly by the bacterium Campylobacter jejuni. Effective interventions require determination of the infection source which is challenging as transmission occurs via multiple sources such as contaminated meat, poultry, and drinking water. Strain variation has allowed source tracking based upon allelic variation in multi-locus sequence typing (MLST) genes allowing isolates from infected individuals to be attributed to specific animal or environmental reservoirs. However, the accuracy of probabilistic attribution models has been limited by the ability to differentiate isolates based upon just 7 MLST genes. Here, we broaden the input data spectrum to include core genome MLST (cgMLST) and whole genome sequences (WGS), and implement multiple machine learning algorithms, allowing more accurate source attribution. We increase attribution accuracy from 64% using the standard iSource population genetic approach to 71% for MLST, 85% for cgMLST and 78% for kmerized WGS data using the classifier we named aiSource. To gain insight beyond the source model prediction, we use Bayesian inference to analyse the relative affinity of C. jejuni strains to infect humans and identified potential differences, in source-human transmission ability among clonally related isolates in the most common disease causing lineage (ST-21 clonal complex). Providing generalizable computationally efficient methods, based upon machine learning and population genetics, we provide a scalable approach to global disease surveillance that can continuously incorporate novel samples for source attribution and identify fine-scale variation in transmission potential.Author summary: C. jejuni are the most common cause of food-borne bacterial gastroenteritis but the relative contribution of different sources is incompletely understood. We traced the origin of human C. jejuni infections using machine learning algorithms that compare the DNA sequences of bacteria sampled from infected people, contaminated chickens, cattle, sheep, wild birds, and the environment. This approach achieved improvement in accuracy of source attribution by 33% over existing methods that use only a subset of genes within the genome and provided evidence for the relative contribution of different infection sources. Sometimes even very similar bacteria showed differences, demonstrating the value of basing analyses on the entire genome when developing this algorithm that can be used for understanding the global epidemiology and other important bacterial infections.
Date: 2021
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009436 (text/html)
https://journals.plos.org/plosgenetics/article/fil ... 09436&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pgen00:1009436
DOI: 10.1371/journal.pgen.1009436
Access Statistics for this article
More articles in PLOS Genetics from Public Library of Science
Bibliographic data for series maintained by plosgenetics ().