Global Disease Monitoring and Forecasting with Wikipedia

Generous, Nicholas; Fairchild, Geoffrey; Deshpande, Alina; Valle, Sara Y Del; Priedhorsky, Reid

Global Disease Monitoring and Forecasting with Wikipedia

Nicholas Generous, Geoffrey Fairchild, Alina Deshpande, Sara Y Del Valle and Reid Priedhorsky

PLOS Computational Biology, 2014, vol. 10, issue 11, 1-16

Abstract: Infectious disease is a leading threat to public health, economic stability, and other key social structures. Efforts to mitigate these impacts depend on accurate and timely monitoring to measure the risk and progress of disease. Traditional, biologically-focused monitoring techniques are accurate but costly and slow; in response, new techniques based on social internet data, such as social media and search queries, are emerging. These efforts are promising, but important challenges in the areas of scientific peer review, breadth of diseases and countries, and forecasting hamper their operational usefulness. We examine a freely available, open data source for this use: access logs from the online encyclopedia Wikipedia. Using linear models, language as a proxy for location, and a systematic yet simple article selection procedure, we tested 14 location-disease combinations and demonstrate that these data feasibly support an approach that overcomes these challenges. Specifically, our proof-of-concept yields models with up to 0.92, forecasting value up to the 28 days tested, and several pairs of models similar enough to suggest that transferring models from one location to another without re-training is feasible. Based on these preliminary results, we close with a research agenda designed to overcome these challenges and produce a disease monitoring and forecasting system that is significantly more effective, robust, and globally comprehensive than the current state of the art.Author Summary: Even in developed countries, infectious disease has significant impact; for example, flu seasons in the United States take between 3,000 and 49,000 lives. Disease surveillance, traditionally based on patient visits to health providers and laboratory tests, can reduce these impacts. Motivated by cost and timeliness, surveillance methods based on internet data have recently emerged, but are not yet reliable for several reasons, including weak scientific peer review, breadth of diseases and countries covered, and underdeveloped forecasting capabilities. We argue that these challenges can be overcome by using a freely available data source: aggregated access logs from the online encyclopedia Wikipedia. Using simple statistical techniques, our proof-of-concept experiments suggest that these data are effective for predicting the present, as well as forecasting up to the 28-day limit of our tests. Our results also suggest that these models can be used even in places with no official data upon which to build models. In short, this paper establishes the utility of Wikipedia as a broadly effective data source for disease information, and we outline a path to a reliable, scientifically sound, operational, and global disease surveillance system that overcomes key gaps in existing traditional and internet-based techniques.

Date: 2014
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (13)

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003892 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 03892&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1003892

DOI: 10.1371/journal.pcbi.1003892

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().