Automated Classification of Online Sources for Infectious Disease Occurrences Using Machine-Learning-Based Natural Language Processing Approaches

Kim, Mira; Chae, Kyunghee; Lee, Seungwoo; Jang, Hong-Jun; Kim, Sukil

Automated Classification of Online Sources for Infectious Disease Occurrences Using Machine-Learning-Based Natural Language Processing Approaches

Mira Kim, Kyunghee Chae, Seungwoo Lee, Hong-Jun Jang and Sukil Kim
Additional contact information
Mira Kim: Department of Preventive Medicine, College of Medicine, The Catholic University of Korea, Seoul 06591, Korea
Kyunghee Chae: Department of Preventive Medicine, College of Medicine, The Catholic University of Korea, Seoul 06591, Korea
Seungwoo Lee: Department of Data and HPC Science, University of Science and Technology, Daejeon 34113, Korea
Hong-Jun Jang: Research Data Sharing Center, Korea Institute of Science and Technology Information, Daejeon 34141, Korea
Sukil Kim: Department of Preventive Medicine, College of Medicine, The Catholic University of Korea, Seoul 06591, Korea

IJERPH, 2020, vol. 17, issue 24, 1-13

Abstract: Collecting valid information from electronic sources to detect the potential outbreak of infectious disease is time-consuming and labor-intensive. The automated identification of relevant information using machine learning is necessary to respond to a potential disease outbreak. A total of 2864 documents were collected from various websites and subsequently manually categorized and labeled by two reviewers. Accurate labels for the training and test data were provided based on a reviewer consensus. Two machine learning algorithms—ConvNet and bidirectional long short-term memory (BiLSTM)—and two classification methods—DocClass and SenClass—were used for classifying the documents. The precision, recall, F1, accuracy, and area under the curve were measured to evaluate the performance of each model. ConvNet yielded higher average, min, and max accuracies (87.6%, 85.2%, and 91.1%, respectively) than BiLSTM with DocClass, while BiLSTM performed better than ConvNet with SenClass with average, min, and max accuracies of 92.8%, 92.6%, and 93.3%, respectively. The performance of BiLSTM with SenClass yielded an overall accuracy of 92.9% in classifying infectious disease occurrences. Machine learning had a compatible performance with a human expert given a particular text extraction system. This study suggests that analyzing information from the website using machine learning can achieve significant accuracies in the presence of abundant articles/documents.

Keywords: machine learning; infectious disease; public health surveillance; online document; classification (search for similar items in EconPapers)
JEL-codes: I I1 I3 Q Q5 (search for similar items in EconPapers)
Date: 2020
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
https://www.mdpi.com/1660-4601/17/24/9467/pdf (application/pdf)
https://www.mdpi.com/1660-4601/17/24/9467/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jijerp:v:17:y:2020:i:24:p:9467-:d:463852

Access Statistics for this article

IJERPH is currently edited by Ms. Jenna Liu

More articles in IJERPH from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().