Design of a Spark Big Data Framework for PM 2.5 Air Pollution Forecasting

Shih, Dong-Her; To, Thi Hien; Nguyen, Ly Sy Phu; Wu, Ting-Wei; You, Wen-Ting

Design of a Spark Big Data Framework for PM 2.5 Air Pollution Forecasting

Dong-Her Shih, Thi Hien To, Ly Sy Phu Nguyen, Ting-Wei Wu and Wen-Ting You
Additional contact information
Dong-Her Shih: Department of Information Management, National Yunlin University of Science & Technology, Douliu 64002, Taiwan
Thi Hien To: Faculty of Environment, University of Science, 227 Nguyen Van Cu Street, District 5, Ho Chi Minh City 700000, Vietnam
Ly Sy Phu Nguyen: Faculty of Environment, University of Science, 227 Nguyen Van Cu Street, District 5, Ho Chi Minh City 700000, Vietnam
Ting-Wei Wu: Department of Information Management, National Yunlin University of Science & Technology, Douliu 64002, Taiwan
Wen-Ting You: Department of Information Management, National Yunlin University of Science & Technology, Douliu 64002, Taiwan

IJERPH, 2021, vol. 18, issue 13, 1-22

Abstract: In recent years, with rapid economic development, air pollution has become extremely serious, causing many negative effects on health, environment and medical costs. PM 2.5 is one of the main components of air pollution. Therefore, it is necessary to know the PM 2.5 air quality in advance for health. Many studies on air quality are based on the government’s official air quality monitoring stations, which cannot be widely deployed due to high cost constraints. Furthermore, the update frequency of government monitoring stations is once an hour, and it is hard to capture short-term PM 2.5 concentration peaks with little warning. Nevertheless, dealing with short-term data with many stations, the volume of data is huge and is calculated, analyzed and predicted in a complex way. This alleviates the high computational requirements of the original predictor, thus making Spark suitable for the considered problem. This study proposes a PM 2.5 instant prediction architecture based on the Spark big data framework to handle the huge data from the LASS community. The Spark big data framework proposed in this study is divided into three modules. It collects real time PM 2.5 data and performs ensemble learning through three machine learning algorithms (Linear Regression, Random Forest, Gradient Boosting Decision Tree) to predict the PM 2.5 concentration value in the next 30 to 180 min with accompanying visualization graph. The experimental results show that our proposed Spark big data ensemble prediction model in next 30-min prediction has the best performance (R 2 up to 0.96), and the ensemble model has better performance than any single machine learning model. Taiwan has been suffering from a situation of relatively poor air pollution quality for a long time. Air pollutant monitoring data from LASS community can provide a wide broader monitoring, however the data is large and difficult to integrate or analyze. The proposed Spark big data framework system can provide short-term PM 2.5 forecasts and help the decision-maker to take proper action immediately.

Keywords: air pollution; PM 2.5 predictions; machine learning; Spark; ensemble model; big data (search for similar items in EconPapers)
JEL-codes: I I1 I3 Q Q5 (search for similar items in EconPapers)
Date: 2021
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/1660-4601/18/13/7087/pdf (application/pdf)
https://www.mdpi.com/1660-4601/18/13/7087/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jijerp:v:18:y:2021:i:13:p:7087-:d:587380

Access Statistics for this article

IJERPH is currently edited by Ms. Jenna Liu

More articles in IJERPH from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().