Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)
Ahmad R. Alsaber,
Jiazhu Pan and
Adeeba Al-Hurban
Additional contact information
Ahmad R. Alsaber: Department of Mathematics and Statistics, University of Strathclyde, Glasgow G1 1XH, UK
Jiazhu Pan: Department of Mathematics and Statistics, University of Strathclyde, Glasgow G1 1XH, UK
Adeeba Al-Hurban: Department of Earth and Environmental Sciences, Faculty of Science, Kuwait University, P.O. Box 5969, Safat 13060, Kuwait
IJERPH, 2021, vol. 18, issue 3, 1-25
Abstract:
In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for N O 2 (18.4%), C O (18.5%), P M 10 (57.4%), S O 2 (19.0%), and O 3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.
Keywords: missing imputation; random forest; high dimensional data; missing data mechanism; air quality (search for similar items in EconPapers)
JEL-codes: I I1 I3 Q Q5 (search for similar items in EconPapers)
Date: 2021
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/1660-4601/18/3/1333/pdf (application/pdf)
https://www.mdpi.com/1660-4601/18/3/1333/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jijerp:v:18:y:2021:i:3:p:1333-:d:491512
Access Statistics for this article
IJERPH is currently edited by Ms. Jenna Liu
More articles in IJERPH from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().