EconPapers    
Economics at your fingertips  
 

Artificial Intelligence Generated Synthetic Datasets as the Remedy for Data Scarcity in Water Quality Index Estimation

Min Yan Chia, Chai Hoon Koo (), Yuk Feng Huang, Wei Chan and Jia Yin Pang
Additional contact information
Min Yan Chia: Universiti Tunku Abdul Rahman
Chai Hoon Koo: Universiti Tunku Abdul Rahman
Yuk Feng Huang: Universiti Tunku Abdul Rahman
Wei Chan: Universiti Tunku Abdul Rahman
Jia Yin Pang: Universiti Tunku Abdul Rahman

Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), 2023, vol. 37, issue 15, No 19, 6183-6198

Abstract: Abstract Water quality index (WQI) has been utilised in many countries and regions as a numeric representation of the condition of water resources. However, the computation of the WQI involves a host of water quality variables. Although machine learning models are proven to be a promising tool to estimate WQI with lesser inputs, sufficient data or samples must be collected so that the machine learning models can be trained well. This exhibits a great challenge in places where there has been a lack of data collection infrastructure to meet the needs of machine learning models. Data scarcity is a major issue to be tackled. This study covered two major rivers that served as water intakes in Peninsular Malaysia (Selangor River and Skudai River), where four synthetic data generation methods, namely the conditional tabular generative adversarial network (CTGAN), the tabular variational autoencoder (TVAE), the Gaussian copula (GC) and the copula generative adversarial network (CopulaGAN), were used to synthesise datasets based on the real dataset. By using the pairwise correlation difference (PCD), Kullback-Leibler divergence (KLD) and the Kolmogorov-Smirnov (KS) test, the best synthetic datasets were selected for the two rivers. The CopulaGAN1 and the CopulaGAN2 yielded the best small and large synthetic datasets at Selangor River, scoring the lowest PCD, KLD and KS statistics. For the Skudai River, the TVAE1 and TVAE2 were chosen. The real and synthetic datasets were used to train the back-propagation neural network (BPNN) for the WQI estimation. Based on the various evaluation metrics, it was proven that increasing the size of training data using the synthetic data method had a positive impact on the performance of the BPNN. The BPNN trained with the CopulaGAN2 (at Selangor River) and the TVAE2 (at Skudai River) yielded more accurate estimations compared to those derived from the actual and smaller datasets.

Keywords: synthetic data; artificial intelligence; back-propagation neural network; water quality index (search for similar items in EconPapers)
Date: 2023
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
http://link.springer.com/10.1007/s11269-023-03650-6 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:waterr:v:37:y:2023:i:15:d:10.1007_s11269-023-03650-6

Ordering information: This journal article can be ordered from
http://www.springer.com/economics/journal/11269

DOI: 10.1007/s11269-023-03650-6

Access Statistics for this article

Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA) is currently edited by G. Tsakiris

More articles in Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA) from Springer, European Water Resources Association (EWRA)
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-04-26
Handle: RePEc:spr:waterr:v:37:y:2023:i:15:d:10.1007_s11269-023-03650-6