|
|
National surface water quality classification evaluation based on SMOTE-GA-CatBoost method |
XU Ling1, JING Xiang-nan2, YANG Ying3, LI Wei-hua3, LIU Yi-xin4, YAN Guo-bing5 |
1. School of Civil Engineering, City University of Hefei, Hefei 238076, China; 2. School of Economics and Management, City University of Hefei, Hefei 238076, China; 3. School of Environment and Energy Engineering, Anhui Jianzhu University, Hefei 230009, China; 4. School of Earthand Space Sciences, University of Science and Technology of China, Hefei 230026, China; 5. Architectural & Civil Engineering Design Institute Co. Ltd HangZhou China, Hefei 230051, China |
|
|
Abstract Aiming at the problems such as the high conflict of water pollution feature space and the imbalance of water quality categories in surface water classification evaluation, Synthetic Minority Oversampling Technique (SMOTE) which was combined with Genetic Algorithms (GA) and CatBoost model that used seven water quality indexes of surface water as water quality evaluation factors were respectively employed to evaluate the water quality of major rivers and important lakes-reservoirs in the country. The results were compared with the other four improved ensemble algorithms, which showed that the SMOTE pretreatment could effectively enhance the imbalance of sample categories and increase the accuracy of CatBoost model for the classification of minority water quality samples. The genetic algorithm parameters could effectively improve the convergence speed and classification accuracy of CatBoost model and optimize the classification performance of the model. The SMOTE-GA-CatBoost model showed higher performance of water quality classification compared with the other four improved integrated classification models. The values of accuracy, precision, recall and F1 of the SMOTE-GA-CatBoost model for river water quality classification reached 97.7%, 97.8%, 96.1% and 96.9%, respectively. The value of accuracy, precision, recall and F1 for water quality classification of lakes-reservoirs water were 96.7%, 96.2%, 95.4% and 95.8%, respectively. The proposed model could be used to classify and evaluate the water quality of different water areas.
|
Received: 04 December 2022
|
|
|
|
|
[1] |
嵇晓燕,侯欢欢,王姗姗,等.近年全国地表水水质变化特征 [J]. 环境科学, 2022,43(10):4419-4429. Ji X Y, Hou H H, Wang S S, et al. Variation characteristics of surface water quality in China in recent years [J]. Environmental Science, 2022,43(10):4419-4429.
|
[2] |
Behmel S, Damour M, Ludwig R, et al. Water quality monitoring strategies—A review and future perspectives [J]. Science of the Total Environment, 2016,571:1312-1329.
|
[3] |
Gitau M W, Chen J Q, Ma Z. Water quality indices as tools for decision making and management [J]. Water Resources Management, 2016,30(8):2591-2610.
|
[4] |
孙 悦,李再兴,张艺冉,等.雄安新区—白洋淀冰封期水体污染特征及水质评价 [J]. 湖泊科学, 2020,32(4):952-963. Sun Y, Li Z X, Zhang Y R, et al. Water pollution characteristics and water quality evaluation during the freezing period in Lake Baiyangdian of Xiong'an New Area [J]. Journal of Lake Sciences, 2020,32(4):952-963.
|
[5] |
韩晓刚,黄廷林,陈秀珍.改进的模糊综合评价法及在给水厂原水水质评价中的应用 [J]. 环境科学学报, 2013,33(5):1513-1518. Han X G, Huang T L, Chen X Z. Improved fuzzy synthetic evaluation method and its application in raw water quality evaluation of water supply plant [J]. Acta Scientiae Circumstantiae, 2013,33(5):1513-1518.
|
[6] |
Hu G J, Mian H R, Abedin Z, et al. Integrated probabilistic-fuzzy synthetic evaluation of drinking water quality in rural and remote communities [J]. Journal of Environmental Management, 2022,301: 113937.
|
[7] |
江 敏,刘金金,卢 柳,等.灰色聚类法综合评价滴水湖水系环境质量 [J]. 生态环境学报, 2012,21(2):346-352. Jiang M, Liu J J, Lu L, et al. Synthetical evaluation of water quality of Dishui Lake water system by gray clustering method [J]. Ecology and Environmental Sciences, 2012,21(2):346-352.
|
[8] |
Ip W C, Hu B Q, Wong H, et al. Applications of grey relational method to river environment quality evaluation in China [J]. Journal of Hydrology, 2009,379(3/4):284-290.
|
[9] |
Garcia-Alba J, Barcena J F, Ugarteburu C, et al. Artificial neural networks as emulators of process-based models to analyse bathing water quality in estuaries [J]. Water research, 2019,150:283-295.
|
[10] |
Zhang J, Qiu H, Li X Y, et al. Real-Time Nowcasting of microbiological water quality at recreational beaches: A wavelet and artificial neural network based hybrid modeling approach [J]. Environmental Science & Technology, 2018,52(15):8446-8455.
|
[11] |
Xu T T, Coco G, Neale M. A predictive model of recreational water quality based on adaptive synthetic sampling algorithms and machine learning [J]. Water Research, 2020,177:115788.
|
[12] |
黄 鹤,梁秀娟,肖 霄,等.基于粗糙集的支持向量机地下水质量评价模型 [J]. 中国环境科学, 2016,36(2):619-625. Huang H, Liang X J, Xiao X, et al. Model of groundwater quality assessment with support vector machine based on rough set [J]. China Environment Science, 2016,36(2):619-625.
|
[13] |
Yan H, Zou Z H, Wang H W. Adaptive neuro fuzzy inference system for classification of water quality status [J]. Journal of Environmental Sciences, 2010,22(12):1891-1896.
|
[14] |
王 清.集成学习中若干关键问题的研究 [D]. 上海:复旦大学, 2011. Wang Q. Research on several key problems of ensemble learning algorithms [D]. Shanghai: Fudan University, 2011.
|
[15] |
吴 敏,温小虎,冯 起,等.基于随机森林模型的干旱绿洲区张掖盆地地下水水质评价 [J]. 中国沙漠, 2018,38(3):657-663. Wu M, Wen X H, Feng Q, et al. Assessment of groundwater quality based on random forest model in arid oasis area [J]. Journal of Desert Research, 2018,38(3):657-663.
|
[16] |
Yao J Q, Sun S Y, Zhai H R, et al. Dynamic monitoring of the largest reservoir in North China based on multi-source satellite remote sensing from 2013 to 2022: Water area, water level, water storage and water quality [J]. Ecological Indicators, 2022,144:109470.
|
[17] |
Yusri H I H, Ab Rahim A A, Hassan S L M, et al. Water quality classification using SVM and XGBoost method [C]. New York: IEEE, 2022.
|
[18] |
Narita K, Matsui Y, Matsushita T, et al. Screening priority pesticides for drinking water quality regulation and monitoring by machine learning: Analysis of factors affecting detectability [J]. Journal of Environmental Management, 2023,326:116738.
|
[19] |
Li L B, Qiao J D, Yu G, et al. Interpretable tree-based ensemble model for predicting beach water quality [J]. Water Research, 2022,211: 118078.
|
[20] |
Nasir N, Kansal A, Alshaltone O, et al. Water quality classification using machine learning algorithms [J]. Journal of Water Process Engineering, 2022,48:102920.
|
[21] |
Chen X G, Liu H T, Liu F R, et al. Two novelty learning models developed based on deep cascade forest to address the environmental imbalanced issues: A case study of drinking water quality prediction [J]. Environmental Pollution, 2021,291:118153.
|
[22] |
Xu T, Coco G, Neale M. A predictive model of recreational water quality based on adaptive synthetic sampling algorithms and machine learning [J]. Water Research, 2020,177:115788.
|
[23] |
Luengo J, Fernández A, García S, et al. Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling [J]. Soft Computing, 2011,15(10):1909-1936.
|
[24] |
Liu F T, Ting K M, Zhou Z H. Isolation-based anomaly detection [J]. ACM Transactions on Knowledge Discovery from Data (TKDD), 2012,6(1):1-39.
|
[25] |
张亚青,王 相,孟凡荣,等.基于熵权和层次分析法的VOCs处理技术综合评价 [J]. 中国环境科学, 2021,41(6):2946-2955. Zhang Y Q, Wang X, Meng F R, et al. Comprehensive evaluation of VOCs processing technology based on entropy weight method and analytic hierarchy process [J]. China Environment Science, 2021,41(6): 2946-2955.
|
[26] |
Zyoud S H, Fuchs-Hanusch D. A bibliometric-based survey on AHP and TOPSIS techniques [J]. Expert Systems with Applications, 2017, 78(7):158-181.
|
[27] |
GB 3838-2002 地表水环境质量标准 [S]. GB 3838-2002 Environmental quality standards for surface water [S].
|
[28] |
梁 泽,王玥瑶,岳远紊,等.耦合遗传算法与RBF神经网络的PM2.5浓度预测模型 [J]. 中国环境科学, 2020,40(2):523-529. Liang Z, Wang Y Y, Yue Y W. A coupling model of genetic algorithm and RBF neural network for the prediction of PM2.5 concentration [J]. China Environment Science, 2020,40(2):523-529.
|
[29] |
Ho T K. The random subspace method for constructing decision forests [J]. IEEE transactions on pattern analysis and machine intelligence, 1998,20(8):832-844.
|
[30] |
Speiser J L, Miller M E, Tooze J, et al. A comparison of random forest variable selection methods for classification prediction modeling [J]. Expert Systems with Applications, 2019,134:93-101.
|
[31] |
Freund Y, Schapire R E. A decision-theoretic generalization of on-line learning and an application to Boosting [J]. Journal of computer and system sciences, 1997,55(1):119-139.
|
[32] |
Zhang C S, Zhang Y, Shi X J, et al. On incremental learning for Gradient Boosting Decision Trees [J]. Neural Processing Letters, 2019, 50(1):957-987.
|
[33] |
连克强.基于Boosting的集成树算法研究与分析 [D]. 北京:中国地质大学, 2018. Lian K Q. The study and application of ensemble of trees based on Boosting [D]. Beijing: China University of Geosciences, 2018.
|
[34] |
Zhang D Y, Gong Y C. The comparison of LightGBM and XGBoost coupling factor analysis and prediagnosis of acute liver failure [J]. IEEE Access, 2020,8:220990-221003.
|
[35] |
Hancock J T, Khoshgoftaar T M. CatBoost for big data: an interdisciplinary review [J]. Journal of Big Data, 2020,7(1):94-139.
|
[36] |
苗丰顺,李 岩,高 岑,等.基于CatBoost算法的糖尿病预测方法 [J]. 计算机系统应用, 2019,28(9):215-218. Miao F S, Li Y, Gao C, et al. Diabetes prediction method based on CatBoost algorithm [J]. Computer Systems & Applications, 2019,28 (9):215-218.
|
|
|
|