|
|
Air quality data restoration based on graph regularization multi-view functional matrix completion |
GAO Hai-yan1,2, MA Wen-juan1 |
1. School of Statistics and Data Science, Lanzhou University of Finance and Economics, Lanzhou 730020, China; 2. Key Laboratory of Digital Economy and Social Computing Science, Lanzhou 730020, China |
|
|
Abstract Due to issues such as sensor malfunctions and data transmission, the collected air quality data often encounter challenges of sparsity and incompleteness. In order to effectively repair and reconstruct the missing parts of air quality data, a Graph Regularized Multi-view Functional Matrix Completion method (GRMFMC) is proposed. Firstly, this innovative method introduces a graph regularization approach that thoroughly takes into account the high-order neighborhood relationship within each pollutant’s sample set, reducing information loss. Secondly, it utilizes the Hilbert-Schmidt Independence Criterion (HSIC) to discern complementary information among various pollutants, thereby improving imputation accuracy. Additionally, by integrating the principles of functional data analysis, the GRMFMC technique treats temporal air quality data as continuous functions, capitalizing on their inherent smoothness and correlation for high-precision data interpolation. Simulation imputations and empirical applications on real air quality datasets both demonstrate that the GRMFMC exhibits superior interpolation performance. In simulation imputations, the GRMFMC method reduces the imputation error by 56%~99% in RMSE and 46%~98% in NRMSE; in empirical applications, it reduces the error by 51%~99% in RMSE and 40%~98% in NRMSE. Furthermore, the GRMFMC method shows consistent robustness across different missing rate and pollutant categories, confirming its potential for generalization capability and practical value in professional settings.
|
Received: 03 March 2024
|
|
|
|
|
[1] Roth P. Missing data: A conceptual review for applied psychologists [J]. Personnel Psychology, 1994,47(3):537-560. [2] Quinteros M E, Lu S, Blazquez C, et al. Use of data imputation tools to reconstruct incomplete air quality datasets: a case-study in Temuco, Chile [J]. Atmospheric Environment, 2019,200:40-49. [3] Blu T, Thevenaz P, Unser M. Linear interpolation revitalized [J]. IEEE Transactions on Image Processing, 2004,13(5):710-719. [4] Hajmohammadi H, Heydecker B. Multivariate time series modelling for urban air quality [J]. Urban Climate, 2021,37:100834. [5] Rumaling M I, Chee F P, Dayou J, et al. Missing value imputation for PM10concentration in Sabah using nearest neighbour method (NNM) and expectation-maximization (EM) algorithm [J]. Asian Journal of Atmospheric Environment, 2020,14(1):62-72. [6] Gomez-Carracedo M, Andrade J, Lopez-Mahia P, et al. A practical comparison of single and multiple imputation methods to handle complex missing data in air quality datasets [J]. Chemometrics & Intelligent Laboratory Systems, 2014,134:23-33. [7] Şahin Ü, Bayat C, Ucan O. Application of cellular neural network (CNN) to the prediction of missing air pollutant data [J]. Atmospheric Research, 2011,101(1/2):314-326. [8] Miller L, Xu X, Wheeler A, et al. Evaluation of missing value methods for predicting ambient BTEX concentrations in two neighbouring cities in Southwestern Ontario Canada [J]. Atmospheric Environment, 2018,181:126-134. [9] Samal K K R, Babu K S, Das S K. Multi-directional temporal convolutional artificial neural network for PM2.5 forecasting with missing values: a deep learning approach [J]. Urban Climate, 2021, 36:100800. [10] Alsaber A, Pan J, Al-Hurban A. Handling complex missing data using random forest approach for an air quality monitoring dataset: a case study of Kuwait environmental data (2012 to 2018) [J]. International Journal of Environmental Research and Public Health, 2021,18(3): 1-26. [11] Alkabbani H, Ramadan A, Zhu Q, et al. An improved air quality index machine learning-based forecasting with multivariate data imputation approach [J]. Atmosphere, 2022,13(7):1144. [12] Narasimhan D, Vanitha M. Machine learning approach-based big data imputation methods for outdoor air quality forecasting [J]. Journal of Scientific & Industrial Research, 2023,82(3):338-347. [13] Chen M, Zhu H, Chen Y, et al. A novel missing data imputation approach for time series air quality data based on logistic regression [J]. Atmosphere, 2022,13(7):1044. [14] 张波,宋国君.大规模空气质量监测数据缺失处理方法实证研究[J]. 中国环境科学, 2022,42(5):2078-2087. Zhang B, Song G J. Research on the missing value methods for large- scale online air quality monitoring data [J]. China Environmental Science, 2022,42(5):2078-2087. [15] Junninen H, Niska H, Tuppurainen K, et al. Methods for imputation of missing values in air quality data sets [J]. Atmospheric Environment, 2004,38(18):2895-2907. [16] Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm [J]. Journal of the Royal Statistical Society. Series B, Methodological, 1977,39(1):1-38. [17] Rubin D. Multiple imputation after 18+ years [J]. Journal of the American Statistical Association, 1996,91:473-489. [18] Li Y, Wu F, Ngom A. A review on machine learning principles for multi-view biological data integration [J]. Briefings in Bioinformatics, 2018,19(2):325-340. [19] 张贝娜.基于时空多视图BP神经网络的城市空气质量数据补全方法研究[J]. 浙江大学学报(理学版), 2019,46(6):737-744. Zhang B N. Urban air quality data completion method based on spatio-temporal multi-view BP neural network [J]. Journal of Zhejiang University (Science Edition), 2019,46(6):737-744. [20] 薛娇,傅德印,韩海波,等.基于多视角学习的非负函数型矩阵填充算法[J]. 统计与决策, 2022,38(7):5-11. Xue J, Fu D Y, Han H B, et al. Non-negative functional matrix completion algorithm based on multi-view learning [J]. Statistics & Decision, 2022,38(7):5-11. [21] Cai D, He X, Han J, et al. Graph regularized nonnegative matrix factorization for data representation [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008,33(8):1548-1560. [22] Tang J, Qu M, Wang M, et al. Line: large-scale information network embedding [C]. Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2015:1067-1077. [23] Defferrard M, Bresson X, Vandergheynst P. Convolutional neural networks on graphs with fast localized spectral filtering [C]. NIPS, 2016:3844-3852. [24] Wang Y, Zhang Y. Nonnegative matrix factorization: a comprehensive review [J]. IEEE Transactions on Knowledge and Data Engineering, 2012,25(6):1336-1353. [25] Liang W, Zhou S, Xiong J, et al. Multi-view spectral clustering with high-order optimal neighborhood laplacian matric [J]. IEEE Transactions on Knowledge and Data Engineering, 2020,34(7):3418- 3430. [26] Niu D, Dy J, Jordan M. Iterative discovery of multiple alternative clustering views [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013,36(7):1340-1353. [27] Gretton A, Bousquet O, Smola A, et al. Measuring statistical dependence with Hilbert-Schmidt norms [C]. International Conference on Algorithmic Learning Theory, Berlin, 2005:63-77. [28] Kidzinski Ł, Hastie T. Longitudinal data analysis using matrix completion [J]. Stat, 2018,1050:24. [29] Li P, Chiou J. Functional clustering and missing value imputation of traffic flow trajectories [J]. Transportmetrica. (Abingdon, Oxfordshire, UK), 2021,9(1):1-21. [30] Qu L, Li L, Zhang Y, et al. PPCA-based missing data imputation for traffic flow volume: a systematical approach [J]. IEEE Transactions on Intelligent Transportation Systems, 2009,10(3):512-522. [31] Chiou J M, Zhang Y C, Chen W H, et al. A functional data approach to missing value imputation and outlier detection for traffic flow data [J]. Transportmetrica B: Transport Dynamics, 2014,2(2):106-129. |
|
|
|