Research on AI-based algorithms for atmospheric carbon dioxide data imputation and reconstruction

GU Tao-feng, YUE Hai-yan, XIONG Zi-li, LIANG Jian-ping, WU Yan-ling, LIU Li-xin

China Environmental Science ›› 2026, Vol. 46 ›› Issue (3) : 1288-1297.

PDF(2839 KB)
PDF(2839 KB)
China Environmental Science ›› 2026, Vol. 46 ›› Issue (3) : 1288-1297.
Air Pollution Control

Research on AI-based algorithms for atmospheric carbon dioxide data imputation and reconstruction

  • GU Tao-feng1, YUE Hai-yan2, XIONG Zi-li1, LIANG Jian-ping3, WU Yan-ling4, LIU Li-xin5
Author information +
History +

Abstract

To address the challenge of partial missing data in atmospheric CO2 concentration observations, this study systematically evaluated the performance of 12 mainstream machine learning algorithms for data imputation, utilizing CO2 measurement data collected at the 30m (L1 layer) and 80m (L2 layer) heights of the Longfengshan Regional Atmospheric Background Station over the period from January 2009 to June 2022. An optimized ensemble machine learning method was proposed, with the distance-weighted K-Nearest Neighbors (KNeighborsDist) algorithm as its core framework. This method reconstructs missing data by leveraging synchronous observational records from the two co-located height layers. Validation analysis of the reconstructed dataset yielded excellent performance (R230.98), which confirms a strong correlation between the reconstructed greenhouse gas data and in-situ measurements, as well as the effectiveness and accuracy of the proposed model in greenhouse gas data reconstruction tasks. The method successfully filled the data gaps in the original dataset of Longfengshan Station, generating a continuous 5minute interval CO2 dataset with a completeness rate of 100%. Additionally, it well preserved the long-term trends and seasonal, monthly, and diurnal variation characteristics of CO2. The proposed AI-based data reconstruction algorithm is applicable to other atmospheric background stations and can provide high-quality data support for regional climate change analysis, carbon emission accounting, and inversion research.

Key words

carbon dioxide / machine learning algorithm / data filling / data set / Longfengshan

Cite this article

Download Citations
GU Tao-feng, YUE Hai-yan, XIONG Zi-li, LIANG Jian-ping, WU Yan-ling, LIU Li-xin. Research on AI-based algorithms for atmospheric carbon dioxide data imputation and reconstruction[J]. China Environmental Science. 2026, 46(3): 1288-1297

References

[1] World Meteorological Organization (WMO). The state of greenhouse gases in the atmosphere based on global observations through 2023 [R]. Geneva: World Data Centre for Greenhouse Gases, 2024.
[2] Masarie K A, Tans P P. Extension and integration of atmospheric carbon dioxide data into a globally consistent measurement record [J]. J. Geophys. Res., 1995,100(D6):11593-11610.
[3] Tsutsumi Y, Mori K, Ikebe T, et al. Technical report of global analysis method for major greenhouse gases by the world data center for greenhouse gases [R]. Tokyo: Global Atmosphere Watch Programme (GAW), 2009.
[4] 闫丽莉,温少妍,高文晶,等.整点气温缺测的插补方法研究及其初步应用 [J]. 震灾防御技术, 2019,14(2):446-455. Yan L L, Wen S Y, Gao W J, et al. Interpolating method for missing data of integral point temperature and its preliminary application [J]. Technology for Earthquake Disaster Prevention, 2019,14(2):446-455.
[5] 司鹏,郝立生,罗传军,等.河北保定气象站长序列气温资料缺测记录插补和非均一性订正 [J]. 气候变化研究进展, 2017,13(1):41-51. Si P, Hao L S, Luo C J, et al. The interpolation and homogenization of Long-Term temperature time series at Baoding observation station in Hebei province [J]. Climate Change Research, 2017,13(1):41-51.
[6] Grange S K, Carslaw D C, Lewis A C, et al. Random forest meteorological normalisation models for swiss pm10 trend analysis [J]. Atmospheric Chemistry and Physics, 2018,18(9):6223-6239.
[7] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction [M]. New York: Springer, 2009.
[8] James G, Witten D, Hastie T, et al. An introduction to statistical learning with applications in R [M]. New York: Springer, 2013.
[9] Li X, Peng L, Yao X, et al. Long-term reconstruction of pm2.5 concentrations in China using a random forest model [J]. Environmental Pollution, 2020,263:114422.
[10] Wei J, Li Z, Lyapustin A, et al. Reconstructing 1-km-resolution high-quality PM2.5 data records from 2000 to 2018 in China [J]. Atmospheric Chemistry and Physics, 2019,19(13):8839-8855.
[11] 刘冰,时凯歌,陈燕敏,等.基于可解释机器学习的地表水DOM光化学性质预测 [J]. 中国环境科学, 2025,45(7):3925-3934. Liu B, Shi K G, Chen Y M, et al. Prediction of photochemical properties of surface water DOM based on interpretable machine learning [J]. Chinese Environmental Science, 2025,45(7):3925-3934.
[12] Liu Y, Cao G, Zhao N, et al. Improve ground-level PM2.5 concentration mapping using a random forest-based geostatistical approach [J]. Environmental Pollution, 2018,235:272-282.
[13] Ding N, Wang H, Xu T, et al. Estimating PM2.5 concentrations using CatBoost and satellite remote sensing data [J]. Science of The Total Environment, 2021,750:141234.
[14] Wei J, Huang W, Li Z, et al. Estimating 1-km daily PM2.5 concentrations across China using the lightGBM model [J]. Environmental Science & Technology, 2020,54(11):6796-6805.
[15] Dorogush A V, Ershov V, Gulin A. Cat Boost: gradient boosting with categorical features support [J]. arXiv.org, 2018,29:1189-1232.
[16] Wu C, Lin H, Xiao Q, et al. High-resolution mapping of PM2.5 using lightGBM and satellite data [J]. Remote Sensing of Environment, 2021,254:112256.
[17] Di Q, Rowland S, Koutrakis P, et al. A hybrid model for spatially and temporally resolved ozone exposures in the continental United States [J]. Journal of the Air & Waste Management Association, 2016,67(1): 39-52.
[18] Brokamp C, Jandarov R, Rao M B, et al. Exposure assessment models for particulate matter using machine learning [J]. Environmental Health Perspectives, 2017,125(6):067010.
[19] Di Q, Amini H, Shi L, et al. An ensemble-based model of PM2.5 concentration across the contiguous United States with high spatiotemporal resolution [J]. Environment International, 2019,130: 104909.
[20] Wu C, Yu J Z. Evaluation of linear regression techniques for atmospheric applications: the importance of appropriate weighting [J]. Atmospheric Measurement Technigues, 2018,11(2):1233-1250.
[21] Just A C, Arfer K B, Schwartz J, et al. A new method for estimating missing air pollution data using machine learning [J]. Atmospheric Environment, 2020,224:117323.
[22] Reid C E, Jerrett M, Petersen M L, et al. Spatiotemporal prediction of fine particulate matter using machine learning [J]. Environmental Science & Technology, 2015,49(6):3870-3879.
[23] van Donkelaar A, Martin R V, Brauer M, et al. Global estimates of fine particulate matter using a combined geophysical-statistical method with information from satellites, models, and monitors [J]. Environmental Science & Technology, 2016,50(7):3762-3772.
[24] Kloog I, Ridgway B, Koutrakis P, et al. Long- and short-term exposure predictions using satellite-based PM2.5 estimates and machine learning [J]. Environmental Health Perspectives, 2014,122(10):1069-1075.
[25] Chen G, Li S, Knibbs L D, et al. A two-stage machine learning model for estimating daily PM2.5 concentrations in China [J]. Environment International, 2018,114:129-137.
[26] 栾天,周凌晞,方双喜,等.龙凤山本底站大气CO2数据筛分及浓度特征研究 [J]. 环境科学, 2014,35(8):2864-2870. Luan T, Zhou L X, Fang T X, et al. Atmospheric CO2 data filtering nethod and characteristics of the molar fractions at the longfengshan wmo/gaw regional station in China [J]. Environmental Sciences, 2014, 35(8):2864-2870.
[27] Ke G, Meng Q, Finley T, et al. LightGBM: A highly efficient gradient bosting decision tree [J]. Advances in Neural Information Processing Systems, 2017,30:3146-3154.
[28] Dorogush A V, Ershov V, Gulin A. CatBoost: unbiased boosting with categorical features [J]. Advances in Neural Information Processing Systems, 2018,31:6638-6648.
[29] Chen T, Guestrin C. XGBoost: A scalable tree boosting system [C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016:785-794.
[30] Erickson N, Mueller J, Shirkov A, et al. AutoGluon-Tabular: Robust and accurate autoML for structured data [J]. arXiv preprint arXiv: 2003.06505, 2020.
[31] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning [M]. Springer, 2009.
[32] Finn D, Clawson K L, Carter R G, et al. Tracer studies to characterize the effects of complex terrain on dispersion [J]. Journal of Applied Meteorology, 2010,49(7):1420-1436.
[33] Stull R B. An introduction to boundary layer meteorology [M]. Springer Science & Business Media, 1988.
[34] Horst T W, Weil J C. How far is far enough: The fetch requirements for micrometeorological measurement of surface fluxes [J]. Journal of Atmospheric and Oceanic Technology, 1994,11(4):1018-1025.
[35] Leclerc M Y, Thurtell G W, Kidd G E, et al. Measurements of the footprint of scalar sources using a laser anemometer [J]. Boundary- Layer Meteorology, 1990,52(3):247-258.
PDF(2839 KB)

Accesses

Citation

Detail

Sections
Recommended

/