Is replacing missing values of PM2.5 constituents with estimates using machine learning better for source apportionment than exclusion or median replacement?

Environmental Pollution(2024)

引用 0|浏览0
暂无评分
摘要
East Asian countries have been conducting source apportionment of fine particulate matter (PM2.5) by applying positive matrix factorization (PMF) to hourly constituent concentrations. However, some of the constituent data from the supersites in South Korea was missing due to instrument maintenance and calibration. Conventional preprocessing of missing values, such as exclusion or median replacement, causes biases in the estimated source contributions by changing the PMF input. Machine learning (ML) can estimate the missing values by training on constituent data, meteorological data, and gaseous pollutants. Complete data from the Seoul Supersite in 2018 was taken, and a random 20% was set as missing. PMF was performed by replacing missing values with estimates. Percent errors of the source contributions were calculated compared to those estimated from complete data. Missing values were estimated using a random forest analysis. Estimation accuracy (r2) was as high as 0.874 for missing carbon species and low at 0.631 when ionic species and trace elements were missing. For the seven highest contributing sources, replacing the missing values of carbon species with estimates minimized the percent errors to 2.0% on average. However, replacing the missing values of the other chemical species with estimates increased the percent errors to more than 9.7% on average. Percent errors were maximal at 37% on average when missing values of ionic species and trace elements were replaced with estimates. Missing values, except for carbon species, need to be excluded. This approach reduced the percent errors to 7.4% on average, which was lower than those due to median replacement. Our results show that reducing the biases in source apportionment is possible by replacing the missing values of carbon species with estimates. To improve the biases due to missing values of the other chemical species, the estimation accuracy of the ML needs to be improved.
更多
查看译文
关键词
PM2.5 constituents,Missing value estimation,Machine learning,Random Forest,Source apportionment,Positive matrix factorization (PMF)
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要