Improving the quantification of peak concentrations for air quality sensors via data weighting
Abstract. Traditional calibration models for low-cost air quality sensors have demonstrated a tendency to under-predict peak concentrations. We assessed the utility of adding data weights to low-cost sensor colocation data to improve the quantification of peak concentrations. Specifically, we explored the effects of data weighting on three different pollutant colocation datasets: total volatile organic compounds, carbon monoxide, and methane. Leveraging two different weighting functions, a sigmoidal and piecewise weighting regime, we explored the impacts of the base model choice (multilinear regression vs random forest models), the sensitivity of weighting functions, and the ability of data weighting to improve high-concentration pollution measurements. When compared to unweighted colocation data, we demonstrate significant reductions in both error (root mean square error-RMSE) and bias (mean bias error-MBE) for pollutant peaks across all three datasets when data weighting is employed. For the top percentile of data, we observe an average of 23 % reduction in RMSE and a 35 % reduction in MBE when optimal weights are employed. More significant reductions occurred in the 95–99th percentile of data, where MBE was reduced by an average of 70 %. RMSE in the 95–99th percentile was reduced by an average of 26 %. However, data weighting can also generate larger errors at baseline pollutant concentrations. Data weighting regimes were sensitive to input parameters, and input weighting functions may be tuned to better predict peak concentration data without significant reductions in the fidelity of baseline pollutant predictions.