Transferable Hourly Ozone Forecasting with Transformers
Abstract. We investigate the suitability of a transformer-based approach for air-quality forecasting, focusing on 4-day ahead hourly predictions of surface ozone (O3). The study employs Google’s Temporal Fusion Transformer (TFT) to integrate meteorological predictors, historical pollutant observations, and static station metadata, using an open source implementation with minimal domain-specific preprocessing. The analysis addresses two questions: (1) how efficiently a transformer model can be deployed for regional air quality forecasting, and (2) how well the learned representations transfer across geophysically distinct regions.
Model performance is evaluated against state-of-the-art regional chemical transport model Copernicus Atmosphere Monitoring Service (CAMS) ensemble forecast using observations from Germany. The TFT consistently achieves lower bias and higher forecast skill across all lead times. Suburban monitoring sites exhibit the highest skill relative to CAMS based on RMSE and SMAPE-based metrics. Urban stations show moderate skill against CAMS baseline, while rural stations have reduced skill in comparison but remain positive across the full 96 h forecast, with the strongest improvements observed at shorter lead times. Post–day-1 results indicate a clear separation of performance by station type; suggesting increasing performance stratification by station type beyond day 1, with larger relative gains at urban and suburban sites and smaller but consistently positive skill at rural locations.
Geographic transferability is assessed by adapting a model trained over Germany to South Korea by retraining region-specific metadata embeddings while preserving learned temporal representations. Forecast errors increase by only 5–10 %, indicating that the model captures meteorological drivers of O3 variability that generalize across contrasting anthropogenic and climatic regimes. Ablation experiments further demonstrate the robustness of the chosen experimental configuration for both forecasting performance and cross region transferability.
This paper presents the application of a popular deep learning architecture – Temporal Fusion Transformer (TFT) – to ozone forecasting. The study initially focuses on Germany where long ozone records are available for training, and where TFT predictions can be compared to state-of-the-art physics-based ozone forecasts from the CAMS regional ensemble. The study then explores the geographic transferability of the trained TFT model to another region with fewer observations, in this case South Korea. Overall, the authors report improved skills compared to CAMS and reasonably good geographical transferability.
Over the last years, the TFT architecture has been used for a variety of time series forecasting applications, apparently with a reasonably good success rate. This justifies the interest of exploring its skills on air pollution forecasting, and although the application of this specific type of model is not new (e.g. Hickman et al., 2023), the authors are still proposing here some refinements (e.g. station-level anthropogenic metadata). Therefore, the innovation of the paper is probably more on the side of the geographical transferability, although this should come with a more extended discussion of the results.
Overall, the paper is clear and well written (although some specific parts could be improved, see minor comments), and falls in the scope of GMD. I suggest accepting the publication but after addressing the major issues described below, which I think could strengthen the study.
Major comments:
The first major comment is related to the set-up chosen for the AI-versus-CAMS comparison, which I think currently represents a significant limitation given that the AI forecasting model is using as known future the ERA5 reanalysis, which would evidently not be available in an operational context, and should thus have been replaced by a meteorological forecast. At least this is what I understood, but it is still partially confusing because the authors are mentioning several times the importance of “meteorological forecast” as known future inputs (L114 and L722), but the data description only mentions the use of ERA5 meteorological reanalysis. If the TFT model does not rely on meteorological forecast but on meteorological reanalysis, then the comparison against the CAMS operational air quality forecast – that only relies on meteorological forecasts – is unfair. Consequently, we can expect the AI forecast model to be less (not) affected by error accumulation on the meteorology, that represents a key driver of the O3 variability, as mentioned by the authors. Given that this comparison against CAMS is quite central in the paper, it would be important to ensure a fairer comparison, using meteorological operational forecast as known future covariates (e.g. IFS or equivalent) (and eventually meteorological operational analysis for past known covariates). At the very least, the authors should make a very clear statement about this strong limitation, but to me the paper would be much stronger replacing ERA5 reanalysis by some meteorological forecast.
As a side comment, given that the authors are evaluating the uncertainties obtained with the TFT model, for a more comprehensive AI-versus-CAMS comparison it would have been useful and very informative to compare them to the uncertainties of the CAMS ensemble, as derived from the spread of the different individual members. Finally, I don’t think it is completely fair to compare an observation-based forecast relying on local station-based information to a pure CTM-based forecast at 10-km (thus quite coarse) resolution. A more appropriate comparison would have required using some CAMS forecast bias-corrected with local observations. I think CAMS is already providing CAMS-MOS forecast, but maybe only at a limited number of stations and probably not in 2023. Here again, this limitation should be highlighted more clearly.
The second major comment is related to the results of the transfer learning. If I understand correctly, please correct me if I am wrong and adjust the text accordingly to avoid confusion, the RMSE on urban stations in South Korea is around 11 ppb (Fig. 6) while it is around 2 ppb in Germany (Fig. 3). This is a strong difference, roughly a factor 4, and therefore I don’t understand why is the abstract is talking about “only 5-10% increase of the forecast errors” when passing from Germany to South Korea? It is crucial to clarify that point.
Minor comments:
All figures: Please revise all figures and include systematically the units of the variable or metric shown. The font size and resolution of the figures should be increased so that to be readable without having to zoom. Besides that, the quality of some figures could be generally improved, especially the multi-panels plots that are not aligned.
L151: About the use of forward-filling, why not doing a simple linear interpolation? This seems to me already much better than repeating the last value.
L170: The authors mention that they split their dataset into train-validation-test sets along the temporal dimension, but they mention only train-test split along the station dimension. Does it mean that the model tuning is performed only the “temporal” validation set (April 2015 to December 2022) but still considering the same 386 stations used for training? Please clarify, and if so, please explain why no independent stations were kept also not only for testing but also for validation.
L179: Only the test set allows providing an unbiased estimate of the skills of the predictive model. Therefore, although summer is indeed the most relevant season for O3 episodes, it would still be useful to have an idea of the performance of the model all along the year, considering that it has been trained with samples distributed all along the year and not only in summer. Footnote 5 suggests that the AI model performs similarly to CAMS in spring and winter. Could you provide more quantitative results during the spring/winter/fall seasons (and ideally some plots in Appendix or Supplement)? Even if ozone episodes occur mostly in summer, I think this is still a relevant aspect given that some ozone episodes can occur outside summer season along early or late heat waves which are more frequent under climate change. (In an operational context, if the AI model is better than CAMS only in summer, this raises the question of when exactly the AI model starts and stops to be more skilful.)
L205: Sample grouping: I am not sure to understand the new feature introduced here, please clarify this paragraph. Do you mean that in practice this new feature takes values from 1 to S with S the total number of stations? If so, it would correspond to a unique identifier for each station, so what would be the difference with the station code that already encodes such unique information for each station?
Fig. 1: I don’t think the red rectangle brings much here.
Fig. 2: Please correct some missing punctuation in the legend. Also, I don’t understand why the authors are describing stations shown in panels d-e-f as “seen a-priori during training or validation” given that they previously said summer 2023 was used only for testing. And why “a-priori one or the other”? Please clarify.
Fig. 3: I don’t understand why for a given station type, for instance urban stations, the RMSE of TFT shown in panel c (around 2.5 ppbv) is much lower than the one shown below for the percentile 50 (around 5-7 ppbv). RMSE also differ for CAMS. Please explain. Also, it would be better to keep the same colour for TFT and CAMS across all figures and panels, here they are inversed.
Also, results of ozone forecast across all stations (here and in elsewhere) are shown in terms on RMSE averaged over all stations. This tells little about the stations where forecast may be the least skilful, could the authors provide some information regarding the distribution (and not only the mean) of the RMSEs across the different stations?
Sect. 3.3.4: In this section it is not clear which components remain frozen. To illustrate more easily which components of the initial TFT model trained over Germany are frozen and which ones are fine-tuned with South Korean data, it would be useful to replicate fig. B1 indicating clearly for instance with a specific colour the components that are retrained.
It would be useful to explain in more detail how transfer learning and gating layers work, so that non-experts on AI can still understand.
L280: The authors should also provide results on the CRPS metric it is the most used in atmospheric forecasting applications. This would facilitate comparisons against other studies.
L298-304: Revise these sentences, the formulation is unclear and the English quite poor. In particular, “performance peak” does not mean much to me in the context of this paragraph.
L310: Please provide the percentage improvement at rural stations.
L312-313: TFT still shows substantially better performance than CAMS here (I would say maybe roughly 15% improvement), this is not so much reflected by the tone of this sentence.
Footnote 7: Change for “anthropogenic forcing”.
L314: Missing reference to Fig. 4.
Fig. 5: Increase the resolution and font size of the figure (and resolution could probably be increased in several other figures).
L335: Fig. 8 should probably come before to be consistent with the order of appearance in the text.
Sect. 4.3: The authors are highlighting the geographical transferability as one of the main contributions of the paper, but the analysis/discussion of the transfer learning results remains very short in the main text (only a few lines, from L337 to L344). I would suggest extending it a bit, maybe including part of the results discussed in Appendix in the main document. One specific issue of this section is that they are no benchmark forecast to compare the TFT model fine-tuned with local data. The authors could eventually consider training their same TFT model directly on South Korea, in the same way as they trained in over Germany and compare the performance of both approaches, or use a simpler approach.
L342-343: I don’t understand the part on “an RMSE of 11 ppb […] when compared against CAMS deterministic global forecasts”, these global CAMS forecasts have not been introduced before and are not shown on the figures. Please reformulate in a clearer way.
L345: Why is this discussed here in the section of transferability of the model to South Korea? This does not seem related and should probably be placed in a dedicated section on variable importance. It is not clear if this variable importance concerns the model trained over Germany or the one fine-tuned over South- Korea. Please clarify.
L357: I really don’t see the interest of this comparison against CANMS stratospheric ozone. At least the authors should have considered the CAMS ozone tropospheric column, not the total column, or better the surface concentration, this comparison against the total column does not make a lot of sense to me.
L363: Where are AI and CAMS comparisons made on elevated ozone episodes? As far as I understand, results are mostly evaluated using RMSE which does not provide insights on these episodes. Some categorical metrics on ozone exceedances above regulatory threshold would be required here to support this statement.
L365: Which minor degradation are the authors referring to here? (see my previous comment on the RMSE increased by x4 in South Korea compared to Germany).
Table A.2: Why using a climatological mean emission, which is likely not the most accurate information about emissions around a given station on a given year?
D1: The authors should refer to this figure at the beginning of Appendix D, include the AI model performance before the ablation to facilitate the comparison. Also, it would be interesting to know how this ablation affects the probabilistic forecast through the WIS for instance. Finally, I don’t understand why D2 is not merged with D1.1, both treating the same ablation aspects, this should be reorganised as it is a bit confusion right now, with transfer learning (D1.2) in the middle.
L711: Do the authors tried longer windows? Does it improve the skills?
L717: This aligns with the limitation mentioned before, that the TFT model probably benefits significantly from relying on reanalysis meteorology instead of forecast meteorology. I don’t understand why now the authors are saying “These results highlight the importance of conditioning on forecast meteorology” if they are not using such forecast but the ERA5 reanalysis, please reformulate to avoid confusion.
L775: “…product, a proxy…”.