the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
AIFS 1.1.0: An update to ECMWF's machine-learned weather forecast model AIFS
Abstract. We present an update to ECMWF's machine-learned weather forecasting model AIFS Single with several key improvements. The model now incorporates physical consistency constraints through bounding layers, an updated training schedule, and an expanded set of variables. The physical constraints substantially improve precipitation forecasts and the new variables show a high level of skill. Upper-air headline scores also show improvement over the previous AIFS version. The AIFS has been fully operational at ECMWF since the 25th of February 2025.
- Preprint
(25026 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- CEC1: 'Comment on egusphere-2025-4716 - No compliance with the policy of the journal', Juan Antonio Añel, 07 Dec 2025
-
RC1: 'Comment on egusphere-2025-4716', Anonymous Referee #1, 05 Jan 2026
AIFS 1.1.0 - An update to ECMWF's machine-learned weather forecast model AIFS
Recommendation: Major Revisions Required
SUMMARY
This manuscript documents an impressive operational update to ECMWF's AIFS, which has already demonstrated remarkable skill in medium-range weather prediction. The revised version introduces three valuable enhancements: (1) activation function-based bounding layers to enforce physical constraints, (2) expansion of prognostic variables to include land surface and energy sector fields, and (3) optimization of the training schedule. The operational deployment at ECMWF represents a substantial achievement, demonstrating the maturity and production-readiness of machine learning weather prediction. AIFS 1.1.0 shows consistent improvements of 4-6% across variables and lead times, with particularly impressive gains in precipitation forecasting.
The manuscript effectively documents these advances and will make a strong contribution to the literature. However, several technical aspects require clarification. The paper makes mechanistic claims about model behavior without adequate theoretical support, attributes performance improvements to specific design choices without proper experimental controls, and presents critical design decisions without quantitative justification. These revisions involve analysis and clarification rather than expensive retraining experiments. With these technical clarifications, the manuscript will provide an excellent reference for operational machine learning weather prediction systems.
MAJOR ISSUES
1. Bounding Layer Mechanism: Insufficient Mechanistic Explanation
The manuscript's primary scientific contribution concerns the bounding layer strategy (Section 3.2, lines 221-227). The authors claim that applying ReLU activation functions to precipitation outputs "facilitates the learning of forecasting for sparse and intermittent variables" by enabling negative output space to serve as a proxy for no-rain classification. Figure 12 presents compelling evidence of structured spatial patterns in pre-activation space, showing strongly negative values over arid regions (Sahara Desert) with smooth gradients near precipitation events.
This observation represents the paper's most interesting result, yet receives inadequate mechanistic analysis. During backpropagation, the ReLU derivative is zero for all negative inputs (∂ReLU/∂x = 0 for x < 0), suggesting gradient information should not flow to regions predicting negative values. How, then, does the model develop sophisticated spatial structure visible in Figure 12? What is the mathematical relationship between the MSE loss gradient and the learned negative space structure? Does this structure emerge from autoregressive training, where negative predictions at intermediate steps influence subsequent forecasts through rollout? Can the information content in the negative space be quantified? How does this compare to alternative formulations such as LeakyReLU (mentioned line 391 but not evaluated)?
The authors should provide mechanistic explanation for the bounding layer's behavior. This need not require expensive additional experiments rather a clear theoretical model of gradient flow during rollout training would suffice. At minimum, the structured negative space in Figure 12 deserves quantitative analysis rather than qualitative description. Understanding this mechanism is critical for generalizing the approach to other sparse variables.
2. Performance Attribution: Confounded Experimental Design
Lines 88-91 state that the revised training schedule (direct fine-tuning on operational analysis rather than sequential ERA5→ERA5→operational) "results in better forecast performance." This causal claim cannot be substantiated from the presented evidence. The revised model simultaneously modifies training schedule, adds new prognostic variables, implements bounding layers, and adjusts learning rate schedules. Section 4.1 attempts to isolate bounding layer effects through Figure 11, but this comparison still includes differences in training data extent (1979-2022 vs shorter periods) and other modifications.
The authors should either remove unsupported causal claims or clearly state that performance improvements result from combined system modifications. Revising "results in better forecast performance" to "is associated with improved forecast performance" would be appropriate. I recognize comprehensive ablation studies are computationally expensive, but making causal claims without supporting evidence is scientifically inappropriate. At minimum, explicitly acknowledging the confounded nature of these comparisons would improve scientific rigor.
3. Subjective Design Tradeoffs: Insufficient Quantification
Lines 254-256 and 400-410 acknowledge a "subjective compromise between forecast realism and forecast skill measured by RMSE," where more aggressive rollout fine-tuning could improve headline scores at the cost of spatial field characteristics. The authors base this decision on spectral analysis not presented in the manuscript. What specific spectral characteristics were prioritized? What magnitude of RMSE degradation was accepted to achieve desired spectral properties? Without quantification, readers cannot assess the appropriateness of this tradeoff or replicate the training procedure.
The spectral analysis underlying this design choice should be presented. Representative power spectra comparing aggressive fine-tuning versus the chosen configuration would clarify the tradeoff. Quantifying the approximate magnitude of this compromise requires no additional training runs—only analysis of existing model outputs. This documentation is essential for reproducibility.
4. Physical Conservation Properties: Missing Diagnostics
Lines 188-190 acknowledge that the bounding strategy does not enforce mass or energy conservation, dismissing this with "we did not consider other constraints such as energy or mass conservation." For a production forecasting system deployed operationally, this warrants more thorough treatment. Do 10-day integrations accumulate substantial mass or energy errors? How do these compare to the physics-based IFS? Does violation magnitude correlate with forecast error? These questions can be addressed through straightforward post-hoc analysis requiring no model retraining. Brief discussion of whether conservation violations matter for medium-range prediction timescales would strengthen the manuscript.
5. Missing Contextualization with Related Work
The manuscript does not adequately position this work within the broader context of physically-constrained machine learning weather models. Multiple operational systems now implement similar physical constraints. The CREDIT framework (Chen et al., 2025, Nature Communications Climate, https://www.nature.com/articles/s41612-025-01125-6; Chen et al., 2025, Geophysical Research Letters, https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2025MS005138) implements comparable bounding strategies for physical constraints in operational settings. Harder et al. (2024) provides theoretical framework for hard-constraint approaches. Kent et al. (2025) and Bonev et al. (2025) present alternative constraint methodologies.
Particularly relevant is recent work by Sha et al. (2025, Geophysical Research Letters, https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2025GL118478) addressing the identical problem of blurry precipitation forecasts and drizzle bias in AIWP models. While AIFS uses ReLU activation functions for bounding, Sha et al. demonstrate that terrain-following coordinates combined with global mass and energy conservation constraints provide comparable benefits for reducing drizzle bias and improving extreme precipitation forecasts. This represents an alternative technical approach to the same fundamental challenge, suggesting multiple pathways exist for addressing sparse variable prediction in ML weather models.
The manuscript would benefit from discussing how this implementation compares to related approaches in GraphCast, Pangu-Weather, FuXi, and CREDIT models. Do other systems exhibit similar light precipitation biases? How do different architectural choices and constraint methodologies affect this common challenge? Comparing the ReLU bounding approach to alternatives like terrain-following coordinates or other constraint formulations would strengthen the contribution by clarifying the relative advantages and positioning the work within the field.
MINOR ISSUES
1. Loss Weight Justification: Loss scaling factors in Table 1 (lines 126-133) are described as "chosen empirically," which is reasonable. However, line 129 states that vertical velocity and soil moisture are "deliberately down-weighted," implying specific motivation. Brief rationale would improve clarity.
2. Unsupported Generalizations: Line 173-175 claims "well-known characteristic of machine learning-based forecasts: a tendency to produce overly smooth spatial fields." Either provide specific citations (e.g., Ben Bouallègue et al., 2024; Bonavita, 2024) or remove "well-known."
3. Line 253 states tropical cyclone performance "is similar to that of the previous version" without supporting data.
4. Statistical Significance: Several PSS comparisons in Figure 3 between revised AIFS and IFS at 144-hour forecasts show marginal differences. Claims of "improvement" should be verified as statistically significant.
5. Case Study Limitations: Section 4.2 case studies effectively demonstrate model capability but represent illustrative examples rather than systematic validation. How many extreme events were evaluated? What is the false alarm rate?
RECOMMENDATION
I recommend major revisions. The manuscript documents an important operational system and makes valuable contributions to ML weather prediction. However, it makes scientific claims about mechanisms and causality without adequate supporting evidence. The revisions do not necessitate expensive retraining, however they require theoretical analysis, quantification of existing results, and proper contextualization within related work. With these technical clarifications, this manuscript will make a strong contribution documenting a significant advance in operational ML weather forecasting.
Citation: https://doi.org/10.5194/egusphere-2025-4716-RC1 -
RC2: 'Comment on egusphere-2025-4716', Anonymous Referee #2, 11 Feb 2026
# Overall
It’s nice to see ECMWF documenting their new model version. The main changes have been some new input and output variables, and an approach for enforcing positivity and boundedness of some of the diagnostics outputs like precipitation. While it is great to see operational models documented, I have concerns about the scientific rigor of their results, and beyond that am not sure there is sufficient novelty to justify publication in GMD. Overall, this feels like an incremental update and I think the short abstract and narrow focus on a few skill metrics from reflects this. With significant revisions and new experiments/analyses this *might* be suitable for publication, but it is borderline.
# General comments:1. I think the main benefit of ReLU training is that it reduce the small positive drizzle, which is nice and likely robust to my concerns below. It's unclear how this impacts the skill scores though or if it could be achieved by other means e.g. threshold all precip < 0.1 mm/day to 0 in the old AIFS.
2. I’m surprised in 2026 to be reviewing a paper on AI weather forecasting without probabilistic skill assessments. The introduction acknowledges this but then goes on to employ known blurring techniques like multistep training with MSE loss. Figure 15 really highlights the blurriness issue. This confound all their skill scores, and makes it impossible to conclude that their proposed modifications are genuinely helpful or if the model is just blurrier. The manuscript shows no results that would contradict this like power spectra or precipitation pdfs. Many of their skill improvements could alternatively be achieved by ensemble averaging. I have some hope that their findings are robust since the skill is also better for short lead times where the blurring impact is less important, but still this needs to be assessed better. One possible idea short of a full-blown probabilistic assessment is to compare probabilistic scores on lagged ensembles built from their existing deterministic hind-casts [1].
3. I also have some concerns about the novelty of the proposed methods. Adding a few input/output variables (some of which other models already) and enforcing precip > 0 using a relu are valuable model developments, but seem incremental. I would be surprised if others aren't dong that, though I will admit to not finding a specific reference off the top of my head.
# Specific commentsFigure 2.
The use of scatter plots for plotting precipitation or any other map increases the pdf size greatly and more importantly adds visual noise that hinders assessment of fine-scale structures. I understand it is not entirely trivial to plot data on the octahedral grid, but surely there is some way to make quad-mesh or contour plots.
**L37–41**
> “Although such MSE‑trained forecast models have been shown to smooth forecast fields at longer lead times to avoid the double‑penalty of incorrectly positioned weather phenomena (Lam et al., 2023; Ben Bouallègue et al., 2024; Lang et al., 2024a; Bonavita, 2024)…”
It would be appropriate to also cite the Brenowitz _et al._ work on lagged ensembles, which explicitly links field blurring to multistep fine‑tuning.
---
**L113–116**
> We have increased the characterization of the land surface in the model by including new prognostic variables
Do these non‑prognostic variables materially influence forecast skill? A targeted sensitivity experiment would be informative.
---
**L218–221**
> “Clipping the precipitation output in inference is a possibility and a common practice… However, we show that the introduction of bounding in the output during training has benefits beyond simply avoiding slightly negative or unphysical values…”
Clipping is likely to introduce bias. Since MSE‑trained models predict the conditional mean, truncating negative values alone will generally induce a positive mean bias. It would be worth showing how the model climatology (averaged over initial times) differs spatially from IMERG.
L262
> The Frequency Bias Index (FBI) and Peirce Skill Score (PSS) are shown for the Northern Hemisphere for different thresholds.I'm not familiar with these metrics. Can you add citations and definitions? Perhaps histogram would be more familiar to the broader audience make the point that the model predicts too much light drizzle.
---
**L276–278**
> “The revised AIFS demonstrates approximately a one‑day gain in forecast skill… The forecast fields also exhibit noticeable improvements, as illustrated in Figure 2…”
These improvements may simply reflect increased spatial smoothing. As presented, the evidence is inconclusive.
---
**L311–314**
> “The results show that the improvement observed in total precipitation forecast skill in the revised AIFS version can mainly be attributed to constraining the output…”
The improvement appears largely constant with lead time rather than growing, suggesting that it may not feed back on error growth and could potentially be achieved through post‑processing. Figure 11 may be relevant in this context.
---
**L329–331**
> “Unlike the previous AIFS version (Figure 4), the convective precipitation forecast is now consistent with the predicted total precipitation accumulation…”
This figure does not demonstrate that training with this constraint is essential. It would be useful to assess the impact of applying similar consistency corrections as a post‑processing step to the _“AIFS revised – no bounding”_ configuration.
# References
[1]: Brenowitz, N. D. _et al._ A practical probabilistic benchmark for AI weather models. _Geophys. Res. Lett._ **52**, (2025).
Citation: https://doi.org/10.5194/egusphere-2025-4716-RC2 -
RC3: 'Comment on egusphere-2025-4716', Anonymous Referee #3, 25 Feb 2026
This paper provides an update to the AIFS model of ECMWF, although as mentioned in my comment below, it is not clear if these updates will be propagated from AIFS-Single to AIFS-CRPS. The updates are interesting and potentially publication worthy, though not revolutionary. As discussed below, it is important for the authors to better identify the sources of the improvement in standard fields other than precipitation.
Major Comments
1) This should be very easy to address, but the paper needs to better discuss the relation between AIFS-Single and AIFS-ENS. Is the only difference between AIFS-Single and AIFS-CRPS the use layer-norm noise and a CRPS dominated training loss? In particular are these additional variables and bounding-layer strategies now deployed in the AIFS-ENS? If these improvements are not clearly planned for incorporation into the AIFS-CRPS, why not?
2) How much of the improvement in the update is simply due to including 2021 and 2022 in the training data. I imagine not that much, but there are potentially lots of other differences in the training schedule that could be responsible for the improvement in ACC for Z500 and T850 (Fig. 6). How does RMSE compare for these fields? The authors should try to isolate the source of this improvement, particularly if it turns out to be a better treatment of the loss weighting in the stratosphere.
3) Further details about expanding the comparison with the other baselines
Figs. 8, & 9a,b: why which switch from ACC (in Figs. 6 & 7) to RMSE for these variables. For a more thorough analysis, please plot both both RMSE and ACC for all of these cases?
Fig. 7 caption: The evaluation for “ the whole of 2023” - what does that mean, Presumably from caption to Fig. 5, twice daily (00 and 12 UTC forecasts for every day of the year? Maybe the authors can clearly establish this in the text and then only note any exceptions. (Sorry if I missed such a sentence.)
Fig. 9: Why the different time ranges: (a) ssd is MAM, (b) is full year, (c) total cloud cover is JJA. Without further discussions this seems like it could be cherry picking.
Minor comments
l. 137: Do the authors mean 64 A100 80 GB GPUs (or maybe 64 of the 40 GB A100s)?
l. 190-192: Consider referencing Subramaniam et al., 2025 (arXiv:2506.08285) who add a loss penalty to effectively obtain a model that respects hydrostatic balance
Fig. 5: Framed rectangles are difficult to see. I suggest using a thick underline for statistically significant results, though the authors may have a better idea. Perhaps even better, they could plot only those values that are significant at the 95% level.
l. 258-259 and/or Fig. 6 caption: please be more specific about “medium range” — which seems like 1-day skill improvement at about 1-week lead time
l. 275, Fig. 10: please do plot “the three main global regions” as suggested in the text and caption instead of results for just the northern and southern hemispheres.
l. 391: The Leaky ReLU would indeed allow calculations of gradients in the negative space, but it seems like it would open the door to negative precipitation amounts as well. There are a variety of possible ways to construct loss functions that handle “no rain” separately. A more thorough discussion of this would be useful.
Citation: https://doi.org/10.5194/egusphere-2025-4716-RC3
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 1,600 | 7,118 | 39 | 8,757 | 43 | 26 |
- HTML: 1,600
- PDF: 7,118
- XML: 39
- Total: 8,757
- BibTeX: 43
- EndNote: 26
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
You have archived some of the assets necessary to replicate your work in sites that are not acceptable long-term repositories for scientific publication, namely huggingface.co and ECMWF servers. In this regard, we need that you move such assets to one of the repositories that comply with our policy, and reply to this comment with the permanent handler (e.g. DOI) and link for such repositories. If for the case of the assets stored in the ECMWF the size of the assets is too large to store them elsewhere (for example, hundreds of GBs), then you can ask for an exception to the policy for them.
Please, reply to this comment with the relevant information as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy. Also, I must note that if you do not fix this problem, we cannot continue with the peer-review process or accept your manuscript for publication in our journal.
Juan A. Añel
Geosci. Model Dev. Executive Editor