the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
AIFS 1.1.0: An update to ECMWF's machine-learned weather forecast model AIFS
Abstract. We present an update to ECMWF's machine-learned weather forecasting model AIFS Single with several key improvements. The model now incorporates physical consistency constraints through bounding layers, an updated training schedule, and an expanded set of variables. The physical constraints substantially improve precipitation forecasts and the new variables show a high level of skill. Upper-air headline scores also show improvement over the previous AIFS version. The AIFS has been fully operational at ECMWF since the 25th of February 2025.
- Preprint
(25026 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (extended)
- CEC1: 'Comment on egusphere-2025-4716 - No compliance with the policy of the journal', Juan Antonio Añel, 07 Dec 2025 reply
-
RC1: 'Comment on egusphere-2025-4716', Anonymous Referee #1, 05 Jan 2026
reply
AIFS 1.1.0 - An update to ECMWF's machine-learned weather forecast model AIFS
Recommendation: Major Revisions Required
SUMMARY
This manuscript documents an impressive operational update to ECMWF's AIFS, which has already demonstrated remarkable skill in medium-range weather prediction. The revised version introduces three valuable enhancements: (1) activation function-based bounding layers to enforce physical constraints, (2) expansion of prognostic variables to include land surface and energy sector fields, and (3) optimization of the training schedule. The operational deployment at ECMWF represents a substantial achievement, demonstrating the maturity and production-readiness of machine learning weather prediction. AIFS 1.1.0 shows consistent improvements of 4-6% across variables and lead times, with particularly impressive gains in precipitation forecasting.
The manuscript effectively documents these advances and will make a strong contribution to the literature. However, several technical aspects require clarification. The paper makes mechanistic claims about model behavior without adequate theoretical support, attributes performance improvements to specific design choices without proper experimental controls, and presents critical design decisions without quantitative justification. These revisions involve analysis and clarification rather than expensive retraining experiments. With these technical clarifications, the manuscript will provide an excellent reference for operational machine learning weather prediction systems.
MAJOR ISSUES
1. Bounding Layer Mechanism: Insufficient Mechanistic Explanation
The manuscript's primary scientific contribution concerns the bounding layer strategy (Section 3.2, lines 221-227). The authors claim that applying ReLU activation functions to precipitation outputs "facilitates the learning of forecasting for sparse and intermittent variables" by enabling negative output space to serve as a proxy for no-rain classification. Figure 12 presents compelling evidence of structured spatial patterns in pre-activation space, showing strongly negative values over arid regions (Sahara Desert) with smooth gradients near precipitation events.
This observation represents the paper's most interesting result, yet receives inadequate mechanistic analysis. During backpropagation, the ReLU derivative is zero for all negative inputs (∂ReLU/∂x = 0 for x < 0), suggesting gradient information should not flow to regions predicting negative values. How, then, does the model develop sophisticated spatial structure visible in Figure 12? What is the mathematical relationship between the MSE loss gradient and the learned negative space structure? Does this structure emerge from autoregressive training, where negative predictions at intermediate steps influence subsequent forecasts through rollout? Can the information content in the negative space be quantified? How does this compare to alternative formulations such as LeakyReLU (mentioned line 391 but not evaluated)?
The authors should provide mechanistic explanation for the bounding layer's behavior. This need not require expensive additional experiments rather a clear theoretical model of gradient flow during rollout training would suffice. At minimum, the structured negative space in Figure 12 deserves quantitative analysis rather than qualitative description. Understanding this mechanism is critical for generalizing the approach to other sparse variables.
2. Performance Attribution: Confounded Experimental Design
Lines 88-91 state that the revised training schedule (direct fine-tuning on operational analysis rather than sequential ERA5→ERA5→operational) "results in better forecast performance." This causal claim cannot be substantiated from the presented evidence. The revised model simultaneously modifies training schedule, adds new prognostic variables, implements bounding layers, and adjusts learning rate schedules. Section 4.1 attempts to isolate bounding layer effects through Figure 11, but this comparison still includes differences in training data extent (1979-2022 vs shorter periods) and other modifications.
The authors should either remove unsupported causal claims or clearly state that performance improvements result from combined system modifications. Revising "results in better forecast performance" to "is associated with improved forecast performance" would be appropriate. I recognize comprehensive ablation studies are computationally expensive, but making causal claims without supporting evidence is scientifically inappropriate. At minimum, explicitly acknowledging the confounded nature of these comparisons would improve scientific rigor.
3. Subjective Design Tradeoffs: Insufficient Quantification
Lines 254-256 and 400-410 acknowledge a "subjective compromise between forecast realism and forecast skill measured by RMSE," where more aggressive rollout fine-tuning could improve headline scores at the cost of spatial field characteristics. The authors base this decision on spectral analysis not presented in the manuscript. What specific spectral characteristics were prioritized? What magnitude of RMSE degradation was accepted to achieve desired spectral properties? Without quantification, readers cannot assess the appropriateness of this tradeoff or replicate the training procedure.
The spectral analysis underlying this design choice should be presented. Representative power spectra comparing aggressive fine-tuning versus the chosen configuration would clarify the tradeoff. Quantifying the approximate magnitude of this compromise requires no additional training runs—only analysis of existing model outputs. This documentation is essential for reproducibility.
4. Physical Conservation Properties: Missing Diagnostics
Lines 188-190 acknowledge that the bounding strategy does not enforce mass or energy conservation, dismissing this with "we did not consider other constraints such as energy or mass conservation." For a production forecasting system deployed operationally, this warrants more thorough treatment. Do 10-day integrations accumulate substantial mass or energy errors? How do these compare to the physics-based IFS? Does violation magnitude correlate with forecast error? These questions can be addressed through straightforward post-hoc analysis requiring no model retraining. Brief discussion of whether conservation violations matter for medium-range prediction timescales would strengthen the manuscript.
5. Missing Contextualization with Related Work
The manuscript does not adequately position this work within the broader context of physically-constrained machine learning weather models. Multiple operational systems now implement similar physical constraints. The CREDIT framework (Chen et al., 2025, Nature Communications Climate, https://www.nature.com/articles/s41612-025-01125-6; Chen et al., 2025, Geophysical Research Letters, https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2025MS005138) implements comparable bounding strategies for physical constraints in operational settings. Harder et al. (2024) provides theoretical framework for hard-constraint approaches. Kent et al. (2025) and Bonev et al. (2025) present alternative constraint methodologies.
Particularly relevant is recent work by Sha et al. (2025, Geophysical Research Letters, https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2025GL118478) addressing the identical problem of blurry precipitation forecasts and drizzle bias in AIWP models. While AIFS uses ReLU activation functions for bounding, Sha et al. demonstrate that terrain-following coordinates combined with global mass and energy conservation constraints provide comparable benefits for reducing drizzle bias and improving extreme precipitation forecasts. This represents an alternative technical approach to the same fundamental challenge, suggesting multiple pathways exist for addressing sparse variable prediction in ML weather models.
The manuscript would benefit from discussing how this implementation compares to related approaches in GraphCast, Pangu-Weather, FuXi, and CREDIT models. Do other systems exhibit similar light precipitation biases? How do different architectural choices and constraint methodologies affect this common challenge? Comparing the ReLU bounding approach to alternatives like terrain-following coordinates or other constraint formulations would strengthen the contribution by clarifying the relative advantages and positioning the work within the field.
MINOR ISSUES
1. Loss Weight Justification: Loss scaling factors in Table 1 (lines 126-133) are described as "chosen empirically," which is reasonable. However, line 129 states that vertical velocity and soil moisture are "deliberately down-weighted," implying specific motivation. Brief rationale would improve clarity.
2. Unsupported Generalizations: Line 173-175 claims "well-known characteristic of machine learning-based forecasts: a tendency to produce overly smooth spatial fields." Either provide specific citations (e.g., Ben Bouallègue et al., 2024; Bonavita, 2024) or remove "well-known."
3. Line 253 states tropical cyclone performance "is similar to that of the previous version" without supporting data.
4. Statistical Significance: Several PSS comparisons in Figure 3 between revised AIFS and IFS at 144-hour forecasts show marginal differences. Claims of "improvement" should be verified as statistically significant.
5. Case Study Limitations: Section 4.2 case studies effectively demonstrate model capability but represent illustrative examples rather than systematic validation. How many extreme events were evaluated? What is the false alarm rate?
RECOMMENDATION
I recommend major revisions. The manuscript documents an important operational system and makes valuable contributions to ML weather prediction. However, it makes scientific claims about mechanisms and causality without adequate supporting evidence. The revisions do not necessitate expensive retraining, however they require theoretical analysis, quantification of existing results, and proper contextualization within related work. With these technical clarifications, this manuscript will make a strong contribution documenting a significant advance in operational ML weather forecasting.
Citation: https://doi.org/10.5194/egusphere-2025-4716-RC1
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 1,017 | 3,773 | 29 | 4,819 | 31 | 19 |
- HTML: 1,017
- PDF: 3,773
- XML: 29
- Total: 4,819
- BibTeX: 31
- EndNote: 19
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
You have archived some of the assets necessary to replicate your work in sites that are not acceptable long-term repositories for scientific publication, namely huggingface.co and ECMWF servers. In this regard, we need that you move such assets to one of the repositories that comply with our policy, and reply to this comment with the permanent handler (e.g. DOI) and link for such repositories. If for the case of the assets stored in the ECMWF the size of the assets is too large to store them elsewhere (for example, hundreds of GBs), then you can ask for an exception to the policy for them.
Please, reply to this comment with the relevant information as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy. Also, I must note that if you do not fix this problem, we cannot continue with the peer-review process or accept your manuscript for publication in our journal.
Juan A. Añel
Geosci. Model Dev. Executive Editor