Increasing Resolution and Accuracy in Sub-Seasonal Forecasting through 3D U-Net: the Western US

Ryu, Jihun; Kim, Hisu; Wang, Shih-Yu Simon; Yoon, Jin-Ho

doi:10.5194/egusphere-2025-308

Preprints

https://doi.org/10.5194/egusphere-2025-308

Preprints

26 Mar 2025

| 26 Mar 2025

Increasing Resolution and Accuracy in Sub-Seasonal Forecasting through 3D U-Net: the Western US

Jihun Ryu, Hisu Kim, Shih-Yu Simon Wang, and Jin-Ho Yoon

Abstract. Sub-seasonal weather forecasting is a major challenge, particularly when high spatial resolution is needed to capture complex patterns and extreme events. Traditional Numerical Weather Prediction (NWP) models struggle with accurate forecasting at finer scales, especially for precipitation. In this study, we investigate the use of 3D U-Net architecture for post-processing sub-seasonal forecasts to enhance both predictability and spatial resolution, focusing on the western U.S. Using the ECMWF ensemble forecasting system and high-resolution PRISM data, we tested different combinations of ensemble members and meteorological variables. Our results demonstrate that the 3D U-Net model significantly improves temperature predictability, and consistently outperforms NWP models across multiple metrics. However, challenges remain in accurately forecasting extreme precipitation events, as the model tends to underestimate precipitation in coastal and mountainous regions. While ensemble members contribute to forecast accuracy, their impact is modest compared to the improvements achieved through downscaling. This study lays the groundwork for further development of neural network-based post-processing methods, showing their potential to enhance weather forecasts at sub-seasonal timescales.

Received: 22 Jan 2025 – Discussion started: 26 Mar 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 20552 KB)

Supplement (10261 KB)

Download & links

Preprint (20552 KB)
Metadata XML
Supplement (10261 KB)
BibTeX
EndNote

Jihun Ryu, Hisu Kim, Shih-Yu Simon Wang, and Jin-Ho Yoon

Status: closed

RC1:
'Comment on egusphere-2025-308', Anonymous Referee #1, 25 Jun 2025
Summary
The paper proposes and tests a method for downscaling sub-seasonal weather forecasts to improve accuracy and spatial resolution. The approach uses a neural network architecture (3D U-Net) that has previously been used for similar tasks. The forecasts are from the ECWMF ensemble forecast system and the high-resolution data are from PRISM. The method is applied in the Western US. The effect of different input data (ensemble members versus mean, different variable sets) are examined. The authors find that the neural network improves temperature predictions relative to the original NWP forecasts. Results for precipitation are also presented but the quality is more mixed.
High level feedback
The topic itself is interesting, and the results shown (particularly for temperature) seem promising. However, it is difficult to evaluate the method due to the omission of important details. Some of the missing methodological details can be deduced by reading the code, but they should be provided in the paper itself.
In additional to technical omissions, a main concern is the lack of clarity around the purpose and impact of the analysis. The main question, as I understand it, is: what is the effect of using (a) different sets of predictor variables and (b) different ensemble components/aggregations on prediction accuracy? The introduction states that Höhlein et al. (2024) examined (b) and reached approximately the same conclusion as this study. How is this study different? Question (a) seems relevant. However, there is very little discussion of the specific predictor variables (I think they are only mentioned in the SI) and how/why specific variables may contribute to better or worse predictions. Again, it is not clearly explained how this study differs from the cited Horat and Lerch (2024) or Weyn et al. (2021). Most of the framing of the results and conclusions boil down to “neural network downscaling improves prediction accuracy relative to NWP,” which, based on the introduction, seems to already be well-established.
The motivation for the ensemble-based predictors is also confusing. The purpose of an ensemble prediction system is to represent uncertainty, which is not discussed. Also the ensemble members are simulations that, by construction, do not start from the “optimal” estimate of the initial conditions. So it is not surprising that E01 performs worse (unless by “first” ensemble member you mean the control). Interpreting the relative performance of E50 versus E50M requires methodological details that are not provided. But again it is not surprising that the performance is similar given that E50 output is being reduced to a deterministic prediction. It seems like the value of downscaling based on an ensemble would be more in representing forecast uncertainty than improving deterministic downscaled predictions.
Specific feedback
Either the E_pre formula or the subsequent description of it is incorrect. You say E_pre = 0 for perfect predictions, which would require the term inside the square brackets to equal 1. However, it is 2^2 = 4 when sigma_obs = sigma_pre and r_0 = r_i = 1. It would also be helpful, if possible, to provide some intuition for the terms in this statistic. E.g., why is the standard deviation term squared and the correlation terms raised to the fourth?

Regarding skip connections, for a given level (or spatial resolution) in the u-net, shouldn’t there be twice the number of channels in the first layer on the right side of the U as on the last layer on the left (due to the concatenation of feature maps)? This is what is shown in both the Horat and Ronneberger papers. Also, I couldn’t determine where the skip connections were implemented in the code but maybe I just missed it.

Related to (2), there is no description of the convolution operations (e.g., kernel size). The pooling operation is also not in the body of the text, only in Fig 1 (but not defined). These are scientifically important details considering they control the spatial scales at which information can be extracted.

The activation function(s) is also not stated.

The loss function is mentioned but not explicitly stated. How are relative weights of the correlation and MSE terms set? Also, point/cell-wise MSE and correlation are closely related so what is the value of including both terms?

The claim that pattern correlation and RMSE “quantify the model’s ability to capture the spatial patterns” is not justified. Grid cell-wise RMSE is invariant under spatial permutation. The formula or weighting scheme for pattern correlation is not given. From the code it looks like weights are assigned inversely to latitude?

The accuracy assessment would benefit from disaggregation into bias versus “random” errors. Using unbiased RMSE is a way of doing this (see e.g., Entekhabi et al., 2010). In figures 4 and 5, it looks like the downscaled predictions have meaningful biases for certain combinations of variable sets and time steps. For example, panel g4 in Fig. 4 and panel g1 in Fig. 5. If it turns out that the predictions are not “on average” biased, the disaggregation might be less important. However, the presence/absence of bias overall should be mentioned given the visually apparent biases in the figures. This analysis would help identify how much of model performance is coming from downscaling versus simply bias correcting the ECMWF forecasts.

The possibility of systematic errors (e.g., season-dependent bias) should be acknowledged/discussed. Even if the predictions are not “on average” biased, there may still be systematic errors. Given that the test period spans one full year, seasonal biases could cancel out such that the predictions appear unbiased in aggregate. The NN predictions may tend toward the overall mean of the training data leading to seasonal biases. I.e., the NN could be introducing systematic biases not present in the NWP output. This may or may not be happing but seems important to consider and rule out.

I don’t see the train/test procedure for E50 anywhere. There are several ways the training and testing could work, so it should be specified. This detail is essential given that this seems to be one of the main focuses of the paper.

I don’t think the predictor variable sets are ever actually stated. The top 8 predictors for each target variable are given in the appendix, but this is scientifically important information that should be in the main text. It is also not clear from the text or SI which of the 8 are included in each subset (I see from the code that Vk is the k highest correlation with the target, which makes sense but is not stated).

I think it is important to acknowledge somewhere in the paper that PRISM data are the output of statistical interpolation and so have uncertainties and systematic errors, which will impact the output of the NN. The paper sometimes refers to PRISM data as “observations,” which is misleading.

Minor comments
Line 16 (and elsewhere): when you refer to a paper in running text, you should still provide the year in parentheses
Line 34: Re “subset of variables,” a subset of what? I think you’re referring to “additional” or maybe “auxiliary” variables, but I don’t think subset is the right term. If anything it’s a (super)set that includes the target variables as a subset.
Line 36: Again I do not think “sub-variable” is the right term here. See above for suggested alternatives.
Lines 49-51: On line 49 is says “we select forecasts from CY40R1” and on line 51 it says “We utilize forecasts from CY40R1 to CY48R1.” This might be clearer to someone more familiar with ECMWF forecast naming scheme, but I find this confusing. It would be helpful to give some additional explanation what this means and maybe provide a link to the relevant data product(s).
Line 70: Remove “properly.” Also, I believe the preferred GMD style is “Fig.” rather than “Figure” in running text.
Line 73: I generally find it clearer to talk in terms of fine versus coarse spatial resolution.
Line 96: Regarding “conservative interpolation,” can you be more specific about the method?
Lines 97-98: Regarding “given the established relationship …”, what is the relationship?
Line 110: semicolon should be colon
First paragraph of 3.1: This description of the scope of analysis should come much earlier in the paper, not in the results section.
Line 131: I think you mean “metrics”
Line 141: “suggest elevation can enhance temperature post-processing accuracy” this is just lapse rate, right? I don’t think this should be framed as a finding of NN methods
References
Entekhabi, D., Reichle, R.H., Koster, R.D. and Crow, W.T., 2010. Performance metrics for soil moisture retrievals and application requirements. Journal of Hydrometeorology, 11(3), pp.832-840.
Citation: https://doi.org/10.5194/egusphere-2025-308-RC1
- AC1: 'Reply on RC1', Jin-Ho Yoon, 17 Aug 2025
  
  We greatly appreciate the reviewer's constructive comments and suggestions. Our replies are attached as a PDF file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-308-AC1
RC2:
'Comment on egusphere-2025-308', Anonymous Referee #2, 24 Jul 2025

Summary

This manuscript revolves around a deep learning-based method for postprocessing subseasonal weather forecasts in the Western United States. The authors used multiple 3D U-Net architectures. These models use ensemble forecasts from the ECMWF model as input. As the target variables, the high-resolution temperature and precipitation data from the PRISM dataset are used. The authors investigate how model performance varies with using only one member of the ensemble, the full ensemble, and the mean of the ensemble. The authors also explore the impact of the number of input variables from the ECMWF forecasts. The performance of the models is evaluated using RMSE, pattern correlation, and the E_pre index. The analysis focuses on both regional (full-domain) and county-level scales, covering urban and agricultural areas. As a final analysis, the authors also examine the performance of the models for extreme weather events.

Results indicate that the 3D U-Net models improve temperature forecasts compared to the raw ECMWF outputs. The models enhance spatial resolution for precipitation but tend to underestimate intensity, particularly in coastal and mountainous regions. Increasing the number of input variables or using all 50 ensemble members shows little to no gain in performance. While the model offers clear benefits at larger spatial scales, these advantages are smaller at finer resolutions and for extreme weather conditions, where performance becomes more comparable to the baseline.

Overall assessment and recommendations:

The proposed method for postprocessing the ECMWF weather forecast is well thought out and novel. The authors have invested significant effort into optimizing the models. This has been done by considering the number of sub-variables used from the forecast as input features for the U-net models and the number of ensemble members to include. It's also good that the analysis also includes evaluations on specific regions (counties) with varying climates. Here, the U-Net models' results are a bit lackluster, but this can be further investigated in a follow-up paper. The additional investigation into extreme events is valuable, as it pushes the model to its limits and offers insight into its performance for less frequent but very important weather conditions. The manuscript is well written and follows a logical structure. There are some minor typographical errors, most of which are listed below.

Overall, the obtained results and the paper itself are definitely promising, but they could be further improved with some additional explanation. The main shortcoming is the lack of a more thorough justification for certain choices made throughout the analysis. I recommend expanding the main text and supporting information to justify and contextualize these decisions better.

General comments

For me, the authors don't explain the reasoning behind the choice of model clearly enough. Using a U-Net makes sense for this problem, but the 3D aspect isn't explained or justified well enough. Adding another paragraph of explanation in Section 2.2 could already go a long way here. Right now, it's not clear why a 3D U-Net was used instead of a more standard 2D version with lead time added as a channel, for example. The 3D architecture might be the better option, but the manuscript should explain how and why that decision was made more clearly.

Additionally, in the main section or the supporting information, the authors should include more details on the chosen model configurations, parameters, and hyperparameters. Details such as the number of epochs, batch size, kernel size, and methods to prevent overfitting aren't documented. Making all these elements as transparent as possible is crucial for a machine learning paper.

The authors also select 16 sub-variables from the ECMWF forecasts and split them into two groups, one for each target variable. But it's not clear why these specific 16 were chosen in the first place. Also, Table S1 is important for understanding this part of the method and would be better placed in the main manuscript instead of the supporting information.

I commend the authors for choosing this train/validation/test split well. The three datasets are kept independent, which is good and as it should be. That said, there isn't enough clear explanation of how each dataset was used. Different model configurations are being compared directly on the test set, which isn't ideal, since model selection and tuning shouldn't be done on the test set. The test set should be reserved for evaluating the final, fully trained, "best" model on truly unseen data. As it stands, no single “final” model is chosen in the paper. If this particular architecture is intended for operational use, defining one final model configuration (e.g., V1 with E50M) with fixed inputs would make sense. This final model should also be explicitly mentioned in the conclusion, or even the abstract, along with its performance metrics compared to the baseline. That would help demonstrate more clearly that the U-Net model offers meaningful improvements and is worth using.

On a related note, it would be interesting to evaluate model performance across different climate zones in a more detailed way than just at the county level. Grouping results by LCZs or land cover types could add useful insights. I'm also missing some basic plots that would help the reader understand the study area better, such as an orography map, a land cover map, or a map showing the selected counties. Including these would improve clarity in the manuscript.

For precipitation, it's important to understand the local climate of the area being studied. Is it generally dry, or does it experience frequent rainfall? A basic analysis of the weather types, or even some box plots showing the precipitation distribution over the training set, could help contextualize things. Additionally, removing some dry days from the dataset might improve the precipitation model by helping to balance the data, giving more weight to the days when precipitation occurs. It is unclear whether the authors considered this.

There's a mismatch between the overall performance of the precipitation and temperature U-Net models when comparing the full spatio-temporal test set to the specific county-level results. In many of the plots in Fig. 7 & 8 (especially for precipitation), the U-Net models do to perform significantly better than the baseline, particularly when looking at the RMSE scores. This discrepancy deserves a more thorough explanation, if not a dedicated analysis, to better understand where the model performs well and falls short. It might help to break down the

results by land cover categories or orographic features over the entire spatial domain. Additionally, to back up your claims, you could run a statistical analysis to test whether the differences between models and baselines are significant, using p-values rather than just examining the plots.

As a final remark, I have some uncertainties about the scope of the paper. The choice to focus only on the West Coast of the USA isn't fully explained. Since PRISM covers the entire continental U.S., wouldn't it be interesting to train the model on different patches across the country? That way, you could capture various terrains and weather types, making the model more generalizable. If the goal is to stick to the West Coast only, then this point doesn't apply. But if this is meant as a trial before scaling up, testing the model on unseen regions (maybe even using a different validation set) would be useful to see how well it performs in entirely new terrain.

To summarize, the paper presents a solid application of 3D U-Net for sub-seasonal postprocessing, with some novel and practical insights, particularly in the experiments involving ensemble configurations and input variables. That said, the overall contribution would be strengthened by more thorough justification of architectural choices, more detailed methodological explanations, and a deeper critical analysis of the model's performance limitations.

Line-by-line and additional smaller comments:

0) Abstract:
- Consider mentioning a value for one of the error metrics in the abstract, such as RMSE, to give an idea of how good or bad your models are.

- line 5: Using the ECMWF ensemble forecasting system (input) and high-resolution PRISM data (target), we tested different combinations of ensemble members and meteorological variables ...

1) Introduction

- line 23: The U-Net architecture has been widely utilized for weather ...

- line 32: based on a U-net model ...

- line 33: The author could clarify what “only use target variables” means. Give a one-sentence explanation that you use the same variable as input (e.g., ECMWF precipitation) as your target (PRISM precipitation).

- line 43: Our 3D U-net architecture...

- line 43: Explain what the 3D part of the 3D U-net model means.

2) data and method

Line 44: Data and Methodology or Data and Methods instead of Data and Method.

Line 49: Give a rough approximation of the grid resolution in km x km for your area of interest.

Line 50: Mention the temporal resolution of the ECMWF dataset.

Line 51: Also mention which variables are present within this data set, optionally put that in supporting info.

Line 55, what variables are present within the PRISM data set? Put it in the support info as well.

Line 61: Add a figure where you mark these counties on the map. It is unclear where these areas are situated on the map.

Line 67: The U-net architecture ...

Line 79: Avoid using “;”

Line 80: “merely exploiting the mean” is an awkward phrasing, rephrase it.

Line 89: The adam-optimizer...

Line 89: Add the missing information on how you trained your model in the support information (batch size, epochs, methods to combat overfitting ...)

Figure 1: This figure is a bit too simplistic. Consider making a more detailed figure for the model architecture. See the Ronneberger paper for reference.

Line 91: Use abbreviations for all of the variables for consistency.

Line 97: Why a 0.25° x 0.25° grid. Do you do the same for PRISM data or only the ECMWF data?

Line 100: Would recursive feature elimination not be a more robust way to select your features, or a different feature-importance method, rather than just correlation? Does this not potentially eliminate features that have a non-linear interaction with the target variables?

Line 102: Add one extra line clarifying why this transformation (Aich et al., 2024) is needed.

Line 107: Why not train on residue (Prism – ECMWF) rather than just temperature as your target?

Line 110: Mention the formulas for RMSE and pattern correlation in the supporting information.

Line 110: Avoid using “;”

3) Results & discussion

Figure 2: Why is the RMSE & pattern correlation of the U-net model so much better, but the E_pre is more on par with the baseline for precipitation, and why is it not the case for temperature? If you were to choose your best model, based on which metric would you do that?

Line 132: Avoid using “;”

Line 135: Also, what sub-variables were used to obtain these figures? Mention these here or in the support information.

Line 138: Put the mentioned figure in the support information.

Line 146: Your model surpasses the baseline slightly, but is it significant? Here, it would be interesting to show error bars or some statistics (p-values) to truly show that it is worth keeping V8, because including more input variables means both a longer training time and higher complexity of the model. Also, make sure to mention what you did with the ensemble members here. Were they averaged out in E50M?

Figure 4 & Figure 5: There are a lot of plots, and not all say equally as much. Is it possible to make them smaller or use fewer graphs? For example, with the temperature plots, the first four rows are too similar to each other to show differences. Perhaps those can be put in supporting info, and you only keep the difference plots. For additional clarity, you could convert the temperatures here to degrees Celsius rather than Kelvin. Potentially add an error metric to the difference plots, such as an average error or MAE. This will make it more understandable how far the predictions are from the PRISM values.

Line 169: I suggest to adding a landcover map or something to help illustrate this better.

Figure 7: Most models seem to be on par with the baseline for three counties (except Salt Lake City and Seattle). This seems a bit contradictory to the figures above. Also, plot the black line on top to compare your models with it, or make it more visible. Is it necessary to have all 12 models in these figures? The graph style itself is very nice, but it is a bit crowded. What about the E_pre metric for these locations?

Line 195: potential indication to use LCZ, land cover, or some urban-rural-sea mask as an input parameter.

Figure 8: Perhaps mention how many extreme weather events you had in your dataset. Some box plots of the year temperatures and precipitation may be interesting in the supporting information.

4) Conclusion + Support information

Here, I miss a mention of what combination of inputs gave you the best models for temperature and precipitation. Also, provide some scores to illustrate your point (potentially mention how much it improves on the baseline).
Include an example or a clearer description of your input for the model. Add the additionally mentioned plots and information on the training processes of your U-net models.

Citation: https://doi.org/10.5194/egusphere-2025-308-RC2
- AC2: 'Reply on RC2', Jin-Ho Yoon, 17 Aug 2025
  
  We greatly appreciate the reviewer's constructive comments and suggestions. Our replies are attached as a PDF file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-308-AC2

Status: closed

RC1:
'Comment on egusphere-2025-308', Anonymous Referee #1, 25 Jun 2025
Summary
The paper proposes and tests a method for downscaling sub-seasonal weather forecasts to improve accuracy and spatial resolution. The approach uses a neural network architecture (3D U-Net) that has previously been used for similar tasks. The forecasts are from the ECWMF ensemble forecast system and the high-resolution data are from PRISM. The method is applied in the Western US. The effect of different input data (ensemble members versus mean, different variable sets) are examined. The authors find that the neural network improves temperature predictions relative to the original NWP forecasts. Results for precipitation are also presented but the quality is more mixed.
High level feedback
The topic itself is interesting, and the results shown (particularly for temperature) seem promising. However, it is difficult to evaluate the method due to the omission of important details. Some of the missing methodological details can be deduced by reading the code, but they should be provided in the paper itself.
In additional to technical omissions, a main concern is the lack of clarity around the purpose and impact of the analysis. The main question, as I understand it, is: what is the effect of using (a) different sets of predictor variables and (b) different ensemble components/aggregations on prediction accuracy? The introduction states that Höhlein et al. (2024) examined (b) and reached approximately the same conclusion as this study. How is this study different? Question (a) seems relevant. However, there is very little discussion of the specific predictor variables (I think they are only mentioned in the SI) and how/why specific variables may contribute to better or worse predictions. Again, it is not clearly explained how this study differs from the cited Horat and Lerch (2024) or Weyn et al. (2021). Most of the framing of the results and conclusions boil down to “neural network downscaling improves prediction accuracy relative to NWP,” which, based on the introduction, seems to already be well-established.
The motivation for the ensemble-based predictors is also confusing. The purpose of an ensemble prediction system is to represent uncertainty, which is not discussed. Also the ensemble members are simulations that, by construction, do not start from the “optimal” estimate of the initial conditions. So it is not surprising that E01 performs worse (unless by “first” ensemble member you mean the control). Interpreting the relative performance of E50 versus E50M requires methodological details that are not provided. But again it is not surprising that the performance is similar given that E50 output is being reduced to a deterministic prediction. It seems like the value of downscaling based on an ensemble would be more in representing forecast uncertainty than improving deterministic downscaled predictions.
Specific feedback
Either the E_pre formula or the subsequent description of it is incorrect. You say E_pre = 0 for perfect predictions, which would require the term inside the square brackets to equal 1. However, it is 2^2 = 4 when sigma_obs = sigma_pre and r_0 = r_i = 1. It would also be helpful, if possible, to provide some intuition for the terms in this statistic. E.g., why is the standard deviation term squared and the correlation terms raised to the fourth?

Regarding skip connections, for a given level (or spatial resolution) in the u-net, shouldn’t there be twice the number of channels in the first layer on the right side of the U as on the last layer on the left (due to the concatenation of feature maps)? This is what is shown in both the Horat and Ronneberger papers. Also, I couldn’t determine where the skip connections were implemented in the code but maybe I just missed it.

Related to (2), there is no description of the convolution operations (e.g., kernel size). The pooling operation is also not in the body of the text, only in Fig 1 (but not defined). These are scientifically important details considering they control the spatial scales at which information can be extracted.

The activation function(s) is also not stated.

The loss function is mentioned but not explicitly stated. How are relative weights of the correlation and MSE terms set? Also, point/cell-wise MSE and correlation are closely related so what is the value of including both terms?

The claim that pattern correlation and RMSE “quantify the model’s ability to capture the spatial patterns” is not justified. Grid cell-wise RMSE is invariant under spatial permutation. The formula or weighting scheme for pattern correlation is not given. From the code it looks like weights are assigned inversely to latitude?

The accuracy assessment would benefit from disaggregation into bias versus “random” errors. Using unbiased RMSE is a way of doing this (see e.g., Entekhabi et al., 2010). In figures 4 and 5, it looks like the downscaled predictions have meaningful biases for certain combinations of variable sets and time steps. For example, panel g4 in Fig. 4 and panel g1 in Fig. 5. If it turns out that the predictions are not “on average” biased, the disaggregation might be less important. However, the presence/absence of bias overall should be mentioned given the visually apparent biases in the figures. This analysis would help identify how much of model performance is coming from downscaling versus simply bias correcting the ECMWF forecasts.

The possibility of systematic errors (e.g., season-dependent bias) should be acknowledged/discussed. Even if the predictions are not “on average” biased, there may still be systematic errors. Given that the test period spans one full year, seasonal biases could cancel out such that the predictions appear unbiased in aggregate. The NN predictions may tend toward the overall mean of the training data leading to seasonal biases. I.e., the NN could be introducing systematic biases not present in the NWP output. This may or may not be happing but seems important to consider and rule out.

I don’t see the train/test procedure for E50 anywhere. There are several ways the training and testing could work, so it should be specified. This detail is essential given that this seems to be one of the main focuses of the paper.

I don’t think the predictor variable sets are ever actually stated. The top 8 predictors for each target variable are given in the appendix, but this is scientifically important information that should be in the main text. It is also not clear from the text or SI which of the 8 are included in each subset (I see from the code that Vk is the k highest correlation with the target, which makes sense but is not stated).

I think it is important to acknowledge somewhere in the paper that PRISM data are the output of statistical interpolation and so have uncertainties and systematic errors, which will impact the output of the NN. The paper sometimes refers to PRISM data as “observations,” which is misleading.

Minor comments
Line 16 (and elsewhere): when you refer to a paper in running text, you should still provide the year in parentheses
Line 34: Re “subset of variables,” a subset of what? I think you’re referring to “additional” or maybe “auxiliary” variables, but I don’t think subset is the right term. If anything it’s a (super)set that includes the target variables as a subset.
Line 36: Again I do not think “sub-variable” is the right term here. See above for suggested alternatives.
Lines 49-51: On line 49 is says “we select forecasts from CY40R1” and on line 51 it says “We utilize forecasts from CY40R1 to CY48R1.” This might be clearer to someone more familiar with ECMWF forecast naming scheme, but I find this confusing. It would be helpful to give some additional explanation what this means and maybe provide a link to the relevant data product(s).
Line 70: Remove “properly.” Also, I believe the preferred GMD style is “Fig.” rather than “Figure” in running text.
Line 73: I generally find it clearer to talk in terms of fine versus coarse spatial resolution.
Line 96: Regarding “conservative interpolation,” can you be more specific about the method?
Lines 97-98: Regarding “given the established relationship …”, what is the relationship?
Line 110: semicolon should be colon
First paragraph of 3.1: This description of the scope of analysis should come much earlier in the paper, not in the results section.
Line 131: I think you mean “metrics”
Line 141: “suggest elevation can enhance temperature post-processing accuracy” this is just lapse rate, right? I don’t think this should be framed as a finding of NN methods
References
Entekhabi, D., Reichle, R.H., Koster, R.D. and Crow, W.T., 2010. Performance metrics for soil moisture retrievals and application requirements. Journal of Hydrometeorology, 11(3), pp.832-840.
Citation: https://doi.org/10.5194/egusphere-2025-308-RC1
- AC1: 'Reply on RC1', Jin-Ho Yoon, 17 Aug 2025
  
  We greatly appreciate the reviewer's constructive comments and suggestions. Our replies are attached as a PDF file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-308-AC1
RC2:
'Comment on egusphere-2025-308', Anonymous Referee #2, 24 Jul 2025

Summary

This manuscript revolves around a deep learning-based method for postprocessing subseasonal weather forecasts in the Western United States. The authors used multiple 3D U-Net architectures. These models use ensemble forecasts from the ECMWF model as input. As the target variables, the high-resolution temperature and precipitation data from the PRISM dataset are used. The authors investigate how model performance varies with using only one member of the ensemble, the full ensemble, and the mean of the ensemble. The authors also explore the impact of the number of input variables from the ECMWF forecasts. The performance of the models is evaluated using RMSE, pattern correlation, and the E_pre index. The analysis focuses on both regional (full-domain) and county-level scales, covering urban and agricultural areas. As a final analysis, the authors also examine the performance of the models for extreme weather events.

Results indicate that the 3D U-Net models improve temperature forecasts compared to the raw ECMWF outputs. The models enhance spatial resolution for precipitation but tend to underestimate intensity, particularly in coastal and mountainous regions. Increasing the number of input variables or using all 50 ensemble members shows little to no gain in performance. While the model offers clear benefits at larger spatial scales, these advantages are smaller at finer resolutions and for extreme weather conditions, where performance becomes more comparable to the baseline.

Overall assessment and recommendations:

The proposed method for postprocessing the ECMWF weather forecast is well thought out and novel. The authors have invested significant effort into optimizing the models. This has been done by considering the number of sub-variables used from the forecast as input features for the U-net models and the number of ensemble members to include. It's also good that the analysis also includes evaluations on specific regions (counties) with varying climates. Here, the U-Net models' results are a bit lackluster, but this can be further investigated in a follow-up paper. The additional investigation into extreme events is valuable, as it pushes the model to its limits and offers insight into its performance for less frequent but very important weather conditions. The manuscript is well written and follows a logical structure. There are some minor typographical errors, most of which are listed below.

Overall, the obtained results and the paper itself are definitely promising, but they could be further improved with some additional explanation. The main shortcoming is the lack of a more thorough justification for certain choices made throughout the analysis. I recommend expanding the main text and supporting information to justify and contextualize these decisions better.

General comments

For me, the authors don't explain the reasoning behind the choice of model clearly enough. Using a U-Net makes sense for this problem, but the 3D aspect isn't explained or justified well enough. Adding another paragraph of explanation in Section 2.2 could already go a long way here. Right now, it's not clear why a 3D U-Net was used instead of a more standard 2D version with lead time added as a channel, for example. The 3D architecture might be the better option, but the manuscript should explain how and why that decision was made more clearly.

Additionally, in the main section or the supporting information, the authors should include more details on the chosen model configurations, parameters, and hyperparameters. Details such as the number of epochs, batch size, kernel size, and methods to prevent overfitting aren't documented. Making all these elements as transparent as possible is crucial for a machine learning paper.

The authors also select 16 sub-variables from the ECMWF forecasts and split them into two groups, one for each target variable. But it's not clear why these specific 16 were chosen in the first place. Also, Table S1 is important for understanding this part of the method and would be better placed in the main manuscript instead of the supporting information.

I commend the authors for choosing this train/validation/test split well. The three datasets are kept independent, which is good and as it should be. That said, there isn't enough clear explanation of how each dataset was used. Different model configurations are being compared directly on the test set, which isn't ideal, since model selection and tuning shouldn't be done on the test set. The test set should be reserved for evaluating the final, fully trained, "best" model on truly unseen data. As it stands, no single “final” model is chosen in the paper. If this particular architecture is intended for operational use, defining one final model configuration (e.g., V1 with E50M) with fixed inputs would make sense. This final model should also be explicitly mentioned in the conclusion, or even the abstract, along with its performance metrics compared to the baseline. That would help demonstrate more clearly that the U-Net model offers meaningful improvements and is worth using.

On a related note, it would be interesting to evaluate model performance across different climate zones in a more detailed way than just at the county level. Grouping results by LCZs or land cover types could add useful insights. I'm also missing some basic plots that would help the reader understand the study area better, such as an orography map, a land cover map, or a map showing the selected counties. Including these would improve clarity in the manuscript.

For precipitation, it's important to understand the local climate of the area being studied. Is it generally dry, or does it experience frequent rainfall? A basic analysis of the weather types, or even some box plots showing the precipitation distribution over the training set, could help contextualize things. Additionally, removing some dry days from the dataset might improve the precipitation model by helping to balance the data, giving more weight to the days when precipitation occurs. It is unclear whether the authors considered this.

There's a mismatch between the overall performance of the precipitation and temperature U-Net models when comparing the full spatio-temporal test set to the specific county-level results. In many of the plots in Fig. 7 & 8 (especially for precipitation), the U-Net models do to perform significantly better than the baseline, particularly when looking at the RMSE scores. This discrepancy deserves a more thorough explanation, if not a dedicated analysis, to better understand where the model performs well and falls short. It might help to break down the

results by land cover categories or orographic features over the entire spatial domain. Additionally, to back up your claims, you could run a statistical analysis to test whether the differences between models and baselines are significant, using p-values rather than just examining the plots.

As a final remark, I have some uncertainties about the scope of the paper. The choice to focus only on the West Coast of the USA isn't fully explained. Since PRISM covers the entire continental U.S., wouldn't it be interesting to train the model on different patches across the country? That way, you could capture various terrains and weather types, making the model more generalizable. If the goal is to stick to the West Coast only, then this point doesn't apply. But if this is meant as a trial before scaling up, testing the model on unseen regions (maybe even using a different validation set) would be useful to see how well it performs in entirely new terrain.

To summarize, the paper presents a solid application of 3D U-Net for sub-seasonal postprocessing, with some novel and practical insights, particularly in the experiments involving ensemble configurations and input variables. That said, the overall contribution would be strengthened by more thorough justification of architectural choices, more detailed methodological explanations, and a deeper critical analysis of the model's performance limitations.

Line-by-line and additional smaller comments:

0) Abstract:
- Consider mentioning a value for one of the error metrics in the abstract, such as RMSE, to give an idea of how good or bad your models are.

- line 5: Using the ECMWF ensemble forecasting system (input) and high-resolution PRISM data (target), we tested different combinations of ensemble members and meteorological variables ...

1) Introduction

- line 23: The U-Net architecture has been widely utilized for weather ...

- line 32: based on a U-net model ...

- line 33: The author could clarify what “only use target variables” means. Give a one-sentence explanation that you use the same variable as input (e.g., ECMWF precipitation) as your target (PRISM precipitation).

- line 43: Our 3D U-net architecture...

- line 43: Explain what the 3D part of the 3D U-net model means.

2) data and method

Line 44: Data and Methodology or Data and Methods instead of Data and Method.

Line 49: Give a rough approximation of the grid resolution in km x km for your area of interest.

Line 50: Mention the temporal resolution of the ECMWF dataset.

Line 51: Also mention which variables are present within this data set, optionally put that in supporting info.

Line 55, what variables are present within the PRISM data set? Put it in the support info as well.

Line 61: Add a figure where you mark these counties on the map. It is unclear where these areas are situated on the map.

Line 67: The U-net architecture ...

Line 79: Avoid using “;”

Line 80: “merely exploiting the mean” is an awkward phrasing, rephrase it.

Line 89: The adam-optimizer...

Line 89: Add the missing information on how you trained your model in the support information (batch size, epochs, methods to combat overfitting ...)

Figure 1: This figure is a bit too simplistic. Consider making a more detailed figure for the model architecture. See the Ronneberger paper for reference.

Line 91: Use abbreviations for all of the variables for consistency.

Line 97: Why a 0.25° x 0.25° grid. Do you do the same for PRISM data or only the ECMWF data?

Line 100: Would recursive feature elimination not be a more robust way to select your features, or a different feature-importance method, rather than just correlation? Does this not potentially eliminate features that have a non-linear interaction with the target variables?

Line 102: Add one extra line clarifying why this transformation (Aich et al., 2024) is needed.

Line 107: Why not train on residue (Prism – ECMWF) rather than just temperature as your target?

Line 110: Mention the formulas for RMSE and pattern correlation in the supporting information.

Line 110: Avoid using “;”

3) Results & discussion

Figure 2: Why is the RMSE & pattern correlation of the U-net model so much better, but the E_pre is more on par with the baseline for precipitation, and why is it not the case for temperature? If you were to choose your best model, based on which metric would you do that?

Line 132: Avoid using “;”

Line 135: Also, what sub-variables were used to obtain these figures? Mention these here or in the support information.

Line 138: Put the mentioned figure in the support information.

Line 146: Your model surpasses the baseline slightly, but is it significant? Here, it would be interesting to show error bars or some statistics (p-values) to truly show that it is worth keeping V8, because including more input variables means both a longer training time and higher complexity of the model. Also, make sure to mention what you did with the ensemble members here. Were they averaged out in E50M?

Figure 4 & Figure 5: There are a lot of plots, and not all say equally as much. Is it possible to make them smaller or use fewer graphs? For example, with the temperature plots, the first four rows are too similar to each other to show differences. Perhaps those can be put in supporting info, and you only keep the difference plots. For additional clarity, you could convert the temperatures here to degrees Celsius rather than Kelvin. Potentially add an error metric to the difference plots, such as an average error or MAE. This will make it more understandable how far the predictions are from the PRISM values.

Line 169: I suggest to adding a landcover map or something to help illustrate this better.

Figure 7: Most models seem to be on par with the baseline for three counties (except Salt Lake City and Seattle). This seems a bit contradictory to the figures above. Also, plot the black line on top to compare your models with it, or make it more visible. Is it necessary to have all 12 models in these figures? The graph style itself is very nice, but it is a bit crowded. What about the E_pre metric for these locations?

Line 195: potential indication to use LCZ, land cover, or some urban-rural-sea mask as an input parameter.

Figure 8: Perhaps mention how many extreme weather events you had in your dataset. Some box plots of the year temperatures and precipitation may be interesting in the supporting information.

4) Conclusion + Support information

Here, I miss a mention of what combination of inputs gave you the best models for temperature and precipitation. Also, provide some scores to illustrate your point (potentially mention how much it improves on the baseline).
Include an example or a clearer description of your input for the model. Add the additionally mentioned plots and information on the training processes of your U-net models.

Citation: https://doi.org/10.5194/egusphere-2025-308-RC2
- AC2: 'Reply on RC2', Jin-Ho Yoon, 17 Aug 2025
  
  We greatly appreciate the reviewer's constructive comments and suggestions. Our replies are attached as a PDF file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-308-AC2

Jihun Ryu, Hisu Kim, Shih-Yu Simon Wang, and Jin-Ho Yoon

Supplement

https://doi.org/10.5194/egusphere-2025-308-supplement

Jihun Ryu, Hisu Kim, Shih-Yu Simon Wang, and Jin-Ho Yoon

Viewed

Total article views: 1,994 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
1,699	263	32	1,994	51	40	56

HTML: 1,699
PDF: 263
XML: 32
Total: 1,994
Supplement: 51
BibTeX: 40
EndNote: 56

Views and downloads (calculated since 26 Mar 2025)

Month	HTML	PDF	XML	Total
Mar 2025	79	10	2	91
Apr 2025	50	28	3	81
May 2025	42	26	3	71
Jun 2025	75	33	3	111
Jul 2025	62	23	3	88
Aug 2025	375	27	3	405
Sep 2025	894	34	11	939
Oct 2025	57	22	0	79
Nov 2025	65	60	4	129

Cumulative views and downloads (calculated since 26 Mar 2025)

Month	HTML	PDF	XML	Total
Mar 2025	79	10	2	91
Apr 2025	50	28	3	81
May 2025	42	26	3	71
Jun 2025	75	33	3	111
Jul 2025	62	23	3	88
Aug 2025	375	27	3	405
Sep 2025	894	34	11	939
Oct 2025	57	22	0	79
Nov 2025	65	60	4	129

Viewed (geographical distribution)

Total article views: 2,049 (including HTML, PDF, and XML) Thereof 2,049 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 22 Nov 2025

Short summary

Using a neural network model, county-level weather forecasts was achieved in the Western U.S. By combining traditional forecasting data with actual weather observations, the AI system achieved better temperature predictions at local scales. While showed promise for temperature forecasting, it still had difficulty accurately predicting extreme rainfall events. The research advances weather forecasting capabilities, potentially helping communities prepare for severe weather conditions.


Total:	0
HTML:	0
PDF:	0
XML:	0