Comprehensive Inter-comparison of Generative AI Models for Super-Resolution Precipitation Downscaling Across Hydroclimatic Regimes

Singh, Shivam; Papalexiou, Simon Michael; Abdelmoaty, Hebatallah M.; Hartvigsen, Tom; Mamalakis, Antonios

doi:10.5194/egusphere-2026-861

Preprints

https://doi.org/10.5194/egusphere-2026-861

Preprints

27 Feb 2026

| 27 Feb 2026

Comprehensive Inter-comparison of Generative AI Models for Super-Resolution Precipitation Downscaling Across Hydroclimatic Regimes

Shivam Singh, Simon Michael Papalexiou, Hebatallah M. Abdelmoaty, Tom Hartvigsen, and Antonios Mamalakis

Abstract. High-resolution precipitation information is essential for hydrologic modeling, flood forecasting, and climate-risk assessment, yet global weather and climate models operate at spatial resolutions too coarse to resolve storm structure, intermittency, and extremes. Deep-learning-based statistical downscaling provides a computationally efficient alternative to dynamical downscaling, but deterministic convolutional neural networks often yield overly smooth predictions and underestimate fine-scale variability and extreme events. Generative deep-learning models, including generative adversarial networks and diffusion models, offer a promising alternative by enabling stochastic downscaling and explicit representation of uncertainty. This study presents a systematic, hydrologically oriented comparison of three representative deep-learning frameworks for precipitation super-resolution: a convolutional U-NET, a conditional Wasserstein GAN (WGAN), and a conditional denoising diffusion probabilistic model (DDPM). Using a perfect-model experimental design based on ERA5-Land precipitation over distinct hydroclimatic regions of the United States, we evaluate performance under 8-times (8×) and 16-times (16×) downscaling tasks within a unified training and evaluation framework. Models are evaluated using diagnostics that examine precipitation distributions, wet–dry occurrence, extremes, spatial structure, storm morphology, mass consistency, ensemble variability, and computational cost. All three models preserve aggregate rainfall mass despite the absence of explicit physical constraints. Differences arise primarily at fine spatial scales and in the representation of extremes, spatial dependence, and uncertainty. U-NET provides stable and computationally efficient predictions but smooths small-scale variability. WGAN improves fine-scale structure and heavy-tail behavior at the expense of increased noise. The DDPM yields physically coherent ensemble members and an explicit representation of uncertainty, at a substantially higher computational cost.

Received: 13 Feb 2026 – Discussion started: 27 Feb 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 9570 KB)

Supplement (9814 KB)

Download & links

Shivam Singh, Simon Michael Papalexiou, Hebatallah M. Abdelmoaty, Tom Hartvigsen, and Antonios Mamalakis

Status: final response (author comments only)

RC1:
'Comment on egusphere-2026-861', Anonymous Referee #1, 09 Mar 2026

Review for the GMD

This manuscript presents a timely and meaningful intercomparison of three “most used” gen AI models for precipitation super-resolution downscaling. The study addresses an important problem at the interface of atmospheric science and machine learning, and the effort to compare multiple model classes within a unified framework is valuable. At the same time, several aspects of the manuscript require substantial clarification and strengthening before the conclusions can be fully supported. Addressing these issues would improve the scientific rigor and impact of the paper.

Major Comments :
1.The authors state that the 10-member ensemble is generated from 10 independently trained models initialized with different random seeds. This procedure primarily reflects epistemic uncertainty associated with parameter estimation and training variability. However, the central theoretical advantage of conditional generative models is that for a given low-resolution input, they can generate a distribution of plausible high-resolution outputs through stochastic sampling. At present, the manuscript uses one prediction from each independently trained model and interprets the resulting spread as ensemble uncertainty, which is not equivalent to sampling the conditional output distribution of a single trained generative model. The authors must separate these two uncertainty sources explicitly. In addition to the current analysis, they should report results from repeated stochastic sampling using a single trained model, preferably the best-performing checkpoint, and compare that spread with the spread arising from different training seeds. This distinction is essential for a correct interpretation of the ensemble results.

2.The manuscript refers in several places to ERA5-Land as “observation”. This terminology is not correct. ERA5-Land is a reanalysis-based product, not a direct observational dataset. Since the study does not use in situ station obs, radar, satellite retrievals, or soundings as reference truth, the manuscript should consistently refer to ERA5-Land as a reanalysis or reanalysis-based target, not as observation. You could read this paper to obtain the detailed reason.
https://doi.org/10.1175/BAMS-D-14-00226.1

3.The use of min-max normalization may help stabilize training, but it raises an important concern for precipitation, especially for extreme events. Min-max scaling bounds the normalized target by the range seen in the training data, which may hinder robust extrapolation to unprecedented values. This issue is especially relevant for climate-related downscaling and extreme precipitation, where out-of-sample events may exceed the historical training maximum. This issue may also be relevant to the behavior shown in Figure S11, where DDPM with T=100 approaches the upper bound and cannot grow freely. The authors should discuss this limitation explicitly and test at least one alternative normalization strategy such as quantile normalization or z-score normalization over wet pixels only, reporting whether the normalization choice materially changes the extreme-value results.

4.Precipitation is not a typical continuous variable like temperature, pressure, or geopotential height. It is sparse, intermittent, highly skewed, and often better represented by zero-inflated or Tweedie-like distributions. For this reason, the loss-function choice deserves much more discussion than it currently receives. The authors should discuss why their selected losses are appropriate for precipitation specifically, and whether distribution-aware losses could improve tail behavior and wet-day occurrence. Recent studies suggest that distributional losses can be beneficial for precipitation prediction and downscaling. At minimum, this should be discussed more clearly. Ideally, the authors would include a sensitivity test or ablation experiment.

https://arxiv.org/html/2509.08369
https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2024GL111828

5.Figure 5 appears to compare model predictions against an upsampled low-resolution field rather than the native high-resolution ERA5-Land target, given the clustering of identical reference values on the x-axis. If so, the comparison is not appropriate and the figure needs to be redone using the actual high-resolution target field. If this interpretation is incorrect, the authors should clarify exactly how the reference field in Figure 5 was constructed.

6.The spatial lag analysis in Fig. 6 is not the most informative way to evaluate scale-dependent structure for precipitation super-resolution. A spatial power spectrum would be more standard and more physically interpretable. The authors should add a spectral analysis to the main paper. The current spatial lag figure could be moved to the supplement.

7.All three models condition only on coarse-resolution precipitation. Precipitation is not a self-contained variable. For instance, topography is a key control on high-resolution precipitation structure, especially in regions where orographic effects are important. The manuscript does not sufficiently discuss the implications of omitting terrain height or other static geographic information as conditioning variables. The smoothness seen in the deterministic baseline may partly reflect the lack of physically informative conditioning, rather than only the architecture itself. This point is also relevant for the generative models. One of the strengths of conditional DDPM is the flexibility with which conditioning information can be incorporated, including modulation-based conditioning (like FiLM used here and AdaGN etc.). The authors should discuss more directly whether including terrain or other physically meaningful covariates could materially change the conclusions.

8.A plain UNet trained with a pointwise loss is well known to produce overly smooth outputs in super resolution tasks, so the observed contrast between UNet and generative models may partly reflect the baseline choice rather than an inherent limitation of deterministic approaches. The paper should justify this baseline more carefully or include at least one stronger deterministic baseline such as a sub-pixel convolution or PixelShuffle-based architecture.
https://arxiv.org/pdf/1609.05158

9.SSIM is a perceptual image metric designed for natural-image comparison based on luminance and contrast, and its physical meaning for sparse, intermittent precipitation fields is limited. SSIM should not be emphasized as a primary result and may be moved to the supplement. The Q-Q diagnostics currently in Figure S7 are more physically meaningful for a heavy-tailed intermittent variable and should be promoted to the main text.

10.Because the low-resolution inputs are constructed using a block-averaging operator, the mass conservation result in Section 5.2.2 is partly guaranteed by the experimental design and is less informative than the manuscript implies. This section should be shortened and some of the discussion moved to the supplement.

11.The manuscript compares a U-Net against generative models, but it does not include a simple interpolation baseline. That omission weakens the benchmark. At minimum, the authors should include one standard interpolation baseline so that readers can assess whether the deterministic neural model actually adds value beyond trivial reconstruction.

12.Precipitation has strong temporal autocorrelation because it is tied to evolving synoptic and mesoscale systems. Wet-spell duration, dry-spell duration, and multi-day persistence are among the most hydrologically relevant properties of any downscaled product. A model may match daily spatial structure while still failing to reproduce realistic persistence across time. The manuscript does not evaluate this sufficiently. The authors should either include temporal diagnostics that directly assess persistence behavior or clearly state that the current evaluation is not enough to establish hydrological usefulness.

Some minor issues :
1.Figures 3 and S3 should state the timestamp or time period shown. The figure captions should clearly indicate which date or sample is being plotted.

2.At line 126, the manuscript refers to one region as the “Pacific Northwest”. Based on the domain shown, this terminology appears inaccurate, since Utah and western Nevada are not usually considered part of the Pacific Northwest. It would be better to use “Northwest” unless the domain is redefined.

3.The manuscript states that the models are trained on the Central Plains and Northwest, while validation uses the Central Plains plus a subset of the Northeast, and the remaining Northeast samples are used for independent testing. This is understandable, but the exact fractions or sample counts should be stated explicitly in the main text.

4.The manuscript sets daily precipitation below 1 mm per day to zero and excludes days with fewer than 1 percent wet pixels. These choices may be reasonable, but the authors should report how many samples are removed by region and season and briefly discuss the potential impact on light-rain statistics and wet-day occurrence.

5.The manuscript states that the models are trained on the Central Plains and Northwest, while validation uses the Central Plains plus a subset of the Northeast, and the remaining Northeast samples are used for independent testing. This is understandable, but the exact fractions or sample counts should be stated explicitly in the main text.

6.The U-Net training description omits weight decay and the Adam beta values; since the WGAN uses non-default beta1=0.0 and beta2=0.9, any deviation from defaults for other models must also be explicitly reported. The DDPM section does not specify the optimizer, initial learning rate, or weight decay.

Citation: https://doi.org/10.5194/egusphere-2026-861-RC1
- AC3: 'Reply on RC1', Shivam Singh, 24 Apr 2026
  
  We are grateful to the reviewer for thoroughly reviewing our work and for providing thoughtful and constructive feedback. The comments were highly insightful and have been very helpful in identifying areas where the manuscript can be further improved.
  We are currently working on the revised version of the manuscript and will submit it in due course. We have attached a response sheet outlining additional analyses, revised results, and clarifications addressing the reviewer’s comments.
  
  Citation: https://doi.org/10.5194/egusphere-2026-861-AC3
CEC1:
'Comment on egusphere-2026-861 - No compliance with the policy of the journal', Juan Antonio Añel, 26 Mar 2026

Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".

https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
Your manuscript does not contain a "Code Availability" section providing all the code used for your work. I am sorry to have to be so outspoken, but we can not accept this, it is forbidden by our policy, and your manuscript should have never been accepted for Discussions given such lack of compliance with the policy of the journal. Our policy clearly states that all the code and data necessary to replicate a manuscript must be published openly and freely to anyone before submission.
Additionally, you do not provide the training data used for your work (you simply cite a paper and a web for ERA5 which is not trusted repository for long-term archival and we can not accept) nor the output files resulting from it.
The GMD review and publication process depends on reviewers and community commentators being able to access, during the discussion phase, the code and data on which a manuscript depends, and on ensuring the provenance of replicability of the published papers for years after their publication. Therefore, you have to reply to this comment as soon as possible with the information for the repositories (link and a permanent identifier for it (e.g. DOI); also, please, check our policy for the characteristics of the accepted repositories) containing all the code and data that you use to produce and replicate your manuscript. We cannot have manuscripts under discussion that do not comply with our policy.
Please, reply to this comment with the new 'Code and Data Availability’ section, which must also be modified in any new version of your manuscript to cite the new repository locations, and corresponding references added to the bibliography.
I must note that if you do not fix this problem, we cannot continue with the peer-review process or accept your manuscript for publication in GMD.
Juan A. Añel

Geosci. Model Dev. Executive Editor

Citation: https://doi.org/10.5194/egusphere-2026-861-CEC1
- AC1:
  'Reply on CEC1', Shivam Singh, 26 Mar 2026
  We sincerely thank the Editor for pointing out this important issue. We apologize for this oversight and fully acknowledge the importance of ensuring reproducibility and long-term accessibility of the code, data, and outputs associated with our manuscript.
  We are currently preparing a complete public archive of the materials required to reproduce the study, including:
  the code used for model training, evaluation, and figure generation,
  
  the processed data and/or reproducible preprocessing workflow used in the study, and
  
  the relevant output files used in the analysis and manuscript figures.
  
  These materials are being organized in a suitable public repository with long-term archival support and a permanent identifier (DOI), in accordance with the GMD Code and Data Policy. We will update this discussion as soon as the repository deposition is finalized and will also revise the manuscript accordingly to include a complete Code and Data Availability section and the corresponding references.
  
  Citation: https://doi.org/10.5194/egusphere-2026-861-AC1
- AC2:
  'Reply on CEC1', Shivam Singh, 02 Apr 2026
  Dear Editor,
  Thank you for your comment and for highlighting the importance of openly sharing code and data to ensure reproducibility and compliance with the journal’s policy. We have now made the materials required for reproducibility publicly available and have updated the manuscript accordingly. Specifically:
  the full code repository used for data downloading, preprocessing, model training, inference, and evaluation is publicly available on GitHub and archived on Zenodo;
  
  the processed dataset splits and selected trained model weights used for the main experiments and figure generation have been archived separately on Zenodo;
  
  the revised manuscript will include updated Code availability and Data availability sections reflecting these resources.
  
  The updated statements are provided below for your reference:
  Code availability
  
  All scripts used for data downloading, preprocessing, model training, inference, and evaluation are openly available in the public GitHub repository: https://github.com/shivamsinghhada/precipitation-downscaling. The exact version of the code used in this study is permanently archived on Zenodo at: https://doi.org/10.5281/zenodo.19297906.
  Data availability
  
  The raw ERA5-Land precipitation data used in this study are publicly available from the Copernicus Climate Change Service (C3S) Climate Data Store (CDS) (Muñoz Sabater, 2019). ERA5-Land provides global land-surface variables at approximately 9 km spatial resolution and hourly temporal resolution and can be accessed at: https://cds.climate.copernicus.eu/datasets/reanalysis-era5-land?tab=download.
  The processed dataset splits and selected trained model weights used for the main experiments and figure generation are archived separately on Zenodo at: https://doi.org/10.5281/zenodo.19324377. These archived materials include the processed 8× and 16× downscaling datasets, selected trained U-Net, WGAN, and DDPM model weights, and the associated inference scripts required to reproduce the main analyses presented in this manuscript.
  Raw ERA5-Land data are not redistributed here because they are already publicly available from ECMWF / CDS. Instead, the full preprocessing workflow required to reproduce the derived datasets is provided in the archived code repository.
  Sincerely,
  
  Shivam Singh
  
  (on behalf of all co-authors)
  
  Citation: https://doi.org/10.5194/egusphere-2026-861-AC2
RC2:
'Comment on egusphere-2026-861', Anonymous Referee #2, 15 Apr 2026
1. General Comments
The study compares three deep learning frameworks (U-NET, WGAN, and DDPM) for precipitation super-resolution across different hydroclimatic regimes. While the comparison is timely, there are significant critical concerns regarding the experimental design and the technical execution that need to be addressed before the paper can be considered for publication.
2. Major Concerns
Training Stability and Overfitting (Critical Concern) A fundamental issue exists regarding the training convergence and generalization of the generative models, particularly the DDPM. In Section 3 (Lines 398–400), the authors describe the dataset splitting into training and testing sets, but no validation set is mentioned. However, validation curves are provided in the Supplemental Material (Figure S2), at least for the UNET and the DDPM models. They are missing for the WGAN model. Upon inspection of Figure S2, despite the compressed scale of the y-axis, the DDPM model appears to exhibit clear signs of overfitting after approximately epoch 10. The divergence between training and validation loss suggests that the model is no longer learning generalizable features of the precipitation fields but is instead memorizing the training samples. Since the remainder of the paper relies on the results derived from these trained weights, the validity of the inter-comparison and the subsequent conclusions regarding DDPM’s performance are in question. Also, the training curves of the WGAN raise some concerns, and no validation curve is shown. The authors must:
Clarify the lack of a validation set description in the main text and for WGAN in FigureS2

Provide a detailed analysis of the loss curves with a more appropriate y-axis scale. Is the overfitting actually happening?

Address how they ensured the "optimal" stopping point for training to prevent reporting results from an overfit model.

In lines 430-434, the authors acknowledge the potential for overfitting but treat it in a too simplistic manner.

Directly related to the previous point: Sample Size vs. Model Complexity (Critical)
The technical rigor of the study is challenged by the potential imbalance between the available data and the model's capacity. The authors utilize ~33,000 daily fields (with only ~19,200 samples for training) to train high-capacity architectures. Modern WGAN and DDPM implementations (including features mentioned like sinusoidal time embeddings and FiLM layers) often contain millions of trainable parameters.

The authors must provide a table explicitly summarizing the total number of trainable parameters for the U-NET, WGAN, and DDPM.

Given the relatively small training set size and the high complexity of the models, the risk of overfitting is severe. The authors must discuss the generalization potential of these models in this context.

Conclusion on the first two points:
This evidence of potential overfitting is a major flaw that propagates through all results and discussions in the manuscript. Before any further analysis can be considered, the authors must demonstrate that the models (especially the DDPM) are not over-parameterized for the provided dataset and that the reported performance metrics are not the result of a model that has failed to generalize.
Additional concerns
Perfect-Model Framework and Practical Usability (Lines 154–157): The authors describe their approach as a "perfect-model super-resolution framework" where both inputs and targets originate from the same dataset (ERA5-Land). While this allows for controlled evaluation, it ignores the critical real-world challenge of "predictor mismatch" or bias between different datasets (e.g., GCM output vs. regional simulations). This design choice significantly hinders the practical usability of the proposed models, as they are not tested against actual biased low-resolution information found in climate model outputs. This limitation must be explicitly stated in the Abstract, and discussed in the Introduction, and Conclusion. The authors should clarify how these models would perform when the "trace" of mesoscale information is truly absent or biased in the low-resolution input.

Data Normalization: The manuscript mentions a log(1+x) transformation and min-max normalization specifically for the DDPM (Line 258). It is unclear if any normalization or transformation was applied to the precipitation data for the U-NET and WGAN models. Given the heavy-tailed nature of precipitation, this is a critical detail. If no normalization was used for the other models, the authors must justify this choice or clarify the preprocessing steps for all architectures.

Stochastic vs. Epistemic Uncertainty: The authors use 10 independently trained models to form an ensemble (Line 270), which they state reflects "epistemic uncertainty." However, the primary strength of generative models like WGAN and DDPM is their ability to produce multiple stochastic realizations from a single trained model. The authors should evaluate the generative potential/stochasticity of a single trained model rather than relying solely on an ensemble of differently trained models, as the latter is computationally expensive and less practical for end-users.

Static Variables: Were any static variables (e.g., elevation, land cover, distance to coast) included as auxiliary inputs to the models? Orographic forcing is mentioned as a relevant driver (especially for one of the domains/regimes line 127), but looks like the models were trained without providing them with this information. Having access to the underlying topography can assist in the downscaling process: why did the authors choose not to include any static covariants in the training?

3. Specific Comments and Technical Corrections
Hydrologic Relevance: The authors mention "hydrologic modeling" as a motivation many times in the text. Could the authors clarify the specific "hydrologic" metrics or assessments used beyond standard meteorological diagnostics?

Target Data Selection (Lines 115–122): The authors use ERA5-Land as the target data. Please provide a brief justification for using a reanalysis product as "Ground Truth" (GT) rather than observational-based data (e.g., high-resolution radar products).

DDPM Hyperparameters (Section 2.2.3): training details for the DDPM are missing. Specifically, please provide the optimizer used (e.g., Adam, AdamW) and the specific learning rate or learning rate schedule employed during training.

Figure 3 Clarification: What do the columns in Figure 3 represent? Please label them clearly in the figure or the caption. The low-resolution input in the last column does not appear to match the high-resolution Ground Truth (GT). Please verify if the block-averaging or interpolation used to create the low-resolution inputs is consistent across the visualization.
Citation: https://doi.org/10.5194/egusphere-2026-861-RC2
- AC4: 'Reply on RC2', Shivam Singh, 24 Apr 2026
  
  We are grateful to the reviewer for carefully evaluating our work and for providing thoughtful and constructive feedback. The comments were insightful and have been very valuable in helping us improve several aspects of the manuscript.
  We are currently preparing the revised version of the manuscript and will submit it in due course. We have attached a response sheet outlining additional analyses, revised results, and clarifications in response to the reviewer’s comments
  
  Citation: https://doi.org/10.5194/egusphere-2026-861-AC4

Shivam Singh, Simon Michael Papalexiou, Hebatallah M. Abdelmoaty, Tom Hartvigsen, and Antonios Mamalakis

Supplement

https://doi.org/10.5194/egusphere-2026-861-supplement

Shivam Singh, Simon Michael Papalexiou, Hebatallah M. Abdelmoaty, Tom Hartvigsen, and Antonios Mamalakis

Viewed

Total article views: 2,041 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
1,218	730	93	2,041	2,226	68	112

HTML: 1,218
PDF: 730
XML: 93
Total: 2,041
Supplement: 2,226
BibTeX: 68
EndNote: 112

Views and downloads (calculated since 27 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	116	40	12	168
Mar 2026	857	523	68	1,448
Apr 2026	213	136	13	362
May 2026	32	31	0	63

Cumulative views and downloads (calculated since 27 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	116	40	12	168
Mar 2026	857	523	68	1,448
Apr 2026	213	136	13	362
May 2026	32	31	0	63

Viewed (geographical distribution)

Total article views: 2,043 (including HTML, PDF, and XML) Thereof 2,043 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 19 May 2026

Short summary

High-resolution precipitation is essential for hydrologic and climate-risk applications, but climate models are too coarse to resolve storm-scale structure and extremes. We compare a deterministic U-NET and two generative models (WGAN and diffusion) for 8× and 16× precipitation downscaling using ERA5-Land. All models conserve rainfall mass, but differ at fine scales: U-NET is stable yet smooths extremes, while generative models better capture variability and heavy tails with added uncertainty.


Total:	0
HTML:	0
PDF:	0
XML:	0