Comprehensive Inter-comparison of Generative AI Models for Super-Resolution Precipitation Downscaling Across Hydroclimatic Regimes
Abstract. High-resolution precipitation information is essential for hydrologic modeling, flood forecasting, and climate-risk assessment, yet global weather and climate models operate at spatial resolutions too coarse to resolve storm structure, intermittency, and extremes. Deep-learning-based statistical downscaling provides a computationally efficient alternative to dynamical downscaling, but deterministic convolutional neural networks often yield overly smooth predictions and underestimate fine-scale variability and extreme events. Generative deep-learning models, including generative adversarial networks and diffusion models, offer a promising alternative by enabling stochastic downscaling and explicit representation of uncertainty. This study presents a systematic, hydrologically oriented comparison of three representative deep-learning frameworks for precipitation super-resolution: a convolutional U-NET, a conditional Wasserstein GAN (WGAN), and a conditional denoising diffusion probabilistic model (DDPM). Using a perfect-model experimental design based on ERA5-Land precipitation over distinct hydroclimatic regions of the United States, we evaluate performance under 8-times (8×) and 16-times (16×) downscaling tasks within a unified training and evaluation framework. Models are evaluated using diagnostics that examine precipitation distributions, wet–dry occurrence, extremes, spatial structure, storm morphology, mass consistency, ensemble variability, and computational cost. All three models preserve aggregate rainfall mass despite the absence of explicit physical constraints. Differences arise primarily at fine spatial scales and in the representation of extremes, spatial dependence, and uncertainty. U-NET provides stable and computationally efficient predictions but smooths small-scale variability. WGAN improves fine-scale structure and heavy-tail behavior at the expense of increased noise. The DDPM yields physically coherent ensemble members and an explicit representation of uncertainty, at a substantially higher computational cost.
Review for the GMD
This manuscript presents a timely and meaningful intercomparison of three “most used” gen AI models for precipitation super-resolution downscaling. The study addresses an important problem at the interface of atmospheric science and machine learning, and the effort to compare multiple model classes within a unified framework is valuable. At the same time, several aspects of the manuscript require substantial clarification and strengthening before the conclusions can be fully supported. Addressing these issues would improve the scientific rigor and impact of the paper.
Major Comments :
1.The authors state that the 10-member ensemble is generated from 10 independently trained models initialized with different random seeds. This procedure primarily reflects epistemic uncertainty associated with parameter estimation and training variability. However, the central theoretical advantage of conditional generative models is that for a given low-resolution input, they can generate a distribution of plausible high-resolution outputs through stochastic sampling. At present, the manuscript uses one prediction from each independently trained model and interprets the resulting spread as ensemble uncertainty, which is not equivalent to sampling the conditional output distribution of a single trained generative model. The authors must separate these two uncertainty sources explicitly. In addition to the current analysis, they should report results from repeated stochastic sampling using a single trained model, preferably the best-performing checkpoint, and compare that spread with the spread arising from different training seeds. This distinction is essential for a correct interpretation of the ensemble results.
2.The manuscript refers in several places to ERA5-Land as “observation”. This terminology is not correct. ERA5-Land is a reanalysis-based product, not a direct observational dataset. Since the study does not use in situ station obs, radar, satellite retrievals, or soundings as reference truth, the manuscript should consistently refer to ERA5-Land as a reanalysis or reanalysis-based target, not as observation. You could read this paper to obtain the detailed reason.
https://doi.org/10.1175/BAMS-D-14-00226.1
3.The use of min-max normalization may help stabilize training, but it raises an important concern for precipitation, especially for extreme events. Min-max scaling bounds the normalized target by the range seen in the training data, which may hinder robust extrapolation to unprecedented values. This issue is especially relevant for climate-related downscaling and extreme precipitation, where out-of-sample events may exceed the historical training maximum. This issue may also be relevant to the behavior shown in Figure S11, where DDPM with T=100 approaches the upper bound and cannot grow freely. The authors should discuss this limitation explicitly and test at least one alternative normalization strategy such as quantile normalization or z-score normalization over wet pixels only, reporting whether the normalization choice materially changes the extreme-value results.
4.Precipitation is not a typical continuous variable like temperature, pressure, or geopotential height. It is sparse, intermittent, highly skewed, and often better represented by zero-inflated or Tweedie-like distributions. For this reason, the loss-function choice deserves much more discussion than it currently receives. The authors should discuss why their selected losses are appropriate for precipitation specifically, and whether distribution-aware losses could improve tail behavior and wet-day occurrence. Recent studies suggest that distributional losses can be beneficial for precipitation prediction and downscaling. At minimum, this should be discussed more clearly. Ideally, the authors would include a sensitivity test or ablation experiment.
https://arxiv.org/html/2509.08369
https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2024GL111828
5.Figure 5 appears to compare model predictions against an upsampled low-resolution field rather than the native high-resolution ERA5-Land target, given the clustering of identical reference values on the x-axis. If so, the comparison is not appropriate and the figure needs to be redone using the actual high-resolution target field. If this interpretation is incorrect, the authors should clarify exactly how the reference field in Figure 5 was constructed.
6.The spatial lag analysis in Fig. 6 is not the most informative way to evaluate scale-dependent structure for precipitation super-resolution. A spatial power spectrum would be more standard and more physically interpretable. The authors should add a spectral analysis to the main paper. The current spatial lag figure could be moved to the supplement.
7.All three models condition only on coarse-resolution precipitation. Precipitation is not a self-contained variable. For instance, topography is a key control on high-resolution precipitation structure, especially in regions where orographic effects are important. The manuscript does not sufficiently discuss the implications of omitting terrain height or other static geographic information as conditioning variables. The smoothness seen in the deterministic baseline may partly reflect the lack of physically informative conditioning, rather than only the architecture itself. This point is also relevant for the generative models. One of the strengths of conditional DDPM is the flexibility with which conditioning information can be incorporated, including modulation-based conditioning (like FiLM used here and AdaGN etc.). The authors should discuss more directly whether including terrain or other physically meaningful covariates could materially change the conclusions.
8.A plain UNet trained with a pointwise loss is well known to produce overly smooth outputs in super resolution tasks, so the observed contrast between UNet and generative models may partly reflect the baseline choice rather than an inherent limitation of deterministic approaches. The paper should justify this baseline more carefully or include at least one stronger deterministic baseline such as a sub-pixel convolution or PixelShuffle-based architecture.
https://arxiv.org/pdf/1609.05158
9.SSIM is a perceptual image metric designed for natural-image comparison based on luminance and contrast, and its physical meaning for sparse, intermittent precipitation fields is limited. SSIM should not be emphasized as a primary result and may be moved to the supplement. The Q-Q diagnostics currently in Figure S7 are more physically meaningful for a heavy-tailed intermittent variable and should be promoted to the main text.
10.Because the low-resolution inputs are constructed using a block-averaging operator, the mass conservation result in Section 5.2.2 is partly guaranteed by the experimental design and is less informative than the manuscript implies. This section should be shortened and some of the discussion moved to the supplement.
11.The manuscript compares a U-Net against generative models, but it does not include a simple interpolation baseline. That omission weakens the benchmark. At minimum, the authors should include one standard interpolation baseline so that readers can assess whether the deterministic neural model actually adds value beyond trivial reconstruction.
12.Precipitation has strong temporal autocorrelation because it is tied to evolving synoptic and mesoscale systems. Wet-spell duration, dry-spell duration, and multi-day persistence are among the most hydrologically relevant properties of any downscaled product. A model may match daily spatial structure while still failing to reproduce realistic persistence across time. The manuscript does not evaluate this sufficiently. The authors should either include temporal diagnostics that directly assess persistence behavior or clearly state that the current evaluation is not enough to establish hydrological usefulness.
Some minor issues :
1.Figures 3 and S3 should state the timestamp or time period shown. The figure captions should clearly indicate which date or sample is being plotted.
2.At line 126, the manuscript refers to one region as the “Pacific Northwest”. Based on the domain shown, this terminology appears inaccurate, since Utah and western Nevada are not usually considered part of the Pacific Northwest. It would be better to use “Northwest” unless the domain is redefined.
3.The manuscript states that the models are trained on the Central Plains and Northwest, while validation uses the Central Plains plus a subset of the Northeast, and the remaining Northeast samples are used for independent testing. This is understandable, but the exact fractions or sample counts should be stated explicitly in the main text.
4.The manuscript sets daily precipitation below 1 mm per day to zero and excludes days with fewer than 1 percent wet pixels. These choices may be reasonable, but the authors should report how many samples are removed by region and season and briefly discuss the potential impact on light-rain statistics and wet-day occurrence.
5.The manuscript states that the models are trained on the Central Plains and Northwest, while validation uses the Central Plains plus a subset of the Northeast, and the remaining Northeast samples are used for independent testing. This is understandable, but the exact fractions or sample counts should be stated explicitly in the main text.
6.The U-Net training description omits weight decay and the Adam beta values; since the WGAN uses non-default beta1=0.0 and beta2=0.9, any deviation from defaults for other models must also be explicitly reported. The DDPM section does not specify the optimizer, initial learning rate, or weight decay.