A Framework for the Assessment of Rainfall Disaggregation Methods in Representing Extreme Precipitation

Sandoval, Claudio Andrés; Gironás, Jorge Alfredo; Chadwick, Cristián

doi:https://doi.org/10.5194/egusphere-2025-2710

Preprints

https://doi.org/10.5194/egusphere-2025-2710

Preprints

26 Jun 2025

| 26 Jun 2025

A Framework for the Assessment of Rainfall Disaggregation Methods in Representing Extreme Precipitation

Claudio Andrés Sandoval, Jorge Alfredo Gironás, and Cristián Chadwick

Abstract. High-resolution precipitation data are essential to analyze extreme rainfall, critical for hydrological modeling, infrastructure design and climate change assessments. As high-resolution rainfall data are limited, disaggregation methods become an alternative to access such data. Although many studies have evaluated these methods, there is no framework for their selection based on their performance in reproducing extreme attributes. This paper presents a framework for evaluating daily-to-hourly rainfall disaggregation methods, measuring the performance on representing extreme precipitation behavior. The framework assesses this performance using Intensity-Duration-Frequency (IDF) curves and extreme rainfall indices (ERIs). IDF curve disaggregation performance evaluation uses accuracy and precision metrics (i.e., how close and consistent disaggregated values are to observed data, respectively), while ERIs are assessed by comparing the variability and bias of disaggregated annual series to observed data using a modified Kling-Gupta efficiency. The framework was applied to five sites with diverse climates, using three disaggregation methods: (1) a stochastic pulse-type method (SOC), (2) a non-parametric k-nearest neighbor (k-NN), and (3) a method based on Huff curves (HUFF). Results show that k-NN tends to outperform other methods in replicating IDF curves, modeling extreme rainfall percentiles and capturing the occurrence and magnitude of intense precipitation events, as well as most critical dry situations. SOC performs well in precision but has a lower ability in accuracy while HUFF is best at modeling 5-hour maximum rainfall. Nonetheless, these performances are not consistent across all locations, with the best-performing method varying per site, highlighting the importance of context-specific evaluations enabled by the framework.

Received: 09 Jun 2025 – Discussion started: 26 Jun 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Claudio Andrés Sandoval, Jorge Alfredo Gironás, and Cristián Chadwick

Status: closed

RC1:
'Comment on egusphere-2025-2710', Anonymous Referee #1, 22 Jul 2025

This was the first time I was involved as a reviewer for this manuscript. The manuscript introduces a framework for a systematic comparison of rainfall extreme values from generated time series. The manuscript is well-written. Unfortunately, I have to recommend the rejection of the manuscript. Please find my detailed comments below.
Major comments:
By the chosen title the authors want to introduce a new framework for assessing rainfall extreme values from generated time series. The authors could not convince me that such a framework is required, current studies on rainfall generation validate extreme value behaviour in a sufficient manner. Indeed, the establishment of a systematic procedure could be nice, but therefore the authors would had to i) compare the framework with conventional validation strategies (classical presentation of IDF curves), and ii) to share the code for the evaluation with the community, which is unfortunately not done (only “…upon reasonable request.”). My second issue is the comparison of rainfall generators: Although not part of the title and beyond the scope of the manuscript, the authors judge the studied rainfall generators without hydro-meteorologic validation, and rank them (Table 4, Table 6).
The manuscript should either focus on the introduction of the framework (comparisons with established methods, providing the code), or on the comparison of the rainfall generators. Both topics in one manuscript provide not enough space that each topic would require for a proper study.

Specific comments:
L44-63 The authors classify the evaluation possibilities in three types. Actually, there are four types, the most practical type is missing: the subsequent application. Due to the high non-linearity of rainfall-runoff transformation processes one cannot conclude from a single rainfall characteristic on the performance when the generated rainfall time series are used as input for the subsequent application. Sometimes the application indicates critical shortcomings of the generated time series and lead to an iterative optimization of the rainfall generation process. This type including references should be added.
Fig.1 a: The upper line are arrows, but I think these are just headers of the boxes below (areas within the coloured dashed lines). If so, I suggest to change the layout, in it’s current version it can be misleading to have arrows twice, for the headers and the datasets/methods below.
Fig.1 b: Why is the framework limited to 1h and not at 1min or 5min, the typical lower bounds for rainfall generation?
Fig.1 c: What is ‘Index value’ on the y-axis? Should it be ‘Rainfall amount (mm)’?
L110-111 Reference curve: Why is crossing that curve such a relevant issue that the method should be discarded? Please provide references or examples here.
L114 I don’t understand this sentence. First three values for d are provided, which yield in three other values for d?
Eq. 2, 3: I’m wondering about the selection of the studied durations. The index is running from d=1,..,24, sol also very uncommon durations as d=19h are analysed? Although a lower weight is chosen (1/d), I question the acceptance of this method within the hydrologic community. I suggest to keep the equations more open and provide the durations used in this study in the paragraph below. For me, typical durations would have been d={1, 2h, 3h, 4h, 6h, 12h, 24h}.
Eq.2 I’m wondering how this criterion is affected by different number of realisations. For comparisons of two rainfall generators A and B, ten realisations maybe required to represent the statistical uncertainty for A, while 100 realisations are required for B. Can this be a problem?
Chapter 5: The introduction of something ‘new’ demands comparisons with something ‘old or established’. Without previous experience, Fig. 3 is hard to interpret for the reader: How does a PEI=0.4 differ from PEI=0.2? OR same PEI for different AEI? The authors could provide classical IDF curves to show the benefit f the new proposed framework: What is visible with the new framework, that could not be seen/quantified before?
L378-392 The manuscript aims at introducing a new framework, not at comparing rainfall generators. For a fair comparison, more hydro-meteorologic in-depth analyses of the results would be required. This comment is also valid for Sect. 5.2.

Citation: https://doi.org/10.5194/egusphere-2025-2710-RC1
- AC1:
  'Reply on RC1', Claudio Sandoval, 05 Sep 2025
  Acknowledgment: We thank the time of the reviewer and suggestions, which will allow us to improve our work.
  This was the first time I was involved as a reviewer for this manuscript. The manuscript introduces a framework for a systematic comparison of rainfall extreme values from generated time series. The manuscript is well-written. Unfortunately, I have to recommend the rejection of the manuscript. Please find my detailed comments below.
  Major comments
  By the chosen title the authors want to introduce a new framework for assessing rainfall extreme values from generated time series. The authors could not convince me that such a framework is required, current studies on rainfall generation validate extreme value behavior in a sufficient manner. Indeed, the establishment of a systematic procedure could be nice, but therefore the authors would had to i) compare the framework with conventional validation strategies (classical presentation of IDF curves), and ii) to share the code for the evaluation with the community, which is unfortunately not done (only “…upon reasonable request.”). My second issue is the comparison of rainfall generators: Although not part of the title and beyond the scope of the manuscript, the authors judge the studied rainfall generators without hydro-meteorologic validation, and rank them (Table 4, Table 6). The manuscript should either focus on the introduction of the framework (comparisons with established methods, providing the code), or on the comparison of the rainfall generators. Both topics in one manuscript provide not enough space that each topic would require for a proper study.
  
  ANS: Most of the studies that compare rainfall-disaggregation methods do so because they want to propose a new approach. Nevertheless, if a hydrological model wants to choose from different already available methods there are no recommendations of how to approach this. Also, for researchers developing new methods it is easier to compare methods, but for hydrological modelers outside research that task might be overwhelming. Given this need, is that we propose a framework to help hydrological models that need to be chosen among already existing rainfall-disaggregation methods. We appreciate the comment and will present the gap in a more direct way in the new version of the manuscript.
  We would like to clarify that our framework does not replace traditional validation approaches, but rather systematizes and extends them. The direct comparison between observed IDF curves and those derived from disaggregated series (commonly used in the design of new disaggregation methods) is already embedded as the foundation of our approach. What the framework adds is the ability to synthesize this comparison into unified indicators of accuracy and precision (AEI, PEI, DT), which condense the information contained in IDF curves into single-valued metrics. In this way, the spirit of traditional validation is preserved, while its outcome becomes easier to interpret, comparable across methods, and directly representable in a common graphical space. Moreover, the framework quantifies the uncertainty with which the IDF curve is reproduced by multiple disaggregations, something that is not necessarily captured by traditional approaches. In the revised version of the manuscript, we will make this added value more explicit, and we will consider including, as supplementary material, a graphical illustration of the construction of IDF curves to more transparently show how they are integrated within the framework.
  Regarding code availability, we agree that reproducibility and transparency are fundamental virtues that must prevail in a study of this nature. Therefore, in the revised manuscript we will provide public access to the full implementation of the proposed framework through an open repository. This repository will also include selected disaggregation methods as examples, not because the comparison of generators is the main objective of our work, but as a means to facilitate the implementation and application of the framework by other researchers. Despite this, we encourage users to apply the framework with other disaggregation methods, for which open-source codes may already be available in the literature, or even with new implementations developed from the beginning.
  We also acknowledge the reviewer’s concern that the current manuscript may appear to combine two different goals (i.e., the introduction of the framework and the comparison of rainfall disaggregation methods). Our main aim is indeed to present the framework. The application to three methods was included only to demonstrate its usefulness in highlighting strengths and limitations across metrics and sites. In the revised manuscript, we will clarify this intention throughout the text, tone down language that could be interpreted as ranking rainfall generators and explicitly state that hydrometeorological validation is beyond the scope of this study. We will also emphasize that future applications of the framework should include such validation in order to connect methodological performance with practical hydrological relevance. Through these revisions, we believe the manuscript will more clearly focus on the introduction and demonstration of the framework, while keeping the comparison of generators strictly as an illustrative application.
  Specific comments
  L44-63 The authors classify the evaluation possibilities in three types. Actually, there are four types, the most practical type is missing: the subsequent application. Due to the high non-linearity of rainfall-runoff transformation processes one cannot conclude from a single rainfall characteristic on the performance when the generated rainfall time series are used as input for the subsequent application. Sometimes the application indicates critical shortcomings of the generated time series and lead to an iterative optimization of the rainfall generation process. This type including references should be added.
  
  ANS: We agree that a fourth type of evaluation can be recognized, namely the “subsequent application of disaggregated rainfall in hydrological models”. This approach would be particularly relevant in contexts where short-duration extremes strongly influence hydrological response, such as small urban catchments. Nevertheless, we emphasize that our study does not aim to optimize or iteratively adjust disaggregation methods based on such applications. The methods are applied as defined in their theoretical formulations (with only minor modifications explicitly described in the manuscript). Our focus is on presenting a systematic statistical framework, which can serve as a foundation for future hydrological applications.
  Although the “subsequent application of disaggregated rainfall in hydrological models” is out of the scope of the paper, we do agree that it should be mentioned. Hence, we will add a couple of lines and references in the final part of this paragraph. With these the four methods you mentioned will be clearly shown in the introduction. Finally, we will clearly state in the scope and objectives of the paper that this fourth method is outside our objetives so that the reader can properly frame the paper.
  1 a: The upper line are arrows, but I think these are just headers of the boxes below (areas within the coloured dashed lines). If so, I suggest to change the layout, in it’s current version it can be misleading to have arrows twice, for the headers and the datasets/methods below.
  
  ANS: The upper arrows in Fig. 1a are indeed intended as headers summarizing the processes represented in the areas within the colored dashed lines. We acknowledge that the current layout may be misleading, and in the revised version we will adjust the figure design to make this distinction clearer. Thank you for highlighting this point.
  1 b: Why is the framework limited to 1h and not at 1min or 5min, the typical lower bounds for rainfall generation?
  
  ANS: In principle, the proposed framework can be applied at any temporal resolution, including 1- or 5-minute data. However, in practice we chose 1 h as the minimum resolution because it represents the most widely available and comparable temporal scale across global datasets. For example, the INTENSE project dataset (Lewis et al., 2019) used in this work collected rainfall records at resolutions ranging from 1 min to 6 h, but standardized the dataset at 1 h because this resolution was by far the most consistently available and reliable across regions. This issue is also relevant to our case study in Chile, where long-term records at sub-hourly resolution are extremely scarce, and even hourly data are limited compared to daily observations. We will clarify in the revised manuscript that the framework can be easily adapted to different sub-daily time resolutions. One hour was chosen as an example because it is the most commonly available resolution in observed rainfall records. The framework can be applied to other resolutions by simply adjusting the temporal step used to construct IDF curves and to calculate extreme rainfall indices, without altering its overall structure.
  1 c: What is ‘Index value’ on the y-axis? Should it be ‘Rainfall amount (mm)’?
  
  ANS: In Fig. 1c, “Index value” refers to the annual value of the specific Extreme Rainfall Index (ERI) being evaluated, as listed in Table 1, rather than to a rainfall amount in millimeters. We will revise the caption and axis label to make this more explicit in the revised manuscript.
  L110-111 Reference curve: Why is crossing that curve such a relevant issue that the method should be discarded? Please provide references or examples here.
  
  ANS: The reference IDF curve is constructed solely from the available daily data, following a deterministic approach. The reason for using it as a benchmark was inspired by a practical engineering perspective: it represents what a decision-maker in hydrological design might do in the absence of long hourly records. The statement about discarding a method reflects this pragmatic logic: if the reference IDF curve (constructed here from a simple empirical frequency analysis assuming a “reasonable” triangular distribution of daily rainfall over its 24 h) reproduces the observed IDF curve better than the disaggregated curves (median), then it is better to just use the benchmark. We will better clarify this reasoning in the revised manuscript.
  L114 I don’t understand this sentence. First three values for d are provided, which yield in three other values for d?
  
  ANS: In the proposed temporal distribution, each intensity of the benchmark IDF curve is assigned to the midpoint of its duration. For example, for the 4–5 h interval the corresponding duration is taken as d = 4.5 h. We will revise the sentence in the manuscript to state this more clearly.
  2, 3: I’m wondering about the selection of the studied durations. The index is running from d=1,..,24, sol also very uncommon durations as d=19h are analysed? Although a lower weight is chosen (1/d), I question the acceptance of this method within the hydrologic community. I suggest to keep the equations more open and provide the durations used in this study in the paragraph below. For me, typical durations would have been d={1, 2h, 3h, 4h, 6h, 12h, 24h}.
  
  ANS: The reason for initially using all durations from 1 to 24 h was to avoid introducing subjectivity by selecting a pool of durations and to ensure that the evaluation was fully systematic. We agree with the reviewer, however, that this procedure introduces redundancies (e.g., the 23 h duration yields values very similar to 24 h) and includes durations that are uncommon in hydrological practice. In the revised manuscript we will restrict the evaluation to the most typical design durations (1, 2, 3, 4, 6, 12, and 24 h). This choice not only reflects standard hydrological practice, but also avoids redundancy and gives more relative weight to the shorter durations, which are particularly relevant for the assessment of extremes.
  2 I’m wondering how this criterion is affected by different number of realisations. For comparisons of two rainfall generators A and B, ten realisations maybe required to represent the statistical uncertainty for A, while 100 realisations are required for B. Can this be a problem?
  
  ANS: Equation 2 refers to the Accuracy Efficiency Index (AEI), which measures how closely the median IDF curve derived from disaggregated series reproduces the observed IDF curve. The value of AEI does depend on the number of realizations (N) if N is small, since the sample may not adequately represent the underlying distribution. However, once a sufficiently large N is used, the estimate of the median IDF curve becomes stable, and increasing N further does not change the result in any meaningful way. The same reasoning applies to Eq. 3 (the Precision Efficiency Index, PEI), which characterizes the spread of the IDF curves derived from each realization. In our study we used 300 realizations for each method, a sample size large enough to ensure stable estimates and to provide results that would be essentially the same as with 1000 realizations. We will clarify these aspects in the revised manuscript.
  Chapter 5: The introduction of something ‘new’ demands comparisons with something ‘old or established’. Without previous experience, Fig. 3 is hard to interpret for the reader: How does a PEI=0.4 differ from PEI=0.2? OR same PEI for different AEI? The authors could provide classical IDF curves to show the benefit of the new proposed framework: What is visible with the new framework, that could not be seen/quantified before?
  
  ANS: We agree that without a direct link to classical IDF curves, the interpretation of Fig. 3 may not be straightforward for readers. As noted in our response to your Major Comment, we will include illustrative examples of classical IDF curves in the revised manuscript. This will make the connection between the traditional validation and the proposed framework explicit, and will clarify how the framework adds value by disentangling accuracy from precision and by providing a quantitative assessment of differences that are otherwise only qualitatively visible.
  L378-392 The manuscript aims at introducing a new framework, not at comparing rainfall generators. For a fair comparison, more hydro-meteorologic in-depth analyses of the results would be required. This comment is also valid for Sect. 5.2.
  
  ANS: Indeed, the main contribution of the manuscript is the introduction of the framework, not a systematic comparison of rainfall generators or a full hydrological application. We fully agree that a more in-depth hydrological test (for instance, running a rainfall–runoff model such as HEC-HMS) would be required for a fair comparison of methods. However, such an analysis goes beyond the scope of the present manuscript, which is focused on providing a methodological basis for evaluating disaggregation approaches. We appreciate this suggestion, as it reveals that the objectives and boundaries of the study were not sufficiently emphasized in the original version. In the revised manuscript we will therefore make this scope more explicit, clarifying that while the framework could indeed be extended to hydrological impact analyses, our aim here is to establish and illustrate the framework itself, leaving its application to specific hydrological models as a future step.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2710-AC1
RC2:
'Comment on egusphere-2025-2710', Anonymous Referee #2, 22 Jul 2025

This study proposes an evaluation framework for the assessment of temporal disaggregation methods of daily precipitation time series, focusing on key statistical properties (IDF relationships, KGE). The paper is very well written and the motivation is clearly explained. However, the proposed framework does not represent a significant novelty compared to the evaluations made in past studies. Furthermore, this framework is illustrated using disaggregation approaches that are not representative of the current approaches in my opinion. This is a strong limitation of this work given that many conclusions or interpretations are based on this set of experiments (see comments below).
I understand that this is not the main objective of this study, but a more comprehensive evaluation with state-of-the-art approaches would have given more value to this study. The Stochastic pulse-type method applied in this study was developed in 2001 and more recent versions are expected to perform a lot better (see e.g. https://hess.copernicus.org/articles/28/391/2024/). The approach using Huff curves has been proposed in 1967 and is clearly outdated in my opinion. The k-nearest neighbors method is still applied and could represent a benchmark. Random cascade models are cited in the introduction and still widely used but are not tested.
General comments:
l.1-3: The beginning of the abstract should clearly indicate that this study focuses on temporal disaggregation of daily precipitation data. There are a lot of other approaches for downscaling precipitation data, especially in space (e.g. dynamical downscaling using regional climate models) and these first sentences are more relevant to these approaches in my opinion.
l.16: High-resolution precipitation data often refer to gridded data with a high spatial resolution.
l.74-76: There are more recent approaches that could have been considered. For example, I was surprised that no random cascade model was considered (see, e.g. https://doi.org/10.5194/hess-27-3643-2023).
l.171: As acknowledged by the authors, other ERIs could be more relevant in other studies. In particular, the number of consecutive hours that defines an intense precipitation event or the quantile used to define a large precipitation intensity are really specific to each case study. In my opinion, these choices illustrate the fundamental difficulty in proposing a universal evaluation framework that would be adapted to all possible applications of a disaggregation method.
l.413-415: I am not sure to understand, is it claimed that the timing of observed events does not follow a Poisson process ? In my experience this is not true, a Poisson process has been shown to be reasonable in many applications of cluster-based models.
l.418: A common criticism made for the kNN approach is actually that the maximum disaggregated values are limited to the largest observed values, which is not a satisfying feature in extrapolation. It is sometimes combined to random draws from a distribution for this reason.

Citation: https://doi.org/10.5194/egusphere-2025-2710-RC2
- AC2:
  'Reply on RC2', Claudio Sandoval, 05 Sep 2025
  Acknowledgment: We sincerely thank you for such constructive comments and suggestions, which will help us to improve the clarity, scientific quality, and overall contribution of this work to the community.
  Major comments
  This study proposes an evaluation framework for the assessment of temporal disaggregation methods of daily precipitation time series, focusing on key statistical properties (IDF relationships, KGE). The paper is very well written and the motivation is clearly explained. However, the proposed framework does not represent a significant novelty compared to the evaluations made in past studies. Furthermore, this framework is illustrated using disaggregation approaches that are not representative of the current approaches in my opinion. This is a strong limitation of this work given that many conclusions or interpretations are based on this set of experiments (see comments below). I understand that this is not the main objective of this study, but a more comprehensive evaluation with state-of-the-art approaches would have given more value to this study. The Stochastic pulse-type method applied in this study was developed in 2001 and more recent versions are expected to perform a lot better (see e.g. https://hess.copernicus.org/articles/28/391/2024/). The approach using Huff curves has been proposed in 1967 and is clearly outdated in my opinion. The k-nearest neighbors method is still applied and could represent a benchmark. Random cascade models are cited in the introduction and still widely used but are not tested.
  
  ANS: Regarding the choice of methods used to illustrate the framework, our selection was deliberate in order to cover conceptually distinct approaches that are relatively simple to implement, apply, and compare. We note that the reference to Huff (1967) was misleadingly phrased: the year refers to the publication where the Huff curves were first presented, but the disaggregation method applied in our study is an adaptation we developed based on these curves, rather than a direct application of the original 1967 procedure. We would also like to emphasize that the selection of methods is not the main focus of the paper. The methods are included only as illustrative applications, and in the revised manuscript we will present this more clearly to ensure that the focus remains on the framework itself rather than on the specific methods tested.
  We agree that more recent approaches, such as the well-mentioned random cascade models, represent important lines of development. Our framework is intentionally method-agnostic and could be readily applied to those state-of-the-art approaches. To make the study more aligned with the current state of the art, in the revised manuscript we will extend the set of disaggregation methods considered to include more recent approaches. At the same time, in the present work we deliberately selected methods that are well established and conceptually different, in order to highlight how the framework can provide insights across a broad range of disaggregation philosophies.
  General comments
  1-3: The beginning of the abstract should clearly indicate that this study focuses on temporal disaggregation of daily precipitation data. There are a lot of other approaches for downscaling precipitation data, especially in space (e.g. dynamical downscaling using regional climate models) and these first sentences are more relevant to these approaches in my opinion.
  
  ANS: We agree that the first sentences of the abstract could be misinterpreted, given it does not clearly state the focus is on temporal disaggregation. In the revised manuscript we will rephrase the beginning of the abstract to clearly state that the study focus on developing a framework to evaluate temporal disaggregation methods, from daily to sub-daily timescale.
  16: High-resolution precipitation data often refer to gridded data with a high spatial resolution.
  
  ANS: We agree that the term “high-resolution precipitation data” is often used to refer to gridded products with high spatial resolution, rather than temporal resolution. As noted in our response to the previous comment (comment 2.1), we will rephrase the beginning of the abstract to avoid this ambiguity and to clearly indicate that our study focuses on temporal disaggregation of daily precipitation into sub-daily scales.
  74-76: There are more recent approaches that could have been considered. For example, I was surprised that no random cascade model was considered (see, e.g. https://doi.org/10.5194/hess-27-3643-2023).
  
  ANS: As noted in our response to the major comment above, we acknowledge the importance of more recent approaches such as random cascade models. In the revised manuscript we will extend the set of disaggregation methods considered to include more recent approaches.
  171: As acknowledged by the authors, other ERIs could be more relevant in other studies. In particular, the number of consecutive hours that defines an intense precipitation event or the quantile used to define a large precipitation intensity are really specific to each case study. In my opinion, these choices illustrate the fundamental difficulty in proposing a universal evaluation framework that would be adapted to all possible applications of a disaggregation method.
  
  ANS: We agree that the selection of ERIs cannot be completely arbitrary, otherwise the framework would lose coherence and comparability. Our intention is not to suggest that “anything goes,” but rather to emphasize that while some flexibility exists, the ERIs should always be chosen following clear criteria. In our case, the selected ERIs have the advantage of being widely used in hydroclimatic studies, of covering complementary aspects of extremes (frequency, intensity, and duration), and of being directly computable from both observed and disaggregated series. These properties make them suitable for systematic application and comparison across methods. We acknowledge, however, that other studies may have specific objectives that require alternative indices (e.g., tailored to local design needs or hydrological impact assessments). In such cases, the framework remains adaptable, provided that the selection of ERIs is guided by the same principles of relevance, representativeness, and comparability. In the revised manuscript, we will make these criteria more explicit to clarify the balance between flexibility and consistency within the framework.
  413-415: I am not sure to understand, is it claimed that the timing of observed events does not follow a Poisson process? In my experience this is not true, a Poisson process has been shown to be reasonable in many applications of cluster-based models.
  
  ANS: We thank the reviewer for this important clarification. Our intention was not to claim that a Poisson process is inadequate for representing rainfall event timing. On the contrary, we fully acknowledge that Poisson arrivals have been widely and successfully used in point-process models. What we intended to highlight is that the Socolofsky method (SOC) does not follow the logic of Poisson event arrivals: instead of generating events over continuous time, it allocates pulses within discrete intervals. As a result, SOC may incur limitations in reproducing the temporal intermittency of rainfall, particularly the dry steps between pulses. In the revised manuscript we will rephrase this passage to avoid any misinterpretation and to make clear that the limitation refers to the SOC implementation, not to Poisson-based approaches in general.
  418: A common criticism made for the kNN approach is actually that the maximum disaggregated values are limited to the largest observed values, which is not a satisfying feature in extrapolation. It is sometimes combined to random draws from a distribution for this reason.
  
  ANS: We agree that a common limitation of the k-NN approach is that the maximum disaggregated values are bounded by the largest values observed in the historical record, which restricts its ability to extrapolate extremes. Indeed, extensions of k-NN have been proposed where random draws from parametric distributions are combined with the analog-based resampling in order to overcome this drawback and give the method more flexibility.
  From the perspective of our framework, this limitation is not ignored: by explicitly evaluating extreme rainfall indices and IDF curves, the framework makes it possible to detect whether a method can or cannot reproduce extremes beyond the observed range. If extrapolation is important for a specific application (e.g., in design contexts requiring estimates for large return periods), the framework provides a transparent way to highlight the strengths and weaknesses of k-NN relative to other approaches. In this sense, rather than solving the limitation itself, the framework clarifies its implications for practice and helps inform the selection of the most appropriate method for the problem at hand.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2710-AC2
RC3:
'Comment on egusphere-2025-2710', Anonymous Referee #3, 24 Jul 2025

This manuscript aims to present a framework for the assessment of rainfall disaggregation models (DM) with a focus on the generation of relevant extreme precipitation. For illustration, the framework is applied to 3 different DMs on 5 locations worldwide where hourly observations of rainfall are available.
The aim of the paper is very relevant. Such a framework would be indeed very valuable to assess individual DMs and to compare them. The paper has however a number of significant limitations and the work needs major revisions. I really encourage the authors to submit a modified version.
Added value of this framework when compared to already existing ones.
A number of multiscale evaluations have been already proposed to assess weather generators (WGEN) and / or to compare DMs. What is the added value of this new framework for instance compared to Benett et al. 2012, Maloku et al. 2023 where multiple temporal scales and Precip. Extremes are considered. Or compared to Cappelli et al. 2025 where IDF curves are the aim of the disaggregation?
Evaluation for ungauged stations.
The framework is applied to different stations where rainfall observations are available. The performance of each DM is assessed by comparison to a reference which has been built from observations available at the considered site. What is proposed to assess DMs for disaggregating a daily time series available at a given site where no subdaily data are available ? This point would be worth to be clarified / commented. The target of the paper has to be clarified with respect to this issue also. This is obviously an important operational issue (see Kim et al. 2016, Maloku et al. 2025 and ref. herein). The evaluation framework proposed here does obviously not need to include this issue. However, the importance of evaluating also the potential for regionalization of the model would be worth to mention in a discussion. One could look for a model with parameters that are robust and do not vary a lot in space for instance (cf. Maloku et al. 2023). The possibility (need ?) to apply a kind of regional assessment with hourly stations available (or not) in the neighborhood of the target station would be also worth to mention.
Seasonal performance of DMs
The ability of DMs to produce relevant simulations is typically found to depend on season. This point would be worth to discuss in the manuscript.
Criteria used to assess the DM ability to reproduce IDF curves.
In the manuscript, DMs are first assessed with their ability to reproduce IDF curves derived in a previous step from observations. A relevant simulation of IDFs is obviously important for DMs. For this assessment, the authors propose two criteria (AEI and PEI in Eq 2 and 3), further combined in one single distance criterion (Dt, Eq. 4). Each criterion is a normalized criterion such as its value for a perfect DM is one and such as its value is negative if its performance is lower than that of a reference model. This normalization facilitates the assessment. I see two major limitations here :
Criterion AEI : the reference model is a straight line in the IDF graph (see Fig. 1b). This model is a poor reference model, especially in the way it is built. This reference is far from the observed IDF for all durations (even for the day) and all return periods and it may be thus not really able to highlight bad models. More relevant references could (should) be considered as suggested here : for each return period T :
Reference A : The shape of the reference is not linear but follows a classical analytical formulation used for IDF curves (in the form of a Talbot or a Montana model). To define the coefficients of the model, one could assume that 1) The reference ends to the point which is known in the considered IDF curve, that is the observed return level (estimated from the observations) for the daily resolution and 2) that (for instance) the maximum intensity for a 1 hour duration is 2times that for the daily duration (2 would be arbitrary chosen).
Reference B : A simple deterministic DM is used to produce a time series from which the IDF curve is determined. Such a model could be the linear model of Ormsbee (1989), also considered in Hingray and BenHaha as a reference (2005) in their comparison work.
Reference C : A simple constant disaggregation model is considered (constant intensity assumption for each daily time step).
Reference D : a single climatological sub-daily pattern, determined each month from the observations (e.g. following the Huff et al. methodology considered in the paper), is used for the disaggregation of all days of this month.
To my opinion, one such reference would make some sense (I guess more than the proposed one).
Criterion PEI : the reference model is the ensemble of IDF curves obtained from a resampling of observed data. The authors mention that the value 1 is the best possible value, negative values are not desirable (lines 144-145). This is not what I understand from Equation 3. For me, a value of 1 indeed indicates no uncertainty in the disaggregation process. I do not see why a DM should give no uncertainty ? For me, the disaggregation can not be deterministic. So, different disaggregation runs will (cannot but) lead to different IDF curves, I agree. Then the confidence interval between the estimates obtained from different runs should not be zero. It should but be “relevant” or at least reasonable when compared to the CI that could be obtained from a simple resampling work of observations.
Distance criterion. If my reasoning is right, the distance criterion, which is in all the paper then considered to assess the performance of DMs, is then not really relevant. Because for me, the best possible value for PEI is not 1 but 0. And the optimal point is not the point (1,1) mentioned line 157.
As a consequence also, I fear that some conclusions of the result section are wrong (all which are relative to PEIt and to DT). For me, SOC does not lead in four out of five locations (ln386) in table 3.
Extreme Rainfall Indices assessment
I have several concerns here also.
Equation 5. This equation has not a lot to do with the KGE equation as the authors disregard the correlation component. To avoid confusion and mis-interpretation I strongly suggest to not refer to the KGE and to use another name for this coefficient. Next, this criterion combines an evaluation on the mean and on the dispersion. I would suggest to provide both evaluations separately. This may provide more understanding of differences between DMs.
The data considered to produce box plots are not clear for me. Can you clarify if you calculate n different mKGE for the n runs of the disaggregation process.
Equation 6. The authors want to compare the mKGE obtained for one given DM with the mKGE obtained for a reference. A new reference is introduced here. I would strongly suggest to have the same reference than the one introduced to assess the IDF precision efficiency. The different reference models suggested above would be possible here.
Disagregation Models considered for this work
Many DMs have been presented in the past (see, for example, Koutsoyiannis, 2003, Pui et al. 2012 and the references within). I have the feeling that (some) DMs considered in the present work refer to rather old approaches (e.g. the SOC DM). Could you give more recent references where those models have been applied and evaluated. This would allow justifying the choice of those approaches and showing that they give reasonable performances in different contexts.
Because of their simplicity and parsimony, analytical microcanonical Multiplicative Random Cascades (MRC) have been widely used in the past for many applications in hydrology (e.g. Paschalis et al., 2014; Maloku et al. 2023). I strongly suggest to include one such MRC model (e.g. Maloku et al. 2023 which merges different MRC approaches).
Other comments
Introduction : Can you precise why is it interesting to consider IDF curves for an evaluation of DMs ?
Dataset : Rainfall time series used for the application have very different lengths. (Is there an interest of having much longer times series for 2 stations ?) Is there some dependency of the DM performance on the length ? The length likely impacts the precision of parameter estimation and then the stochastic variability importance in the evaluation (e.g. in the "reference” generated by resampling from observations in criterion PEI). To my opinion, a more homogeneous dataset would be worth and would prevent comments on this non homogeneity. A discussion on this length issue would otherwise be welcome.
Lines 44 64 : I am not convinced by the classification proposed. What is the difference between the 2^nd and third approach ? A summary table would be worth to present all these assessment approaches and identify the criterion, (multi)scales, precipitation characteristics considered for the evaluation. Note that for me, an important evaluation of WGENs (and DMs) is (should be) also impact oriented – cf. assessing the ability of reproducing discharge floods after simulation with some hydrological model (a number of works have proposed such evaluations). This may be especially relevant here for instance as IDFs are not of interest per see but because they are often produced and used for hydrological design purposes.
Ln 98. IDF curve construction. For a given return period, how are return levels estimated from observations. Do you use the gumbel distribution to fit data ?
Ln104-105. The way the uncertainty bounds are produced from resampling observations is not clear. What are the multiple samples of d-hours AMP ? How many samples ? for a given T ? for all T ? How many value in each sample ?
Equation 1. Clarify why this equation ? which assumptions / choices behind ? I understand that you assume i(d=0)=AMP24/12 which turns to i(d=0)= 2*(AMP24/24) = 2 * i(d=24). Is it right ? why ?
The assumption i(d=24) = 0 is not convincing. You know exactly that i(d=24) = AMP24/24
ln121 : PEIT criterion : according to what I understand from eq 3, I do not understand why this index estimates the efficiency of the “precision”. What is the precision ? or do you mean this is an index of “precision efficiency”, if yes, what is it ?
Ln135. I agree with this statement, but not with lines 144-146. Note that a deterministic model (e.g. that of Ormsbee, 1989) would have PEIT = 0. But such a model would be obviously under-dispersive. (dispersion is actually zero between runs)
Extreme rainfall indices
Ln 165. To me, the probability of dry steps is a more common criterion to assess intermittency
Ln 168. The 5hours duration is not really critical for a large range of basins. The critical duration depends on the time of concentration of the basin and this time of concentration varies a lot from one catchment to the other. Different durations could be thus considered (as in the IDF curves).
Ln 168-170. How is defined P95 ? is it based on daily precipitation ? hourly ? Does it vary from one year to the other ?
Ln 170. Which “last 2 indices”? If you refer to indices 3 and 4 of previous sentence, they do not assess to my opinion the duration and magnitudes of extreme precipitation pulses. How do you define “pulses” (on which duration(s) are they defined) ? The first index looks like a frequency (but I do not understand why this index should be different from 5% of the time if P95 is defined from hourly data). I do not understand well what is the meteorological / hydrological interest of the second index. Can you clarify ?
Section 3.2. I would suggest to summarize more the description of the method which is well known. The different equations are not necessarily needed for instance;
Ln 252-253. The knn model considered here does not depend on the local rainfall pattern around the daily amount to disaggregate. This may be a too important simplification. A number of works have shown that the subdaily structure of precipitation is highly dependent on this local rainfall pattern (cf. Ormsbee, 1989, Gunter et al. 2001, Maloku et al. 2023). Testing the initial Sharma et al. 2006 method would be perhaps worth.
Ln270-280 and equations next : I do not understand what is the purpose of this text and of those equations. Can you clarify ?
Ln 310. If I understand well, the disaggregated rainfall of one day can cross that day to the next one. Is it possible to have to rainfall amounts at a same time from day I and day i+1 ? if yes, do you sum them ?
Ln 330-338. Can you clarify ? I do not understand what do you do here and what it is for ? I understand that you produce 300 disagregated time series. I had understand from fig1. that each time series is used in turn to produce one set of IDFs curves. Here, I now understand that you merge the data from the 300 time series to do your frequential analysis.
Ln 338. What are “the main disaggregation runs” ? how are they produced ? defined ?
Ln 339. Do you do differently for the third DM ??? if yes, I do not understand why
Ln 340. I do not understand : what are the calibration results ? do you make any distinction between simulation results obtained for calibration and simulation results obtained from the application of the disaggregation process to the daily data ? can you clarify ?
Ln345. In this observation configuration, how are generated the 1000 AMP series ?
Figure3. Can you complete the caption. To what correspond the ellipses ?
Ln 403 and In table 5, what means the 50^th percentile mKGE or each ERI time series > this suggests that you compute multiple mKGE for each generated time series ? this is confusing.. or do you compute on mKGE for each generate time series and then produce the boxplots of those n values (300 values)

References
Bennett, B., Thyer, M., Leonard, M., Lambert, M., and Bates, B.: A comprehensive and systematic evaluation framework for a parsimonious daily rainfall field model, J. Hydrol., 556, 1123–1138, https://doi.org/10.1016/j.jhydrol.2016.12.043, 2018.
Cappelli, F; Volpi, E; (...); Grimaldi, S. 2025. Sub-daily rainfall simulation using multifractal canonical disaggregation: a parsimonious calibration strategy based on intensity-duration-frequency curves. SERRA
Kim, H.-H. Kwon, S.-O. Lee, and S. Kim. Regionalization of the Modified Bartlett–Lewis rectangular pulse stochastic rainfall model across the Korean Peninsula. J. Hydro-Environ. Res., 11:123–137, 2016. ISSN 1570-6443. doi: 10.1016/j.jher.2014.10.004.
Guntner, A., Olsson, J., Calver, A., Gannon, B., 2001. Cascade-based disaggregation of continuous rainfall time series: the influence of climate. Hydrol. Earth Syst. Sci. 5 (2), 145– 164.
Koutsoyiannis, D.: Rainfall disaggregation methods: Theory and applications, in: Proceedings, Workshop on Statistical and Mathematical Methods for Hydrological Analysis; Università di Roma “La Sapienza”, May 2003, Rome, Italy, 1–23, https://doi.org/10.13140/RG.2.1.2840.8564, 2003.
Maloku, K., Evin, G., Hingray, B., 2025. Generation of sub-daily precipitation time series anywhere in Switzerland by mapping the parameters of GWEX-MRC, an at-site weather generator. Journal of Hydrol. Regional Studies. https://doi.org/10.1016/j.ejrh.2025.102454
Maloku, K., Hingray, B., Evin, G. 2023. Accounting for precipitation temporal asymmetry in a Multiplicative Random Cascades Disaggregation Model. Hydrol. Earth Syst. Sci. https://doi.org/10.5194/hess-27-3643-2023
Ormsbee, L.E., 1989. Rainfall disaggregation model for continuous hydrologic modelling. J. Hydraul. Eng., ASCE 115 (4), 507– 525.
Paschalis, A., Molnar, P., Fatichi, S., and Burlando, P.: On temporal stochastic modeling of precipitation, nesting models across scales, Adv. Water Resour., 63, 152–166, https://doi.org/10.1016/j.advwatres.2013.11.006, 2014.
Pui, A., Sharma, A., Mehrotra, R., Sivakumar, B., and Jeremiah, E.: A comparison of alternatives for daily to sub-daily rainfall disaggregation, J. Hydrol., 470-471, 138–157, https://doi.org/10.1016/j.jhydrol.2012.08.041, 2012.

Citation: https://doi.org/10.5194/egusphere-2025-2710-RC3
- AC3:
  'Reply on RC3', Claudio Sandoval, 05 Sep 2025
  This manuscript aims to present a framework for the assessment of rainfall disaggregation models (DM) with a focus on the generation of relevant extreme precipitation. For illustration, the framework is applied to 3 different DMs at 5 locations worldwide where hourly observations of rainfall are available.
  The aim of the paper is very relevant. Such a framework would be indeed very valuable to assess individual DMs and to compare them. The paper has however a number of significant limitations and the work needs major revisions. I really encourage the authors to submit a modified version.
  ANS: We thank you for acknowledging the relevance of the paper and for the constructive comments provided. We recognize that several aspects of the manuscript required clarification and improvement. In the revised version, we have carefully addressed each of your points, making substantial changes to improve the clarity, focus, and presentation of the framework.
  Added value of this framework when compared to already existing ones
  A number of multiscale evaluations have been already proposed to assess weather generators (WGEN) and / or to compare DMs. What is the added value of this new framework for instance compared to Benett et al. 2012, Maloku et al. 2023 where multiple temporal scales and Precip. Extremes are considered. Or compared to Cappelli et al. 2025 where IDF curves are the aim of the disaggregation?
  
  ANS: As you note, several previous studies have already proposed multiscale evaluations and frameworks for rainfall models. Bennett et al. (2018), for example, developed a comprehensive assessment at daily, monthly, and annual scales, focusing on general rainfall characteristics such as occurrences, amounts, and dry/wet spell distributions, and also assessing daily annual maximum precipitation. Maloku et al. (2023) evaluated multiplicative random cascade models across multiple temporal scales, incorporating both general statistics and extremes, including precipitation amounts associated with 5- and 20-year return levels. More recently, Cappelli et al. (2025) directed their evaluation specifically toward the reproduction of IDF curves by comparing observed and simulated annual maximum precipitation.
  Although we recognize that extreme rainfall is considered in these works in a very detailed and comprehensive way, our framework is explicitly designed to systematize these diverse aspects into two complementary components: (i) the representation of annual maximum precipitation across durations, expressed as IDF curves, and (ii) the representation of extreme rainfall indices (ERIs), which are conceptually related to several of the metrics adopted in previous studies. For example, the rainfall occurrence used in Bennett et al. can be related to our R95%, which measures the number of hours exceeding the 95th percentile; their rainfall amounts correspond to our P>R95%, i.e., the precipitation volume in those extreme hours; and their distributions of dry spells can be compared to our TDD, which aggregates the total annual duration of dry periods from hourly data. Similarly, Maloku et al. evaluated the proportion of wet steps and the mean length of wet spells, which are conceptually analogous to our R95% and TDD, respectively (although we emphasize dry periods instead of wet). They also analyzed rainfall amounts at 40-min resolution for specific return periods, which are analogous to our generalized representation of extremes through IDF curves across multiple durations and return periods.
  What differentiates our contribution is that these characteristics are synthesized into unified indicators: AEI, PEI, and DT for IDF curves, and ERIs evaluated through mKGE applied directly to time series. This synthesis reduces complex information into single-valued measures that can be jointly represented in a common graphical space (e.g., AEI/PEI–DT plots for IDF curves, or boxplots of mKGE across ERIs), thereby enabling a straightforward and easily appreciable comparison among disaggregation methods. Importantly, our analysis is carried out at the hourly scale, i.e., a finer temporal resolution than Bennett et al. (2018), which is especially relevant for hydrological applications in fast-responding catchments. While our framework does not yet address spatial disaggregation or exhaustively cover all possible methodologies and metrics, its unifying and systematizing character provides a clear added value for evaluating rainfall disaggregation methods.
  Evaluation for ungauged stations
  The framework is applied to different stations where rainfall observations are available. The performance of each DM is assessed by comparison to a reference which has been built from observations available at the considered site. What is proposed to assess DMs for disaggregating a daily time series available at a given site where no subdaily data are available ? This point would be worth to be clarified / commented. The target of the paper has to be clarified with respect to this issue also. This is obviously an important operational issue (see Kim et al. 2016, Maloku et al. 2025 and ref. herein). The evaluation framework proposed here does obviously not need to include this issue. However, the importance of evaluating also the potential for regionalization of the model would be worth to mention in a discussion. One could look for a model with parameters that are robust and do not vary a lot in space for instance (cf. Maloku et al. 2023). The possibility (need ?) to apply a kind of regional assessment with hourly stations available (or not) in the neighborhood of the target station would be also worth to mention.
  
  ANS: We agree that the framework, as presented here, requires sub-daily observations at the site of interest to evaluate the performance of disaggregation methods, since these observations are used to construct the observed IDF curve against which both the reference curve and the disaggregated IDF curves are compared. This limits its direct applicability in ungauged stations where only daily data is available. While addressing this challenge is beyond the scope of the present work, we acknowledge that it is a fundamental problem for operational use.
  In the revised manuscript we will add a discussion of possible avenues for extending the framework in such cases, including regionalization strategies, transfer of parameters from nearby hourly stations, and approaches that aim to develop spatially robust models. An additional possibility would be to adopt a more empirical approach, directly using the available data without parameter transfer, which could still provide valuable insights, for example, in locations where daily data is available but subdaily data is missing. We believe that such alternatives may also be helpful for extending the framework in future work towards not only temporal, but also spatio-temporal disaggregation, by incorporating strategies for spatial correlation. Our aim will be to clarify that the current study is focused on evaluating disaggregation methods where sub-daily observations are available, but that future applications of the framework should also consider regional and empirical approaches to extend its usefulness to ungauged sites.
  Seasonal performance of DMs
  The ability of DMs to produce relevant simulations is typically found to depend on season. This point would be worth to discuss in the manuscript.
  
  ANS: We agree that the performance of disaggregation methods is often season-dependent, as rainfall regimes and storm characteristics vary throughout the year. Our framework does not include an explicit seasonal component by default, since the primary aim of this work was to demonstrate its general structure and applicability at the annual scale. However, the framework can readily be applied in a seasonal manner by calculating IDF-based metrics and ERIs separately for each season or month, thereby allowing direct evaluation of seasonal performance. In the revised manuscript we will clarify this point more explicitly, emphasizing that while seasonality was indirectly addressed through the way the illustrative methods were calibrated, the framework itself can be straightforwardly adapted to explicitly assess seasonal differences in future applications.
  Criteria used to assess the DM ability to reproduce IDF curves
  In the manuscript, DMs are first assessed with their ability to reproduce IDF curves derived in a previous step from observations. A relevant simulation of IDFs is obviously important for DMs. For this assessment, the authors propose two criteria (AEI and PEI in Eq 2 and 3), further combined in one single distance criterion (Dt, Eq. 4). Each criterion is a normalized criterion such as its value for a perfect DM is one and such as its value is negative if its performance is lower than that of a reference model. This normalization facilitates the assessment. I see two major limitations here:
  
  Criterion AEI: the reference model is a straight line in the IDF graph (see Fig. 1b). This model is a poor reference model, especially in the way it is built. This reference is far from the observed IDF for all durations (even for the day) and all return periods and it may be thus not really able to highlight bad models. More relevant references could (should) be considered as suggested here. For each return period T:
  Reference A: The shape of the reference is not linear but follows a classical analytical formulation used for IDF curves (in the form of a Talbot or a Montana model). To define the coefficients of the model, one could assume that 1) The reference ends to the point which is known in the considered IDF curve, that is the observed return level (estimated from the observations) for the daily resolution and 2) that (for instance) the maximum intensity for a 1 hour duration is 2times that for the daily duration (2 would be arbitrary chosen).
  Reference B: A simple deterministic DM is used to produce a time series from which the IDF curve is determined. Such a model could be the linear model of Ormsbee (1989), also considered in Hingray and BenHaha as a reference (2005) in their comparison work.
  Reference C: A simple constant disaggregation model is considered (constant intensity assumption for each daily time step).
  Reference D: a single climatological sub-daily pattern, determined each month from the observations (e.g. following the Huff et al. methodology considered in the paper), is used for the disaggregation of all days of this month.
  To my opinion, one such reference would make some sense (I guess more than the proposed one).
  ANS: Thank you for the comment. We agree that the reference IDF curve used in the AEI computation is deliberately simple and does not necessarily represent a realistic model of rainfall extremes. Our intention was not to propose this curve as a hydrologically meaningful model, but rather as a very basic benchmark (i.e., a “minimal solution” that reflects what a decision-maker with only daily data could construct in practice). The purpose was to provide a simple lower bound against which the added value of disaggregation can be assessed.
  We also recognize the concern about arbitrariness. In our view, a benchmark should meet a few general criteria: (i) it should not be a formal disaggregation method itself, but rather a simplified baseline; (ii) it should reproduce the expected decreasing behavior of IDF relationships; and (iii) it should avoid arbitrary parameter choices or the need for calibration. At an earlier stage we explored alternatives, such as exponential shapes, but this required parameter calibration and were therefore less consistent with the above criteria. In the absence of a better option, we proposed the linear form as the simplest possible benchmark.
  We do appreciate the reviewer’s point, and it is a concern we also discussed in detail. If a more suitable benchmark can be suggested—one that avoids arbitrary parameters while still remaining simple and not a proper disaggregation method in itself—we would be glad to adopt and implement it. In any case, we recognize that these general criteria for constructing the benchmark were not clearly conveyed in the manuscript, and we will make sure to present them explicitly in the revised version.
  Criterion PEI: the reference model is the ensemble of IDF curves obtained from a resampling of observed data. The authors mention that the value 1 is the best possible value, negative values are not desirable (lines 144-145). This is not what I understand from Equation 3. For me, a value of 1 indeed indicates no uncertainty in the disaggregation process. I do not see why a DM should give no uncertainty ? For me, the disaggregation can not be deterministic. So, different disaggregation runs will (cannot but) lead to different IDF curves, I agree. Then the confidence interval between the estimates obtained from different runs should not be zero. It should but be “relevant” or at least reasonable when compared to the CI that could be obtained from a simple resampling work of observations.
  
  ANS: We thank you for raising this point, which is indeed central to the interpretation of the PEI. In our initial formulation, we defined the optimal value as 1, reflecting the idea of “maximum reproducibility” with no dispersion across realizations. However, we agree with you that such a deterministic outcome is neither realistic nor desirable, since stochastic disaggregation methods should naturally reproduce a certain level of variability. In light of this, we acknowledge that the conceptual optimum should indeed be PEI = 0, corresponding to a dispersion consistent with the observational benchmark.
  We also recognize, however, that the question of whether a disaggregation method that produces slightly lower variability than the observed benchmark should always be considered problematic is less straightforward. One might argue that under-dispersion could still be acceptable in certain contexts, depending on the application, whereas over-dispersion systematically inflates extremes and is clearly undesirable. We appreciate that your suggestion helps to clarify this conceptual distinction, and we will adopt the formulation with PEI = 0 as the reference point in the revised manuscript, while also making explicit that the interpretation of under-dispersion versus over-dispersion deserves further consideration
  Distance criterion: If my reasoning is right, the distance criterion, which is in all the paper then considered to assess the performance of DMs, is then not really relevant. Because for me, the best possible value for PEI is not 1 but 0. And the optimal point is not the point (1,1) mentioned line 157. As a consequence also, I fear that some conclusions of the result section are wrong (all which are relative to PEIt and to DT). For me, SOC does not lead in four out of five locations (ln386) in table 3.
  
  ANS: We appreciate your comment, it has lead us to revise our assumptions, and we thing the optimal PEI should be 0 instead of 1. Consequently, the definition of the distance criterion and the interpretation of the AEI–PEI space will be revised accordingly. We acknowledge that this affects some of the conclusions in Sect. 5, including statements regarding the relative performance of SOC in Table 3, and we will carefully revisit and reformulate these parts of the manuscript to ensure that the conclusions are consistent with the revised interpretation of PEI and the distance criteria.
  Extreme Rainfall Indices assessment
  I have several concerns here also.
  Equation 5. This equation has not a lot to do with the KGE equation as the authors disregard the correlation component. To avoid confusion and mis-interpretation I strongly suggest to not refer to the KGE and to use another name for this coefficient. Next, this criterion combines an evaluation on the mean and on the dispersion. I would suggest to provide both evaluations separately. This may provide more understanding of differences between DMs.
  
  ANS: We agree that referring to Equation 5 as a (modified) KGE can be confusing since the correlation component is not included. Although we used the prefix “mKGE” to signal that this was not the original KGE, we acknowledge that the label may not be fully appropriate in this context. In the revised manuscript we will rename the coefficient as “Moments Bias Coefficient” and clarify its definition.
  The data considered to produce box plots are not clear for me. Can you clarify if you calculate n different mKGE for the n runs of the disaggregation process.
  
  ANS: Exactly. For each disaggregated series (i.e., each run) and for each ERI, we calculate the annual time series of that metric and then obtain a single mKGE value by comparing the disaggregated series against the observed one. This is repeated for all n runs, which yields the distribution of n mKGE values from which the boxplots (one per ERI) are constructed. We will explain this with more clarity in the caption of the Figure 1.
  Equation 6. The authors want to compare the mKGE obtained for one given DM with the mKGE obtained for a reference. A new reference is introduced here. I would strongly suggest to have the same reference than the one introduced to assess the IDF precision efficiency. The different reference models suggested above would be possible here.
  
  ANS: We agree that in principle it would be desirable to have benchmarks that are more comparable or logically equivalent across the different components of the framework. However, IDF curves and ERIs capture fundamentally different aspects of rainfall behavior, which is why we used different references.
  
  Disagregation Models considered for this work
  Many DMs have been presented in the past (see, for example, Koutsoyiannis, 2003, Pui et al. 2012 and the references within). I have the feeling that (some) DMs considered in the present work refer to rather old approaches (e.g. the SOC DM). Could you give more recent references where those models have been applied and evaluated. This would allow justifying the choice of those approaches and showing that they give reasonable performances in different contexts.
  
  Because of their simplicity and parsimony, analytical microcanonical Multiplicative Random Cascades (MRC) have been widely used in the past for many applications in hydrology (e.g. Paschalis et al., 2014; Maloku et al. 2023). I strongly suggest to include one such MRC model (e.g. Maloku et al. 2023 which merges different MRC approaches).
  ANS: We recognize that the point is valid: some of the methods applied in this study were originally proposed several decades ago. For your information, as previously clarified in response to another reviewer, in the case of Huff this was partly due to a misleading phrasing, since what we applied was a DM based on Huff curves rather than using directly the original 1967 procedure. We will further clarify this in the manuscript. Regarding the illustration of the framework with more recent In particular, we welcome the suggestion of incorporating a multiplicative random cascade model, which we consider an ideal candidate to add as a disaggregation method in the revised analysis.
  Other comments
  Introduction: Can you precise why is it interesting to consider IDF curves for an evaluation of DMs?
  
  ANS: The motivation for including IDF curves is that they are a central tool in hydrology and engineering practice, widely used for hydraulic design, flood risk assessment, and urban drainage planning. Evaluating whether disaggregation methods can reproduce observed IDF relationships therefore provides a direct and operationally relevant measure of their performance. In the revised manuscript we will make this rationale more explicit in the Introduction.
  Dataset: Rainfall time series used for the application have very different lengths. (Is there an interest of having much longer times series for 2 stations?) Is there some dependency of the DM performance on the length? The length likely impacts the precision of parameter estimation and then the stochastic variability importance in the evaluation (e.g. in the "reference” generated by resampling from observations in criterion PEI). To my opinion, a more homogeneous dataset would be worth and would prevent comments on this non homogeneity. A discussion on this length issue would otherwise be welcome.
  
  ANS: The stations used in this study were drawn from different regions of the world, primarily from the INTENSE project dataset, which provides high-quality sub-daily rainfall series with rigorous quality control. In addition, the Quinta Normal station in Chile was included to bring the framework into Chilean context as well. The resulting heterogeneity in record lengths therefore reflects the diversity of the available data sources rather than a deliberate design choice. We agree with the reviewer that differences in record length can indeed have implications: they may affect the degree of stationarity assumed when estimating parameters over the entire record, and they reduce consistency between stations, making cross-comparisons less straightforward. In the revised manuscript we will explicitly acknowledge this limitation and discuss how record length may influence the evaluation of disaggregation methods.
  Lines 44 64: I am not convinced by the classification proposed. What is the difference between the 2nd and third approach? A summary table would be worth to present all these assessment approaches and identify the criterion, (multi)scales, precipitation characteristics considered for the evaluation. Note that for me, an important evaluation of WGENs (and DMs) is (should be) also impact oriented – cf. assessing the ability of reproducing discharge floods after simulation with some hydrological model (a number of works have proposed such evaluations). This may be especially relevant here for instance as IDFs are not of interest per se but because they are often produced and used for hydrological design purposes.
  
  ANS: The distinction we intended is that the second type of study refers to systematic comparisons of existing disaggregation methods within a given evaluation scheme, but usually limited to the scope of that specific study. The third type, in contrast, seeks to go beyond such one-off comparisons by proposing an integrative framework that can be applied more generally. In other words, the third type corresponds to a systematization of the evaluation process itself, which is precisely the aim of the framework we propose here. In the revised manuscript we will clarify this difference more explicitly. We also agree that a summary table would be valuable. We will add a table that synthesizes the different evaluation approaches, identifying for each the criterion used, the temporal scale(s), and the precipitation characteristics that are assessed.
  Regarding the impact-oriented evaluations, we acknowledge that these are highly relevant, since disaggregated rainfall is often used in practice as input to hydrological models for flood estimation and design purposes. However, such analyses are outside the scope of this paper, which is focused on introducing the framework. Nevertheless, in the revised manuscript we will emphasize in the discussion that the framework is flexible and could be readily extended to impact-oriented applications in future studies.
  Ln 98. IDF curve construction. For a given return period, how are return levels estimated from observations. Do you use the gumbel distribution to fit data?
  
  ANS: The construction of the observed IDF curve in this study follows an empirical approach. For each duration, we extract the annual maximum precipitation (AMP) for every year in the record, and then perform a frequency analysis using the Weibull plotting position based on ranking. This allows us to obtain quantiles associated with all the return periods considered for each duration. By connecting the ordered pairs across durations for a given return period, we then derive the corresponding observed IDF curve, resulting in one curve per return period. We note that the estimation of the uncertainty around these curves does require a statistical fit. For this purpose, we apply the Gumbel distribution to characterize the variability of the return levels, as explained later in this responses. This will be explained with more detail in Section 2.2.1.
  Ln104-105. The way the uncertainty bounds are produced from resampling observations is not clear. What are the multiple samples of d-hours AMP? How many samples? for a given T? for all T? How many value in each sample?
  
  ANS: We generate 1000 samples for each duration (1, 2, …, 24 h), as detailed in Sect. 4.2. For each duration, the samples are drawn from the Gumbel distribution fitted to the observed annual maximum precipitation values of that duration. Each sample contains as many values as the number of years in the historical record. Then, for each return period, the corresponding quantile is computed in each of the 1000 samples. This procedure yields the distribution of intensities for every duration and return period, from which the uncertainty bounds around the observed IDF curves are derived. Here we will add a reference to Section 4.2 (in parentheses) for greater clarity.
  Equation 1. Clarify why this equation? which assumptions / choices behind? I understand that you assume i(d=0)=AMP24/12 which turns to i(d=0)= 2*(AMP24/24) = 2 * i(d=24). Is it right? why?
  
  ANS: The temporal distribution assumed for the daily AMP was designed to represent a simple decreasing pattern, where the highest intensities occur at the beginning of the day and gradually decrease towards the end. This criterion is admittedly arbitrary, but it was adopted in order to provide a unified, replicable, and straightforward way of constructing the reference IDF. Under this scheme, when extracting the most intense hours for the frequency analysis, they are always consecutive starting from the first hour, which naturally leads to the linear decreasing structure of the reference curve. We acknowledge that other plausible assumptions could have been made, such as the alternate block method (with the maximum pulse centered and decreasing alternately on both sides) or even random alternations between both temporal distributions, which would result in different reference IDFs.
  Regarding the specific point raised, it is correct that the formulation leads to i(d=0) = AMP24/12, while i(d=24) = 0, i.e., rainfall ends at the end of the day. This is purely a consequence of the mathematical construction, since the hourly intensities are evaluated at the centroid of each pulse (0.5, 1.5, …, 23.5 h), which ensures mass balance.
  The assumption i(d=24) = 0 is not convincing. You know exactly that i(d=24) = AMP24/24
  
  ANS: We would like to clarify that the assumption i(d=24) = 0 is correct under the way the temporal distribution was constructed in our study. We believe that the reviewer’s interpretation may stem from considering a uniform distribution of rainfall, in which case each hour would indeed receive AMP24/24, including the last one. We will clarify this point more explicitly in the revised manuscript to avoid confusion.
  ln121: PEIT criterion: according to what I understand from eq 3, I do not understand why this index estimates the efficiency of the “precision”. What is the precision? or do you mean this is an index of “precision efficiency”, if yes, what is it?
  
  ANS: In our framework, “precision” is understood as the consistency with which the disaggregated IDF estimates approach the observed (real) values across multiple realizations. A method is precise if repeated realizations converge closely around the observed benchmark, and less precise if they diverge widely. The PEI criterion was therefore designed as a “precision efficiency” index, in the sense that it evaluates how efficiently a method reproduces the level of dispersion that is consistent with the observational benchmark derived from resampling. In the revised manuscript we will make this definition explicit and clarify that PEI refers specifically to consistency around the observed values, rather than to accuracy, which is part of the AEI logic.
  I agree with this statement, but not with lines 144-146. Note that a deterministic model (e.g. that of Ormsbee, 1989) would have PEIT = 0. But such a model would be obviously under-dispersive. (dispersion is actually zero between runs)
  
  ANS: We agree that a deterministic model, such as that of Ormsbee (1989), would always yield PEI = 0, since it produces no dispersion between runs. In this sense such a model is indeed under-dispersive and does not reflect the natural variability of precipitation. Our intention in lines 144–146 was not to suggest that this outcome is ideal, but rather to illustrate how the index behaves in limiting cases. In light of this comment and the related discussion raised by Reviewer 2, we will revise the logic of the PEI so that the optimal value is defined as a dispersion equal to that of the observational benchmark, neither greater nor smaller. This definition ensures that the framework evaluates whether a disaggregation method reproduces the natural level of uncertainty, avoiding both over- and under-dispersion. We will clarify this conceptual point in the revised manuscript to prevent misinterpretation and to emphasize that deterministic models, while assessable within the framework, are not desirable when the objective is to represent rainfall extremes consistently with their natural variability.
  Extreme rainfall indices
  Ln 165. To me, the probability of dry steps is a more common criterion to assess intermittency
  
  ANS: We agree that the probability of dry steps is a more common way of assessing intermittency. In essence, this is equivalent to our formulation, since it corresponds to dividing the total dry duration (TDD) by the total number of hours in a given year. In the revised manuscript we will make this equivalence explicit and adopt the more standard terminology to avoid ambiguity.
  Ln 168. The 5hours duration is not really critical for a large range of basins. The critical duration depends on the time of concentration of the basin and this time of concentration varies a lot from one catchment to the other. Different durations could be thus considered (as in the IDF curves).
  
  ANS: We agree that the critical duration should depend on the time of concentration of the basin, which varies widely from one catchment to another. In this study, the 5-hour duration was adopted as a choice between a very fast response of a basins (around 1 hour or less) and that of larger basins with a concentration time on the order of a full day or more. In the revised manuscript we will clarify this rationale.
  Ln 168-170. How is defined P95? is it based on daily precipitation? hourly? Does it vary from one year to the other?
  
  ANS: P95 is defined based on the observed hourly precipitation, considering only wet hours (p ≥ 0.1 mm). The P95 value is calculated once over the full observation record. We will improve the explanation of P95 in the manuscript.
  Ln 170. Which “last 2 indices”? If you refer to indices 3 and 4 of previous sentence, they do not assess to my opinion the duration and magnitudes of extreme precipitation pulses. How do you define “pulses” (on which duration(s) are they defined)? The first index looks like a frequency (but I do not understand why this index should be different from 5% of the time if P95 is defined from hourly data). I do not understand well what is the meteorological / hydrological interest of the second index. Can you clarify?
  
  ANS: The two indices referred to are indeed R95% and P>R95% (indices 3 and 4 in the text). Our intention was not to describe “pulses” per se, but rather to distinguish between two complementary aspects:
  R95% (duration): the number of hours in which precipitation exceeds the P95 threshold, i.e., the temporal extent of extreme precipitation.
  
  P>R95% (magnitude): the total precipitation amount accumulated during those hours, i.e., the contribution of extremes to annual rainfall.
  
  We also note that R95% is not a trivial 5% of the time, as might be inferred. The P95 threshold is defined once for the whole record, and its exceedances can vary widely from year to year depending on how extremes are distributed. This index therefore captures the interannual variability in the occurrence of extreme precipitation.
  If by “second index” the reviewer refers to P>R95%, we note that this is important because it characterizes the annual evolution of precipitation extremes above the P95 threshold. From an aggregated perspective, it quantifies how much rainfall is actually concentrated in events considered “extreme,” which is a key metric in both climatological and hydrological applications. In the revised manuscript we will clarify the terminology to avoid confusion and make explicit the definitions and relevance of these indices.
  Section 3.2. I would suggest to summarize more the description of the method which is well known. The different equations are not necessarily needed for instance.
  
  ANS: We agree that the k-NN approach is well known and does not require as much detail as currently presented. In the revised manuscript we will streamline this section by shortening the description and removing equations that are not essential, while retaining the key information needed to understand its implementation within our framework.
  Ln 252-253. The knn model considered here does not depend on the local rainfall pattern around the daily amount to disaggregate. This may be a too important simplification. A number of works have shown that the subdaily structure of precipitation is highly dependent on this local rainfall pattern (cf. Ormsbee, 1989, Gunter et al. 2001, Maloku et al. 2023). Testing the initial Sharma et al. 2006 method would be perhaps worth.
  
  ANS: We fully agree with the theoretical foundation that the sub-daily structure of rainfall is influenced by the local rainfall pattern surrounding the target day, and that the original formulation of Sharma et al. (2006) accounts for this dependence by including antecedent and subsequent days when selecting analogues. In our case, however, we chose not to incorporate this element because it reduces flexibility in regimes with strongly seasonal precipitation. For example, in sites such as Quinta Normal (Chile), many summer months are almost completely dry, which means that including surrounding days would often leave no valid analogues to select from, even across long records. By simplifying the selection to focus on the daily amount itself, we ensured that the method could still be applied consistently across both wet and dry seasons.
  We note, however, that seasonal rainfall patterns are indirectly captured in our implementation through the moving time window used to search for analogues, which restricts the pool of potential candidates to the same period of the year. Finally, we emphasize that there are different ways of implementing k-NN for disaggregation, each with its own strengths and weaknesses. The point of this paper is not to promote one specific implementation, but to demonstrate how the proposed framework can systematically evaluate such methods and highlight their relative advantages and limitations.
  Ln270-280 and equations next: I do not understand what is the purpose of this text and of those equations. Can you clarify?
  
  ANS: Following the procedure of Alam and Elshorbagy (2015), the optimal half-window size for the k-NN method was calibrated using an error-based criterion. The purpose of this calibration is to select a window size that minimizes the difference between the observed IDF curves (derived from AMPs of different durations and return periods) and those obtained from the disaggregated series. We will clarify this point more explicitly in the revised manuscript. Hence, for each candidate window size:
  The AMPs of each duration d are normalized with respect to the maximum AMP of that duration over the entire record, so that their values range between 0 and 1 (for both observed and disaggregated series).
  
  For each duration d, the RMSE is calculated between the normalized observed and disaggregated AMPs (with N values in each series, where N is the number of years).
  
  A weighted average of the 24 RMSE values is then obtained, using as weights the inverse of the duration (to give more importance to shorter durations).
  
  This results in a single weighted RMSE (RMSEwei) for that candidate window size. The optimal window is the one that minimizes this RMSEwei.
  
  We also acknowledge that in the current manuscript (Equations 12 and 13), the notation may lead to confusion: the terms max(AMP_{i,d}) and min(AMP_{i,d}) should not include the subscript i, because otherwise the normalization would incorrectly be applied relative to each value itself. What is actually used is the maximum (or minimum) AMP_d across all years. We will correct this notation in the revised version to avoid any misunderstanding. In addition, we will improve the wording of this section to make the procedure more reader-friendly, acknowledging that in its current form it may not be straightforward to follow.
  Ln 310. If I understand well, the disaggregated rainfall of one day can cross that day to the next one. Is it possible to have to rainfall amounts at a same time from day i and day i+1? if yes, do you sum them?
  
  ANS: No, the disaggregated rainfall of one day cannot cross into the next day. This is precisely why the starting time of the disaggregation is restricted to a uniform random variable U(0, 24–d), which ensures that the full disaggregated event of duration d remains entirely within the corresponding day. We will clarify this in Section 3.3.
  Ln 330-338. Can you clarify? I do not understand what do you do here and what it is for? I understand that you produce 300 disagregated time series. I had understand from fig1. that each time series is used in turn to produce one set of IDFs curves. Here, I now understand that you merge the data from the 300 time series to do your frequential analysis.
  
  ANS: We thank the reviewer for this question and for correctly understanding the procedure. Indeed, 300 disaggregated time series were generated, and each one was used independently to produce an IDF curve. The ensemble of 300 curves is then used to estimate the distribution of intensities for each duration and return period, from which we derive quantiles and their associated uncertainty.
  The purpose of the text in lines 330–338 was to justify why using a large number of disaggregations provides a robust estimate of any given quantile and its uncertainty. In the limit, if n → ∞, the estimate of a quantile and its confidence bounds would converge to the “exact” values. With 300 realizations, we consider the sample size to be more than sufficient to provide a reliable estimate of both the quantiles and their uncertainty. We will clarify this explanation in the revised manuscript to avoid any possible misunderstanding.
  Ln 338. What are “the main disaggregation runs”? how are they produced? defined?
  
  ANS: By “main disaggregation runs” we refer to the series that were ultimately used to calculate the evaluation metrics related to both the IDF curves and the ERIs (i.e. validation series). This distinction was made to avoid confusion with the additional series generated in preliminary steps, such as those used to calibrate SOC and k-NN or to construct the Huff curves. Those calibration and validation steps also involved 300 realizations, but the “main runs” are specifically the ones employed in the final evaluation of the framework. We will clarify this in the manuscript.
  Ln 339. Do you do differently for the third DM??? if yes, I do not understand why
  
  ANS: Yes, the procedure is different for the third disaggregation method because this approach does not require a calibration of a parameter of a set of these. Instead, it relies on estimating the Huff curves for each site, which are then used as the basis for the disaggregation. This is why no additional calibration runs were needed in this case. This will be explained in detail in the manuscript.
  Ln 340. I do not understand: what are the calibration results? do you make any distinction between simulation results obtained for calibration and simulation results obtained from the application of the disaggregation process to the daily data? can you clarify?
  
  ANS: By “calibration results” we meant the simulations used solely to calibrate the parameters of SOC and k-NN (e.g., the minimum event threshold ε or the half-window size) and for obtain the Huff curves. These calibration runs were only a preliminary step to fix the parameters and were not included in the evaluation. The evaluation itself was based on a separate set of 300 disaggregated series generated after calibration, using the fixed parameters. These are the simulations that were then used to construct the IDF curves and to calculate the ERI-based metrics. In the revised manuscript we will clarify this distinction to avoid any misunderstanding.
  In this observation configuration, how are generated the 1000 AMP series?
  
  ANS: This point was clarified earlier in the responses (question n°15): the 1000 AMP series are generated by resampling from the fitted Gumbel distribution for each duration, as described in Sect. 4.2. Each sample contains as many values as years in the observed record, and the ensemble of 1000 samples is then used to estimate the distribution of return levels and their uncertainty. We will include a parenthetical reference to Section 2.2.1 to make this clearer.
  Can you complete the caption. To what correspond the ellipses?
  
  ANS: The ellipses in Fig. 3 correspond to the different values taken by the Euclidean distance D_T, as shown as an examplein Fig. 1c. They are plotted in order to clearly illustrate, for each ordered pair of (AEI_T, PEI_T), the associated D_T value. In the revised manuscript we will clarify the caption accordingly.
  Ln 403 and In table 5, what means the 50th percentile mKGE or each ERI time series > this suggests that you compute multiple mKGE for each generated time series? this is confusing, or do you compute on mKGE for each generate time series and then produce the boxplots of those n values (300 values).
  
  ANS: To clarify: for each disaggregated time series (one of the 300 realizations) and for each ERI, we compute a single mKGE value. This yields 300 mKGE values per ERI. The boxplots then summarize the distribution of these 300 values. In Table 5, the “50th percentile” refers to the median of this distribution of 300 mKGE values, not to percentiles within each individual time series. We will rephrase this part of the text to make it clearer.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2710-AC3

Status: closed

RC1:
'Comment on egusphere-2025-2710', Anonymous Referee #1, 22 Jul 2025

This was the first time I was involved as a reviewer for this manuscript. The manuscript introduces a framework for a systematic comparison of rainfall extreme values from generated time series. The manuscript is well-written. Unfortunately, I have to recommend the rejection of the manuscript. Please find my detailed comments below.
Major comments:
By the chosen title the authors want to introduce a new framework for assessing rainfall extreme values from generated time series. The authors could not convince me that such a framework is required, current studies on rainfall generation validate extreme value behaviour in a sufficient manner. Indeed, the establishment of a systematic procedure could be nice, but therefore the authors would had to i) compare the framework with conventional validation strategies (classical presentation of IDF curves), and ii) to share the code for the evaluation with the community, which is unfortunately not done (only “…upon reasonable request.”). My second issue is the comparison of rainfall generators: Although not part of the title and beyond the scope of the manuscript, the authors judge the studied rainfall generators without hydro-meteorologic validation, and rank them (Table 4, Table 6).
The manuscript should either focus on the introduction of the framework (comparisons with established methods, providing the code), or on the comparison of the rainfall generators. Both topics in one manuscript provide not enough space that each topic would require for a proper study.

Specific comments:
L44-63 The authors classify the evaluation possibilities in three types. Actually, there are four types, the most practical type is missing: the subsequent application. Due to the high non-linearity of rainfall-runoff transformation processes one cannot conclude from a single rainfall characteristic on the performance when the generated rainfall time series are used as input for the subsequent application. Sometimes the application indicates critical shortcomings of the generated time series and lead to an iterative optimization of the rainfall generation process. This type including references should be added.
Fig.1 a: The upper line are arrows, but I think these are just headers of the boxes below (areas within the coloured dashed lines). If so, I suggest to change the layout, in it’s current version it can be misleading to have arrows twice, for the headers and the datasets/methods below.
Fig.1 b: Why is the framework limited to 1h and not at 1min or 5min, the typical lower bounds for rainfall generation?
Fig.1 c: What is ‘Index value’ on the y-axis? Should it be ‘Rainfall amount (mm)’?
L110-111 Reference curve: Why is crossing that curve such a relevant issue that the method should be discarded? Please provide references or examples here.
L114 I don’t understand this sentence. First three values for d are provided, which yield in three other values for d?
Eq. 2, 3: I’m wondering about the selection of the studied durations. The index is running from d=1,..,24, sol also very uncommon durations as d=19h are analysed? Although a lower weight is chosen (1/d), I question the acceptance of this method within the hydrologic community. I suggest to keep the equations more open and provide the durations used in this study in the paragraph below. For me, typical durations would have been d={1, 2h, 3h, 4h, 6h, 12h, 24h}.
Eq.2 I’m wondering how this criterion is affected by different number of realisations. For comparisons of two rainfall generators A and B, ten realisations maybe required to represent the statistical uncertainty for A, while 100 realisations are required for B. Can this be a problem?
Chapter 5: The introduction of something ‘new’ demands comparisons with something ‘old or established’. Without previous experience, Fig. 3 is hard to interpret for the reader: How does a PEI=0.4 differ from PEI=0.2? OR same PEI for different AEI? The authors could provide classical IDF curves to show the benefit f the new proposed framework: What is visible with the new framework, that could not be seen/quantified before?
L378-392 The manuscript aims at introducing a new framework, not at comparing rainfall generators. For a fair comparison, more hydro-meteorologic in-depth analyses of the results would be required. This comment is also valid for Sect. 5.2.

Citation: https://doi.org/10.5194/egusphere-2025-2710-RC1
- AC1:
  'Reply on RC1', Claudio Sandoval, 05 Sep 2025
  Acknowledgment: We thank the time of the reviewer and suggestions, which will allow us to improve our work.
  This was the first time I was involved as a reviewer for this manuscript. The manuscript introduces a framework for a systematic comparison of rainfall extreme values from generated time series. The manuscript is well-written. Unfortunately, I have to recommend the rejection of the manuscript. Please find my detailed comments below.
  Major comments
  By the chosen title the authors want to introduce a new framework for assessing rainfall extreme values from generated time series. The authors could not convince me that such a framework is required, current studies on rainfall generation validate extreme value behavior in a sufficient manner. Indeed, the establishment of a systematic procedure could be nice, but therefore the authors would had to i) compare the framework with conventional validation strategies (classical presentation of IDF curves), and ii) to share the code for the evaluation with the community, which is unfortunately not done (only “…upon reasonable request.”). My second issue is the comparison of rainfall generators: Although not part of the title and beyond the scope of the manuscript, the authors judge the studied rainfall generators without hydro-meteorologic validation, and rank them (Table 4, Table 6). The manuscript should either focus on the introduction of the framework (comparisons with established methods, providing the code), or on the comparison of the rainfall generators. Both topics in one manuscript provide not enough space that each topic would require for a proper study.
  
  ANS: Most of the studies that compare rainfall-disaggregation methods do so because they want to propose a new approach. Nevertheless, if a hydrological model wants to choose from different already available methods there are no recommendations of how to approach this. Also, for researchers developing new methods it is easier to compare methods, but for hydrological modelers outside research that task might be overwhelming. Given this need, is that we propose a framework to help hydrological models that need to be chosen among already existing rainfall-disaggregation methods. We appreciate the comment and will present the gap in a more direct way in the new version of the manuscript.
  We would like to clarify that our framework does not replace traditional validation approaches, but rather systematizes and extends them. The direct comparison between observed IDF curves and those derived from disaggregated series (commonly used in the design of new disaggregation methods) is already embedded as the foundation of our approach. What the framework adds is the ability to synthesize this comparison into unified indicators of accuracy and precision (AEI, PEI, DT), which condense the information contained in IDF curves into single-valued metrics. In this way, the spirit of traditional validation is preserved, while its outcome becomes easier to interpret, comparable across methods, and directly representable in a common graphical space. Moreover, the framework quantifies the uncertainty with which the IDF curve is reproduced by multiple disaggregations, something that is not necessarily captured by traditional approaches. In the revised version of the manuscript, we will make this added value more explicit, and we will consider including, as supplementary material, a graphical illustration of the construction of IDF curves to more transparently show how they are integrated within the framework.
  Regarding code availability, we agree that reproducibility and transparency are fundamental virtues that must prevail in a study of this nature. Therefore, in the revised manuscript we will provide public access to the full implementation of the proposed framework through an open repository. This repository will also include selected disaggregation methods as examples, not because the comparison of generators is the main objective of our work, but as a means to facilitate the implementation and application of the framework by other researchers. Despite this, we encourage users to apply the framework with other disaggregation methods, for which open-source codes may already be available in the literature, or even with new implementations developed from the beginning.
  We also acknowledge the reviewer’s concern that the current manuscript may appear to combine two different goals (i.e., the introduction of the framework and the comparison of rainfall disaggregation methods). Our main aim is indeed to present the framework. The application to three methods was included only to demonstrate its usefulness in highlighting strengths and limitations across metrics and sites. In the revised manuscript, we will clarify this intention throughout the text, tone down language that could be interpreted as ranking rainfall generators and explicitly state that hydrometeorological validation is beyond the scope of this study. We will also emphasize that future applications of the framework should include such validation in order to connect methodological performance with practical hydrological relevance. Through these revisions, we believe the manuscript will more clearly focus on the introduction and demonstration of the framework, while keeping the comparison of generators strictly as an illustrative application.
  Specific comments
  L44-63 The authors classify the evaluation possibilities in three types. Actually, there are four types, the most practical type is missing: the subsequent application. Due to the high non-linearity of rainfall-runoff transformation processes one cannot conclude from a single rainfall characteristic on the performance when the generated rainfall time series are used as input for the subsequent application. Sometimes the application indicates critical shortcomings of the generated time series and lead to an iterative optimization of the rainfall generation process. This type including references should be added.
  
  ANS: We agree that a fourth type of evaluation can be recognized, namely the “subsequent application of disaggregated rainfall in hydrological models”. This approach would be particularly relevant in contexts where short-duration extremes strongly influence hydrological response, such as small urban catchments. Nevertheless, we emphasize that our study does not aim to optimize or iteratively adjust disaggregation methods based on such applications. The methods are applied as defined in their theoretical formulations (with only minor modifications explicitly described in the manuscript). Our focus is on presenting a systematic statistical framework, which can serve as a foundation for future hydrological applications.
  Although the “subsequent application of disaggregated rainfall in hydrological models” is out of the scope of the paper, we do agree that it should be mentioned. Hence, we will add a couple of lines and references in the final part of this paragraph. With these the four methods you mentioned will be clearly shown in the introduction. Finally, we will clearly state in the scope and objectives of the paper that this fourth method is outside our objetives so that the reader can properly frame the paper.
  1 a: The upper line are arrows, but I think these are just headers of the boxes below (areas within the coloured dashed lines). If so, I suggest to change the layout, in it’s current version it can be misleading to have arrows twice, for the headers and the datasets/methods below.
  
  ANS: The upper arrows in Fig. 1a are indeed intended as headers summarizing the processes represented in the areas within the colored dashed lines. We acknowledge that the current layout may be misleading, and in the revised version we will adjust the figure design to make this distinction clearer. Thank you for highlighting this point.
  1 b: Why is the framework limited to 1h and not at 1min or 5min, the typical lower bounds for rainfall generation?
  
  ANS: In principle, the proposed framework can be applied at any temporal resolution, including 1- or 5-minute data. However, in practice we chose 1 h as the minimum resolution because it represents the most widely available and comparable temporal scale across global datasets. For example, the INTENSE project dataset (Lewis et al., 2019) used in this work collected rainfall records at resolutions ranging from 1 min to 6 h, but standardized the dataset at 1 h because this resolution was by far the most consistently available and reliable across regions. This issue is also relevant to our case study in Chile, where long-term records at sub-hourly resolution are extremely scarce, and even hourly data are limited compared to daily observations. We will clarify in the revised manuscript that the framework can be easily adapted to different sub-daily time resolutions. One hour was chosen as an example because it is the most commonly available resolution in observed rainfall records. The framework can be applied to other resolutions by simply adjusting the temporal step used to construct IDF curves and to calculate extreme rainfall indices, without altering its overall structure.
  1 c: What is ‘Index value’ on the y-axis? Should it be ‘Rainfall amount (mm)’?
  
  ANS: In Fig. 1c, “Index value” refers to the annual value of the specific Extreme Rainfall Index (ERI) being evaluated, as listed in Table 1, rather than to a rainfall amount in millimeters. We will revise the caption and axis label to make this more explicit in the revised manuscript.
  L110-111 Reference curve: Why is crossing that curve such a relevant issue that the method should be discarded? Please provide references or examples here.
  
  ANS: The reference IDF curve is constructed solely from the available daily data, following a deterministic approach. The reason for using it as a benchmark was inspired by a practical engineering perspective: it represents what a decision-maker in hydrological design might do in the absence of long hourly records. The statement about discarding a method reflects this pragmatic logic: if the reference IDF curve (constructed here from a simple empirical frequency analysis assuming a “reasonable” triangular distribution of daily rainfall over its 24 h) reproduces the observed IDF curve better than the disaggregated curves (median), then it is better to just use the benchmark. We will better clarify this reasoning in the revised manuscript.
  L114 I don’t understand this sentence. First three values for d are provided, which yield in three other values for d?
  
  ANS: In the proposed temporal distribution, each intensity of the benchmark IDF curve is assigned to the midpoint of its duration. For example, for the 4–5 h interval the corresponding duration is taken as d = 4.5 h. We will revise the sentence in the manuscript to state this more clearly.
  2, 3: I’m wondering about the selection of the studied durations. The index is running from d=1,..,24, sol also very uncommon durations as d=19h are analysed? Although a lower weight is chosen (1/d), I question the acceptance of this method within the hydrologic community. I suggest to keep the equations more open and provide the durations used in this study in the paragraph below. For me, typical durations would have been d={1, 2h, 3h, 4h, 6h, 12h, 24h}.
  
  ANS: The reason for initially using all durations from 1 to 24 h was to avoid introducing subjectivity by selecting a pool of durations and to ensure that the evaluation was fully systematic. We agree with the reviewer, however, that this procedure introduces redundancies (e.g., the 23 h duration yields values very similar to 24 h) and includes durations that are uncommon in hydrological practice. In the revised manuscript we will restrict the evaluation to the most typical design durations (1, 2, 3, 4, 6, 12, and 24 h). This choice not only reflects standard hydrological practice, but also avoids redundancy and gives more relative weight to the shorter durations, which are particularly relevant for the assessment of extremes.
  2 I’m wondering how this criterion is affected by different number of realisations. For comparisons of two rainfall generators A and B, ten realisations maybe required to represent the statistical uncertainty for A, while 100 realisations are required for B. Can this be a problem?
  
  ANS: Equation 2 refers to the Accuracy Efficiency Index (AEI), which measures how closely the median IDF curve derived from disaggregated series reproduces the observed IDF curve. The value of AEI does depend on the number of realizations (N) if N is small, since the sample may not adequately represent the underlying distribution. However, once a sufficiently large N is used, the estimate of the median IDF curve becomes stable, and increasing N further does not change the result in any meaningful way. The same reasoning applies to Eq. 3 (the Precision Efficiency Index, PEI), which characterizes the spread of the IDF curves derived from each realization. In our study we used 300 realizations for each method, a sample size large enough to ensure stable estimates and to provide results that would be essentially the same as with 1000 realizations. We will clarify these aspects in the revised manuscript.
  Chapter 5: The introduction of something ‘new’ demands comparisons with something ‘old or established’. Without previous experience, Fig. 3 is hard to interpret for the reader: How does a PEI=0.4 differ from PEI=0.2? OR same PEI for different AEI? The authors could provide classical IDF curves to show the benefit of the new proposed framework: What is visible with the new framework, that could not be seen/quantified before?
  
  ANS: We agree that without a direct link to classical IDF curves, the interpretation of Fig. 3 may not be straightforward for readers. As noted in our response to your Major Comment, we will include illustrative examples of classical IDF curves in the revised manuscript. This will make the connection between the traditional validation and the proposed framework explicit, and will clarify how the framework adds value by disentangling accuracy from precision and by providing a quantitative assessment of differences that are otherwise only qualitatively visible.
  L378-392 The manuscript aims at introducing a new framework, not at comparing rainfall generators. For a fair comparison, more hydro-meteorologic in-depth analyses of the results would be required. This comment is also valid for Sect. 5.2.
  
  ANS: Indeed, the main contribution of the manuscript is the introduction of the framework, not a systematic comparison of rainfall generators or a full hydrological application. We fully agree that a more in-depth hydrological test (for instance, running a rainfall–runoff model such as HEC-HMS) would be required for a fair comparison of methods. However, such an analysis goes beyond the scope of the present manuscript, which is focused on providing a methodological basis for evaluating disaggregation approaches. We appreciate this suggestion, as it reveals that the objectives and boundaries of the study were not sufficiently emphasized in the original version. In the revised manuscript we will therefore make this scope more explicit, clarifying that while the framework could indeed be extended to hydrological impact analyses, our aim here is to establish and illustrate the framework itself, leaving its application to specific hydrological models as a future step.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2710-AC1
RC2:
'Comment on egusphere-2025-2710', Anonymous Referee #2, 22 Jul 2025

This study proposes an evaluation framework for the assessment of temporal disaggregation methods of daily precipitation time series, focusing on key statistical properties (IDF relationships, KGE). The paper is very well written and the motivation is clearly explained. However, the proposed framework does not represent a significant novelty compared to the evaluations made in past studies. Furthermore, this framework is illustrated using disaggregation approaches that are not representative of the current approaches in my opinion. This is a strong limitation of this work given that many conclusions or interpretations are based on this set of experiments (see comments below).
I understand that this is not the main objective of this study, but a more comprehensive evaluation with state-of-the-art approaches would have given more value to this study. The Stochastic pulse-type method applied in this study was developed in 2001 and more recent versions are expected to perform a lot better (see e.g. https://hess.copernicus.org/articles/28/391/2024/). The approach using Huff curves has been proposed in 1967 and is clearly outdated in my opinion. The k-nearest neighbors method is still applied and could represent a benchmark. Random cascade models are cited in the introduction and still widely used but are not tested.
General comments:
l.1-3: The beginning of the abstract should clearly indicate that this study focuses on temporal disaggregation of daily precipitation data. There are a lot of other approaches for downscaling precipitation data, especially in space (e.g. dynamical downscaling using regional climate models) and these first sentences are more relevant to these approaches in my opinion.
l.16: High-resolution precipitation data often refer to gridded data with a high spatial resolution.
l.74-76: There are more recent approaches that could have been considered. For example, I was surprised that no random cascade model was considered (see, e.g. https://doi.org/10.5194/hess-27-3643-2023).
l.171: As acknowledged by the authors, other ERIs could be more relevant in other studies. In particular, the number of consecutive hours that defines an intense precipitation event or the quantile used to define a large precipitation intensity are really specific to each case study. In my opinion, these choices illustrate the fundamental difficulty in proposing a universal evaluation framework that would be adapted to all possible applications of a disaggregation method.
l.413-415: I am not sure to understand, is it claimed that the timing of observed events does not follow a Poisson process ? In my experience this is not true, a Poisson process has been shown to be reasonable in many applications of cluster-based models.
l.418: A common criticism made for the kNN approach is actually that the maximum disaggregated values are limited to the largest observed values, which is not a satisfying feature in extrapolation. It is sometimes combined to random draws from a distribution for this reason.

Citation: https://doi.org/10.5194/egusphere-2025-2710-RC2
- AC2:
  'Reply on RC2', Claudio Sandoval, 05 Sep 2025
  Acknowledgment: We sincerely thank you for such constructive comments and suggestions, which will help us to improve the clarity, scientific quality, and overall contribution of this work to the community.
  Major comments
  This study proposes an evaluation framework for the assessment of temporal disaggregation methods of daily precipitation time series, focusing on key statistical properties (IDF relationships, KGE). The paper is very well written and the motivation is clearly explained. However, the proposed framework does not represent a significant novelty compared to the evaluations made in past studies. Furthermore, this framework is illustrated using disaggregation approaches that are not representative of the current approaches in my opinion. This is a strong limitation of this work given that many conclusions or interpretations are based on this set of experiments (see comments below). I understand that this is not the main objective of this study, but a more comprehensive evaluation with state-of-the-art approaches would have given more value to this study. The Stochastic pulse-type method applied in this study was developed in 2001 and more recent versions are expected to perform a lot better (see e.g. https://hess.copernicus.org/articles/28/391/2024/). The approach using Huff curves has been proposed in 1967 and is clearly outdated in my opinion. The k-nearest neighbors method is still applied and could represent a benchmark. Random cascade models are cited in the introduction and still widely used but are not tested.
  
  ANS: Regarding the choice of methods used to illustrate the framework, our selection was deliberate in order to cover conceptually distinct approaches that are relatively simple to implement, apply, and compare. We note that the reference to Huff (1967) was misleadingly phrased: the year refers to the publication where the Huff curves were first presented, but the disaggregation method applied in our study is an adaptation we developed based on these curves, rather than a direct application of the original 1967 procedure. We would also like to emphasize that the selection of methods is not the main focus of the paper. The methods are included only as illustrative applications, and in the revised manuscript we will present this more clearly to ensure that the focus remains on the framework itself rather than on the specific methods tested.
  We agree that more recent approaches, such as the well-mentioned random cascade models, represent important lines of development. Our framework is intentionally method-agnostic and could be readily applied to those state-of-the-art approaches. To make the study more aligned with the current state of the art, in the revised manuscript we will extend the set of disaggregation methods considered to include more recent approaches. At the same time, in the present work we deliberately selected methods that are well established and conceptually different, in order to highlight how the framework can provide insights across a broad range of disaggregation philosophies.
  General comments
  1-3: The beginning of the abstract should clearly indicate that this study focuses on temporal disaggregation of daily precipitation data. There are a lot of other approaches for downscaling precipitation data, especially in space (e.g. dynamical downscaling using regional climate models) and these first sentences are more relevant to these approaches in my opinion.
  
  ANS: We agree that the first sentences of the abstract could be misinterpreted, given it does not clearly state the focus is on temporal disaggregation. In the revised manuscript we will rephrase the beginning of the abstract to clearly state that the study focus on developing a framework to evaluate temporal disaggregation methods, from daily to sub-daily timescale.
  16: High-resolution precipitation data often refer to gridded data with a high spatial resolution.
  
  ANS: We agree that the term “high-resolution precipitation data” is often used to refer to gridded products with high spatial resolution, rather than temporal resolution. As noted in our response to the previous comment (comment 2.1), we will rephrase the beginning of the abstract to avoid this ambiguity and to clearly indicate that our study focuses on temporal disaggregation of daily precipitation into sub-daily scales.
  74-76: There are more recent approaches that could have been considered. For example, I was surprised that no random cascade model was considered (see, e.g. https://doi.org/10.5194/hess-27-3643-2023).
  
  ANS: As noted in our response to the major comment above, we acknowledge the importance of more recent approaches such as random cascade models. In the revised manuscript we will extend the set of disaggregation methods considered to include more recent approaches.
  171: As acknowledged by the authors, other ERIs could be more relevant in other studies. In particular, the number of consecutive hours that defines an intense precipitation event or the quantile used to define a large precipitation intensity are really specific to each case study. In my opinion, these choices illustrate the fundamental difficulty in proposing a universal evaluation framework that would be adapted to all possible applications of a disaggregation method.
  
  ANS: We agree that the selection of ERIs cannot be completely arbitrary, otherwise the framework would lose coherence and comparability. Our intention is not to suggest that “anything goes,” but rather to emphasize that while some flexibility exists, the ERIs should always be chosen following clear criteria. In our case, the selected ERIs have the advantage of being widely used in hydroclimatic studies, of covering complementary aspects of extremes (frequency, intensity, and duration), and of being directly computable from both observed and disaggregated series. These properties make them suitable for systematic application and comparison across methods. We acknowledge, however, that other studies may have specific objectives that require alternative indices (e.g., tailored to local design needs or hydrological impact assessments). In such cases, the framework remains adaptable, provided that the selection of ERIs is guided by the same principles of relevance, representativeness, and comparability. In the revised manuscript, we will make these criteria more explicit to clarify the balance between flexibility and consistency within the framework.
  413-415: I am not sure to understand, is it claimed that the timing of observed events does not follow a Poisson process? In my experience this is not true, a Poisson process has been shown to be reasonable in many applications of cluster-based models.
  
  ANS: We thank the reviewer for this important clarification. Our intention was not to claim that a Poisson process is inadequate for representing rainfall event timing. On the contrary, we fully acknowledge that Poisson arrivals have been widely and successfully used in point-process models. What we intended to highlight is that the Socolofsky method (SOC) does not follow the logic of Poisson event arrivals: instead of generating events over continuous time, it allocates pulses within discrete intervals. As a result, SOC may incur limitations in reproducing the temporal intermittency of rainfall, particularly the dry steps between pulses. In the revised manuscript we will rephrase this passage to avoid any misinterpretation and to make clear that the limitation refers to the SOC implementation, not to Poisson-based approaches in general.
  418: A common criticism made for the kNN approach is actually that the maximum disaggregated values are limited to the largest observed values, which is not a satisfying feature in extrapolation. It is sometimes combined to random draws from a distribution for this reason.
  
  ANS: We agree that a common limitation of the k-NN approach is that the maximum disaggregated values are bounded by the largest values observed in the historical record, which restricts its ability to extrapolate extremes. Indeed, extensions of k-NN have been proposed where random draws from parametric distributions are combined with the analog-based resampling in order to overcome this drawback and give the method more flexibility.
  From the perspective of our framework, this limitation is not ignored: by explicitly evaluating extreme rainfall indices and IDF curves, the framework makes it possible to detect whether a method can or cannot reproduce extremes beyond the observed range. If extrapolation is important for a specific application (e.g., in design contexts requiring estimates for large return periods), the framework provides a transparent way to highlight the strengths and weaknesses of k-NN relative to other approaches. In this sense, rather than solving the limitation itself, the framework clarifies its implications for practice and helps inform the selection of the most appropriate method for the problem at hand.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2710-AC2
RC3:
'Comment on egusphere-2025-2710', Anonymous Referee #3, 24 Jul 2025

This manuscript aims to present a framework for the assessment of rainfall disaggregation models (DM) with a focus on the generation of relevant extreme precipitation. For illustration, the framework is applied to 3 different DMs on 5 locations worldwide where hourly observations of rainfall are available.
The aim of the paper is very relevant. Such a framework would be indeed very valuable to assess individual DMs and to compare them. The paper has however a number of significant limitations and the work needs major revisions. I really encourage the authors to submit a modified version.
Added value of this framework when compared to already existing ones.
A number of multiscale evaluations have been already proposed to assess weather generators (WGEN) and / or to compare DMs. What is the added value of this new framework for instance compared to Benett et al. 2012, Maloku et al. 2023 where multiple temporal scales and Precip. Extremes are considered. Or compared to Cappelli et al. 2025 where IDF curves are the aim of the disaggregation?
Evaluation for ungauged stations.
The framework is applied to different stations where rainfall observations are available. The performance of each DM is assessed by comparison to a reference which has been built from observations available at the considered site. What is proposed to assess DMs for disaggregating a daily time series available at a given site where no subdaily data are available ? This point would be worth to be clarified / commented. The target of the paper has to be clarified with respect to this issue also. This is obviously an important operational issue (see Kim et al. 2016, Maloku et al. 2025 and ref. herein). The evaluation framework proposed here does obviously not need to include this issue. However, the importance of evaluating also the potential for regionalization of the model would be worth to mention in a discussion. One could look for a model with parameters that are robust and do not vary a lot in space for instance (cf. Maloku et al. 2023). The possibility (need ?) to apply a kind of regional assessment with hourly stations available (or not) in the neighborhood of the target station would be also worth to mention.
Seasonal performance of DMs
The ability of DMs to produce relevant simulations is typically found to depend on season. This point would be worth to discuss in the manuscript.
Criteria used to assess the DM ability to reproduce IDF curves.
In the manuscript, DMs are first assessed with their ability to reproduce IDF curves derived in a previous step from observations. A relevant simulation of IDFs is obviously important for DMs. For this assessment, the authors propose two criteria (AEI and PEI in Eq 2 and 3), further combined in one single distance criterion (Dt, Eq. 4). Each criterion is a normalized criterion such as its value for a perfect DM is one and such as its value is negative if its performance is lower than that of a reference model. This normalization facilitates the assessment. I see two major limitations here :
Criterion AEI : the reference model is a straight line in the IDF graph (see Fig. 1b). This model is a poor reference model, especially in the way it is built. This reference is far from the observed IDF for all durations (even for the day) and all return periods and it may be thus not really able to highlight bad models. More relevant references could (should) be considered as suggested here : for each return period T :
Reference A : The shape of the reference is not linear but follows a classical analytical formulation used for IDF curves (in the form of a Talbot or a Montana model). To define the coefficients of the model, one could assume that 1) The reference ends to the point which is known in the considered IDF curve, that is the observed return level (estimated from the observations) for the daily resolution and 2) that (for instance) the maximum intensity for a 1 hour duration is 2times that for the daily duration (2 would be arbitrary chosen).
Reference B : A simple deterministic DM is used to produce a time series from which the IDF curve is determined. Such a model could be the linear model of Ormsbee (1989), also considered in Hingray and BenHaha as a reference (2005) in their comparison work.
Reference C : A simple constant disaggregation model is considered (constant intensity assumption for each daily time step).
Reference D : a single climatological sub-daily pattern, determined each month from the observations (e.g. following the Huff et al. methodology considered in the paper), is used for the disaggregation of all days of this month.
To my opinion, one such reference would make some sense (I guess more than the proposed one).
Criterion PEI : the reference model is the ensemble of IDF curves obtained from a resampling of observed data. The authors mention that the value 1 is the best possible value, negative values are not desirable (lines 144-145). This is not what I understand from Equation 3. For me, a value of 1 indeed indicates no uncertainty in the disaggregation process. I do not see why a DM should give no uncertainty ? For me, the disaggregation can not be deterministic. So, different disaggregation runs will (cannot but) lead to different IDF curves, I agree. Then the confidence interval between the estimates obtained from different runs should not be zero. It should but be “relevant” or at least reasonable when compared to the CI that could be obtained from a simple resampling work of observations.
Distance criterion. If my reasoning is right, the distance criterion, which is in all the paper then considered to assess the performance of DMs, is then not really relevant. Because for me, the best possible value for PEI is not 1 but 0. And the optimal point is not the point (1,1) mentioned line 157.
As a consequence also, I fear that some conclusions of the result section are wrong (all which are relative to PEIt and to DT). For me, SOC does not lead in four out of five locations (ln386) in table 3.
Extreme Rainfall Indices assessment
I have several concerns here also.
Equation 5. This equation has not a lot to do with the KGE equation as the authors disregard the correlation component. To avoid confusion and mis-interpretation I strongly suggest to not refer to the KGE and to use another name for this coefficient. Next, this criterion combines an evaluation on the mean and on the dispersion. I would suggest to provide both evaluations separately. This may provide more understanding of differences between DMs.
The data considered to produce box plots are not clear for me. Can you clarify if you calculate n different mKGE for the n runs of the disaggregation process.
Equation 6. The authors want to compare the mKGE obtained for one given DM with the mKGE obtained for a reference. A new reference is introduced here. I would strongly suggest to have the same reference than the one introduced to assess the IDF precision efficiency. The different reference models suggested above would be possible here.
Disagregation Models considered for this work
Many DMs have been presented in the past (see, for example, Koutsoyiannis, 2003, Pui et al. 2012 and the references within). I have the feeling that (some) DMs considered in the present work refer to rather old approaches (e.g. the SOC DM). Could you give more recent references where those models have been applied and evaluated. This would allow justifying the choice of those approaches and showing that they give reasonable performances in different contexts.
Because of their simplicity and parsimony, analytical microcanonical Multiplicative Random Cascades (MRC) have been widely used in the past for many applications in hydrology (e.g. Paschalis et al., 2014; Maloku et al. 2023). I strongly suggest to include one such MRC model (e.g. Maloku et al. 2023 which merges different MRC approaches).
Other comments
Introduction : Can you precise why is it interesting to consider IDF curves for an evaluation of DMs ?
Dataset : Rainfall time series used for the application have very different lengths. (Is there an interest of having much longer times series for 2 stations ?) Is there some dependency of the DM performance on the length ? The length likely impacts the precision of parameter estimation and then the stochastic variability importance in the evaluation (e.g. in the "reference” generated by resampling from observations in criterion PEI). To my opinion, a more homogeneous dataset would be worth and would prevent comments on this non homogeneity. A discussion on this length issue would otherwise be welcome.
Lines 44 64 : I am not convinced by the classification proposed. What is the difference between the 2^nd and third approach ? A summary table would be worth to present all these assessment approaches and identify the criterion, (multi)scales, precipitation characteristics considered for the evaluation. Note that for me, an important evaluation of WGENs (and DMs) is (should be) also impact oriented – cf. assessing the ability of reproducing discharge floods after simulation with some hydrological model (a number of works have proposed such evaluations). This may be especially relevant here for instance as IDFs are not of interest per see but because they are often produced and used for hydrological design purposes.
Ln 98. IDF curve construction. For a given return period, how are return levels estimated from observations. Do you use the gumbel distribution to fit data ?
Ln104-105. The way the uncertainty bounds are produced from resampling observations is not clear. What are the multiple samples of d-hours AMP ? How many samples ? for a given T ? for all T ? How many value in each sample ?
Equation 1. Clarify why this equation ? which assumptions / choices behind ? I understand that you assume i(d=0)=AMP24/12 which turns to i(d=0)= 2*(AMP24/24) = 2 * i(d=24). Is it right ? why ?
The assumption i(d=24) = 0 is not convincing. You know exactly that i(d=24) = AMP24/24
ln121 : PEIT criterion : according to what I understand from eq 3, I do not understand why this index estimates the efficiency of the “precision”. What is the precision ? or do you mean this is an index of “precision efficiency”, if yes, what is it ?
Ln135. I agree with this statement, but not with lines 144-146. Note that a deterministic model (e.g. that of Ormsbee, 1989) would have PEIT = 0. But such a model would be obviously under-dispersive. (dispersion is actually zero between runs)
Extreme rainfall indices
Ln 165. To me, the probability of dry steps is a more common criterion to assess intermittency
Ln 168. The 5hours duration is not really critical for a large range of basins. The critical duration depends on the time of concentration of the basin and this time of concentration varies a lot from one catchment to the other. Different durations could be thus considered (as in the IDF curves).
Ln 168-170. How is defined P95 ? is it based on daily precipitation ? hourly ? Does it vary from one year to the other ?
Ln 170. Which “last 2 indices”? If you refer to indices 3 and 4 of previous sentence, they do not assess to my opinion the duration and magnitudes of extreme precipitation pulses. How do you define “pulses” (on which duration(s) are they defined) ? The first index looks like a frequency (but I do not understand why this index should be different from 5% of the time if P95 is defined from hourly data). I do not understand well what is the meteorological / hydrological interest of the second index. Can you clarify ?
Section 3.2. I would suggest to summarize more the description of the method which is well known. The different equations are not necessarily needed for instance;
Ln 252-253. The knn model considered here does not depend on the local rainfall pattern around the daily amount to disaggregate. This may be a too important simplification. A number of works have shown that the subdaily structure of precipitation is highly dependent on this local rainfall pattern (cf. Ormsbee, 1989, Gunter et al. 2001, Maloku et al. 2023). Testing the initial Sharma et al. 2006 method would be perhaps worth.
Ln270-280 and equations next : I do not understand what is the purpose of this text and of those equations. Can you clarify ?
Ln 310. If I understand well, the disaggregated rainfall of one day can cross that day to the next one. Is it possible to have to rainfall amounts at a same time from day I and day i+1 ? if yes, do you sum them ?
Ln 330-338. Can you clarify ? I do not understand what do you do here and what it is for ? I understand that you produce 300 disagregated time series. I had understand from fig1. that each time series is used in turn to produce one set of IDFs curves. Here, I now understand that you merge the data from the 300 time series to do your frequential analysis.
Ln 338. What are “the main disaggregation runs” ? how are they produced ? defined ?
Ln 339. Do you do differently for the third DM ??? if yes, I do not understand why
Ln 340. I do not understand : what are the calibration results ? do you make any distinction between simulation results obtained for calibration and simulation results obtained from the application of the disaggregation process to the daily data ? can you clarify ?
Ln345. In this observation configuration, how are generated the 1000 AMP series ?
Figure3. Can you complete the caption. To what correspond the ellipses ?
Ln 403 and In table 5, what means the 50^th percentile mKGE or each ERI time series > this suggests that you compute multiple mKGE for each generated time series ? this is confusing.. or do you compute on mKGE for each generate time series and then produce the boxplots of those n values (300 values)

References
Bennett, B., Thyer, M., Leonard, M., Lambert, M., and Bates, B.: A comprehensive and systematic evaluation framework for a parsimonious daily rainfall field model, J. Hydrol., 556, 1123–1138, https://doi.org/10.1016/j.jhydrol.2016.12.043, 2018.
Cappelli, F; Volpi, E; (...); Grimaldi, S. 2025. Sub-daily rainfall simulation using multifractal canonical disaggregation: a parsimonious calibration strategy based on intensity-duration-frequency curves. SERRA
Kim, H.-H. Kwon, S.-O. Lee, and S. Kim. Regionalization of the Modified Bartlett–Lewis rectangular pulse stochastic rainfall model across the Korean Peninsula. J. Hydro-Environ. Res., 11:123–137, 2016. ISSN 1570-6443. doi: 10.1016/j.jher.2014.10.004.
Guntner, A., Olsson, J., Calver, A., Gannon, B., 2001. Cascade-based disaggregation of continuous rainfall time series: the influence of climate. Hydrol. Earth Syst. Sci. 5 (2), 145– 164.
Koutsoyiannis, D.: Rainfall disaggregation methods: Theory and applications, in: Proceedings, Workshop on Statistical and Mathematical Methods for Hydrological Analysis; Università di Roma “La Sapienza”, May 2003, Rome, Italy, 1–23, https://doi.org/10.13140/RG.2.1.2840.8564, 2003.
Maloku, K., Evin, G., Hingray, B., 2025. Generation of sub-daily precipitation time series anywhere in Switzerland by mapping the parameters of GWEX-MRC, an at-site weather generator. Journal of Hydrol. Regional Studies. https://doi.org/10.1016/j.ejrh.2025.102454
Maloku, K., Hingray, B., Evin, G. 2023. Accounting for precipitation temporal asymmetry in a Multiplicative Random Cascades Disaggregation Model. Hydrol. Earth Syst. Sci. https://doi.org/10.5194/hess-27-3643-2023
Ormsbee, L.E., 1989. Rainfall disaggregation model for continuous hydrologic modelling. J. Hydraul. Eng., ASCE 115 (4), 507– 525.
Paschalis, A., Molnar, P., Fatichi, S., and Burlando, P.: On temporal stochastic modeling of precipitation, nesting models across scales, Adv. Water Resour., 63, 152–166, https://doi.org/10.1016/j.advwatres.2013.11.006, 2014.
Pui, A., Sharma, A., Mehrotra, R., Sivakumar, B., and Jeremiah, E.: A comparison of alternatives for daily to sub-daily rainfall disaggregation, J. Hydrol., 470-471, 138–157, https://doi.org/10.1016/j.jhydrol.2012.08.041, 2012.

Citation: https://doi.org/10.5194/egusphere-2025-2710-RC3
- AC3:
  'Reply on RC3', Claudio Sandoval, 05 Sep 2025
  This manuscript aims to present a framework for the assessment of rainfall disaggregation models (DM) with a focus on the generation of relevant extreme precipitation. For illustration, the framework is applied to 3 different DMs at 5 locations worldwide where hourly observations of rainfall are available.
  The aim of the paper is very relevant. Such a framework would be indeed very valuable to assess individual DMs and to compare them. The paper has however a number of significant limitations and the work needs major revisions. I really encourage the authors to submit a modified version.
  ANS: We thank you for acknowledging the relevance of the paper and for the constructive comments provided. We recognize that several aspects of the manuscript required clarification and improvement. In the revised version, we have carefully addressed each of your points, making substantial changes to improve the clarity, focus, and presentation of the framework.
  Added value of this framework when compared to already existing ones
  A number of multiscale evaluations have been already proposed to assess weather generators (WGEN) and / or to compare DMs. What is the added value of this new framework for instance compared to Benett et al. 2012, Maloku et al. 2023 where multiple temporal scales and Precip. Extremes are considered. Or compared to Cappelli et al. 2025 where IDF curves are the aim of the disaggregation?
  
  ANS: As you note, several previous studies have already proposed multiscale evaluations and frameworks for rainfall models. Bennett et al. (2018), for example, developed a comprehensive assessment at daily, monthly, and annual scales, focusing on general rainfall characteristics such as occurrences, amounts, and dry/wet spell distributions, and also assessing daily annual maximum precipitation. Maloku et al. (2023) evaluated multiplicative random cascade models across multiple temporal scales, incorporating both general statistics and extremes, including precipitation amounts associated with 5- and 20-year return levels. More recently, Cappelli et al. (2025) directed their evaluation specifically toward the reproduction of IDF curves by comparing observed and simulated annual maximum precipitation.
  Although we recognize that extreme rainfall is considered in these works in a very detailed and comprehensive way, our framework is explicitly designed to systematize these diverse aspects into two complementary components: (i) the representation of annual maximum precipitation across durations, expressed as IDF curves, and (ii) the representation of extreme rainfall indices (ERIs), which are conceptually related to several of the metrics adopted in previous studies. For example, the rainfall occurrence used in Bennett et al. can be related to our R95%, which measures the number of hours exceeding the 95th percentile; their rainfall amounts correspond to our P>R95%, i.e., the precipitation volume in those extreme hours; and their distributions of dry spells can be compared to our TDD, which aggregates the total annual duration of dry periods from hourly data. Similarly, Maloku et al. evaluated the proportion of wet steps and the mean length of wet spells, which are conceptually analogous to our R95% and TDD, respectively (although we emphasize dry periods instead of wet). They also analyzed rainfall amounts at 40-min resolution for specific return periods, which are analogous to our generalized representation of extremes through IDF curves across multiple durations and return periods.
  What differentiates our contribution is that these characteristics are synthesized into unified indicators: AEI, PEI, and DT for IDF curves, and ERIs evaluated through mKGE applied directly to time series. This synthesis reduces complex information into single-valued measures that can be jointly represented in a common graphical space (e.g., AEI/PEI–DT plots for IDF curves, or boxplots of mKGE across ERIs), thereby enabling a straightforward and easily appreciable comparison among disaggregation methods. Importantly, our analysis is carried out at the hourly scale, i.e., a finer temporal resolution than Bennett et al. (2018), which is especially relevant for hydrological applications in fast-responding catchments. While our framework does not yet address spatial disaggregation or exhaustively cover all possible methodologies and metrics, its unifying and systematizing character provides a clear added value for evaluating rainfall disaggregation methods.
  Evaluation for ungauged stations
  The framework is applied to different stations where rainfall observations are available. The performance of each DM is assessed by comparison to a reference which has been built from observations available at the considered site. What is proposed to assess DMs for disaggregating a daily time series available at a given site where no subdaily data are available ? This point would be worth to be clarified / commented. The target of the paper has to be clarified with respect to this issue also. This is obviously an important operational issue (see Kim et al. 2016, Maloku et al. 2025 and ref. herein). The evaluation framework proposed here does obviously not need to include this issue. However, the importance of evaluating also the potential for regionalization of the model would be worth to mention in a discussion. One could look for a model with parameters that are robust and do not vary a lot in space for instance (cf. Maloku et al. 2023). The possibility (need ?) to apply a kind of regional assessment with hourly stations available (or not) in the neighborhood of the target station would be also worth to mention.
  
  ANS: We agree that the framework, as presented here, requires sub-daily observations at the site of interest to evaluate the performance of disaggregation methods, since these observations are used to construct the observed IDF curve against which both the reference curve and the disaggregated IDF curves are compared. This limits its direct applicability in ungauged stations where only daily data is available. While addressing this challenge is beyond the scope of the present work, we acknowledge that it is a fundamental problem for operational use.
  In the revised manuscript we will add a discussion of possible avenues for extending the framework in such cases, including regionalization strategies, transfer of parameters from nearby hourly stations, and approaches that aim to develop spatially robust models. An additional possibility would be to adopt a more empirical approach, directly using the available data without parameter transfer, which could still provide valuable insights, for example, in locations where daily data is available but subdaily data is missing. We believe that such alternatives may also be helpful for extending the framework in future work towards not only temporal, but also spatio-temporal disaggregation, by incorporating strategies for spatial correlation. Our aim will be to clarify that the current study is focused on evaluating disaggregation methods where sub-daily observations are available, but that future applications of the framework should also consider regional and empirical approaches to extend its usefulness to ungauged sites.
  Seasonal performance of DMs
  The ability of DMs to produce relevant simulations is typically found to depend on season. This point would be worth to discuss in the manuscript.
  
  ANS: We agree that the performance of disaggregation methods is often season-dependent, as rainfall regimes and storm characteristics vary throughout the year. Our framework does not include an explicit seasonal component by default, since the primary aim of this work was to demonstrate its general structure and applicability at the annual scale. However, the framework can readily be applied in a seasonal manner by calculating IDF-based metrics and ERIs separately for each season or month, thereby allowing direct evaluation of seasonal performance. In the revised manuscript we will clarify this point more explicitly, emphasizing that while seasonality was indirectly addressed through the way the illustrative methods were calibrated, the framework itself can be straightforwardly adapted to explicitly assess seasonal differences in future applications.
  Criteria used to assess the DM ability to reproduce IDF curves
  In the manuscript, DMs are first assessed with their ability to reproduce IDF curves derived in a previous step from observations. A relevant simulation of IDFs is obviously important for DMs. For this assessment, the authors propose two criteria (AEI and PEI in Eq 2 and 3), further combined in one single distance criterion (Dt, Eq. 4). Each criterion is a normalized criterion such as its value for a perfect DM is one and such as its value is negative if its performance is lower than that of a reference model. This normalization facilitates the assessment. I see two major limitations here:
  
  Criterion AEI: the reference model is a straight line in the IDF graph (see Fig. 1b). This model is a poor reference model, especially in the way it is built. This reference is far from the observed IDF for all durations (even for the day) and all return periods and it may be thus not really able to highlight bad models. More relevant references could (should) be considered as suggested here. For each return period T:
  Reference A: The shape of the reference is not linear but follows a classical analytical formulation used for IDF curves (in the form of a Talbot or a Montana model). To define the coefficients of the model, one could assume that 1) The reference ends to the point which is known in the considered IDF curve, that is the observed return level (estimated from the observations) for the daily resolution and 2) that (for instance) the maximum intensity for a 1 hour duration is 2times that for the daily duration (2 would be arbitrary chosen).
  Reference B: A simple deterministic DM is used to produce a time series from which the IDF curve is determined. Such a model could be the linear model of Ormsbee (1989), also considered in Hingray and BenHaha as a reference (2005) in their comparison work.
  Reference C: A simple constant disaggregation model is considered (constant intensity assumption for each daily time step).
  Reference D: a single climatological sub-daily pattern, determined each month from the observations (e.g. following the Huff et al. methodology considered in the paper), is used for the disaggregation of all days of this month.
  To my opinion, one such reference would make some sense (I guess more than the proposed one).
  ANS: Thank you for the comment. We agree that the reference IDF curve used in the AEI computation is deliberately simple and does not necessarily represent a realistic model of rainfall extremes. Our intention was not to propose this curve as a hydrologically meaningful model, but rather as a very basic benchmark (i.e., a “minimal solution” that reflects what a decision-maker with only daily data could construct in practice). The purpose was to provide a simple lower bound against which the added value of disaggregation can be assessed.
  We also recognize the concern about arbitrariness. In our view, a benchmark should meet a few general criteria: (i) it should not be a formal disaggregation method itself, but rather a simplified baseline; (ii) it should reproduce the expected decreasing behavior of IDF relationships; and (iii) it should avoid arbitrary parameter choices or the need for calibration. At an earlier stage we explored alternatives, such as exponential shapes, but this required parameter calibration and were therefore less consistent with the above criteria. In the absence of a better option, we proposed the linear form as the simplest possible benchmark.
  We do appreciate the reviewer’s point, and it is a concern we also discussed in detail. If a more suitable benchmark can be suggested—one that avoids arbitrary parameters while still remaining simple and not a proper disaggregation method in itself—we would be glad to adopt and implement it. In any case, we recognize that these general criteria for constructing the benchmark were not clearly conveyed in the manuscript, and we will make sure to present them explicitly in the revised version.
  Criterion PEI: the reference model is the ensemble of IDF curves obtained from a resampling of observed data. The authors mention that the value 1 is the best possible value, negative values are not desirable (lines 144-145). This is not what I understand from Equation 3. For me, a value of 1 indeed indicates no uncertainty in the disaggregation process. I do not see why a DM should give no uncertainty ? For me, the disaggregation can not be deterministic. So, different disaggregation runs will (cannot but) lead to different IDF curves, I agree. Then the confidence interval between the estimates obtained from different runs should not be zero. It should but be “relevant” or at least reasonable when compared to the CI that could be obtained from a simple resampling work of observations.
  
  ANS: We thank you for raising this point, which is indeed central to the interpretation of the PEI. In our initial formulation, we defined the optimal value as 1, reflecting the idea of “maximum reproducibility” with no dispersion across realizations. However, we agree with you that such a deterministic outcome is neither realistic nor desirable, since stochastic disaggregation methods should naturally reproduce a certain level of variability. In light of this, we acknowledge that the conceptual optimum should indeed be PEI = 0, corresponding to a dispersion consistent with the observational benchmark.
  We also recognize, however, that the question of whether a disaggregation method that produces slightly lower variability than the observed benchmark should always be considered problematic is less straightforward. One might argue that under-dispersion could still be acceptable in certain contexts, depending on the application, whereas over-dispersion systematically inflates extremes and is clearly undesirable. We appreciate that your suggestion helps to clarify this conceptual distinction, and we will adopt the formulation with PEI = 0 as the reference point in the revised manuscript, while also making explicit that the interpretation of under-dispersion versus over-dispersion deserves further consideration
  Distance criterion: If my reasoning is right, the distance criterion, which is in all the paper then considered to assess the performance of DMs, is then not really relevant. Because for me, the best possible value for PEI is not 1 but 0. And the optimal point is not the point (1,1) mentioned line 157. As a consequence also, I fear that some conclusions of the result section are wrong (all which are relative to PEIt and to DT). For me, SOC does not lead in four out of five locations (ln386) in table 3.
  
  ANS: We appreciate your comment, it has lead us to revise our assumptions, and we thing the optimal PEI should be 0 instead of 1. Consequently, the definition of the distance criterion and the interpretation of the AEI–PEI space will be revised accordingly. We acknowledge that this affects some of the conclusions in Sect. 5, including statements regarding the relative performance of SOC in Table 3, and we will carefully revisit and reformulate these parts of the manuscript to ensure that the conclusions are consistent with the revised interpretation of PEI and the distance criteria.
  Extreme Rainfall Indices assessment
  I have several concerns here also.
  Equation 5. This equation has not a lot to do with the KGE equation as the authors disregard the correlation component. To avoid confusion and mis-interpretation I strongly suggest to not refer to the KGE and to use another name for this coefficient. Next, this criterion combines an evaluation on the mean and on the dispersion. I would suggest to provide both evaluations separately. This may provide more understanding of differences between DMs.
  
  ANS: We agree that referring to Equation 5 as a (modified) KGE can be confusing since the correlation component is not included. Although we used the prefix “mKGE” to signal that this was not the original KGE, we acknowledge that the label may not be fully appropriate in this context. In the revised manuscript we will rename the coefficient as “Moments Bias Coefficient” and clarify its definition.
  The data considered to produce box plots are not clear for me. Can you clarify if you calculate n different mKGE for the n runs of the disaggregation process.
  
  ANS: Exactly. For each disaggregated series (i.e., each run) and for each ERI, we calculate the annual time series of that metric and then obtain a single mKGE value by comparing the disaggregated series against the observed one. This is repeated for all n runs, which yields the distribution of n mKGE values from which the boxplots (one per ERI) are constructed. We will explain this with more clarity in the caption of the Figure 1.
  Equation 6. The authors want to compare the mKGE obtained for one given DM with the mKGE obtained for a reference. A new reference is introduced here. I would strongly suggest to have the same reference than the one introduced to assess the IDF precision efficiency. The different reference models suggested above would be possible here.
  
  ANS: We agree that in principle it would be desirable to have benchmarks that are more comparable or logically equivalent across the different components of the framework. However, IDF curves and ERIs capture fundamentally different aspects of rainfall behavior, which is why we used different references.
  
  Disagregation Models considered for this work
  Many DMs have been presented in the past (see, for example, Koutsoyiannis, 2003, Pui et al. 2012 and the references within). I have the feeling that (some) DMs considered in the present work refer to rather old approaches (e.g. the SOC DM). Could you give more recent references where those models have been applied and evaluated. This would allow justifying the choice of those approaches and showing that they give reasonable performances in different contexts.
  
  Because of their simplicity and parsimony, analytical microcanonical Multiplicative Random Cascades (MRC) have been widely used in the past for many applications in hydrology (e.g. Paschalis et al., 2014; Maloku et al. 2023). I strongly suggest to include one such MRC model (e.g. Maloku et al. 2023 which merges different MRC approaches).
  ANS: We recognize that the point is valid: some of the methods applied in this study were originally proposed several decades ago. For your information, as previously clarified in response to another reviewer, in the case of Huff this was partly due to a misleading phrasing, since what we applied was a DM based on Huff curves rather than using directly the original 1967 procedure. We will further clarify this in the manuscript. Regarding the illustration of the framework with more recent In particular, we welcome the suggestion of incorporating a multiplicative random cascade model, which we consider an ideal candidate to add as a disaggregation method in the revised analysis.
  Other comments
  Introduction: Can you precise why is it interesting to consider IDF curves for an evaluation of DMs?
  
  ANS: The motivation for including IDF curves is that they are a central tool in hydrology and engineering practice, widely used for hydraulic design, flood risk assessment, and urban drainage planning. Evaluating whether disaggregation methods can reproduce observed IDF relationships therefore provides a direct and operationally relevant measure of their performance. In the revised manuscript we will make this rationale more explicit in the Introduction.
  Dataset: Rainfall time series used for the application have very different lengths. (Is there an interest of having much longer times series for 2 stations?) Is there some dependency of the DM performance on the length? The length likely impacts the precision of parameter estimation and then the stochastic variability importance in the evaluation (e.g. in the "reference” generated by resampling from observations in criterion PEI). To my opinion, a more homogeneous dataset would be worth and would prevent comments on this non homogeneity. A discussion on this length issue would otherwise be welcome.
  
  ANS: The stations used in this study were drawn from different regions of the world, primarily from the INTENSE project dataset, which provides high-quality sub-daily rainfall series with rigorous quality control. In addition, the Quinta Normal station in Chile was included to bring the framework into Chilean context as well. The resulting heterogeneity in record lengths therefore reflects the diversity of the available data sources rather than a deliberate design choice. We agree with the reviewer that differences in record length can indeed have implications: they may affect the degree of stationarity assumed when estimating parameters over the entire record, and they reduce consistency between stations, making cross-comparisons less straightforward. In the revised manuscript we will explicitly acknowledge this limitation and discuss how record length may influence the evaluation of disaggregation methods.
  Lines 44 64: I am not convinced by the classification proposed. What is the difference between the 2nd and third approach? A summary table would be worth to present all these assessment approaches and identify the criterion, (multi)scales, precipitation characteristics considered for the evaluation. Note that for me, an important evaluation of WGENs (and DMs) is (should be) also impact oriented – cf. assessing the ability of reproducing discharge floods after simulation with some hydrological model (a number of works have proposed such evaluations). This may be especially relevant here for instance as IDFs are not of interest per se but because they are often produced and used for hydrological design purposes.
  
  ANS: The distinction we intended is that the second type of study refers to systematic comparisons of existing disaggregation methods within a given evaluation scheme, but usually limited to the scope of that specific study. The third type, in contrast, seeks to go beyond such one-off comparisons by proposing an integrative framework that can be applied more generally. In other words, the third type corresponds to a systematization of the evaluation process itself, which is precisely the aim of the framework we propose here. In the revised manuscript we will clarify this difference more explicitly. We also agree that a summary table would be valuable. We will add a table that synthesizes the different evaluation approaches, identifying for each the criterion used, the temporal scale(s), and the precipitation characteristics that are assessed.
  Regarding the impact-oriented evaluations, we acknowledge that these are highly relevant, since disaggregated rainfall is often used in practice as input to hydrological models for flood estimation and design purposes. However, such analyses are outside the scope of this paper, which is focused on introducing the framework. Nevertheless, in the revised manuscript we will emphasize in the discussion that the framework is flexible and could be readily extended to impact-oriented applications in future studies.
  Ln 98. IDF curve construction. For a given return period, how are return levels estimated from observations. Do you use the gumbel distribution to fit data?
  
  ANS: The construction of the observed IDF curve in this study follows an empirical approach. For each duration, we extract the annual maximum precipitation (AMP) for every year in the record, and then perform a frequency analysis using the Weibull plotting position based on ranking. This allows us to obtain quantiles associated with all the return periods considered for each duration. By connecting the ordered pairs across durations for a given return period, we then derive the corresponding observed IDF curve, resulting in one curve per return period. We note that the estimation of the uncertainty around these curves does require a statistical fit. For this purpose, we apply the Gumbel distribution to characterize the variability of the return levels, as explained later in this responses. This will be explained with more detail in Section 2.2.1.
  Ln104-105. The way the uncertainty bounds are produced from resampling observations is not clear. What are the multiple samples of d-hours AMP? How many samples? for a given T? for all T? How many value in each sample?
  
  ANS: We generate 1000 samples for each duration (1, 2, …, 24 h), as detailed in Sect. 4.2. For each duration, the samples are drawn from the Gumbel distribution fitted to the observed annual maximum precipitation values of that duration. Each sample contains as many values as the number of years in the historical record. Then, for each return period, the corresponding quantile is computed in each of the 1000 samples. This procedure yields the distribution of intensities for every duration and return period, from which the uncertainty bounds around the observed IDF curves are derived. Here we will add a reference to Section 4.2 (in parentheses) for greater clarity.
  Equation 1. Clarify why this equation? which assumptions / choices behind? I understand that you assume i(d=0)=AMP24/12 which turns to i(d=0)= 2*(AMP24/24) = 2 * i(d=24). Is it right? why?
  
  ANS: The temporal distribution assumed for the daily AMP was designed to represent a simple decreasing pattern, where the highest intensities occur at the beginning of the day and gradually decrease towards the end. This criterion is admittedly arbitrary, but it was adopted in order to provide a unified, replicable, and straightforward way of constructing the reference IDF. Under this scheme, when extracting the most intense hours for the frequency analysis, they are always consecutive starting from the first hour, which naturally leads to the linear decreasing structure of the reference curve. We acknowledge that other plausible assumptions could have been made, such as the alternate block method (with the maximum pulse centered and decreasing alternately on both sides) or even random alternations between both temporal distributions, which would result in different reference IDFs.
  Regarding the specific point raised, it is correct that the formulation leads to i(d=0) = AMP24/12, while i(d=24) = 0, i.e., rainfall ends at the end of the day. This is purely a consequence of the mathematical construction, since the hourly intensities are evaluated at the centroid of each pulse (0.5, 1.5, …, 23.5 h), which ensures mass balance.
  The assumption i(d=24) = 0 is not convincing. You know exactly that i(d=24) = AMP24/24
  
  ANS: We would like to clarify that the assumption i(d=24) = 0 is correct under the way the temporal distribution was constructed in our study. We believe that the reviewer’s interpretation may stem from considering a uniform distribution of rainfall, in which case each hour would indeed receive AMP24/24, including the last one. We will clarify this point more explicitly in the revised manuscript to avoid confusion.
  ln121: PEIT criterion: according to what I understand from eq 3, I do not understand why this index estimates the efficiency of the “precision”. What is the precision? or do you mean this is an index of “precision efficiency”, if yes, what is it?
  
  ANS: In our framework, “precision” is understood as the consistency with which the disaggregated IDF estimates approach the observed (real) values across multiple realizations. A method is precise if repeated realizations converge closely around the observed benchmark, and less precise if they diverge widely. The PEI criterion was therefore designed as a “precision efficiency” index, in the sense that it evaluates how efficiently a method reproduces the level of dispersion that is consistent with the observational benchmark derived from resampling. In the revised manuscript we will make this definition explicit and clarify that PEI refers specifically to consistency around the observed values, rather than to accuracy, which is part of the AEI logic.
  I agree with this statement, but not with lines 144-146. Note that a deterministic model (e.g. that of Ormsbee, 1989) would have PEIT = 0. But such a model would be obviously under-dispersive. (dispersion is actually zero between runs)
  
  ANS: We agree that a deterministic model, such as that of Ormsbee (1989), would always yield PEI = 0, since it produces no dispersion between runs. In this sense such a model is indeed under-dispersive and does not reflect the natural variability of precipitation. Our intention in lines 144–146 was not to suggest that this outcome is ideal, but rather to illustrate how the index behaves in limiting cases. In light of this comment and the related discussion raised by Reviewer 2, we will revise the logic of the PEI so that the optimal value is defined as a dispersion equal to that of the observational benchmark, neither greater nor smaller. This definition ensures that the framework evaluates whether a disaggregation method reproduces the natural level of uncertainty, avoiding both over- and under-dispersion. We will clarify this conceptual point in the revised manuscript to prevent misinterpretation and to emphasize that deterministic models, while assessable within the framework, are not desirable when the objective is to represent rainfall extremes consistently with their natural variability.
  Extreme rainfall indices
  Ln 165. To me, the probability of dry steps is a more common criterion to assess intermittency
  
  ANS: We agree that the probability of dry steps is a more common way of assessing intermittency. In essence, this is equivalent to our formulation, since it corresponds to dividing the total dry duration (TDD) by the total number of hours in a given year. In the revised manuscript we will make this equivalence explicit and adopt the more standard terminology to avoid ambiguity.
  Ln 168. The 5hours duration is not really critical for a large range of basins. The critical duration depends on the time of concentration of the basin and this time of concentration varies a lot from one catchment to the other. Different durations could be thus considered (as in the IDF curves).
  
  ANS: We agree that the critical duration should depend on the time of concentration of the basin, which varies widely from one catchment to another. In this study, the 5-hour duration was adopted as a choice between a very fast response of a basins (around 1 hour or less) and that of larger basins with a concentration time on the order of a full day or more. In the revised manuscript we will clarify this rationale.
  Ln 168-170. How is defined P95? is it based on daily precipitation? hourly? Does it vary from one year to the other?
  
  ANS: P95 is defined based on the observed hourly precipitation, considering only wet hours (p ≥ 0.1 mm). The P95 value is calculated once over the full observation record. We will improve the explanation of P95 in the manuscript.
  Ln 170. Which “last 2 indices”? If you refer to indices 3 and 4 of previous sentence, they do not assess to my opinion the duration and magnitudes of extreme precipitation pulses. How do you define “pulses” (on which duration(s) are they defined)? The first index looks like a frequency (but I do not understand why this index should be different from 5% of the time if P95 is defined from hourly data). I do not understand well what is the meteorological / hydrological interest of the second index. Can you clarify?
  
  ANS: The two indices referred to are indeed R95% and P>R95% (indices 3 and 4 in the text). Our intention was not to describe “pulses” per se, but rather to distinguish between two complementary aspects:
  R95% (duration): the number of hours in which precipitation exceeds the P95 threshold, i.e., the temporal extent of extreme precipitation.
  
  P>R95% (magnitude): the total precipitation amount accumulated during those hours, i.e., the contribution of extremes to annual rainfall.
  
  We also note that R95% is not a trivial 5% of the time, as might be inferred. The P95 threshold is defined once for the whole record, and its exceedances can vary widely from year to year depending on how extremes are distributed. This index therefore captures the interannual variability in the occurrence of extreme precipitation.
  If by “second index” the reviewer refers to P>R95%, we note that this is important because it characterizes the annual evolution of precipitation extremes above the P95 threshold. From an aggregated perspective, it quantifies how much rainfall is actually concentrated in events considered “extreme,” which is a key metric in both climatological and hydrological applications. In the revised manuscript we will clarify the terminology to avoid confusion and make explicit the definitions and relevance of these indices.
  Section 3.2. I would suggest to summarize more the description of the method which is well known. The different equations are not necessarily needed for instance.
  
  ANS: We agree that the k-NN approach is well known and does not require as much detail as currently presented. In the revised manuscript we will streamline this section by shortening the description and removing equations that are not essential, while retaining the key information needed to understand its implementation within our framework.
  Ln 252-253. The knn model considered here does not depend on the local rainfall pattern around the daily amount to disaggregate. This may be a too important simplification. A number of works have shown that the subdaily structure of precipitation is highly dependent on this local rainfall pattern (cf. Ormsbee, 1989, Gunter et al. 2001, Maloku et al. 2023). Testing the initial Sharma et al. 2006 method would be perhaps worth.
  
  ANS: We fully agree with the theoretical foundation that the sub-daily structure of rainfall is influenced by the local rainfall pattern surrounding the target day, and that the original formulation of Sharma et al. (2006) accounts for this dependence by including antecedent and subsequent days when selecting analogues. In our case, however, we chose not to incorporate this element because it reduces flexibility in regimes with strongly seasonal precipitation. For example, in sites such as Quinta Normal (Chile), many summer months are almost completely dry, which means that including surrounding days would often leave no valid analogues to select from, even across long records. By simplifying the selection to focus on the daily amount itself, we ensured that the method could still be applied consistently across both wet and dry seasons.
  We note, however, that seasonal rainfall patterns are indirectly captured in our implementation through the moving time window used to search for analogues, which restricts the pool of potential candidates to the same period of the year. Finally, we emphasize that there are different ways of implementing k-NN for disaggregation, each with its own strengths and weaknesses. The point of this paper is not to promote one specific implementation, but to demonstrate how the proposed framework can systematically evaluate such methods and highlight their relative advantages and limitations.
  Ln270-280 and equations next: I do not understand what is the purpose of this text and of those equations. Can you clarify?
  
  ANS: Following the procedure of Alam and Elshorbagy (2015), the optimal half-window size for the k-NN method was calibrated using an error-based criterion. The purpose of this calibration is to select a window size that minimizes the difference between the observed IDF curves (derived from AMPs of different durations and return periods) and those obtained from the disaggregated series. We will clarify this point more explicitly in the revised manuscript. Hence, for each candidate window size:
  The AMPs of each duration d are normalized with respect to the maximum AMP of that duration over the entire record, so that their values range between 0 and 1 (for both observed and disaggregated series).
  
  For each duration d, the RMSE is calculated between the normalized observed and disaggregated AMPs (with N values in each series, where N is the number of years).
  
  A weighted average of the 24 RMSE values is then obtained, using as weights the inverse of the duration (to give more importance to shorter durations).
  
  This results in a single weighted RMSE (RMSEwei) for that candidate window size. The optimal window is the one that minimizes this RMSEwei.
  
  We also acknowledge that in the current manuscript (Equations 12 and 13), the notation may lead to confusion: the terms max(AMP_{i,d}) and min(AMP_{i,d}) should not include the subscript i, because otherwise the normalization would incorrectly be applied relative to each value itself. What is actually used is the maximum (or minimum) AMP_d across all years. We will correct this notation in the revised version to avoid any misunderstanding. In addition, we will improve the wording of this section to make the procedure more reader-friendly, acknowledging that in its current form it may not be straightforward to follow.
  Ln 310. If I understand well, the disaggregated rainfall of one day can cross that day to the next one. Is it possible to have to rainfall amounts at a same time from day i and day i+1? if yes, do you sum them?
  
  ANS: No, the disaggregated rainfall of one day cannot cross into the next day. This is precisely why the starting time of the disaggregation is restricted to a uniform random variable U(0, 24–d), which ensures that the full disaggregated event of duration d remains entirely within the corresponding day. We will clarify this in Section 3.3.
  Ln 330-338. Can you clarify? I do not understand what do you do here and what it is for? I understand that you produce 300 disagregated time series. I had understand from fig1. that each time series is used in turn to produce one set of IDFs curves. Here, I now understand that you merge the data from the 300 time series to do your frequential analysis.
  
  ANS: We thank the reviewer for this question and for correctly understanding the procedure. Indeed, 300 disaggregated time series were generated, and each one was used independently to produce an IDF curve. The ensemble of 300 curves is then used to estimate the distribution of intensities for each duration and return period, from which we derive quantiles and their associated uncertainty.
  The purpose of the text in lines 330–338 was to justify why using a large number of disaggregations provides a robust estimate of any given quantile and its uncertainty. In the limit, if n → ∞, the estimate of a quantile and its confidence bounds would converge to the “exact” values. With 300 realizations, we consider the sample size to be more than sufficient to provide a reliable estimate of both the quantiles and their uncertainty. We will clarify this explanation in the revised manuscript to avoid any possible misunderstanding.
  Ln 338. What are “the main disaggregation runs”? how are they produced? defined?
  
  ANS: By “main disaggregation runs” we refer to the series that were ultimately used to calculate the evaluation metrics related to both the IDF curves and the ERIs (i.e. validation series). This distinction was made to avoid confusion with the additional series generated in preliminary steps, such as those used to calibrate SOC and k-NN or to construct the Huff curves. Those calibration and validation steps also involved 300 realizations, but the “main runs” are specifically the ones employed in the final evaluation of the framework. We will clarify this in the manuscript.
  Ln 339. Do you do differently for the third DM??? if yes, I do not understand why
  
  ANS: Yes, the procedure is different for the third disaggregation method because this approach does not require a calibration of a parameter of a set of these. Instead, it relies on estimating the Huff curves for each site, which are then used as the basis for the disaggregation. This is why no additional calibration runs were needed in this case. This will be explained in detail in the manuscript.
  Ln 340. I do not understand: what are the calibration results? do you make any distinction between simulation results obtained for calibration and simulation results obtained from the application of the disaggregation process to the daily data? can you clarify?
  
  ANS: By “calibration results” we meant the simulations used solely to calibrate the parameters of SOC and k-NN (e.g., the minimum event threshold ε or the half-window size) and for obtain the Huff curves. These calibration runs were only a preliminary step to fix the parameters and were not included in the evaluation. The evaluation itself was based on a separate set of 300 disaggregated series generated after calibration, using the fixed parameters. These are the simulations that were then used to construct the IDF curves and to calculate the ERI-based metrics. In the revised manuscript we will clarify this distinction to avoid any misunderstanding.
  In this observation configuration, how are generated the 1000 AMP series?
  
  ANS: This point was clarified earlier in the responses (question n°15): the 1000 AMP series are generated by resampling from the fitted Gumbel distribution for each duration, as described in Sect. 4.2. Each sample contains as many values as years in the observed record, and the ensemble of 1000 samples is then used to estimate the distribution of return levels and their uncertainty. We will include a parenthetical reference to Section 2.2.1 to make this clearer.
  Can you complete the caption. To what correspond the ellipses?
  
  ANS: The ellipses in Fig. 3 correspond to the different values taken by the Euclidean distance D_T, as shown as an examplein Fig. 1c. They are plotted in order to clearly illustrate, for each ordered pair of (AEI_T, PEI_T), the associated D_T value. In the revised manuscript we will clarify the caption accordingly.
  Ln 403 and In table 5, what means the 50th percentile mKGE or each ERI time series > this suggests that you compute multiple mKGE for each generated time series? this is confusing, or do you compute on mKGE for each generate time series and then produce the boxplots of those n values (300 values).
  
  ANS: To clarify: for each disaggregated time series (one of the 300 realizations) and for each ERI, we compute a single mKGE value. This yields 300 mKGE values per ERI. The boxplots then summarize the distribution of these 300 values. In Table 5, the “50th percentile” refers to the median of this distribution of 300 mKGE values, not to percentiles within each individual time series. We will rephrase this part of the text to make it clearer.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2710-AC3

Claudio Andrés Sandoval, Jorge Alfredo Gironás, and Cristián Chadwick

Viewed

Total article views: 552 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
436	98	18	552	11	25

HTML: 436
PDF: 98
XML: 18
Total: 552
BibTeX: 11
EndNote: 25

Views and downloads (calculated since 26 Jun 2025)

Month	HTML	PDF	XML	Total
Jun 2025	72	13	3	88
Jul 2025	97	37	10	144
Aug 2025	94	36	2	132
Sep 2025	144	8	2	154
Oct 2025	28	4	1	33
Nov 2025	1	0	1

Cumulative views and downloads (calculated since 26 Jun 2025)

Month	HTML	PDF	XML	Total
Jun 2025	72	13	3	88
Jul 2025	97	37	10	144
Aug 2025	94	36	2	132
Sep 2025	144	8	2	154
Oct 2025	28	4	1	33
Nov 2025	1	0	1

Viewed (geographical distribution)

Total article views: 549 (including HTML, PDF, and XML) Thereof 549 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 03 Nov 2025

Short summary

We developed a framework to evaluate methods that convert daily rainfall data into hourly values, focusing on how well they represent extreme rainfall. Applying this framework to five locations with different climates revealed that no single method performs best everywhere. The results provide practical guidance for researchers, engineers, and decision-makers involved in designing infrastructure and managing flood risk under limited data conditions.


Total:	0
HTML:	0
PDF:	0
XML:	0