Enhanced Climate Reproducibility Testing with False Discovery Rate Correction

Kelleher, Michael E.; Mahajan, Salil

doi:10.5194/egusphere-2025-2311

Preprints

https://doi.org/10.5194/egusphere-2025-2311

Preprints

28 May 2025

| 28 May 2025

Enhanced Climate Reproducibility Testing with False Discovery Rate Correction

Michael E. Kelleher and Salil Mahajan

Abstract. Simulating the Earth's climate is an important and complex problem, thus climate models are similarly complex, comprised of tens to hundreds of thousands of lines of code. In order to appropriately utilize the latest computational and software infrastructure advancements in Earth system models running on modern hybrid computing architectures to improve their performance, precision, accuracy, or all three; it is important to ensure that model simulations are repeatable and robust. This introduces the need for establishing statistical or non-bit-for-bit reproducibility, since bit-for-bit reproducibility may not always be achievable. Here, we propose a short-simulation ensemble-based test for an atmosphere model to evaluate the null hypothesis that modified model results are statistically equivalent to that of the original model. We implement this test in US Department of Energy's Energy Exascale Earth System Model (E3SM). The test evaluates a standard set of output variables across the two simulation ensembles and uses a false discovery rate correction to account for multiple testing. The false positive rates of the test are examined using re-sampling techniques on large simulation ensembles and are found to be lower than the currently implemented bootstrapping-based testing approach in E3SM. We also evaluate the statistical power of the test using perturbed simulation ensemble suites, each with a progressively larger magnitude of change to a tuning parameter. The new test is generally found to exhibit more statistical power than the current approach, being able to detect smaller changes in parameter values with higher confidence.

Received: 16 May 2025 – Discussion started: 28 May 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Michael E. Kelleher and Salil Mahajan

Status: final response (author comments only)

RC1:
'Comment on egusphere-2025-2311', Anonymous Referee #1, 26 Jun 2025
I think this work is an interesting and worthwhile contribution to the field. It addresses an important need: how to balance false positives and sensitivity in hypothesis-based ensemble testing for climate models. The use of the Benjamini-Hochberg (BH) FDR correction approach seems very worthy of exploration in this context. With that said, I would love to see a bit more detail in the following areas,
I think it would be worth adding further explanation of the correction in section 2.2, as it is a crux of understanding the work. I had to read through a few times and revisit this section to understand what is going on. I might also suggest modifying equation 2.2, as it currently reads as the maximum of a set of booleans. Would argmax be a better fit? It would be helpful, I think, to give a plot of p(i) values and the calculated PFDR value for the control ensemble at least.

Does the estimate of p(i)'s and PFDR stay stable with different ensemble sizes? In other words, how does one know that the 30-member ensemble is sufficient? A similar question would apply to the estimates of the false positive rate and sensitivity. As a model changes, is there a point where a different ensemble size is necessary, and how would one know?

I think the information in Figure 1 is very interesting. It could be worth clarifying further. I read that description as meaning that for each modification, a base ensemble and a test ensemble are created. Versus testing fastmath against the control?

In table 2 I would include the PFDR values for each ensemble.

I found the use of the 95 percentile in Figure 2, and its description in the paragraph at line 215, to be confusing. For instance, with a 10% parameter change of GW Orog with MVK, does this mean it would be possible 94% of bootstrapped tests don't return a failure? Also, I believe the horizontal lines might be mislabeled in the caption and legend. Why not just plot the mean number of failed variables with bars for the std? Perhaps there is a good reason for the percentile that I am missing, or perhaps there is some mismatch between the paragraph at 215 and what the plot is actually showing.

I find it interesting and might be worth discussing, in regard to Figure 5, that there doesn't seem to be any correlation of between the false positives for MVK and BH-FDR approaches. On the one test day that fails BH-FDR the MVK approach is nowhere near failing.

I might check out https://gmd.copernicus.org/articles/18/2349/2025/gmd-18-2349-2025.html , a recently published follow-up work to Baker 2015, and Milroy 2018 that deals with similar issues for the ECT testing approach.
Citation: https://doi.org/10.5194/egusphere-2025-2311-RC1
- AC1: 'Reply on RC1', Michael Kelleher, 22 Aug 2025
  
  The authors thank all the reviewers for their helpful comments. We have attached a response document which includes all comments, as many are related.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2311-AC1
RC2:
'Comment on egusphere-2025-2311', Anonymous Referee #2, 02 Jul 2025
This manuscript proposes a statistical approach to assess the reproducibility of climate model simulations. The main idea is to test whether the distribution of output variables from two ensembles is “statistically equivalent” or if significant differences exist between the ensembles. The approach employs a multi-testing Kolmogorov-Smirnov (MVK) test with a Benjamini-Hochberg false discovery rate (FDR) correction to account for error inflation due to multiple testing. Overall, the approach is reasonable but limited to very specific scenarios. Here are some specific comments and questions about this manuscript:
While the adoption of the MVK test is motivated by previous studies cited in this manuscript, there are alternatives to the MVK test that could also be considered before applying an FDR correction. For instance, the Anderson-Darling or Cramér-von Mises tests have advantages over the Kolmogorov-Smirnov test, such as being more sensitive to repeated deviations from the empirical distribution functions, detecting deviations in the tails of the distributions, and effectively handling skewed distributions.

As I understand it, the MVK test is performed on two independent ensembles with N=30 members, comparing the distribution of annual global means for 120 output fields (local null hypothesis). What is the rationale for choosing N=30 specifically, rather than other values of N? The reported results may vary with different values of N. In addition to the caveat of using a “low ultra-resolution model,” the reported statistical results are conditional on a very specific value of N. It would be important to explore lower and higher N values to provide evidence of the effect or perhaps non-effect of the FDR correction under different sample size scenarios.

Furthermore, why are “annual global means” the only statistic of interest rather than regional means or perhaps extreme value statistics, such as the maximum or minimum values over the same time-period? Testing for equality of distributions of global means represents just one focused statistical aspect of a set of climate ensembles with hundreds of output fields. I believe that many other tests beyond those proposed in this paper could be considered and of interest, potentially leading to different statistical assessments of the ensembles and a wide range of results. Even under a coarse spatial resolution, it is to be expected that tests of a local mean or some other local statistic may lead to a different assessment than those resulting from a global statistic. This raises the question of what does it mean to study and test "climate reproducibility"? That variations of the model simulations provide similar global means?

The false positive rates and power analysis presented in the manuscript consider a one-at-a-time variation of the tuning parameters. Should simultaneous variations be considered to study interactions between the “effgw_oro” and “clubb-c1” parameters and understand how these interactions impact the rates and power? I realize that considering all possible interactions may create a large computational burden. However, selecting a few potential interactions of interest could provide further evidence in the application of an FDR correction to assess the reproducibility of climate simulations.

Overall, I find the main ideas behind this paper are interesting, but I feel the paper needs to be better framed and is lacking some form of sensitivity study on the choices that had been made. While I tend to agree with the conclusion that the “BH-FDR approach maintains or improves the statistical power”, this has been demonstrated for a very specific type of test (Kolmogorov-Smirnov), a specific number of ensembles and a specific statistic (annual global mean).
Citation: https://doi.org/10.5194/egusphere-2025-2311-RC2
- AC1: 'Reply on RC1', Michael Kelleher, 22 Aug 2025
  
  The authors thank all the reviewers for their helpful comments. We have attached a response document which includes all comments, as many are related.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2311-AC1
RC3:
'Comment on egusphere-2025-2311', Anonymous Referee #3, 02 Jul 2025
Review of "Enhanced Climate Reproducibility Testing with False Discovery Rate Correction"
This manuscript presents an application of the Benjamini-Hochberg false discovery rate (BH-FDR) correction to climate model reproducibility testing, specifically for the E3SM atmosphere model. While the approach successfully reduces false positive rates in operational testing (from 7.5% to 1.9%), the study's scope and methodological rigor raise concerns about its broader applicability and scientific contribution.
Major Comments:
Limited Generalizability: The study's findings are highly specific to:
A single model (E3SM v2.1) at ultra-low resolution (~7.5° atmosphere)

Annual global means of standard output variables

A particular computational environment (Argonne's Chrysalis machine with Intel compiler)

The authors acknowledge some limitations but do not adequately demonstrate that their approach generalizes beyond this narrow context. Testing at production resolutions, with other models, or with different climate statistics (regional patterns, extremes, seasonal cycles) is essential to support claims of broader applicability.

Incomplete Methodological Comparisons: The manuscript lacks crucial comparisons with alternative approaches:
Permutation testing: While the authors mention that MVK "was found to give similar results as compared to permutation testing" (citing Mahajan et al., 2019b), they do not actually compare BH-FDR against permutation-based methods for their specific application.

Alternative multiple testing corrections: No comparison with other FDR methods (e.g., Benjamini-Yekutieli for dependent tests) or other approaches like the False Discovery Exceedance method.

Statistical Assumptions and Dependencies: The paper inadequately addresses:
The correlation structure among the 117 tested variables, which violates the independence assumption of standard BH-FDR

Whether the "theoretical" critical value of 1 for rejecting the global null hypothesis holds beyond their specific ensemble configuration

The choice of q* = α = 0.05 for FDR control without justification

Limited Scope of Reproducibility Testing: Testing only annual global means represents a minimal assessment of climate reproducibility. Climate models are evaluated on their ability to simulate regional patterns, variability, extremes, and trends. A model could pass this test while having significant regional biases or incorrect variability patterns.

Minor Comments:
The power analysis covers only two parameters. Testing additional parameters with different sensitivities would strengthen the conclusions.

The operational validation period (53 days) is quite short for drawing robust conclusions about false positive rates.

The manuscript reads more as a technical note about an incremental improvement to E3SM's testing infrastructure rather than a methodological advance in climate model evaluation.

The manuscript makes a useful contribution to E3SM's quality control processes, and the reduction in false positive rates is valuable operationally. However, the limited scope and incomplete methodological comparisons prevent me from recommending publication in its current form as a general methodological contribution to climate model evaluation.
Citation: https://doi.org/10.5194/egusphere-2025-2311-RC3
- AC1: 'Reply on RC1', Michael Kelleher, 22 Aug 2025
  
  The authors thank all the reviewers for their helpful comments. We have attached a response document which includes all comments, as many are related.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2311-AC1
RC4:
'Comment on egusphere-2025-2311', Anonymous Referee #4, 02 Jul 2025

Paper seeks to modify multi-testing Kolmogorov-Smirnov (MVK), which is used to test is two model configurations have statistically diverged climate, by using a false discovery rate (FDR) correction for assessing P-values instead of a bootstrapped distribution. The hope is that this will decrease the false positive rate of MVK without increasing the false negative rate, and that it will eliminate the need for a large ensemble to generate the bootstrapped distribution. Paper also recalculated critical value threshold for MVK in E3SMv2.1.

Major Comments
I believe the paper effectively demonstrated that the false negative rate is lower while the false positive rate is nearly unchanged for BH-FDR compared to MVK in this specific use case.
My biggest issue with the manuscript is that it seems to largely be a method development paper for ensuring statistical stability of a numerical model with an Earth system model used as the exemplar. Even though this is for a special issue focused on ensemble design, I think the article need more of an explicit relation to Earth System Dynamics. So, while I believe the paper has strong merits for publication, I am not sure if this is the correct journal. Perhaps a more statistics-oriented journal would be appropriate.
It's not clear to me what Section 4.4 is adding to the paper. They aren't showing any meaningful difference in the climate produced by two alternative compilers, but as they rightfully point out this is likely influenced by the fact that they are running ultra-low resolution simulations. So, they can't really comment on the ultimate effect of these compilers on the mean state of the climate outside of this specific configuration, which feels like a weak result.

Minor Comments
ln 29: Typo "...testing both the..."
ln 78: Need more detail on what qualifies as a "'climate changing' code modification...". I'm assuming this is based on some archived baseline ensemble but it's not very clear with the current language.
ln 81-83: This sentence is difficult to read, consider breaking into two separate sentences. For example:
"Then, for each field, the null hypothesis (H1, also referred to as the local null hypothesis here) is evaluated. This local null hypothesis posits that the sample distribution function of the annual global mean of that field estimated from the baseline ensemble (estimated from N data points) is statistically similar to that of the new ensemble."
ln 98: I know you clarified that "Type I" errors were false positives in the intro, but I think it's worth mentioning that again here since it becomes important in subsequent sections. Also, can you point to some source of the Type I errors in this method? A relevant citation would suffice.
ln 102: Again, I believe these statements about Type I error rates need citations.
ln 112: Typo, add dash between "BH FDR" to be consistent with the rest of the paper.
ln 123-134: It's not clear that this text is describing Eq 1. The p_FDR in the text seems to be (i/m)q*, while the p_FDR in the equation seems to be either zero or one based on whether the null hypothesis is rejected.
ln 149: I think omega'^2_bar needs more explanation. Some more technical explanation of the dissipation rate of zonal mean vertical eddy kinetic energy and what role that plays in the statistical stability of the model. Similarly, a brief explanation of gravity wave drag intensity is also warranted.
ln 150: "The simulation ensembles..." this sentence is redundant with information from Section 2.1 so it can be deleted
ln 197: It's not clear how the 95th percentile is meant to be interpreted relative to the mean in Table 2, and it's not discussed at all. One or two sentences explaining how this is representative of the distribution of false positive rates would be appreciated.
ln 206: Again, some explanation of the significance of the 95th percentile is needed. This is discussed briefly in ln 214, but I think it should be moved up to when Figure 2 is introduced.
ln 215: Typo "BD-FDR" --> "BH-FDR"
ln 233: While BH-FDR does show greater statistical power over MVK in Figure 3, the difference appears relatively marginal and it feels like an overstatement to suggest the should lead to increased confidence when considering false detection detection rates.
ln 296: Typo: "improvises" --> "improves"

Citation: https://doi.org/10.5194/egusphere-2025-2311-RC4
- AC1: 'Reply on RC1', Michael Kelleher, 22 Aug 2025
  
  The authors thank all the reviewers for their helpful comments. We have attached a response document which includes all comments, as many are related.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2311-AC1
RC5:
'Comment on egusphere-2025-2311', Anonymous Referee #5, 02 Jul 2025
The authors present a detailed and thorough evaluation of different statistical methods used to evaluate climate reproducibility in the DOE Energy Exascale Earth System Model (E3SM). Their focus is on the preexisting multi-testing Kolmogorov-Smirnov (MVK) test and a proposed Benjamini-Hochberg False Discovery Rate correction to MVK (BH-FDR). The new BH-FDR method is computationally more efficient and can reduce false positives while improving upon the statistical power of MVK. The MVK and BH-FDR methods are clearly described, and their application and limitations are laid out in detail by the authors. That being said, the computational efficiency was not discussed in detail and could benefit from a short comment. Analysis clearly shows how the correction improves the current MVK method and how sensitive both methods are to different parameter adjustments (with differing model sensitivities) and compiler optimization.
Overall, I found the presentation to be convincing and clearly presented, but there were a few issues that should be addressed. The most pressing from my perspective revolved around comments that the 120-member ensemble used in the analyses may not have been enough to capture the internal variability for the MVK method (see comment below). Other than this my concerns were minor and included some grammatical comments and suggestions to improve clarity. As such, I would recommend this paper is published after addressing the minor concerns laid out below.

Major comments
There were repeated comments regarding the potential limitations of a 120-member ensemble in this work suggesting that this may not be enough simulations to fully represent the internal variability of MVK. While I found the rest of the paper to be a compelling defense of the update, this comment stuck with me and introduced a level of uncertainty into the conclusions that I was uncomfortable with. I would like to see some justification or supplementary analysis that suggests that this ensemble is sufficient for these results or at least can give some estimated range of uncertainty.

Can you include a statement in your paper or in the results section that describes the improvement in computational time with BH-FDR versus MVK? This would then defend the twofold benefits of the BH-FDR that were described in the introduction of your paper.

Comments by line:
Line 195: Can you test if the 120 ensembles do in fact capture the true variability? This sounds like 120 may or may not be enough which makes me wonder about the results.
Line 276: A similar comment appears in section 4.2. Is there a way to test the internal variability sampling and quantify its use? Could the authors add 30 additional ensembles and test the dependence in increased ensemble totals? I worry about the results given the uncertainty in the number of ensembles. While the authors then go on to state that increasing ensemble size is a future task, I do think that verifying that the model ensembles are large enough to draw appropriate conclusions is an important aspect of this paper.
Line 303: Something that seems to be missing from the paper is description of the improvement in computational time for the BH-FDR method. Can you mention here or in your results how much faster it is than MVK?

Minor and grammatical comments
There were a handful of run-on sentences with multiple asides that could benefit from being broken into two or more sentences or broken up with parenthesis/dashes rather than commas. Some have been noted in the comments below but please check for others in the manuscript.

There are some statements about previous research or findings that are important to the conclusions but are missing references. It would be nice to see some of these referenced to something researchers can find if possible (see in comments below).

Comments by line:
Line 2: For clarity consider rephrasing to "It is important to ensure that Earth system models running on modern hybrid computing architectures are repeatable and robust. This facilitates the utilization of the latest computational and software infrastructure advancements in these models while allowing for improvements in their performance, precision, and accuracy. "
Line 7: To avoid confusion with the current version 3 of E3SM, please add, "...this test in version 2 of the US Department of Energy's Energy Exascale Earth System Model (E3SM)"
Line 22: Please remove 'indeed'
Line 23: Consider changing to, "...changed (assuming that is not the intended effect)".
Line 27: Please mention that this is version 2 of E3SM and reference Golaz et al., 2022 (10.1029/2022MS003156) here.
Line 29: remove 'the both'
Line 30: consider replacing ‘which’ with 'both of which'
Line 38: Is this 1 time step and 2 time step? Unless this is common notation, please replace 1s and 2s with 1-step and 2-step respectively.
Line 51: Consider changing to: "...approach that improves upon the MVK and provides a two-fold benefit over it."
Line 52: Consider changing to, "...(Type I error rates; e.g., where two simulations are arroneously labeled as statistically different) without affecting the false negative ..."
Line 54: consider changing to, "...rate of about 7.5% since its induction into the test suite despite the prescribed significance..."
Line 58: A little hard to follow. Please make new sentence such as, "The latter of these approaches is the methodology used by MVK."
Line 66: Unclear what you are saying with the end of the sentence. It seems to contradict the immediately preceeding statement. Please make this a new sentence and elaborate on what you mean.
Line 67: replace semicolon with colon
Line 68: Add comma after 'rates)'
Line 71: Remove 'And,' or combine with previous sentence into single sentence.
Line 79: consider replacing with 'resolution (~7.5 deg atmosphere, 240 km ocean) for 14 months'
Line 81: consider moving 'is evaluated' from the end of the sentence to this section: '...the null hypothesis (...) is evaluated...'
Line 90: Assuming this is E3SM version 1. Please mention this here and reference version 1 documentation (Golaz et al., 2019 (https://doi.org/10.1029/2018MS001603))
Line 95: Consider replacing with, '...strongly. That is, those...'
Line 96: consider changing to '...significance. But...'
Line 105: Remove comma after ‘chance’
Line 110: please change to, 'the latter of which is used here'.
Line 113: has been
Line 125: consider using semicolons instead of commas to break up each definition.
Line 136: change to, '...these characteristics for each approach...'
Line 136: remove comma after MVK and add 'while also'
Line 164: change to, '...arithmetic (Intel Corporation, 2023).'
Line 166: Not clear what optimization flag you are referring to. Please change sentence to, "In the case of E3SM, {opt. flag} is already used in the compilation of several source files..."
Line 171: Please include a reference for this statement.
Line 172: Unclear. Maybe break into two sentences or restructure
Line 177: While this method intuitively seems robust, is there justification for the 1000 times sample? I may have missed it, but was there testing to determine that this stochastic sampling leads to a spread in the null distribution of t that is representative of the 120 member ensemble set?
Line 179: Please put sentence explaining the significance of the t statistic from two samples from the same ensemble further up in the paragraph closer to the description of how the analysis was conducted. It wasn't clear to me upon first reading why you were calculating the test statistic within the same ensemble set.
Line 190: remove comma after 'correction'
Line 213: On the topic of detectability and mainly a question of personal interest: based on the relative sensitivities, can you generalize the detectability of a parameter change based on its sensitivity? In other words, if E3SM is 2x more sensitive to parameter X than clubb_c1, can you say what percent change in parameter X will put you at the critical value threshold? If this is easy enough to do, this would be of interest to the subset of the modeling community working on parametric uncertainty.
Line 220: change to, '...(1-P, incorrectly accepting a false null hypothesis), ...'
Line 227: Is this because the parameter value is correlated with the rejected iterations? Not totally clear to me, please clarify.
Line 236: Please give example of how the climate is not different in a meaningful way or elaborate.
Line 237: Do you have estimates of sampling error and could these be applied to this comparison?
Line 241: Remove comma after 'here'.
Line 244: I think this is a good idea and recommend looking at Nugent et al., 2025 (10.22541/essoar.174907165.57104591/v1) and Eidhammer et al., 2024 (10.5194/egusphere-2023-2165) for other significant parameters, especially mass and number autoconversion parameters in the P3 cloud microphysics.
Line 251: change to '...22 out of the 1000...'
Line 262: Consider changing to, "...ultra-low resolution model used here."
Line 273: Break into new sentence, such as, '...occurred. In this case both tetsts...'
Line 284: I think these examples are interesting and meaningful for a larger perspective of the use of these tools, but do they have a reference? Is this documented in the E3SM overview, a presentation, or personal correspondence?
Line 289: Reference?
Line 293: In the above examples it seems both MVK and BH-FDR give the same results. Can you add a statement justifying why BH-FDR may be a better choice due to it's higher sensitivity? Is there an example where detecting non-bit-for-bit changes would require the higher sensitivity of BH-FDR? It can be speculative, but it would be nice for a future perspective with this new method.
Line 296: change ‘improvises’ to ‘improves’
Line 309: change ‘effect’ to ‘affect’
Line 310: Is this unpublished work slated to be published? Can any part of this work be plotted and put in the supplementary? It would be nice to see some part of this comparison to support this claim as it seems to be a powerful basis for the hypothesis that MVK is resolution independent.
Citation: https://doi.org/10.5194/egusphere-2025-2311-RC5
- AC1: 'Reply on RC1', Michael Kelleher, 22 Aug 2025
  
  The authors thank all the reviewers for their helpful comments. We have attached a response document which includes all comments, as many are related.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2311-AC1
RC6:
'Comment on egusphere-2025-2311', Anonymous Referee #6, 06 Jul 2025
General comments:

The current preprint implements the Benjamini–Hochberg False-Discovery-Rate (BH-FDR) as an update to the ensemble-based reproducibility test of the atmosphere component of E3SM, updating in particular the multi-testing Kolmogorov-Smirnov (MVK) test.

The two goals of this update are to:

(1) reduce computational cost of determining empirically a threshold for the ‘global’ hypothesis test H_1 and

(2) to reduce the false positives and false negatives (i.e. increase the power).
From my perspective the authors’ main contribution is the implementation, in particular the tool, of this updated reproducibility test and the evaluation of the second goal, (2) above. It is clear that the BH-FDR helps with (1). The code has been published on Github and the data on Zenodo, the experiments have been described in such a way that reproducibility seems doable, although due to a difference in HPC architecture exact numerical results could differ.
The current results seem fitting to the special issue of “Theoretical and computational aspects of ensemble design, implementation, and interpretation in climate science”, and the current results contribute in particular to the implementational aspect.
The title is clear and fitting, in the abstract I would add some comments mentioned below in the technical corrections.
The evaluation is solely empirical/implementational using a particular case study for evaluation. This case study is restricted to (i) ultra-low resolution, (ii) on 13 families of ensembles, where the families are generated by varying two parameters (11 families) and compiler options (2 families). The evaluation of goal (2) through this case study is, in my perspective, convincing although in a limited setting, but further confirmed in an operational setting.
The theoretical basis comes from the paper on the BH False-Discovery-Rate. While the idea of this method is clearly explained, the comprehension of this method, although not strictly needed as the ‘statsmodels’ package is used, sometimes feels lacking. This results in errors and vague statements in some parts of the preprint. A major flaw is that the assumptions for this method, for example independence of the random variables, are not checked.
The structure of the preprint is clear and good. The writing sometimes feels hurried, as there are some inconsistencies and typing errors, and sometimes some phrases which seem to lack argument.
The manuscript is, in my opinion, not sufficiently related to similar work. In particular, in my opinion, they should make a comparison with https://gmd.copernicus.org/articles/15/3183/2022/ and possibly with the new work in https://gmd.copernicus.org/articles/18/2349/2025/gmd-18-2349-2025.html

The former work could, in particular, help to comment on some choices made (e.g. only Kolmogorov-Smirnov and the choice of ultra-low resolution) even though the tests are different.
It seems that FDR is also used in one of the author’s previous papers: https://dl.acm.org/doi/pdf/10.1145/3468267.3470572 and it should be made clear how the current results differ from the one in the previous paper.
Therefore, due to the flaws in the scientific argumentation I call for a revision.

Nevertheless, I would like to note that the authors are working on a tool which seems at the forefront of statistical reproducibility testing in earth system modelling and would like to commend them for this.
Specific comments

In my opinion the major remarks are that:
the assumptions of the Benjamini–Hochberg False-Discovery-Rate should be explicitly stated and checked;

the work should be put in a more general context, including https://gmd.copernicus.org/articles/15/3183/2022/ and the previous work https://dl.acm.org/doi/pdf/10.1145/3468267.3470572 The former work could, in particular, help to comment on some choices made (e.g. only Kolmogorov-Smirnov and the choice of ultra-low resolution) even though the tests are different.

It seems that FDR is also used in one of the author’s previous papers: https://dl.acm.org/doi/pdf/10.1145/3468267.3470572 and it should be made clear how the current results differ from the one in the previous paper.

The case study is quite restricted. Possibly more cases could be considered. I also agree with the other reviewers that sometimes more explanation could be given on the choice of methods/parameters.

Technical corrections:

Errors:
The equation for p_FDR seems incorrect and should likely be changed to the one in https://dl.acm.org/doi/pdf/10.1145/3468267.3470572

Problems between reproducibility/consistency between the Intel and GNU compiler on the one hand and the PGI compiler on the other hand have also been found in the work on pycect. In line 161 this observations significantly reduces the relevance of the remark

In the abstract, you speak of “short-simulation”. While later in the introduction (line 42) it is not clear relative to what “short” refers

In section 4.2, line 194. I am not sure that I agree with the remark that the mean is expected to be 0.05, since the critical threshold was estimated from the same population. It is not clear to me, and I think it is not true, that taking the median of the 95% quantiles of the 13 families of ensembles would imply that we may expect the mean to be 0.05 Possibly I am wrong in this.

Similarly, in section 4.2 It is not clear to me, and I think it is not true that, “this may be different if an ensemble size of 120 does not capture the true internal variability”. Possibly I am wrong in this.

Again in Section 4.2, I think that before using the lemma from the paper to explain the reduction of the mean false positive rate, the assumptions should be checked and a more formal argument could be given

Suggestions:
In the abstract, I would clearly add the two goals (1: reduce computational effort and 2: reduce false negatives and false positives)

In the abstract, you can add more clearly that you have confirmed the results of the second goal in an operational setting

Possibly, besides similar work on CIME, the work from Ludwig et al on statistical reproducibility at DKRZ, or the recent work of Kai Keller et al at BSC (https://egusphere.copernicus.org/preprints/2025/egusphere-2025-1367/egusphere-2025-1367.pdf) could be added. I have no personal investment in these papers.

In line 214, I would change “detectable” to “easily detectable”

In Figure 5, you could add the square and bullet markers to the legend, just as in the legend of figures 2 and 3

 Inconsistencies/spelling:
“Tens to hundreds of thousands of lines of code” and later “millions of lines of code”

Typewriter font for the parameters “effgw_oro” and “clubb_c1” applied inconsistently

re-sampling and resampling

“the both the” in line 29

“Across model” in line 258 should be “across the model”

"BH-FDR" not always consistent
Citation: https://doi.org/10.5194/egusphere-2025-2311-RC6
- AC1: 'Reply on RC1', Michael Kelleher, 22 Aug 2025
  
  The authors thank all the reviewers for their helpful comments. We have attached a response document which includes all comments, as many are related.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2311-AC1
RC7:
'Comment on egusphere-2025-2311', Anonymous Referee #7, 07 Jul 2025

This is a review for submission "Enhanced Climate Reproducibility Testing with False Discovery Rate Correction" by Kelleher and Mahajan. The work should be of great interest to the community, since evaluating Earth system models can be very computationally expensive. The manuscript is well-presented and organized, I have only minor comments:
There is a lot of jargon in the paper in the introductory sections, which makes it harder to read for a non-statistician. Some of the definitions (like Type I errors) were not clear from the text (even though the text refers to Sec. 4 for Type I error definition, there is no further clarification in Sec. 4). I found some clarifications in Mahajan2017 and Mahajan2019, but I think it is ok to repeat definitions of basic concepts in this paper.
Equation 1 -- maximum function has boolean input, instead of a number. Is there a typo? Also, why does p-level there depend in an index ("i") of a field to be tested? The authors list m=117 above, but in general m can be arbitrary large.
It would be very interesting to see (if not in this publication, then maybe in future ones) which simulations in the new framework (and possibly the old one) are deemed climate-changing by looking at their plots (standard latlon mean fields, zonal means, etc.), and evaluate them by an eyeball norm. Would a human expert also consider these simulations "climate changing" or not? Something similar is presented in Mahajan2017, but not exactly. There is an issue that the current work was done with an ultralow res E3SM (ne4 for the atmosphere), and this configuration of E3SM might not have established climatologies.

Citation: https://doi.org/10.5194/egusphere-2025-2311-RC7
- AC1: 'Reply on RC1', Michael Kelleher, 22 Aug 2025
  
  The authors thank all the reviewers for their helpful comments. We have attached a response document which includes all comments, as many are related.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2311-AC1

Michael E. Kelleher and Salil Mahajan

Viewed

Total article views: 832 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
724	77	31	832	20	31

HTML: 724
PDF: 77
XML: 31
Total: 832
BibTeX: 20
EndNote: 31

Views and downloads (calculated since 28 May 2025)

Month	HTML	PDF	XML	Total
May 2025	32	3	2	37
Jun 2025	88	15	4	107
Jul 2025	109	25	14	148
Aug 2025	103	8	2	113
Sep 2025	335	12	4	351
Oct 2025	40	10	5	55
Nov 2025	17	4	0	21

Cumulative views and downloads (calculated since 28 May 2025)

Month	HTML	PDF	XML	Total
May 2025	32	3	2	37
Jun 2025	88	15	4	107
Jul 2025	109	25	14	148
Aug 2025	103	8	2	113
Sep 2025	335	12	4	351
Oct 2025	40	10	5	55
Nov 2025	17	4	0	21

Viewed (geographical distribution)

Total article views: 859 (including HTML, PDF, and XML) Thereof 859 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 16 Nov 2025

Short summary

Building numerical models of the Earth is a complex task that scientists and engineers around the world work on. It's important to be able to replicate results accurately to help advance science. This study uses a statistical method to reduce false positive errors when comparing two sets of simulations to see if they agree with each other. This approach helps identify if changes made to the model's code result in unexpected effects.


Total:	0
HTML:	0
PDF:	0
XML:	0