the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Statistical summaries for streamed data from climate simulations: One-pass algorithms (v0.6.2)
Abstract. Projections from global climate models (GCMs) are a fundamental information source for climate adaptation policies and socio-economic decisions. As such, these models are being progressively run at finer spatio-temporal resolutions to resolve smaller scale dynamics and consequently reduce uncertainty associated with parameterizations. Yet even with increased capacity from High Performance Computing (HPC) the consequent size of the data output (which can be on the order of Terabytes to Petabytes), means that native resolution data cannot feasibly be stored for long time periods. Lower resolution archives containing a reduced set of variables are often all that is kept, limiting data consumers from harnessing the full potential of these models. To overcome this growing challenge, the climate modelling community is investigating data streaming; a novel way of processing GCM output without having to store a limited set of variables on disk. In this paper we present a detailed analysis of the use of one-pass algorithms from the 'one-pass' package, for streamed climate data. These intelligent data reduction techniques allow for the computation of statistics on-the-fly, enabling climate workflows to temporally aggregate the data output from GCMs into meaningful statistics for the end-user without having to store the full time series. We present these algorithms for four different statistics: mean, standard deviation, percentiles and histograms. Each statistic is presented in the context of a use case, showing the statistic applied to a relevant variable. For statistics that can be represented by a single floating point value (i.e., mean, standard deviation, variance), the accuracy is at the order of the numerical precision of the machine and the memory savings scale linearly with the period of time covered by the statistic. For the statistics that require a distribution (percentiles and histograms), we present an algorithm that reduces the full time series to a set of key clusters that represent the distribution. Using this algorithm we find that the accuracy provided is well within the acceptable bounds for the climate variables examined while still providing memory savings that bypass the unfeasible storage requirements of high-resolution data.
- Preprint
(2303 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 03 Apr 2025)
-
RC1: 'Comment on egusphere-2025-28', Anonymous Referee #1, 17 Feb 2025
reply
In this manuscript, the authors present an alternative way to process “on-the-fly” (online) model output data, with the motivation being reduced data/memory usage and potential ease for users. Ultimately, this is an interesting manuscript, albeit weird. I think it can be made worthy of publication upon a revision. In general, the authors seem to be factual, careful, and nuanced. I think most of the work is relatively high quality. There are some problems though (the first one potentially fatal for the manuscript).
First: My reading of the manuscript is that it is making use of an algorithm package (whose version is in the title no less), but upon examination of the code availability section, I don’t think the authors actually used the package at all in their analysis. Instead, they “simulated” using the package. I find this quite odd. I wrote a comment about this at the end. I don’t know if this is intentional or not, but this is pretty deceptive. I invite the authors to explain, and I will keep an open mind.
Second: The authors didn’t explain how this type of algorithm/package would be used with an actual climate model running. Maybe it would be used in a futuristic cloud-compute setup? (They do cite some cloud-computing works.) I think this manuscript would greatly benefit from a detailed example (ideally a workflow) of how this would be implemented in an end-to-end fashion.
Third: I think this manuscript would benefit from a comparison to the “online diagnostics” route. In the “online diagnostics” route, climate scientists (and developers) write functionalities to output specific items of interests, without needing too much data. Climate models already do online mean, min, max, etc. in their output streams — nothing here is novel. That could easily be extended to all sorts of statistics. In fact, it could be extended in a composable fashion to even more complex and variable algorithms. For example, consider the following: “I want to calculate the globally averaged temperature at cloud-top but with a threshold of 280 K”. An algorithm can be written to identify cloud top, then finding the temperature there, considering the threshold, and then doing a horizontal reduction in a composable way. This algorithm can then be run after each time step in the model, and storing in memory the variable to be output or writing it out immediately. I think the authors could improve this manuscript if they compare and contrast their one-pass streamed approach to that of models natively doing more online calculations themselves (potentially on idling CPUs, if the models run on GPUs) instead of an external module/package.
Other comments I wrote while (re)reading the manuscript:
L2: nitpick: “as such” is kinda weird here in that it made me think running at high-res was a consequence of the preceding sentence, but it actually is more of a antecedent of the following sentence.
L5: nitpick: probably misplaced comma (I’d move it to after HPC instead)
L7: the phrase “data streaming” is introduced like a known quantity that is specific climate science, but in fact, it is neither known nor usually associated with climate science. Maybe jargon? If I were you, I would consider removing “data stream;” (the phrase with the semicolon) and let the sentence stand without the disruption
L9: one-pass may necessitate explanation here (so I would rephrase)
L9: intelligent doesn’t seem like the right word (maybe efficient?) — these techniques aren’t learning things, or are there?
L17: I think “well within the acceptable bounds” is an underestimated. Imho and understanding, these algorithms basically recover accuracy to within an insurmountable precision limit, so basically as good as one would get anyway. I would rephrase to finish with a stronger point
L35–42: I think this may surprise an average climate scientist involved in model development and evaluation (such as myself). We always had the ability to access data before simulations are done and one could obviously write stuff in a continuous manner. I think what you want to highlight is that you’re algorithmically calculating some interesting statistics that capture specific interests. That is not really novel, not unique, but it is interesting and worthy of publication. In other words, I don’t see the “novel” method and “unprecedented” reduction of “meaningful” output. Can you please (carefully) elaborate?
L70: the last sentence here gives the wrong impression of what you’re trying to say. Your manuscript is supposed to showcase those algorithms, so “requesting” readers to go follow some other package documentation may not be the best way to politely say, “We don’t discuss the code implementation of specific algorithmic details related to each statistic, and we instead focus on showing their utility to climate analysis” …
L194: I’d elaborate on the different results part
L219: I think you can say that plainly at the outset without taking the reader on an unnecessary voyage. I am not sure if the added info (analysis) is informative otherwise. Feel free to disagree
L268: so, the NumPy-calculated one is the “reference” right? (Note, I would refer to NumPy like they refer to it in their papers/docs, uppercase N and P)
Section 5.3: If I were you, I would simply delete this section. Your concluding paragraph (around L385) basically shows that most of the preceding analysis/“results” should be taken with a giant grain of salt. I would simply avoid the distraction, maybe add something about how the analysis in 5.2 could done on precipitation with a meaningful cutoff and/or a different type of underlying assumption (distribution type)
L431: But this doesn’t make sense to me. As a scientist, what I care about is the stability of the value, not an arbitrarily defined metric of convergence. In Figure 6, sigma_n goes from roughly 0.35 at 200 samples to 0.15 at 600 samples, but it is within those “convergence” lines.
L432: Ok, but how does that impact your assertions earlier about all the savings and not having to wait for model simulations to run a lot of steps? I guess you’re saying, one kinda has to wait a lot of steps to get “accurate” stuff out of these one-pass algorithms? Like 200 steps or so?
Section 6: Like Section 5.3, I think the convergence analysis is likely misleading, potentially counterproductive, and perhaps better left out. Your goal in this manuscript is to showcase something useful for users of this method. My assumption is that if people choose to use this type of one-pass algorithm, they would do so on relatively high-frequency data and they would understand the risk of under-sampling. Maybe I am just not getting what you’re trying to do here? Could you motivate it better? Is it anything other than something like “don’t take the mean of the first 5 time steps as the mean of the next 5000 time steps”??
L508: Thanks for the links. A few questions/suggestions: 1) might be good to list the GitHub links here as well? 2) is there a hosted version of the package docs somewhere? 3) how was the one-pass package used in making these figures? I didn’t see any “import one_pass” or “from one_pass…” … looks like you reimplemented/“simulated” everything from scratch? If so, this means the package in the title of this manuscript wasn’t used at all and the whole manuscript is misleading. I also don’t quite understand the numerics are returning round-off errors, where I would’ve expected exact answers (e.g., for the means)
Citation: https://doi.org/10.5194/egusphere-2025-28-RC1 -
AC1: 'Comment on egusphere-2025-28: Preliminary reply to RC1', Katherine Grayson, 27 Feb 2025
reply
Firstly the authors would like to thank the reviewer for taking their time and reviewing thoroughly the manuscript. This is not a full reply to all the reviewer comments (we will properly address all comments after receipt of all the other reviews), this is just a comment on the first point made in RC1. The reviewer highlights a very valid point that the analysis provided in the notebooks does not actually use the one_pass package. We would like to briefly clarify this issue. The reason for this is that when the paper was initially submitted the actual package we developed (https://doi.org/10.5281/zenodo.14591828) was not allowed to be published open source due contractual obligations of the project. While this was unfortunate, we felt this different method of data handling was still highly relevant for the community and we wanted the paper to reflect how these algorithms could be used, including their pros and cons, rather than their specific implementation in our software. As such, we wrote the notebooks trying to be as transparent as possible and exposing the internal mathematical functions of the package to re-create the same results.
Ultimately however, it was decided that the paper could not be submitted without full open source of the underlying package and we waited for some months to be allowed to publish the package open source. When this was decided we resubmitted the article (as it is now) however the notebooks to make the Figures have yet to be updated. We would like to assure the reviewer that all Figures can be re-created using the package (indeed, much more simply with one line of code) and we will re-do all the notebooks explicitly using our package. There was no intent to be deceptive, in-fact quite the opposite, however we can see very clearly how this has resulted in confusion.Citation: https://doi.org/10.5194/egusphere-2025-28-AC1 -
RC2: 'Comment on egusphere-2025-28', Anonymous Referee #2, 06 Mar 2025
reply
This timely manuscript describes a powerful set of "one-pass" methods to compute statistics on-line during a model integration without first needing to write model output to disk. This is especially useful for very large kilometer-scale weather and climate simulations that produce enormous volumes of data, as this method works without sacrificing spatial resolution or temporal frequency. The demonstrations of efficacy and memory savings are powerful arguments for the value of this approach. However I didn't quite understand the t-digest method used for computing histograms and percentiles, and some of the analyses were difficult to understand. Please see attached my more comprehensive review.
Data sets
nextGEMS cycle3 datasets: statistical summaries for streamed data from climate simulations nextGEMS, K. Grayson https://doi.org/10.5281/zenodo.12533197
Model code and software
DestinE-Climate-DT/one_pass: v0.6.2 K. Grayson https://doi.org/10.5281/zenodo.14591828
Interactive computing environment
kat-grayson/one_pass_algorithms_paper K. Grayson https://doi.org/10.5281/zenodo.12533064
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
124 | 26 | 9 | 159 | 8 | 5 |
- HTML: 124
- PDF: 26
- XML: 9
- Total: 159
- BibTeX: 8
- EndNote: 5
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1