the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Statistical summaries for streamed data from climate simulations: One-pass algorithms (v0.6.2)
Abstract. Projections from global climate models (GCMs) are a fundamental information source for climate adaptation policies and socio-economic decisions. As such, these models are being progressively run at finer spatio-temporal resolutions to resolve smaller scale dynamics and consequently reduce uncertainty associated with parameterizations. Yet even with increased capacity from High Performance Computing (HPC) the consequent size of the data output (which can be on the order of Terabytes to Petabytes), means that native resolution data cannot feasibly be stored for long time periods. Lower resolution archives containing a reduced set of variables are often all that is kept, limiting data consumers from harnessing the full potential of these models. To overcome this growing challenge, the climate modelling community is investigating data streaming; a novel way of processing GCM output without having to store a limited set of variables on disk. In this paper we present a detailed analysis of the use of one-pass algorithms from the 'one-pass' package, for streamed climate data. These intelligent data reduction techniques allow for the computation of statistics on-the-fly, enabling climate workflows to temporally aggregate the data output from GCMs into meaningful statistics for the end-user without having to store the full time series. We present these algorithms for four different statistics: mean, standard deviation, percentiles and histograms. Each statistic is presented in the context of a use case, showing the statistic applied to a relevant variable. For statistics that can be represented by a single floating point value (i.e., mean, standard deviation, variance), the accuracy is at the order of the numerical precision of the machine and the memory savings scale linearly with the period of time covered by the statistic. For the statistics that require a distribution (percentiles and histograms), we present an algorithm that reduces the full time series to a set of key clusters that represent the distribution. Using this algorithm we find that the accuracy provided is well within the acceptable bounds for the climate variables examined while still providing memory savings that bypass the unfeasible storage requirements of high-resolution data.
- Preprint
(2303 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-28', Anonymous Referee #1, 17 Feb 2025
In this manuscript, the authors present an alternative way to process “on-the-fly” (online) model output data, with the motivation being reduced data/memory usage and potential ease for users. Ultimately, this is an interesting manuscript, albeit weird. I think it can be made worthy of publication upon a revision. In general, the authors seem to be factual, careful, and nuanced. I think most of the work is relatively high quality. There are some problems though (the first one potentially fatal for the manuscript).
First: My reading of the manuscript is that it is making use of an algorithm package (whose version is in the title no less), but upon examination of the code availability section, I don’t think the authors actually used the package at all in their analysis. Instead, they “simulated” using the package. I find this quite odd. I wrote a comment about this at the end. I don’t know if this is intentional or not, but this is pretty deceptive. I invite the authors to explain, and I will keep an open mind.
Second: The authors didn’t explain how this type of algorithm/package would be used with an actual climate model running. Maybe it would be used in a futuristic cloud-compute setup? (They do cite some cloud-computing works.) I think this manuscript would greatly benefit from a detailed example (ideally a workflow) of how this would be implemented in an end-to-end fashion.
Third: I think this manuscript would benefit from a comparison to the “online diagnostics” route. In the “online diagnostics” route, climate scientists (and developers) write functionalities to output specific items of interests, without needing too much data. Climate models already do online mean, min, max, etc. in their output streams — nothing here is novel. That could easily be extended to all sorts of statistics. In fact, it could be extended in a composable fashion to even more complex and variable algorithms. For example, consider the following: “I want to calculate the globally averaged temperature at cloud-top but with a threshold of 280 K”. An algorithm can be written to identify cloud top, then finding the temperature there, considering the threshold, and then doing a horizontal reduction in a composable way. This algorithm can then be run after each time step in the model, and storing in memory the variable to be output or writing it out immediately. I think the authors could improve this manuscript if they compare and contrast their one-pass streamed approach to that of models natively doing more online calculations themselves (potentially on idling CPUs, if the models run on GPUs) instead of an external module/package.
Other comments I wrote while (re)reading the manuscript:
L2: nitpick: “as such” is kinda weird here in that it made me think running at high-res was a consequence of the preceding sentence, but it actually is more of a antecedent of the following sentence.
L5: nitpick: probably misplaced comma (I’d move it to after HPC instead)
L7: the phrase “data streaming” is introduced like a known quantity that is specific climate science, but in fact, it is neither known nor usually associated with climate science. Maybe jargon? If I were you, I would consider removing “data stream;” (the phrase with the semicolon) and let the sentence stand without the disruption
L9: one-pass may necessitate explanation here (so I would rephrase)
L9: intelligent doesn’t seem like the right word (maybe efficient?) — these techniques aren’t learning things, or are there?
L17: I think “well within the acceptable bounds” is an underestimated. Imho and understanding, these algorithms basically recover accuracy to within an insurmountable precision limit, so basically as good as one would get anyway. I would rephrase to finish with a stronger point
L35–42: I think this may surprise an average climate scientist involved in model development and evaluation (such as myself). We always had the ability to access data before simulations are done and one could obviously write stuff in a continuous manner. I think what you want to highlight is that you’re algorithmically calculating some interesting statistics that capture specific interests. That is not really novel, not unique, but it is interesting and worthy of publication. In other words, I don’t see the “novel” method and “unprecedented” reduction of “meaningful” output. Can you please (carefully) elaborate?
L70: the last sentence here gives the wrong impression of what you’re trying to say. Your manuscript is supposed to showcase those algorithms, so “requesting” readers to go follow some other package documentation may not be the best way to politely say, “We don’t discuss the code implementation of specific algorithmic details related to each statistic, and we instead focus on showing their utility to climate analysis” …
L194: I’d elaborate on the different results part
L219: I think you can say that plainly at the outset without taking the reader on an unnecessary voyage. I am not sure if the added info (analysis) is informative otherwise. Feel free to disagree
L268: so, the NumPy-calculated one is the “reference” right? (Note, I would refer to NumPy like they refer to it in their papers/docs, uppercase N and P)
Section 5.3: If I were you, I would simply delete this section. Your concluding paragraph (around L385) basically shows that most of the preceding analysis/“results” should be taken with a giant grain of salt. I would simply avoid the distraction, maybe add something about how the analysis in 5.2 could done on precipitation with a meaningful cutoff and/or a different type of underlying assumption (distribution type)
L431: But this doesn’t make sense to me. As a scientist, what I care about is the stability of the value, not an arbitrarily defined metric of convergence. In Figure 6, sigma_n goes from roughly 0.35 at 200 samples to 0.15 at 600 samples, but it is within those “convergence” lines.
L432: Ok, but how does that impact your assertions earlier about all the savings and not having to wait for model simulations to run a lot of steps? I guess you’re saying, one kinda has to wait a lot of steps to get “accurate” stuff out of these one-pass algorithms? Like 200 steps or so?
Section 6: Like Section 5.3, I think the convergence analysis is likely misleading, potentially counterproductive, and perhaps better left out. Your goal in this manuscript is to showcase something useful for users of this method. My assumption is that if people choose to use this type of one-pass algorithm, they would do so on relatively high-frequency data and they would understand the risk of under-sampling. Maybe I am just not getting what you’re trying to do here? Could you motivate it better? Is it anything other than something like “don’t take the mean of the first 5 time steps as the mean of the next 5000 time steps”??
L508: Thanks for the links. A few questions/suggestions: 1) might be good to list the GitHub links here as well? 2) is there a hosted version of the package docs somewhere? 3) how was the one-pass package used in making these figures? I didn’t see any “import one_pass” or “from one_pass…” … looks like you reimplemented/“simulated” everything from scratch? If so, this means the package in the title of this manuscript wasn’t used at all and the whole manuscript is misleading. I also don’t quite understand the numerics are returning round-off errors, where I would’ve expected exact answers (e.g., for the means)
Citation: https://doi.org/10.5194/egusphere-2025-28-RC1 -
AC1: 'Comment on egusphere-2025-28: Preliminary reply to RC1', Katherine Grayson, 27 Feb 2025
Firstly the authors would like to thank the reviewer for taking their time and reviewing thoroughly the manuscript. This is not a full reply to all the reviewer comments (we will properly address all comments after receipt of all the other reviews), this is just a comment on the first point made in RC1. The reviewer highlights a very valid point that the analysis provided in the notebooks does not actually use the one_pass package. We would like to briefly clarify this issue. The reason for this is that when the paper was initially submitted the actual package we developed (https://doi.org/10.5281/zenodo.14591828) was not allowed to be published open source due contractual obligations of the project. While this was unfortunate, we felt this different method of data handling was still highly relevant for the community and we wanted the paper to reflect how these algorithms could be used, including their pros and cons, rather than their specific implementation in our software. As such, we wrote the notebooks trying to be as transparent as possible and exposing the internal mathematical functions of the package to re-create the same results.
Ultimately however, it was decided that the paper could not be submitted without full open source of the underlying package and we waited for some months to be allowed to publish the package open source. When this was decided we resubmitted the article (as it is now) however the notebooks to make the Figures have yet to be updated. We would like to assure the reviewer that all Figures can be re-created using the package (indeed, much more simply with one line of code) and we will re-do all the notebooks explicitly using our package. There was no intent to be deceptive, in-fact quite the opposite, however we can see very clearly how this has resulted in confusion.Citation: https://doi.org/10.5194/egusphere-2025-28-AC1 -
AC2: 'Further reply on AC1', Katherine Grayson, 21 Mar 2025
We now take this opportunity to address in more detail other comments from RC1.
Again, thank you. We really appreciate all the feedback, some comments on your review are given below:
For the second concern, we agree that the manuscript would benefit from a description of the overall setup. Originally, as explained in the previous response, this was not included due to limitations of not allowing the package to be open source, so we kept the scope of the paper to the more theoretical implementation. As these restrictions have been removed we are working on a new section addressing your comments, including details of the workflow manager as well as a graphical representation.
For the third concern, while it is true that models often provide mechanisms to compute variable statistics during the simulation runs, the key is here that a model no longer needs to be tailored to the specific needs of a data consumer downstream. The increasing resolution of climate models introduces much complexity when trying to tailor the model outputs to the requirements of the data consumer. While is common for models to produce monthly means of certain variables, getting different modelling groups to align on specific statistics for downstream users (e.g. 99th percentile of wind speed) would be challenging, to say the least. The way we envision the Climate Digital Twin (the project this package has been designed for), is that data consumers may enter or leave the simulation data stream at any time, according to their needs, and do not enforce their needs onto the design of the model. Instead, they are provided with a mechanism to select their fields of interest, as well as the aggregation method and frequency for those fields. This will make the necessary operations at a later stage downstream and will be performed by an independent process, making the whole ecosystem more scaleable. We plan on explaining this in the new section we will include to make this clearer, addressing also the minor comment in L35–42.
Will address all minor comments, however we will respond to a few of them here as well.
L219: We believe here this description and explanation of how the t-digest method works is valuable. RC2 have requested more details on how the method works and the purpose of this section was to add details that can not be found in other literature and are relevant for the package implementation. In line with comments from RC2 we will develop this section.
Section 5.3. We don't believe removing this section is a good idea. The aim of the paper is to show how well (or not) these algorithms will work based on different climate variables. Precipitation is a commonly used climate variable and detailing the short-comings on this method over this type of extreme value distribution we believe is informative and worthwhile.
Section 6: Thanks for your comments regarding this section, this has also raised doubts with RC2. We were unsure if this section should be included and recognise that it requires better motivation. We included it, as, in later releases of the one-pass package, bias-adjustment in streaming mode is available and this analysis was required to understand how long we needed to sample the stream for in order to achieve a representative climatology. We plan on revising this section and will most likely keep it in the re-submission, however upon re-submission if both reviewers are not convinced we will remove.
Feel free to respond or feedback any of the points. We will resubmit the manuscript after the open discussion is closed.Citation: https://doi.org/10.5194/egusphere-2025-28-AC2
-
AC2: 'Further reply on AC1', Katherine Grayson, 21 Mar 2025
-
RC2: 'Comment on egusphere-2025-28', Anonymous Referee #2, 06 Mar 2025
This timely manuscript describes a powerful set of "one-pass" methods to compute statistics on-line during a model integration without first needing to write model output to disk. This is especially useful for very large kilometer-scale weather and climate simulations that produce enormous volumes of data, as this method works without sacrificing spatial resolution or temporal frequency. The demonstrations of efficacy and memory savings are powerful arguments for the value of this approach. However I didn't quite understand the t-digest method used for computing histograms and percentiles, and some of the analyses were difficult to understand. Please see attached my more comprehensive review.
-
AC3: 'Reply on RC2', Katherine Grayson, 21 Mar 2025
As with RC1, we would firstly like to thank the reviewer for taking their time to thoroughly review the manuscript. The review has been highly beneficial and we will address all comments properly in the resubmitted version.
Some comments in response:
Scale function
The scale-function defines a monotonically increasing function that will set the 'edge values' for each cluster based on the corresponding percentile. Due to it's hyperbolic shape, clusters at the tails are smaller and will contain less data. The reason we have not gone into so much detail as to how the method works is because it is detailed in Dunning, T. and Ertl, O.: Computing Extremely Accurate Quantiles Using t-Digests, arXiv[preprint],arXiv:1902.04023, 2019. Figure 1 of their paper shows a visual representation of the scale function and how it translates quantiles to their corresponding k value. When increasing the delta (compression), the shape of the scale-function will remain the same, however the corresponding limits on the y-axis (k) will increase, allowing more clusters to represent the same distribution. I have attached a .gif that shows how the graph looks as the compression (delta) is increased. As the visual of the scale function is given in Dunning's paper, we have elected to not show it here but if the reviewer thinks it would greatly aid in the comprehension of the method we will look to include it. A useful visual is shown in Figure 3 here https://www.gresearch.com/news/approximate-percentiles-with-t-digests/ and we could look to emulate something similar.Q: Should the same scale function be used for each variable?
Great point! Ideally, no. The scale function used works best for normally distributed data, as the largest clusters (clusters containing the most data points) will be around P50. This is part of the reason why we see greater errors in precipitation, as this is very much not a normally distributed variable. However, modifying the scale function for different climate variable distributions is a whole research topic in itself (indeed https://arxiv.org/pdf/2005.09599) and is unfortunately outside the scope of this paper. We are using an external package that uses the standard scale function and have suggested to them to support other scale functions however this is not possible at the moment. We will highlight this more clearly in the resubmitted version along with some suggestions.Q: How the t-digest method goes from its clusters to creating percentiles and histograms is unclear.
A percentile value will be given by the mean value of the cluster that it corresponds to. For a normally distributed variable, the P50 will be a large cluster (containing many similar values). P50 will therefore be the mean of all the values in the cluster. Reviewing the manuscript we see this has not been properly stated and we will include it. For the numpy estimates, they are made by via the 'normal' method, where all data is seen as once, ordered, and sorted into percentiles. We will also clearly state this.Thanks for the comments regarding Fig 4 and 5. We will enlarge and spilt these along with including better labelling. For (b) and (e), yes exactly, each dot represents a percentile, with the numpy estimate given along the x-axis and the t-digest along with y. If the estimates are exactly the same, the data will lie along the x = y line. We included the shaded areas to highlight that the percentiles we are examining are within the turbine operating limits. We will try to make this clearer.
We agree with the reviewers final comment. We were also unsure if this section belongs in the manuscript, however the reason for including it is, as part of the one-pass package (see later versions) on-the-fly bias adjustment is also included. In order to include this feature we had to know at what point we could start using the constructed climatology based on the streamed data. This was our method and we thought it was interesting to detail in the context of one-pass bias-adjustment. We will revise this section, more clearly stating motivation and why it's relevant in the context on this paper.
All minor comments will be addressed.
Please feel free to respond with any comments / suggestions on the above. We will re-submit the revised manuscript after the open discussion has closed.
-
AC3: 'Reply on RC2', Katherine Grayson, 21 Mar 2025
Data sets
nextGEMS cycle3 datasets: statistical summaries for streamed data from climate simulations nextGEMS, K. Grayson https://doi.org/10.5281/zenodo.12533197
Model code and software
DestinE-Climate-DT/one_pass: v0.6.2 K. Grayson https://doi.org/10.5281/zenodo.14591828
Interactive computing environment
kat-grayson/one_pass_algorithms_paper K. Grayson https://doi.org/10.5281/zenodo.12533064
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
156 | 47 | 12 | 215 | 12 | 6 |
- HTML: 156
- PDF: 47
- XML: 12
- Total: 215
- BibTeX: 12
- EndNote: 6
Viewed (geographical distribution)
Country | # | Views | % |
---|---|---|---|
United States of America | 1 | 110 | 45 |
Spain | 2 | 26 | 10 |
China | 3 | 21 | 8 |
Germany | 4 | 17 | 6 |
France | 5 | 8 | 3 |
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
- 110