Statistical summaries for streamed data from climate simulations: One-pass algorithms (v0.6.2)

Grayson, Katherine; Thober, Stephan; Lacima-Nadolnik, Aleksander; Sharifi, Ehsan; Lledó, Llorenç; Doblas-Reyes, Francisco

doi:https://doi.org/10.5194/egusphere-2025-28

Katherine Grayson, Stephan Thober, Aleksander Lacima-Nadolnik, Ehsan Sharifi, Llorenç Lledó, and Francisco Doblas-Reyes

Abstract. Projections from global climate models (GCMs) are a fundamental information source for climate adaptation policies and socio-economic decisions. As such, these models are being progressively run at finer spatio-temporal resolutions to resolve smaller scale dynamics and consequently reduce uncertainty associated with parameterizations. Yet even with increased capacity from High Performance Computing (HPC) the consequent size of the data output (which can be on the order of Terabytes to Petabytes), means that native resolution data cannot feasibly be stored for long time periods. Lower resolution archives containing a reduced set of variables are often all that is kept, limiting data consumers from harnessing the full potential of these models. To overcome this growing challenge, the climate modelling community is investigating data streaming; a novel way of processing GCM output without having to store a limited set of variables on disk. In this paper we present a detailed analysis of the use of one-pass algorithms from the 'one-pass' package, for streamed climate data. These intelligent data reduction techniques allow for the computation of statistics on-the-fly, enabling climate workflows to temporally aggregate the data output from GCMs into meaningful statistics for the end-user without having to store the full time series. We present these algorithms for four different statistics: mean, standard deviation, percentiles and histograms. Each statistic is presented in the context of a use case, showing the statistic applied to a relevant variable. For statistics that can be represented by a single floating point value (i.e., mean, standard deviation, variance), the accuracy is at the order of the numerical precision of the machine and the memory savings scale linearly with the period of time covered by the statistic. For the statistics that require a distribution (percentiles and histograms), we present an algorithm that reduces the full time series to a set of key clusters that represent the distribution. Using this algorithm we find that the accuracy provided is well within the acceptable bounds for the climate variables examined while still providing memory savings that bypass the unfeasible storage requirements of high-resolution data.

Received: 06 Jan 2025 – Discussion started: 06 Feb 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Country	#	Views	%
United States of America	1	110	45
Spain	2	26	10
China	3	21	8
Germany	4	17	6
France	5	8	3


Total:	0
HTML:	0
PDF:	0
XML:	0

Statistical summaries for streamed data from climate simulations: One-pass algorithms (v0.6.2)

Data sets

Model code and software

Interactive computing environment

Viewed

Viewed (geographical distribution)