the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Reduced floatingpoint precision in regional climate simulations: An ensemblebased statistical verification
Abstract.
The use of single precision in floatingpoint representation has become increasingly common in operational weather prediction. Meanwhile, climate simulations are still typically run in double precision. The reasons for this are likely manifold and range from concerns about compliance to conservation laws to the unknown effect of single precision on slow processes, or simply the less frequent opportunity and higher computational costs of validation.
Using an ensemblebased statistical methodology, Zeman and Schär (2022) could detect differences between double and singleprecision simulations from the regional weather and climate model COSMO. However, these differences are minimal and often only detectable during the first few hours or days of the simulation. To evaluate whether these differences are relevant for regional climate simulations, we have conducted 10yearlong ensemble simulations over the EUROCORDEX domain in single and double precision with 100 ensemble members.
By applying the statistical testing at a gridcell level for 47 output variables every 12 or 24 hours, we only detected a marginally increased rejection rate for the singleprecision climate simulations compared to the doubleprecision reference. This increase in the rejection rate is much smaller than that arising from minor variations of the horizontal diffusion coefficient in the model. Therefore, we deem it negligible.
To our knowledge, this study represents the most comprehensive analysis so far on the effects of reduced precision in a climate simulation for a realistic setting, namely with a fullyfledged regional climate model in a configuration that has already been used for climate change impact and adaptation studies. The ensemblebased verification of model output at a gridcell level and high temporal resolution is very sensitive and suitable for verifying climate models. Furthermore, the verification methodology is model agnostic, meaning it can be applied to any model. Our findings encourage exploiting the reduction of computational costs ( ∼ 30 % for COSMO) obtained from reduced precision for regional climate simulations.

Notice on discussion status
The requested preprint has a corresponding peerreviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint
(2147 KB)

The requested preprint has a corresponding peerreviewed final revised paper. You are encouraged to refer to the final revised version.
 Preprint
(2147 KB)  Metadata XML
 BibTeX
 EndNote
 Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed

RC1: 'Comment on egusphere20232263', Anonymous Referee #1, 11 Dec 2023
General comments
The manuscript is well written, concise, and represents an important contribution to the efforts of using reduced float precision in climate modelling.
The approach of ensemble based verification is interesting and thorough. The success of single precision is encouraging.
Additionally, I found the introduction to be a useful review of relevant literature.
The manuscript is in a good state for publication subject to some minor questions below.
Specific comments
 This is beyond the main scope of the paper, but it would be interesting to include some discussion on implications for 16bit, stochastic rounding etc. especially since this is mentioned in the introduction
 Similarly, are there any plans for extensions to the work in 16bit? I appreciate the hardware is not readily available; the papers you cite in the introduction use a halfprecision emulator…
 Some discussion/justification on the choice of KS over other metrics such as Wasserstein would be useful, especially since KS causes problems w.r.t the steep distribution functions of the bounded variables.
 The choice of the 95th percentile for the rejection of the global null hypothesis is reasonable, but I wonder how robust the results are to different percentile choices? The shape of the empirical distribution is surely relevant here which is somewhat lost by considering only the mean
 Last sentence of appendix C, some more details on the “”more comprehensive experiment” would be useful for reproducibility, even if the results are not shared
 Figure 2: why are some variables labelled in grey, some in black? If it is the same reason as given in the caption of Figure 3 I would move the explanation earlier.
Citation: https://doi.org/10.5194/egusphere20232263RC1 
RC2: 'Comment on egusphere20232263', Milan Klöwer, 19 Apr 2024
Summary
The authors present a statistical verification technique to compare two datasets whether are statistically from the same distribution or not. They apply this technique to an evaluation of single precision arithmetic used in COSMO (most model components) for 10year climate simulations. The methodology is nicely illustrated (appreciation for Fig 1) and generally the paper is written for others to reproduce it, making it a useful manuscript for future projects. The method described is a statistical verifications between a (1) baseline model, (2) some change to it and (3) an “anticontrol” ensemble used to quantify the expected deviation from the baseline due to uncertainties in parameterizations and tuning parameters. The authors therefore conclude that there is little reason to not use single precision in COSMO also for climate simulations, except for one purely diagnostic variable which the method identifies to suffer significantly from the use of single precision.
I generally recommend this paper to be published with minor corrections, see below for a list of points I created while reading the manuscript. Most of them are (strong) recommendations that you can object to though if you have a good reason for it (please explain). However, I also want to raise three “major” points that don’t require major work, but that should be discussed in a paragraph or two as I feel this is currently missing from the text.
I enjoyed reading the manuscript, many thanks. It is generally well written and concise, illustrating a method and discussing some results in three figures without being overly complicated.
Major points
1. Conditional distributions
Your method generally uses *unconditional* probability distributions. That means two adjacent grid points could covary in double precision but be independent with single precision even though both grid points follow an unchanged unconditional probability distribution. In an information theoretic sense, the mutual information between grid points could change even though the (unconditional) entropy is unchanged. As I see it, this could be a significant impact of precision (or any other change of the model) but your method would not detect it. A real world example might be precipitation which could occur in large patches (high mutual information) in the control but in smaller patches (lower mutual information) in the test ensemble. Clearly, analysing conditional probability distributions would explode the dimensionality of the problem, which is with 10TB already relatively large. Could you elaborate on this aspect in 2.1?
2. Rejection is binary
After reading the manuscript it is unclear to me what the effect is of the binary rejection, instead of having some error metric that would penalise larger deviations more. For example, if cloud cover has a rejection rate of n% because cloud cover is too high, then, as far as I understand it, it doesn’t really matter whether in these situations cloud cover is 100% or 200%. But the latter is obviously not what you would tolerate a single precision simulation to output. I see the argument that you shouldn’t penalise large deviations because maybe they don’t matter more if they don’t have kickoff effects causing other variables or grid points or timesteps to be rejected more frequently.
3. Comparison to other methodologies
Reading Appendix A I’m left with the feeling that both methodologies have their outliers (for different reasons) and that an even better method would be to take the minimum rejection rate between both. Because then (if I see this right from Fig. A1) only HPBL would stick out as an anomaly which you also identify in the results. I see you have your arguments for your method over the BenjaminiHochberg method but the discussion in Appendix C also shows that neither are actually methods robust to geophysical data distributions. I generally think the manuscript would gain a lot of strength if you incorporated the ideas from Appendix C directly into the main text to not leave the reader with a figure where you identify most variables that are outliers as being an artefact of the methodology. I mention in the minor comments that maybe converting the data to ranks, or maybe you can think of another way to deal with notsonormal distributions. What you suggest to just round all data to 4 decimal points I think is one possibility (although I’d round in binary not in decimal) because your probability distributions should be well resolved by the numerical precision of your data. So whether you have 23 mantissa bits precision (in the data, not the compute) or 20 shouldn’t have an impact on the rejection rate. I can see a method where you accept a rejection rate as robust when rounding it to n1, n2 or n3 mantissa bits does not have an impact. But note that different variables have a different bitwise real information content. E.g. temperature or CO2 have much more information in the significand/mantissa than a variable that varies over orders of magnitude in the atmosphere, say, specific humidity (see Klöwer et al. 2021, https://doi.org/10.1038/s43588021001562). I’d love to see a version of the manuscript that does not require Appendix C to explain some artefacts.
Minor points
Abstract
L11: “”rejection rate” would benefit from a bit more explanation, rejected based on what? You of course elaborate on it in the text, but maybe just name the verification you use given this is the abstract? Maybe just “rejection rate, highlighting little statistical difference between the …”
L12: “negligible as masked by model uncertainty” maybe? To explain the meaning of your anticontrol?
Intro
L24: Maybe add memory and or data requirements?
L25: Or length of integration? Number of variables? Physical accuracy (e.g. more accurate parameterizations are often more costly), in the narrative of more accuracy with less precision?
L26: For most applications only float64 > float32 is straightforward. 16bit arithmetic often requires adjusted algorithms, and getting performance out is also not necessarily straightforward. You say this around L64 but maybe adjust the usage of “straightforward” here, or only refer to float64>float32.
L27: Remove “typically” and just state the important points of the IEEE754 standard here.
L28: Note that this is the normal range, subnormals are smaller, please add to be precise.
L31: Note that around the number you mention the following float64 are representable
 296.45678912345664
 296.4567891234567
 296.45678912345676
Rounded to float32 the representable floats (round to nearest in the middle) are
 296.45676f0
 296.4568f0
 296.45682f0
While your point holds please choose a float64,float32 pair that’s actually representable not “something like”.
L33: you mention discretization, model and boundary condition error, there’s also initial condition error, maybe add?
L35: Maybe use “arithmetic intensity” to distinguish this concept from the use of “operational(ly)” in terms of operational, i.e. regularly scheduled numerical weather predictions?
L44: you switch between single precision and SP, I don’t see the need to abbreviate but in any case be consistent?
L45: I would expect computing architecture or compiler settings to play a role too? Have you tested that too or is that less relevant given where and how COSMO is run? I’m not saying you should test that performance, but maybe just outline to the reader what could impact performance improvements.
L97 and L98: the significanD or significanT bits.
L99: when first mentioning stochastic rounding, I’d provide a reference like https://doi.org/10.1098/rsos.211631
L128: to
L150: I like these sentences summarising the implications of your rejection procedure. But it might be helpful to the reader to discuss, say, two cases, one where single precision causes a tiny bias globally, would this be rejected? As I see, not if that bias is masked by the variance of fCR. And two, a case where single precision changes the climate in a small country but has no impacts in other regions of the world. Could these be added for clarification?
Figure 1: This is great!
L162: Call it multiplicative noise? There’s an analogy here to the stochastically perturbed parameterization tendencies (SPPT) where the perturbations also take this form but you apply it to the actual variables, I assume R has some autocorrelation in space and time? Maybe irrelevant for your study though.
L168: Maybe add a small table for the 7 ensembles?
L173: Can you not reduce diffusion because of numerical instability? If yes, maybe state why you’re changing the coefficients only in one direction. Also while I see a change in diffusion as a reasonable control to test against, you could have also changed a physical parameterization (e.g. make convection stronger/weaker). Maybe elaborate more on your decision why you created the control as you did?
Figure 2: Maybe state the precision on each panel? It’s double everywhere where it’s not single I guess, just for clarity.
L187: Could you elaborate where rejections in the ID test come from? As I see it you only perturb the initial conditions so rejections are solely due to internal variability which however is small because of the identical boundary conditions leaving little room for the weather over Europe to evolve onto an independent trajectory? So this could be a storm away from the boundaries that is strong in some ensemble members but not in the other ensemble?
L193: Differences … in the initial condition perturbations?
L197: Given that some variables suffer from single precision as you outline, do you think this can have an impact on others being one average those 25% off? E.g. if cloud cover has a systematic bias with single precision this could introduce a bias on surface temperature (and consequently other variables) that’s not large but enough to systematically cause those 25% higher rejection rate? Maybe a discussion of crossvariable impacts could be added?
L205: Use a rankbased test instead?
L207: This sounds like also output precision (assuming you always output single?) is of relevance here. I think it makes sense to round all the data to something slightly less than single precision anyway. That way any clustering of data to identical values is at least the same across all simulations regardless of the precision used for computations.
L211: I find it difficult to think of every sensitivity to precision as a bug. There are algorithms that are stable only at high precision without them being coded up incorrectly. E.g. stagnation in large sums of small numbers due to insufficient precision can be overcome with a compensated summation but that comes at additional computational cost. In other situations you might be able to solve precision issues by computing the sum in reverse (possibly an easy fix that could be considered a “bug”). Maybe write “due to rounding errors in algorithms whether easy to fix (e.g. a bug) or not.”
L212: Could you mark those variables in Fig 3 somehow? For anyone repeating your analysis I find this an important concept to highlight that rounding errors from some variables cannot propagate to others which certainly helps in finding in which calculation precision is lost.
L215: noting > reiterating given you already said this?
Fig 3: You write “Height of the boundary layer” but abbreviate it as HPBL, add the “planetary” or call it HBL for consistency? Also decision rate vs rejection rate?
L222: after > during ?
L223: Could you not present a version of Fig 2 and 3 where these technical artefacts are somehow circumvented / the methodology adjusted? Most notsocareful readers would probably look at those figures and conclude “single precision is bad for cloud or soil modelling so we shouldn’t do this”.
L240ff: Just want to appreciate this list of recommendations that you give to readers, very helpful I believe!
L295: Temporal resolution usually comes with more constraints on compute and data storage. But would you recommend using time averages or time snapshots if both were available?
Fig C1: I find the grey background to highlight rejection a bit of an overkill, and it doesn’t make the purple lines particularly readable, given all are rejected, just write this in the caption and make the background white again?
Citation: https://doi.org/10.5194/egusphere20232263RC2  AC1: 'Comment on egusphere20232263', Hugo Banderier, 24 May 2024
Interactive discussion
Status: closed

RC1: 'Comment on egusphere20232263', Anonymous Referee #1, 11 Dec 2023
General comments
The manuscript is well written, concise, and represents an important contribution to the efforts of using reduced float precision in climate modelling.
The approach of ensemble based verification is interesting and thorough. The success of single precision is encouraging.
Additionally, I found the introduction to be a useful review of relevant literature.
The manuscript is in a good state for publication subject to some minor questions below.
Specific comments
 This is beyond the main scope of the paper, but it would be interesting to include some discussion on implications for 16bit, stochastic rounding etc. especially since this is mentioned in the introduction
 Similarly, are there any plans for extensions to the work in 16bit? I appreciate the hardware is not readily available; the papers you cite in the introduction use a halfprecision emulator…
 Some discussion/justification on the choice of KS over other metrics such as Wasserstein would be useful, especially since KS causes problems w.r.t the steep distribution functions of the bounded variables.
 The choice of the 95th percentile for the rejection of the global null hypothesis is reasonable, but I wonder how robust the results are to different percentile choices? The shape of the empirical distribution is surely relevant here which is somewhat lost by considering only the mean
 Last sentence of appendix C, some more details on the “”more comprehensive experiment” would be useful for reproducibility, even if the results are not shared
 Figure 2: why are some variables labelled in grey, some in black? If it is the same reason as given in the caption of Figure 3 I would move the explanation earlier.
Citation: https://doi.org/10.5194/egusphere20232263RC1 
RC2: 'Comment on egusphere20232263', Milan Klöwer, 19 Apr 2024
Summary
The authors present a statistical verification technique to compare two datasets whether are statistically from the same distribution or not. They apply this technique to an evaluation of single precision arithmetic used in COSMO (most model components) for 10year climate simulations. The methodology is nicely illustrated (appreciation for Fig 1) and generally the paper is written for others to reproduce it, making it a useful manuscript for future projects. The method described is a statistical verifications between a (1) baseline model, (2) some change to it and (3) an “anticontrol” ensemble used to quantify the expected deviation from the baseline due to uncertainties in parameterizations and tuning parameters. The authors therefore conclude that there is little reason to not use single precision in COSMO also for climate simulations, except for one purely diagnostic variable which the method identifies to suffer significantly from the use of single precision.
I generally recommend this paper to be published with minor corrections, see below for a list of points I created while reading the manuscript. Most of them are (strong) recommendations that you can object to though if you have a good reason for it (please explain). However, I also want to raise three “major” points that don’t require major work, but that should be discussed in a paragraph or two as I feel this is currently missing from the text.
I enjoyed reading the manuscript, many thanks. It is generally well written and concise, illustrating a method and discussing some results in three figures without being overly complicated.
Major points
1. Conditional distributions
Your method generally uses *unconditional* probability distributions. That means two adjacent grid points could covary in double precision but be independent with single precision even though both grid points follow an unchanged unconditional probability distribution. In an information theoretic sense, the mutual information between grid points could change even though the (unconditional) entropy is unchanged. As I see it, this could be a significant impact of precision (or any other change of the model) but your method would not detect it. A real world example might be precipitation which could occur in large patches (high mutual information) in the control but in smaller patches (lower mutual information) in the test ensemble. Clearly, analysing conditional probability distributions would explode the dimensionality of the problem, which is with 10TB already relatively large. Could you elaborate on this aspect in 2.1?
2. Rejection is binary
After reading the manuscript it is unclear to me what the effect is of the binary rejection, instead of having some error metric that would penalise larger deviations more. For example, if cloud cover has a rejection rate of n% because cloud cover is too high, then, as far as I understand it, it doesn’t really matter whether in these situations cloud cover is 100% or 200%. But the latter is obviously not what you would tolerate a single precision simulation to output. I see the argument that you shouldn’t penalise large deviations because maybe they don’t matter more if they don’t have kickoff effects causing other variables or grid points or timesteps to be rejected more frequently.
3. Comparison to other methodologies
Reading Appendix A I’m left with the feeling that both methodologies have their outliers (for different reasons) and that an even better method would be to take the minimum rejection rate between both. Because then (if I see this right from Fig. A1) only HPBL would stick out as an anomaly which you also identify in the results. I see you have your arguments for your method over the BenjaminiHochberg method but the discussion in Appendix C also shows that neither are actually methods robust to geophysical data distributions. I generally think the manuscript would gain a lot of strength if you incorporated the ideas from Appendix C directly into the main text to not leave the reader with a figure where you identify most variables that are outliers as being an artefact of the methodology. I mention in the minor comments that maybe converting the data to ranks, or maybe you can think of another way to deal with notsonormal distributions. What you suggest to just round all data to 4 decimal points I think is one possibility (although I’d round in binary not in decimal) because your probability distributions should be well resolved by the numerical precision of your data. So whether you have 23 mantissa bits precision (in the data, not the compute) or 20 shouldn’t have an impact on the rejection rate. I can see a method where you accept a rejection rate as robust when rounding it to n1, n2 or n3 mantissa bits does not have an impact. But note that different variables have a different bitwise real information content. E.g. temperature or CO2 have much more information in the significand/mantissa than a variable that varies over orders of magnitude in the atmosphere, say, specific humidity (see Klöwer et al. 2021, https://doi.org/10.1038/s43588021001562). I’d love to see a version of the manuscript that does not require Appendix C to explain some artefacts.
Minor points
Abstract
L11: “”rejection rate” would benefit from a bit more explanation, rejected based on what? You of course elaborate on it in the text, but maybe just name the verification you use given this is the abstract? Maybe just “rejection rate, highlighting little statistical difference between the …”
L12: “negligible as masked by model uncertainty” maybe? To explain the meaning of your anticontrol?
Intro
L24: Maybe add memory and or data requirements?
L25: Or length of integration? Number of variables? Physical accuracy (e.g. more accurate parameterizations are often more costly), in the narrative of more accuracy with less precision?
L26: For most applications only float64 > float32 is straightforward. 16bit arithmetic often requires adjusted algorithms, and getting performance out is also not necessarily straightforward. You say this around L64 but maybe adjust the usage of “straightforward” here, or only refer to float64>float32.
L27: Remove “typically” and just state the important points of the IEEE754 standard here.
L28: Note that this is the normal range, subnormals are smaller, please add to be precise.
L31: Note that around the number you mention the following float64 are representable
 296.45678912345664
 296.4567891234567
 296.45678912345676
Rounded to float32 the representable floats (round to nearest in the middle) are
 296.45676f0
 296.4568f0
 296.45682f0
While your point holds please choose a float64,float32 pair that’s actually representable not “something like”.
L33: you mention discretization, model and boundary condition error, there’s also initial condition error, maybe add?
L35: Maybe use “arithmetic intensity” to distinguish this concept from the use of “operational(ly)” in terms of operational, i.e. regularly scheduled numerical weather predictions?
L44: you switch between single precision and SP, I don’t see the need to abbreviate but in any case be consistent?
L45: I would expect computing architecture or compiler settings to play a role too? Have you tested that too or is that less relevant given where and how COSMO is run? I’m not saying you should test that performance, but maybe just outline to the reader what could impact performance improvements.
L97 and L98: the significanD or significanT bits.
L99: when first mentioning stochastic rounding, I’d provide a reference like https://doi.org/10.1098/rsos.211631
L128: to
L150: I like these sentences summarising the implications of your rejection procedure. But it might be helpful to the reader to discuss, say, two cases, one where single precision causes a tiny bias globally, would this be rejected? As I see, not if that bias is masked by the variance of fCR. And two, a case where single precision changes the climate in a small country but has no impacts in other regions of the world. Could these be added for clarification?
Figure 1: This is great!
L162: Call it multiplicative noise? There’s an analogy here to the stochastically perturbed parameterization tendencies (SPPT) where the perturbations also take this form but you apply it to the actual variables, I assume R has some autocorrelation in space and time? Maybe irrelevant for your study though.
L168: Maybe add a small table for the 7 ensembles?
L173: Can you not reduce diffusion because of numerical instability? If yes, maybe state why you’re changing the coefficients only in one direction. Also while I see a change in diffusion as a reasonable control to test against, you could have also changed a physical parameterization (e.g. make convection stronger/weaker). Maybe elaborate more on your decision why you created the control as you did?
Figure 2: Maybe state the precision on each panel? It’s double everywhere where it’s not single I guess, just for clarity.
L187: Could you elaborate where rejections in the ID test come from? As I see it you only perturb the initial conditions so rejections are solely due to internal variability which however is small because of the identical boundary conditions leaving little room for the weather over Europe to evolve onto an independent trajectory? So this could be a storm away from the boundaries that is strong in some ensemble members but not in the other ensemble?
L193: Differences … in the initial condition perturbations?
L197: Given that some variables suffer from single precision as you outline, do you think this can have an impact on others being one average those 25% off? E.g. if cloud cover has a systematic bias with single precision this could introduce a bias on surface temperature (and consequently other variables) that’s not large but enough to systematically cause those 25% higher rejection rate? Maybe a discussion of crossvariable impacts could be added?
L205: Use a rankbased test instead?
L207: This sounds like also output precision (assuming you always output single?) is of relevance here. I think it makes sense to round all the data to something slightly less than single precision anyway. That way any clustering of data to identical values is at least the same across all simulations regardless of the precision used for computations.
L211: I find it difficult to think of every sensitivity to precision as a bug. There are algorithms that are stable only at high precision without them being coded up incorrectly. E.g. stagnation in large sums of small numbers due to insufficient precision can be overcome with a compensated summation but that comes at additional computational cost. In other situations you might be able to solve precision issues by computing the sum in reverse (possibly an easy fix that could be considered a “bug”). Maybe write “due to rounding errors in algorithms whether easy to fix (e.g. a bug) or not.”
L212: Could you mark those variables in Fig 3 somehow? For anyone repeating your analysis I find this an important concept to highlight that rounding errors from some variables cannot propagate to others which certainly helps in finding in which calculation precision is lost.
L215: noting > reiterating given you already said this?
Fig 3: You write “Height of the boundary layer” but abbreviate it as HPBL, add the “planetary” or call it HBL for consistency? Also decision rate vs rejection rate?
L222: after > during ?
L223: Could you not present a version of Fig 2 and 3 where these technical artefacts are somehow circumvented / the methodology adjusted? Most notsocareful readers would probably look at those figures and conclude “single precision is bad for cloud or soil modelling so we shouldn’t do this”.
L240ff: Just want to appreciate this list of recommendations that you give to readers, very helpful I believe!
L295: Temporal resolution usually comes with more constraints on compute and data storage. But would you recommend using time averages or time snapshots if both were available?
Fig C1: I find the grey background to highlight rejection a bit of an overkill, and it doesn’t make the purple lines particularly readable, given all are rejected, just write this in the caption and make the background white again?
Citation: https://doi.org/10.5194/egusphere20232263RC2  AC1: 'Comment on egusphere20232263', Hugo Banderier, 24 May 2024
Peer review completion
Journal article(s) based on this preprint
Viewed
HTML  XML  Total  BibTeX  EndNote  

338  115  35  488  27  23 
 HTML: 338
 PDF: 115
 XML: 35
 Total: 488
 BibTeX: 27
 EndNote: 23
Viewed (geographical distribution)
Country  #  Views  % 

Total:  0 
HTML:  0 
PDF:  0 
XML:  0 
 1
Christian Zeman
David Leutwyler
Stefan Rüdisühli
Christoph Schär
The requested preprint has a corresponding peerreviewed final revised paper. You are encouraged to refer to the final revised version.
 Preprint
(2147 KB)  Metadata XML