Survival analysis for droplet-freezing data: Kaplan&ndash;Meier confidence intervals and log-rank tests

Whale, Thomas F.; Barr, Sarah L.; Surawy-Stepney, Trystan

doi:10.5194/egusphere-2026-680

Preprints

https://doi.org/10.5194/egusphere-2026-680

Preprints

09 Mar 2026

| 09 Mar 2026

Status: this preprint is open for discussion and under review for Atmospheric Measurement Techniques (AMT).

Survival analysis for droplet-freezing data: Kaplan–Meier confidence intervals and log-rank tests

Thomas F. Whale, Sarah L. Barr, and Trystan Surawy-Stepney

Abstract. Droplet‑freezing assays underpin immersion‑mode ice‑nucleation research yet approaches to uncertainty quantification for fraction‑frozen curves and derived active‑site densities (n_s(T)) are inconsistent. Further, there is not currently a rigorous method for significance testing the difference between fraction frozen curves. To address these issues, we recast droplet‑freezing measurements as survival data and apply analysis techniques typically used in medical statistics. Using the Kaplan–Meier estimator, we derive nonparametric confidence intervals for droplet fraction frozen and n_s(T) without binning or model assumptions, matching Monte‑Carlo and studentized‑bootstrapped intervals on a literature volcanic ash ice nucleation dataset. Confidence intervals calculated for simulated datasets show precision improves with sample size and with steeper fraction frozen curves. Adapting the log-rank test, we introduce a method for comparing fraction frozen curves and demonstrate its application to literature and simulated droplet freezing datasets. We recommend reporting Kaplan–Meier confidence intervals on droplet freezing datasets and employing the log-rank test when comparing droplet fraction frozen curves.

Received: 04 Feb 2026 – Discussion started: 09 Mar 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Thomas F. Whale, Sarah L. Barr, and Trystan Surawy-Stepney

Status: open (until 14 May 2026)

Post a comment Subscribe to comment alert

RC1:
'Comment on egusphere-2026-680', Anonymous Referee #1, 10 Apr 2026 reply
In “Survival analysis for droplet-freezing data: Kaplan-Meier confidence intervals and log-rank tests,” Whale et al. present an approach to quantify uncertainty in droplet-freezing data and test differences in frozen fraction curves based on the non-parametric Kaplan-Meier survival function estimator. By applying this method, they are able to derive confidence intervals for frozen fraction curves and cumulative ice nucleation activity spectra. They also adapt the log-rank test to droplet freezing data to hypothesis test whether two frozen fraction curves are identical or not. They then demonstrate both methods on selected literature and simulated datasets. The method of calculating confidence intervals they present is rigorous, easy to use, and addresses a persistent problem of a lack of standardization in ice nucleation statistics. However, I have much more significant concerns surrounding the use of the log-rank test for ice nucleation data. Specifically, the hypothesis being tested by the log-rank test is limited in the context of ice nucleation, where in many samples there may be multiple different populations of ice active species that dominate at different temperatures in the frozen fraction spectrum. The requirement of proportional hazards is also a concern as many frozen fraction curves in literature do not meet this assumption, further limiting the usefulness of this test. Given that it can also only be applied to frozen fraction curves (and so generally cannot be used to compare across instruments or across samples at different concentrations), I am not convinced of the broad applicability that the authors imply when recommending this statistical test be used. When combined with the fact that the method of calculating confidence intervals has been previously published (Kinney et al., 2024), I am not convinced that this manuscript in its current form constitutes a sufficiently novel and significant contribution to the field. I encourage the authors to consider whether refinements mentioned in the manuscript that address the shortcomings of the log-rank test or other approaches to statistical testing in the context of survival analysis might be more appropriate and useful to a wider range of applications. Specific comments and suggestions are below.
Major comments:
The method presented for calculating confidence intervals is robust, but I have two points that could expand its usefulness:
The calculation of confidence intervals with the Kaplan-Meier estimator uses the asymptotic normal distribution, which requires that the central limit theorem apply. It is not clear from the manuscript how many droplets are required for this assumption to be accurate. In Figure 2, the authors consider droplet freezing experiments with different numbers of droplets and conclude that the number of droplets used is ‘essentially a matter of taste’ (excluding the simulation with 5 droplets). However, in using this statistical approach, the sample size matters to the accuracy of the confidence intervals – I suggest the authors at least discuss this in more detail and give guidelines on a population size required to meet the central limit theorem requirement, or to achieve a defined statistical power.

Is it possible to apply this estimator to differential ice nucleation activity spectra, in addition to the cumulative spectra presented here? This would extend its usefulness.

I am skeptical that the log-rank test as presented here is generally suitable for comparing measurements of ice nucleation activity. The authors should consider adding more robust approaches that do not rely on the proportional hazards test that they mention on line 271. Additionally, more discussion of the limitations of this approach including strong statements of where it is not appropriate to use (i.e. when the proportional hazards assumption is not met) and the benefits of this approach over simple inference by eye or confidence intervals on difference spectra should be added. Specifically:
The foremost issue to me is that simply testing whether two frozen fraction curves are different has limited utility and could be misleading in some use cases or if not interpreted carefully. Consider the case of Figure 4c and d. The authors state that the log-rank test produces a correct result in distinguishing the two curves in Figure 4d, while in Figure 4c, the test produces a Type II error. This interpretation is correct under the hypothesis of the log-rank test. However, it is immediately clear from both panels that the curves crossing cause each of the two simulated samples to have a higher ice nucleation activity at different points in the spectrum. In panel d, it would be far more physically meaningful to say that the curve in red has higher ice nucleation activity at high temperatures, and then meets the curve in grey, which then has a higher ice nucleation activity at low temperatures where its slope is large. These conclusions can be drawn from the confidence bands themselves (since they do not overlap in large sections of both curves), and could also be drawn from ice nucleation activity spectra, where other approaches to rigorously comparing two datasets already exist, as discussed in point d below. In contrast, applying the log rank test tells the researcher that there is merely a difference between the samples, a far less useful result. The authors should further justify and discuss in what situations this specific test should be used and is more useful than simply comparing confidence intervals. They should also caution the reader about the limitations of the hypothesis being tested (i.e. it is not a directional test, and it does not tell the researcher anything about how the two curves are different).

The second issue, which the authors acknowledge, is the requirement of proportional hazards in the log rank test, i.e. that the ratio of freezing probabilities is constant across the experiment. Notably, the curves need not cross for this assumption as written on line 168 to be invalid. The authors argue that violation of the proportional-hazards assumption (or specifically, crossing of frozen fraction curves) is unlikely in literature. I disagree. Even in the specific use cases of aging of laboratory samples that the authors advocate for using the log-rank test, many counterexamples exist in literature. For example, in the dataset of Fahy et al., (2022) used in Figure 1 of this manuscript, the unaged ash and water aged ash begin freezing at approximately the same point, before diverging at lower temperatures. Other examples include Whale et al., (2018) Figure 3a or 4c, Jahl et al., (2021) Figure 1, some of the coal fly ash samples in Losey et al., (2018) or any instance of comparing a frozen fraction curve to a background freezing curve when the frozen fraction curve meets the background. Further discussion is needed in the manuscript to clarify how violating the non-proportional hazards assumption influences the results of the log-rank test, and ideally the authors should implement a version of the log-rank test that is robust to non-proportional hazards.

The data that the authors use to demonstrate these approaches are generally well-behaved for this test, except for the data in Figure 4c and 4d, which break the assumptions of the test in very controlled, specific ways. I would encourage the authors to test the log-rank test with other, less well-behaved simulated datasets such as those generated from multiple separate gaussians to demonstrate how the test behaves when used on datasets with multiple separate populations of ice nucleating sites. Additional experimental datasets that are less well-behaved might also be useful to further demonstrate where the log rank test is and is not appropriate to use. In particular, a demonstration of the limits of the log-rank test (i.e. how different two curves have to be to be distinguished given a certain population size) would be helpful information for future researchers trying to interpret their results when using this test.

It is not clear to me what the benefit of testing the differences between specifically two frozen fraction curves is, as opposed to the volume, surface area, or mass-normalized cumulative ice nucleation activity spectra K(T), n_s(T), or n_m(T). A stronger argument should be made for the need for using this specific test, either in the introduction or when the log-rank test is being demonstrated. Even if sample information is not available K(T) is readily calculated from just droplet volume and the frozen fraction curve, while use of n_m or n_s normalize by mass or surface area, allowing for more physical comparisons between samples and comparisons between laboratories. Comparing to background spectra is also possible using K(T). If ice nucleation activity spectra are used, other rigorous approaches to comparing the differences between ice nucleation spectra already exist. If the confidence intervals calculated using the Kaplan-Meier estimator (or any other appropriate confidence intervals) do not overlap, then the curves are statistically significantly different at that point (although this ad-hoc test loses some statistical power; see Cumming, 2009 for a discussion of this). For a more rigorous test that can distinguish between curves at a specific significance level, Fahy et al., (2022) demonstrated that confidence bands on differences between ice nucleation spectra can be calculated using a Monte Carlo approach, whether parameterized as presented in this paper or empirical, which is equivalent to a quantitative statistical test between ice nucleation active site curves across the entire temperature range overlap. These approaches could also be applied to frozen fraction curves. In fact, directly comparing confidence bands conservatively using the 'inference by eye' approach on the simulated frozen fraction curves of Figure 4 gives the correct result for all four panels, clearly outperforming the log rank test in this case despite the loss of statistical power. This may not always be true, but it demonstrates the limitations of the log-rank test. Therefore, a stronger argument for using these log-rank test on frozen fraction curves over these other approaches should be presented in the text.

Minor comments:
Line 187: These Monte-Carlo confidence intervals were first presented earlier than Whale et al. 2022, for example in the supplemental information of Jahl et al., (2021). I recommend removing this line, as the earliest use of this method is not important to this study.

In the figures of this paper, the Kaplan-Meier confidence intervals are presented as continuous bands, apparently through interpolations between the calculated confidence intervals. This should be stated somewhere, as continuous confidence bands are not precisely equivalent to interpolated discrete confidence intervals (which are less accurate than true confidence bands; see (Sachs et al., 2022).

The provided script and example data work on my machine, and the methods used are appropriate. I appreciate the authors providing an easy-to-use script for this approach as a supplement to the manuscript.

References
Cumming, G.: Inference by eye: Reading the overlap of independent confidence intervals, Stat. Med., 28, 205–220, https://doi.org/10.1002/sim.3471, 2009.
Fahy, W. D., Shalizi, C. R., and Sullivan, R. C.: A universally applicable method of calculating confidence bands for ice nucleation spectra derived from droplet freezing experiments, Atmospheric Meas. Tech., 15, 6819–6836, https://doi.org/10.5194/amt-15-6819-2022, 2022.
Jahl, L. G., Brubaker, T. A., Polen, M. J., Jahn, L. G., Cain, K. P., Bowers, B. B., Fahy, W. D., Graves, S., and Sullivan, R. C.: Atmospheric aging enhances the ice nucleation ability of biomass-burning aerosol, Sci. Adv., 7, eabd3440, https://doi.org/10.1126/sciadv.abd3440, 2021.
Kinney, N. L. H., Hepburn, C. A., Gibson, M. I., Ballesteros, D., and Whale, T. F.: High interspecific variability in ice nucleation activity suggests pollen ice nucleators are incidental, Biogeosciences, 21, 3201–3214, https://doi.org/10.5194/bg-21-3201-2024, 2024.
Losey, D. J., Sihvonen, S. K., Veghte, D. P., Chong, E., and Freedman, M. A.: Acidic processing of fly ash: chemical characterization, morphology, and immersion freezing, Environ. Sci. Process. Impacts, 20, 1581–1592, https://doi.org/10.1039/c8em00319j, 2018.
Sachs, M. C., Brand, A., and Gabriel, E. E.: Confidence bands in survival analysis, Br. J. Cancer, 127, 1636–1641, https://doi.org/10.1038/s41416-022-01920-5, 2022.
Whale, T. F., Holden, M. A., Wilson, T. W., O’Sullivan, D., and Murray, B. J.: The enhancement and suppression of immersion mode heterogeneous ice-nucleation by solutes, Chem. Sci., 9, 4142–4151, https://doi.org/10.1039/C7SC05421A, 2018.

Reply
Citation: https://doi.org/10.5194/egusphere-2026-680-RC1
RC2: 'Comment on egusphere-2026-680', Anonymous Referee #2, 14 Apr 2026 reply

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2026/egusphere-2026-680/egusphere-2026-680-RC2-supplement.pdf
Reply

Citation: https://doi.org/10.5194/egusphere-2026-680-RC2

Thomas F. Whale, Sarah L. Barr, and Trystan Surawy-Stepney

Interactive computing environment

KM-nucleation-stats Tom Whale https://github.com/TFWhale/KM-nucleation-stats

Thomas F. Whale, Sarah L. Barr, and Trystan Surawy-Stepney

Viewed

Total article views: 235 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
145	74	16	235	11	19

HTML: 145
PDF: 74
XML: 16
Total: 235
BibTeX: 11
EndNote: 19

Views and downloads (calculated since 09 Mar 2026)

Month	HTML	PDF	XML	Total
Mar 2026	120	61	16	197
Apr 2026	25	13	0	38

Cumulative views and downloads (calculated since 09 Mar 2026)

Month	HTML	PDF	XML	Total
Mar 2026	120	61	16	197
Apr 2026	25	13	0	38

Viewed (geographical distribution)

Total article views: 220 (including HTML, PDF, and XML) Thereof 220 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 25 Apr 2026

Short summary

Droplet-freezing experiments are used to study ice formation in the atmosphere, but standard methods to show uncertainty or test whether two results differ are lacking. We borrow from medical ‘time-to-event’ statistics to add easily-calculated confidence intervals to fraction-frozen curves and derived quantities without binning or assumptions about underlying physics, and adapt a test to judge whether curves differ beyond random variation. This will make comparison of studies easier.


Total:	0
HTML:	0
PDF:	0
XML:	0