Towards the Bayesian calibration of a glacier surface energy balance model for unmonitored glaciers
Abstract. Building on Bayesian calibration techniques and leveraging high-resolution climate simulations, we test the current capabilities of calibrating a surface energy balance model for unmonitored glaciers using only globally available satellite observations at Hintereisferner. We developed a multi-objective Bayesian framework for the Coupled Snowpack and Ice surface energy and mass balance model in Python (COSIPY), calibrated with satellite-derived geodetic mass balance, transient snowline altitudes, and mean glacier albedo. The framework is evaluated with in-situ observations and weather station measurements. A Latin Hypercube Sampling ensemble was used to investigate the model's parameter behaviour, establish priors and enable a Markov Chain Monte Carlo calibration helped by a novel and computationally efficient COSIPY emulator. The multi-objective calibration successfully constrains parameter distributions, addressing the equifinality between accumulation and albedo parameters. However, tighter parameter constraints, combined with imperfect climate simulations as forcing, create a model solution that requires a compromise between the three observational targets. Model evaluation shows that the calibrated ensemble reproduces glacier-mean albedos and inter-annual mass-balance variability well, but exhibits a negative bias in mean annual mass balance and a delayed snowline rise. This apparent contradiction arises from both forcing and model limitations. Incoming longwave radiation is overestimated throughout the year, while a warm and humid bias in the meteorological input during the later melt season enhances the positive turbulent fluxes. Early season melt is delayed by underestimated incoming shortwave radiation, and by an overly slow albedo decay caused by too high albedo aging and firn albedo parameters. Such biases in the meteorological forcing data remain a major obstacle to applying surface energy balance models to unmonitored glaciers. The calibration framework presented in this study provides diagnostic tools that help identify shortcomings and compensation effects within the modelling chain, paving the way for correcting forcing biases within a Bayesian framework as more observations become available. The open-source tools developed here have the potential to lower the barrier to studying the atmospheric drivers of glacier change at unmonitored sites with explicit treatment of uncertainty.
In this study, the authors present the first Bayesian calibration framework for surface energy balance (SEB) modeling for unmonitored glaciers. The framework uses COSIPY calibrated against satellite-derived geodetic mass balance, transient snowline altitudes, and mean remote sensing glacier albedo. The authors employ a Latin Hypercube Sampling ensemble together with Markov Chain Monte Carlo (MCMC) simulations, supported by a novel surrogate model designed to emulate COSIPY outputs. Their analysis indicates that parameter equifinality primarily arises between the precipitation factor and albedo decay-related parameters. Model evaluation shows that the calibrated ensemble reproduces glacier-mean albedo and interannual mass-balance variability well, but shows a negative bias in mean annual mass balance and a delayed snowline rise.
Overall, this manuscript represents a very valuable contribution to the glacier modeling community. Bayesian calibration approaches for SEB modeling are long overdue, and I am pleased to see that the authors address this important and complex problem. The study represents an important first step toward more robust calibration of SEB models with uncertainty quantification and is highly relevant for advancing regional and global physics-based glacier modeling. The manuscript is very well written, the figures are clear and well presented, and the analysis is thorough. The authors also apply a wide range of carefully considered methods, analyses, and modeling tools. While Bayesian calibration itself is not my area of expertise, I am able to comment on the physics-based glacier modeling and machine learning components of the manuscript. There are some shortcomings that should be addressed before publication, but once these issues are resolved, I would support publication. Detailed comments and suggestions are provided below.
Major comments:
1) Geodetic mass balance calibration data
The main concern I have relates to how the geodetic mass balance data is used for calibration. In L 204–205, the authors state that the model is calibrated against the “annual average specific mass change rate” for the period 2000–2010 derived from the dataset of Hugonnet et al. (2021). The reported uncertainty is −1.0425 ± 0.261 m w.e. a⁻¹ (L 204). While the dataset of Hugonnet et al. (2021) indeed provides estimates for shorter subperiods than the full 2000–2019 interval, the signal-to-noise ratio decreases substantially as the aggregation period shortens. In practice, this means that for most glaciers, periods shorter than ~10 years are basically just noise. Most current global or regional glacier modeling studies use the Hugonnet dataset only as a single 2000–2019 average value (i.e., one data point per glacier), precisely because shorter time intervals tend to have low signal-to-noise ratios.
Are the authors using the 1-year subperiods for calibration? If so, I think it should be carefully checked whether these values can actually be used, since their uncertainties might be too large to provide meaningful information. One way to verify this would be to compare the geodetic mass balance rates with the uncertainties reported by Hugonnet et al. (2021) (which are also known to be potentially optimistic) for different aggregation periods (e.g., 1-year, 2-year, 5-year, and 10-year) and calculate the corresponding signal-to-noise ratios. For example, Table 2 suggests that the observed mass balance values (though from WGMS and Klug et al., 2018 and not from Hugonnet et al.) exceed the uncertainty range (−1.0425 ± 0.261 m w.e. a⁻¹) in four (?) out of eight years. What implications would the use of longer aggregation periods (e.g., 5-year or 10-year averages from Hugonnet et al., 2021) have for the Bayesian calibration framework? I am aware that the primary focus of the manuscript is on the framework for Bayesian calibration. However, it is still very important that the calibration data are used with a clear understanding of their limitations and uncertainties.
2) Climatic input data
The authors use COSMO as the climate forcing, produced as part of the CORDEX initiative with a grid spacing of 2.2 km. They state that these simulations are based on ERA-Interim (~80 km, L 156), which has been superseded by ERA5 (~31 km) for nearly a decade. ERA5 includes several updates compared to ERA-Interim that are relevant for glacier studies. Has the COSMO dataset been specifically evaluated over glacierized terrain? I am asking because regional climate model performance can be quite sensitive to the choice of physical parameterizations, and models optimized for other land surface types do not necessarily perform well in cryospheric environments.
I am also wondering why this dataset was chosen instead of, for example, ERA5-Land (~9 km) or ERA5 itself. Both have been shown to provide reliable forcing for SEB modeling studies, and higher spatial resolution does not necessarily translate into better forcing data. For example, Draeger et al. (2025) found that dynamical downscaling did not improve SEB model forcing for a sample of glaciers in Canada compared to ERA5, and that ERA5 even performed better than the higher-resolution ERA5-Land in their case. I also assume that the study period is limited to 1999–2010 (L155) because of the temporal availability of the COSMO dataset.
I am not suggesting that the authors redo the analysis with a different climate forcing dataset, but I do think it would be important to more clearly justify the choice of COSMO in this context, particularly since the manuscript puts a strong emphasis on uncertainties related to the forcing data in the discussion and conclusions.
3) Katabatic winds
In L170 the authors mention that a logarithmic wind profile is used. However, logarithmic profiles are known to poorly represent wind speeds under katabatic conditions and can lead to substantial underestimation of near-surface wind speeds over glaciers (L 598) For example, Shannon et al. (2019) addressed this issue by applying a calibration-dependent multiplicative factor (ranging from 1 to 4) to the 10 m wind speed from reanalysis data.
I wonder whether wind speeds during katabatic conditions might be systematically underestimated in the present setup, which would lead to underestimated turbulent fluxes. However, in L515 the authors mention an overestimation of turbulent fluxes in summer. Could the authors provide more detail on how this can be explained?
4) Machine learning surrogate model
Using a machine learning surrogate model to avoid the computationally expensive COSIPY simulations is an interesting and valuable contribution. However, since this component is a key part of the modeling chain, I think the manuscript would benefit from providing more details.
For example, I am wondering how a test set was constructed (not mentioned in L339). It would also be helpful to include some technical details about the machine learning models (this could be placed in the SI), such as the number of epochs, batch size, loss function, optimizer, and the number of neurons in each layer. It would also help the reader if the input features and model outputs were stated more clearly. It is partly implied in L337, but there is already a lot of information in that section, which makes it somewhat difficult to follow. I might have missed it, but it is also unclear to me why the forcing data (except precipitation) are kept constant (L344).
5) Clarity and readability
The analysis is quite complex and involves several methodological steps, which at times makes it difficult for the reader to follow the technical details. For example, the PCA analysis is only briefly introduced in L314, and for me there are not enough details to clearly understand how it was implemented. It is also not entirely clear to me how the effect of the observed data on individual parameter values is isolated from the K-means clusters (L319).
I recognize that the manuscript is already quite long and that some additional details are provided in the SI. Nevertheless, I would encourage the authors to include a few brief, intuitive explanations of the more complex parts of the analysis in the main text. This would make the methodology much more accessible, particularly for readers in the glaciology community who may not be familiar with all the statistical techniques used.
Minor comments:
L39: To set the stage, I would briefly introduce SEB models here. It would help to mention that they are more physically based than simpler approaches but require substantially more input variables. It may also be useful to clarify the role of calibration parameters in SEB models, which differs from TI models. Unlike TI models, where the melt factor is an empirical parameter inherent to the model itself, many calibration parameters in SEB models are used to account for limitations in the forcing data (e.g., downscaling or unresolved processes).
L75: It would be helpful to provide a bit more detail on the Bayesian calibration frameworks introduced by Rounce et al. (2020) and Sjursen et al. (2023, 2025), as these are the only existing applications of Bayesian calibration in glacier modeling so far.
L71: Typo
L83: I would clarify here how (and with which data) the simulations are evaluated. This is mentioned at the end of the paragraph below, but it might fit better directly after the three main points.
Figure 1: I believe the Copernicus WorldDEM-30 (https://doi.org/10.5270/ESA-c5d3d65) should also be cited in the references.
L156: I am a bit confused about the grid spacings mentioned. ERA-Interim has a resolution of ~79 km, but in L156 it is referred to as “12 km”. Could the authors please clarify this?
L169: Could you provide a few brief details on the analysis of Jennings et al. (2018) to clarify why it is used as a justification here?
L175: Could you please provide details on the “large spacing between vertically stored levels in the COSMO simulation at standard pressure levels”?
L176–179: I am not sure I fully follow this part. Does this mean that, in the end, all 3x3 grid cells were used to calculate lapse rates, regardless of land surface type?
L195: Please provide a short description of HORZYZON. Also, please cite the original publication, as required when using the code (Steger et al., 2022).
L 229: I might have missed it, but it is unclear to me how you distinguish between alpha_ice, alpha_firn, and alpha_snow based on the Landsat-based pixel data. How does this vary temporally and spatially?
L 254-258: Am I understanding correctly that “COSIPY model output” here does not only refer to mass balance, but also to other model outputs, such as the albedo simulated by the COSIPY albedo model (Eq. 5)? If so, the wording might be confusing.
L304: It would be helpful to already name LHS1 and LHS2 here (as defined in Table 1).
Figure 7: In the caption, it might improve readability to explicitly mention “ENS MEDIAN” and “95% CI”.
L 472: Typo
L510: Could you expand on what is meant by the “misrepresentation of surface albedo”? What do you think is the main issue here, for example, the simple decay-type albedo parameterization, limitations in the Landsat data (e.g., cloud cover leading to irregular observations or the 16-day revisit time), or something else?
L 542: Brackets appear twice
L545–L547: It might be useful to briefly mention here that glacier albedo has been observed to decrease with climate change (e.g., Williamson et al., 2025). This would also highlight that capturing interannual variability in the albedo simulations is important, not only intra-annual variability.
Section 4.3: I find this future work section very helpful and thorough.
L629: The mismatches between observed and simulated mass balance gradients are attributed here to linear lapse rates, but other factors could also contribute, so the wording may be somewhat misleading.
L640: I am wondering whether you applied any temporally-guided averaging of the Landsat observations (e.g., monthly or multi-month averages). This might help here.
L649: This is a very interesting observation: the “impact of parameter uncertainty is small compared to that of the forcing data.” I wonder whether there is a simple test or short analysis that could help quantify this effect.
References:
Draeger, C., Radić, V., White, R. H., and Tessema, M. A.: Evaluation of reanalysis data and dynamical downscaling for surface energy balance modeling at mountain glaciers in western Canada, The Cryosphere, 18, 17–42, https://doi.org/10.5194/tc-18-17-2024, 2024.
Steger, C. R., Steger, B. and Schär, C. (2022): HORAYZON v1.2: an efficient and flexible ray-tracing algorithm to compute horizon and sky view factor, Geosci. Model Dev., 15, 6817–6840, https://doi.org/10.5194/gmd-15-6817-2022
Shannon, S., Smith, R., Wiltshire, A., Payne, T., Huss, M., Betts, R., Caesar, J., Koutroulis, A., Jones, D., and Harrison, S. (2019). Global glacier volume projections under high-end climate change scenarios. The Cryosphere, 13(1):325–350.
Williamson, S.N., Marshall, S.J. & Menounos, B. Temperature mediated albedo decline portends acceleration of North American glacier mass loss. Commun Earth Environ 6, 555 (2025). https://doi.org/10.1038/s43247-025-02503-x