the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Land Surface Model Underperformance Tied to Specific Meteorological Conditions
Abstract. The exchange of carbon, water, and energy fluxes between the land and the atmosphere plays a vital role in shaping our understanding of global change and how climate change will affect extreme events. Yet our understanding of the theory of this surface-atmosphere exchange, represented via land surface models, continues to be limited, highlighted by marked biases in model-data benchmarking exercises. Here, we leveraged the PLUMBER2 dataset of observations and model simulations of terrestrial fluxes from 153 international eddy-covariance sites to identify the meteorological conditions under which land surface models are performing worse than a priori expectations. By defining performance relative to three sophisticated out-of-sample empirical models, we generated a lower bound of performance in turbulent flux prediction that can be achieved with the input information available to the land surface models during testing at flux tower sites. We found that land surface model (LSM) performance relative to empirical models is worse at boundary conditions – that is, LSMs underperform in timesteps where the meteorological conditions consist of coinciding relative extreme values. Conversely, LSMs perform much better under "typical" conditions within the centre of the meteorological variable distributions. Constraining analysis to exclude the boundary conditions results in the LSMs outperforming strong empirical benchmarks. Encouragingly, we show that refinement of the performance of land surface models in these boundary conditions, consisting of only 12 % to 31 % of time steps, would see large improvements (22 % to 114 %) in an aggregated performance metric. Precise targeting of model development towards these meteorological boundary conditions offers a fruitful avenue to focus model development, ensuring future improvements have the greatest impact.
- Preprint
(17322 KB) - Metadata XML
-
Supplement
(200 KB) - BibTeX
- EndNote
Status: open (until 09 Nov 2025)
-
CC1: 'Comment on egusphere-2025-4149', Sean Walsh, 07 Sep 2025
reply
-
AC1: 'Reply on CC1', Jon Cranko Page, 16 Sep 2025
reply
Dr. Walsh raises a pertinent point regarding our use of the phrase “boundary conditions” in the manuscript, and we agree that it may cause confusion. However, we believe it is important to retain terminology that highlights that underperformance occurs not only under “extreme weather conditions” as they are commonly understood, but also at the extremes or “edges” of the joint variable distributions as illustrated in the figures. As such, in future versions of the manuscript we will replace all references to “boundary conditions” with the term “edge conditions”.
We hope this addresses Dr. Walsh’s concerns, as well as those of other readers, regarding terminology.
Many thanks for the valuable feedback.
Dr. Jon Cranko Page
on behalf of all authorsCitation: https://doi.org/10.5194/egusphere-2025-4149-AC1
-
AC1: 'Reply on CC1', Jon Cranko Page, 16 Sep 2025
reply
-
RC1: 'Comment on egusphere-2025-4149', Anonymous Referee #1, 02 Oct 2025
reply
Review on “Land Surface Model Underperformance Tied to Specific Meteorological Conditions”
General comments
This study used the framework established in the PLUMBER2 model benchmarking project. The authors used the results of 11 land surface models (LSMs) and studied their performance metrics under different meteorological conditions. The observational data included sensitive and latent heat fluxes and well as net ecosystem exchange from eddy covariance towers. A metric was developed based on the performance of the LSMs in relation to empirical flux models (EFMs). The best EFMs were used to estimate the ‘expected’ performance of the LSMs. If the LSM had a larger error than the best EFMs, it would ‘lose’ and these ‘lost’ values were used to calculate a LSM Lost Ratio (LLR) that was used in the analysis to show in which meteorological conditions this LLR was highest. Additionally, effect of some filters (based on LLR level, wind speed, physical consistency or time of day) on the results was tested.
The study found that indeed the ‘edge’ values of the meteorological variable ranges were those, where the LSMs were not performing so well. This was most obvious for the sensitive heat flux and then to the latent heat flux. For the net ecosystem exchange this effect did not seem as pronounced. The aim of the study was to pinpoint to where the model improvement efforts should be focused for the best outcomes.
This paper addresses a highly relevant scientific question within the scope of this journal and will have large interest among model developers. The metric developed is new and used in a way that enables novel findings. The authors are able to state to which direction fruitful model development should be headed. The paper is clearly written and the scientific method seems sound. The authors are building on top of earlier work and this work has a new substantial contribution and message. The title is descriptive of the paper and the work is well structured and clear as well as well written. Referencing was appropriate.
This paper is an important contribution to the field and it is of good quality. However, I’d have some remarks that I’d hope the authors to address before this manuscript is accepted for publication.
Major comments
1. Overall, it’s understandable that this is a follow-up study and that much of the related information has been published earlier, but to have this manuscript ‘independent’, I had some suggestions of small clarifications that are below.
2. This is a very minor thing, but something I started to wonder. You made a filter condition of the wind speed. For my understanding it is common to use the friction velocity in determining which observations to use and which observations to gap-fill, when processing the flux data. I’m not sure, but friction velocity was perhaps not taken into account in the processing of these data. Would you expect using friction velocity as a threshold to take care of the turbulence requirement and whether your results using the wind speed threshold would say something about how important this kind of threshold is?
3. Not a big thing, but a bit of a mind-twister, is using the term “boundary conditions” when referring to meteorological values at the edges of their ranges. The status of boundary layer conditions effects, whether the micrometeorological observations meet the turbulence requirements and using terminology mixing variable space and meteorology can get confusing. At some points in the manuscript these are called “edge effects”. I’d recommend using one term consistently throughout the manuscript and perhaps avoiding word “boundary”.
4. Do you have any thoughts, why the longwave radiation vs. air temperature relationship would show worse performance in the edge values for the sensitive heat flux?
5. Also, based on these results, could you conclude anything about the energy partitioning of the models?
Minor comments
6. Abstract: If I interpret your results right, the room for improvement would be largest for the sensitive heat flux and then to the latent heat flux, whereas for NEE I don’t see that many dark-red points at least in the averaged plot. Would this kind of information be relevant to convey already in the abstract?
7. L1-2: Will exchanges between land and atmosphere really tell how climate change affects extreme events? This was a bit surprising claim from the first sentence of abstract and it’s not returned to later in the paper. (I know there are feedback mechanisms, but from what I’ve understood there are more factors.)
8. L5: You could maybe here state which fluxes are meant (perhaps not all the readers will know exactly what is meant).
9. L6: I understand that “a priori expectation” is a certain thing in such benchmarking studies, but for readers not familiar with this field, one might consider some expression more intuitive to understand for the abstract. (Someone not familiar, might wonder what “a posteriori expectation” would be.)
10. L9: Abbreviation for LSM could be given when first mentioned.
11. L88: poor LSM “performance”?
12. L95: What do the “good conditions” mean here?
13. L110: For Qh and Qle, did you use energy-balance corrected values or directly the measured values?
14. L115: I think it would be good to mention which LAI product this was and at what time resolution it was available. When introducing the models, it could be mentioned which of them used LAI as input variable.
15. L120-125: It’s not easy to digest a list of these models with references. Would it be possible to have a table with the references in another column and potentially an explanation for different CABLE, JULES and ORCHIDEE versions?
16. L166: What is “raw LSM performance”?
17. L166-167: Here it was at first not clear for me, that “better than a single one” means better than only one EFM. It could be rephrased e.g. by “LSM was outperformed by only one of the best EFMs”. However, when looking at this in Fig. 1 in panel 3, this is expressed in other way, so that it’s defined as loss only if the error of LSM is larger than error of all the best EFMs. Would it help the reader if the way of talking was similar in the text and the figure?
18. L202: It’s a bit confusing to have acronym of PDF for “Density Overlap Percentage”
19. L210: Is this really the best EFMs? Looking at the equations, it says benchmark_EFMs.
20. L220: Why did you choose this LRF?
21. L221-222: “was strongly preferenced as only timesteps in input cells where less than half of timesteps were a LSM Loss were analysed.” - This is a bit difficult to read. Is there a small mistake?
23. L214: Filters: Refer to Fig. 1 here again?
24. Fig. 2 caption: Worth mentioning that these are the best EFMs?
25. Fig. 2-4 color bar denotion: “LSM Losses”. Should this be LLR? Or at least LSM Loss Ratio?
26. L235: Why did you start this section with Qh?
27. L251: Usually CHTESSEL has been called “CHTESSEL_1”.
28. L251: Is this sentence about the VPD-Tair domain, or just referring to all the domains?
29. L252: Here you start talking about the amount of high LLRs in the edge regions of the fingerprints. However, just comparing at the figs 2 and 3, it would seem that the amount of high LLRs is lower for the latent heat flux. Would it make sense to comment also on that point here?
30. L270: Add units to 0.
31. L270: Would the performance of MATSIRO again improve in warmer temperatures, as there is a lighter strike in the plot?
32. L278: There are perhaps now two orange colors in the color bar, a lighter and a darker orange. At least looking at the ORCHIDEE model performance, there basely exists this darker orange. Would that be worth mentioning?
33. Fig. 4: Why don’t you leave out the four models without data? Now quite a lot of grey is shown in the figure, which is quite busy already.
34. L292: I thought all of the LSMs here had 30 min time-steps. Is there a specific reason you mention the timestep here or have I missed something?
35. L402: I agree with you that number of parameters is increasing over time in LSMs, but these increases might often be due to model complexity in totally other regions than energy balance. To get the Qh and Qle right, perhaps the information given in for soil structure is not sufficient or the models could be suffering for biases in the soil moisture.
36. L433: The authors are here potentially concerned of models being able to perform even when unphysical meteorological forcing is being used. Could it be possible that some of the models have some checks of the input data to ensure that inside the model things are consistent? Would the authors consider it relevant to flag this kind of unphysical instances in the dataset, so that they could be left out. Or would they just consider this for everyone using the dataset to take into consideration?
Typos etc
37. CO2 not written with subscript
38. L222: “an” LSM Loss
Citation: https://doi.org/10.5194/egusphere-2025-4149-RC1 -
RC2: 'Comment on egusphere-2025-4149', Anonymous Referee #2, 08 Oct 2025
reply
This study leveraged the PLUMBER2 dataset to compare different land surface models and out-of-sample empirical models. They identified the underperformance of land surface models in several boundary conditions and further separated the related time steps to examine their impacts on the evaluation. The manuscript is generally well written and easy to follow. I agree with the authors that rigorous benchmarks between land surface and empirical models are important to improve the land model structure. I have these comments below and hope they can help further improve this study.
I understand the comparison of relative performance, as defined using LSM Loss Ratio is informative. However, I am thinking about the cases that all types of the models behave unsatisfactory. Looking at the absolute metrics is actually important for these cases, because the “winning” of LSMs is not that meaningful if its absolute performance is quite degraded. For example, some dark red regions in Figure 4 can be both LSMs and EFMs perform quite badly but LSMs are less worse. I hope the authors can add some analysis on this question as well as discussions on how we might further improve LSMs under these cases.
I feel the definition of LSMs winning - as long as outperforming any EFMs, is favoring LSMs too much and could mislead the analysis in some cases. It will be helpful to report the variation of performance within EFMs groups. Many studies show that LSTM can largely outperform simple regression methods. LSMs might easily beat a regression model but can still fall behind a lot when compared to the LSTM. This can make us think the LSMs “win” while they still have a large space for structure improvement. Can the authors provide more discussions on this issue?
Line 152 why the EFMs are only trained in a single site, instead of using K fold cross validation to get out-of-sample prediction? Only using one single site and extrapolating to all sites can naturally degrade the performance given the limitations in training samples.
Is there a specific reason that air temperature is chosen as the fixed variable when conducting the paired analysis for different cells? Please clarify this.
Line 173 The abbreviations like “SWdown”, “Qair”, “Qle” should be clearly defined and explained when first introduced, though the readers might guess what they mean. I feel a little confused about some abbreviations that suddenly show up when reading through the paper.
Line 276 This pronounced temperature effect is interesting. Could you explain more on the potential reasons or hypotheses behind this effect?
I feel Figure 5 should not need the lines that connect each group of dots, because there is no dynamic information between Qh, Qle and NEE? The lines can be distractive when reading this figure.
Line 370 This statement raises an important issue regarding how we evaluate the model. Isn’t this directly dependent on what metrics we use for performance evaluation? For example, the metrics for overall performance and extreme conditions capture different aspects of simulations and can behave very differently. The authors might need to discuss how the choice of evaluation metrics can impact some conclusions of the paper.
I suggest the authors clearly summarize under what boundary conditions as they identify the land models perform consistently worse than EFMs in the conclusion section. These are the key messages that the readers would like to get from the conclusion, while the current section is relatively general.
Citation: https://doi.org/10.5194/egusphere-2025-4149-RC2
Model code and software
Analysis Code for "LSM Underperformance Tied to Specific Meteorological Conditions" Jon Cranko Page https://github.com/JDCP93/LSMUnderperformance
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
1,847 | 35 | 15 | 1,897 | 20 | 23 | 12 |
- HTML: 1,847
- PDF: 35
- XML: 15
- Total: 1,897
- Supplement: 20
- BibTeX: 23
- EndNote: 12
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
A very interesting and relevant study of LSM model performance. However, I am concerned about the unconventional use of the term "boundary conditions". In computational science this term normally refers to spatial or temporal constraints - in other words, hard boundaries defined by the problem itself (e.g. the vertical component of wind must be zero at the ground surface). In this article, the term has been used to refer to climatic conditions at extreme ends of a probability distribution, which is conceptually quite different from a physical constraint condition. I recommend that you consider changing "boundary conditions" to "extreme conditions" or "extreme weather conditions". Kind regards, Dr Sean Walsh, University of Melbourne.