the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
OpenBench: a land models evaluation system
Abstract. Recent Land surface models (LSMs) have evolved significantly in complexity and resolution, requiring comprehensive evaluation systems to assess their performance. This paper introduces The Open Source Land Surface Model Benchmarking System (OpenBench), an open-source, cross-platform benchmarking system designed to evaluate the state-of-the-art LSMs. OpenBench addresses significant limitations in the current evaluation frameworks by integrating processes that encompass human activities, facilitating arbitrary spatiotemporal resolutions, and offering comprehensive visualization capabilities. The system utilizes various metrics and normalized scoring indices, enabling a comprehensive evaluation of different aspects of model performance. Key features include automation for managing multiple reference datasets, advanced data processing capabilities, and support for station-based and gridded data evaluations. By examining case studies on river discharge, urban heat flux, and agricultural modeling, we illustrate OpenBench's ability to identify the strengths and limitations of models across different spatiotemporal scales and processes. The system's modular architecture enables seamless integration of new models, variables, and evaluation metrics, ensuring adaptability to emerging research needs. OpenBench provides the research community with a standardized, extensible framework for model assessment and improvement. Its comprehensive evaluation capabilities and efficient computational architecture make it a valuable tool for both model development and operational applications in various fields.
- Preprint
(14476 KB) - Metadata XML
-
Supplement
(1022 KB) - BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on egusphere-2025-1380', Anonymous Referee #1, 16 May 2025
This paper presents a new software system, called OpenBench, to evaluate land surface models. OpenBench evaluates land surface models following a rigorous scientific method based on a wide range of statistical metrics and evaluation scores to allow for a quick and objective evaluation of various aspects of the models’ results. OpenBench showcases its capabilities by presenting a range of analyses accompanied by a varied array of representations. Although one may deplore the general scattering of effort in the community in developing such tools, the paper is generally well written and successfully explains the advantages of the software. Using Python and well-supported packages to write the software is a solid choice, ensuring potential widespread adoption and continuous support of dependencies. The paper clearly highlights how OpenBench differs from the existing tools with support for a range of data types and of model output formats, new variables linked to human activities and the possibility of user extension for other datasets, models or variables. Although OpenBench is using common evaluation metrics and scores, the set of metrics and scores chosen is pertinent and allows for an evaluation of a wide range of aspects for land surface model results. In addition, the paper explains how OpenBench differs in its handling and visualisation of the metrics and scores.
However, a few points of the paper need to be clarified. Firstly, the choice of a Fortran namelist format for the configuration file of a Python software is unusual. Fortran namelists are not the most flexible format for configuration files and are not well supported in Python. Common, popular choices like YAML, JSON or others have a much stronger support in Python and offer greater flexibility. It would be good to explain better why the Fortran namelist format was chosen for OpenBench.
A few points required clarification in the description of the metrics and scores. In Table 2, the bias metrics are described as “the smaller is better, ideal value is 0”. However, a lot of the metrics have an infinite range [-∞,∞], in which case the smaller value for the metrics isn’t 0 but -∞. It would be more accurate to say “the closer to 0 is better”.
The text explaining the various metrics used references metrics that do not appear in Table 2 and need to be clarified:
- line 176: “For categorical data, the Kappa coefficient”, (bolding from me)
- line 178: “and Percent Change in maximum and minimum values help identify” (bolding from me)
The variable naming in the calculation of nRMSEScore should be reviewed. The name “CRESM” is strange; shouldn’t it be “CRMSE”? Additionally, the error is once called εrmse and once εcresm.
The nPhaseScore score explanation needs to be reviewed. Several issues with it are likely linked and can be addressed together. I do not understand what “climatological mean cycles” (line 214) are, which cycles are referred to here? I also do not understand what is referred to with “of evaluation time resolution”. Finally, there are two mathematical symbols in the equations that are not explained, λ and φ.
Lastly, there is very little explanation of the nSpatialScore score. Why was this done so?
In the section showcasing the tool with some use cases, I disagree with the conclusion of the urban heat evaluation that “these findings highlight the importance of refined urban parameterization schemes in land surface models” (line 377). It isn’t clear why the results shown indicate this. The results indicate the CoLM2024 model performs well except for a specific zone, but do not show if models with different parameterizations do better or worse. Although I strongly agree that refined urban parametrizations can perform better, I disagree that the results shown in this paper allow us to draw a conclusion on the importance of urban parametrization.
In the multiple models comparison, at line 456, it says CoLM2024 and TE are the best models for canopy transpiration and total runoff, whereas figure 6 shows CLM5 and CoLM2024 are the best for the total runoff. The text in this section talks of “superior performance”. I would argue that we can’t qualify a score of 0.54 for the runoff of superior. It seems “highest” might be a better choice of qualifier in this case.
In the multiple models comparison section, I also question the choice of the vertical axis range in the parallel coordinates plot for the scores (figures 6b and 7b). I think these plots would be more informative if OpenBench used the same range from 0 to 1 for all the plots. In this way, the plots would visually highlight not only the relative position of the various models, but also the overall quality of all the models (how far from 1 all the models lie) and the relative performance of the models between each other (the spread of the lines would visually highlight if the models performed similarly or very differently). It would make it harder to identify small differences between models, which is, in my view, an advantage as small differences indicate similar performances. It is logical to keep the setting of the range for the vertical axis unchanged in the parallel coordinates plot for the metrics since, contrary to scores, a lot of metrics have an infinite range.
In the section comparing a model to multiple reference datasets, I find Figure 9 confusing. It presents a heatmap of various metrics for a model compared to several datasets. The same colormap is used for all metrics, with darker hues for higher metric values. Unfortunately, the metrics do not all show a better agreement at the higher values. Users then need to know the details of each of the metrics to interpret the table instead of being visually guided by the figure. This representation of the metrics would work better if OpenBench used different colormaps for different types of metrics: closer to 0 metrics with a darker hue at 0, metrics with the smallest values being the best with a darker hue for the smallest values, etc. I realise it is harder to put together, but it would greatly improve the representation.
Finally, the paper refers several times to the efficiency of the tool and points out the parallelisation using Dask. However, there is nothing in the paper to substantiate this. It would be good if some information could be given about the resources used and the time needed to produce the analyses that are showcased in the paper, for example.
Technical corrections:
Bold text indicates parts of the cited text that I modified to show needed corrections.
Line 29 and 31: “various changes in the Earth system”, “key components of Earth system models”. “Earth”, when referring to the planet, takes an uppercase
Line 164: “For example, bias metrics”, no uppercase to “bias”.
Line 188: “For a given variable 𝒗(𝒕, 𝒙), where 𝒕 represents time and 𝒙 represents spatial coordinates, we first calculate”. The first sentence here is not a sentence; replace the full stop after “coordinates” with a comma.
Line 194: “Where t0 and tf are the first and final timesteps, respectively.” Replace singular with plural.
Line 200: “Similarly to nBiasScore, we first calculate the centralized RMSE:”. “Similar” changed to “Similarly”, “We” changed to “we”, “RSME” changed to “RMSE”, and remove bolding of nBiasScore.
Line 239: “In contrast, OpenBench offers”. Replace “offering” with “offers”.
Line 277: The sentence finishing with “making it possible to evaluate.” is incomplete. It should be combined with the next sentence.
Figure 3 legend: Replace with “An example of a scores heatmap for GPP classified by IGBP land cover.”
Line 293: Considering OpenBench does not provide any datasets, the part saying “while OpenBench integrates a comprehensive collection of datasets,” would be more accurate as such: “while OpenBench integrates with a comprehensive collection of datasets,”
Citation: https://doi.org/10.5194/egusphere-2025-1380-RC1 -
AC1: 'Reply on RC1', Zhongwang Wei, 09 Jun 2025
We thank Reviewer #1 for thoughtful and constructive feedback. This Response to the Reviewer file provides a complete documentation of the changes that have been made in response to each individual comment. Reviewer’s comments are shown in plain text. Authors’ responses are shown in purple color. Quotations from the revised manuscript are shown in blue color.
-
AC3: 'Reply on AC1', Zhongwang Wei, 09 Jun 2025
We thank Reviewer #1 for thoughtful and constructive feedback. This Response to the Reviewer file provides a complete documentation of the changes that have been made in response to each individual comment. Reviewer’s comments are shown in plain text. Authors’ responses are shown in purple color. Quotations from the revised manuscript are shown in blue color.
-
AC3: 'Reply on AC1', Zhongwang Wei, 09 Jun 2025
-
RC2: 'Comment on egusphere-2025-1380', Anonymous Referee #2, 27 May 2025
This paper describes new cross-platform software system for evaluation and comparison of land surface models using a broad suite of metrics, statistics and comparison methods.
Authors clearly demonstrate OpenBench’s capabilities with various examples. Figures are comprehensive and clear. The manuscript is written very clearly, with few grammatical errors, and therefore I have few comments in this regard.
Regarding the software itself, I appreciate authors efforts to provide an easily accessible and runnable code base along with sample data for testing. However, I note if users follow “usage” instructions from the github repository README, there is no file provided for “nml/main.nml”, so the program fails. I was able to run the more complex example with sample data using the file “main-Debug.nml”, but I recommend authors update the codebase to provide a highly simplified “main.nml” for initial user testing, and clearer instructions on how to adapt the codebase for custom models/dataset analysis.
An internet connection is required for some plotting functions (e.g. to download Cartopy coastline), while some HPC environments may not have internet connectivity. Without internet connectivity, the program fails. A programmed exception to exclude downloading coastlines etc would improve functionality.
Regarding the manuscript, authors may wish to comment in the paper on the name “OpenBench”, and reduce reference to this being a “benchmarking system”, as readers may have a different interpretation of “benchmarking”. To my understanding, the broad meaning of benchmarking is comparison with a well-defined standard, or an a-priori performance expectation (e.g. see introduction and explanatory figures in your reference Best et al., 2015). This software undertakes evaluation and comparison without explicitly benchmarking (using the definitions in Best et al.,). However, I recognise that others in the community use “benchmarking” differently (e.g. in ILAMB). This could be commented on in the paper.
Some referenced models, datasets or studies are not properly referenced. For example: CLASS, CABLE, PLUMBER2. Please include relevant references.
Also ensure all acronyms are defined. For example, I cannot find a definition for uRMSD used in Figure 10. Overall, figure captions could be improved by reducing or explaining acronyms.
Please ensure software in Table 2 is properly named. For example ESMVal should be ESMValTool, and PALS has changed their name to modelevaluation.org.
Overall, I see great potential to this work, and congratulate authors for this contribution. I look forward to integrating OpenBench into my evaluation workflow.
Citation: https://doi.org/10.5194/egusphere-2025-1380-RC2 -
AC2: 'Reply on RC2', Zhongwang Wei, 09 Jun 2025
We thank Reviewer #2 for thoughtful and constructive feedback. This Response to the Reviewer file provides complete documentation of the changes that have been made in response to each individual comment. Reviewer’s comments are shown in plain text. Authors’ responses are shown in purple. Quotations from the revised manuscript are shown in blue.
Citation: https://doi.org/10.5194/egusphere-2025-1380-AC2 -
AC4: 'Reply on RC2', Zhongwang Wei, 09 Jun 2025
We thank Reviewer #2 for thoughtful and constructive feedback. This Response to the Reviewer file provides complete documentation of the changes that have been made in response to each individual comment. Reviewer’s comments are shown in plain text. Authors’ responses are shown in purple. Quotations from the revised manuscript are shown in blue.
-
AC2: 'Reply on RC2', Zhongwang Wei, 09 Jun 2025
Status: closed
-
RC1: 'Comment on egusphere-2025-1380', Anonymous Referee #1, 16 May 2025
This paper presents a new software system, called OpenBench, to evaluate land surface models. OpenBench evaluates land surface models following a rigorous scientific method based on a wide range of statistical metrics and evaluation scores to allow for a quick and objective evaluation of various aspects of the models’ results. OpenBench showcases its capabilities by presenting a range of analyses accompanied by a varied array of representations. Although one may deplore the general scattering of effort in the community in developing such tools, the paper is generally well written and successfully explains the advantages of the software. Using Python and well-supported packages to write the software is a solid choice, ensuring potential widespread adoption and continuous support of dependencies. The paper clearly highlights how OpenBench differs from the existing tools with support for a range of data types and of model output formats, new variables linked to human activities and the possibility of user extension for other datasets, models or variables. Although OpenBench is using common evaluation metrics and scores, the set of metrics and scores chosen is pertinent and allows for an evaluation of a wide range of aspects for land surface model results. In addition, the paper explains how OpenBench differs in its handling and visualisation of the metrics and scores.
However, a few points of the paper need to be clarified. Firstly, the choice of a Fortran namelist format for the configuration file of a Python software is unusual. Fortran namelists are not the most flexible format for configuration files and are not well supported in Python. Common, popular choices like YAML, JSON or others have a much stronger support in Python and offer greater flexibility. It would be good to explain better why the Fortran namelist format was chosen for OpenBench.
A few points required clarification in the description of the metrics and scores. In Table 2, the bias metrics are described as “the smaller is better, ideal value is 0”. However, a lot of the metrics have an infinite range [-∞,∞], in which case the smaller value for the metrics isn’t 0 but -∞. It would be more accurate to say “the closer to 0 is better”.
The text explaining the various metrics used references metrics that do not appear in Table 2 and need to be clarified:
- line 176: “For categorical data, the Kappa coefficient”, (bolding from me)
- line 178: “and Percent Change in maximum and minimum values help identify” (bolding from me)
The variable naming in the calculation of nRMSEScore should be reviewed. The name “CRESM” is strange; shouldn’t it be “CRMSE”? Additionally, the error is once called εrmse and once εcresm.
The nPhaseScore score explanation needs to be reviewed. Several issues with it are likely linked and can be addressed together. I do not understand what “climatological mean cycles” (line 214) are, which cycles are referred to here? I also do not understand what is referred to with “of evaluation time resolution”. Finally, there are two mathematical symbols in the equations that are not explained, λ and φ.
Lastly, there is very little explanation of the nSpatialScore score. Why was this done so?
In the section showcasing the tool with some use cases, I disagree with the conclusion of the urban heat evaluation that “these findings highlight the importance of refined urban parameterization schemes in land surface models” (line 377). It isn’t clear why the results shown indicate this. The results indicate the CoLM2024 model performs well except for a specific zone, but do not show if models with different parameterizations do better or worse. Although I strongly agree that refined urban parametrizations can perform better, I disagree that the results shown in this paper allow us to draw a conclusion on the importance of urban parametrization.
In the multiple models comparison, at line 456, it says CoLM2024 and TE are the best models for canopy transpiration and total runoff, whereas figure 6 shows CLM5 and CoLM2024 are the best for the total runoff. The text in this section talks of “superior performance”. I would argue that we can’t qualify a score of 0.54 for the runoff of superior. It seems “highest” might be a better choice of qualifier in this case.
In the multiple models comparison section, I also question the choice of the vertical axis range in the parallel coordinates plot for the scores (figures 6b and 7b). I think these plots would be more informative if OpenBench used the same range from 0 to 1 for all the plots. In this way, the plots would visually highlight not only the relative position of the various models, but also the overall quality of all the models (how far from 1 all the models lie) and the relative performance of the models between each other (the spread of the lines would visually highlight if the models performed similarly or very differently). It would make it harder to identify small differences between models, which is, in my view, an advantage as small differences indicate similar performances. It is logical to keep the setting of the range for the vertical axis unchanged in the parallel coordinates plot for the metrics since, contrary to scores, a lot of metrics have an infinite range.
In the section comparing a model to multiple reference datasets, I find Figure 9 confusing. It presents a heatmap of various metrics for a model compared to several datasets. The same colormap is used for all metrics, with darker hues for higher metric values. Unfortunately, the metrics do not all show a better agreement at the higher values. Users then need to know the details of each of the metrics to interpret the table instead of being visually guided by the figure. This representation of the metrics would work better if OpenBench used different colormaps for different types of metrics: closer to 0 metrics with a darker hue at 0, metrics with the smallest values being the best with a darker hue for the smallest values, etc. I realise it is harder to put together, but it would greatly improve the representation.
Finally, the paper refers several times to the efficiency of the tool and points out the parallelisation using Dask. However, there is nothing in the paper to substantiate this. It would be good if some information could be given about the resources used and the time needed to produce the analyses that are showcased in the paper, for example.
Technical corrections:
Bold text indicates parts of the cited text that I modified to show needed corrections.
Line 29 and 31: “various changes in the Earth system”, “key components of Earth system models”. “Earth”, when referring to the planet, takes an uppercase
Line 164: “For example, bias metrics”, no uppercase to “bias”.
Line 188: “For a given variable 𝒗(𝒕, 𝒙), where 𝒕 represents time and 𝒙 represents spatial coordinates, we first calculate”. The first sentence here is not a sentence; replace the full stop after “coordinates” with a comma.
Line 194: “Where t0 and tf are the first and final timesteps, respectively.” Replace singular with plural.
Line 200: “Similarly to nBiasScore, we first calculate the centralized RMSE:”. “Similar” changed to “Similarly”, “We” changed to “we”, “RSME” changed to “RMSE”, and remove bolding of nBiasScore.
Line 239: “In contrast, OpenBench offers”. Replace “offering” with “offers”.
Line 277: The sentence finishing with “making it possible to evaluate.” is incomplete. It should be combined with the next sentence.
Figure 3 legend: Replace with “An example of a scores heatmap for GPP classified by IGBP land cover.”
Line 293: Considering OpenBench does not provide any datasets, the part saying “while OpenBench integrates a comprehensive collection of datasets,” would be more accurate as such: “while OpenBench integrates with a comprehensive collection of datasets,”
Citation: https://doi.org/10.5194/egusphere-2025-1380-RC1 -
AC1: 'Reply on RC1', Zhongwang Wei, 09 Jun 2025
We thank Reviewer #1 for thoughtful and constructive feedback. This Response to the Reviewer file provides a complete documentation of the changes that have been made in response to each individual comment. Reviewer’s comments are shown in plain text. Authors’ responses are shown in purple color. Quotations from the revised manuscript are shown in blue color.
-
AC3: 'Reply on AC1', Zhongwang Wei, 09 Jun 2025
We thank Reviewer #1 for thoughtful and constructive feedback. This Response to the Reviewer file provides a complete documentation of the changes that have been made in response to each individual comment. Reviewer’s comments are shown in plain text. Authors’ responses are shown in purple color. Quotations from the revised manuscript are shown in blue color.
-
AC3: 'Reply on AC1', Zhongwang Wei, 09 Jun 2025
-
RC2: 'Comment on egusphere-2025-1380', Anonymous Referee #2, 27 May 2025
This paper describes new cross-platform software system for evaluation and comparison of land surface models using a broad suite of metrics, statistics and comparison methods.
Authors clearly demonstrate OpenBench’s capabilities with various examples. Figures are comprehensive and clear. The manuscript is written very clearly, with few grammatical errors, and therefore I have few comments in this regard.
Regarding the software itself, I appreciate authors efforts to provide an easily accessible and runnable code base along with sample data for testing. However, I note if users follow “usage” instructions from the github repository README, there is no file provided for “nml/main.nml”, so the program fails. I was able to run the more complex example with sample data using the file “main-Debug.nml”, but I recommend authors update the codebase to provide a highly simplified “main.nml” for initial user testing, and clearer instructions on how to adapt the codebase for custom models/dataset analysis.
An internet connection is required for some plotting functions (e.g. to download Cartopy coastline), while some HPC environments may not have internet connectivity. Without internet connectivity, the program fails. A programmed exception to exclude downloading coastlines etc would improve functionality.
Regarding the manuscript, authors may wish to comment in the paper on the name “OpenBench”, and reduce reference to this being a “benchmarking system”, as readers may have a different interpretation of “benchmarking”. To my understanding, the broad meaning of benchmarking is comparison with a well-defined standard, or an a-priori performance expectation (e.g. see introduction and explanatory figures in your reference Best et al., 2015). This software undertakes evaluation and comparison without explicitly benchmarking (using the definitions in Best et al.,). However, I recognise that others in the community use “benchmarking” differently (e.g. in ILAMB). This could be commented on in the paper.
Some referenced models, datasets or studies are not properly referenced. For example: CLASS, CABLE, PLUMBER2. Please include relevant references.
Also ensure all acronyms are defined. For example, I cannot find a definition for uRMSD used in Figure 10. Overall, figure captions could be improved by reducing or explaining acronyms.
Please ensure software in Table 2 is properly named. For example ESMVal should be ESMValTool, and PALS has changed their name to modelevaluation.org.
Overall, I see great potential to this work, and congratulate authors for this contribution. I look forward to integrating OpenBench into my evaluation workflow.
Citation: https://doi.org/10.5194/egusphere-2025-1380-RC2 -
AC2: 'Reply on RC2', Zhongwang Wei, 09 Jun 2025
We thank Reviewer #2 for thoughtful and constructive feedback. This Response to the Reviewer file provides complete documentation of the changes that have been made in response to each individual comment. Reviewer’s comments are shown in plain text. Authors’ responses are shown in purple. Quotations from the revised manuscript are shown in blue.
Citation: https://doi.org/10.5194/egusphere-2025-1380-AC2 -
AC4: 'Reply on RC2', Zhongwang Wei, 09 Jun 2025
We thank Reviewer #2 for thoughtful and constructive feedback. This Response to the Reviewer file provides complete documentation of the changes that have been made in response to each individual comment. Reviewer’s comments are shown in plain text. Authors’ responses are shown in purple. Quotations from the revised manuscript are shown in blue.
-
AC2: 'Reply on RC2', Zhongwang Wei, 09 Jun 2025
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
731 | 112 | 21 | 864 | 43 | 14 | 37 |
- HTML: 731
- PDF: 112
- XML: 21
- Total: 864
- Supplement: 43
- BibTeX: 14
- EndNote: 37
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1