the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Best practices in software development for robust and reproducible geoscientific models based on insights from the Global Carbon Project models
Abstract. Computational models play an increasingly vital role in scientific research, by numerically simulating processes that cannot be solved analytically. Such models are fundamental in geosciences and offer critical insights into the impacts of global change on the Earth system today and in the future. Beyond their value as research tools, models are also software products and should therefore adhere to certain established software engineering standards. However, scientists are rarely trained as software developers, which can lead to potential deficiencies in software quality like unreadable, inefficient, or erroneous code. The complexity of these models, coupled with their integration into broader workflows, also often makes reproducing results, evaluating processes, and building upon them highly challenging.
In this paper, we review the current practices within the development processes of the state-of-the-art land surface models used by the Global Carbon Project. By combining the experience of modelers from the respective research groups with the expertise of professional software engineers, we bridge the gap between software development and scientific modeling to outline key principles and tools for improving software quality in research. We explore four main areas: 1) model testing and validation, 2) scientific, technical, and user documentation, 3) version control, continuous integration, and code review, and 4) the portability and reproducibility of workflows.
Our review of current models reveals that while modeling communities are incorporating many of the suggested practices, significant room for improvement remains in areas such as automated testing, documentation, and reproducible workflows. For instance, there is limited adoption of automated documentation and testing, and provision of reproducible workflow pipelines remains an exception. This highlights the need to identify and promote essential software engineering practices within the scientific community. Nonetheless, we also discuss numerous examples of practices within the community that can serve as guidelines for other models and could even help streamline processes within the entire community.
We conclude with an open-source example implementation of these principles built around the LPJ-GUESS model, showcasing portable and reproducible data flows, a continuous integration setup, and web-based visualizations. This example may serve as a practical resource for model developers, users, and all scientists engaged in scientific programming.
Competing interests: Co-author Sam Rabin is on the editorial board of GMD
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(1068 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-1733', Anonymous Referee #1, 23 Aug 2025
-
RC2: 'Reply on RC1', Anonymous Referee #1, 23 Aug 2025
Publisher’s note: the content of this comment was removed on 9 September 2025 since the comment was posted by mistake.
Citation: https://doi.org/10.5194/egusphere-2025-1733-RC2 -
RC3: 'Reply on RC1', Anonymous Referee #1, 23 Aug 2025
Publisher’s note: the content of this comment was removed on 9 September 2025 since the comment was posted by mistake.
Citation: https://doi.org/10.5194/egusphere-2025-1733-RC3
-
RC2: 'Reply on RC1', Anonymous Referee #1, 23 Aug 2025
-
RC4: 'Comment on egusphere-2025-1733', Anonymous Referee #2, 23 Dec 2025
In its own words, this article aims to combine the experience of land surface modelers and the expertise of professional software engineers in order to define the key principles and tools for improving software quality in the field of research. The authors conclude that it is possible to improve land surface models in areas such as automated testing, documentation, and reproducible workflows, but that inspiration can already be found in individual models, particularly the LPJ-GUESS model, which the authors consider the most successful in this field. This article is well written (with the help of ChatGPT, as explained in the acknowledgements – a honest and fair statement) and can certainly serve as a useful reference, but it can still be greatly improved. I actually tend to challenge the fact that “All [32] authors contributed in writing and editing the manuscript”, or at least that many of them did it more than ChatGPT. This statement may seem provocative, but it stems from a few surprises:
- Why would land surface modelers claim that their models are “the Global Carbon Project models” (title)? Have they not read Table 4 of Friedingstein et al. (2023) or other papers in the series that many of them coauthored? This table lists many more types of models in GCP’s Global Carbon Budget (GCB).
- Why would land surface modelers explain that all models in geoscience including their own ones represent “processes that cannot be solved analytically” (l. 2)?
- Why would land surface modelers suggest (l. 2) that their models are only used to study global change?
- Why would land surface modelers from the GCB ignore the fact that some of them contribute to GCP’s Global Methane Budget (GMB) and to GCP’s Global Nitrous Oxide budget, and that other biogeochemical models contribute to the GMB as well (l. 8-9, 46-51)?
- Why would the modelers who, arguably, wrote Section A 1.7, have the left five lines in German in their section if they have properly read the paper?
- Why would they have left the table (1) describing their model with so little useful information? Rough number of code lines, rough code age, rough developer number (cf. l. 84), number of dedicated scientific programmers (cf. l. 421-422), rough number of systems on which the code has been ported, main configuration (site scale, regional or global scale, coupling with an Atmospheric General Circulation Model, usage as a component of an Earth System Model), and whether they are used in an operational framework or not, would also be interesting to the reader in the paper context (cf. l. 423-424).
- Why would the proud representatives of the 20 land surface models choose to elect one of them as the ideal model (l. 20, l. 68-69, or Section 7 pompously called “A showcase”) without much justification? For example, Section A1.7, with its comprehensive suite of tests run daily, strikes me as impressive (or is the paragraph fake?): it is necessary to discuss the reasons why this model workflow is inferior to the one that was chosen.
- In general, why would they have left their experience in software engineering described in such a superficial way, even suggesting that none of them are professional software engineers (l. 10) ? Exploring the references and URL given Table 1, I see that some of these models have an history of integrating developments from an heterogeneous ecosystem of contributors over several decades (cf. l. 84-85), that several models have been developed under the coordination of weather centers and that some of them have been components of Earth system models of Coupled Model Intercomparison Project (CMIP): some of their developers must be particularly good at software development or the models would not have survived the diversity of their contributors and of their computing environments, and would not have had such challenging applications like CMIP. Actually, the text suggests (l. 4 and 31) that only scientists develop these models. I do believe that some professional software engineers should be credited as well and that some of the scientists involved, formally trained or not (l. 31-32), are also remarkable software developers (see also l. 421-422). The fact that they generalize the dull description of their experience in software engineering to “all sorts of scientific modeling and software” (l. 62-63) may be seen as arrogant. Developers of Numerical Weather or Ocean Prediction, for instance, would be mere amateurs? Come on! They deserve better comments, and the LSM models as well.
- Why would land surface modelers who are familiar with MIPs explain that MIPs allow them to “understand the range of uncertainty” (l. 227 – what does it mean?), while also forgetting to mention that a main outcome of MIPs is debugging the weakest models? A deeper investigation of MIP usefulness is certainly not out of the scope of this paper (l. 231).
I really urge the 32 coauthors to deepen their analysis for the benefit of the readers. In reality, the LSM authors could have presented their software work in a much more positive light and highlighted the merit and strength of their efforts, whereas the article reads like a simple request for funding. Note that in their work, LSM developers also face the bugs of commercial software, like compilers: nobody’s safe.
Additional detailed comments:
- It would be worth explaining that the entry ticket for LSMs in the GCB is rather cheap, as explained in Section S.4.2 of Friedlingstein et al. (2023): “We apply three criteria for minimum DGVMs realism by including only those DGVMs with (1) steady state after spin up, (2) global net land flux (SLAND – ELUC) that is an atmosphere-to-land carbon flux over the 1990s ranging between -0.3 and 2.3 GtC yr-1, within 90% confidence of constraints by global atmospheric and oceanic observations (Keeling and Manning, 2014; Wanninkhof et al., 2013), and (3) global ELUC that is a carbon source to the atmosphere over the 1990s, as already mentioned in Supplement S.2.2. All DGVMs meet these three criteria.” As a consequence, the quality of the selected LSMs is likely heterogeneous and their engineering support as well.
- Surprisingly, the issue of competition with the private sector for the recruitment of skilled computer engineers is not addressed at all.
- The challenge of rewriting hundreds of thousands of lines of code and the dilemma of scientists losing their understanding of the rewritten versions must be addressed.
- Line 2: why is the discussion restricted to the impact of global change?
- Line 49: what are the critical insights offered by the GCB for policy-making?
- Line 50: where is the land flux to the ocean computed by LSMs used in the GCB?
- Line 33: the statement about the lack of recognition for software development in academia is unfair, in particular if one relates it to l. 255-256 about dedicated high-profile scientific journals for software development. There are also many calls for proposal in some countries or groups of countries for funding software developments or just to offer professional software support.
- Line 86 and 631-632: the lack of funding for positions dedicated to scientific programming is also a choice made by scientists who set priorities within their host organizations.
- Line 91: around me, the time spent on software development or on software support calls by scientists is a choice. It may not be sufficient, but authors should not blame the system in the first place, but rather themselves.
- Line 124-125: to detect where something is wrong, reference outputs need to be available, which is not always the case. Think about the adjoint code of a LSM routine for instance (except if one builds a heavy adjoint testing machinery around each routine).
- Line 219: netCDF has already been used above without any expansion of the acronym.
- Line 224: MIP has not been defined.
- Line 455: missing punctuation mark.
- Appendix A: the authors should try to better exploit/integrate this appendix within the main text.
- 6: this subsection seems to be out of scope. It should be either removed or rewritten in order to fit the paper.
- In Table 1, the citation Vuichard et al has been duplicated and Hoffman et al (l. 213) has no year associated.
Citation: https://doi.org/10.5194/egusphere-2025-1733-RC4
Model code and software
Model workflow showcase Konstantin Gregor https://doi.org/10.5281/zenodo.15191116
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 2,309 | 274 | 33 | 2,616 | 42 | 63 |
- HTML: 2,309
- PDF: 274
- XML: 33
- Total: 2,616
- BibTeX: 42
- EndNote: 63
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Publisher’s note: this comment was edited on 9 September 2025. The following text is not identical to the original comment, but the adjustments were minor without effect on the scientific meaning.
First off all, I want to congratulate the authors on this great piece of work. The manuscript provides really valuable and timely insights, especially now when the scientific community is paying more attention to software quality and making sure geoscientific models can be reliably reproduced. Combining the hands-on experience of modelers within the Global Carbon Project with the perspective of software engineers creates a solid and much-needed approach. Their thorough review of current practices and suggestions for best practices (e.g. covering testing, documentation, continuous integration, portability, and reproducibility) are a big step forward for the carbon modeling community and, more generally, for climate modeling as a whole.
The effort they put into gathering real-world experiences, pinpointing issues, and sharing concrete examples—like the case of LPJ-GUESS—really shows strong teamwork and a real commitment to making scientific software more strong and sustainable. That said, I have some questions and thoughts that popped up while reading this article, and I think they might help deepen the discussion. For example, the article emphasizes that automated testing setups and strategies like CI/CD aren’t widely used, even though they’re really important. It also points out that many models run on high-performance computing (HPC) system which can be expensive and take weeks to run. Given this, what ideas do you have for expanding testing and continuous integration methods to HPC environments and intercomparison projects like CMIP6/CMIP7, where resources are tight and running models costs a lot?
The article also mentions that reproducibility is often challenged by the diversity in environments, things like different compilers, HPC systems, and local setups. Even following FAIR data principles and sharing in open repositories doesn’t always ensure workflows can be repeated easily. Have you thought about ways to tackle these reproducibility barriers across different institutions? Do you see moving towards more standardized containers and environment managers (like Conda, Docker, Singularity) as a practical step to narrowing this gap?
Also, the coexistence of various programming languages such as Fortran, C, and C++, makes it necessary to keep code readable, portable, and maintainable. With the ongoing shift towards newer paradigms, like migrating from Fortran to C++/Python and the increasing use of GPUs and hybrid architectures, what practices would you recommend to keep code sustainable over the next 10–20 years?
I also believe making public code repositories accessible is essential, as platforms like Zenodo (mentioned in the article for sharing model versions) are really effective for preserving and sharing software. Do you think these repositories should become a standard part of our community’s toolkit?
Improve the maintainability and readability of Fortran code in these models is a very hard challenge. Since 13 out of 20 GCP land surface models (LSMs) are written in Fortran, have you considered integrating a specific analyser tool into the development and CI pipelines? Do you think using FortranAnalyser could become a standard for the community, especially since it not only helps find potential errors but also encourages standard coding styles, modularity, and best practices in long-term projects like the GCP LSMs? This open-source tool stands out because it can analyse any version of Fortran, unlike others like F-Lint or fortran-src. This flexibility is especially useful since different Fortran standards coexist in scientific work, and it can even be customized to add specific metrics useful to the development team. If yes, do you see any technical or cultural challenges in adopting it widely?