Biogeochemistry-Informed Neural Network (BINN) for Improving Accuracy of Model Prediction and Scientific Understanding of Soil Organic Carbon

Xu, Haodi; Fan, Joshua; Tao, Feng; Jiang, Lifen; You, Fengqi; Houlton, Benjamin Z.; Sun, Ying; Gomes, Carla P.; Luo, Yiqi

doi:10.48550/arXiv.2502.00672

Preprints

https://doi.org/10.48550/arXiv.2502.00672

Preprints

15 Jul 2025

| 15 Jul 2025

Biogeochemistry-Informed Neural Network (BINN) for Improving Accuracy of Model Prediction and Scientific Understanding of Soil Organic Carbon

Haodi Xu, Joshua Fan, Feng Tao, Lifen Jiang, Fengqi You, Benjamin Z. Houlton, Ying Sun, Carla P. Gomes, and Yiqi Luo

Abstract. Big data and the rapid development of artificial intelligence (AI) provide unprecedented opportunities to enhance our understanding of the global carbon cycle and other biogeochemical processes. However, retrieving mechanistic knowledge from big data remains a challenge. Here, we develop a Biogeochemistry-Informed Neural Network (BINN) that seamlessly integrates a vectorized process-based soil carbon cycle model (i.e., Community Land Model version 5, CLM5) into a neural network (NN) structure to examine mechanisms governing soil organic carbon (SOC) storage from big data. BINN demonstrates high accuracy in retrieving biogeochemical parameter values from synthetic data in a parameter recovery experiment. We use BINN to predict six major processes regulating the soil carbon cycle (or components in process-based models) from 25,925 observed SOC profiles across the conterminous US and compared them with the same processes previously retrieved by a Bayesian inference-based PROcess-guided deep learning and DAta-driven modeling (PRODA) approach. The high agreement between the spatial patterns of the retrieved processes using the two approaches with an average correlation coefficient of 0.81 confirms BINN’s ability in retrieving mechanistic knowledge from big data. Additionally, the integration of neural networks and process-based models in BINN improves computational efficiency by more than 50 times over PRODA. We conclude that BINN is a transformative tool that harnesses the power of both AI and process-based modeling, facilitating new scientific discoveries while improving interpretability and accuracy of Earth system models.

Received: 09 Jul 2025 – Discussion started: 15 Jul 2025

Haodi Xu, Joshua Fan, Feng Tao, Lifen Jiang, Fengqi You, Benjamin Z. Houlton, Ying Sun, Carla P. Gomes, and Yiqi Luo

Status: final response (author comments only)

CEC1:
'Comment on egusphere-2025-3282 - No compliance with the policy of the journal', Juan Antonio Añel, 28 Jul 2025

Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".

https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
In your "Code and Data Availability" statement you do not include suitable repositories (with its link and permanent handler (e.g., DOI)) for all the code and data necessary to replicate your work. This includes the neural network, the CLM5 code and all the data used for training and resulting output data.
We can not accept this. Your manuscript should have never been accepted for Discussions given such non compliance with the policy of the journal. Our policy clearly states that all the code and data necessary to replicate a manuscript must be published openly and freely to anyone before submission.
Therefore, you have to reply to this comment in a prompt manner with the information for the repositories containing all the models, code and data that you use to produce and replicate your manuscript. Also, any future version of your manuscript must include the modified section with the new information.
I must note that if you do not fix this problem, we cannot continue with the peer-review process or accept your manuscript for publication in our journal.
Juan A. Añel

Geosci. Model Dev. Executive Editor

Citation: https://doi.org/10.5194/egusphere-2025-3282-CEC1
- AC1:
  'Reply on CEC1', Haodi Xu, 28 Jul 2025
  
  Dear Dr Añel,
  Thank you for alerting us to this problem. We did not realise that the hyperlink in the PDF version of our pre‑print was inactive. All material needed to reproduce the study is already openly available at the following GitHub repository: https://github.com/Hardyxu8067/BINN/tree/main
  We also archived the GitHub repository on Zenodo with a citable DOI: https://doi.org/10.5281/zenodo.16541441
  We apologise for the oversight and appreciate your guidance.
  Best regards,
  Haodi Xu
  
  Citation: https://doi.org/10.5194/egusphere-2025-3282-AC1
  - CEC2: 'Reply on AC1', Juan Antonio Añel, 29 Jul 2025
    
    Dear authors,
    Unfortunately, your reply does not solve the irregular situation that I pointed out before. First, the text in your Code and Data Availability section is not according the policy of the journal, which clearly states that you must cite the repositories. Hyperlinks are not acceptable.
    Second, your Zenodo repository does not contain the code of the CLM model, which you must share too.
    Additionally, now that it is possible to check the assets that you have shared, it is clear that your implementation depends on a number of external libraries, that you do not share. Therefore, to ensure the replicability of your work, in the README file you should indicate the version number of the libraries that you have used.
    Also, I have seen that in the file with the license you have indicated in the field for the owner of the copyright a nickname. Nickname are not acceptables, as they are not formally linked to a natural or legal person. Therefore, you must change it by the name of the person or entity retaining the copyright of the assets published.
    Juan A. Añel
    Geosci. Model Dev. Executive Editor
    
    Citation: https://doi.org/10.5194/egusphere-2025-3282-CEC2
    
    AC2: 'Reply on CEC2', Haodi Xu, 29 Jul 2025
    
    Dear Dr Añel,
    Thank you for the clear guidance. We try to use a formal citation (no hyperlinks) as required for our Code and Data Availability section :
    All materials needed to reproduce the study, including data and scripts to run BINN with CLM5 and produce the plots for all the simulations presented in this paper are archived at [Hardyxu8067 and joshuafan (2025) “Hardyxu8067/BINN: v0.0.2”. Zenodo. doi: 10.5281/zenodo.16557753.]
    Regarding the CLM5 code: our study uses a stand‑alone, Pytorch re‑implementation of the CLM5 soil‑organic-carbon submodel, which is included in the Zenodo archive (BINN_clean/src_binns/fun_matrix_clm5_vectorized.py).
    To ensure full replicability, exact versions of all external libraries are pinned in 'BINN_clean/src_binns/requirements.txt', and the 'README.md' provides step‑by‑step instructions to install all required libraries and reproduce all results. We have also corrected the license files to list the authors’ full legal names.
    Please feel free to let us know if we may need further improvements to meet GMD’s policy.
    Best regards,
    Haodi Xu
    
    Citation: https://doi.org/10.5194/egusphere-2025-3282-AC2
    
    CEC3: 'Reply on AC2', Juan Antonio Añel, 29 Jul 2025
    
    Dear authors,
    Many thanks for addressing the outstanding issues and for your explanations regarding the CLM model. We can consider now the current version of your manuscript in compliance with the Code and Data policy of the journal.
    Juan A. Añel
    Geosci. Model Dev. Executive Editor
    
    Citation: https://doi.org/10.5194/egusphere-2025-3282-CEC3
RC1:
'Comment on egusphere-2025-3282', Anonymous Referee #1, 12 Aug 2025

In this article, the authors integrated a well-known process-based model CLM5 for simulating soil organic carbon states into a neural network framework which they call BiNN. They use a neural network to recover CLM5 parameters and compare this approach to Bayesian parameter estimation framework called PRODA. The strength of BiNNs is a significant speed up towards parameter calibration with PRODA and is supported by a preliminary experiment on synthetic data to showcase the approach. While this is an interesting and specific application for embedding neural networks and process-based models, as well as that the article builds on some interesting previous works, I'd recommend changes and extensions to the analysis, as well as restructuring the article before considering it for publication.
In general, I hesitate to consider this as a new model or method, but rather see it as an application of an already established method to an existing model. Further, to my knowledge, articles in GMD require a model version in the title, which in this case should contain at least CLM5, which is calibrated.

As the key idea of this article is that we should use BiNN because it speeds up parameter estimation at a similar level of accuracy as the second approach that estimates space-varying parameters, that is PRODA. However, the PRODA framework isn't properly introduced in the method section and evaluation of both frameworks in the results section is incomplete, so I suggest revisions of this part. As readers we don’t know how good PRODA is performing. In section 4.2, only the correlations of PRODA and BiNN predictions are reported, but not the ones with the observations, they remain qualitatively described. Please add the according evaluation metrics here as well (is this what we see in figure 6? Then please refer to this figure).
Accordingly, when comparing the two approaches against each other (Figure 5), it would be good to see both of them evaluated against a common ground truth, e.g. as is displayed in figure 4 for just the BiNNs. A good choice of experimental design would be to do this within the synthetic experiment first, where the ground truth of the parameters is known. This means the results could compare 1) the correlation of recovery parameters from both approaches with the synthetic parameters and simulated SOC, and then 2) in applying this method after parameterisation compare the correlation of both predicted SOC (or other soil states) with the observed SOC. In the discussion, finally, PRODA should be contrasted with the parameter learning approach on a methodological level.

Another larger issue I see with this article is that equifinality isn’t addressed in a critical way, while the synthetic results from figure 4 indicate that it is already present when fitting just four parameters (see section 3, why else are correlations not stronger here? ). For the application later on, the authors fit all 21 parameters of the model, and it is to be expected that this will not diminish their unidentifiability. If this work aims for methodological demonstration, it could be better to just remain with a smaller set of parameters, but discuss their estimation in more detail, or, if going for many parameters, also do the simulation on many. Generally, the strength of a traditional Bayesian calibration approach is that we get uncertainty with the posterior distributions, information which unfortunately, neither PRoda nor the BiNNs currently provide but which could greatly support interpretation of these findings. This is a limitation of PRODA, that is introduced as a Bayesian approach but doesn’t leverage its capabilities. I think the authors should openly discuss the limitations of PRODA in the methods section on PROda and in the discussion.

As a last major point, I have doubts about the naming that may be misleading. While the terminology is certainly used broadly, more recently, PiNNs have been re-defined more specifically as incorporating the learned ode in a combined loss function as teacher forcing (https://doi.org/10.1007/s44379-025-00015-1). The approach introduced in this study falls into the general realm of physics-informed machine learning but differs in the methods significantly from this definition of PiNNs, hence I believe the article would profit from re-specifying their method more precisely in title and abstract to, e.g. an end-to-end local parameter learning/recovery approach (see e.g. https://doi.org/10.1038/s41467-021-26107-z). Further, this specific approach to model calibration should be reviewed in the introduction and the physics-informed ML approach distinguished from other process-informed ML approaches. This will help structuring the article and methods section (for an overview of process-informed ML approaches see e.g. https://doi.org/10.1111/ele.70012).

Please find below a selection of minor points.

Structure: Section 5: These algorithmic details should come earlier and in a section on the fitting the network to CLM5 parameters (e.g. 2.1) . PRODA should totally not be introduced here but in an own subsection of section 2 after the NN and process model. Also, section 2 should be general methods. Why is there an own section on observational performance and computational efficiency? I suggest summarising results in a results section of their own. And move the data preparation methods section. And distribute the contents of section 4.2 to methods and results.

Section 1 (Introduction): 1) The beginning of the introduction lacks biogeochemical examples of parameter learning beyond soil, while mentioning a wide range of fields where hybrid models are applied. Preferably, mention other biogeochemical applications. 2) More importantly, hybrid approaches are introduced without any differentiation of how physical constraints are integrated with machine learning. See general comment on BiNNs, this would greatly help the reader at the beginning to locate the introduced approach (for an overview for e.g. for carbon flux with difference equations see https://doi.org/10.1111/ele.70012).
3) Further, if the goal is enabling interpretability of biogeochemical dynamics, as stated at the end of the introduction, I would like to know why and how BiNNs can lead to improvement here towards, e.g. traditional Bayesian approaches.
4) In contrast to hybrid and mechanistic approaches, there are also established statistical models that estimate spatially-varying parameters while maintaining direct interpretability of their coefficients, such as SVCMs or Geographically-weighted regression. See for example: (10.1186/s12862-024-02260-z ). This should at least be mentioned.

Section 2: General introduction – state more precisely if this end-to-end or a two-step procedure – from Figure 1 and description in 2.2 I expect a fully integrated hybrid model, but here it sounds as if its two fragmented steps – please clarify.
Equation (2): Please check for mistakes, y_i is not defined, could be z_i.
Equation (5): How do you choose tau, also in the HP search? And why p_j – 0.5, i.e. could you elaborate on why you chose 0.5 - from the description I would expect a parameter-specific value for each p_j, i.e. the center of the prior distribution, unless you scaled them. If so, please mention.
Section 2.4: Please give the training details you have in section 5 here.

Section 3: If the goal of the paper is clearly stated in title and abstract, this section could simply be called simulation experiment.
What type of sensitivity analysis did you use for CLM5?
Given the equifinality and wide distributions in Figure 3, as this experiment was repeated in a CV, would it be possible to also report the standard deviation on the correlations?
Figure 3: See above. What sensitivity index was used and how was this done? Looks to me like feature importances.

Section 7: General: Quickly introduce BiNNs and the Bayesian approach at the beginning.

“High correlations between BINN-retrieved and prescribed biogeochemical parameter
values in a controlled parameter recovery experiment demonstrate BINN’s ability to recover
causal relationships between covariates and SOC dynamics. Faithful retrieval of biogeochemical
parameters from data substantially reduces uncertainty in SOC model predictions"
These two sentences come across quite lonely as they are. Please link back to your findings: Where do we see this?

Section 7.3: I agree that this approach may provide a new tool to model unresolved processes, it is not very clear on how it can help towards better mechanistic understanding with e.g. tracebility analysis mentioned. Could you explain this better? Also, there’s a lot of repetition in this paragraph.
The positional encoder that was used in the NN that informs the networks about the location. This design decision may blur the biogeochemical interpretation of parameter estimates if that is the goal. Here, a post-hoc check of sensitivity to location could be useful and if sensitive, run the analysis without the positional encoder.

Citation: https://doi.org/10.5194/egusphere-2025-3282-RC1
- AC3: 'Reply on RC1', Haodi Xu, 31 Oct 2025
  
  We appreciate the reviewers’ valuable feedback. We have thoroughly reviewed your suggestions and attached our detailed responses. Reviewer comments appear in black, and our responses are in blue.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3282-AC3
RC2:
'Comment on egusphere-2025-3282', Anonymous Referee #2, 15 Sep 2025
Summary of Abstract:
This article presents a Biogeochemistry-Informed Neural Network abbreviated as BINN. BINN is an NN model that outputs 21 parameters (carbon transfer fractions, turnover times, environmental modifiers, carbon input allocation and vertical transport parameters) of the Community Land Model version 5 (CLM5) soil carbon module. This NN model runs a differentiable matrix form of CLM5 inside the model, and claims (i) high parameter-recovery on synthetic data, (ii) strong agreement with PRODA (a former modeling framework developed by the same group) components across conterminous U.S. states (CONUS) with a mean correlation coefficient of 0.81. The advantage this framework offers compared to PRODA is its reduced computational cost while integrating CLM5. Finally, the authors claim BINN is a transformative tool that harnesses the power of both AI and process-based modeling, facilitating new scientific discoveries while improving interpretability and accuracy of Earth System Models (ESMs).
Overall Impression:
The main contribution of this work is integration of CLM5 into a BINN framework. The paper walks the reader through BINN’s design, validation, benchmarking, and performance: Figure 1 lays out the architecture—neural network parameters bounded by sigmoids flow into a differentiable CLM5 SOC module, trained end-to-end against Smooth-L1 losses with soft priors. Figures 2–4 show synthetic tests: CLM5-generated data used to recover the most sensitive parameters, with moderate success (r≈0.7) and acceptable SOC skill, supported by sensitivity analysis but limited by equifinality and assumptions. Figures 5–6 compare BINN outputs to PRODA and observations, demonstrating high spatial correlation and NSE≈0.66, though validation may be optimistic. Figure 7 uses traceability analysis to illustrate how different biomes balance inputs vs residence time, offering mechanistic interpretation, and Figure 8 highlights computational efficiency, with BINN running over 50× faster than PRODA by virtue of vectorization and gradient-based learning.
The language of the article reads clear most of the times except for using buzzwords that exaggerates and overstretch the results/claims, e.g. “transformative”, “harness the power of AI”. The work feels methodologically rigor and advances engineering problem with good computational efficiency. The manuscript presents a technically solid and well-executed methodological advance, and the figures clearly illustrate the architecture, parameter recovery, benchmarking, and efficiency of the BINN framework. However, the work in its current form suffers from a gap between claims and evidence: while the method is convincingly demonstrated in synthetic tests and with observational SOC profiles, the validation strategy (random cross-validation, reliance on shared forcing, limited exploration of equifinality, absence of independent benchmarks or uncertainty quantification) does not fully support the breadth of the conclusions drawn. Strengthening the validation would likely require substantial additional work, which may not be feasible in a short revision cycle. Therefore, I recommend one of two paths forward: (i) reframe the manuscript more modestly, toning down broad claims of ecological insight and general applicability, focusing instead on the clear computational and methodological contributions; or (ii) extend the analysis with additional validation and uncertainty assessment to bring the evidence base up to the level of the claims. Either path would improve the agreement between the strong methodological innovation and the scientific narrative presented.

Major Comments:
There are few points that need serious investigation and discussion
Sensitivity to Design Choices: The sensitivity to certain study design choices that may have affected the entire article are not investigated: (1) Choice of a sigmoids activation function per-parameter risks stabilizes the training, and is a regularization mechanism [1] but may cause the output parameters to be biased rather than interpretable results especially in the presence of noisy and/or sparse data [2]. (2) While I understand the prior is chosen based on literature, for certain parameters it doesn’t seem certain other literature values. Most importantly: tau4s3 (Turnover time of passive SOC) is set to 20-400 years seems to be short, tau4s1 (Turnover time of fast SOC) minimum rate is set to 0.8 hr, and w_folding (the influence of soil water on modifying SOC decomposition) seems too wide, allowing 0.0001 may nullify water limitation and 5 being a large amplification.
Please, include references for each choice of the priors’ range.

Please, investigate, if the choice of activation function and priors has biased the results.

Please, show more evidence as why you are convinced the model results indicate interpretability rather than bias.

Please, also explain why other model choices (like loss function) are made

Computational trade-offs: The main difference between PRODA and this model is that this model considers all sites data across space at the same time. It is correct that reduction in computational time is expected, but the computational memory cost is expected to increase, with increased chance of data leakage in space. This model is claimed to offer a computational advantage to PRODA, but it is not made clear how much computational time is saved, how much memory load is increased. Please, consider acknowledging the trade-offs made in the BINN modeling framework to save computational costs compared to PRODA. Some are summarized as follows.
To claim the potential to extend to other regions or spatial generalizability of the framework, you would need to use leave-one-biome-out would test for the special generalizations (and if same NSE and correlation coefficients will be achieved)

Uncertainty quantification is a strong point in PRODA which also enhances model robustness and interpretability.

Limited parameter testing: Limited test cases and benchmarking to only a few of the 21 parameters. Only 4 out of 21 CLM5 parameters were actually recovered and validated. The sensitivity analysis (Fig. 3) justifies focusing on these, but ignores interactions and leaves 17 parameters untested. If the framework is claimed to be generalizable and “interpretable,” but only 4 parameters were realistically tested, then the claims exceed the demonstrated evidence.

Line-by-line Comments:
The manuscript doesn't include line number, so I cannot unfortunately provide line-by-line comments. Please, consider this in the resubmission

Figure-by-figure Comments:
For figure panels either use parenthesis (as in Fig1) or not (as in Fig 5). Same for format being bold or not
Figure 1, panel (a) and (b) need titles in the figure for better read. If colors contain information, please, be specific. "Priors" vs "Sigmoid activation" can be more explicitly separated.

Figure 2, seems redundant.

Figure 3, caption needs to explain better what kind of sensitivity test was carried out. The parameter labels need to be more intuitive. You can also consider giving colors to parameters that fall within one of your five broad categories (environmental modifier, CUE, substrate decomposability, ...). Include appropriate legends as needed.

Figure 4, please, use clear and sharp caption as these are your main contributions, such as "Fit of BINN to SOC data across CONUS in depth"

Figure 5, please, make color scales consistent across all panels and legends be readable. Please, explain why in panel c and f BINN is constantly overestimating the PRADO, and why correlation collapses to 1 in r.

Figure 6, bias and error distributions. Hard to interpret geographically. Captions don’t explain ecological meaning of bias hotspots. Needs to be restructured in agreement with how you would handle the major revisions.

Figure 7 decomposes SOC into carbon input vs residence time across biomes without any independent data or synthesis data to validate biome-level trade-offs. A needed step before presenting this figure is testing BINN on analytical or the reference [3] can be used to create a test case for validity of residence times derived from the BINN, before its application to CONUS.
Figure 8, please, restructure and consider the points in the major revision
References:
[1] Raissi, Maziar, Paris Perdikaris, and George E. Karniadakis. "Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations." Journal of Computational physics 378 (2019): 686-707.
[2] Wang, S., Teng, Y. and Perdikaris, P., 2021. Understanding and mitigating gradient flow pathologies in physics-informed neural networks. SIAM Journal on Scientific Computing, 43(5), pp.A3055-A3081.
[3] Sierra, Carlos A., et al. "Carbon sequestration in the subsoil and the time required to stabilize carbon for climate change mitigation." Global Change Biology 30.1 (2024): e17153.
Citation: https://doi.org/10.5194/egusphere-2025-3282-RC2
- AC4: 'Reply on RC2', Haodi Xu, 31 Oct 2025
  
  We appreciate the reviewers’ valuable feedback. We have thoroughly reviewed your suggestions and attached our detailed responses. Reviewer comments appear in black, and our responses are in blue.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3282-AC4

Haodi Xu, Joshua Fan, Feng Tao, Lifen Jiang, Fengqi You, Benjamin Z. Houlton, Ying Sun, Carla P. Gomes, and Yiqi Luo

Viewed

Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.

Total article views: 2,413 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
2,404	0	9	2,413	0	0

HTML: 2,404
PDF: 0
XML: 9
Total: 2,413
BibTeX: 0
EndNote: 0

Views and downloads (calculated since 15 Jul 2025)

Month	HTML	PDF	XML
Jul 2025	244	4	248
Aug 2025	590	0	590
Sep 2025	1,349	3	1,352
Oct 2025	156	0	156
Nov 2025	65	2	67

Cumulative views and downloads (calculated since 15 Jul 2025)

Month	HTML	PDF	XML
Jul 2025	244	4	248
Aug 2025	590	0	590
Sep 2025	1,349	3	1,352
Oct 2025	156	0	156
Nov 2025	65	2	67

Viewed (geographical distribution)

Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.

Total article views: 2,394 (including HTML, PDF, and XML) Thereof 2,394 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 22 Nov 2025

Short summary

We developed the Biogeochemistry-Informed Neural Network (BINN) which embeds a process-based model inside an AI framework so the model’s parameters can be learned from big data. BINN recovered known parameters in synthetic tests and revealed key controls when applied to about 25 000 soil profiles across the contiguous US. It operates more than 50 times faster than Bayesian approaches while reproducing similar key processes governing SOC stocks.


Total:	0
HTML:	0
PDF:	0
XML:	0