the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
mLDNDCv1.0: A Machine Learning-based Surrogate of LandscapeDNDC for Optimising Cropping Systems in Denmark
Abstract. Optimising Danish arable management is critical for reducing greenhouse‐gas (GHG) emissions and nitrogen (N) losses while maintaining or even improving crop productivity and soil health. Process-based models such as LandscapeDNDC can simulate the effects of management on agroecosystem functioning. However, their computational demand limits large-scale optimisation. Here we present mLDNDCv1.0, a tree-based machine-learning surrogate of LandscapeDNDC that allows for the rapid exploration of large decision spaces without sacrificing mechanistic fidelity. We generated a synthetic training set of >45 million LandscapeDNDC simulations from a full factorial of soils, climate (2011–2020), and management options for winter wheat. We benchmarked gradient-boosted tree algorithms (LightGBM, XGBoost, CatBoost) on predictive performance. XGBoost delivered the most accurate and stable predictions for the core indicators in this study: soil N2O emissions (R2 = 0.81), NO3− leaching (R2 = 0.84), yield (R2 = 0.93), and for soil-organic-carbon stock changes (R2 = 0.86). The model maintained high accuracy when confronted with real management and environmental settings that reflected true operating conditions. Coupling mLDNDC with the multi-objective evolutionary algorithm NSGA-II allowed us to optimise millions of management combinations across all winter wheat fields in Denmark. Pareto-optimal solutions reduced N2O emissions by 27.5 ± 4.5 %, NO3− and leaching by 27 ± 3.0 %. These solutions also increased grain yield by 8.5 ± 1.5 % and soil-organic-carbon stocks by 1.2 ± 0.1 %, and improving nitrogen-use efficiency (NUE) by 10 ± 2 %, while turning the system into a net GHG sink (2200 ± 400 Mg CO2-eq ha−1 yr−1). These gains were achieved without increasing total fertiliser input. They arose from re-allocating mineral and organic fertliser N input, adjusting incorporation depth, and optimising residue, catch-crop, and irrigation practices. Thus, mLDNDC therefore provides a scalable, transparent framework for country-wide optimisation and real-time decision support in climate-smart agriculture.
- Preprint
(1922 KB) - Metadata XML
-
Supplement
(922 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
CEC1: 'Comment on egusphere-2026-294 - No compliance with the policy of the journal', Juan Antonio Añel, 25 Mar 2026
-
AC1: 'Reply on CEC1', Jaber Rahimi, 26 Mar 2026
Dear Dr. Añel,
Thank you for your message and for highlighting this issue.
We have now released an updated version of the dataset on Zenodo to support the reproducibility of our study. The repository includes the harmonized field-level dataset for winter wheat in Denmark used for training the machine-learning surrogate model (mLDNDC), together with the associated feature engineering outputs at the field level.
https://doi.org/10.5281/zenodo.18573225
This dataset contains the management information and derived variables necessary to run and reproduce the surrogate modeling framework and optimization presented in our study. While some components of the original SmartField dataset are subject to data protection constraints (e.g., field’s coordination), we have ensured that all essential inputs required to train and use the ML model are included in the repository.
We believe that this fulfills the requirements of the Code and Data Policy and allows reviewers and readers to reproduce the key results of the manuscript.
Could you please confirm whether this is sufficient for the manuscript to proceed in the discussion and review process?
Thank you very much for your guidance.
Best regards,
Dr. Jaber Rahimi, on behalf of the authors
Citation: https://doi.org/10.5194/egusphere-2026-294-AC1 -
CEC2: 'Reply on AC1', Juan Antonio Añel, 28 Mar 2026
Dear authors,
Many thanks for the quick reply. I can confirm that now the current version of your manuscript is in compliance with the policy of the journal.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2026-294-CEC2
-
CEC2: 'Reply on AC1', Juan Antonio Añel, 28 Mar 2026
-
AC1: 'Reply on CEC1', Jaber Rahimi, 26 Mar 2026
-
RC1: 'Comment on egusphere-2026-294', Anonymous Referee #1, 12 Apr 2026
Overall, I think the manuscript presents an interesting and well-executed workflow. That said, several aspects need to be clarified or toned down before the results can be fully supported.
First, I think the manuscript consistently overstates the level of validation and generalization achieved. The surrogate is trained and evaluated primarily on synthetic data generated by the same process-based model. The "actual space" evaluation still relies on model-generated outputs, and independent validation is effectively limited to crop yield. In that sense, what is demonstrated here is a good emulation of LandscapeDNDC within the sampled domain, rather than generalization to real-world conditions. This is particularly relevant for N2O, NO3 leaching, and SOC, where no independent validation is provided. I would suggest toning down statements around "real-world conditions", "transferability", and decision-support applicability, and making it clearer that the results are conditional on the model and the constructed dataset.
Second, I found the description of the validation workflow somewhat unclear. The manuscript mentions an 80/20 train-test split, but also refers to five-fold cross-validation. It is not clear whether cross-validation was restricted to the training subset or applied to the full dataset, which would compromise the independence of the test set. This should be clarified. More importantly, even if implemented correctly, the current validation design is not very demanding. A random split within a synthetic factorial dataset mainly tests interpolation within a highly structured space. It does not provide a strong assessment of generalization across management regimes or environmental conditions. I would therefore be more cautious in interpreting the reported performance as evidence of transferability.
Third, the whole framework stands on the synthetic dataset and the way the decision space is defined, but this part remains somewhat under-specified. In particular, it is not entirely clear how unrealistic or inconsistent combinations were identified and removed, or how sensitive the results are to the chosen parameter ranges and constraints. Given how strongly both the surrogate and the optimization depend on these choices, I think this deserves a more explicit discussion.
Specific points follow,
l24, this is too strong given what is actually done. A tree-based surrogate trained on model outputs does not preserve mechanistic fidelity in any strict sense. At best, it approximates the response surface of the parent model within the sampled domain.
l30, this is somewhat misleading. The so-called "real" conditions still rely on model-generated outputs, not independent observations (except partially for yield). This is not a true real-world validation, and the sentence overstates the level of external validation I would like to say.
l32, this sounds more general than it is. It would be more precise to state that the optimization is conditional on the model and the defined decision space.
l36, 2200 ± 400 Mg CO2-eq ha-1 yr-1 is highly questionable. The magnitude appears unrealistically large for cropland systems..
l39, the study demonstrates a model-based optimization workflow, not a validated decision-support system. "Real-time decision support" in particular is not demonstrated.
l133, it is not clear whether the design leads to unrealistic combinations, especially when multiple categorical and continuous variables are combined. This needs more justification.
l148, the optimization results will be entirely conditioned on these boundaries, yet it is not clear how they were defined, how strict they are, or how sensitive results are to these choices.
l214, This is a key step but not described in sufficient detail. How many combinations were removed? How sensitive are results to these rules? This directly affects the training distribution.
l245, despite this, the manuscript assumes that prior validation is sufficient for all variables and conditions considered here, which may need explicitly stating here.
l305, I would say that a simple random split may not be sufficient to test generalization across management regimes.
l351, but validation against yield alone is not sufficient to claim general transferability, especially for the other target variables.
l404-406, I think it effectively removes all trade-offs and selects only "win–win" solutions. As a result, the reported improvements are no longer representative of the Pareto space but of a heavily filtered subset. The authors should clarify this point.
l483, this applies only to yield, but other variables are not independently validated. The sentence generalizes beyond what is actually shown.
l485, not fully convinced; agreement with yield alone does not demonstrate that the model captures underlying processes, especially for nitrogen and carbon dynamics.
l548, that's true, yet the interpretation in the section occasionally goes beyond this (consistency check that the surrogate has not learned spurious relationships). Given that the surrogate is trained on model-generated outputs, the SHAP analysis primarily reflects the behavior of the parent model instead of independent evidence of processes. This distinction could be made more explicit in the section.Citation: https://doi.org/10.5194/egusphere-2026-294-RC1 -
RC2: 'Comment on egusphere-2026-294', Anonymous Referee #2, 19 Apr 2026
This study makes a good contribution by showing a hybrid approach of integrating process-based model with machine learning for optimizing agroecosystem management practices. Previous reviewer have already provided an excellent and thorough reviews. Please see below some points that would help the authors to improve the manuscript:
First is about the mechanistic fidelity claim. The paper repeatedly claims about preserving mechanistic fidelity but actually the surrogate model only preserve the input-out mapping of the process based model, not its mechanistic processes and it cannot report the intermediate state variables. This distinction should be stated clearly.
Second is the path dependency and legacy effects in agricultural systems. This approach ignores the trajectory-dependent nature of agroecosystem state variables. Soil organic carbon accrual, microbial community composition and other soil health related variables are slow variables whose current states constrain future management responses and have feedback effects. The discussion should acknowledgement that two-year rotation categorial variable is an insufficient proxy for these cumulative legacy dynamics.
The third issue is temporal specificity and the real-time decision support claim. The surrogate predicts annual totals from static or seasonally aggregated features. Claiming suitability for real-time decision support overstates the tool's capability. In operational terms in farming conditions, real time decision making is much difficult because of within season interactions effects of uncertain weather, crop phenology, soil moisture dynamics among others. It would be better to state it as scenario-comparison and strategic planning tool.
Overall, the manuscript reads well. As reported by the first reviewer, the author needs to address the circular validation architecture used in the study as a limitation. Also, it would be better to be more concise in SHAP discussion sections as the post-hoc interpretability of a statistical surrogate does not substitute for mechanistic diagnosis.Citation: https://doi.org/10.5194/egusphere-2026-294-RC2
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 238 | 132 | 19 | 389 | 44 | 18 | 28 |
- HTML: 238
- PDF: 132
- XML: 19
- Total: 389
- Supplement: 44
- BibTeX: 18
- EndNote: 28
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
Checking the Code and Data Availability section, and the repository that you provide for the data, we have not found the data from the harmonized field-level data from the SmartField project that you have used. If we have missed it, please let us know replying to this comment, and omit the remainder of this comment.
This issue should have been noticed before, and due to it, your manuscript should have not been accepted for Discussions or peer review in the journal. Therefore, the current situation is irregular.
The GMD review and publication process depends on reviewers and community commentators being able to access, during the discussion phase, the code and data on which a manuscript depends, and on ensuring the provenance of replicability of the published papers for years after their publication. Please, therefore, publish your data in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible. We cannot have manuscripts under discussion that do not comply with our policy.
The 'Code and Data Availability’ section must also be modified to cite the new repository locations, and corresponding references added to the bibliography.
I must note that if you do not fix this problem, we cannot continue with the peer-review process or accept your manuscript for publication in GMD.
Juan A. Añel
Geosci. Model Dev. Executive Editor