Modelling N₂O Emissions in Fertilized Grasslands with Machine Learning: A Multi-Site LSTM Approach

Payrosangari, Samin; Fuchs, Kathrin; Wolf, Benjamin; Kiese, Ralf; Scheer, Clemens

doi:10.5194/egusphere-2025-5155

Preprints

https://doi.org/10.5194/egusphere-2025-5155

Preprints

19 Nov 2025

| 19 Nov 2025

Status: this preprint is open for discussion and under review for Biogeosciences (BG).

Modelling N₂O Emissions in Fertilized Grasslands with Machine Learning: A Multi-Site LSTM Approach

Samin Payrosangari, Kathrin Fuchs, Benjamin Wolf, Ralf Kiese, and Clemens Scheer

Abstract. Accurate quantification of GHG emissions from agricultural soils based on key environmental and management factors, is crucial for developing effective mitigation strategies. This study applies a machine learning approach using data from multiple montane grasslands in central Europe to model the spatial and temporal dynamics of N₂O emissions, using meteorological, soil, and management data. The primary aim is to advance predictive modelling of N₂O emissions from grassland soils for spatial upscaling and scenario analysis. Specifically, we assess whether a generic model can accurately estimate N₂O emissions from an independent site excluded from training.

We collected data from five fertilized grasslands across southern Germany, northern Switzerland and northern Austria (122 soil-site-treatment-year combinations), and trained a long short-term memory (LSTM) algorithm to model the influence of drivers within 5 subsequent days on N₂O emissions. The dataset includes daily N₂O emission measurements along with key emission drivers such as soil moisture and temperature in 10 cm soil depth, daily precipitation, occurrence of fertilization events (zero or one flag), as well as soil characteristics such as pH and bulk density.

The trained LSTM model showed strong predictive performance (RMSE of 18 μgm⁻²h⁻¹, and Relative RMSE of 270 %) when evaluated on a test set that included both data from an independent soil and withheld years from training procedures. The model accurately captured N₂O dynamics, including the magnitude and timing of emission peaks driven by slurry application and environmental factors. Compared to the performance of established process-based biogeochemical models the LSTM model yielded similar RMSE and bias values for most site-years. These results demonstrate that LSTM-based models can reliably predict N₂O emissions at independent sites with similar environmental and soil characteristics and represent a promising alternative to process-based models for predicting soil N₂O emissions.

Received: 17 Oct 2025 – Discussion started: 19 Nov 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Samin Payrosangari, Kathrin Fuchs, Benjamin Wolf, Ralf Kiese, and Clemens Scheer

Status: open (until 13 Jan 2026)

Post a comment Subscribe to comment alert

RC1: 'Comment on egusphere-2025-5155', Anonymous Referee #1, 18 Dec 2025 reply

General comments:
This manuscript takes on the important task of improving N2O emissions predictions from grasslands using machine learning, crucially through the use of a large multi-site dataset. This is significant development from the earlier work on this topic. However, several aspects of the configuration of LSTMs and the preprocessing of data require clarification or improvement in order to provide a fair and transparent evaluation of the model’s performance. Specifically, I have concerns regarding potential data leakage in the train/test splitting strategy, and the ambiguity regarding the use of past target variable values as inputs.
These details may seem esoteric, but without them, it is impossible to discern whether the RMSE achieved here is impressive or disappointing. That being said, I feel that these changes can be implemented—or at least clarified—relatively easily, and are necessary to validate the study's conclusions. I also strongly encourage the authors to make their model code and data available via a public repository. This would resolve the interpretability issues raised in this review and ensure the reproducibility of the study.
I believe that with these methodological issues resolved, this study could make valuable contribution to the field of environmental modeling, with broad potential impact.
Specific comments:
L105-114: Equation 1 suggests that you have manually included four additional time-lagged features of temporal drivers for each of the prior four days. If I have correctly interpreted this, it is unclear to me why you have manually created these time-lagged features given that you are using an LSTM model. The LSTM model already contains memory cells, forget-gates, an attention layer, and other architectural features which remember past events and learn which are important. As an LSTM model already has access to previous time-steps, forcing the model to use manually-created ones may interfere with the model’s attempt to learn and forget appropriately. Please clarify whether the ‘t-1’ etc indicate manually included time-lagged features, and if so, justify your use of them. It would be helpful to explicitly state the shape of the 3D tensor used in the LSTM (N_Samples, Time_Steps, N_Features).
L120-130: After a careful read, I have interpreted the authors’ data splitting as follows. There are two subsets, one training and one testing subset. All of the Rottenbuch-in-Fendt data was siloed to the testing dataset, while all Neustift-in-Neustift data was used for training. All data from a single year of the Chamau site was withheld for the testing dataset, while all previous years were included in training. The rest of the data appears to be split at the treatment level, with all three replications and experiment years of a given treatment either sent to training or testing. This is somewhat different from the description given in the abstract, as the test set included not just a holdout site and withheld years as described, but also parallel years of separate treatments: “The trained LSTM model showed strong predictive performance (RMSE of 18 𝜇𝑔𝑚−2 ℎ −1, and Relative RMSE of 270%) when evaluated on a test set that included both data from an independent soil and withheld years from training procedures.“ Importantly, withheld years and withheld sites make up only 320 and 259 data points of the total 14591 measurements in the testing set (~2% each).
I firstly want to commend the authors for clearly taking data handling seriously and devising a more rigorous method of splitting than a simple randomized split. I also appreciate that authors have provided a loose description of criteria and thought that went into dividing datasets into testing and training and their desire to represent a variety of soil, environmental, and management properties in both groups. That being said, I do still have concerns about the potential for data leakage given the methods described. Most crucially, I believe it generally best practice when working with sequential data to avoid having data from the same locations and time periods represented in both testing and training pools, even if there was a difference in management technique (e.g. extensive/intensive). As most experiments use one-factor-at-a-time, separate treatments typically experience mostly the same conditions at the same time, including weather and other factors which may affect biogeochemical systems and which may not be captured by model inputs (Zhu et al., 2023). Thus there is often a similarity in flux patterns over time, where factor differences may lead to a difference in peak height etc, but the overall shape of flux over time is similar. Thus when split across testing and training datasets, information about flux patterns and their drivers may leak from a parallel treatment, giving the model an unfair advantage during predictions.
In my view, how the authors have handled data from the Chamau site and Rottenbuch-in-Fendt are both strong examples of data splitting. From the Chamau site, the model had the opportunity to learn from previous years’ data during training but was then evaluated on a separate year, with no opportunity to ‘cheat’ and gain information about events and effects that occurred during that year from other treatments. The Rottenbuch-in-Fendt is the strongest example of a true holdout testing the model’s generalizability, as the authors point out. Arguments about the model’s generalizability thus would be stronger based upon the model’s performance on these two subsets. Aggregate RMSE and other metrics could be reported independently for withheld sites and withheld years, without the overwhelming influence of additional data included in the test set where parallel treatments are represented in training data. Authors may also wish to consider adding additional sites/years to these pools, given the small quantity of data handled in these manners.
L140-142, L179-185: The use of log-transform and the loss function selected is an interesting combination of model techniques. I think it would be useful to provide a greater discussion on the effect of this combination on model behavior, given a non-statistical audience and a highly non-traditional loss term. The log transformation compresses the data distribution, forcing the model to minimize relative prediction errors on a relative, rather than absolute, scale. Then, the loss function penalizes prediction errors *exponentially,* placing a massive (approaching infinite) influence on any large errors. Thus the cost of a single large error will easily outweigh a thousand small errors. The result is a highly risk-averse model (avoids large errors at all cost), and which prioritizes consistency (always predicting within the same order of magnitude), in contrast to precision and accuracy on the bulk of data with poor predictions on outliers. It also likely forces the model to pay close attention to hot moments. It’s an interesting and potentially effective strategy depending on the goals, but I feel the biogeochemist audience would benefit from more explicit discussion of these effects.
L164-175: LSTM models can be configured in three ways: (1) using ground-truth target values from previous time-steps as inputs (teacher-forcing or closed-loop), (2) using the model’s own past predictions as inputs (autoregressive or open-loop), or (3) relying solely on environmental drivers without including any version of the past target variable as an input feature in the 3D tensor. I could not find any mention of whether some version of previous flux time-steps were included in the 3D tensor. This is a crucial detail, as flux is often highly autocorrelated, yet in practical scenarios, actual past flux values will not be known. Using teacher-forcing is thus appropriate during training to accelerate convergence, but inappropriate during model evaluation on testing datasets, as this would provide highly useful but practically impossible information. A discussion of these distinctions and how precisely the LSTM model was set up in both training and testing phases is immensely needed in order to assess the model’s true predictive power.
L222-224: I interpret this as you have evaluated two different hyperparameters for the n_steps/timesteps parameter of input_shape argument of the LSTM layer, those being 5 and 10. This sets a hard limit on the number of days your model can “look backward” at previous events. I am curious why such short timeframes were considered, especially given that a fertilization input may have an impact on soil nitrogen content and thus denitrification rates for many weeks, particularly if the fertilization happens to occur during a dry period. I note that you explain in the discussion that emissions nearly always peak shortly after fertilization, thus reducing the benefit of longer sequence inputs. However, I also notice from Figure 4 that moderate peaks often occur long after fertilization- much longer than the 10 days maximum evaluated. I am curious whether longer input sequences may improve predictions on such moderate peaks, for example if soil nitrogen is still moderately elevated a month after fertilization and a significant precipitation event occurs.
This also compounds my confusion regarding the aforementioned time-lagged features. The input_shape argument of an LSTM layer already defines the ceiling of how many previous time-steps the model is capable of considering, and attention and forget gates are used to determine which data among this pool to actually focus on or forget.
L235-236: I see that RMSE for the holdout set are given individually for timeseries in Fig 4, but I would appreciate an aggregate performance evaluation for the Rottenbuch-in-Fendt holdout listed here for comparison’s sake.
L269-271: Clarification is again needed here as to whether time-lagged features were directly represented as model features. If the importance of previous time-steps of features were determined by using standard PFI, which then varied the values of e.g. feature “soil_temperature_t-1” and ultimately assigned it an importance value, then this method would be invalid. In the case of using explicit time-lagged features, previous values of features are represented in the model in two places: the engineered feature, as well as the internal Hidden State of the LSTM. Even if the time-lagged feature were altered by PMI, the model would still have access to the true value in its hidden state, and thus the model could simply ignore the shuffling by PFI.
If this is a misinterpretation and time-lagged features are not explicitly represented, then I would like to see more detail on how PFI was adapted to assess the importance of previous time-steps given the LSTM architecture.
L349-353: Given that no process-based model was tested against the LSTM model, I question the need for this whole section of comparing with process model, which is not the scope of this study. I am skeptical that any meaningful comparison of predictive performance metrics can be made across different datasets, especially given the extreme temporal and spatial variability of N2O which translates to major variability in data distributions. Some datasets are simply easier to predict due to lesser influence of hot moments, and thus comparisons of models evaluated on different datasets should not be made without at least acknowledging their tenuousness.
L361-365: Again, considering that data exists for process-based model predictions at the same site (and year?) as one of the modeled sites in this study (Chamau), I would prefer that authors limit their comparisons of LSTM vs process-based simulations where these comparisons can be made fairly.
Technical corrections:
L24. Spell out full word ‘approximately’
L30. ‘2’ subscript
L237: “Having relative RMSE higher than 1 is a known effect of the imbalanced dataset.” Citation needed.
Other minor comments:
Table 1 and 2: Dataset size - clarify if this is daily flux observation numbers
L81: Better to say “flux” and not “flux rate”
Fig 1: It would be helpful to include flux distribution for the test data dataset too
Table 5 and associated discussion: The table does not need to be in the main manuscript and can be presented as supplementary data. L297-303 sounding like the authors are criticizing the previous studies and can be shortened. These are some of the initial studies on this topic and should be acknowledged accordingly with limitations that the current study is addressing.
Fig 6: While feature importance was assessed separately for each site in Fig 6, the discussion on any variation in variable importance across sites is weak and contributes little to the understanding of differential N2O dynamics. Also, why time-lagged features were not included in Fig 6 analysis.

Reply

Citation: https://doi.org/10.5194/egusphere-2025-5155-RC1

Samin Payrosangari, Kathrin Fuchs, Benjamin Wolf, Ralf Kiese, and Clemens Scheer

Data sets

Eddy covariance ecosystem fluxes, meteorological data and detailed management information for the intensively managed grassland site Chamau in Switzerland, collected between 2005 and 2024 L. Hörtnagl et al. https://doi.org/10.3929/ethz-b-000747025

The TERENO Pre-Alpine Observatory: Integrating Meteorological, Hydrological, and Biogeochemical Measurements and Modeling R. Kiese et al. https://doi.org/10.2136/vzj2018.03.0060

Samin Payrosangari, Kathrin Fuchs, Benjamin Wolf, Ralf Kiese, and Clemens Scheer

Viewed

Total article views: 220 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
155	44	21	220	17	17

HTML: 155
PDF: 44
XML: 21
Total: 220
BibTeX: 17
EndNote: 17

Views and downloads (calculated since 19 Nov 2025)

Month	HTML	PDF	XML	Total
Nov 2025	109	12	9	130
Dec 2025	46	32	12	90

Cumulative views and downloads (calculated since 19 Nov 2025)

Month	HTML	PDF	XML	Total
Nov 2025	109	12	9	130
Dec 2025	46	32	12	90

Viewed (geographical distribution)

Total article views: 207 (including HTML, PDF, and XML) Thereof 207 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 31 Dec 2025

Short summary

In our manuscript, we implement a machine learning based model to improve the simulation of daily N₂O emissions at regional scale across several pre-alpine grasslands in Central Europe. The study focuses on evaluating the model’s general applicability to other grasslands in the region. This work contributes to the development of generic N₂O emission models, thereby reducing the effort required to estimate emissions from sites lacking direct measurements.


Total:	0
HTML:	0
PDF:	0
XML:	0