Modelling N₂O Emissions in Fertilized Grasslands with Machine Learning: A Multi-Site LSTM Approach
Abstract. Accurate quantification of GHG emissions from agricultural soils based on key environmental and management factors, is crucial for developing effective mitigation strategies. This study applies a machine learning approach using data from multiple montane grasslands in central Europe to model the spatial and temporal dynamics of N2O emissions, using meteorological, soil, and management data. The primary aim is to advance predictive modelling of N2O emissions from grassland soils for spatial upscaling and scenario analysis. Specifically, we assess whether a generic model can accurately estimate N₂O emissions from an independent site excluded from training.
We collected data from five fertilized grasslands across southern Germany, northern Switzerland and northern Austria (122 soil-site-treatment-year combinations), and trained a long short-term memory (LSTM) algorithm to model the influence of drivers within 5 subsequent days on N2O emissions. The dataset includes daily N2O emission measurements along with key emission drivers such as soil moisture and temperature in 10 cm soil depth, daily precipitation, occurrence of fertilization events (zero or one flag), as well as soil characteristics such as pH and bulk density.
The trained LSTM model showed strong predictive performance (RMSE of 18 μgm−2h−1, and Relative RMSE of 270 %) when evaluated on a test set that included both data from an independent soil and withheld years from training procedures. The model accurately captured N2O dynamics, including the magnitude and timing of emission peaks driven by slurry application and environmental factors. Compared to the performance of established process-based biogeochemical models the LSTM model yielded similar RMSE and bias values for most site-years. These results demonstrate that LSTM-based models can reliably predict N₂O emissions at independent sites with similar environmental and soil characteristics and represent a promising alternative to process-based models for predicting soil N2O emissions.
General comments:
This manuscript takes on the important task of improving N2O emissions predictions from grasslands using machine learning, crucially through the use of a large multi-site dataset. This is significant development from the earlier work on this topic. However, several aspects of the configuration of LSTMs and the preprocessing of data require clarification or improvement in order to provide a fair and transparent evaluation of the model’s performance. Specifically, I have concerns regarding potential data leakage in the train/test splitting strategy, and the ambiguity regarding the use of past target variable values as inputs.
These details may seem esoteric, but without them, it is impossible to discern whether the RMSE achieved here is impressive or disappointing. That being said, I feel that these changes can be implemented—or at least clarified—relatively easily, and are necessary to validate the study's conclusions. I also strongly encourage the authors to make their model code and data available via a public repository. This would resolve the interpretability issues raised in this review and ensure the reproducibility of the study.
I believe that with these methodological issues resolved, this study could make valuable contribution to the field of environmental modeling, with broad potential impact.
Specific comments:
L105-114: Equation 1 suggests that you have manually included four additional time-lagged features of temporal drivers for each of the prior four days. If I have correctly interpreted this, it is unclear to me why you have manually created these time-lagged features given that you are using an LSTM model. The LSTM model already contains memory cells, forget-gates, an attention layer, and other architectural features which remember past events and learn which are important. As an LSTM model already has access to previous time-steps, forcing the model to use manually-created ones may interfere with the model’s attempt to learn and forget appropriately. Please clarify whether the ‘t-1’ etc indicate manually included time-lagged features, and if so, justify your use of them. It would be helpful to explicitly state the shape of the 3D tensor used in the LSTM (N_Samples, Time_Steps, N_Features).
L120-130: After a careful read, I have interpreted the authors’ data splitting as follows. There are two subsets, one training and one testing subset. All of the Rottenbuch-in-Fendt data was siloed to the testing dataset, while all Neustift-in-Neustift data was used for training. All data from a single year of the Chamau site was withheld for the testing dataset, while all previous years were included in training. The rest of the data appears to be split at the treatment level, with all three replications and experiment years of a given treatment either sent to training or testing. This is somewhat different from the description given in the abstract, as the test set included not just a holdout site and withheld years as described, but also parallel years of separate treatments: “The trained LSTM model showed strong predictive performance (RMSE of 18 𝜇𝑔𝑚−2 ℎ −1, and Relative RMSE of 270%) when evaluated on a test set that included both data from an independent soil and withheld years from training procedures.“ Importantly, withheld years and withheld sites make up only 320 and 259 data points of the total 14591 measurements in the testing set (~2% each).
I firstly want to commend the authors for clearly taking data handling seriously and devising a more rigorous method of splitting than a simple randomized split. I also appreciate that authors have provided a loose description of criteria and thought that went into dividing datasets into testing and training and their desire to represent a variety of soil, environmental, and management properties in both groups. That being said, I do still have concerns about the potential for data leakage given the methods described. Most crucially, I believe it generally best practice when working with sequential data to avoid having data from the same locations and time periods represented in both testing and training pools, even if there was a difference in management technique (e.g. extensive/intensive). As most experiments use one-factor-at-a-time, separate treatments typically experience mostly the same conditions at the same time, including weather and other factors which may affect biogeochemical systems and which may not be captured by model inputs (Zhu et al., 2023). Thus there is often a similarity in flux patterns over time, where factor differences may lead to a difference in peak height etc, but the overall shape of flux over time is similar. Thus when split across testing and training datasets, information about flux patterns and their drivers may leak from a parallel treatment, giving the model an unfair advantage during predictions.
In my view, how the authors have handled data from the Chamau site and Rottenbuch-in-Fendt are both strong examples of data splitting. From the Chamau site, the model had the opportunity to learn from previous years’ data during training but was then evaluated on a separate year, with no opportunity to ‘cheat’ and gain information about events and effects that occurred during that year from other treatments. The Rottenbuch-in-Fendt is the strongest example of a true holdout testing the model’s generalizability, as the authors point out. Arguments about the model’s generalizability thus would be stronger based upon the model’s performance on these two subsets. Aggregate RMSE and other metrics could be reported independently for withheld sites and withheld years, without the overwhelming influence of additional data included in the test set where parallel treatments are represented in training data. Authors may also wish to consider adding additional sites/years to these pools, given the small quantity of data handled in these manners.
L140-142, L179-185: The use of log-transform and the loss function selected is an interesting combination of model techniques. I think it would be useful to provide a greater discussion on the effect of this combination on model behavior, given a non-statistical audience and a highly non-traditional loss term. The log transformation compresses the data distribution, forcing the model to minimize relative prediction errors on a relative, rather than absolute, scale. Then, the loss function penalizes prediction errors *exponentially,* placing a massive (approaching infinite) influence on any large errors. Thus the cost of a single large error will easily outweigh a thousand small errors. The result is a highly risk-averse model (avoids large errors at all cost), and which prioritizes consistency (always predicting within the same order of magnitude), in contrast to precision and accuracy on the bulk of data with poor predictions on outliers. It also likely forces the model to pay close attention to hot moments. It’s an interesting and potentially effective strategy depending on the goals, but I feel the biogeochemist audience would benefit from more explicit discussion of these effects.
L164-175: LSTM models can be configured in three ways: (1) using ground-truth target values from previous time-steps as inputs (teacher-forcing or closed-loop), (2) using the model’s own past predictions as inputs (autoregressive or open-loop), or (3) relying solely on environmental drivers without including any version of the past target variable as an input feature in the 3D tensor. I could not find any mention of whether some version of previous flux time-steps were included in the 3D tensor. This is a crucial detail, as flux is often highly autocorrelated, yet in practical scenarios, actual past flux values will not be known. Using teacher-forcing is thus appropriate during training to accelerate convergence, but inappropriate during model evaluation on testing datasets, as this would provide highly useful but practically impossible information. A discussion of these distinctions and how precisely the LSTM model was set up in both training and testing phases is immensely needed in order to assess the model’s true predictive power.
L222-224: I interpret this as you have evaluated two different hyperparameters for the n_steps/timesteps parameter of input_shape argument of the LSTM layer, those being 5 and 10. This sets a hard limit on the number of days your model can “look backward” at previous events. I am curious why such short timeframes were considered, especially given that a fertilization input may have an impact on soil nitrogen content and thus denitrification rates for many weeks, particularly if the fertilization happens to occur during a dry period. I note that you explain in the discussion that emissions nearly always peak shortly after fertilization, thus reducing the benefit of longer sequence inputs. However, I also notice from Figure 4 that moderate peaks often occur long after fertilization- much longer than the 10 days maximum evaluated. I am curious whether longer input sequences may improve predictions on such moderate peaks, for example if soil nitrogen is still moderately elevated a month after fertilization and a significant precipitation event occurs.
This also compounds my confusion regarding the aforementioned time-lagged features. The input_shape argument of an LSTM layer already defines the ceiling of how many previous time-steps the model is capable of considering, and attention and forget gates are used to determine which data among this pool to actually focus on or forget.
L235-236: I see that RMSE for the holdout set are given individually for timeseries in Fig 4, but I would appreciate an aggregate performance evaluation for the Rottenbuch-in-Fendt holdout listed here for comparison’s sake.
L269-271: Clarification is again needed here as to whether time-lagged features were directly represented as model features. If the importance of previous time-steps of features were determined by using standard PFI, which then varied the values of e.g. feature “soil_temperature_t-1” and ultimately assigned it an importance value, then this method would be invalid. In the case of using explicit time-lagged features, previous values of features are represented in the model in two places: the engineered feature, as well as the internal Hidden State of the LSTM. Even if the time-lagged feature were altered by PMI, the model would still have access to the true value in its hidden state, and thus the model could simply ignore the shuffling by PFI.
If this is a misinterpretation and time-lagged features are not explicitly represented, then I would like to see more detail on how PFI was adapted to assess the importance of previous time-steps given the LSTM architecture.
L349-353: Given that no process-based model was tested against the LSTM model, I question the need for this whole section of comparing with process model, which is not the scope of this study. I am skeptical that any meaningful comparison of predictive performance metrics can be made across different datasets, especially given the extreme temporal and spatial variability of N2O which translates to major variability in data distributions. Some datasets are simply easier to predict due to lesser influence of hot moments, and thus comparisons of models evaluated on different datasets should not be made without at least acknowledging their tenuousness.
L361-365: Again, considering that data exists for process-based model predictions at the same site (and year?) as one of the modeled sites in this study (Chamau), I would prefer that authors limit their comparisons of LSTM vs process-based simulations where these comparisons can be made fairly.
Technical corrections:
L24. Spell out full word ‘approximately’
L30. ‘2’ subscript
L237: “Having relative RMSE higher than 1 is a known effect of the imbalanced dataset.” Citation needed.
Other minor comments:
Table 1 and 2: Dataset size - clarify if this is daily flux observation numbers
L81: Better to say “flux” and not “flux rate”
Fig 1: It would be helpful to include flux distribution for the test data dataset too
Table 5 and associated discussion: The table does not need to be in the main manuscript and can be presented as supplementary data. L297-303 sounding like the authors are criticizing the previous studies and can be shortened. These are some of the initial studies on this topic and should be acknowledged accordingly with limitations that the current study is addressing.
Fig 6: While feature importance was assessed separately for each site in Fig 6, the discussion on any variation in variable importance across sites is weak and contributes little to the understanding of differential N2O dynamics. Also, why time-lagged features were not included in Fig 6 analysis.