the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Modelling N₂O Emissions in Fertilized Grasslands with Machine Learning: A Multi-Site LSTM Approach
Abstract. Accurate quantification of GHG emissions from agricultural soils based on key environmental and management factors, is crucial for developing effective mitigation strategies. This study applies a machine learning approach using data from multiple montane grasslands in central Europe to model the spatial and temporal dynamics of N2O emissions, using meteorological, soil, and management data. The primary aim is to advance predictive modelling of N2O emissions from grassland soils for spatial upscaling and scenario analysis. Specifically, we assess whether a generic model can accurately estimate N₂O emissions from an independent site excluded from training.
We collected data from five fertilized grasslands across southern Germany, northern Switzerland and northern Austria (122 soil-site-treatment-year combinations), and trained a long short-term memory (LSTM) algorithm to model the influence of drivers within 5 subsequent days on N2O emissions. The dataset includes daily N2O emission measurements along with key emission drivers such as soil moisture and temperature in 10 cm soil depth, daily precipitation, occurrence of fertilization events (zero or one flag), as well as soil characteristics such as pH and bulk density.
The trained LSTM model showed strong predictive performance (RMSE of 18 μgm−2h−1, and Relative RMSE of 270 %) when evaluated on a test set that included both data from an independent soil and withheld years from training procedures. The model accurately captured N2O dynamics, including the magnitude and timing of emission peaks driven by slurry application and environmental factors. Compared to the performance of established process-based biogeochemical models the LSTM model yielded similar RMSE and bias values for most site-years. These results demonstrate that LSTM-based models can reliably predict N₂O emissions at independent sites with similar environmental and soil characteristics and represent a promising alternative to process-based models for predicting soil N2O emissions.
- Preprint
(1376 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-5155', Anonymous Referee #1, 18 Dec 2025
-
RC2: 'Comment on egusphere-2025-5155', Anonymous Referee #2, 05 Jan 2026
This study evaluates the application of LSTM networks to model spatial and temporal N2O emission dynamics across montane grasslands in Central Europe. By training on multi-site data including meteorological and soil drivers, the authors assess the model's ability to generalize to independent sites, concluding that the LSTM approach offers predictive performance comparable to established process-based biogeochemical models. My overall feeling about this manuscript is "negative", but still feel it could make some contributions to the body of literature. As such, I recommend a major revision.
Major comments:
[1] Inadequate Contextualization and Benchmarking. The authors haven't acknowledged the state-of-the-art in GHG modeling, such as various advanced pure DL approaches, or those combining process-based model with DL typically known as Physics-Informed Neural Networks (PINN) and Knowledge-Guided Machine Learning (KGML). By omitting these latest advancements, the introduction falsely framed standard LSTM as a novel frontier. Furthermore, the benchmarks in Table 5 appear to be low-performance baselines. To avoid the appearance of cherry-picking, the authors need to validate their model against advanced contemporary literature rather than outdated or weak standards.[2] Over-Ambitious Scope and Insufficient Performance. The goal of developing a "generic model" using only 122 site-year combinations is over-ambitious and unsupported by the results. The reported Relative RMSE of 270% suggests the model error is nearly three times the average observed emission, contradicting the claim of "strong predictive performance." This poor performance likely stems from insufficient training strategies or architectural limitations along with data noise. Realistically, the authors should reframe this as a regional case study and objectively analyze why the model failed to generalize better, rather than over-claiming its predictive utility. In that case, I would still treat this as a valuable study that documents "what works well or not, by how much" in the surge of AI applications.
[3] Lack of Technical Reproducibility. The methodology consumes excessive space describing rudimentary ML concepts ("ML 101") while omitting essential technical specifications. Critical details regarding feature engineering and model architecture—specifically how the attention mechanism was implemented within the LSTM—are missing. The authors need to replace generic descriptions with the precise technical details required for reproducibility.
Specific comments:
L15, "withheld years from training procedures": why? likely a data leakage issue.
L35: need to update the 2013 reference because there're already several recent ML/DL advances for N2O simulations. I know Klaus Butterbach-Bahl is aware of those, but not sure if he had a recent review paper reflecting the state-of-the-art.
L43: the introduction should review existing ML-based literature and disclose strength and weakness, rather than discussing the opportunities
L65: plz add information about when the data was collected in Table 1
L100: the data split for Chamau is problematic. 2015 is between training years which makes "test set" not a real out-of-sample test
L117: sorry, you can't claim generic model here. 6 sites only occupies a very small fraction of the high-dimensional feature space that jointly determines N2O production.
L123-125: It's hard for readers to follow this narrative. consider visualizing by a decision tree or a separate table apart from Table 2 that mingled soil information with data split
L128: personally I feel it's meaningless to compare LSTM for gap-filling, which may be workable but not the best choice; instead, you could use "Bi-LSTM" that captures the contextual information of a whole time series
L163: section 2.4.1 need to be significantly expanded to allow a full evaluation, such as the dimension of the LSTM and Dense layers and what was the output of the 1st "Dense layers", and how attention was implemented in the pipelineL185: in eq.(4), with "loss" term also appear in the exponential power, this weighted loss can be quite unstable
L200, Table 3: what's the point to test "0.01" and "0.05" for dropout?
L218: If this is true, you don't need to report architecture B, which is not a normal structure that one would like to test
L233: in this section, RMSE alone is insufficient to describe the model performance, at least add R2
L255: Figure 3 suggests the model has a low performance, but given the limited information given, it's hard to say whether it is because of low training skills or missing key features or the model simply didn't work for this particular dataset
L274: time series plot for Chamau and Fendt is impressive, but the authors need to convince people they were not cherry-picked and Chamau is not because of data leakage
L295: this study used a relatively larger dataset, which also means more noise. As a result, the current LSTM structure is insufficient to fit a "generic model" that works for multiple sites, soil and managements. This problem is much more challenging than the authors written or expected, so I won't blame the low model performance, but claiming this study has solved the problem or even reached an acceptable milestone is misleading.
L346: section 4.4 is redundant in this manuscript
Citation: https://doi.org/10.5194/egusphere-2025-5155-RC2 -
RC3: 'Comment on egusphere-2025-5155', Anonymous Referee #3, 19 Jan 2026
# General comments
This is a very interesting study that tackles the difficult task of simulating N2O emissions using LSTM models across European grasslands. I think that the dataset compilation is a valuable contribution to the community. To help this manuscript reach its full potential, I have identified a few key areas where the approach should be strengthened regarding data aggregation, validation strategy and data transformation. In addition, the Discussion would benefit from a stronger connection between findings in this study with existing literature. I believe this extra effort is necessary to prove the robustness of the presented approach and will help this paper to make a significant contribution to this field. For this reason, I recommend major revisions.
# Major comments
## Aggregation to daily values
L70-97: The manuscript states that sub-daily flux measurements (from both automatic chambers and eddy covariance) were "aggregated into daily values". However, the specific method of aggregation and the quality control criteria for this aggregation are not defined here. In addition, I am missing info about which data were aggregated (gap-filled or non-gap-filled). Especially for the eddy covariance (EC) sites Chamau and Neustift this would be crucial information for reproducibility, since half-hourly EC data are frequently rejected during quality control which often leaves days with very few valid data points. Then, calculating a daily value from only a few sparse measurements (which may be biased towards specific times of day) can introduce significant errors. It is therefore currently unclear if the daily training targets represent true daily averages, or if they are biased artifacts of incomplete data.
I therefore suggest that the authors please specify the aggregation method and give more detailed info about the used time series (gap-filled or non-gap-filled) that were aggregated. In case non-gap-filled time series were used, the minimum data availability threshold required to accept a day as valid should be given. If days with low coverage were included, please discuss the potential bias this introduces to the model. If gap-filled time series were used for the aggregation, the used gap-filling models should be given.## Input variables
L103-114 and Table 4: The model looking back at drivers of the past 5 days is likely too short to capture the causal drivers of emission peaks, because the soil is potentially primed by an event long before the emissions actually happen (legacy effect). For example, if fertilization happened 6 days ago then the model cannot see it and it assumes e.g. "rain and no fertilizer" in its time window. Therefore, a longer sequence would be more appropriate (e.g. 30 days or longer). One architecture was tested using 10 days, but this might also be too short. Maybe the authors have already tested longer windows (10+) and can extend Table 4 with this additional information. It is still possible that the shortest window works best, but it is important to see this in comparison to other tests.## Dataset splitting
L120-130, Data leakage: The manuscript describes a data splitting strategy where parallel treatments (e.g., intensive vs. extensive management) from the same site and year are separated into training and testing sets. However, this approach introduces significant data leakage because the parallel treatmens experience identical daily meteorological drivers. Basically, the model is exposed to the (specific) weather sequences of the test period during training. I would assume that performance metrics on the test set likely reflect the ability of the model to memorize these weather patterns rather than the true ability of the model to generalize. A test for generalization would require the test data to be independent in time (unseen years) or space (unseen sites).
I therefore suggest that the authors restructure their validation strategy. This could be done by withholding entire years across all treatments or by withholding entire sites (as was done with Rottenbuch) from the training set.L125/L130, Train/test splitting: In Section 2.3, the reported train/test split (13000 training vs. 15000 testing) is highly unconventional for machine learning and risks underfitting. It is unusual for a test set to be larger than the training set, I think I have not seen this in other studies yet. Unless there is a very specific requirement for these unusual data sizes that I am not aware of (and a proper justification), this needs to be addressed in an updated version of the manuscript.
## Log-transformation of N2O data and loss function
L140-157, L178-189: The text describes that daily N2O fluxes were log-transformed, meaning that the model ouputs predictions on a log scale which are then back-transformed. This transformation seems to be somewhat in conflict with L180 that emphasizes the "importance of catching high peaks". To me it seems counterintuitive to first compress extreme values/peaks (log transformation), but then applying a custom exponential loss function that basically attempts to re-weight the peaks. By compressing the target variable, it incentivizes the model to focus on fitting the "background" noise of low magnitude emissions rather than learning the rare, high magnitude events that drive the N2O budget. Therefore, the ability of the model to predict timing and magnitude of emission peaks is limited. Regarding the loss function, as pointed out the penalty grows exponentially as the error increases. If used in a dataset with rare high peaks this makes the model highly unstable during training.
I am sure that there were good reasons why the authors choose this approach, and I therefore suggest that the authors add their reasoning to the text. If the authors want to reconsider their approach and aim for optimizing the predictive power for peak events, I recommend to train the model on the linear N2O data in combination with a loss function that is robust to outliers (to improve stability).## LSTM
L163-175: In its current state, Section 2.4.1 lacks technical specifications that would enable others to reproduce the model. The parts in the text that mention "a set of LSTM layers" and "several dense layers" are missing details, e.g. the specific layer dimensions (units per layer) used in the final model. Similarly, the "attention mechanism" should be explained, at least briefly, even if more details are given in the provided reference. Also, attention can be applied in different places in the pipeline and there are multiple ways for its implementation. These details, among others, are currently missing, but required for reproducibility. I also suggest to add some of these details to Fig. 2.## Hyperparameter tuning
L190-200: I appreciate showing the ranges used for hyperparameter tuning, but this section also needs to show the optimized values that the authors decided to use in the final model. In addition, in this section I have another concern regarding validation. From the description it is not clear which data were used for validation, so I assume the test set was used instead of a validation set. Section 2.3 only mentions a test set. A typical split for train/validation/test split would be e.g. 60/20/20. To my knowledge, optimization algorithms like Hyperband require a validation metric to compare the differrent configurations. If the test set was used for optimization, this would be a "training on the test set" error (it would mean the model has seen the "answers" to the tests it is taking), a form of data leakage. If the training set was used for optimization, the hyperparameters are likely overfitted to the training data. And finally, the "manual adjustments" are in my view difficult to justify after the optimization procedure, because it means that the model was optimized, but optimal parameters were then not applied. Generally, manual tuning based on observing model output introduces researcher bias and implies that the model was tweaked until it *looked* right on the evaluation data. Regarding the ranges for tuning: it is important that the authors justify the specific ranges chosen and describe e.g. why the search space for neuron count was so constrained (it is very narrow and specific).## Feature importance
L202-209, Fig.6, Fig. 7: PFI is used to asses the importance of drivers at individual time steps. However, this is methodologically flawed for highly autocorrelated time series such as soil moisture since the value at time t is nearly identical to t-1. So by shuffling only the value at time t the model can recover the information from the unshuffled values at t-1 (or t-2 ...) stored in the memory of the LSTM. Consequently, the importance score of the respective feature is reduced and continuous drivers might be undervalued in comparison to e.g. fertilization. Therefore, I am not sure Fig. 7 can work because the lags are not independent. For Fig. 6, it is not immediately clear to me why the lagged variables are missing. I assume that maybe all time steps of a single variable were shuffled together, and then an overall importance is given (sort of a "grouped" PFI based on grouped shuffling)?## Model performance
L230 Table 4: The relative RMSE of 270% is very high, and one of the reasons could be that the log transformation distorts the optimization. The log transformation minimizes the (logarithmic) error, but after the results are back-transformed to the linear scale the bias on low-flux days becomes disproportionately large (relative to the mean). Also, I assume the errors on peak events increase dramatically. The authors should compare this performance against e.g. the predicted mean of the training set for every day and then check whether the LSTM can outperform this naive baseline. In light of the high RMSE, it is important that the authors discuss why the error is so high, especially if it is driven by a few very high outlier values, or if the model is systematically biased. A scatter plot of predicted vs. observed fluxes on the linear scale (not log-log) can help here to diagnose this. Currently, I cannot see how the model is "capable to transfer results across sites".# Specific comments
L10/L61: The site Neustift is described as "North Austria". Please correct this to "Western Austria".
L25: Change "greenhouse gas" to "anthropogenic greenhouse gas"
L30: The sentence "The Pearson correlation..." presents specific results in the introduction ("between -0.11 to 0.16 for our dataset"), which should not be done. I suggest the authors rephrase this sentence and remove these specific calculation outcomes, they have their place in the Results section.
L33: Change "long periods" to "long time periods"
L56: I would not go so far to mention "general applicability". One independent grassland is not enough for this claim.
L62: "no grazing": Some of these sites have grazing. If the authors are refering to specific time periods used in their study, please add this info.
L65: I suggest to give the FLUXNET identifiers for all sites, which in case of Chamau is "CH-Cha" (instead of "CH-CHA", note the lowercase letters) and in case of Neustift "AT-Neu" (instead of "AT-NEU").
L65 table caption, and other places throughout the manuscript: "Hörtnagel et al., 2014": the correct name is "Hörtnagl"
L65 table header: I think there is a typo in the header and "(kg h-1 y-1)" should be "(kg ha-1 y-1)". I would interpret "h-1" as "per hour".
L65 Table 1: Please give full units for "Annual N2O", i.e., is this "kg N ha-1 yr-1"? The header indicates that it is "kg N2O ha-1 yr-1", because the header mentions specifically N2O, but from the magnitude of the values I would assume N2O-N.
L65/L100 What is "Dataset size"? I am also not sure how this dataset size matches up with the dataset sizes given in Table 2, e.g., Graswang has dataset size 8708 in Table 1, but numbers in Table 2 add up to 1670. Please check and add missing info.
L71: For the three German sites, please add "respectively" after giving the altitudes.
L71: "asl": I suggest to change to "a.s.l.", because this journal generally uses periods for text abbreviations.
L81/82: The given references point to subpanels a and c in Fig. A1, but this figure does not show any subpanel letters.
L85: "is part of the FLUXNET initiative": I suggest to change to "is part of Swiss FluxNet".
L88: "by separating two parcels at north and south of the site" change to "by establishing separate parcels in the northern and southern parts of the site", if I understand this correctly.
L89: "in south" should be "in the south"
L100 Table 2: Great to see (and have available) the different soil properties for the different locations. Would it make sense to include the soil properties as static drivers in the model?
L116: "among German grassland" should be "across German grasslands"
L117: "capable to transfer" should be "capable of transferring"
L125: "13000 data sequences": to me it is not immediately clear what "data sequence" means. I assume one data sequence is what is shown in Eq. 1, i.e., a single input block that is fed into the LSTM (correct?). I suggest to add more info here since the term "data sequence" is mentioned for the first time.
L130: "sequence" should be "sequences"
L161: "should be considered" is prescriptive language and OK in other places. In the Methods section I suggest to focus on reporting what "was" done to clearly describe the specific actions taken in this study, not what "should" be done.
L213: Please add this reference for Matplotlib: https://doi.org/10.1109/MCSE.2007.55
L235: Given the low flux values (in combination with sometimes emission peaks), I do not think that the *relative* RMSE is a good indicator
L235: From the figures I take that flux values are expressed using N2O-N, but in the text there are several places where this is not clear. I suggest to add a sentence in the Methods that fluxes are always expressed using N2O-N. In this regard, the header in Table 1 is slightly misleading because it mentions "Annual N2O" with incomplete units, but the numbers seem to give cumulative N2O-N.
L236: Please explain "total bias"
L260: Is there a reason why the time lagged variables are shown in Fig. 7, but not in Fig. 6? Maybe I have missed this somewhere
L275 Fig. 4: Typo "occurance" --> "occurrence"
L290-296: This could be shortened, most was already mentioned before.
L297-306: This currently reads more like a literature review than a discussion. The limitations of previous studies are listed, but the direct connection to findings in this study are missing. I suggest the authors revise this section to synthesize these citations with their results. For example, how their study addresses (or contrasts with) methodological gaps they identified in prior work (e.g., lack of independent test sites).
L308-326: Although this is part of the Discussion, there are no references to existing literature and many of the sentences belong to the Results section. Please revise this section and contextualize findings by citing previous work that supports or contradicts results from this study. For example, is the observed moisture dependency consistent with established theory?
L330: "Soil temperature, ..." This sentence seems to be missing words, please fix.
L331: I am critical of the claim that maximum air temperature can substitute for soil temperature with "minimal impact". Air and soil temperatures often decouple due to factors like snow cover (insulation) and vegetation shading. And of course there are different soil depths and different thermal lags. In case a sensitivity analysis was performed to demonstrate that the performance drop is indeed negligible, it could be mentioned solely in the context of this study rather than implying it is a viable strategy for sites globally.
L334-337: Context to literature is missing
L338-345: Here, some interesting points are raised from the literature (pH, BD), but they are not discussed in relation to findings in this study. Instead, the final two sentences jump back to soil moisture but without giving references.
L346-369: I suggest to limit this section to sites that were also used in this study. A comparison with sites from other locations is difficult, especially for N2O that is characterized by high spatial variability.Citation: https://doi.org/10.5194/egusphere-2025-5155-RC3
Data sets
Eddy covariance ecosystem fluxes, meteorological data and detailed management information for the intensively managed grassland site Chamau in Switzerland, collected between 2005 and 2024 L. Hörtnagl et al. https://doi.org/10.3929/ethz-b-000747025
The TERENO Pre-Alpine Observatory: Integrating Meteorological, Hydrological, and Biogeochemical Measurements and Modeling R. Kiese et al. https://doi.org/10.2136/vzj2018.03.0060
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 240 | 77 | 25 | 342 | 18 | 18 |
- HTML: 240
- PDF: 77
- XML: 25
- Total: 342
- BibTeX: 18
- EndNote: 18
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
General comments:
This manuscript takes on the important task of improving N2O emissions predictions from grasslands using machine learning, crucially through the use of a large multi-site dataset. This is significant development from the earlier work on this topic. However, several aspects of the configuration of LSTMs and the preprocessing of data require clarification or improvement in order to provide a fair and transparent evaluation of the model’s performance. Specifically, I have concerns regarding potential data leakage in the train/test splitting strategy, and the ambiguity regarding the use of past target variable values as inputs.
These details may seem esoteric, but without them, it is impossible to discern whether the RMSE achieved here is impressive or disappointing. That being said, I feel that these changes can be implemented—or at least clarified—relatively easily, and are necessary to validate the study's conclusions. I also strongly encourage the authors to make their model code and data available via a public repository. This would resolve the interpretability issues raised in this review and ensure the reproducibility of the study.
I believe that with these methodological issues resolved, this study could make valuable contribution to the field of environmental modeling, with broad potential impact.
Specific comments:
L105-114: Equation 1 suggests that you have manually included four additional time-lagged features of temporal drivers for each of the prior four days. If I have correctly interpreted this, it is unclear to me why you have manually created these time-lagged features given that you are using an LSTM model. The LSTM model already contains memory cells, forget-gates, an attention layer, and other architectural features which remember past events and learn which are important. As an LSTM model already has access to previous time-steps, forcing the model to use manually-created ones may interfere with the model’s attempt to learn and forget appropriately. Please clarify whether the ‘t-1’ etc indicate manually included time-lagged features, and if so, justify your use of them. It would be helpful to explicitly state the shape of the 3D tensor used in the LSTM (N_Samples, Time_Steps, N_Features).
L120-130: After a careful read, I have interpreted the authors’ data splitting as follows. There are two subsets, one training and one testing subset. All of the Rottenbuch-in-Fendt data was siloed to the testing dataset, while all Neustift-in-Neustift data was used for training. All data from a single year of the Chamau site was withheld for the testing dataset, while all previous years were included in training. The rest of the data appears to be split at the treatment level, with all three replications and experiment years of a given treatment either sent to training or testing. This is somewhat different from the description given in the abstract, as the test set included not just a holdout site and withheld years as described, but also parallel years of separate treatments: “The trained LSTM model showed strong predictive performance (RMSE of 18 𝜇𝑔𝑚−2 ℎ −1, and Relative RMSE of 270%) when evaluated on a test set that included both data from an independent soil and withheld years from training procedures.“ Importantly, withheld years and withheld sites make up only 320 and 259 data points of the total 14591 measurements in the testing set (~2% each).
I firstly want to commend the authors for clearly taking data handling seriously and devising a more rigorous method of splitting than a simple randomized split. I also appreciate that authors have provided a loose description of criteria and thought that went into dividing datasets into testing and training and their desire to represent a variety of soil, environmental, and management properties in both groups. That being said, I do still have concerns about the potential for data leakage given the methods described. Most crucially, I believe it generally best practice when working with sequential data to avoid having data from the same locations and time periods represented in both testing and training pools, even if there was a difference in management technique (e.g. extensive/intensive). As most experiments use one-factor-at-a-time, separate treatments typically experience mostly the same conditions at the same time, including weather and other factors which may affect biogeochemical systems and which may not be captured by model inputs (Zhu et al., 2023). Thus there is often a similarity in flux patterns over time, where factor differences may lead to a difference in peak height etc, but the overall shape of flux over time is similar. Thus when split across testing and training datasets, information about flux patterns and their drivers may leak from a parallel treatment, giving the model an unfair advantage during predictions.
In my view, how the authors have handled data from the Chamau site and Rottenbuch-in-Fendt are both strong examples of data splitting. From the Chamau site, the model had the opportunity to learn from previous years’ data during training but was then evaluated on a separate year, with no opportunity to ‘cheat’ and gain information about events and effects that occurred during that year from other treatments. The Rottenbuch-in-Fendt is the strongest example of a true holdout testing the model’s generalizability, as the authors point out. Arguments about the model’s generalizability thus would be stronger based upon the model’s performance on these two subsets. Aggregate RMSE and other metrics could be reported independently for withheld sites and withheld years, without the overwhelming influence of additional data included in the test set where parallel treatments are represented in training data. Authors may also wish to consider adding additional sites/years to these pools, given the small quantity of data handled in these manners.
L140-142, L179-185: The use of log-transform and the loss function selected is an interesting combination of model techniques. I think it would be useful to provide a greater discussion on the effect of this combination on model behavior, given a non-statistical audience and a highly non-traditional loss term. The log transformation compresses the data distribution, forcing the model to minimize relative prediction errors on a relative, rather than absolute, scale. Then, the loss function penalizes prediction errors *exponentially,* placing a massive (approaching infinite) influence on any large errors. Thus the cost of a single large error will easily outweigh a thousand small errors. The result is a highly risk-averse model (avoids large errors at all cost), and which prioritizes consistency (always predicting within the same order of magnitude), in contrast to precision and accuracy on the bulk of data with poor predictions on outliers. It also likely forces the model to pay close attention to hot moments. It’s an interesting and potentially effective strategy depending on the goals, but I feel the biogeochemist audience would benefit from more explicit discussion of these effects.
L164-175: LSTM models can be configured in three ways: (1) using ground-truth target values from previous time-steps as inputs (teacher-forcing or closed-loop), (2) using the model’s own past predictions as inputs (autoregressive or open-loop), or (3) relying solely on environmental drivers without including any version of the past target variable as an input feature in the 3D tensor. I could not find any mention of whether some version of previous flux time-steps were included in the 3D tensor. This is a crucial detail, as flux is often highly autocorrelated, yet in practical scenarios, actual past flux values will not be known. Using teacher-forcing is thus appropriate during training to accelerate convergence, but inappropriate during model evaluation on testing datasets, as this would provide highly useful but practically impossible information. A discussion of these distinctions and how precisely the LSTM model was set up in both training and testing phases is immensely needed in order to assess the model’s true predictive power.
L222-224: I interpret this as you have evaluated two different hyperparameters for the n_steps/timesteps parameter of input_shape argument of the LSTM layer, those being 5 and 10. This sets a hard limit on the number of days your model can “look backward” at previous events. I am curious why such short timeframes were considered, especially given that a fertilization input may have an impact on soil nitrogen content and thus denitrification rates for many weeks, particularly if the fertilization happens to occur during a dry period. I note that you explain in the discussion that emissions nearly always peak shortly after fertilization, thus reducing the benefit of longer sequence inputs. However, I also notice from Figure 4 that moderate peaks often occur long after fertilization- much longer than the 10 days maximum evaluated. I am curious whether longer input sequences may improve predictions on such moderate peaks, for example if soil nitrogen is still moderately elevated a month after fertilization and a significant precipitation event occurs.
This also compounds my confusion regarding the aforementioned time-lagged features. The input_shape argument of an LSTM layer already defines the ceiling of how many previous time-steps the model is capable of considering, and attention and forget gates are used to determine which data among this pool to actually focus on or forget.
L235-236: I see that RMSE for the holdout set are given individually for timeseries in Fig 4, but I would appreciate an aggregate performance evaluation for the Rottenbuch-in-Fendt holdout listed here for comparison’s sake.
L269-271: Clarification is again needed here as to whether time-lagged features were directly represented as model features. If the importance of previous time-steps of features were determined by using standard PFI, which then varied the values of e.g. feature “soil_temperature_t-1” and ultimately assigned it an importance value, then this method would be invalid. In the case of using explicit time-lagged features, previous values of features are represented in the model in two places: the engineered feature, as well as the internal Hidden State of the LSTM. Even if the time-lagged feature were altered by PMI, the model would still have access to the true value in its hidden state, and thus the model could simply ignore the shuffling by PFI.
If this is a misinterpretation and time-lagged features are not explicitly represented, then I would like to see more detail on how PFI was adapted to assess the importance of previous time-steps given the LSTM architecture.
L349-353: Given that no process-based model was tested against the LSTM model, I question the need for this whole section of comparing with process model, which is not the scope of this study. I am skeptical that any meaningful comparison of predictive performance metrics can be made across different datasets, especially given the extreme temporal and spatial variability of N2O which translates to major variability in data distributions. Some datasets are simply easier to predict due to lesser influence of hot moments, and thus comparisons of models evaluated on different datasets should not be made without at least acknowledging their tenuousness.
L361-365: Again, considering that data exists for process-based model predictions at the same site (and year?) as one of the modeled sites in this study (Chamau), I would prefer that authors limit their comparisons of LSTM vs process-based simulations where these comparisons can be made fairly.
Technical corrections:
L24. Spell out full word ‘approximately’
L30. ‘2’ subscript
L237: “Having relative RMSE higher than 1 is a known effect of the imbalanced dataset.” Citation needed.
Other minor comments:
Table 1 and 2: Dataset size - clarify if this is daily flux observation numbers
L81: Better to say “flux” and not “flux rate”
Fig 1: It would be helpful to include flux distribution for the test data dataset too
Table 5 and associated discussion: The table does not need to be in the main manuscript and can be presented as supplementary data. L297-303 sounding like the authors are criticizing the previous studies and can be shortened. These are some of the initial studies on this topic and should be acknowledged accordingly with limitations that the current study is addressing.
Fig 6: While feature importance was assessed separately for each site in Fig 6, the discussion on any variation in variable importance across sites is weak and contributes little to the understanding of differential N2O dynamics. Also, why time-lagged features were not included in Fig 6 analysis.