Abstract. Agriculture is the largest anthropogenic source of nitrous oxide (N2O), primarily due to nitrogen (N) fertilization. Understanding how the influence of key drivers and the relative contribution of source processes change throughout the cropping season is crucial for developing effective strategies to mitigate N2O emissions. In this study, we combined high-resolution eddy covariance flux measurements and stable isotope analyses over one winter wheat cropping season and the subsequent summer cover crop season. Two phases, crop establishment and early spring, were identified as critical periods for N2O emissions, characterized by a mismatch between N supply and plant demand, resulting in surplus soil mineral N and elevated N2O fluxes under favorable environmental conditions. Gross primary productivity (GPP), used as a proxy for crop N uptake, suppressed N2O emissions, especially under high soil moisture, highlighting the importance of active vegetation in mitigating emissions. Source partitioning, based on stable isotopes, revealed denitrification as the dominant process of N2O production, driven by poor soil drainage and high soil moisture. Over the nine-month winter wheat season, the Tier 1 N2O emission factor was 1.8 %, with cumulative emissions of 5.5 kg N2O-N ha−1, offsetting 70 % of the net CO2 uptake. Our findings emphasize the need to better synchronize N supply with crop demand and to adopt agronomic practices that promote rapid crop establishment to mitigate N2O emissions in cropping systems.
Received: 04 Jan 2026 – Discussion started: 20 Jan 2026
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.
This manuscript is a thorough and insightful exploration into the drivers of soil N2O emissions. Building on a well-researched field, the paper employs strong and novel methods to demonstrate this idea through use of spatio-temporally integrated flux measurements and frequent measurement of crop growth dynamics to estimate soil nitrogen dynamics, plant nitrogen uptake, and N2O emissions over time. The experiment appears well conducted and to have yielded a wealth of data, to which the authors add investigative statistical methods which contribute a strong picture of the drivers of N2O and how these drivers shift over time, which I found particularly insightful. The manuscript is also clearly written, if overly descriptive and verbose at times.
However, I have key concerns regarding the methodological rigor and the resulting interpretations in three key areas. First, there is significant opacity regarding the data handling process during machine learning, including a high risk of temporal data leakage during model validation. Second, the methods used to classify 'background' vs. 'hot-moment' emissions are tenuous and lack a mechanistic basis. Third, while the authors identify Gross Primary Productivity (GPP) as a preeminent negative driver of emissions in their discussion of results, the exclusion of soil mineral N concentrations from driver analysis creates a risk of omitted variable bias, in which GPP may act as a proxy for N exhaustion. To justify their major claim that GPP is a suppressor of N2O emissions and that N synchronicity is a path toward N2O mitigation, the authors should make efforts to decouple GPP from simple substrate limitation in their statistical investigation of drivers.
With a tighter text and these improvements to statistical methods, I am confident this manuscript will represent a strong and novel contribution to understanding pathways to agricultural GHG mitigation.
Major Comments:
L55-56: Is this sentence referring to nitrogen cycling or crop growth?
L80-84: This is true, strictly speaking. However, it is also possible to infer the dominant N-transforming process based on the rate of N2O production, as hot moments have been shown to be overwhelmingly the result of denitrification. Relatedly, a recent work used ML to differentially model nitrification-dominant and denitrification-dominant emissions (Lussich et al., 2026) https://doi.org/10.1002/jeq2.70126
L84-90: I’ll use this passage to illustrate a broader observation that characterized much of this manuscript. The authors frequently provide detailed explanations, such as the principles and mechanics behind stable isotope analysis. While thorough, these explanations are often characterized by a poor economy of words, include details which are not directly pertinent to the narrative at hand, and add up to a manuscript of considerable length (I count nearly 13,000 words). This work would significantly benefit from greater concision, illustrating only the most pertinent details and taking advantage of the expected level of familiarity of a biogeoscientist audience.
Table 1: slurry was applied the day after mineral N fertilizer was applied. Is this typical practice in the region? This creates perfect C and N chemistry to produce N2O. A study focused on understanding N2O mitigation opportunities, this is not an efficient fertility management decision. Can slurry be applied in fall, before wheat planting?
L162-163: Why were two separate outlier detection methods used to filter erroneous fluxes?
L173-184: Regarding the RF gap-filling: was there any analysis of error propagation? Or an assessment or justification of the accuracy of RF as a gap-filling method? Accurate gap filling N2O emissions is still a topic of much research, and has resulted in mixed success. One of the papers cited here reports R2 of gap-filling for N2O between 0.6 and 0.76, using only 15% of ‘missing values’ as the test set, in comparison to ~47% of the time period missing in this study. Other work has reported lower R2 values from 0.42 (Taki et al., 2019, 10.1139/cjss-2018-0041) to 0.66 (Goodrich et al., 2021, 10.1016/j.agrformet.2020.108280), also using just 15% of the data as ‘missing values’ to be filled. The other paper the authors cited here performed no analysis on the accuracy of the RF gap-filling method used. The success of ML gap-filling has also been shown to be related to the length of gaps (Taki et al., 2019), yet there is no discussion of typical gap length or the impact of lengths beyond reporting the aggregate percentage of missing values. While I acknowledge that flux gap-filling is a major challenge which has yet to be solved, and that work must go on in the meantime, nevertheless I feel it is important to acknowledge the limitations of gap-filling high resolution N2O flux data and the effect that these limitations might have on this study.
L183-185: It seems here like the authors estimated the background flux level by just excluding 30-day post-fertilization flux data with assumption that fertilization effect lasts only for 30 days. First of all, I don’t think this assumption is correct and your own data did not support the assumption. Fertilization can have long-lived impacts on N2O emissions, particularly during dry periods, offseason, etc. As well, not all hot moments are fertilizer-driven, as those like Claudia Wagner-Riddle’s group have shown. Moreover, not all fertilizer-driven N2O emissions are caused directly by fertilizer at all: the excess N added to the soil system by fertilization may be temporarily captured by plant or microbial biomass and mineralized months or even years later, thus driving emissions that would not occur in a natural, unfertilized system but which nevertheless occur distantly from any fertilization event. The authors’ data shows this: the large peaks, as big as post-fertilization, from mid-June to August is well beyond 30-day post-fertilization period and unlikely to happen in a truly unfertilized control treatment. Moreover, the SHAP analysis (Fig 4) also shows that fertilization effect was a dominant factor for almost two-months after fertilization.
I have serious reservation about this method of distinguishing between HM and BG emissions by simply excluding fluxes within 30 days of fertilization. This potentially inflated the background emissions and perhaps underestimated emission factors. Authors might consider alternative methods of contextually distinguishing background emissions from hot moment emissions based on outlier detection. Please see Ackett et al., 2025; doi.org/10.1029/2025JG008953.
L199-201: Spatially aggregated flux measurements are captured across a field using EC, yet a single point measurement is used for soil moisture content and temperature? Soil moisture is also highly spatially heterogenous. This seems a potentially noteworthy limitation.
L303-305: Again, the persistence of elevated N availability is highly variable and can be much longer than 30 days.
L318-325: A more complete description is needed about the data splitting process is needed in order to verify its validity. In this custom time-block method, how large were the time chunks? Was the model trained on data further in the future than the test data? When working with a single time series, the most correct method of cross validation is to use a method similar to that employed by the TimeSeriesSplit function in scikit-learn. By this method, the timeseries is split into n+1 chunks, and the model is trained on all prior data within the timeseries, including an initial runup chunk (i.e. an expanding window). This ensures that the model is never trained on future data, which would constitute a form of data leakage.
L323-325: RMSE and R2 both give disproportionately large influence to hot moments by way of heavily weighing large residuals. Consider using an evaluation metric that evenly weights residuals of all sizes, like MAE, to give a more balanced evaluation of model performance across the distribution of values.
L328-331: I have searched through the document and could not find a description of what the authors here call the “final model,” nor what the “test set” was. There seem to be a lot of mixing of terminologies regarding data handling and model evaluation, leading to a very opaque picture of the actual methods used. Starting at the beginning of this section:
“Following variable selection, model hyperparameters were optimized using 10-fold cross-validation. To account for temporal autocorrelation and avoid overfitting, while also providing representative coverage of the measurement period, we employed a custom time-block strategy. This approach involved an 80/20 split between training and validation…”
This seems plainly contradictory. If a 10-fold cross validation were used, then across each fold 90% of the data would be used for training and 10% for validation, yet the authors claim an 80/20 split. My best guess might be that the 10-fold cross validation might be a separate process exclusively for hyperparameter tuning, for which no details were provided, and then an abrupt shift to a new data handling process involving an 80/20 split of some kind takes place? Yet descriptions of each process are incomplete and poorly differentiated in their purposes, leading to a jumbled passage.
“with the validation set comprising randomly selected, non-overlapping time blocks that together represented 20% of the available data.”
The authors here claim that time blocks were randomly selected, but a timeseries should not be randomly split. The authors seem to acknowledge this idea with their reference to temporal autocorrelation, but their description of their methods here does not give confidence that they have properly dealt with this challenge.
“This splitting strategy was consistently applied throughout the modeling workflow, including during cross-validation and final model evaluation.”
This does not make sense to me. There have been no clear definitions of what constitutes the “cross-validation” and the “final model evaluation.”
“Cross-validation results showed a R2 of 0.60 and a RMSE of 1.1 nmol N2O m-2 s-1 on the validation set, while the training set showed a R 2 of 0.98 and a RMSE of 0.29 nmol N2O m -2 s-1. The final model, trained with early stopping (10 rounds) to prevent overfitting, achieved a R2 of 0.70 and a RMSE of 1.14 nmol N2O m-2 s-1 on the test set (Fig. A2).”
This passage invites more confusion. Beyond the unclarity regarding CV vs final evaluation, the authors describe model performance on “the test set,” “the validation set,” etc. Was there a singular test set, as the language here suggests, or are the authors instead referring to the average model performance on holdout data across the 5 or 10 folds?
L333-349: In contrast to the preceding passage, this description of SHAP values is exhaustive, and may be excessive.
L360-384: This passage is a lengthy textual description of figure 2, which I’m not sure is necessary given the length of the manuscript.
L452-453: The average background flux estimate of 0.44 nmol m⁻² s⁻¹ is equivalent to ~11 g N2O-N/ha/d, which is substantially (3-4 times) larger that globally estimated background emissions of ~2.5 to 3 g N2O-N/ha/d. This could be due to the reasons I explained above.
L462-466: “In contrast, the cover crop acted as a net CO 2 source, emitting 108 g CO2-C m-2”
The cover crop phase was only measured for 2 months, which was influenced by cultivation after wheat harvest. Is this the typical cover crop growth duration? If not, then this is misleading information that provides a snapshot of the rotation. Cover crop active growth phase is often known to sequester C (NEE sink) and negligible N2O emissions.
As the authors correctly point out in their discussion, it is not the cover crop per se that is acting as an emitter of CO2, but rather this is the response to harvest of the winter wheat and the bare soil. It is tricky, but I wonder if there is a better way to convey that the cover crop is actually reducing emissions, not magnifying them, even though these emissions are taking place during the cover crop’s establishment.
L480-483: in L452, the authors mentioned background flux of 0.44 nmol m⁻² s⁻¹. This is confusing as it seems different subset of data are being used in places to estimate background flux.
L486-509: SHAP values are best interpreted by relating the direction/magnitude of impact alongside the magnitude of the factor value. For example, the authors write: “N2O emissions were mainly suppressed by WFPS.” The fact that soil moisture had a large negative impact on flux predictions, as described here, leaves a lot of room for interpretation. Were these negative influences related to high or low soil moisture?
L510-515, Figure 5: The binning strategy used here to quantify the relationships between WFPS, GPP, and flux, is statistically fragile. The bins create artificial step functions in what is likely a linear relationship, and also reduces the number of data points for statistical analysis. It also is able to demonstrate an interaction exists, but does not quantify that relationship in terms of strength or behavior. A better approach might be to model a multiple linear regression with an interaction term. Partial r2 could also be used to more precisely determine the relative explanatory power of each factor and the interaction.
L559-560: True, but could this also be due to increasing contribution of nitrification to the net N2O emissions? How could you be so sure about Fig 7B secondary vertical axis?
L565-566: Relatedly, could not the increased role of nitrifier denitrification also indicate a decreased rate of denitrifier denitrification overall?
L646-650: The N2O offset effect on net CO2 is based on NEE and not actually on long-term C sink, which is soil C stabilization. Not all of the NEE sink from a growing season is translated into soil C gain that is stabilized for longer period beyond the growing season. The studies using long-term experiments showed much higher N2O offset effects in fertilized systems in other regions (see Dhaliwal et al., 2025; doi.org/10.1002/jeq2.70046). Therefore, the short-term nature of the study may have underestimated N2O effects.
L656-661: Mineral N sampling also showed modestly high N availability during this period, likely related to low uptake and mineralization from residue and microbes
Section 4.4, L705-708: I will take this line as an opportunity for a discussion on the idea of synchronicity between N supply and demand as a key feature in determining N2O emissions, and as a key idea in this manuscript. The authors place much significance on this idea throughout their introduction, use of methods, and the discussion here. In my view, the evidence for nitrogen synchronicity as the preeminent determinant of N2O emissions, rather than simply N availability, is a bit scattered. For example, Figure 4 excellently illustrates flux and GPP, yet mineral N availability is missing as the third key component needed to make this case. From Fig 4 there certainly does seem to be an inverse relationship between flux and GPP, but I wonder the extent to which GPP is also negatively correlated with N availability. Given the inverse relationship between GPP and mineral N (the nitrogen is moving directly from the soil and into the plant biomass), more evidence is needed to prove that GPP is independently negatively affecting N2O emissions, beyond soil N.
This leads me to my concern about the authors’ interpretation of GPP’s role in the SHAP analysis. Because soil mineral N concentrations were not included as input features in the Random Forest model, the model likely suffers from omitted variable bias. While 'days since fertilization' is included as a predictor, it is a very rough and linear proxy that fails to capture the dynamic, non-linear depletion of the soil N pool. In the absence of a direct N-substrate variable, the machine learning algorithm quite possibly utilizes GPP as a better mathematical proxy for the diminishing N pool. Consequently, the high importance attributed to GPP in the SHAP analysis may simply reflect N exhaustion rather than a mechanistic suppression of N2O production by plant uptake (synchronicity). To truly substantiate the claim that synchronicity is key to emissions, the authors would ideally show that for a given level of soil mineral N, higher GPP results in lower emissions. Without including measured soil N data in the modeling workflow, the current results show correlation with the crop’s growth phase, but do not sufficiently decouple plant demand from simple substrate limitation to make strong claims such as this one.
Eddy covariance ecosystem fluxes, meteorological data and detailed management information for the cropland site Oensingen in Switzerland, collected between 2021 and 2023Fabio Turco et al. https://doi.org/10.3929/ethz-c-000782868
Notebooks used to build the PI dataset of the cropland ecosystem station CH-OE2 (Oensingen) for the period 2021-23. The notebooks were used to process and gap-fill the following fluxes: NEE: Net ecosystem exchange of carbon dioxide (2021-2023) LE: Latent heat flux (2021-2023) H: Sensible heat flux (2021-2023) FN2O: Nitrous oxide flux (Sep 2022 - October 2023) FCH4: Methane flux (Sep 2022 - October 2023)Fabio Turco and Lukas Hörtnagl https://zenodo.org/records/17975468
Nitrous oxide emissions over winter wheat peaked during crop establishment and in early spring when wet soil and high nitrogen levels met low plant uptake. Our study found that strong plant growth suppressed these emissions by uptaking excess nitrogen. Stable isotope analysis indicated denitrification as the dominant emission source. Our findings indicate that synchronizing fertilizer inputs with crop demand, promoting rapid early crop growth, and improving soil drainage can mitigate emissions.
Nitrous oxide emissions over winter wheat peaked during crop establishment and in early spring...
General Comments:
This manuscript is a thorough and insightful exploration into the drivers of soil N2O emissions. Building on a well-researched field, the paper employs strong and novel methods to demonstrate this idea through use of spatio-temporally integrated flux measurements and frequent measurement of crop growth dynamics to estimate soil nitrogen dynamics, plant nitrogen uptake, and N2O emissions over time. The experiment appears well conducted and to have yielded a wealth of data, to which the authors add investigative statistical methods which contribute a strong picture of the drivers of N2O and how these drivers shift over time, which I found particularly insightful. The manuscript is also clearly written, if overly descriptive and verbose at times.
However, I have key concerns regarding the methodological rigor and the resulting interpretations in three key areas. First, there is significant opacity regarding the data handling process during machine learning, including a high risk of temporal data leakage during model validation. Second, the methods used to classify 'background' vs. 'hot-moment' emissions are tenuous and lack a mechanistic basis. Third, while the authors identify Gross Primary Productivity (GPP) as a preeminent negative driver of emissions in their discussion of results, the exclusion of soil mineral N concentrations from driver analysis creates a risk of omitted variable bias, in which GPP may act as a proxy for N exhaustion. To justify their major claim that GPP is a suppressor of N2O emissions and that N synchronicity is a path toward N2O mitigation, the authors should make efforts to decouple GPP from simple substrate limitation in their statistical investigation of drivers.
With a tighter text and these improvements to statistical methods, I am confident this manuscript will represent a strong and novel contribution to understanding pathways to agricultural GHG mitigation.
Major Comments:
L55-56: Is this sentence referring to nitrogen cycling or crop growth?
L80-84: This is true, strictly speaking. However, it is also possible to infer the dominant N-transforming process based on the rate of N2O production, as hot moments have been shown to be overwhelmingly the result of denitrification. Relatedly, a recent work used ML to differentially model nitrification-dominant and denitrification-dominant emissions (Lussich et al., 2026) https://doi.org/10.1002/jeq2.70126
L84-90: I’ll use this passage to illustrate a broader observation that characterized much of this manuscript. The authors frequently provide detailed explanations, such as the principles and mechanics behind stable isotope analysis. While thorough, these explanations are often characterized by a poor economy of words, include details which are not directly pertinent to the narrative at hand, and add up to a manuscript of considerable length (I count nearly 13,000 words). This work would significantly benefit from greater concision, illustrating only the most pertinent details and taking advantage of the expected level of familiarity of a biogeoscientist audience.
Table 1: slurry was applied the day after mineral N fertilizer was applied. Is this typical practice in the region? This creates perfect C and N chemistry to produce N2O. A study focused on understanding N2O mitigation opportunities, this is not an efficient fertility management decision. Can slurry be applied in fall, before wheat planting?
L162-163: Why were two separate outlier detection methods used to filter erroneous fluxes?
L173-184: Regarding the RF gap-filling: was there any analysis of error propagation? Or an assessment or justification of the accuracy of RF as a gap-filling method? Accurate gap filling N2O emissions is still a topic of much research, and has resulted in mixed success. One of the papers cited here reports R2 of gap-filling for N2O between 0.6 and 0.76, using only 15% of ‘missing values’ as the test set, in comparison to ~47% of the time period missing in this study. Other work has reported lower R2 values from 0.42 (Taki et al., 2019, 10.1139/cjss-2018-0041) to 0.66 (Goodrich et al., 2021, 10.1016/j.agrformet.2020.108280), also using just 15% of the data as ‘missing values’ to be filled. The other paper the authors cited here performed no analysis on the accuracy of the RF gap-filling method used. The success of ML gap-filling has also been shown to be related to the length of gaps (Taki et al., 2019), yet there is no discussion of typical gap length or the impact of lengths beyond reporting the aggregate percentage of missing values. While I acknowledge that flux gap-filling is a major challenge which has yet to be solved, and that work must go on in the meantime, nevertheless I feel it is important to acknowledge the limitations of gap-filling high resolution N2O flux data and the effect that these limitations might have on this study.
L183-185: It seems here like the authors estimated the background flux level by just excluding 30-day post-fertilization flux data with assumption that fertilization effect lasts only for 30 days. First of all, I don’t think this assumption is correct and your own data did not support the assumption. Fertilization can have long-lived impacts on N2O emissions, particularly during dry periods, offseason, etc. As well, not all hot moments are fertilizer-driven, as those like Claudia Wagner-Riddle’s group have shown. Moreover, not all fertilizer-driven N2O emissions are caused directly by fertilizer at all: the excess N added to the soil system by fertilization may be temporarily captured by plant or microbial biomass and mineralized months or even years later, thus driving emissions that would not occur in a natural, unfertilized system but which nevertheless occur distantly from any fertilization event. The authors’ data shows this: the large peaks, as big as post-fertilization, from mid-June to August is well beyond 30-day post-fertilization period and unlikely to happen in a truly unfertilized control treatment. Moreover, the SHAP analysis (Fig 4) also shows that fertilization effect was a dominant factor for almost two-months after fertilization.
I have serious reservation about this method of distinguishing between HM and BG emissions by simply excluding fluxes within 30 days of fertilization. This potentially inflated the background emissions and perhaps underestimated emission factors. Authors might consider alternative methods of contextually distinguishing background emissions from hot moment emissions based on outlier detection. Please see Ackett et al., 2025; doi.org/10.1029/2025JG008953.
L199-201: Spatially aggregated flux measurements are captured across a field using EC, yet a single point measurement is used for soil moisture content and temperature? Soil moisture is also highly spatially heterogenous. This seems a potentially noteworthy limitation.
L303-305: Again, the persistence of elevated N availability is highly variable and can be much longer than 30 days.
L318-325: A more complete description is needed about the data splitting process is needed in order to verify its validity. In this custom time-block method, how large were the time chunks? Was the model trained on data further in the future than the test data? When working with a single time series, the most correct method of cross validation is to use a method similar to that employed by the TimeSeriesSplit function in scikit-learn. By this method, the timeseries is split into n+1 chunks, and the model is trained on all prior data within the timeseries, including an initial runup chunk (i.e. an expanding window). This ensures that the model is never trained on future data, which would constitute a form of data leakage.
L323-325: RMSE and R2 both give disproportionately large influence to hot moments by way of heavily weighing large residuals. Consider using an evaluation metric that evenly weights residuals of all sizes, like MAE, to give a more balanced evaluation of model performance across the distribution of values.
L328-331: I have searched through the document and could not find a description of what the authors here call the “final model,” nor what the “test set” was. There seem to be a lot of mixing of terminologies regarding data handling and model evaluation, leading to a very opaque picture of the actual methods used. Starting at the beginning of this section:
“Following variable selection, model hyperparameters were optimized using 10-fold cross-validation. To account for temporal autocorrelation and avoid overfitting, while also providing representative coverage of the measurement period, we employed a custom time-block strategy. This approach involved an 80/20 split between training and validation…”
This seems plainly contradictory. If a 10-fold cross validation were used, then across each fold 90% of the data would be used for training and 10% for validation, yet the authors claim an 80/20 split. My best guess might be that the 10-fold cross validation might be a separate process exclusively for hyperparameter tuning, for which no details were provided, and then an abrupt shift to a new data handling process involving an 80/20 split of some kind takes place? Yet descriptions of each process are incomplete and poorly differentiated in their purposes, leading to a jumbled passage.
“with the validation set comprising randomly selected, non-overlapping time blocks that together represented 20% of the available data.”
The authors here claim that time blocks were randomly selected, but a timeseries should not be randomly split. The authors seem to acknowledge this idea with their reference to temporal autocorrelation, but their description of their methods here does not give confidence that they have properly dealt with this challenge.
“This splitting strategy was consistently applied throughout the modeling workflow, including during cross-validation and final model evaluation.”
This does not make sense to me. There have been no clear definitions of what constitutes the “cross-validation” and the “final model evaluation.”
“Cross-validation results showed a R2 of 0.60 and a RMSE of 1.1 nmol N2O m-2 s-1 on the validation set, while the training set showed a R 2 of 0.98 and a RMSE of 0.29 nmol N2O m -2 s-1. The final model, trained with early stopping (10 rounds) to prevent overfitting, achieved a R2 of 0.70 and a RMSE of 1.14 nmol N2O m-2 s-1 on the test set (Fig. A2).”
This passage invites more confusion. Beyond the unclarity regarding CV vs final evaluation, the authors describe model performance on “the test set,” “the validation set,” etc. Was there a singular test set, as the language here suggests, or are the authors instead referring to the average model performance on holdout data across the 5 or 10 folds?
L333-349: In contrast to the preceding passage, this description of SHAP values is exhaustive, and may be excessive.
L360-384: This passage is a lengthy textual description of figure 2, which I’m not sure is necessary given the length of the manuscript.
L452-453: The average background flux estimate of 0.44 nmol m⁻² s⁻¹ is equivalent to ~11 g N2O-N/ha/d, which is substantially (3-4 times) larger that globally estimated background emissions of ~2.5 to 3 g N2O-N/ha/d. This could be due to the reasons I explained above.
L462-466: “In contrast, the cover crop acted as a net CO 2 source, emitting 108 g CO2-C m-2”
The cover crop phase was only measured for 2 months, which was influenced by cultivation after wheat harvest. Is this the typical cover crop growth duration? If not, then this is misleading information that provides a snapshot of the rotation. Cover crop active growth phase is often known to sequester C (NEE sink) and negligible N2O emissions.
As the authors correctly point out in their discussion, it is not the cover crop per se that is acting as an emitter of CO2, but rather this is the response to harvest of the winter wheat and the bare soil. It is tricky, but I wonder if there is a better way to convey that the cover crop is actually reducing emissions, not magnifying them, even though these emissions are taking place during the cover crop’s establishment.
L480-483: in L452, the authors mentioned background flux of 0.44 nmol m⁻² s⁻¹. This is confusing as it seems different subset of data are being used in places to estimate background flux.
L486-509: SHAP values are best interpreted by relating the direction/magnitude of impact alongside the magnitude of the factor value. For example, the authors write: “N2O emissions were mainly suppressed by WFPS.” The fact that soil moisture had a large negative impact on flux predictions, as described here, leaves a lot of room for interpretation. Were these negative influences related to high or low soil moisture?
L510-515, Figure 5: The binning strategy used here to quantify the relationships between WFPS, GPP, and flux, is statistically fragile. The bins create artificial step functions in what is likely a linear relationship, and also reduces the number of data points for statistical analysis. It also is able to demonstrate an interaction exists, but does not quantify that relationship in terms of strength or behavior. A better approach might be to model a multiple linear regression with an interaction term. Partial r2 could also be used to more precisely determine the relative explanatory power of each factor and the interaction.
L559-560: True, but could this also be due to increasing contribution of nitrification to the net N2O emissions? How could you be so sure about Fig 7B secondary vertical axis?
L565-566: Relatedly, could not the increased role of nitrifier denitrification also indicate a decreased rate of denitrifier denitrification overall?
L646-650: The N2O offset effect on net CO2 is based on NEE and not actually on long-term C sink, which is soil C stabilization. Not all of the NEE sink from a growing season is translated into soil C gain that is stabilized for longer period beyond the growing season. The studies using long-term experiments showed much higher N2O offset effects in fertilized systems in other regions (see Dhaliwal et al., 2025; doi.org/10.1002/jeq2.70046). Therefore, the short-term nature of the study may have underestimated N2O effects.
L656-661: Mineral N sampling also showed modestly high N availability during this period, likely related to low uptake and mineralization from residue and microbes
Section 4.4, L705-708: I will take this line as an opportunity for a discussion on the idea of synchronicity between N supply and demand as a key feature in determining N2O emissions, and as a key idea in this manuscript. The authors place much significance on this idea throughout their introduction, use of methods, and the discussion here. In my view, the evidence for nitrogen synchronicity as the preeminent determinant of N2O emissions, rather than simply N availability, is a bit scattered. For example, Figure 4 excellently illustrates flux and GPP, yet mineral N availability is missing as the third key component needed to make this case. From Fig 4 there certainly does seem to be an inverse relationship between flux and GPP, but I wonder the extent to which GPP is also negatively correlated with N availability. Given the inverse relationship between GPP and mineral N (the nitrogen is moving directly from the soil and into the plant biomass), more evidence is needed to prove that GPP is independently negatively affecting N2O emissions, beyond soil N.
This leads me to my concern about the authors’ interpretation of GPP’s role in the SHAP analysis. Because soil mineral N concentrations were not included as input features in the Random Forest model, the model likely suffers from omitted variable bias. While 'days since fertilization' is included as a predictor, it is a very rough and linear proxy that fails to capture the dynamic, non-linear depletion of the soil N pool. In the absence of a direct N-substrate variable, the machine learning algorithm quite possibly utilizes GPP as a better mathematical proxy for the diminishing N pool. Consequently, the high importance attributed to GPP in the SHAP analysis may simply reflect N exhaustion rather than a mechanistic suppression of N2O production by plant uptake (synchronicity). To truly substantiate the claim that synchronicity is key to emissions, the authors would ideally show that for a given level of soil mineral N, higher GPP results in lower emissions. Without including measured soil N data in the modeling workflow, the current results show correlation with the crop’s growth phase, but do not sufficiently decouple plant demand from simple substrate limitation to make strong claims such as this one.