On the added value of sequential deep learning for upscaling evapotranspiration

Kraft, Basil; Nelson, Jacob A.; Walther, Sophia; Gans, Fabian; Weber, Ulrich; Duveiller, Gregory; Reichstein, Markus; Zhang, Weijie; Rußwurm, Marc; Tuia, Devis; Körner, Marco; Hamdi, Zayd Mahmoud; Jung, Martin

doi:10.5194/egusphere-2024-2896

Preprints

https://doi.org/10.5194/egusphere-2024-2896

Preprints

10 Oct 2024

| 10 Oct 2024

On the added value of sequential deep learning for upscaling evapotranspiration

Basil Kraft, Jacob A. Nelson, Sophia Walther, Fabian Gans, Ulrich Weber, Gregory Duveiller, Markus Reichstein, Weijie Zhang, Marc Rußwurm, Devis Tuia, Marco Körner, Zayd Mahmoud Hamdi, and Martin Jung

Abstract. Estimating ecosystem-atmosphere fluxes such as evapotranspiration (ET) in a robust manner and at global scale remains a challenge. Machine learning-based methods have shown promising results to achieve such upscaling, providing a complementary methodology that is independent from process-based and semi-empirical approaches. However, a systematic evaluation of the skill and robustness of different ML approaches is an active field of research that requires more investigations. Concretely, deep learning approaches in the time domain have not been explored systematically for this task.

In this study, we compared instantaneous (i.e., non-sequential) models—extreme gradient boosting (XGBoost) and a fully-connected neural network (FCN)—with sequential models—a long short-term memory (LSTM) model and a temporal convolutional network (TCN), for the modeling and upscaling of ET. We compared different types of covariates (meteorological, remote sensing, and plant functional types) and their impact on model performance at the site level in a cross-validation setup. For the upscaling from site to global coverage, we input the best-performing combination of covariates—which was meteorological and remote sensing observations—with globally available gridded data. To evaluate and compare the robustness of the modeling approaches, we generated a cross-validation-based ensemble of upscaled ET, compared the ensemble mean and variance among models, and contrasted it with independent global ET data.

We found that the sequential models performed better than the instantaneous models (FCN and XGBoost) in cross-validation, while the advantage of the sequential models diminished with the inclusion of remote-sensing-based predictors. The generated patterns of global ET variability were highly consistent across all ML models overall. However, the temporal models yielded 6–9 % lower globally integrated ET compared to the non-temporal counterparts and estimates from independent land surface models, which was likely due to their enhanced vulnerability to changes in the predictor distributions from site-level training data to global prediction data. In terms of global integrals, the neural network ensembles showed a sizable spread due to training data subsets, which exceeds differences among neural network variants. XGBoost showed smaller ensemble spread compared to neural networks in particular when conditions were poorly represented in the training data.

Our findings highlight non-linear model responses to biases in the training data and underscore the need for improved upscaling methodologies, which could be achieved by increasing the amount and quality of training data or by the extraction of more targeted features representing spatial variability. Approaches such as knowledge-guided ML, which encourage physically consistent results while harnessing the efficiency of ML, or transfer learning, should be investigated. Deep learning for flux upscaling holds large promise, while remedies for its vulnerability to training data distribution changes, especially of sequential models, still need consideration by the community.

Received: 17 Sep 2024 – Discussion started: 10 Oct 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 7291 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (7291 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

14 Aug 2025

On the added value of sequential deep learning for the upscaling of evapotranspiration

Basil Kraft, Jacob A. Nelson, Sophia Walther, Fabian Gans, Ulrich Weber, Gregory Duveiller, Markus Reichstein, Weijie Zhang, Marc Rußwurm, Devis Tuia, Marco Körner, Zayd Hamdi, and Martin Jung

Biogeosciences, 22, 3965–3987, https://doi.org/10.5194/bg-22-3965-2025,https://doi.org/10.5194/bg-22-3965-2025, 2025

Short summary

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-2896', Simon Besnard, 07 Nov 2024

Overall impression

Kraft et al. comprehensively review the current state of evapotranspiration upscaling methods. The paper is well-written and includes excellent figures and discussions, making it a valuable contribution to Biogeosciences' scope. The experimental setup is thoughtfully constructed, though I have some feedback regarding the comparative design of the model configurations. Given the quality of the study, it could proceed with either minor or major revisions depending on whether the authors choose to incorporate additional model experiments as suggested.
Specific comments

L15: Consider adding quantitative information when discussing model performance differences between the sequential and non-sequential models to make the comparative strengths more transparent for readers.
Lag variables in non-sequential models: Have you considered explicitly adding lagged variables to the non-sequential models? This would offer a more balanced comparison, as it would simulate past dynamics without the complexity of models like LSTM. For example, incorporating lagged climate variables into XGBoost could provide insights into whether sequential models are uniquely beneficial in capturing temporal patterns.
Model selection – self-attention models: The paper mentions TCN and self-attention as alternatives to LSTM, yet only TCN was tested. Could you elaborate on why the self-attention models were excluded from the experiments?
L127-129: Precipitation appears to be absent from the meteorological variables. Was there a reason for this omission? Precipitation likely impacts ET indirectly through soil moisture, which could be a significant predictor in capturing temporal dynamics.
L129: For clarity, it would help readers if you briefly explained the significance of the time derivative of potential shortwave irradiation and what processes it represents within the context of ET.
L175: The statement "The remote sensing and PFT covariates were repeated in time to obtain uniform inputs" could be clarified. Does this mean daily remote sensing data were kept constant at a sub-daily scale? If so, it’s worth discussing if the model accounts for the sub-daily variations, as metrics like LST and NDWI are not entirely invariant within a day.
L204-205: Were the same hyperparameters used for each fold in cross-validation? This clarification would help assess whether the variation within model ensembles arises from differences in training data subsets or distinct hyperparameter settings.
Fig 4 – PFT impact on TCN and LSTM models: In Figure 4, adding PFT as a predictor seems to penalize TCN performance, yet it enhances LSTM’s accuracy in capturing interannual variability. Could you discuss this divergence?
Fig 4 – Model sensitivity to hyperparameters: Are the displayed results limited to the best models, or could the performance of the other 19 models (with variation bars) be included to show the sensitivity to hyperparameter choices?
Fig 4—mean-site results: Presenting model performance metrics related to spatial variability, such as the mean site performance, could be informative.
L254-255: While PFT doesn’t enhance site-level predictions, could it mitigate extrapolation errors during upscaling? Including this consideration in the discussion of the scaling-up section may add valuable insight. Or would the spread for the sequential model change with or without PFT?
L286-288: Can you test the hypothesis about observation biases with synthetic data or a process-based model simulating extreme events and disturbances? This might strengthen the argument about model vulnerability to changes in predictor distributions.
L410-411: Jung et al. (2020) introduced an extrapolation index that might be useful here. Plotting the model spread against this index could demonstrate that model uncertainty correlates with areas requiring more extrapolation, supporting your discussion points.
References:

Jung, Martin, et al. "Scaling carbon fluxes from eddy covariance sites to globe: synthesis and evaluation of the FLUXCOM approach." Biogeosciences 17.5 (2020): 1343-1365.

Citation: https://doi.org/10.5194/egusphere-2024-2896-RC1
- AC1: 'Reply on RC1', Basil Kraft, 11 Feb 2025
  
  Dear Reviewer,
  Thank you for your thoughtful and constructive feedback on our manuscript. We are pleased that you found the study well-written, with valuable figures and discussions, and we appreciate your recognition of its contribution to the field.
  We have carefully considered your suggestion to perform additional analyses, such as using synthetic data to explore observational biases, evaluating model uncertainty at the site level, testing an attention-based model, incorporating lagged features, and adding precipitation as an input feature. While these are all interesting avenues for further research, our study primarily focuses on assessing the efficacy of sequential models for flux upscaling and emphasizing the need for comprehensive experimental setups to identify "best practices" in this domain.
  We believe the current analysis addresses our initial research questions effectively and highlights key challenges in ecosystem-atmosphere flux upscaling. Regarding your suggestion about model uncertainty, we agree that the site-level model comparison in Figure 4 would benefit from its inclusion. We have since re-run cross-validation for the top six models from hyperparameter tuning, and will update the figure to reflect model uncertainty (this addition is also presented below).
  As for incorporating precipitation as an additional input, we encountered significant challenges. The site-level precipitation data have large gaps, and our down-scaling approach for other meteorological variables is not applicable to precipitation. Using ERA precipitation is also not ideal due to the stochastic nature of precipitation. We believe that the downscaling of precipitation needs further investigation before drawing conclusions, as relying on ERA’s nearest neighbor could lead to misleading results. This method might suggest that precipitation is insignificant when, in fact, the data may not accurately reflect local site conditions. Nonetheless, we would be happy to include this analysis if requested by the Editor.
  Below, we provide a more detailed point-by-point response to fully address your suggestions and clarify how they align with or extend the scope of our study.
  We sincerely appreciate your understanding and thoughtful approach. Should the manuscript proceed, we will include a comprehensive response to all comments and highlight any changes made in response to your valuable feedback.
  Once again, thank you for your insightful input.
  Best regards,
  
  Basil Kraft and co-authors
  Response to specific comments
  L15: Consider adding quantitative information when discussing model performance differences between the sequential and non-sequential models to make the comparative strengths more transparent for readers.
  We agree with this suggestion and will include quantitative information on model performance in the abstract to better highlight the comparative strengths of the models.
  Lag variables in non-sequential models: Have you considered explicitly adding lagged variables to the non-sequential models? This would offer a more balanced comparison, as it would simulate past dynamics without the complexity of models like LSTM. For example, incorporating lagged climate variables into XGBoost could provide insights into whether sequential models are uniquely beneficial in capturing temporal patterns.
  We intentionally did not include lagged variables, as not having to deal with the manual feature crafting is one of the main reasons for using sequential models. Instead, we used proxy variables for system state, such as vegetation indices. While these cannot directly replace lagged meteorological variables, they offer a simpler experimental setup. Designing lagged features for non-sequential models, as highlighted in studies like Tramontana et al. (2016; https://doi.org/10.5194/bg-13-4291-2016), is complex and presents its own challenges.
  Model selection – self-attention models: The paper mentions TCN and self-attention as alternatives to LSTM, yet only TCN was tested. Could you elaborate on why the self-attention models were excluded from the experiments?
  Based on our experience, attention-based models typically require more training data than what is available in our case, leading us to expect a lower chance of successful training. There are numerous sequential deep learning models (e.g., gated recurrent units (GRU), attention-based models, convolution-based architectures). We chose LSTM and TCN as representative examples. We will clarify this in the revised manuscript.
  L127-129: Precipitation appears to be absent from the meteorological variables. Was there a reason for this omission? Precipitation likely impacts ET indirectly through soil moisture, which could be a significant predictor in capturing temporal dynamics.
  We agree that precipitation has a significant lagged effect on ET, and its omission is a point that needs further discussion. We will provide additional details in the revised manuscript. While we believe that adding precipitation may improve model performance, particularly for temporal models, downscaling precipitation to the site level remains a significant challenge. ERA observations are suboptimal due to the stochastic nature of precipitation, and this could affect model performance. Given the uncertainties, we are cautious about misinterpreting results. However, we are open to running an additional model configuration with precipitation, should the Editor request it.
  L129: For clarity, it would help readers if you briefly explained the significance of the time derivative of potential shortwave irradiation and what processes it represents within the context of ET.
  We agree that this information would be helpful. The time derivative is used to help non-temporal models distinguish between morning and afternoon conditions. We will clarify this aspect in the revised version of the manuscript.
  L175: The statement "The remote sensing and PFT covariates were repeated in time to obtain uniform inputs" could be clarified. Does this mean daily remote sensing data were kept constant at a sub-daily scale? If so, it’s worth discussing if the model accounts for the sub-daily variations, as metrics like LST and NDWI are not entirely invariant within a day.
  Yes, remote sensing variables were kept constant at a sub-daily scale because these variables do not provide diurnal observations. Diurnal variations are driven primarily by hourly-varying meteorological variables, though they interact with satellite-based features that change only on a daily to weekly basis. We will add this clarification in the manuscript.
  L204-205: Were the same hyperparameters used for each fold in cross-validation? This clarification would help assess whether the variation within model ensembles arises from differences in training data subsets or distinct hyperparameter settings.
  Yes, hyperparameters were tuned once for each model and setup, and they remained constant throughout cross-validation. Thus, the variability within the model ensemble is attributed to differences in the training data and model optimization, rather than variations in hyperparameter settings. We will clarify this point in the revised manuscript.
  Fig 4 – PFT impact on TCN and LSTM models: In Figure 4, adding PFT as a predictor seems to penalize TCN performance, yet it enhances LSTM’s accuracy in capturing interannual variability. Could you discuss this divergence?
  Considering the scale of the IAV differences (LSTM improves around 0.05 NSE and TCN decreases around 0.01 NSE), this could just be noise from model initialization. Similarly, other setups and scales show small variability which we cannot explain, as for example, the TCN having the same performance in MET and MET+RS on raw time scale, while the LSTM shows an improvement. We argue here that these minor differences should not be interpreted, and we will expand on this in the manuscript. We updated figure 4, as you suggested in the next comment, and indeed the small differences are due to noise.
  Fig 4 – Model sensitivity to hyperparameters: Are the displayed results limited to the best models, or could the performance of the other 19 models (with variation bars) be included to show the sensitivity to hyperparameter choices?
  The figure 4 displays the test set of the best-performing hyperparameter set per model. We acknowledge that the aspect of model uncertainty needs to be addressed and we appreciate your suggestion. We ran the cross-validation for the six best models and updated the figure. The updated figure (see below) shows the minimum to maximum performance across the six model runs, and the line represents the best model based on the MSE loss. Note that we do not use all models from hyperparameter tuning, as some models did not converge. From this updated figure, we learn several things, which we will include in the result and discussion sections: 1) As expected, small differences in interannual variability (links to previous question) are due to noise. 2) The xgboost model is extremely robust. 3) The sequential neural networks robustly outperform the non-sequential models on the anomaly scale.
  Fig 4—mean-site results: Presenting model performance metrics related to spatial variability, such as the mean site performance, could be informative.
  We have added spatial scale to Figure 4 (see the response to the previous comment) and will include a discussion of these new results in the manuscript.
  L254-255: While PFT doesn’t enhance site-level predictions, could it mitigate extrapolation errors during upscaling? Including this consideration in the discussion of the scaling-up section may add valuable insight. Or would the spread for the sequential model change with or without PFT?
  Our results at the site level suggest that PFTs provide little to no additional value. While there may be minimal improvements on some scales, the potential negative effects, such as unforeseen impacts on upscaling due to the large leverage of PFTs and limited training sites, prevail. We advocate for exploring continuous features (such as plant traits or soil properties) in future studies. This point is discussed in lines 354-363 and 439-444 in the manuscript.
  L286-288: Can you test the hypothesis about observation biases with synthetic data or a process-based model simulating extreme events and disturbances? This might strengthen the argument about model vulnerability to changes in predictor distributions.
  We agree with the reviewer that testing this hypothesis with synthetic data or a process-based model could be valuable. However, this is not a primary focus of the study and would require substantial experimentation. Therefore, we will revise our "hypothesis" and present it as a potential explanation rather than a tested claim.
  L410-411: Jung et al. (2020) introduced an extrapolation index that might be useful here. Plotting the model spread against this index could demonstrate that model uncertainty correlates with areas requiring more extrapolation, supporting your discussion points.
  Unfortunately, applying the method introduced by Jung et al. (2020) is computationally too demanding for our setting (hourly, 0.05degree) due to the need to identify nearest neighbors in the training data for each predicted data point. Adapting this method for large datasets would require significant development and would constitute a separate study. Therefore, we are unable to implement it within the scope of this work.
  
  Citation: https://doi.org/10.5194/egusphere-2024-2896-AC1
RC2:
'Comment on egusphere-2024-2896', Wenli Zhao, 26 Dec 2024

Basil et al. systematically evaluate both non-sequential and sequential machine learning methods for upscaling evapotranspiration (ET) fluxes. They conduct extensive experiments comparing these approaches in predicting ET, providing robust evidence to support their findings. The authors have clearly invested significant effort into their study, and their conclusions are convinced, with the potential to benefit the scientific community and inspire further research. I enjoyed reading the manuscript and recommend acceptance with minor revisions.
Line 127–129: Could the authors further clarify how meteorological variable combinations are defined? For example, I noticed that precipitation, which often strongly influences ET, is not included as an input feature in their experiments. Could the authors elaborate this?
Line 129: How did you calculate the time-derivative of potential shortwave irradiation? Please provide more details.
Line 136: You mentioned each remote sensing product was interpolated to a daily resolution. Did you use nearest-neighbor interpolation, or another method? Please clarify.
Line 184: The phrase “clustering of coordinates” needs further clarification. It would be helpful to include a table (list of sitenames) summarizing the final results of the 8-fold site clustering in the supplementals.
Line 191–193: Do you mean that observations from only two random years for each site were selected and used for training, validation, and testing?
Line 294–295: If there are discrepancies between EC site observations and the reanalysis dataset, would it be more effective to build a functional relationship between the reanalysis dataset (e.g., extracted ERA5 values at the site level) and ET observations directly, rather than using site-level functional relationships for upscaling? similar to the methodology in Nathaniel (2023). The authors could consider adding a discussion on this point.
Line 376–378: The sentence, “In summary, the sequential models did, when trained on the same subsets of sites, not behave similar in terms of global annual ET. Therefore, it seems unreasonable to assume that the lower global ET estimated by the sequential neural networks is due to a better (and hence more consistent) representation of the processes,” appears ambiguous and redundant. Please revise for clarity and conciseness.

Minor Typos :
Line 283: “PFTs.”
Line 316: “patterns”

Citation: https://doi.org/10.5194/egusphere-2024-2896-RC2
- AC2: 'Reply on RC2', Basil Kraft, 11 Feb 2025
  
  Dear Reviewer,
  Thank you for your thoughtful review and constructive feedback! It’s great to hear that you found our manuscript interesting and see its potential value to the scientific community. Your specific comments and suggestions are very helpful, and we appreciate your time and effort in providing them.
  Below, we address your questions and concerns. If the manuscript is accepted and proceeds to the review phase, we will provide more detailed responses, including specific changes made to address your comments.
  Thanks again for your valuable feedback,
  Basil Kraft and co-authors
  Response to specific comments
  Line 127–129: Could the authors further clarify how meteorological variable combinations are defined? For example, I noticed that precipitation, which often strongly influences ET, is not included as an input feature in their experiments. Could the authors elaborate this?
  We agree that precipitation has a significant effect on ET, and its omission is a point that needs further discussion. We will provide additional details in the revised manuscript. While we believe that adding precipitation may improve model performance, particularly for temporal models, downscaling precipitation to the site level remains a significant challenge. ERA observations are suboptimal due to the stochastic nature of precipitation, and this could affect model performance. Given the uncertainties, we are cautious about misinterpreting results. However, we are open to running an additional model configuration with precipitation, should the Editor request it.
  Line 129: How did you calculate the time-derivative of potential shortwave irradiation? Please provide more details.
  The time-derivative is calculated as the difference between potential shortwave irradiation values for two consecutive hours, based on the time and location. We will include this information in the revised manuscript.
  Line 136: You mentioned each remote sensing product was interpolated to a daily resolution. Did you use nearest-neighbor interpolation, or another method? Please clarify.
  This statement is misleading, and we will rephrase it for clarity. Although the MCD43A4 product for the reflectances uses observations from a period of 16 days to characterize and invert the bidirectional reflectance distribution function of a given pixel for the day at the center of the period, this operation is done over a temporally moving window at daily timesteps, resulting in output data with daily frequency. Therefore, no interpolation was required.
  Line 184: The phrase “clustering of coordinates” needs further clarification. It would be helpful to include a table (list of sitenames) summarizing the final results of the 8-fold site clustering in the supplementals.
  The clustering was performed randomly, but sites within a 0.05-degree threshold were grouped into the same cluster. We will provide a more detailed description of this approach in the revised manuscript and include a table summarizing the 8-fold site clustering results.
  Line 191–193: Do you mean that observations from only two random years for each site were selected and used for training, validation, and testing?
  We used the full available sequence of observations for training. However, for each training iteration (i.e., each minibatch), a consecutive two-year period was randomly selected from the available data. We will clarify this in the revised manuscript.
  Line 294–295: If there are discrepancies between EC site observations and the reanalysis dataset, would it be more effective to build a functional relationship between the reanalysis dataset (e.g., extracted ERA5 values at the site level) and ET observations directly, rather than using site-level functional relationships for upscaling? similar to the methodology in Nathaniel (2023). The authors could consider adding a discussion on this point.
  Eddy-covariance measurements of land-atmosphere fluxes are unique and rich in information. The processes involved are highly dependent on local conditions, such as land cover (which can be partially captured by remote sensing features) and weather. By relying on ERA5 features, we would lose access to some of this richness, introducing potential biases and missing high-frequency information from the observations. However, using the same data source for both model training and upscaling could reduce biases, which may offer some advantage. While this is an interesting point, it is beyond the scope of this study. As you suggest, we will elaborate on our choice to use local meteorological observations for model training in the revised manuscript.
  Line 376–378: The sentence, “In summary, the sequential models did, when trained on the same subsets of sites, not behave similar in terms of global annual ET. Therefore, it seems unreasonable to assume that the lower global ET estimated by the sequential neural networks is due to a better (and hence more consistent) representation of the processes,” appears ambiguous and redundant. Please revise for clarity and conciseness.
  Thank you for your feedback. We will revise this sentence for clarity and conciseness. Our intention is to convey that the temporal models exhibit a small systematic bias, likely because they are more sensitive to shifts in the data source. We will ensure this is more clearly stated in the revised version.
  
  Citation: https://doi.org/10.5194/egusphere-2024-2896-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-2896', Simon Besnard, 07 Nov 2024

Overall impression

Kraft et al. comprehensively review the current state of evapotranspiration upscaling methods. The paper is well-written and includes excellent figures and discussions, making it a valuable contribution to Biogeosciences' scope. The experimental setup is thoughtfully constructed, though I have some feedback regarding the comparative design of the model configurations. Given the quality of the study, it could proceed with either minor or major revisions depending on whether the authors choose to incorporate additional model experiments as suggested.
Specific comments

L15: Consider adding quantitative information when discussing model performance differences between the sequential and non-sequential models to make the comparative strengths more transparent for readers.
Lag variables in non-sequential models: Have you considered explicitly adding lagged variables to the non-sequential models? This would offer a more balanced comparison, as it would simulate past dynamics without the complexity of models like LSTM. For example, incorporating lagged climate variables into XGBoost could provide insights into whether sequential models are uniquely beneficial in capturing temporal patterns.
Model selection – self-attention models: The paper mentions TCN and self-attention as alternatives to LSTM, yet only TCN was tested. Could you elaborate on why the self-attention models were excluded from the experiments?
L127-129: Precipitation appears to be absent from the meteorological variables. Was there a reason for this omission? Precipitation likely impacts ET indirectly through soil moisture, which could be a significant predictor in capturing temporal dynamics.
L129: For clarity, it would help readers if you briefly explained the significance of the time derivative of potential shortwave irradiation and what processes it represents within the context of ET.
L175: The statement "The remote sensing and PFT covariates were repeated in time to obtain uniform inputs" could be clarified. Does this mean daily remote sensing data were kept constant at a sub-daily scale? If so, it’s worth discussing if the model accounts for the sub-daily variations, as metrics like LST and NDWI are not entirely invariant within a day.
L204-205: Were the same hyperparameters used for each fold in cross-validation? This clarification would help assess whether the variation within model ensembles arises from differences in training data subsets or distinct hyperparameter settings.
Fig 4 – PFT impact on TCN and LSTM models: In Figure 4, adding PFT as a predictor seems to penalize TCN performance, yet it enhances LSTM’s accuracy in capturing interannual variability. Could you discuss this divergence?
Fig 4 – Model sensitivity to hyperparameters: Are the displayed results limited to the best models, or could the performance of the other 19 models (with variation bars) be included to show the sensitivity to hyperparameter choices?
Fig 4—mean-site results: Presenting model performance metrics related to spatial variability, such as the mean site performance, could be informative.
L254-255: While PFT doesn’t enhance site-level predictions, could it mitigate extrapolation errors during upscaling? Including this consideration in the discussion of the scaling-up section may add valuable insight. Or would the spread for the sequential model change with or without PFT?
L286-288: Can you test the hypothesis about observation biases with synthetic data or a process-based model simulating extreme events and disturbances? This might strengthen the argument about model vulnerability to changes in predictor distributions.
L410-411: Jung et al. (2020) introduced an extrapolation index that might be useful here. Plotting the model spread against this index could demonstrate that model uncertainty correlates with areas requiring more extrapolation, supporting your discussion points.
References:

Jung, Martin, et al. "Scaling carbon fluxes from eddy covariance sites to globe: synthesis and evaluation of the FLUXCOM approach." Biogeosciences 17.5 (2020): 1343-1365.

Citation: https://doi.org/10.5194/egusphere-2024-2896-RC1
- AC1: 'Reply on RC1', Basil Kraft, 11 Feb 2025
  
  Dear Reviewer,
  Thank you for your thoughtful and constructive feedback on our manuscript. We are pleased that you found the study well-written, with valuable figures and discussions, and we appreciate your recognition of its contribution to the field.
  We have carefully considered your suggestion to perform additional analyses, such as using synthetic data to explore observational biases, evaluating model uncertainty at the site level, testing an attention-based model, incorporating lagged features, and adding precipitation as an input feature. While these are all interesting avenues for further research, our study primarily focuses on assessing the efficacy of sequential models for flux upscaling and emphasizing the need for comprehensive experimental setups to identify "best practices" in this domain.
  We believe the current analysis addresses our initial research questions effectively and highlights key challenges in ecosystem-atmosphere flux upscaling. Regarding your suggestion about model uncertainty, we agree that the site-level model comparison in Figure 4 would benefit from its inclusion. We have since re-run cross-validation for the top six models from hyperparameter tuning, and will update the figure to reflect model uncertainty (this addition is also presented below).
  As for incorporating precipitation as an additional input, we encountered significant challenges. The site-level precipitation data have large gaps, and our down-scaling approach for other meteorological variables is not applicable to precipitation. Using ERA precipitation is also not ideal due to the stochastic nature of precipitation. We believe that the downscaling of precipitation needs further investigation before drawing conclusions, as relying on ERA’s nearest neighbor could lead to misleading results. This method might suggest that precipitation is insignificant when, in fact, the data may not accurately reflect local site conditions. Nonetheless, we would be happy to include this analysis if requested by the Editor.
  Below, we provide a more detailed point-by-point response to fully address your suggestions and clarify how they align with or extend the scope of our study.
  We sincerely appreciate your understanding and thoughtful approach. Should the manuscript proceed, we will include a comprehensive response to all comments and highlight any changes made in response to your valuable feedback.
  Once again, thank you for your insightful input.
  Best regards,
  
  Basil Kraft and co-authors
  Response to specific comments
  L15: Consider adding quantitative information when discussing model performance differences between the sequential and non-sequential models to make the comparative strengths more transparent for readers.
  We agree with this suggestion and will include quantitative information on model performance in the abstract to better highlight the comparative strengths of the models.
  Lag variables in non-sequential models: Have you considered explicitly adding lagged variables to the non-sequential models? This would offer a more balanced comparison, as it would simulate past dynamics without the complexity of models like LSTM. For example, incorporating lagged climate variables into XGBoost could provide insights into whether sequential models are uniquely beneficial in capturing temporal patterns.
  We intentionally did not include lagged variables, as not having to deal with the manual feature crafting is one of the main reasons for using sequential models. Instead, we used proxy variables for system state, such as vegetation indices. While these cannot directly replace lagged meteorological variables, they offer a simpler experimental setup. Designing lagged features for non-sequential models, as highlighted in studies like Tramontana et al. (2016; https://doi.org/10.5194/bg-13-4291-2016), is complex and presents its own challenges.
  Model selection – self-attention models: The paper mentions TCN and self-attention as alternatives to LSTM, yet only TCN was tested. Could you elaborate on why the self-attention models were excluded from the experiments?
  Based on our experience, attention-based models typically require more training data than what is available in our case, leading us to expect a lower chance of successful training. There are numerous sequential deep learning models (e.g., gated recurrent units (GRU), attention-based models, convolution-based architectures). We chose LSTM and TCN as representative examples. We will clarify this in the revised manuscript.
  L127-129: Precipitation appears to be absent from the meteorological variables. Was there a reason for this omission? Precipitation likely impacts ET indirectly through soil moisture, which could be a significant predictor in capturing temporal dynamics.
  We agree that precipitation has a significant lagged effect on ET, and its omission is a point that needs further discussion. We will provide additional details in the revised manuscript. While we believe that adding precipitation may improve model performance, particularly for temporal models, downscaling precipitation to the site level remains a significant challenge. ERA observations are suboptimal due to the stochastic nature of precipitation, and this could affect model performance. Given the uncertainties, we are cautious about misinterpreting results. However, we are open to running an additional model configuration with precipitation, should the Editor request it.
  L129: For clarity, it would help readers if you briefly explained the significance of the time derivative of potential shortwave irradiation and what processes it represents within the context of ET.
  We agree that this information would be helpful. The time derivative is used to help non-temporal models distinguish between morning and afternoon conditions. We will clarify this aspect in the revised version of the manuscript.
  L175: The statement "The remote sensing and PFT covariates were repeated in time to obtain uniform inputs" could be clarified. Does this mean daily remote sensing data were kept constant at a sub-daily scale? If so, it’s worth discussing if the model accounts for the sub-daily variations, as metrics like LST and NDWI are not entirely invariant within a day.
  Yes, remote sensing variables were kept constant at a sub-daily scale because these variables do not provide diurnal observations. Diurnal variations are driven primarily by hourly-varying meteorological variables, though they interact with satellite-based features that change only on a daily to weekly basis. We will add this clarification in the manuscript.
  L204-205: Were the same hyperparameters used for each fold in cross-validation? This clarification would help assess whether the variation within model ensembles arises from differences in training data subsets or distinct hyperparameter settings.
  Yes, hyperparameters were tuned once for each model and setup, and they remained constant throughout cross-validation. Thus, the variability within the model ensemble is attributed to differences in the training data and model optimization, rather than variations in hyperparameter settings. We will clarify this point in the revised manuscript.
  Fig 4 – PFT impact on TCN and LSTM models: In Figure 4, adding PFT as a predictor seems to penalize TCN performance, yet it enhances LSTM’s accuracy in capturing interannual variability. Could you discuss this divergence?
  Considering the scale of the IAV differences (LSTM improves around 0.05 NSE and TCN decreases around 0.01 NSE), this could just be noise from model initialization. Similarly, other setups and scales show small variability which we cannot explain, as for example, the TCN having the same performance in MET and MET+RS on raw time scale, while the LSTM shows an improvement. We argue here that these minor differences should not be interpreted, and we will expand on this in the manuscript. We updated figure 4, as you suggested in the next comment, and indeed the small differences are due to noise.
  Fig 4 – Model sensitivity to hyperparameters: Are the displayed results limited to the best models, or could the performance of the other 19 models (with variation bars) be included to show the sensitivity to hyperparameter choices?
  The figure 4 displays the test set of the best-performing hyperparameter set per model. We acknowledge that the aspect of model uncertainty needs to be addressed and we appreciate your suggestion. We ran the cross-validation for the six best models and updated the figure. The updated figure (see below) shows the minimum to maximum performance across the six model runs, and the line represents the best model based on the MSE loss. Note that we do not use all models from hyperparameter tuning, as some models did not converge. From this updated figure, we learn several things, which we will include in the result and discussion sections: 1) As expected, small differences in interannual variability (links to previous question) are due to noise. 2) The xgboost model is extremely robust. 3) The sequential neural networks robustly outperform the non-sequential models on the anomaly scale.
  Fig 4—mean-site results: Presenting model performance metrics related to spatial variability, such as the mean site performance, could be informative.
  We have added spatial scale to Figure 4 (see the response to the previous comment) and will include a discussion of these new results in the manuscript.
  L254-255: While PFT doesn’t enhance site-level predictions, could it mitigate extrapolation errors during upscaling? Including this consideration in the discussion of the scaling-up section may add valuable insight. Or would the spread for the sequential model change with or without PFT?
  Our results at the site level suggest that PFTs provide little to no additional value. While there may be minimal improvements on some scales, the potential negative effects, such as unforeseen impacts on upscaling due to the large leverage of PFTs and limited training sites, prevail. We advocate for exploring continuous features (such as plant traits or soil properties) in future studies. This point is discussed in lines 354-363 and 439-444 in the manuscript.
  L286-288: Can you test the hypothesis about observation biases with synthetic data or a process-based model simulating extreme events and disturbances? This might strengthen the argument about model vulnerability to changes in predictor distributions.
  We agree with the reviewer that testing this hypothesis with synthetic data or a process-based model could be valuable. However, this is not a primary focus of the study and would require substantial experimentation. Therefore, we will revise our "hypothesis" and present it as a potential explanation rather than a tested claim.
  L410-411: Jung et al. (2020) introduced an extrapolation index that might be useful here. Plotting the model spread against this index could demonstrate that model uncertainty correlates with areas requiring more extrapolation, supporting your discussion points.
  Unfortunately, applying the method introduced by Jung et al. (2020) is computationally too demanding for our setting (hourly, 0.05degree) due to the need to identify nearest neighbors in the training data for each predicted data point. Adapting this method for large datasets would require significant development and would constitute a separate study. Therefore, we are unable to implement it within the scope of this work.
  
  Citation: https://doi.org/10.5194/egusphere-2024-2896-AC1
RC2:
'Comment on egusphere-2024-2896', Wenli Zhao, 26 Dec 2024

Basil et al. systematically evaluate both non-sequential and sequential machine learning methods for upscaling evapotranspiration (ET) fluxes. They conduct extensive experiments comparing these approaches in predicting ET, providing robust evidence to support their findings. The authors have clearly invested significant effort into their study, and their conclusions are convinced, with the potential to benefit the scientific community and inspire further research. I enjoyed reading the manuscript and recommend acceptance with minor revisions.
Line 127–129: Could the authors further clarify how meteorological variable combinations are defined? For example, I noticed that precipitation, which often strongly influences ET, is not included as an input feature in their experiments. Could the authors elaborate this?
Line 129: How did you calculate the time-derivative of potential shortwave irradiation? Please provide more details.
Line 136: You mentioned each remote sensing product was interpolated to a daily resolution. Did you use nearest-neighbor interpolation, or another method? Please clarify.
Line 184: The phrase “clustering of coordinates” needs further clarification. It would be helpful to include a table (list of sitenames) summarizing the final results of the 8-fold site clustering in the supplementals.
Line 191–193: Do you mean that observations from only two random years for each site were selected and used for training, validation, and testing?
Line 294–295: If there are discrepancies between EC site observations and the reanalysis dataset, would it be more effective to build a functional relationship between the reanalysis dataset (e.g., extracted ERA5 values at the site level) and ET observations directly, rather than using site-level functional relationships for upscaling? similar to the methodology in Nathaniel (2023). The authors could consider adding a discussion on this point.
Line 376–378: The sentence, “In summary, the sequential models did, when trained on the same subsets of sites, not behave similar in terms of global annual ET. Therefore, it seems unreasonable to assume that the lower global ET estimated by the sequential neural networks is due to a better (and hence more consistent) representation of the processes,” appears ambiguous and redundant. Please revise for clarity and conciseness.

Minor Typos :
Line 283: “PFTs.”
Line 316: “patterns”

Citation: https://doi.org/10.5194/egusphere-2024-2896-RC2
- AC2: 'Reply on RC2', Basil Kraft, 11 Feb 2025
  
  Dear Reviewer,
  Thank you for your thoughtful review and constructive feedback! It’s great to hear that you found our manuscript interesting and see its potential value to the scientific community. Your specific comments and suggestions are very helpful, and we appreciate your time and effort in providing them.
  Below, we address your questions and concerns. If the manuscript is accepted and proceeds to the review phase, we will provide more detailed responses, including specific changes made to address your comments.
  Thanks again for your valuable feedback,
  Basil Kraft and co-authors
  Response to specific comments
  Line 127–129: Could the authors further clarify how meteorological variable combinations are defined? For example, I noticed that precipitation, which often strongly influences ET, is not included as an input feature in their experiments. Could the authors elaborate this?
  We agree that precipitation has a significant effect on ET, and its omission is a point that needs further discussion. We will provide additional details in the revised manuscript. While we believe that adding precipitation may improve model performance, particularly for temporal models, downscaling precipitation to the site level remains a significant challenge. ERA observations are suboptimal due to the stochastic nature of precipitation, and this could affect model performance. Given the uncertainties, we are cautious about misinterpreting results. However, we are open to running an additional model configuration with precipitation, should the Editor request it.
  Line 129: How did you calculate the time-derivative of potential shortwave irradiation? Please provide more details.
  The time-derivative is calculated as the difference between potential shortwave irradiation values for two consecutive hours, based on the time and location. We will include this information in the revised manuscript.
  Line 136: You mentioned each remote sensing product was interpolated to a daily resolution. Did you use nearest-neighbor interpolation, or another method? Please clarify.
  This statement is misleading, and we will rephrase it for clarity. Although the MCD43A4 product for the reflectances uses observations from a period of 16 days to characterize and invert the bidirectional reflectance distribution function of a given pixel for the day at the center of the period, this operation is done over a temporally moving window at daily timesteps, resulting in output data with daily frequency. Therefore, no interpolation was required.
  Line 184: The phrase “clustering of coordinates” needs further clarification. It would be helpful to include a table (list of sitenames) summarizing the final results of the 8-fold site clustering in the supplementals.
  The clustering was performed randomly, but sites within a 0.05-degree threshold were grouped into the same cluster. We will provide a more detailed description of this approach in the revised manuscript and include a table summarizing the 8-fold site clustering results.
  Line 191–193: Do you mean that observations from only two random years for each site were selected and used for training, validation, and testing?
  We used the full available sequence of observations for training. However, for each training iteration (i.e., each minibatch), a consecutive two-year period was randomly selected from the available data. We will clarify this in the revised manuscript.
  Line 294–295: If there are discrepancies between EC site observations and the reanalysis dataset, would it be more effective to build a functional relationship between the reanalysis dataset (e.g., extracted ERA5 values at the site level) and ET observations directly, rather than using site-level functional relationships for upscaling? similar to the methodology in Nathaniel (2023). The authors could consider adding a discussion on this point.
  Eddy-covariance measurements of land-atmosphere fluxes are unique and rich in information. The processes involved are highly dependent on local conditions, such as land cover (which can be partially captured by remote sensing features) and weather. By relying on ERA5 features, we would lose access to some of this richness, introducing potential biases and missing high-frequency information from the observations. However, using the same data source for both model training and upscaling could reduce biases, which may offer some advantage. While this is an interesting point, it is beyond the scope of this study. As you suggest, we will elaborate on our choice to use local meteorological observations for model training in the revised manuscript.
  Line 376–378: The sentence, “In summary, the sequential models did, when trained on the same subsets of sites, not behave similar in terms of global annual ET. Therefore, it seems unreasonable to assume that the lower global ET estimated by the sequential neural networks is due to a better (and hence more consistent) representation of the processes,” appears ambiguous and redundant. Please revise for clarity and conciseness.
  Thank you for your feedback. We will revise this sentence for clarity and conciseness. Our intention is to convey that the temporal models exhibit a small systematic bias, likely because they are more sensitive to shifts in the data source. We will ensure this is more clearly stated in the revised version.
  
  Citation: https://doi.org/10.5194/egusphere-2024-2896-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Publish subject to minor revisions (review by editor) (27 Feb 2025) by Nicolas Brüggemann

AR by Basil Kraft on behalf of the Authors (09 May 2025) Author's response Author's tracked changes

EF by Polina Shvedko (09 May 2025) Manuscript

ED: Publish as is (21 May 2025) by Nicolas Brüggemann

AR by Basil Kraft on behalf of the Authors (26 May 2025) Manuscript

Journal article(s) based on this preprint

14 Aug 2025

On the added value of sequential deep learning for the upscaling of evapotranspiration

Basil Kraft, Jacob A. Nelson, Sophia Walther, Fabian Gans, Ulrich Weber, Gregory Duveiller, Markus Reichstein, Weijie Zhang, Marc Rußwurm, Devis Tuia, Marco Körner, Zayd Hamdi, and Martin Jung

Biogeosciences, 22, 3965–3987, https://doi.org/10.5194/bg-22-3965-2025,https://doi.org/10.5194/bg-22-3965-2025, 2025

Short summary

Viewed

Total article views: 3,864 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
2,092	1,038	734	3,864	124	170

HTML: 2,092
PDF: 1,038
XML: 734
Total: 3,864
BibTeX: 124
EndNote: 170

Views and downloads (calculated since 10 Oct 2024)

Month	HTML	PDF	XML	Total
Oct 2024	274	54	14	342
Nov 2024	130	26	6	162
Dec 2024	76	30	92	198
Jan 2025	50	36	102	188
Feb 2025	110	32	98	240
Mar 2025	86	34	140	260
Apr 2025	50	26	100	176
May 2025	44	22	100	166
Jun 2025	90	50	44	184
Jul 2025	84	18	4	106
Aug 2025	126	20	2	148
Sep 2025	388	26	4	418
Oct 2025	76	72	0	148
Nov 2025	82	50	4	136
Dec 2025	94	34	0	128
Jan 2026	116	96	8	220
Feb 2026	82	44	4	130
Mar 2026	76	230	10	316
Apr 2026	35	72	2	109
May 2026	18	53	0	71
Jun 2026	5	13	0	18
Jul 2026	0

Cumulative views and downloads (calculated since 10 Oct 2024)

Month	HTML	PDF	XML	Total
Oct 2024	274	54	14	342
Nov 2024	130	26	6	162
Dec 2024	76	30	92	198
Jan 2025	50	36	102	188
Feb 2025	110	32	98	240
Mar 2025	86	34	140	260
Apr 2025	50	26	100	176
May 2025	44	22	100	166
Jun 2025	90	50	44	184
Jul 2025	84	18	4	106
Aug 2025	126	20	2	148
Sep 2025	388	26	4	418
Oct 2025	76	72	0	148
Nov 2025	82	50	4	136
Dec 2025	94	34	0	128
Jan 2026	116	96	8	220
Feb 2026	82	44	4	130
Mar 2026	76	230	10	316
Apr 2026	35	72	2	109
May 2026	18	53	0	71
Jun 2026	5	13	0	18
Jul 2026	0

Viewed (geographical distribution)

Total article views: 3,864 (including HTML, PDF, and XML) Thereof 3,864 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 03 Jul 2026

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (7291 KB)
Metadata XML

Short summary

Global evapotranspiration (ET) can be estimated using machine learning (ML) models optimized on local data and applied to global data. This study explores whether sequential neural networks, which consider past data, perform better than models that do not. The findings show that sequential models struggle with global upscaling, likely due to their sensitivity to data shifts from local to global scales. To improve ML-based upscaling, additional data or integration of physical knowledge is needed.


Total:	0
HTML:	0
PDF:	0
XML:	0