Potential of Machine learning techniques compared to MIKE-SHE model for drain flow predictions in tile-drained agricultural areas of Denmark

Mahmood, Hafsa; Ferré, Ty P. A.; Schneider, Raphael J. M.; Stisen, Simon; Frederiksen, Rasmus R.; Christiansen, Anders V.

doi:10.5194/egusphere-2023-1872

Preprints

https://doi.org/10.5194/egusphere-2023-1872

Preprints

31 Aug 2023

| 31 Aug 2023

Status: this preprint has been withdrawn by the authors.

Potential of Machine learning techniques compared to MIKE-SHE model for drain flow predictions in tile-drained agricultural areas of Denmark

Hafsa Mahmood, Ty P. A. Ferré, Raphael J. M. Schneider, Simon Stisen, Rasmus R. Frederiksen, and Anders V. Christiansen

Abstract. Temporal drain flow dynamics and understanding of their underlying controlling factors are important for water resource management in tile-drained agricultural areas. The use of physics-based water flow models to understand tile drained systems is common. These models are complex, with large parameter sets and require high computational effort. The primary goal of this study was to examine whether simpler, more efficient machine learning (ML) models can provide acceptable solutions.

The specific aim of our study was to assess the potential of ML tools for predicting drain flow time series in multiple catchments subject to a range of climatic and landscape conditions. The investigation is based on unique data containing time series of daily drain flow in multiple field scale drain sites in Denmark. The data include: climate (precipitation, potential evapotranspiration, temperature); geological properties (clay fraction, first sand layer thickness, first clay layer thickness); and topographical indexes (curvature, Topographical wetness indexes, Topographical position index, elevation). Both static and dynamic variables are used in the prediction of drain flows. The ML algorithm extreme gradient boosting (XGBoost) and convolutional neural network (CNN) were examined, and the results were compared with a physics-based distributed model (MIKE-SHE).

The results show that XGBoost performs similarly to the physics-based MIKE-SHE models, and both outperform CNN. Both ML models required significantly less effort to build, train, and run than MIKE-SHE. In addition, the ML models support efficient feature importance analysis. This showed that climatic variables were important for CNN models and XGBoost. The results support the use of ML models for hydrologic applications with sufficient data for training. Further, the insights offered by the feature importance analysis may support further data collection and developments of physics-based models when existing data are insufficient to support ML approaches.

This preprint has been withdrawn.

Received: 16 Aug 2023 – Discussion started: 31 Aug 2023

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 2029 KB)

Withdrawal notice
This preprint has been withdrawn.
Preprint (2029 KB)

Download & links

This preprint has been withdrawn.

Hafsa Mahmood, Ty P. A. Ferré, Raphael J. M. Schneider, Simon Stisen, Rasmus R. Frederiksen, and Anders V. Christiansen

Interactive discussion

Status: closed

RC1: 'Comment on egusphere-2023-1872', Anonymous Referee #1, 10 Oct 2023

Summary: This study uses two machine learning techniques and one physics-based model to simulate drain flows in tile-drained agricultural areas. The study indicates that one ML method outperforms the other and suggests that the better-performing ML method performs similarly to the physics-based model. It then goes on to recommend further development of ML methods given their ease of construction and positive performance relative to physics-based models.
Recommendation: While ML techniques undoubtedly hold promise in the prediction of complicated systems, at present, this study does not rise to the level of instilling confidence in their use. The authors choose to compare their ML methods to a physics-based model that does not itself perform particularly well, particularly at the tails. By choosing a rather low target of performance, and somewhat meeting it, the authors argue for the promise of ML given its ease of development and use. Given the single low performance threshold, I suggest the authors (a) compare their ML output to other better performing physics-based models and (b) given the ease of development and use of ML techniques, spend some time crafting an approach that performs more admirably, i.e., an approach that could have more utility in the operational sense.
Suggestions and criticisms:
One overarching criticism is that the piece takes an editorial tone arguing for the self-evident value of ML approaches given their ease of use. This point is repeated throughout, and it is largely based on ease of use/development rather than performance. The authors argument would be more convincing and require less editorializing if their ML results spoke for themselves. Additionally, there seems to be some drift in the stated opinion regarding the performance of the ML techniques: it is sometime quite confident in their promise, and other times more subdued regarding their performance. I think the authors would be better served if they restricted their value judgements to the discussion section and remained consistent in their assessment.
There is some mixed messaging regarding the goal of this study. On Lines 46-47 it states: The question addressed in this study was whether ML models, which are also generally considered to be correlative, are more efficient than either distributed or lumped physics-based models.” Whereas lines 67-68 state something else as the objective. Comparisons to lumped-physics-based models are not conducted, nor are efficiencies.
From the start, it can be challenging to understand the system that is being simulated. I think a cartoon schematic would help.
What is clustered? Where are the clusters? How were the clusters determined? How do the physical processes interact with the clusters?
There is very little written about MIKE-SHE, but from what is provided it appears to be a not very-well calibrated model that does not provide particularly useful predictions. The authors list a number of other tile-focused physics-based models. Perhaps there is a better standard the authors could compare their ML techniques to? I’m afraid MIKE-SHE is setting a low bar which falsely instills nominal confidence in the ML approaches.
Indeed, it would be helpful to be able to understand MIKE-SHE without consulting a different publication. As written, the methods related to MIKE-SHE in this paper do not allow it to stand on it’s own. How was MIKE-SHE calibrated? Was the calibration optimized?
The authors indicate on Lines 254-261 that the comparison they have devised is unfair. Since they are devising the comparison, why not make it fair?
Line 123: “This set was reduced to 19 by developing a covariance matrix and removing any feature that had a Pearson R value above 0.85 with any other feature.” If 2 variables are highly correlated, how do you decide which to keep?
L108: What variables are 10 km and which are 20 km? Without this information, one could not replicate the analysis.
L153: “multiple” means what exactly?
The feature importance analysis is important for us to understand what factors are contributing the most and can be used to guide data collection and eventually reduce data requirement into ML models. I'd be curious to see a sensitivity analysis that is built off the feature importance analysis and show what are the model performance when including different subsets of the most important predictors.
Figure 10 does not provide sufficient information for interpretation.

Citation: https://doi.org/10.5194/egusphere-2023-1872-RC1
CC1:
'Review, Muhammad Ikhsan & Prof. Jos van Dam, Wageningen University & Research', Muhammad Ikhsan, 31 Oct 2023
This review was prepared as part of graduate program coursework at Wageningen University and has been produced under the supervision of Prof. Jos van Dam. The review has been posted because of its good quality, and likely usefulness to the authors and editor. The journal did not solicit this review.
The study discussed how data-driven approaches compared to physics-based modeling in predicting daily drain flow in agricultural areas in Denmark. The aim of this research was to develop machine learning models with comparable or superior drain flow prediction capabilities than physics-based models while requiring fewer parameters and computational effort. The author also emphasizes the transferability of the machine learning methods in other catchment regions. The results show that none of the models showed a strong performance in predicting the observed drain flows. Climatic variables were found to be the most important drivers compared to geological and topographical properties. The authors proposed more extensive variables and physics-guided machine learning for future studies because the models were not transferrable yet.
The aim and objective of this study are relatively novel and in the state of current technology to make an exchange between data-driven and physical-based models. Data-driven models could provide insights into what is missing or left out in the physics-based models. This study also matches the scope of the Hydrology and Earth System Sciences (HESS) journal as this study aims to improve hydrological modeling, specifically for subsurface drainage prediction using data-driven modeling. In my opinion, there is still much to learn about the potential of machine learning and data-driven models in this rapidly changing global environment. The research is interesting since it also presents the possibility of applying machine learning to simplify hydrological models, which would improve hydrological fluxes monitoring and predictions in the future.
The data and methodology given in the paper are well presented. As the reader, I could quickly grasp what the authors were doing. Employment of different evaluation metrics for machine learning models such as R², KGE (Kling-Gupta Efficiency), and PBIAS (Percent Bias) was also done. The ensemble of these metrics gave a full insight into how each model performs. Other than that, training and test cycles for machine learning models were done with cross-validation to minimize overfitting. Spatial validation proposed by Bjerre et al. (2022) was also employed in data-driven modeling.
However, the major issue with this research is the results did not answer the research questions sufficiently. This is due to the low performance of both architectures used by the authors, where Convolutional Neural Networks (CNN) and Gradient Boosting (XGBoost) models give correlation coefficients that range between 0.26-0.53. On the other hand, physics-based models using MIKE-SHE also perform quite the same as machine learning models with a correlation coefficient equal to 0.34. This makes the transferability of machine learning models still questionable, yet the author claimed that the machine learning models could replace the traditional physics-based models in the conclusions. I think the authors should provide more evidence to make this claim.
Furthermore, I recommend the editor to ensure that the authors reexamine a few points in the manuscript such as the variability in explanatory variables, the peak discharge replications process, and the underlying rationale for selecting the data-driven models. The central parts of those points are uncertainties quantifications where the discharge replications that the authors did will alter the spatiotemporal characteristics of the data and this will result in a questionable result that is not in line with the scope of HESS. Other than that, to simplify physics models, one must consider the uncertainties of data-driven models and guarantee reproducible results in different catchment areas by reducing the number of disorganized procedures. Uncertainties related to data-driven modeling are related to the data and model (Lang et al., 2023). I will elaborate more about these in the following paragraphs. If the authors give such points additional thought, it will leverage the usefulness of this study.
Major issues:
The first issue with the paper, there are downscaling approaches mentioned in line 96, but their presence in the rest of the manuscript is unclear. As the reader, I want to know where it was needed and to which data the authors used downscaling approaches. The downscaling approaches need to be elaborated because climate studies that only use one downscaling approach should be interpreted with caution (Chen et al., 2011). Other than that, the authors also should elaborate if any resampling techniques were used. Inaccuracies for downscaling and resampling methods will propagate to the output and cause inaccuracies in the hydrological models (Wu et al., 2008).
The paper which authors cited by Motarjemi et al. (2021) also suggested including variability for all predictor variables used in providing high-quality prediction models. Therefore, by including downscaling and resampling methods used for the data in Table 2 (line 104), the author could elaborate on the uncertainties in the data that will propagate to the output of the models. Lang et al., 2023 quantified uncertainties related to data-driven models by two different sources, data (aleatoric) and model (epistemic). Downscaling and resampling are part of the data uncertainties, meanwhile, model uncertainties are related to differences in initial values of the random number generator or hyperparameters. There is ongoing study in machine learning models to use automatic hyperparameter optimization, for example, the study done by Zela et al. (2018). If the author used a manual hyperparameter optimization, there might be inaccuracies in the choice, therefore sensitivity of hyperparameters should also be quantified.
I suggest the author include the data preparation methods and how sensitive the changes in the data will influence the output of the models. This could be done by elaborating more about downscaling and resampling approaches in section 2.2. Other than that, both uncertainties related to data and models could be quantified by using several models with different initialization and data preparation methods. If the author addresses this point, the study will be more reproducible by overcoming differences in data resolution and uncertainties in the models.
The second issue that I note is the peak discharge replications that were done by the authors, as stated in line 153. I do not think the author has done this in a systematic way, given the aleatoric uncertainties were not considered. It will also change the overall statistics of the time series and make the transferability of the methods to other catchment areas questionable, meanwhile, the author also emphasizes transferability to ungauged basins in the aim of the study (lines 67-68).
I proposed several studies to capture these peak behaviours, one of the studies done by Senent‐Aparicio et al. (2019), was coupling machine-learning models with the SWAT model. The author did mention that physics-guided machine learning could be beneficial in line 380. The ensemble of MIKE-SHE and machine learning models could be interesting to consider rather than replicating multiple values of peak discharge if the sensitivity of this procedure is not thoroughly examined. I also think recurrent neural networks (RNNs) should be considered, as it is one of the most fundamental sequence modeling in machine learning. A study done by Tao et al. (2019b), suggests that the introduction of LSTM (Long Short-Term Memory) cells which is part of RNN drastically improves the performance of simulating peak discharge. Another study done by Jiang et al. (2022) uses explainable machine learning to see flooding mechanisms in Europe. They use RNN models with LSTM cells to predict river peak flows because the models were able to capture temporal dependencies between hydrological variables effectively (Jiang et al., 2022) and the backbone of hydrological models in earth system (Lees et al., 2022).
Therefore, I suggest the authors to support the replication of discharge with more literature if there is any systematic procedure for this. Otherwise, this issue is also tightly coupled with uncertainties propagation that I proposed in the first issue, therefore looking at how sensitive the output to the peak discharge replication methods should be included as aleatoric uncertainties. Another alternative for future studies is to include other models such as RNN or ensemble models (CNN-RNN) given the ability to simulate peak discharge in other studies that would also be interlinked with the third major issue that I found in this paper. If an organized method for doing discharge replications is identified, the authors could show a flowchart to solve this issue; if not, the outcomes are integrated with the first issue by including discharge replications as aleatoric uncertainty. By addressing this issue, it will make the study more transferable through the removal of non-systematic procedures.
The third issue is related to the main aim and objective of this study which is to provide simple alternatives for machine learning models that require less computational effort and transferring machine learning methods in ungauged basins (lines 17-18, 67-68). However, the results do not have strong evidence to answer the research questions. This is because the machine learning models did not do better than physics-based models, with R² that was around 0.3 and 0.52 (line 233, Figure 5).
On the other hand, the authors also restricted the choice of machine learning methods by saying Convolutional Neural Networks (CNNs) perform better than Recurrent Neural Networks (RNNs) (lines 64-66) and providing the study by Bai et al., 2018. The study by Bai et al., 2018, provided empirical studies and considered several sequence modeling tasks, but not hydrological systems in general. Therefore, we could not conclude whether the statement above also holds for drain flow predictions. Other studies show that RNN could simulate peak values better (Tao et al., 2019 and Jiang et al., 2022). Ghimere, S. et al. (2021) also conducted a study to predict streamflow using several machine learning algorithms. They concluded that an integrated method of CNN-RNN is more powerful than standalone conventional AI and ensemble models (decision trees, gradient boosting, extreme gradient boosting). Therefore, the fundamental reasons for choosing CNN and XGBoost are not clear. This is due to low performance and other studies also shown other architectures could be opted for better performance.
To address this issue the authors could support their fundamental reasons for choosing CNN and XGBoost as their models, by providing other convincing literature and arguments. For the recommendation for future studies, the author could mention several studies that use integrated architecture such as CNN-RNN (Ghimere, S. et al., 2021). Additionally, more literature is needed for the suggestion of more extensive explanatory variables. For example, the study by Cho et al. (2019) used satellite indices to predict agricultural area subsurface flow in the United States using machine learning that could give 76.9-87% accuracy. By addressing this point, there will be clear recommendations on what to do in the future to improve the prediction capability of data-driven models. In the end, we can conclude whether by simplification of data, we could predict tile-drained agricultural flow or whether we need as extensive variables as needed by the physics-based models.
Minor issues:
Conclusion (lines 393-395) is given beyond the evidence provided. I suggest the authors change the conclusion to this sentence "Our study does not provide enough evidence on how machine learning could be transferrable to predict daily drain flow on agricultural areas. Therefore, for future studies, we recommend exploring more extensive variables and data-driven models that could give better results."

The author seems to have references that are mainly linked to the authors themselves. For a more balanced view, the author could consider the study done by Cho, et.al (2019). The study used Google Earth Engine to predict agricultural area drainage in the United States with an accuracy of about 76.9-87% using machine learning. The author could use this study for future recommendations on what variables (for example: satellite-derived indices) to include in the data-driven models for better results.

Explanation about how static features, for example, drain catchment areas are used is unclear (Table 2, line 104). As the reader, knowing how the authors supply this number is valuable. The author could add a figure that explains the dimensions of the input (both static and dynamic features) and output of the data-driven model in a schematic diagram.

There are some data-driven models that are impervious to multicollinearity (Kim et al., 2019). The author should give more references on specific criteria for removing multicollinearity (lines 123-124) and why it is needed in the models that the authors used.

Physics-based models’ calibration from MIKE-SHE is not well-defined in the study. The authors could include which parameters were inputted or an important parameter summary of the physical-based models for more clarity.

The author should add how the cluster is made (lines 80-81). This would increase the reproducibility of the study in other areas.

P16, lines 257-259: The author should include the effect of using drain flow data from all sites. For example: is it comparable to only using one data-driven model for all sites? Encapsulation of all data in one model is also interesting to consider (Jiang, et al., 2022), given the emphasis on transferability.

P3, lines 64-66: This statement is misleading. This is because the study cited by Bai et al., 2018 does not specifically consider tile-drained field flow prediction, but the authors made an impression as if the study that was cited had done CNN to tile-drained field flow predictions. The authors should recite the statement to provide a more balanced view on this.

List of minor issues:
P2, line 46: Lumped physics-based models should be removed, as the authors only used the distributed one.

P3, line 81: I think it should be 120 ha instead of 100 ha.

P6, line 108: It should be 10 km/20 km, and there are several typos that look similar, such as P6, line 112, 144; P7, line 116.

P6, line 119: I think the horizon D is missing.

P8, Figure 3: “randomly division” should be “random division”. “R2” should be “R²”.

P8, line 167: “bellow” should be “below”.

P9, Equation 2: I think it is better to use a bar instead of _mean.

P10, line 188: It is mentioned in Table A1 there are verification data for each model. I do not know why only MIKE-SHE calibration and verification period are shown.

P10, Table 3: There are some typos such as: ‘tunning’ should be ‘tuning’, ‘hyper parameters’ should be ‘hyperparameters’, and space between some values should be added.

P12, Table 4: The same problems with Table 3, such as: ‘tunning’ should be ‘tuning’ and space in between values in some cells should be added.

P18, Figure 9: the authors refer to the standard deviation in TWI of 300 m buffer as the fifth most important feature in lines 284 and 285, but I do not really see the corresponding variables in Figure 9.

P18, line 290: “don’t” should be “do not”.

P19, lines 304-305: The authors conclude that topographical features do not depict a clear correlation and refer to Figure 10. I think the authors should include some metrics to determine whether a correlation between explanatory variables and the target variable is important in Figure 10.

P20, Figure 11: The authors should include what is the important reason to include this figure. I think there is no relation to the overall objective and aim of this study.

P21, line 340: “100 ha” should be “120 ha”.

P22, line 368: “R2” should be “R²”.

References:
Bai, S. (2018, March 4). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv.org. https://arxiv.org/abs/1803.01271
Bjerre, E., Fienen, M. N., Schneider, R., Koch, J., & Højberg, A. L. (2022). Assessing spatial transferability of a random forest metamodel for predicting drainage fraction. Journal of Hydrology, 612, 128177. https://doi.org/10.1016/j.jhydrol.2022.128177
Chen, J., Brissette, F., & Leconte, R. (2011). Uncertainty of downscaling method in quantifying the impact of climate change on hydrology. Journal of Hydrology, 401(3–4), 190–202. https://doi.org/10.1016/j.jhydrol.2011.02.020
Cho, E., Jacobs, J. M., Jia, X., & Kraatz, S. (2019). Identifying Subsurface Drainage using Satellite Big Data and Machine Learning via Google Earth Engine. Water Resources Research, 55(10), 8028–8045. https://doi.org/10.1029/2019wr024892
Ghimire, S., Yaseen, Z.M., Farooque, A.A. et al. Streamflow prediction using an integrated methodology based on convolutional neural networks and long short-term memory networks. Sci Rep 11, 17497 (2021). https://doi.org/10.1038/s41598-021-96751-4
Jiang, S., Bevacqua, E., & Zscheischler, J. (2022). River flooding mechanisms and their changes in Europe revealed by explainable machine learning. Hydrology and Earth System Sciences, 26(24), 6339–6359. https://doi.org/10.5194/hess-26-6339-2022
Kim, D. W., Lee, S., Kwon, S., Nam, W., Cha, I., & Kim, H. J. (2019). Deep learning-based survival prediction of oral cancer patients. Scientific Reports, 9(1). https://doi.org/10.1038/s41598-019-43372-7
Lang, N., Jetz, W., Schindler, K. et al. A high-resolution canopy height model of the Earth. Nat Ecol Evol (2023). https://doi.org/10.1038/s41559-023-02206-6
Lees, T., Reece, S., Kratzert, F., Klotz, D., Gauch, M., De Bruijn, J., Kumar Sahu, R., Greve, P., Slater, L., and Dadson, S. J.: Hydrological concept formation inside long short-term memory (LSTM) networks, Hydrol. Earth Syst. Sci., 26, 3079–3101, https://doi.org/10.5194/hess-26-3079-2022
Motarjemi, S. K., Møller, A., Plauborg, F., & Iversen, B. V. (2021). Predicting national-scale tile drainage discharge in Denmark using machine learning algorithms. Journal of Hydrology: Regional Studies, 36, 100839. https://doi.org/10.1016/j.ejrh.2021.10083
Senent‐Aparicio, J., Jimeno‐Sáez, P., Bueno-Crespo, A., Pérez‐Sánchez, J., & Pulido-Velázquez, D. (2019). Coupling machine-learning techniques with SWAT model for instantaneous peak flow prediction. Biosystems Engineering, 177, 67–77. https://doi.org/10.1016/j.biosystemseng.2018.04.022
Tao, Y., Sun, F., Gentine, P., Liu, W., Wang, H., Yin, J., Du, M. H., & Liu, C. (2019). Evaluation and machine learning improvement of global hydrological model-based flood simulations. Environmental Research Letters, 14(11), 114027. https://doi.org/10.1088/1748-9326/ab4d5e
Wu, S., Li, J., & Huang, G. (2008). A study on DEM-derived primary topographic attributes for hydrologic applications: Sensitivity to elevation data resolution. Applied Geography, 28(3), 210–223. https://doi.org/10.1016/j.apgeog.2008.02.006
Zela, A., Klein, A., Falkner, S., & Hutter, F. (2018). Towards automated deep learning: efficient joint neural architecture and hyperparameter search. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1807.06906
Citation: https://doi.org/10.5194/egusphere-2023-1872-CC1

Interactive discussion

Status: closed

RC1: 'Comment on egusphere-2023-1872', Anonymous Referee #1, 10 Oct 2023

Summary: This study uses two machine learning techniques and one physics-based model to simulate drain flows in tile-drained agricultural areas. The study indicates that one ML method outperforms the other and suggests that the better-performing ML method performs similarly to the physics-based model. It then goes on to recommend further development of ML methods given their ease of construction and positive performance relative to physics-based models.
Recommendation: While ML techniques undoubtedly hold promise in the prediction of complicated systems, at present, this study does not rise to the level of instilling confidence in their use. The authors choose to compare their ML methods to a physics-based model that does not itself perform particularly well, particularly at the tails. By choosing a rather low target of performance, and somewhat meeting it, the authors argue for the promise of ML given its ease of development and use. Given the single low performance threshold, I suggest the authors (a) compare their ML output to other better performing physics-based models and (b) given the ease of development and use of ML techniques, spend some time crafting an approach that performs more admirably, i.e., an approach that could have more utility in the operational sense.
Suggestions and criticisms:
One overarching criticism is that the piece takes an editorial tone arguing for the self-evident value of ML approaches given their ease of use. This point is repeated throughout, and it is largely based on ease of use/development rather than performance. The authors argument would be more convincing and require less editorializing if their ML results spoke for themselves. Additionally, there seems to be some drift in the stated opinion regarding the performance of the ML techniques: it is sometime quite confident in their promise, and other times more subdued regarding their performance. I think the authors would be better served if they restricted their value judgements to the discussion section and remained consistent in their assessment.
There is some mixed messaging regarding the goal of this study. On Lines 46-47 it states: The question addressed in this study was whether ML models, which are also generally considered to be correlative, are more efficient than either distributed or lumped physics-based models.” Whereas lines 67-68 state something else as the objective. Comparisons to lumped-physics-based models are not conducted, nor are efficiencies.
From the start, it can be challenging to understand the system that is being simulated. I think a cartoon schematic would help.
What is clustered? Where are the clusters? How were the clusters determined? How do the physical processes interact with the clusters?
There is very little written about MIKE-SHE, but from what is provided it appears to be a not very-well calibrated model that does not provide particularly useful predictions. The authors list a number of other tile-focused physics-based models. Perhaps there is a better standard the authors could compare their ML techniques to? I’m afraid MIKE-SHE is setting a low bar which falsely instills nominal confidence in the ML approaches.
Indeed, it would be helpful to be able to understand MIKE-SHE without consulting a different publication. As written, the methods related to MIKE-SHE in this paper do not allow it to stand on it’s own. How was MIKE-SHE calibrated? Was the calibration optimized?
The authors indicate on Lines 254-261 that the comparison they have devised is unfair. Since they are devising the comparison, why not make it fair?
Line 123: “This set was reduced to 19 by developing a covariance matrix and removing any feature that had a Pearson R value above 0.85 with any other feature.” If 2 variables are highly correlated, how do you decide which to keep?
L108: What variables are 10 km and which are 20 km? Without this information, one could not replicate the analysis.
L153: “multiple” means what exactly?
The feature importance analysis is important for us to understand what factors are contributing the most and can be used to guide data collection and eventually reduce data requirement into ML models. I'd be curious to see a sensitivity analysis that is built off the feature importance analysis and show what are the model performance when including different subsets of the most important predictors.
Figure 10 does not provide sufficient information for interpretation.

Citation: https://doi.org/10.5194/egusphere-2023-1872-RC1
CC1:
'Review, Muhammad Ikhsan & Prof. Jos van Dam, Wageningen University & Research', Muhammad Ikhsan, 31 Oct 2023
This review was prepared as part of graduate program coursework at Wageningen University and has been produced under the supervision of Prof. Jos van Dam. The review has been posted because of its good quality, and likely usefulness to the authors and editor. The journal did not solicit this review.
The study discussed how data-driven approaches compared to physics-based modeling in predicting daily drain flow in agricultural areas in Denmark. The aim of this research was to develop machine learning models with comparable or superior drain flow prediction capabilities than physics-based models while requiring fewer parameters and computational effort. The author also emphasizes the transferability of the machine learning methods in other catchment regions. The results show that none of the models showed a strong performance in predicting the observed drain flows. Climatic variables were found to be the most important drivers compared to geological and topographical properties. The authors proposed more extensive variables and physics-guided machine learning for future studies because the models were not transferrable yet.
The aim and objective of this study are relatively novel and in the state of current technology to make an exchange between data-driven and physical-based models. Data-driven models could provide insights into what is missing or left out in the physics-based models. This study also matches the scope of the Hydrology and Earth System Sciences (HESS) journal as this study aims to improve hydrological modeling, specifically for subsurface drainage prediction using data-driven modeling. In my opinion, there is still much to learn about the potential of machine learning and data-driven models in this rapidly changing global environment. The research is interesting since it also presents the possibility of applying machine learning to simplify hydrological models, which would improve hydrological fluxes monitoring and predictions in the future.
The data and methodology given in the paper are well presented. As the reader, I could quickly grasp what the authors were doing. Employment of different evaluation metrics for machine learning models such as R², KGE (Kling-Gupta Efficiency), and PBIAS (Percent Bias) was also done. The ensemble of these metrics gave a full insight into how each model performs. Other than that, training and test cycles for machine learning models were done with cross-validation to minimize overfitting. Spatial validation proposed by Bjerre et al. (2022) was also employed in data-driven modeling.
However, the major issue with this research is the results did not answer the research questions sufficiently. This is due to the low performance of both architectures used by the authors, where Convolutional Neural Networks (CNN) and Gradient Boosting (XGBoost) models give correlation coefficients that range between 0.26-0.53. On the other hand, physics-based models using MIKE-SHE also perform quite the same as machine learning models with a correlation coefficient equal to 0.34. This makes the transferability of machine learning models still questionable, yet the author claimed that the machine learning models could replace the traditional physics-based models in the conclusions. I think the authors should provide more evidence to make this claim.
Furthermore, I recommend the editor to ensure that the authors reexamine a few points in the manuscript such as the variability in explanatory variables, the peak discharge replications process, and the underlying rationale for selecting the data-driven models. The central parts of those points are uncertainties quantifications where the discharge replications that the authors did will alter the spatiotemporal characteristics of the data and this will result in a questionable result that is not in line with the scope of HESS. Other than that, to simplify physics models, one must consider the uncertainties of data-driven models and guarantee reproducible results in different catchment areas by reducing the number of disorganized procedures. Uncertainties related to data-driven modeling are related to the data and model (Lang et al., 2023). I will elaborate more about these in the following paragraphs. If the authors give such points additional thought, it will leverage the usefulness of this study.
Major issues:
The first issue with the paper, there are downscaling approaches mentioned in line 96, but their presence in the rest of the manuscript is unclear. As the reader, I want to know where it was needed and to which data the authors used downscaling approaches. The downscaling approaches need to be elaborated because climate studies that only use one downscaling approach should be interpreted with caution (Chen et al., 2011). Other than that, the authors also should elaborate if any resampling techniques were used. Inaccuracies for downscaling and resampling methods will propagate to the output and cause inaccuracies in the hydrological models (Wu et al., 2008).
The paper which authors cited by Motarjemi et al. (2021) also suggested including variability for all predictor variables used in providing high-quality prediction models. Therefore, by including downscaling and resampling methods used for the data in Table 2 (line 104), the author could elaborate on the uncertainties in the data that will propagate to the output of the models. Lang et al., 2023 quantified uncertainties related to data-driven models by two different sources, data (aleatoric) and model (epistemic). Downscaling and resampling are part of the data uncertainties, meanwhile, model uncertainties are related to differences in initial values of the random number generator or hyperparameters. There is ongoing study in machine learning models to use automatic hyperparameter optimization, for example, the study done by Zela et al. (2018). If the author used a manual hyperparameter optimization, there might be inaccuracies in the choice, therefore sensitivity of hyperparameters should also be quantified.
I suggest the author include the data preparation methods and how sensitive the changes in the data will influence the output of the models. This could be done by elaborating more about downscaling and resampling approaches in section 2.2. Other than that, both uncertainties related to data and models could be quantified by using several models with different initialization and data preparation methods. If the author addresses this point, the study will be more reproducible by overcoming differences in data resolution and uncertainties in the models.
The second issue that I note is the peak discharge replications that were done by the authors, as stated in line 153. I do not think the author has done this in a systematic way, given the aleatoric uncertainties were not considered. It will also change the overall statistics of the time series and make the transferability of the methods to other catchment areas questionable, meanwhile, the author also emphasizes transferability to ungauged basins in the aim of the study (lines 67-68).
I proposed several studies to capture these peak behaviours, one of the studies done by Senent‐Aparicio et al. (2019), was coupling machine-learning models with the SWAT model. The author did mention that physics-guided machine learning could be beneficial in line 380. The ensemble of MIKE-SHE and machine learning models could be interesting to consider rather than replicating multiple values of peak discharge if the sensitivity of this procedure is not thoroughly examined. I also think recurrent neural networks (RNNs) should be considered, as it is one of the most fundamental sequence modeling in machine learning. A study done by Tao et al. (2019b), suggests that the introduction of LSTM (Long Short-Term Memory) cells which is part of RNN drastically improves the performance of simulating peak discharge. Another study done by Jiang et al. (2022) uses explainable machine learning to see flooding mechanisms in Europe. They use RNN models with LSTM cells to predict river peak flows because the models were able to capture temporal dependencies between hydrological variables effectively (Jiang et al., 2022) and the backbone of hydrological models in earth system (Lees et al., 2022).
Therefore, I suggest the authors to support the replication of discharge with more literature if there is any systematic procedure for this. Otherwise, this issue is also tightly coupled with uncertainties propagation that I proposed in the first issue, therefore looking at how sensitive the output to the peak discharge replication methods should be included as aleatoric uncertainties. Another alternative for future studies is to include other models such as RNN or ensemble models (CNN-RNN) given the ability to simulate peak discharge in other studies that would also be interlinked with the third major issue that I found in this paper. If an organized method for doing discharge replications is identified, the authors could show a flowchart to solve this issue; if not, the outcomes are integrated with the first issue by including discharge replications as aleatoric uncertainty. By addressing this issue, it will make the study more transferable through the removal of non-systematic procedures.
The third issue is related to the main aim and objective of this study which is to provide simple alternatives for machine learning models that require less computational effort and transferring machine learning methods in ungauged basins (lines 17-18, 67-68). However, the results do not have strong evidence to answer the research questions. This is because the machine learning models did not do better than physics-based models, with R² that was around 0.3 and 0.52 (line 233, Figure 5).
On the other hand, the authors also restricted the choice of machine learning methods by saying Convolutional Neural Networks (CNNs) perform better than Recurrent Neural Networks (RNNs) (lines 64-66) and providing the study by Bai et al., 2018. The study by Bai et al., 2018, provided empirical studies and considered several sequence modeling tasks, but not hydrological systems in general. Therefore, we could not conclude whether the statement above also holds for drain flow predictions. Other studies show that RNN could simulate peak values better (Tao et al., 2019 and Jiang et al., 2022). Ghimere, S. et al. (2021) also conducted a study to predict streamflow using several machine learning algorithms. They concluded that an integrated method of CNN-RNN is more powerful than standalone conventional AI and ensemble models (decision trees, gradient boosting, extreme gradient boosting). Therefore, the fundamental reasons for choosing CNN and XGBoost are not clear. This is due to low performance and other studies also shown other architectures could be opted for better performance.
To address this issue the authors could support their fundamental reasons for choosing CNN and XGBoost as their models, by providing other convincing literature and arguments. For the recommendation for future studies, the author could mention several studies that use integrated architecture such as CNN-RNN (Ghimere, S. et al., 2021). Additionally, more literature is needed for the suggestion of more extensive explanatory variables. For example, the study by Cho et al. (2019) used satellite indices to predict agricultural area subsurface flow in the United States using machine learning that could give 76.9-87% accuracy. By addressing this point, there will be clear recommendations on what to do in the future to improve the prediction capability of data-driven models. In the end, we can conclude whether by simplification of data, we could predict tile-drained agricultural flow or whether we need as extensive variables as needed by the physics-based models.
Minor issues:
Conclusion (lines 393-395) is given beyond the evidence provided. I suggest the authors change the conclusion to this sentence "Our study does not provide enough evidence on how machine learning could be transferrable to predict daily drain flow on agricultural areas. Therefore, for future studies, we recommend exploring more extensive variables and data-driven models that could give better results."

The author seems to have references that are mainly linked to the authors themselves. For a more balanced view, the author could consider the study done by Cho, et.al (2019). The study used Google Earth Engine to predict agricultural area drainage in the United States with an accuracy of about 76.9-87% using machine learning. The author could use this study for future recommendations on what variables (for example: satellite-derived indices) to include in the data-driven models for better results.

Explanation about how static features, for example, drain catchment areas are used is unclear (Table 2, line 104). As the reader, knowing how the authors supply this number is valuable. The author could add a figure that explains the dimensions of the input (both static and dynamic features) and output of the data-driven model in a schematic diagram.

There are some data-driven models that are impervious to multicollinearity (Kim et al., 2019). The author should give more references on specific criteria for removing multicollinearity (lines 123-124) and why it is needed in the models that the authors used.

Physics-based models’ calibration from MIKE-SHE is not well-defined in the study. The authors could include which parameters were inputted or an important parameter summary of the physical-based models for more clarity.

The author should add how the cluster is made (lines 80-81). This would increase the reproducibility of the study in other areas.

P16, lines 257-259: The author should include the effect of using drain flow data from all sites. For example: is it comparable to only using one data-driven model for all sites? Encapsulation of all data in one model is also interesting to consider (Jiang, et al., 2022), given the emphasis on transferability.

P3, lines 64-66: This statement is misleading. This is because the study cited by Bai et al., 2018 does not specifically consider tile-drained field flow prediction, but the authors made an impression as if the study that was cited had done CNN to tile-drained field flow predictions. The authors should recite the statement to provide a more balanced view on this.

List of minor issues:
P2, line 46: Lumped physics-based models should be removed, as the authors only used the distributed one.

P3, line 81: I think it should be 120 ha instead of 100 ha.

P6, line 108: It should be 10 km/20 km, and there are several typos that look similar, such as P6, line 112, 144; P7, line 116.

P6, line 119: I think the horizon D is missing.

P8, Figure 3: “randomly division” should be “random division”. “R2” should be “R²”.

P8, line 167: “bellow” should be “below”.

P9, Equation 2: I think it is better to use a bar instead of _mean.

P10, line 188: It is mentioned in Table A1 there are verification data for each model. I do not know why only MIKE-SHE calibration and verification period are shown.

P10, Table 3: There are some typos such as: ‘tunning’ should be ‘tuning’, ‘hyper parameters’ should be ‘hyperparameters’, and space between some values should be added.

P12, Table 4: The same problems with Table 3, such as: ‘tunning’ should be ‘tuning’ and space in between values in some cells should be added.

P18, Figure 9: the authors refer to the standard deviation in TWI of 300 m buffer as the fifth most important feature in lines 284 and 285, but I do not really see the corresponding variables in Figure 9.

P18, line 290: “don’t” should be “do not”.

P19, lines 304-305: The authors conclude that topographical features do not depict a clear correlation and refer to Figure 10. I think the authors should include some metrics to determine whether a correlation between explanatory variables and the target variable is important in Figure 10.

P20, Figure 11: The authors should include what is the important reason to include this figure. I think there is no relation to the overall objective and aim of this study.

P21, line 340: “100 ha” should be “120 ha”.

P22, line 368: “R2” should be “R²”.

References:
Bai, S. (2018, March 4). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv.org. https://arxiv.org/abs/1803.01271
Bjerre, E., Fienen, M. N., Schneider, R., Koch, J., & Højberg, A. L. (2022). Assessing spatial transferability of a random forest metamodel for predicting drainage fraction. Journal of Hydrology, 612, 128177. https://doi.org/10.1016/j.jhydrol.2022.128177
Chen, J., Brissette, F., & Leconte, R. (2011). Uncertainty of downscaling method in quantifying the impact of climate change on hydrology. Journal of Hydrology, 401(3–4), 190–202. https://doi.org/10.1016/j.jhydrol.2011.02.020
Cho, E., Jacobs, J. M., Jia, X., & Kraatz, S. (2019). Identifying Subsurface Drainage using Satellite Big Data and Machine Learning via Google Earth Engine. Water Resources Research, 55(10), 8028–8045. https://doi.org/10.1029/2019wr024892
Ghimire, S., Yaseen, Z.M., Farooque, A.A. et al. Streamflow prediction using an integrated methodology based on convolutional neural networks and long short-term memory networks. Sci Rep 11, 17497 (2021). https://doi.org/10.1038/s41598-021-96751-4
Jiang, S., Bevacqua, E., & Zscheischler, J. (2022). River flooding mechanisms and their changes in Europe revealed by explainable machine learning. Hydrology and Earth System Sciences, 26(24), 6339–6359. https://doi.org/10.5194/hess-26-6339-2022
Kim, D. W., Lee, S., Kwon, S., Nam, W., Cha, I., & Kim, H. J. (2019). Deep learning-based survival prediction of oral cancer patients. Scientific Reports, 9(1). https://doi.org/10.1038/s41598-019-43372-7
Lang, N., Jetz, W., Schindler, K. et al. A high-resolution canopy height model of the Earth. Nat Ecol Evol (2023). https://doi.org/10.1038/s41559-023-02206-6
Lees, T., Reece, S., Kratzert, F., Klotz, D., Gauch, M., De Bruijn, J., Kumar Sahu, R., Greve, P., Slater, L., and Dadson, S. J.: Hydrological concept formation inside long short-term memory (LSTM) networks, Hydrol. Earth Syst. Sci., 26, 3079–3101, https://doi.org/10.5194/hess-26-3079-2022
Motarjemi, S. K., Møller, A., Plauborg, F., & Iversen, B. V. (2021). Predicting national-scale tile drainage discharge in Denmark using machine learning algorithms. Journal of Hydrology: Regional Studies, 36, 100839. https://doi.org/10.1016/j.ejrh.2021.10083
Senent‐Aparicio, J., Jimeno‐Sáez, P., Bueno-Crespo, A., Pérez‐Sánchez, J., & Pulido-Velázquez, D. (2019). Coupling machine-learning techniques with SWAT model for instantaneous peak flow prediction. Biosystems Engineering, 177, 67–77. https://doi.org/10.1016/j.biosystemseng.2018.04.022
Tao, Y., Sun, F., Gentine, P., Liu, W., Wang, H., Yin, J., Du, M. H., & Liu, C. (2019). Evaluation and machine learning improvement of global hydrological model-based flood simulations. Environmental Research Letters, 14(11), 114027. https://doi.org/10.1088/1748-9326/ab4d5e
Wu, S., Li, J., & Huang, G. (2008). A study on DEM-derived primary topographic attributes for hydrologic applications: Sensitivity to elevation data resolution. Applied Geography, 28(3), 210–223. https://doi.org/10.1016/j.apgeog.2008.02.006
Zela, A., Klein, A., Falkner, S., & Hutter, F. (2018). Towards automated deep learning: efficient joint neural architecture and hyperparameter search. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1807.06906
Citation: https://doi.org/10.5194/egusphere-2023-1872-CC1

Hafsa Mahmood, Ty P. A. Ferré, Raphael J. M. Schneider, Simon Stisen, Rasmus R. Frederiksen, and Anders V. Christiansen

Viewed

Total article views: 1,266 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
916	290	60	1,266	56	85

HTML: 916
PDF: 290
XML: 60
Total: 1,266
BibTeX: 56
EndNote: 85

Views and downloads (calculated since 31 Aug 2023)

Month	HTML	PDF	XML	Total
Aug 2023	25	6	1	32
Sep 2023	112	41	6	159
Oct 2023	103	33	6	142
Nov 2023	39	6	4	49
Dec 2023	18	10	3	31
Jan 2024	32	9	2	43
Feb 2024	24	9	0	33
Mar 2024	20	13	1	34
Apr 2024	27	3	2	32
May 2024	12	10	1	23
Jun 2024	33	6	4	43
Jul 2024	23	5	4	32
Aug 2024	17	2	5	24
Sep 2024	11	3	0	14
Oct 2024	22	4	0	26
Nov 2024	19	1	0	20
Dec 2024	22	4	0	26
Jan 2025	23	7	0	30
Feb 2025	24	4	0	28
Mar 2025	16	6	2	24
Apr 2025	10	9	0	19
May 2025	20	7	0	27
Jun 2025	17	8	4	29
Jul 2025	35	5	1	41
Aug 2025	8	9	1	18
Sep 2025	18	15	0	33
Oct 2025	33	21	1	55
Nov 2025	37	17	2	56
Dec 2025	40	8	2	50
Jan 2026	52	4	3	59
Feb 2026	24	5	5	34

Cumulative views and downloads (calculated since 31 Aug 2023)

Month	HTML	PDF	XML	Total
Aug 2023	25	6	1	32
Sep 2023	112	41	6	159
Oct 2023	103	33	6	142
Nov 2023	39	6	4	49
Dec 2023	18	10	3	31
Jan 2024	32	9	2	43
Feb 2024	24	9	0	33
Mar 2024	20	13	1	34
Apr 2024	27	3	2	32
May 2024	12	10	1	23
Jun 2024	33	6	4	43
Jul 2024	23	5	4	32
Aug 2024	17	2	5	24
Sep 2024	11	3	0	14
Oct 2024	22	4	0	26
Nov 2024	19	1	0	20
Dec 2024	22	4	0	26
Jan 2025	23	7	0	30
Feb 2025	24	4	0	28
Mar 2025	16	6	2	24
Apr 2025	10	9	0	19
May 2025	20	7	0	27
Jun 2025	17	8	4	29
Jul 2025	35	5	1	41
Aug 2025	8	9	1	18
Sep 2025	18	15	0	33
Oct 2025	33	21	1	55
Nov 2025	37	17	2	56
Dec 2025	40	8	2	50
Jan 2026	52	4	3	59
Feb 2026	24	5	5	34

Viewed (geographical distribution)

Total article views: 1,202 (including HTML, PDF, and XML) Thereof 1,202 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 28 Feb 2026

Short summary

Temporal drain flow dynamics and understanding of their underlying controlling factors are important for water resource management in tile-drained agricultural areas. This study examine whether simpler, more efficient machine learning (ML) models can provide acceptable solutions compared to traditional physics based models. We predicted drain flow time series in multiple catchments subject to a range of climatic and landscape conditions.


Total:	0
HTML:	0
PDF:	0
XML:	0