Sensitivity of hydrological machine learning prediction accuracy to information quantity and quality

Jeung, Minhyuk; Her, Younggu; Baek, Sang-Soo; Yoon, Kwangsik

doi:10.5194/egusphere-2025-2036

Preprints

https://doi.org/10.5194/egusphere-2025-2036

Preprints

02 Jun 2025

| 02 Jun 2025

Sensitivity of hydrological machine learning prediction accuracy to information quantity and quality

Minhyuk Jeung, Younggu Her, Sang-Soo Baek, and Kwangsik Yoon

Abstract. Machine learning (ML) is now commonly employed as a tool for hydrological prediction due to recent advances in computing resources and increases in data volume. The prediction accuracy of ML (or data-driven) modeling is known to be improved through training with additional data; however, the improvement mechanism needs to be better understood and documented. This study explores the connection between the amount of information contained in the data used to train an ML model and the model’s prediction accuracy. The amount of information was quantified using Shannon’s information theory, including marginal and transfer entropy. Three ML models were trained to predict the flow discharge, sediment, total nitrogen, and total phosphorus loads of four watersheds. The amount of information contained in the training data was increased by sequentially adding weather data and the simulation outputs of uncalibrated and/or calibrated mechanistic (or theory-driven) models. The reliability of training data was considered a surrogate of information quality, and accuracy statistics were used to measure the quality (or reliability) of the uncalibrated and calibrated theory-driven modeling outputs to be provided as training data for ML modeling. The results demonstrated that the prediction accuracy of hydrological ML modeling depends on the quality and quantity of information contained in the training data. The use of all types of training data provided the best hydrological ML prediction accuracy. ML models trained only with weather data and calibrated theory-driven modeling outputs could most efficiently improve accuracy in terms of information use. This study thus illustrates how a theory-driven approach can help improve the accuracy of data-driven modeling by providing quality information about a system of interest.

Received: 30 Apr 2025 – Discussion started: 02 Jun 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 1625 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (1625 KB)

Supplement (8190 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

24 Feb 2026

Sensitivity of hydrological machine learning prediction accuracy to information quantity and quality

Minhyuk Jeung, Younggu Her, Sang-Soo Baek, and Kwangsik Yoon

Hydrol. Earth Syst. Sci., 30, 1077–1095, https://doi.org/10.5194/hess-30-1077-2026,https://doi.org/10.5194/hess-30-1077-2026, 2026

Short summary

Minhyuk Jeung, Younggu Her, Sang-Soo Baek, and Kwangsik Yoon

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-2036', Anonymous Referee #1, 30 Jun 2025
The manuscript investigates how the input information quantity and quality, quantified by marginal and transfer entropy, influence the machine-learning-based (ML) hydrological prediction performance.
The results demonstrate that increased information quantity does not necessarily enhance model performance whereas improved information quality can more efficiently boost predictive accuracy.
However, some points might need to be improved or clarified before publication:

For the discussion part, the authors dump the discussion of ML modeling accuracy, the influence of data quantity and quality into one long section. It might be better to transfer the results of ML performance into the Results part, and then divide remaining discussions into several subsections, e.g., the influence of data quantity on ML performance, the influence of data quality on ML performance, implications on future integration of ML with mechanistic models. This breakdown might better clarify the contribution of this work.

Line 526. The authors mentions that the ANN model exhibit its resilience to more efficiently utilize quality information. What does the term of “resilience” mean here?

Line 555-557. The authors mentions that high-quality training data can improve the efficiency of ML models. The term of “efficiency” might be ambiguous here since it can either refer to the information use efficiency or reduced training/computation time of ML models, especially given later comments on potential advantages of streamlined model training. Similar unclarified issues also exist in other parts, e.g., Line 539-540, and might cause reader’s misunderstanding. Please check related unclarified terms and keep consistency through discussions.
Citation: https://doi.org/10.5194/egusphere-2025-2036-RC1
- AC1:
  'Reply on RC1', Minhyuk Jeung, 04 Dec 2025
  RC1.1: The manuscript investigates how the input information quantity and quality, quantified by marginal and transfer entropy, influence the machine-learning-based (ML) hydrological prediction performance. The results demonstrate that increased information quantity does not necessarily enhance model performance whereas improved information quality can more efficiently boost predictive accuracy. However, some points might need to be improved or clarified before publication.
  Response to RC1.1: We thank the reviewer for recognizing this study’s central contributions and for highlighting areas needing clarification. In response to the comments regarding structure, terminology, and interpretation in the discussion, we have made several key revisions to improve clarity and emphasize the study’s contributions.
  First, as suggested, we split the original Discussion section into three clearer parts: (1) how information quantity affects ML performance, (2) how information quality affects performance, and (3) implications for combining ML with process-based models. We also moved detailed model results (e.g., RMSE and NSE) to the Results section to clearly separate evidence from interpretation.
  Second, we replaced the word “resilience” with “robustness” when describing the ANN model’s ability to maintain high performance even when low-quality inputs (i.e., WD+UC) were utilized.
  Third, we clarified our use of the term “efficiency” to avoid confusion. In this study, “information-use efficiency” refers to a model’s ability to achieve equal or better predictive performance using fewer, but more informative, input variables, not to reductions in training time or computational cost. We revised the text to reflect this meaning consistently.
  Together, these revisions directly address the reviewer’s concerns and improve the manuscript’s clarity, structure, and contribution to the literature on entropy-informed ML model development in hydrology.
  
  RC1.2: For the discussion part, the authors dump the discussion of ML modeling accuracy, the influence of data quantity and quality into one long section. It might be better to transfer the results of ML performance into the Results part, and then divide remaining discussions into several subsections, e.g., the influence of data quantity on ML performance, the influence of data quality on ML performance, implications on future integration of ML with mechanistic models. This breakdown might better clarify the contribution of this work.
  Response to RC1.2: Thank you for this constructive suggestion. We agree that the original Discussion section combined several key themes, including ML model performance, the effects of data quantity and quality, and broader implications, into a single continuous narrative, which may have reduced clarity and obscured the study's main contributions. In response, we have made the following revisions:
  Relocated quantitative results (e.g., RMSE and NSE comparisons across data scenarios) from the Discussion to the Results section to more clearly separate evidence from interpretation.
  
  Reorganized the Discussion section into three focused sub-sections:
  1. Influence of information quantity on ML performance
  
  2. Influence of information quality on ML performance
  
  3. Implications for integrating ML with mechanistic models
  
  These changes better align our discussion structure with the study's main objectives and improve the clarity of our contributions. Specifically:
  Section 4.1 examines how increases in input quantity influenced model performance, including observations of diminishing returns, and connects these patterns to previous literature.
  Section 4.2 focuses on the impact of information quality, as quantified by transfer entropy, and highlights cases where fewer but higher-quality inputs led to improved performance, particularly in the ANN model.
  Section 4.3 explores the broader implications of our findings, including how entropy metrics can guide input selection and data assimilation in mechanistic models, inform efficient training strategies, and support the development of hybrid modeling approaches.
  
  RC1.3: Line 526. The authors mentions that the ANN model exhibit its resilience to more efficiently utilize quality information. What does the term of “resilience” mean here?
  Response to RC1.3: We appreciate the reviewer’s request for clarification. In the original sentence, we wrote: “The ANN model exhibits its resilience to more efficiently utilize quality information …” Our intention was to convey that the ANN model retained a high level of predictive skill even after low-entropy (low-quality) inputs were added, whereas the other models showed a sharper decline in performance. In other words, the ANN model was robust to reductions in the quantity of information as long as the remaining inputs were of high informational quality. To avoid ambiguity, we have replaced the word “resilience” with “robustness,” a term more commonly used in the modelling literature to describe stability of performance under adverse or reduced-information conditions. In addition, we have added a definition in the text. The revised sentence now reads (Section 4.2, the first paragraph): “The ANN demonstrated robustness by effectively exploiting additional information, whereas the RF and SVM models exhibited performance deterioration.” Furthermore, we have provided a short explanatory clause that explicitly links robustness to the model’s ability to exploit high-entropy (high-quality) inputs: “…indicating that the ANN can exploit the remaining high-entropy variables more effectively than the other algorithms.”
  
  RC1.4: Line 555-557. The authors mentions that high-quality training data can improve the efficiency of ML models. The term of “efficiency” might be ambiguous here since it can either refer to the information use efficiency or reduced training/computation time of ML models, especially given later comments on potential advantages of streamlined model training. Similar unclarified issues also exist in other parts, e.g., Line 539-540, and might cause reader’s misunderstanding. Please check related unclarified terms and keep consistency through discussions.
  Response to RC1.4: Thank you for your comment regarding the ambiguous use of the term “efficiency.” We agree that in the original text, the use of “efficiency” could be misinterpreted as either referring to information-use efficiency or to computational/training efficiency. Our intention was to emphasize that high-quality training data enhanced the information-use efficiency of machine learning models, meaning the models achieved equal or better predictive accuracy using fewer, yet more informative, input variables. To clarify this, we revised the sentence in Lines 555–557 to state: “These results suggest that higher-quality training data improved the information-use efficiency of ML models, enabling them to maintain or improve prediction accuracy while using a reduced number of inputs.” We also reviewed and revised other instances throughout the manuscript, such as on Lines 539–540, to ensure that the term “efficiency” consistently refers to information-use efficiency unless otherwise specified.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2036-AC1
RC2:
'Comment on egusphere-2025-2036', Anonymous Referee #2, 05 Nov 2025

This is a useful, well-motivated study on how information quantity (marginal entropy) and quality (transfer entropy) affect hydrological ML performance. The core message—that more data does not guarantee better predictions while higher-quality information is more impactful—is clear and relevant. I recommend moderate revision focused on clarity of structure, terminology, and methods transparency.

Detailed comments are listed as follows.

1. Restructure: move quantitative ML results to Results; keep Discussion for interpretation, organized into quantity effects, quality effects, and implications for ML–process model integration.

2. Clarify terms: replace/define “resilience” precisely; reserve “efficiency” for information-use efficiency (IUE) and use “computational efficiency” for runtime/training remarks.

3. Methods transparency: briefly specify how marginal/transfer entropy are estimated (estimator, lags/embedding/discretization) and note comparability of “bits” across variables.

4. Data splits & leakage: clearly diagram time windows (SWAT calibration vs. ML train/test) and state how leakage is avoided.

5. Uncertainty & presentation: add compact uncertainty cues (e.g., CIs/whiskers or paired tests) to key figures; simplify dense plots and fix minor typos/formatting.

Citation: https://doi.org/10.5194/egusphere-2025-2036-RC2
- AC2:
  'Reply on RC2', Minhyuk Jeung, 04 Dec 2025
  RC2.1: This is a useful, well-motivated study on how information quantity (marginal entropy) and quality (transfer entropy) affect hydrological ML performance. The core message—that more data does not guarantee better predictions while higher-quality information is more impactful—is clear and relevant. I recommend moderate revision focused on clarity of structure, terminology, and methods transparency. Detailed comments are listed as follows.
  Response to RC2.1: We appreciate the reviewer’s detailed and constructive suggestions for clarifying the manuscript structure and improving methodological transparency. In response to them, we have made several key revisions. First, as suggested, we split the original Discussion section into three parts: (1) how information quantity affects ML performance, (2) how information quality affects ML performance, and (3) implications for combining ML with process-based models. We also moved detailed model accuracy statistics (e.g., RMSE and NSE) included in the Discussion section to the Results section to better separate evidence from interpretation. Second, we replaced the word “resilience” with “robustness” when describing the ANN model’s ability. Third, we clarified our use of the term “efficiency” to avoid confusion. In this study, “information-use efficiency” refers to a model’s ability to achieve equal or better predictive performance using fewer, but more informative, input variables, not to reductions in training time or computational cost. Fourth, we added whisker-box plots to present the inter-dataset, inter-model, and inter-watershed variability of the key metrics (KGE and IUE). These plots allow us to examine how performance varies across different ML models and data combinations and to discuss which configurations yield more consistent results.
  
  RC2.2: Restructure: move quantitative ML results to Results; keep Discussion for interpretation, organized into quantity effects, quality effects, and implications for ML–process model integration.
  Response to RC2.2: Thank you for this constructive suggestion. We agree that the original Discussion section combined several key themes, including ML model performance, the effects of data quantity and quality, and broader implications, into a single continuous narrative, which may have reduced clarity and obscured the study's main contributions. In response, we have made the following revisions:
  Relocated quantitative results (e.g., RMSE and NSE comparisons across data scenarios) from the Discussion to the Results section to more clearly separate evidence from interpretation.
  
  Reorganized the Discussion section into three focused sub-sections:
  1. Influence of information quantity on ML performance
  
  2. Influence of information quality on ML performance
  
  3. Implications for integrating ML with mechanistic models
  
  These changes better align our discussion structure with the study's main objectives and improve the clarity of our contributions. Specifically:
  Section 4.1 examines how increases in input quantity influenced model performance, including observations of diminishing returns, and connects these patterns to previous literature.
  Section 4.2 focuses on the impact of information quality, as quantified by transfer entropy, and highlights cases where fewer but higher-quality inputs led to improved performance, particularly in the ANN model.
  Section 4.3 explores the broader implications of our findings, including how entropy metrics can guide input selection and data assimilation in mechanistic models, inform efficient training strategies, and support the development of hybrid modeling approaches.
  
  RC2.3: Clarify terms: replace/define “resilience” precisely; reserve “efficiency” for information-use efficiency (IUE) and use “computational efficiency” for runtime/training remarks.
  Response to RC2.3: We appreciate the reviewer’s request for clarification. In the original sentence, we wrote: “The ANN model exhibits its resilience to more efficiently utilize quality information …” Our intention was to convey that the ANN model retained a high level of predictive skill even after low-entropy (low-quality) inputs were added, whereas the other models showed a sharper decline in performance. In other words, the ANN model was robust to reductions in the quantity of information as long as the remaining inputs were of high informational quality. To avoid ambiguity, we have replaced the word “resilience” with “robustness,” a term more commonly used in the modelling literature to describe stability of performance under adverse or reduced-information conditions. In addition, we have added a quantitative definition in the text. The revised sentence now reads (Section 4.2 Influence of information quality on ML performance, the first paragraph): “The ANN demonstrated robustness by effectively exploiting additional information, whereas the RF and SVM models exhibited performance deterioration.” Furthermore, we have provided a short explanatory clause that explicitly links robustness to the model’s ability to exploit high-entropy (high-quality) inputs: “…indicating that the ANN can exploit the remaining high-entropy variables more effectively than the other algorithms.”
  In addition, the use of “efficiency” could be misinterpreted as either referring to information-use efficiency or to computational/training efficiency. Our intention was to emphasize that high-quality training data enhanced the information-use efficiency of machine learning models, meaning the models achieved equal or better predictive accuracy using fewer, yet more informative, input variables. To clarify this, we revised the corresponding sentence to state: “These results suggest that higher-quality training data improved the information-use efficiency of ML models, enabling them to maintain or improve prediction accuracy while using a reduced number of inputs.” We also reviewed and revised other instances throughout the manuscript to ensure that the term “efficiency” consistently refers to information-use efficiency unless otherwise specified.
  
  RC2.4: Methods transparency: briefly specify how marginal/transfer entropy are estimated (estimator, lags/embedding/discretization) and note comparability of “bits” across variables.
  Response to RC2.4: We agree that a more detailed description of how marginal and transfer entropy were computed would improve reader understanding. Transfer entropy (TE) was calculated using the methodology offered by the RTransferEntropy package (Behrendt et al., 2019), applying Shannon TE with a quantile-based discretization scheme. This choice enhances robustness to outliers and better captures information transfer associated with relatively high and low values (Nie, 2021; Zhang and Zhao, 2022). The lag parameter was set to zero, because the ML models used in this study are standard regression models without explicit temporal memory (e.g., no LSTM); accordingly, we quantified synchronous information-use between inputs and outputs (relationship between same time; lag = 0), which aligns with our primary objective. Marginal entropy was computed with log base 2, and all marginal/transfer entropy magnitudes are reported in bits; where missing, units have been added to figures and tables.
  References:
  Behrendt, S., Dimpfl, T., Peter, F.J., Zimmermann, D.J., 2019. RTransferEntropy — Quantifying information flow between different time series using effective transfer entropy. SoftwareX 10, 100265.
  Nie, C.-X., 2021. Dynamics of the price–volume information flow based on surrogate time series. Chaos: An Interdisciplinary Journal of Nonlinear Science 31.
  Zhang, N., Zhao, X., 2022. Quantile transfer entropy: Measuring the heterogeneous information transfer of nonlinear time series. Communications in Nonlinear Science and Numerical Simulation 111, 106505.
  
  RC2.5: Data splits & leakage: clearly diagram time windows (SWAT calibration vs. ML train/test) and state how leakage is avoided.
  Response to RC2.5: Thank you for the constructive suggestion. The primary objective of this study is to evaluate how the quantity and quality of input data influence the predictive accuracy of ML models by using both intentionally uncalibrated and calibrated mechanistic modeling (SWAT) outputs as inputs.
  To further ensure that data leakage was avoided, we carefully aligned the training and testing periods of the ML models with the calibration and validation periods of the SWAT model, respectively. For example, the ML models were evaluated using the same period (January 1, 2016 to December 31, 2017) employed for SWAT model validation. In addition, the ML models were trained exclusively on SWAT-simulated nutrient loads from the calibration period, while observed discharge and concentration data were used only for SWAT calibration and validation and never as ML inputs, thereby preserving the independence of the datasets. These ensured that no observed data used for SWAT calibration was involved in ML model training or testing, thereby maintaining strict independence between datasets.
  To enhance clarity, these methodological safeguards and the rationale behind our data-separation strategy have been explicitly described in the revised manuscript (Section 2.6 – Study Watersheds and Training Data Acquisition and Section 4 – Discussions). In addition, we revised the diagram (Figure 1 in supplementary file) to clearly illustrate the data-splitting scheme and the training workflow.
  
  RC2.6: Uncertainty & presentation: add compact uncertainty cues (e.g., CIs/whiskers or paired tests) to key figures; simplify dense plots and fix minor typos/formatting.
  Response to RC2.6: We appreciate the reviewer’s suggestion to clarify the presentation of model performance uncertainty. In response to it, we have added new figures (Figure S5 in supplementary file) and revised the key figure (Figure 8 in supplementary file) to include compact uncertainty indicators as suggested. Specifically, we added box-whisker plots to present the inter-dataset, inter-model, and inter-watershed variability of the key metrics (IUE-ME and IUE-TE). In addition, we have carefully revised typographical errors and formatting throughout the manuscript.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2036-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-2036', Anonymous Referee #1, 30 Jun 2025
The manuscript investigates how the input information quantity and quality, quantified by marginal and transfer entropy, influence the machine-learning-based (ML) hydrological prediction performance.
The results demonstrate that increased information quantity does not necessarily enhance model performance whereas improved information quality can more efficiently boost predictive accuracy.
However, some points might need to be improved or clarified before publication:

For the discussion part, the authors dump the discussion of ML modeling accuracy, the influence of data quantity and quality into one long section. It might be better to transfer the results of ML performance into the Results part, and then divide remaining discussions into several subsections, e.g., the influence of data quantity on ML performance, the influence of data quality on ML performance, implications on future integration of ML with mechanistic models. This breakdown might better clarify the contribution of this work.

Line 526. The authors mentions that the ANN model exhibit its resilience to more efficiently utilize quality information. What does the term of “resilience” mean here?

Line 555-557. The authors mentions that high-quality training data can improve the efficiency of ML models. The term of “efficiency” might be ambiguous here since it can either refer to the information use efficiency or reduced training/computation time of ML models, especially given later comments on potential advantages of streamlined model training. Similar unclarified issues also exist in other parts, e.g., Line 539-540, and might cause reader’s misunderstanding. Please check related unclarified terms and keep consistency through discussions.
Citation: https://doi.org/10.5194/egusphere-2025-2036-RC1
- AC1:
  'Reply on RC1', Minhyuk Jeung, 04 Dec 2025
  RC1.1: The manuscript investigates how the input information quantity and quality, quantified by marginal and transfer entropy, influence the machine-learning-based (ML) hydrological prediction performance. The results demonstrate that increased information quantity does not necessarily enhance model performance whereas improved information quality can more efficiently boost predictive accuracy. However, some points might need to be improved or clarified before publication.
  Response to RC1.1: We thank the reviewer for recognizing this study’s central contributions and for highlighting areas needing clarification. In response to the comments regarding structure, terminology, and interpretation in the discussion, we have made several key revisions to improve clarity and emphasize the study’s contributions.
  First, as suggested, we split the original Discussion section into three clearer parts: (1) how information quantity affects ML performance, (2) how information quality affects performance, and (3) implications for combining ML with process-based models. We also moved detailed model results (e.g., RMSE and NSE) to the Results section to clearly separate evidence from interpretation.
  Second, we replaced the word “resilience” with “robustness” when describing the ANN model’s ability to maintain high performance even when low-quality inputs (i.e., WD+UC) were utilized.
  Third, we clarified our use of the term “efficiency” to avoid confusion. In this study, “information-use efficiency” refers to a model’s ability to achieve equal or better predictive performance using fewer, but more informative, input variables, not to reductions in training time or computational cost. We revised the text to reflect this meaning consistently.
  Together, these revisions directly address the reviewer’s concerns and improve the manuscript’s clarity, structure, and contribution to the literature on entropy-informed ML model development in hydrology.
  
  RC1.2: For the discussion part, the authors dump the discussion of ML modeling accuracy, the influence of data quantity and quality into one long section. It might be better to transfer the results of ML performance into the Results part, and then divide remaining discussions into several subsections, e.g., the influence of data quantity on ML performance, the influence of data quality on ML performance, implications on future integration of ML with mechanistic models. This breakdown might better clarify the contribution of this work.
  Response to RC1.2: Thank you for this constructive suggestion. We agree that the original Discussion section combined several key themes, including ML model performance, the effects of data quantity and quality, and broader implications, into a single continuous narrative, which may have reduced clarity and obscured the study's main contributions. In response, we have made the following revisions:
  Relocated quantitative results (e.g., RMSE and NSE comparisons across data scenarios) from the Discussion to the Results section to more clearly separate evidence from interpretation.
  
  Reorganized the Discussion section into three focused sub-sections:
  1. Influence of information quantity on ML performance
  
  2. Influence of information quality on ML performance
  
  3. Implications for integrating ML with mechanistic models
  
  These changes better align our discussion structure with the study's main objectives and improve the clarity of our contributions. Specifically:
  Section 4.1 examines how increases in input quantity influenced model performance, including observations of diminishing returns, and connects these patterns to previous literature.
  Section 4.2 focuses on the impact of information quality, as quantified by transfer entropy, and highlights cases where fewer but higher-quality inputs led to improved performance, particularly in the ANN model.
  Section 4.3 explores the broader implications of our findings, including how entropy metrics can guide input selection and data assimilation in mechanistic models, inform efficient training strategies, and support the development of hybrid modeling approaches.
  
  RC1.3: Line 526. The authors mentions that the ANN model exhibit its resilience to more efficiently utilize quality information. What does the term of “resilience” mean here?
  Response to RC1.3: We appreciate the reviewer’s request for clarification. In the original sentence, we wrote: “The ANN model exhibits its resilience to more efficiently utilize quality information …” Our intention was to convey that the ANN model retained a high level of predictive skill even after low-entropy (low-quality) inputs were added, whereas the other models showed a sharper decline in performance. In other words, the ANN model was robust to reductions in the quantity of information as long as the remaining inputs were of high informational quality. To avoid ambiguity, we have replaced the word “resilience” with “robustness,” a term more commonly used in the modelling literature to describe stability of performance under adverse or reduced-information conditions. In addition, we have added a definition in the text. The revised sentence now reads (Section 4.2, the first paragraph): “The ANN demonstrated robustness by effectively exploiting additional information, whereas the RF and SVM models exhibited performance deterioration.” Furthermore, we have provided a short explanatory clause that explicitly links robustness to the model’s ability to exploit high-entropy (high-quality) inputs: “…indicating that the ANN can exploit the remaining high-entropy variables more effectively than the other algorithms.”
  
  RC1.4: Line 555-557. The authors mentions that high-quality training data can improve the efficiency of ML models. The term of “efficiency” might be ambiguous here since it can either refer to the information use efficiency or reduced training/computation time of ML models, especially given later comments on potential advantages of streamlined model training. Similar unclarified issues also exist in other parts, e.g., Line 539-540, and might cause reader’s misunderstanding. Please check related unclarified terms and keep consistency through discussions.
  Response to RC1.4: Thank you for your comment regarding the ambiguous use of the term “efficiency.” We agree that in the original text, the use of “efficiency” could be misinterpreted as either referring to information-use efficiency or to computational/training efficiency. Our intention was to emphasize that high-quality training data enhanced the information-use efficiency of machine learning models, meaning the models achieved equal or better predictive accuracy using fewer, yet more informative, input variables. To clarify this, we revised the sentence in Lines 555–557 to state: “These results suggest that higher-quality training data improved the information-use efficiency of ML models, enabling them to maintain or improve prediction accuracy while using a reduced number of inputs.” We also reviewed and revised other instances throughout the manuscript, such as on Lines 539–540, to ensure that the term “efficiency” consistently refers to information-use efficiency unless otherwise specified.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2036-AC1
RC2:
'Comment on egusphere-2025-2036', Anonymous Referee #2, 05 Nov 2025

This is a useful, well-motivated study on how information quantity (marginal entropy) and quality (transfer entropy) affect hydrological ML performance. The core message—that more data does not guarantee better predictions while higher-quality information is more impactful—is clear and relevant. I recommend moderate revision focused on clarity of structure, terminology, and methods transparency.

Detailed comments are listed as follows.

1. Restructure: move quantitative ML results to Results; keep Discussion for interpretation, organized into quantity effects, quality effects, and implications for ML–process model integration.

2. Clarify terms: replace/define “resilience” precisely; reserve “efficiency” for information-use efficiency (IUE) and use “computational efficiency” for runtime/training remarks.

3. Methods transparency: briefly specify how marginal/transfer entropy are estimated (estimator, lags/embedding/discretization) and note comparability of “bits” across variables.

4. Data splits & leakage: clearly diagram time windows (SWAT calibration vs. ML train/test) and state how leakage is avoided.

5. Uncertainty & presentation: add compact uncertainty cues (e.g., CIs/whiskers or paired tests) to key figures; simplify dense plots and fix minor typos/formatting.

Citation: https://doi.org/10.5194/egusphere-2025-2036-RC2
- AC2:
  'Reply on RC2', Minhyuk Jeung, 04 Dec 2025
  RC2.1: This is a useful, well-motivated study on how information quantity (marginal entropy) and quality (transfer entropy) affect hydrological ML performance. The core message—that more data does not guarantee better predictions while higher-quality information is more impactful—is clear and relevant. I recommend moderate revision focused on clarity of structure, terminology, and methods transparency. Detailed comments are listed as follows.
  Response to RC2.1: We appreciate the reviewer’s detailed and constructive suggestions for clarifying the manuscript structure and improving methodological transparency. In response to them, we have made several key revisions. First, as suggested, we split the original Discussion section into three parts: (1) how information quantity affects ML performance, (2) how information quality affects ML performance, and (3) implications for combining ML with process-based models. We also moved detailed model accuracy statistics (e.g., RMSE and NSE) included in the Discussion section to the Results section to better separate evidence from interpretation. Second, we replaced the word “resilience” with “robustness” when describing the ANN model’s ability. Third, we clarified our use of the term “efficiency” to avoid confusion. In this study, “information-use efficiency” refers to a model’s ability to achieve equal or better predictive performance using fewer, but more informative, input variables, not to reductions in training time or computational cost. Fourth, we added whisker-box plots to present the inter-dataset, inter-model, and inter-watershed variability of the key metrics (KGE and IUE). These plots allow us to examine how performance varies across different ML models and data combinations and to discuss which configurations yield more consistent results.
  
  RC2.2: Restructure: move quantitative ML results to Results; keep Discussion for interpretation, organized into quantity effects, quality effects, and implications for ML–process model integration.
  Response to RC2.2: Thank you for this constructive suggestion. We agree that the original Discussion section combined several key themes, including ML model performance, the effects of data quantity and quality, and broader implications, into a single continuous narrative, which may have reduced clarity and obscured the study's main contributions. In response, we have made the following revisions:
  Relocated quantitative results (e.g., RMSE and NSE comparisons across data scenarios) from the Discussion to the Results section to more clearly separate evidence from interpretation.
  
  Reorganized the Discussion section into three focused sub-sections:
  1. Influence of information quantity on ML performance
  
  2. Influence of information quality on ML performance
  
  3. Implications for integrating ML with mechanistic models
  
  These changes better align our discussion structure with the study's main objectives and improve the clarity of our contributions. Specifically:
  Section 4.1 examines how increases in input quantity influenced model performance, including observations of diminishing returns, and connects these patterns to previous literature.
  Section 4.2 focuses on the impact of information quality, as quantified by transfer entropy, and highlights cases where fewer but higher-quality inputs led to improved performance, particularly in the ANN model.
  Section 4.3 explores the broader implications of our findings, including how entropy metrics can guide input selection and data assimilation in mechanistic models, inform efficient training strategies, and support the development of hybrid modeling approaches.
  
  RC2.3: Clarify terms: replace/define “resilience” precisely; reserve “efficiency” for information-use efficiency (IUE) and use “computational efficiency” for runtime/training remarks.
  Response to RC2.3: We appreciate the reviewer’s request for clarification. In the original sentence, we wrote: “The ANN model exhibits its resilience to more efficiently utilize quality information …” Our intention was to convey that the ANN model retained a high level of predictive skill even after low-entropy (low-quality) inputs were added, whereas the other models showed a sharper decline in performance. In other words, the ANN model was robust to reductions in the quantity of information as long as the remaining inputs were of high informational quality. To avoid ambiguity, we have replaced the word “resilience” with “robustness,” a term more commonly used in the modelling literature to describe stability of performance under adverse or reduced-information conditions. In addition, we have added a quantitative definition in the text. The revised sentence now reads (Section 4.2 Influence of information quality on ML performance, the first paragraph): “The ANN demonstrated robustness by effectively exploiting additional information, whereas the RF and SVM models exhibited performance deterioration.” Furthermore, we have provided a short explanatory clause that explicitly links robustness to the model’s ability to exploit high-entropy (high-quality) inputs: “…indicating that the ANN can exploit the remaining high-entropy variables more effectively than the other algorithms.”
  In addition, the use of “efficiency” could be misinterpreted as either referring to information-use efficiency or to computational/training efficiency. Our intention was to emphasize that high-quality training data enhanced the information-use efficiency of machine learning models, meaning the models achieved equal or better predictive accuracy using fewer, yet more informative, input variables. To clarify this, we revised the corresponding sentence to state: “These results suggest that higher-quality training data improved the information-use efficiency of ML models, enabling them to maintain or improve prediction accuracy while using a reduced number of inputs.” We also reviewed and revised other instances throughout the manuscript to ensure that the term “efficiency” consistently refers to information-use efficiency unless otherwise specified.
  
  RC2.4: Methods transparency: briefly specify how marginal/transfer entropy are estimated (estimator, lags/embedding/discretization) and note comparability of “bits” across variables.
  Response to RC2.4: We agree that a more detailed description of how marginal and transfer entropy were computed would improve reader understanding. Transfer entropy (TE) was calculated using the methodology offered by the RTransferEntropy package (Behrendt et al., 2019), applying Shannon TE with a quantile-based discretization scheme. This choice enhances robustness to outliers and better captures information transfer associated with relatively high and low values (Nie, 2021; Zhang and Zhao, 2022). The lag parameter was set to zero, because the ML models used in this study are standard regression models without explicit temporal memory (e.g., no LSTM); accordingly, we quantified synchronous information-use between inputs and outputs (relationship between same time; lag = 0), which aligns with our primary objective. Marginal entropy was computed with log base 2, and all marginal/transfer entropy magnitudes are reported in bits; where missing, units have been added to figures and tables.
  References:
  Behrendt, S., Dimpfl, T., Peter, F.J., Zimmermann, D.J., 2019. RTransferEntropy — Quantifying information flow between different time series using effective transfer entropy. SoftwareX 10, 100265.
  Nie, C.-X., 2021. Dynamics of the price–volume information flow based on surrogate time series. Chaos: An Interdisciplinary Journal of Nonlinear Science 31.
  Zhang, N., Zhao, X., 2022. Quantile transfer entropy: Measuring the heterogeneous information transfer of nonlinear time series. Communications in Nonlinear Science and Numerical Simulation 111, 106505.
  
  RC2.5: Data splits & leakage: clearly diagram time windows (SWAT calibration vs. ML train/test) and state how leakage is avoided.
  Response to RC2.5: Thank you for the constructive suggestion. The primary objective of this study is to evaluate how the quantity and quality of input data influence the predictive accuracy of ML models by using both intentionally uncalibrated and calibrated mechanistic modeling (SWAT) outputs as inputs.
  To further ensure that data leakage was avoided, we carefully aligned the training and testing periods of the ML models with the calibration and validation periods of the SWAT model, respectively. For example, the ML models were evaluated using the same period (January 1, 2016 to December 31, 2017) employed for SWAT model validation. In addition, the ML models were trained exclusively on SWAT-simulated nutrient loads from the calibration period, while observed discharge and concentration data were used only for SWAT calibration and validation and never as ML inputs, thereby preserving the independence of the datasets. These ensured that no observed data used for SWAT calibration was involved in ML model training or testing, thereby maintaining strict independence between datasets.
  To enhance clarity, these methodological safeguards and the rationale behind our data-separation strategy have been explicitly described in the revised manuscript (Section 2.6 – Study Watersheds and Training Data Acquisition and Section 4 – Discussions). In addition, we revised the diagram (Figure 1 in supplementary file) to clearly illustrate the data-splitting scheme and the training workflow.
  
  RC2.6: Uncertainty & presentation: add compact uncertainty cues (e.g., CIs/whiskers or paired tests) to key figures; simplify dense plots and fix minor typos/formatting.
  Response to RC2.6: We appreciate the reviewer’s suggestion to clarify the presentation of model performance uncertainty. In response to it, we have added new figures (Figure S5 in supplementary file) and revised the key figure (Figure 8 in supplementary file) to include compact uncertainty indicators as suggested. Specifically, we added box-whisker plots to present the inter-dataset, inter-model, and inter-watershed variability of the key metrics (IUE-ME and IUE-TE). In addition, we have carefully revised typographical errors and formatting throughout the manuscript.
  
  Citation: https://doi.org/10.5194/egusphere-2025-2036-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (21 Dec 2025) by Fuqiang Tian

AR by Minhyuk Jeung on behalf of the Authors (09 Jan 2026) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (10 Jan 2026) by Fuqiang Tian

RR by Anonymous Referee #1 (07 Feb 2026)

RR by Anonymous Referee #2 (13 Feb 2026)

ED: Publish as is (13 Feb 2026) by Fuqiang Tian

AR by Minhyuk Jeung on behalf of the Authors (16 Feb 2026) Manuscript

Journal article(s) based on this preprint

24 Feb 2026

Sensitivity of hydrological machine learning prediction accuracy to information quantity and quality

Minhyuk Jeung, Younggu Her, Sang-Soo Baek, and Kwangsik Yoon

Hydrol. Earth Syst. Sci., 30, 1077–1095, https://doi.org/10.5194/hess-30-1077-2026,https://doi.org/10.5194/hess-30-1077-2026, 2026

Short summary

Minhyuk Jeung, Younggu Her, Sang-Soo Baek, and Kwangsik Yoon

Supplement

https://doi.org/10.5194/egusphere-2025-2036-supplement

Minhyuk Jeung, Younggu Her, Sang-Soo Baek, and Kwangsik Yoon

Viewed

Total article views: 1,003 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
785	185	33	1,003	81	22	39

HTML: 785
PDF: 185
XML: 33
Total: 1,003
Supplement: 81
BibTeX: 22
EndNote: 39

Views and downloads (calculated since 02 Jun 2025)

Month	HTML	PDF	XML	Total
Jun 2025	131	19	7	157
Jul 2025	26	13	4	43
Aug 2025	92	16	3	111
Sep 2025	390	6	1	397
Oct 2025	43	15	4	62
Nov 2025	25	25	2	52
Dec 2025	30	42	7	79
Jan 2026	20	37	4	61
Feb 2026	28	12	1	41

Cumulative views and downloads (calculated since 02 Jun 2025)

Month	HTML	PDF	XML	Total
Jun 2025	131	19	7	157
Jul 2025	26	13	4	43
Aug 2025	92	16	3	111
Sep 2025	390	6	1	397
Oct 2025	43	15	4	62
Nov 2025	25	25	2	52
Dec 2025	30	42	7	79
Jan 2026	20	37	4	61
Feb 2026	28	12	1	41

Viewed (geographical distribution)

Total article views: 1,006 (including HTML, PDF, and XML) Thereof 1,006 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 24 Feb 2026

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (1625 KB)
Metadata XML

Short summary

Machine learning (ML) techniques have become widely used due to the availability of large data repositories and advancements in computing resources and methods. Our study explored the connection between a model’s accuracy and the information content of input data. Results showed that the accuracy of three ML models significantly improved when high-quality input data were included. These findings highlight the importance of data quality in ML model training.


Total:	0
HTML:	0
PDF:	0
XML:	0