the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Interrogating process deficiencies in large-scale hydrologic models with interpretable machine learning
Abstract. Large-scale hydrologic models are increasingly being developed for operational use in the forecasting and planning of water resources. However, the predictive strength of such models depends on how well they resolve various functions of catchment hydrology, which are influenced by gradients in climate, topography, soils, and land use. Most assessments of these hydrologic models has been limited to traditional statistical approaches. The rise of machine learning techniques can provide novel insights into identifying process deficiencies in large-scale hydrologic models. In this study, we train a random forest model to predict the Kling-Gupta Efficiency (KGE) of National Water Model (NWM) and National Hydrologic Model (NHM) predictions for 4,383 streamgages across the conterminous United States. Thereafter, we explain the local and global controls that 48 catchment attributes exert on KGE prediction using interpretable Shapley values. Overall, we find that soil water content is the most impactful feature controlling successful model performance, suggesting that soil water storage is difficult for hydrologic models to resolve, particularly for arid locations. We identify non-linear thresholds beyond which predictive performance decreases for NWM and NHM. For example, soil water content less than 210 mm, precipitation less than 900 mm/yr, road density greater than 5 km/km2, and lake area percent greater than 10 % contributed to lower KGE values. These results suggest that improvements in how these influential processes are represented could result in the largest increases in predictive performance of NWM and NHM. This study demonstrates the utility of interrogating process-based models using data-driven techniques, which has broad applicability and potential for improving the next generation of large-scale hydrologic models.
- 
        
                                        Notice on discussion status
                                        The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version. 
- 
                                    Preprint
                                    (7251 KB) 
- 
                                    Supplement
                                    (9423 KB) 
- 
            
            
                                    The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version. 
- Preprint
                                        (7251 KB) 
- Metadata XML
- 
                                    Supplement (9423 KB) 
- BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
- 
                     RC1:  'Comment on egusphere-2024-3235', Jonathan Frame, 13 Jan 2025
            
            
            
            
                        This paper presents an analysis of sensitivity for random forest inputs to predict large-domain hydrologic model performance. The attribution method is the Shapley value, which the authors note is model agnostic, and is uncommon in hydrology, as far as I know. The paper is quite simple in its approach, but leads to comprehensive results of model performance attribution. Even though the results aren't particularly surprising, the reproducibility, the completeness and the solid methodology are admirable. The text, the figures, and particularly the figure captions are very high quality, which allows the reader to gain insight into the drivers of hydrologic processes around the CONUS quite easily. The model agnostic nature of this attribution approach means that it can be used by many other models in many other regions, which the authors acknowledge in the discussion. The supplemental shows results for the NHM, while the main text shows results for the NWM. The results are strikingly similar. My guess is that the results of any deep learning models will also be similar. As tempting as it may be to ask the authors to do additional analysis, such as including analysis of the poor performing basins or comparing the results of the different models, I believe this would be an unnecessary burden, given the high quality of the manuscript and the potential for future research using this method. I recommend this paper be published as is, and I look forward to the eventual global results (include the poor performing basins) that will surely be the next step. regards, Jonathan M. Frame Citation: https://doi.org/10.5194/egusphere-2024-3235-RC1 - 
                                        
                                     AC1:  'Reply on RC1', Admin Husic, 25 Feb 2025
                            
                            
                            
                            
                                        We thank Jonathan Frame for the thorough reading and assessment of the manuscript. We look forward to seeing the potential ways the hydrologic community may use this tool to assess model sensitivity. Citation: https://doi.org/10.5194/egusphere-2024-3235-AC1 
 
- 
                                        
                                     AC1:  'Reply on RC1', Admin Husic, 25 Feb 2025
                            
                            
                            
                            
                                        
- 
                     RC2:  'Comment on egusphere-2024-3235', Anonymous Referee #2, 16 Jan 2025
            
            
            
            
                        This manuscript uses Shapely approach, an explainable AI methodology, to broadly define categories that contribute to model bias, with the focus being on streamflow output from two, continental-scale, processed based, hydrological models. The methodology uses a random forest model to predict KGE for streamflow in each model and it is trained on several ecoregion characteristics. The Shapely values indicate feature importance for all the ecoregion characteristics and their impact on streamflow KGE. The random forest model is moderately sufficient in predicting KGE. This study, rather than assessing model performance or specific ways to improve model performance, is a proof of concept for using explainable AI for process-based hydrological model bias identification sources in a post-hoc manner. Overall, this manuscript is well written and organized. There were few grammatical errors, and the sections of the manuscript are logically structured. This methodology is an interesting and scientifically significant way to assess sources of model bias and parameter sensitivity for continental-scale models, which are typically far too computationally expensive for traditional sensitivity analyses. I suggest the background and methodology can be expanded in some sections for increased clarity, particularly across scientific fields. Additionally, the authors should clarify specifically the purpose and motivation. The main points of revision are as follows: -  Purpose and Motivation:
- Clearly define the study's purpose: This is not a model comparison or a process representation assessment. This reads more as a proof of concept for a new methodology to assess bias and sensitivity in process-based, continental-scale hydrologic models.
- Strengthen the introduction and motivation by emphasizing the challenges of running traditional sensitivity analyses on computationally expensive, large-scale models. This should serve as a primary justification for developing and applying this methodology.
 
-  Improved Model Description:
- Include a more detailed description of model configurations and processes of interest (even in supplementary materials, if necessary).
- Clearly identify the configurations used and ensure consistency across the text (e.g., clarify references to model attributes in results and Line 390).
- Focus on processes directly relevant to the discussion (e.g., those affecting streamflow).
 
-  Clarify Objectives and Takeaways:
- One suggestion is to change the recurring message: the Shapley results are a tool for identifying model deficiencies and offering insights into bias and process representation improvements, not for directly improving NWM or NHM processes. This study is a proof of concept for using of explainable AI for hydrologic model bias identification in a post-hoc manner and authors can emphasize that this study demonstrates the utility of explainable AI in detecting model deficiencies, particularly for computationally expensive, large-scale hydrological models. Authors can compare the methodology to traditional sensitivity analyses, highlighting its innovation and feasibility given the computational constraints of large-scale models.
 
 Specific Comments: - Line 30: “Grand challenge of hydrology…” Why/what makes this especially difficult at large scales? What are the specific challenges for large scale models that are addressed with this methodology?
- It is not initially clear why the analysis includes both the NWM and NHM. Justification for this should be added. Does using two models illustrate that this methodology is useful beyond a single-model use case? Or something else?
- Generally, the use of ecoregions needs to be expanded upon in the introduction and methods. It should be a bigger part of the central thesis, since the analysis and discussion consist primarily of the catchment attributes of these ecoregions. Additionally, it is not entirely clear in the methodology how streamflow gages are related to the ecoregions and catchment attributes. What is the catchment size / product (e.g., NHDPlus)?
- Line 20 and 71: “model performance” – should indicate that streamflow performance is the only variable being assessed.
- Line 26: Why are large-scale hydrologic models important? Should add brief justification of the rational of using these vs. e.g. regional, catchment scale models.
- Line 29: Besides parameterizing and calibrating, what about physics and physical process representation in these models?
- Line 44: Clarify what is meant by “sites”.
- Line 54: Can also cite Ma et al., 2023 (Groundwater) https://doi.org/10.1111/gwat.13362
- Lines 54-55: This is too broad a statement. Needs further explanation or an example given.
- Line 60: Have numerous explainable AI methods been developed for use in hydrology or just generally? More generally, expand on what explainable AI is and how it is defined. Authors discussed that XAI can be leveraged and cite methods, but there is not a clear explanation of what XAI is.
- Line 90: Clarify if these basins are NHDPlus or something else.
- Line 92-93: This seems more like a result and might fit better in the results section.
- Line 94: separate the metrics section from the model section as these are not explicitly related. Also, provide justification for only using KGE.
- Line 105: It is not clear what the “aggregated” attributes are. Are these the 7 groupings listed in the next sentence?
- Line 110-111: Need to clarify where soil water content is being “represented.” In the models or elsewhere?
- Line 112: The “in-the-bag” and “out-of-bag” language is jargon and can be explained in plain language. The language could be revised to be more consistent with the “training-testing” language in the results. Further explanation can be provided on the RF methodology, particularly because this paper is outlining a new methodology that others will likely want to employ and will need more specifics to do that. E.g., what was the training-testing split as a percentage? Was this a randomized selection?
- Line 125: Unless someone is familiar with Shapely values and methodology, some of the results are initially difficult to interpret. Consider adding some general explanation of what feature importance, Shapely value, and directionality mean within the context of the results and figures. Also, it would be helpful to consider explaining further in the methods / changing the language about how KGE behaves (perhaps instead of increase/decrease use improve/worsen) so that the qualitative relationship between Shapely and KGE is clear.
- Figure 5: Explain difference between Shapely value and the Predictor Value, this will help in interpreting the swarm plot.
- Line 128: Expand on this point and give reasoning for why this is beneficial within this methodology. Explain “distribution of gain.”
- Line 146: Consider moving the ecoregion names listed here to the figure caption and replace with content focused on why this methodology was chosen, the implications of this method on the study, general ecoregion methodology explanation. (See Comment 6).
- Line 154: A plot of actual to modeled KGE would be helpful (maybe in the Supplementary Materials). Did authors evaluate this with any other metrics?
- Line 249-242: This is a great explanation. Perhaps include something similar in the introduction as this is an important point which makes this study unique. Remove the second “that is.”
- Line 272: after “certain thresholds are crossed” reference specifically the scatter plots in Fig. 4.
- Section 4.2 Title: Rethink the section heading. The section broadly discusses model performance related to the ecoregion features, not model formulations or actual process representations.
- Line 332: The NWM routes streamflow through the NHDPlus vector network, not across a 1km2 The reasoning for headwater performance should be given more thought.
- Line 362: “under the hood” model assessments – this is jargon and can be clarified with plain language.
- Figure 3: Specify on the x-axis this is predicted.
- Figure 4, Figure S2, Figure S3: Need colorbar explanations.
 Citation: https://doi.org/10.5194/egusphere-2024-3235-RC2 - AC2: 'Reply on RC2', Admin Husic, 25 Feb 2025
 
-  Purpose and Motivation:
Interactive discussion
Status: closed
- 
                     RC1:  'Comment on egusphere-2024-3235', Jonathan Frame, 13 Jan 2025
            
            
            
            
                        This paper presents an analysis of sensitivity for random forest inputs to predict large-domain hydrologic model performance. The attribution method is the Shapley value, which the authors note is model agnostic, and is uncommon in hydrology, as far as I know. The paper is quite simple in its approach, but leads to comprehensive results of model performance attribution. Even though the results aren't particularly surprising, the reproducibility, the completeness and the solid methodology are admirable. The text, the figures, and particularly the figure captions are very high quality, which allows the reader to gain insight into the drivers of hydrologic processes around the CONUS quite easily. The model agnostic nature of this attribution approach means that it can be used by many other models in many other regions, which the authors acknowledge in the discussion. The supplemental shows results for the NHM, while the main text shows results for the NWM. The results are strikingly similar. My guess is that the results of any deep learning models will also be similar. As tempting as it may be to ask the authors to do additional analysis, such as including analysis of the poor performing basins or comparing the results of the different models, I believe this would be an unnecessary burden, given the high quality of the manuscript and the potential for future research using this method. I recommend this paper be published as is, and I look forward to the eventual global results (include the poor performing basins) that will surely be the next step. regards, Jonathan M. Frame Citation: https://doi.org/10.5194/egusphere-2024-3235-RC1 - 
                                        
                                     AC1:  'Reply on RC1', Admin Husic, 25 Feb 2025
                            
                            
                            
                            
                                        We thank Jonathan Frame for the thorough reading and assessment of the manuscript. We look forward to seeing the potential ways the hydrologic community may use this tool to assess model sensitivity. Citation: https://doi.org/10.5194/egusphere-2024-3235-AC1 
 
- 
                                        
                                     AC1:  'Reply on RC1', Admin Husic, 25 Feb 2025
                            
                            
                            
                            
                                        
- 
                     RC2:  'Comment on egusphere-2024-3235', Anonymous Referee #2, 16 Jan 2025
            
            
            
            
                        This manuscript uses Shapely approach, an explainable AI methodology, to broadly define categories that contribute to model bias, with the focus being on streamflow output from two, continental-scale, processed based, hydrological models. The methodology uses a random forest model to predict KGE for streamflow in each model and it is trained on several ecoregion characteristics. The Shapely values indicate feature importance for all the ecoregion characteristics and their impact on streamflow KGE. The random forest model is moderately sufficient in predicting KGE. This study, rather than assessing model performance or specific ways to improve model performance, is a proof of concept for using explainable AI for process-based hydrological model bias identification sources in a post-hoc manner. Overall, this manuscript is well written and organized. There were few grammatical errors, and the sections of the manuscript are logically structured. This methodology is an interesting and scientifically significant way to assess sources of model bias and parameter sensitivity for continental-scale models, which are typically far too computationally expensive for traditional sensitivity analyses. I suggest the background and methodology can be expanded in some sections for increased clarity, particularly across scientific fields. Additionally, the authors should clarify specifically the purpose and motivation. The main points of revision are as follows: -  Purpose and Motivation:
- Clearly define the study's purpose: This is not a model comparison or a process representation assessment. This reads more as a proof of concept for a new methodology to assess bias and sensitivity in process-based, continental-scale hydrologic models.
- Strengthen the introduction and motivation by emphasizing the challenges of running traditional sensitivity analyses on computationally expensive, large-scale models. This should serve as a primary justification for developing and applying this methodology.
 
-  Improved Model Description:
- Include a more detailed description of model configurations and processes of interest (even in supplementary materials, if necessary).
- Clearly identify the configurations used and ensure consistency across the text (e.g., clarify references to model attributes in results and Line 390).
- Focus on processes directly relevant to the discussion (e.g., those affecting streamflow).
 
-  Clarify Objectives and Takeaways:
- One suggestion is to change the recurring message: the Shapley results are a tool for identifying model deficiencies and offering insights into bias and process representation improvements, not for directly improving NWM or NHM processes. This study is a proof of concept for using of explainable AI for hydrologic model bias identification in a post-hoc manner and authors can emphasize that this study demonstrates the utility of explainable AI in detecting model deficiencies, particularly for computationally expensive, large-scale hydrological models. Authors can compare the methodology to traditional sensitivity analyses, highlighting its innovation and feasibility given the computational constraints of large-scale models.
 
 Specific Comments: - Line 30: “Grand challenge of hydrology…” Why/what makes this especially difficult at large scales? What are the specific challenges for large scale models that are addressed with this methodology?
- It is not initially clear why the analysis includes both the NWM and NHM. Justification for this should be added. Does using two models illustrate that this methodology is useful beyond a single-model use case? Or something else?
- Generally, the use of ecoregions needs to be expanded upon in the introduction and methods. It should be a bigger part of the central thesis, since the analysis and discussion consist primarily of the catchment attributes of these ecoregions. Additionally, it is not entirely clear in the methodology how streamflow gages are related to the ecoregions and catchment attributes. What is the catchment size / product (e.g., NHDPlus)?
- Line 20 and 71: “model performance” – should indicate that streamflow performance is the only variable being assessed.
- Line 26: Why are large-scale hydrologic models important? Should add brief justification of the rational of using these vs. e.g. regional, catchment scale models.
- Line 29: Besides parameterizing and calibrating, what about physics and physical process representation in these models?
- Line 44: Clarify what is meant by “sites”.
- Line 54: Can also cite Ma et al., 2023 (Groundwater) https://doi.org/10.1111/gwat.13362
- Lines 54-55: This is too broad a statement. Needs further explanation or an example given.
- Line 60: Have numerous explainable AI methods been developed for use in hydrology or just generally? More generally, expand on what explainable AI is and how it is defined. Authors discussed that XAI can be leveraged and cite methods, but there is not a clear explanation of what XAI is.
- Line 90: Clarify if these basins are NHDPlus or something else.
- Line 92-93: This seems more like a result and might fit better in the results section.
- Line 94: separate the metrics section from the model section as these are not explicitly related. Also, provide justification for only using KGE.
- Line 105: It is not clear what the “aggregated” attributes are. Are these the 7 groupings listed in the next sentence?
- Line 110-111: Need to clarify where soil water content is being “represented.” In the models or elsewhere?
- Line 112: The “in-the-bag” and “out-of-bag” language is jargon and can be explained in plain language. The language could be revised to be more consistent with the “training-testing” language in the results. Further explanation can be provided on the RF methodology, particularly because this paper is outlining a new methodology that others will likely want to employ and will need more specifics to do that. E.g., what was the training-testing split as a percentage? Was this a randomized selection?
- Line 125: Unless someone is familiar with Shapely values and methodology, some of the results are initially difficult to interpret. Consider adding some general explanation of what feature importance, Shapely value, and directionality mean within the context of the results and figures. Also, it would be helpful to consider explaining further in the methods / changing the language about how KGE behaves (perhaps instead of increase/decrease use improve/worsen) so that the qualitative relationship between Shapely and KGE is clear.
- Figure 5: Explain difference between Shapely value and the Predictor Value, this will help in interpreting the swarm plot.
- Line 128: Expand on this point and give reasoning for why this is beneficial within this methodology. Explain “distribution of gain.”
- Line 146: Consider moving the ecoregion names listed here to the figure caption and replace with content focused on why this methodology was chosen, the implications of this method on the study, general ecoregion methodology explanation. (See Comment 6).
- Line 154: A plot of actual to modeled KGE would be helpful (maybe in the Supplementary Materials). Did authors evaluate this with any other metrics?
- Line 249-242: This is a great explanation. Perhaps include something similar in the introduction as this is an important point which makes this study unique. Remove the second “that is.”
- Line 272: after “certain thresholds are crossed” reference specifically the scatter plots in Fig. 4.
- Section 4.2 Title: Rethink the section heading. The section broadly discusses model performance related to the ecoregion features, not model formulations or actual process representations.
- Line 332: The NWM routes streamflow through the NHDPlus vector network, not across a 1km2 The reasoning for headwater performance should be given more thought.
- Line 362: “under the hood” model assessments – this is jargon and can be clarified with plain language.
- Figure 3: Specify on the x-axis this is predicted.
- Figure 4, Figure S2, Figure S3: Need colorbar explanations.
 Citation: https://doi.org/10.5194/egusphere-2024-3235-RC2 - AC2: 'Reply on RC2', Admin Husic, 25 Feb 2025
 
-  Purpose and Motivation:
Peer review completion
 
                             
                           
                             
                           
                             
                          Journal article(s) based on this preprint
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 669 | 112 | 22 | 803 | 34 | 19 | 32 | 
- HTML: 669
- PDF: 112
- XML: 22
- Total: 803
- Supplement: 34
- BibTeX: 19
- EndNote: 32
Viewed (geographical distribution)
| Country | # | Views | % | 
|---|
| Total: | 0 | 
| HTML: | 0 | 
| PDF: | 0 | 
| XML: | 0 | 
- 1
John Hammond
Adam N. Price
Joshua K. Roundy
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
                            (7251 KB) 
- Metadata XML
- 
                                Supplement
                                (9423 KB) 
- BibTeX
- EndNote
- Final revised paper
 
 
                         
                         
                         
                         
            
                             
                 
                 
                 
                