Technical Note: High Nash Sutcliffe Efficiencies conceal poor simulations of interannual variance in tropical, alpine, and polar catchments
Abstract. Streamflow time series can be decomposed into interannual, seasonal, and irregular components, with regionally varying contributions of each component. Seasonal variance dominates in many tropical, alpine, and polar regions, while irregular variance dominates in most other regions. Interannual variability in streamflow is known to strongly influence human and ecological systems and is likely to increase under the influence of climate change, though we find that historical interannual variance is usually only a small fraction of the total variance. We show that hydrologic models often simulate one component well while failing to simulate the others, a fact that is hidden by popular performance metrics such as the Nash-Sutcliffe Efficiency (NSE) and the Kling-Gupta Efficiency (KGE) which aggregate performance to a single number. We analyse 18 regional and global hydrologic models and find that in highly seasonal catchments where the NSE and KGE are consistently the highest, the models are almost always worse at simulating interannual variability. The NSE of the interannual component is lower in highly seasonal catchments, and simulated year-to-year changes in ecologically relevant hydrologic signatures are less accurate. This is concerning because it indicates that these hydrologic models may struggle to predict long-term responses to climate change, especially in tropical, alpine, and polar regions, which are some of the most vulnerable regimes regarding climate change.
The authors touch a very important and increasingly spotted (luckily) topic: should we blindly trust our traditional performance metrics for hydrological modeling? Aside other very interesting insights, they discuss a sad (although needed) truth: high NSEs (or even KGEs) do not necessarily mean that the simulations are adequate. It urges in some aspect our need to improve, as modelers, our optimization metrics. The paper is definitely a fit for HESS and should be published, but as should be expected, some concerns should be clarified/corrected/improve before, aside many suggestions.
1. I believe that the methodology used for the time-series decommission needs to be better explained (with more details) and if needed, authors could make use of Appendix/Supporting information. This is a crucial part and needs to be ensured to be easy to follow by readers.
2. Also on that, I feel that authors could justify better the choice of the decomposition. Was it motivated by previous work? Are there more references? This needs to be made clear in the text.
3. The authors called the seasonal component the long-term seasonality of the basins. Our rivers are under changes and the seasonality is consequently changing in many of our rivers. I think this could fit a bit better in the text. I understand the choice (L85-89), and also I believe that much of the change is captured in the irregular, but the text would benefit for a bit of clarification in the choices.
4. Simulations: If I understood correct, the authors used simulated data from several models (and in one case they simulated themselves). Did the authors check for the different periods of calibration/evaluation/tests for all the models? Or for overlapping period? Did the authors used only what was classified as test? my main concern, is that during the model comparison, the authors might be using simulated streamflow from test for some models and for "calibration" for other models. Or even, single-basin versus regional simulations. I see no problem in using different settings, but this needs to be extensively reported and discussed in the results. For example, I have the feeling that for the PREVAH-CH simulations, the authors might have used all the simulation (including calibration) and not only evaluation (I might be wrong). My suggestion is to review these aspects, and incorporate such information in the manuscript.
5. Regarding Figure 3 (along also L275 onwards) models that performed better for highly seasonal catchments were the ones with the lowest performances overall, or is it my impression? I think you should discuss better this, maybe showing the median performances? A box plot in appendix? Something to clarify if these models being better in seasonal are actually just the case that they had overall poor performance? Also touching point 4, how were these simulations obtained by the original authors? did they report them as the evaluation phase? or are they actually for the calibration period? This would be worthy clarifying for the readers.
6 L328-332: Needs to be rephrased (maybe) after reviewing points 4 and 5.