Preprints
https://doi.org/10.5194/egusphere-2025-769
https://doi.org/10.5194/egusphere-2025-769
18 Jun 2025
 | 18 Jun 2025

Technical note: Does Multiple Basin Training Strategy Guarantee Superior Machine Learning Performance for Streamflow Predictions in Gaged Basins?

Vinh Ngoc Tran, Tam V. Nguyen, Jongho Kim, and Valeriy Y. Ivanov

Abstract. In recent years, machine learning (ML) has witnessed growing prominence and popularity in hydrological science, offering convenience and ease of use without requiring extensive hydrological expertise or the complexity associated with process-based models. There exists debate regarding optimal training approaches, with some researchers advocating for multi-basin training while questioning the validity of single-basin approaches. This study examines the relationship between training dataset size (number of basins) and model performance. Through comparative analysis, we found that increasing the number of basins for ML training does not necessarily guarantee improved performance of the trained ML model. Specifically, the state-of-the-art global ML (G model) trained by Google with nearly 6,000 global basins underperforms compared to regional ML models trained with hundreds of basins in contiguous US and Great Britain regions for predicting streamflow in both gauged and ungauged basins. Furthermore, we compared the G model with our single-basin (S) ML models, trained for 609 global locations individually, and found that the G model does not consistently outperform S models, as results show S models outperforming the G model in 46 % of case studies. Therefore, the training approach should not be a criterion for judging model validity; instead, the focus should be on the trained model's performance.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.
Share
Vinh Ngoc Tran, Tam V. Nguyen, Jongho Kim, and Valeriy Y. Ivanov

Status: final response (author comments only)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on egusphere-2025-769', Frederik Kratzert, 10 Jul 2025
    • AC1: 'Reply on RC1', Vinh Ngoc Tran, 17 Aug 2025
  • RC2: 'Comment on egusphere-2025-769', Anonymous Referee #2, 08 Aug 2025
    • AC2: 'Reply on RC2', Vinh Ngoc Tran, 17 Aug 2025
Vinh Ngoc Tran, Tam V. Nguyen, Jongho Kim, and Valeriy Y. Ivanov
Vinh Ngoc Tran, Tam V. Nguyen, Jongho Kim, and Valeriy Y. Ivanov

Viewed

Total article views: 1,456 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
1,385 57 14 1,456 27 40
  • HTML: 1,385
  • PDF: 57
  • XML: 14
  • Total: 1,456
  • BibTeX: 27
  • EndNote: 40
Views and downloads (calculated since 18 Jun 2025)
Cumulative views and downloads (calculated since 18 Jun 2025)

Viewed (geographical distribution)

Total article views: 1,358 (including HTML, PDF, and XML) Thereof 1,358 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 12 Sep 2025
Download
Short summary
Our research questions whether machine learning models for predicting streamflow need to be trained on data from multiple basins at once. We compared three approaches: a global model trained on thousands of basins, regional models using hundreds of basins, and individual single-basin models. We found that regional and single-basin models often performed better than the global model. This suggests we should judge models by their actual performance rather than their training approach.
Share