Preprints
https://doi.org/10.5194/egusphere-2025-769
https://doi.org/10.5194/egusphere-2025-769
18 Jun 2025
 | 18 Jun 2025
Status: this preprint is open for discussion and under review for Hydrology and Earth System Sciences (HESS).

Technical note: Does Multiple Basin Training Strategy Guarantee Superior Machine Learning Performance for Streamflow Predictions in Gaged Basins?

Vinh Ngoc Tran, Tam V. Nguyen, Jongho Kim, and Valeriy Y. Ivanov

Abstract. In recent years, machine learning (ML) has witnessed growing prominence and popularity in hydrological science, offering convenience and ease of use without requiring extensive hydrological expertise or the complexity associated with process-based models. There exists debate regarding optimal training approaches, with some researchers advocating for multi-basin training while questioning the validity of single-basin approaches. This study examines the relationship between training dataset size (number of basins) and model performance. Through comparative analysis, we found that increasing the number of basins for ML training does not necessarily guarantee improved performance of the trained ML model. Specifically, the state-of-the-art global ML (G model) trained by Google with nearly 6,000 global basins underperforms compared to regional ML models trained with hundreds of basins in contiguous US and Great Britain regions for predicting streamflow in both gauged and ungauged basins. Furthermore, we compared the G model with our single-basin (S) ML models, trained for 609 global locations individually, and found that the G model does not consistently outperform S models, as results show S models outperforming the G model in 46 % of case studies. Therefore, the training approach should not be a criterion for judging model validity; instead, the focus should be on the trained model's performance.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.
Share
Vinh Ngoc Tran, Tam V. Nguyen, Jongho Kim, and Valeriy Y. Ivanov

Status: open (until 16 Aug 2025)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
Vinh Ngoc Tran, Tam V. Nguyen, Jongho Kim, and Valeriy Y. Ivanov
Vinh Ngoc Tran, Tam V. Nguyen, Jongho Kim, and Valeriy Y. Ivanov

Viewed

Total article views: 233 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
188 35 10 233 4 8
  • HTML: 188
  • PDF: 35
  • XML: 10
  • Total: 233
  • BibTeX: 4
  • EndNote: 8
Views and downloads (calculated since 18 Jun 2025)
Cumulative views and downloads (calculated since 18 Jun 2025)

Viewed (geographical distribution)

Total article views: 245 (including HTML, PDF, and XML) Thereof 245 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 16 Jul 2025
Download
Short summary
Our research questions whether machine learning models for predicting streamflow need to be trained on data from multiple basins at once. We compared three approaches: a global model trained on thousands of basins, regional models using hundreds of basins, and individual single-basin models. We found that regional and single-basin models often performed better than the global model. This suggests we should judge models by their actual performance rather than their training approach.
Share