Preprints
https://doi.org/10.5194/egusphere-2026-1637
https://doi.org/10.5194/egusphere-2026-1637
12 May 2026
 | 12 May 2026
Status: this preprint is open for discussion and under review for Hydrology and Earth System Sciences (HESS).

Streamflow prediction in data-scarce regions with semi-supervised deep learning

Tianlong Jia, Guoding Chen, Yao Li, Xinyu Chang, and Uwe Ehret

Abstract. Deep learning methods have demonstrated great performance in streamflow prediction. However, they typically require large amounts of "labeled" data for supervised learning (SL), including meteorological forcing data paired with corresponding streamflow observations. The data scarcity of streamflow observation limits application of SL models across hydrologically diverse regions worldwide. To address this issue, we propose a two-stage semi-supervised learning (SSL) for streamflow prediction in data-scarce regions, based on the Contrastive Predictive Coding (CPC) method. CPC is a self-supervised learning method, that learns data representations from "unlabeled" data (i.e., meteorological forcing time series without streamflow observations). In the first stage, CPC was used to pre-train an encoder and a Long Short-Term Memory (LSTM) network with a projection head, using a large number of meteorological sequences. In the second stage, we attached a linear layer to the pre-trained encoder and LSTM, and fine-tuned the entire model architecture for streamflow prediction, using labeled data. We developed and evaluated this approach for streamflow prediction in both regional models and single-basin models, using the CAMELS-DE dataset. We assessed the in-domain generalization performances of regional models on 1,265 basins in Germany, used to pre-train and fine-tune models. Moreover, we examined their zero-shot out-of-domain generalization performances, on additional 317 basins from CAMELS-DE, that were not involved in model training. We benchmarked our approach with a baseline SL-trained model. Our results show that the SSL regional models outperforms the SL baseline in both in-domain and zero-shot out-of-domain generalization performance for data-scarce conditions, when less than 10% of one-year labeled sequences are available. SSL models yield significant improvements in median Nash-Sutcliffe Efficiency (NSE) of 0.137 (in-domain) and 0.139 (out-of-domain), with 0.5% of one-year labeled data. Additionally, SSL enhances model ability to predict low flow and floods for data-scarce conditions, reducing the median percent bias of the bottom 30% low flow range (FLV) by 21.047 and the median Mean Absolute Percentage Error of peaks (MAPEpeak) by 13.933 (out-of-domain), with 1% of one-year labeled data. This improved performance stems from the informative feature representations learned from meteorological forcing inputs though CPC pre-training, that enhances prediction ability across diverse basins under data-scarce conditions. Moreover, the advantages of SSL are more obvious for single-basin models when one-year labeled data is available. These results indicate a promising direction for leveraging SSL to develop hydrological foundation models, that have recently revolutionized artificial intelligence research. Hydrological foundation models involve pre-training on large-scale meteorological forcing datasets using self-supervised learning methods (e.g., CPC), and can be adapted to multiple hydrological tasks by model fine-tuning, e.g., simulation of water temperature and soil moisture.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.
Share
Tianlong Jia, Guoding Chen, Yao Li, Xinyu Chang, and Uwe Ehret

Status: open (until 23 Jun 2026)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
Tianlong Jia, Guoding Chen, Yao Li, Xinyu Chang, and Uwe Ehret

Data sets

CAMELS-DE: hydro-meteorological time series and attributes for 1582 catchments in Germany Ralf Loritz et al. https://doi.org/10.5281/zenodo.16755906

Tianlong Jia, Guoding Chen, Yao Li, Xinyu Chang, and Uwe Ehret

Viewed

Total article views: 39 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
33 5 1 39 0 0
  • HTML: 33
  • PDF: 5
  • XML: 1
  • Total: 39
  • BibTeX: 0
  • EndNote: 0
Views and downloads (calculated since 12 May 2026)
Cumulative views and downloads (calculated since 12 May 2026)

Viewed (geographical distribution)

Total article views: 39 (including HTML, PDF, and XML) Thereof 39 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 13 May 2026
Download
Short summary
Supervised learning-based streamflow prediction models typically rely on large volumes of meteorological forcing data paired with streamflow observations, but the data scarcity of streamflow observations limits their applicability. To address this, we propose a novel method to improve prediction performance in data-scarce regions, that first pre-trains on abundant meteorological forcing data, and then fine-tunes using limited meteorological forcing data paired with streamflow observations.
Share