EGUsphere

Copernicus Publications

Göttingen, Germany

10.5194/egusphere-2026-1637

Streamflow prediction in data-scarce regions with semi-supervised deep learning

Jia

Tianlong

https://orcid.org/0000-0001-5142-1321

¹ Chen

Guoding

https://orcid.org/0000-0002-7298-506X

² Li

Yao

https://orcid.org/0000-0002-5406-4494

³ Chang

Xinyu

⁴ ⁵ ¹ Ehret

Uwe

https://orcid.org/0000-0003-3454-8755

Karlsruhe Institute of Technology (KIT), Institute of Water and Environment, Karlsruhe, Germany

Zhejiang Institute of Hydraulics and Estuary (Zhejiang Institute of Marine Planning and Design), Hangzhou 310020, China

LEESU, ENPC, Institut Polytechnique de Paris, Univ Paris Est Creteil, 77455 Marne-la-Vallée, 10 France

Huazhong University of Science and Technology, School of Civil and Hydraulic Engineering, Wuhan 430070, China

Huazhong University of Science and Technology, Hubei Key Laboratory of Digital River Basin Science and Technology, Wuhan 430070, China

12 05 2026

2026 1 37

2026

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/

This article is available from https://egusphere.copernicus.org/preprints/2026/egusphere-2026-1637/

The full text article is available as a PDF file from https://egusphere.copernicus.org/preprints/2026/egusphere-2026-1637/egusphere-2026-1637.pdf

Deep learning methods have demonstrated great performance in streamflow prediction. However, they typically require large amounts of "labeled" data for supervised learning (SL), including meteorological forcing data paired with corresponding streamflow observations. The data scarcity of streamflow observation limits application of SL models across hydrologically diverse regions worldwide. To address this issue, we propose a two-stage semi-supervised learning (SSL) for streamflow prediction in data-scarce regions, based on the Contrastive Predictive Coding (CPC) method. CPC is a self-supervised learning method, that learns data representations from "unlabeled" data (i.e., meteorological forcing time series without streamflow observations). In the first stage, CPC was used to pre-train an encoder and a Long Short-Term Memory (LSTM) network with a projection head, using a large number of meteorological sequences. In the second stage, we attached a linear layer to the pre-trained encoder and LSTM, and fine-tuned the entire model architecture for streamflow prediction, using labeled data. We developed and evaluated this approach for streamflow prediction in both regional models and single-basin models, using the CAMELS-DE dataset. We assessed the in-domain generalization performances of regional models on 1,265 basins in Germany, used to pre-train and fine-tune models. Moreover, we examined their zero-shot out-of-domain generalization performances, on additional 317 basins from CAMELS-DE, that were not involved in model training. We benchmarked our approach with a baseline SL-trained model. Our results show that the SSL regional models outperforms the SL baseline in both in-domain and zero-shot out-of-domain generalization performance for data-scarce conditions, when less than 10% of one-year labeled sequences are available. SSL models yield significant improvements in median Nash-Sutcliffe Efficiency (NSE) of 0.137 (in-domain) and 0.139 (out-of-domain), with 0.5% of one-year labeled data. Additionally, SSL enhances model ability to predict low flow and floods for data-scarce conditions, reducing the median percent bias of the bottom 30% low flow range (FLV) by 21.047 and the median Mean Absolute Percentage Error of peaks (MAPE<sub>peak</sub>) by 13.933 (out-of-domain), with 1% of one-year labeled data. This improved performance stems from the informative feature representations learned from meteorological forcing inputs though CPC pre-training, that enhances prediction ability across diverse basins under data-scarce conditions. Moreover, the advantages of SSL are more obvious for single-basin models when one-year labeled data is available. These results indicate a promising direction for leveraging SSL to develop hydrological foundation models, that have recently revolutionized artificial intelligence research. Hydrological foundation models involve pre-training on large-scale meteorological forcing datasets using self-supervised learning methods (e.g., CPC), and can be adapted to multiple hydrological tasks by model fine-tuning, e.g., simulation of water temperature and soil moisture.