Streamflow prediction in data-scarce regions with semi-supervised deep learning

Jia, Tianlong; Chen, Guoding; Li, Yao; Chang, Xinyu; Ehret, Uwe

doi:10.5194/egusphere-2026-1637

Preprints

https://doi.org/10.5194/egusphere-2026-1637

Preprints

12 May 2026

| 12 May 2026

Status: this preprint is open for discussion and under review for Hydrology and Earth System Sciences (HESS).

Streamflow prediction in data-scarce regions with semi-supervised deep learning

Tianlong Jia, Guoding Chen, Yao Li, Xinyu Chang, and Uwe Ehret

Abstract. Deep learning methods have demonstrated great performance in streamflow prediction. However, they typically require large amounts of "labeled" data for supervised learning (SL), including meteorological forcing data paired with corresponding streamflow observations. The data scarcity of streamflow observation limits application of SL models across hydrologically diverse regions worldwide. To address this issue, we propose a two-stage semi-supervised learning (SSL) for streamflow prediction in data-scarce regions, based on the Contrastive Predictive Coding (CPC) method. CPC is a self-supervised learning method, that learns data representations from "unlabeled" data (i.e., meteorological forcing time series without streamflow observations). In the first stage, CPC was used to pre-train an encoder and a Long Short-Term Memory (LSTM) network with a projection head, using a large number of meteorological sequences. In the second stage, we attached a linear layer to the pre-trained encoder and LSTM, and fine-tuned the entire model architecture for streamflow prediction, using labeled data. We developed and evaluated this approach for streamflow prediction in both regional models and single-basin models, using the CAMELS-DE dataset. We assessed the in-domain generalization performances of regional models on 1,265 basins in Germany, used to pre-train and fine-tune models. Moreover, we examined their zero-shot out-of-domain generalization performances, on additional 317 basins from CAMELS-DE, that were not involved in model training. We benchmarked our approach with a baseline SL-trained model. Our results show that the SSL regional models outperforms the SL baseline in both in-domain and zero-shot out-of-domain generalization performance for data-scarce conditions, when less than 10% of one-year labeled sequences are available. SSL models yield significant improvements in median Nash-Sutcliffe Efficiency (NSE) of 0.137 (in-domain) and 0.139 (out-of-domain), with 0.5% of one-year labeled data. Additionally, SSL enhances model ability to predict low flow and floods for data-scarce conditions, reducing the median percent bias of the bottom 30% low flow range (FLV) by 21.047 and the median Mean Absolute Percentage Error of peaks (MAPE_peak) by 13.933 (out-of-domain), with 1% of one-year labeled data. This improved performance stems from the informative feature representations learned from meteorological forcing inputs though CPC pre-training, that enhances prediction ability across diverse basins under data-scarce conditions. Moreover, the advantages of SSL are more obvious for single-basin models when one-year labeled data is available. These results indicate a promising direction for leveraging SSL to develop hydrological foundation models, that have recently revolutionized artificial intelligence research. Hydrological foundation models involve pre-training on large-scale meteorological forcing datasets using self-supervised learning methods (e.g., CPC), and can be adapted to multiple hydrological tasks by model fine-tuning, e.g., simulation of water temperature and soil moisture.

Received: 25 Mar 2026 – Discussion started: 12 May 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Tianlong Jia, Guoding Chen, Yao Li, Xinyu Chang, and Uwe Ehret

Status: open (until 02 Aug 2026)

Post a comment Subscribe to comment alert

CC1:
'Comment on egusphere-2026-1637', Nima Zafarmomen, 14 May 2026 reply
The paper “Streamflow prediction in data-scarce regions with semi-supervised deep learning” presents a two-stage semi-supervised learning framework for streamflow prediction in regions where observed discharge data are limited. The authors use Contrastive Predictive Coding (CPC) to pre-train an encoder and LSTM network using unlabeled meteorological forcing data, and then fine-tune the model with a limited amount of labeled streamflow data.
Overall, this is a timely and valuable contribution. The paper is methodologically sound, clearly motivated, and relevant to hydrological prediction in poorly gauged or ungauged basins. The results support the usefulness of semi-supervised learning for data-scarce streamflow prediction. Therefore, I believe the manuscript is worth publishing after minor revisions.
Minor comments:
The authors should more clearly explain why CPC was selected over other self-supervised learning methods, and briefly discuss whether alternative SSL methods may perform differently.

The study uses CAMELS-DE, which is a high-quality dataset from Germany. The authors should discuss more explicitly how well the proposed method may transfer to regions with poorer data quality or stronger hydroclimatic variability.

The results show that SSL performs best when labeled data are very limited, while supervised learning performs better when more labeled data are available. The authors should emphasize this practical threshold more clearly in the conclusions.

The manuscript reports improvements for flood and low-flow prediction, but the discussion of extreme-event performance could be slightly expanded, especially regarding cases where both SSL and SL underestimate flood peaks.

The authors mention limited computational resources and no detailed hyperparameter tuning for CPC pre-training. This limitation should be stated more clearly in the main discussion, as model performance may depend on pre-training duration and parameter choices.

I strongly recommend the authors citing “Analysis of historical global warming impacts on climatological trends for the partially gauged Hirmand River Basin based on multiple data products and bias correction methods” because it is relevant to streamflow and hydrological prediction in partially gauged/data-scarce basins.

Reply
Citation: https://doi.org/10.5194/egusphere-2026-1637-CC1
RC1:
'Comment on egusphere-2026-1637', Anonymous Referee #1, 20 May 2026 reply
To the Authors,
From my understanding of the paper, you introduced a semi-supervised learning framework consisting of two steps: an encoder and LSTM are first pre-trained through Contrastive Predictive Coding (CPC), using meteorological forcings and catchments attributes. The model is then fine-tuned with varying amounts of labeled streamflow data to simulate different levels of data scarcity.
You compared your approach with a “classical” LSTM architecture, trained with previously defined varying levels of data scarcity. You also experimented with these two methods on single basins.
This is a solid contribution in the domain of streamflow prediction in poorly gauged basins, and you also highlight some of the limitations of your approach, which is always appreciated.
However, I see some issues with the current state of your manuscript:
For the pre-training of your CPC method, you randomly divided your sequences into train and validation. However, since your sequences can overlap up to 11 months, I fear that you could be overfitting without noticing.

Your fine-tuning training sets rely also on random splitting. As this may not be too much of an issue for low percentages of data scarcity (e.g., >=60% available), this could affect a lot the training and fine-tuning of your models for lower percentages. In return this could false the results and possibly your claims (which I hope not).

Your training/fine-tuning is performed on a small number of epochs, and your performance curves do not seem to show that convergence is achieved.

You should be careful with the usage of “flood” in the manuscript or make it clearer that the events you look at indeed led to floods.

I recommend accepting with minor revisions, provided that the following comments are addressed.
Major comments:
L. 175-182: if I understand correctly, the validation set could contain sequences that overlap by up to 11 months with sequences contained in the train set (since you sample randomly). How did you ensure that you do not overfit during CPC? You should either make it clearer that you addressed this issue already or design a different sampling strategy to ensure that you’re not overfitting.

L. 187-189 and L. 271: My understanding is that the different seeds are only used for the training/fine-tuning phase. This is not an issue for low levels of scarcity (e.g., >=60% available data), but this could change by a lot the results you get for high levels of scarcity.
If the seeds were used as well in the data splitting strategy, you should make it clearer when introducing your different training sets.

If not, you need to provide some measure of uncertainty for your performance, as it could change greatly depending on the data present in the split. I would assume that computation-wise, it shouldn’t be an issue to run multiple times this training/fine-tuning for higher levels of scarcity.

Figure 5.: Related to my first comment
It is unclear from Figure 5. whether any of the methods reached convergence. It seems to me that in most cases, none of the networks has reached convergence (high variability in some cases, score increasing in others).

It also seems that that the SSL method does not stabilize for some of the data splits.

You should make sure that both networks achieve convergence to have a fair comparison. You should also include the curves for the three highest levels of scarcity (possibly in the appendix) to make sure that you are doing things correctly. A suggestion from my side would be to use a higher number of base epochs (e.g., 50) and to implement an early stopping strategy.

All relevant figures: most figures showing model performance show only one curve (for seed=110). This choice can raise questions about reproducibility and stability. To convince me, you should either include a "stability" curve (e.g., mean +- std across seeds) or move all per-seed figures to the appendix (or plot all remaining curves on one figure in the appendix).

General comment: you use streamflow as your indicator for “flood” events. Are the events you investigate related to real floods? I’m not aware of the general conditions and values of streamflow in Germany but I think you should make it more explicit. You should also make it explicit if these catchments are prone to high-streamflow events (e.g., the Rhein falls in Switzerland would make it into this category) or if the events you investigate are “extremes” for these catchments. This relates to my minor comment about Fig 6. and B3.

Minor comments:
Table 1. and Table 2.:
It was hard for me to follow how the numbers of sequences in the test and validation were obtained.

It is also unclear to me why the time period for the third row shows only 1999. From the number of sequences, my understanding is that training/fine-tuning occurred on the full 1970-1999 period.

I think the captions of the tables could be improved to provide more information.

L. 189:
How did you choose your levels of data scarcity? Are they representative of any real-world situation? You could make this more explicit.

Did you perform any check in the spatio-temporal coverage that these sets of data represent? This relates also to my second major comment.

L. 271-272: my understanding is that you show the median of the performance (i.e., calculate all performances then take the median score). If this is what you mean, I’m fine with the phrasing. If you mean instead the performance of the median (i.e., train across seeds, calculate the ensemble median predictions and evaluate), you should adapt your phrasing to make it clearer.

Figure 4. and Figure 7.: I understand you show the median performance. Did you observe substantial differences between your different “seed runs” or are the methods stable across seeds? You could add error bars for the performance range. This also relates to my comments on figures showing performance for only one seed.

L. 302: given the curves shown in Figure 5., I would be careful with the claim about overfitting. It is unclear to me if your SSL does indeed improve on that regard since the validation score shows more unstable behavior. This relates to some extent to my third major comment. Depending on the “new” results, you may have to adapt (tone down) the claim or totally remove it.

L. 303: From what you wrote in L. 210-211, I would assume the pre-training is the computationally expensive step of the method (although this scales well since you then fine-tune the same base architecture). You could maybe add more details (e.g., table in the appendix) about the resources (mostly training time, since hardware is the same) to strengthen your claim.

Figure 6. and Figure B3. (and L. 316-317):
You could convert mm/day to m3/s (or any flow units) to strengthen the claim about “highest streamflow events” (L. 316-317).

I did not understand to which dataset belonged the two basins DE212760 and DE710280
If they are in test_out, you could add a sentence about it

If they are in test_in, I would be interested in knowing how different from the training distribution these events are. This could explain the (almost) perfect modeling with the 100% training.

Figure 6.: The SL performance increases as the amount of training data increases, which is consistent with previous studies. However, your SSL shows a small drop from 20% to 40% and a bigger drop from 60% to 80% before increasing again. Did you investigate this behavior with larger case studies? Could this be linked to my comments on convergence of the training?

Figure B3.: we observe that the performance does not increase too much for the classical approach when increasing training data. Could this relate again to the convergence? It could also relates to the comment above about the dataset in which this basin is.

Table 3., 4. and 5.: add units for the metrics that are not unitless. I may have missed other places where this is needed.

Section 4.3: I think the comparison is unfair to the LSTM.
We know from previous research that one should not use LSTMs in such context (single basin, 1 year of data) as they don’t perform well. You could remedy this in multiple ways:
Use a process-based model calibrated on your basins as the baseline

Use more data (e.g., 3-5 years). Although this is still very limited, it could make for a fairer comparison

Drop the section and move it in the discussion of limitations (your method seems to also suffer from the lack of data, see Figure 8.).

Table 5.: I would be interested in knowing the number of peaks you evaluate on.

L. 390-391.: You could tone down your claims as Figure 9. does not support them. While your method shows “better” time series, it does not capture the trend nor the flow peak.

Notes:
L. 167: you describe CAMELS-DE as providing data starting in 1951, but you only use data from 1965 onwards. Is it because the data is not complete for these years or is there another reason (I’m not an expert of this dataset)?

L. 240: you could drop either RMSE or MSE since they represent the exact same information but scaled. You could maybe replace one with the overall bias of your model.

Section 4.3:
Figure 8.: while you improve over the classical approach, the results for the first 5 basins show NSE values around 0.1, which remain quite low.
This highlights that SSL struggles too in this context (this can also be seen for the instability in performance depending on the catchment).

This relates to my minor comment about the need of this section, which could be discussed less extensively in the discussion.

L. 422-428: From my understanding, classical transfer learning approaches can transfer a relation between inputs (meteorological forcing, geographical data, catchment attributes) and outputs (observations of streamflow e.g.). On the other hand, SSL transfers only an encoding of meteorological “behavior” and geographical/catchment attributes. You are right that transfer learning should be first trained on a variety of catchment representing the catchment you want to transfer to. However, when searching for relevant papers, I came across a paper from Alzhanov et al. [1], where they seem to address some of the concerns you mention about transfer learning. This could be an addition to balance your claims. Note that I read rapidly this reference.

L. 525-526: You used the base values as those provided in the code of NeuralHydrology. Have you tried different approaches, e.g., by reducing the number of timesteps between peaks and increasing the prominence? With this, you could study how your method performs in case of multiple peaks in a short timeframe. I know you would have to go into the NeuralHydrology code, so I leave this as a note.

Figure B3.: some of your labels are partially hidden (does not affect understanding)

References:
[1] Alzhanov, A., Nugumanova, A., & Moreido, V. (2025). Transfer learning using the global Caravan dataset for developing a local river streamflow prediction model. Environmental Modelling & Software, 194, 106691. https://doi.org/10.1016/j.envsoft.2025.106691

Reply
Citation: https://doi.org/10.5194/egusphere-2026-1637-RC1
RC2: 'Comment on egusphere-2026-1637', Anonymous Referee #2, 06 Jul 2026 reply

The manuscript, “Streamflow prediction in data-scarce regions with semi-supervised deep learning,” proposes a two-stage semi-supervised learning method for streamflow prediction in data-scarce regions. This two-stage framework first involves pre-training an encoder and an LSTM network on large volumes of “unlabeled” data (i.e., meteorological forcings and basin attributes) using Contrastive Predictive Coding (CPC). Then, it involves fine-tuning the pre-trained network on limited amount of “labeled” data (i.e., meteorological forcings and streamflow measurements) using conventional supervised learning methods. This work evaluates the proposed framework for both regional and single-basin models under in-domain and out-of-domain prediction settings, and compares its performance with that of a conventional supervised LSTM model. The comprehensive performance analysis, based on a range of evaluation metrics—including those targeting low-flow and high-flow conditions—provides valuable insights into the strengths of the proposed approach.

Overall, this study provides a meaningful and solid contribution to the development of streamflow prediction methods for data-scarce basins, where sufficient streamflow observations for supervised learning models are often unavailable or of limited quality. I therefore recommend the manuscript for publication, subject to minor revisions.
Major comments:

1.The fine-tuning datasets (e.g., Train80%) were generated by randomly sampling from the full dataset. Could the authors explain why this strategy was chosen instead of using temporally continuous subsets (e.g., the first 30, 20, 10, or 5 years of observations)? Such temporal subsets may better represent real-world data-scarce scenarios, where only a limited historical record is available (e.g., in some basins, only the latest 10 years data is available). Evaluating the proposed framework under these conditions could provide additional insights for practitioners developing streamflow prediction models in data-scarce regions.

2.The median NSE values for the in-domain and out-of-domain experiments appear to be very similar, regardless of the amount of labeled data used (Figures 4 and 7). Could the authors provide an explanation for this observation? This result is somewhat surprising, as out-of-domain prediction is generally expected to be more challenging than in-domain prediction due to differences in data feature distributions between the source and target datasets.

3.The authors train models using data across all flow regimes (including both low-flow and high-flow conditions) and evaluate performance using FLV and FHV metrics to assess low-flow and high-flow behavior, respectively. Have the authors considered training SSL or baseline models exclusively on high-flow data and evaluating their performance specifically under high-flow conditions? Such an evaluation could provide additional insight into the models’ ability to learn regime-specific dynamics and their robustness under extreme flow conditions, while it may be outside the scope of the present study.

4.In Line 270, the authors state that the results reported in Sections 4.1 and 4.2 are the median values obtained from 10 runs with different random seeds. However, the hydrographs presented in the manuscript (e.g., Figure 6) appear to show predictions from only a single run. I recommend presenting the ensemble prediction uncertainty across all 10 runs, for example by showing the minimum, median (or mean), and maximum predicted streamflow values, or by including confidence bands around the median prediction. This would provide readers with a clearer assessment of the robustness and stability of the proposed method with respect to random initialization.

5.Please use the term flood with caution throughout the manuscript, or clarify that the high-flow events analyzed in this study correspond to actual flood events. If the evaluated events do not necessarily result in flood, terms such as high-flow events or peak flows may be more appropriate.

Reply

Citation: https://doi.org/10.5194/egusphere-2026-1637-RC2

Tianlong Jia, Guoding Chen, Yao Li, Xinyu Chang, and Uwe Ehret

Data sets

CAMELS-DE: hydro-meteorological time series and attributes for 1582 catchments in Germany Ralf Loritz et al. https://doi.org/10.5281/zenodo.16755906

Tianlong Jia, Guoding Chen, Yao Li, Xinyu Chang, and Uwe Ehret

Viewed

Total article views: 468 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
351	99	18	468	20	25

HTML: 351
PDF: 99
XML: 18
Total: 468
BibTeX: 20
EndNote: 25

Views and downloads (calculated since 12 May 2026)

Month	HTML	PDF	XML	Total
May 2026	302	85	15	402
Jun 2026	35	12	3	50
Jul 2026	14	2	0	16

Cumulative views and downloads (calculated since 12 May 2026)

Month	HTML	PDF	XML	Total
May 2026	302	85	15	402
Jun 2026	35	12	3	50
Jul 2026	14	2	0	16

Viewed (geographical distribution)

Total article views: 450 (including HTML, PDF, and XML) Thereof 450 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 13 Jul 2026

Short summary

Supervised learning-based streamflow prediction models typically rely on large volumes of meteorological forcing data paired with streamflow observations, but the data scarcity of streamflow observations limits their applicability. To address this, we propose a novel method to improve prediction performance in data-scarce regions, that first pre-trains on abundant meteorological forcing data, and then fine-tunes using limited meteorological forcing data paired with streamflow observations.


Total:	0
HTML:	0
PDF:	0
XML:	0