the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Streamflow prediction in data-scarce regions with semi-supervised deep learning
Abstract. Deep learning methods have demonstrated great performance in streamflow prediction. However, they typically require large amounts of "labeled" data for supervised learning (SL), including meteorological forcing data paired with corresponding streamflow observations. The data scarcity of streamflow observation limits application of SL models across hydrologically diverse regions worldwide. To address this issue, we propose a two-stage semi-supervised learning (SSL) for streamflow prediction in data-scarce regions, based on the Contrastive Predictive Coding (CPC) method. CPC is a self-supervised learning method, that learns data representations from "unlabeled" data (i.e., meteorological forcing time series without streamflow observations). In the first stage, CPC was used to pre-train an encoder and a Long Short-Term Memory (LSTM) network with a projection head, using a large number of meteorological sequences. In the second stage, we attached a linear layer to the pre-trained encoder and LSTM, and fine-tuned the entire model architecture for streamflow prediction, using labeled data. We developed and evaluated this approach for streamflow prediction in both regional models and single-basin models, using the CAMELS-DE dataset. We assessed the in-domain generalization performances of regional models on 1,265 basins in Germany, used to pre-train and fine-tune models. Moreover, we examined their zero-shot out-of-domain generalization performances, on additional 317 basins from CAMELS-DE, that were not involved in model training. We benchmarked our approach with a baseline SL-trained model. Our results show that the SSL regional models outperforms the SL baseline in both in-domain and zero-shot out-of-domain generalization performance for data-scarce conditions, when less than 10% of one-year labeled sequences are available. SSL models yield significant improvements in median Nash-Sutcliffe Efficiency (NSE) of 0.137 (in-domain) and 0.139 (out-of-domain), with 0.5% of one-year labeled data. Additionally, SSL enhances model ability to predict low flow and floods for data-scarce conditions, reducing the median percent bias of the bottom 30% low flow range (FLV) by 21.047 and the median Mean Absolute Percentage Error of peaks (MAPEpeak) by 13.933 (out-of-domain), with 1% of one-year labeled data. This improved performance stems from the informative feature representations learned from meteorological forcing inputs though CPC pre-training, that enhances prediction ability across diverse basins under data-scarce conditions. Moreover, the advantages of SSL are more obvious for single-basin models when one-year labeled data is available. These results indicate a promising direction for leveraging SSL to develop hydrological foundation models, that have recently revolutionized artificial intelligence research. Hydrological foundation models involve pre-training on large-scale meteorological forcing datasets using self-supervised learning methods (e.g., CPC), and can be adapted to multiple hydrological tasks by model fine-tuning, e.g., simulation of water temperature and soil moisture.
- Preprint
(12455 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 06 Jul 2026)
- CC1: 'Comment on egusphere-2026-1637', Nima Zafarmomen, 14 May 2026 reply
-
RC1: 'Comment on egusphere-2026-1637', Anonymous Referee #1, 20 May 2026
reply
To the Authors,
From my understanding of the paper, you introduced a semi-supervised learning framework consisting of two steps: an encoder and LSTM are first pre-trained through Contrastive Predictive Coding (CPC), using meteorological forcings and catchments attributes. The model is then fine-tuned with varying amounts of labeled streamflow data to simulate different levels of data scarcity.
You compared your approach with a “classical” LSTM architecture, trained with previously defined varying levels of data scarcity. You also experimented with these two methods on single basins.
This is a solid contribution in the domain of streamflow prediction in poorly gauged basins, and you also highlight some of the limitations of your approach, which is always appreciated.
However, I see some issues with the current state of your manuscript:
- For the pre-training of your CPC method, you randomly divided your sequences into train and validation. However, since your sequences can overlap up to 11 months, I fear that you could be overfitting without noticing.
- Your fine-tuning training sets rely also on random splitting. As this may not be too much of an issue for low percentages of data scarcity (e.g., >=60% available), this could affect a lot the training and fine-tuning of your models for lower percentages. In return this could false the results and possibly your claims (which I hope not).
- Your training/fine-tuning is performed on a small number of epochs, and your performance curves do not seem to show that convergence is achieved.
- You should be careful with the usage of “flood” in the manuscript or make it clearer that the events you look at indeed led to floods.
I recommend accepting with minor revisions, provided that the following comments are addressed.
Major comments:
- L. 175-182: if I understand correctly, the validation set could contain sequences that overlap by up to 11 months with sequences contained in the train set (since you sample randomly). How did you ensure that you do not overfit during CPC? You should either make it clearer that you addressed this issue already or design a different sampling strategy to ensure that you’re not overfitting.
- L. 187-189 and L. 271: My understanding is that the different seeds are only used for the training/fine-tuning phase. This is not an issue for low levels of scarcity (e.g., >=60% available data), but this could change by a lot the results you get for high levels of scarcity.
- If the seeds were used as well in the data splitting strategy, you should make it clearer when introducing your different training sets.
- If not, you need to provide some measure of uncertainty for your performance, as it could change greatly depending on the data present in the split. I would assume that computation-wise, it shouldn’t be an issue to run multiple times this training/fine-tuning for higher levels of scarcity.
- Figure 5.: Related to my first comment
- It is unclear from Figure 5. whether any of the methods reached convergence. It seems to me that in most cases, none of the networks has reached convergence (high variability in some cases, score increasing in others).
- It also seems that that the SSL method does not stabilize for some of the data splits.
- You should make sure that both networks achieve convergence to have a fair comparison. You should also include the curves for the three highest levels of scarcity (possibly in the appendix) to make sure that you are doing things correctly. A suggestion from my side would be to use a higher number of base epochs (e.g., 50) and to implement an early stopping strategy.
- All relevant figures: most figures showing model performance show only one curve (for seed=110). This choice can raise questions about reproducibility and stability. To convince me, you should either include a "stability" curve (e.g., mean +- std across seeds) or move all per-seed figures to the appendix (or plot all remaining curves on one figure in the appendix).
- General comment: you use streamflow as your indicator for “flood” events. Are the events you investigate related to real floods? I’m not aware of the general conditions and values of streamflow in Germany but I think you should make it more explicit. You should also make it explicit if these catchments are prone to high-streamflow events (e.g., the Rhein falls in Switzerland would make it into this category) or if the events you investigate are “extremes” for these catchments. This relates to my minor comment about Fig 6. and B3.
Minor comments:
- Table 1. and Table 2.:
- It was hard for me to follow how the numbers of sequences in the test and validation were obtained.
- It is also unclear to me why the time period for the third row shows only 1999. From the number of sequences, my understanding is that training/fine-tuning occurred on the full 1970-1999 period.
- I think the captions of the tables could be improved to provide more information.
- L. 189:
- How did you choose your levels of data scarcity? Are they representative of any real-world situation? You could make this more explicit.
- Did you perform any check in the spatio-temporal coverage that these sets of data represent? This relates also to my second major comment.
- L. 271-272: my understanding is that you show the median of the performance (i.e., calculate all performances then take the median score). If this is what you mean, I’m fine with the phrasing. If you mean instead the performance of the median (i.e., train across seeds, calculate the ensemble median predictions and evaluate), you should adapt your phrasing to make it clearer.
- Figure 4. and Figure 7.: I understand you show the median performance. Did you observe substantial differences between your different “seed runs” or are the methods stable across seeds? You could add error bars for the performance range. This also relates to my comments on figures showing performance for only one seed.
- L. 302: given the curves shown in Figure 5., I would be careful with the claim about overfitting. It is unclear to me if your SSL does indeed improve on that regard since the validation score shows more unstable behavior. This relates to some extent to my third major comment. Depending on the “new” results, you may have to adapt (tone down) the claim or totally remove it.
- L. 303: From what you wrote in L. 210-211, I would assume the pre-training is the computationally expensive step of the method (although this scales well since you then fine-tune the same base architecture). You could maybe add more details (e.g., table in the appendix) about the resources (mostly training time, since hardware is the same) to strengthen your claim.
- Figure 6. and Figure B3. (and L. 316-317):
- You could convert mm/day to m3/s (or any flow units) to strengthen the claim about “highest streamflow events” (L. 316-317).
- I did not understand to which dataset belonged the two basins DE212760 and DE710280
- If they are in test_out, you could add a sentence about it
- If they are in test_in, I would be interested in knowing how different from the training distribution these events are. This could explain the (almost) perfect modeling with the 100% training.
- Figure 6.: The SL performance increases as the amount of training data increases, which is consistent with previous studies. However, your SSL shows a small drop from 20% to 40% and a bigger drop from 60% to 80% before increasing again. Did you investigate this behavior with larger case studies? Could this be linked to my comments on convergence of the training?
- Figure B3.: we observe that the performance does not increase too much for the classical approach when increasing training data. Could this relate again to the convergence? It could also relates to the comment above about the dataset in which this basin is.
- Table 3., 4. and 5.: add units for the metrics that are not unitless. I may have missed other places where this is needed.
- Section 4.3: I think the comparison is unfair to the LSTM.
- We know from previous research that one should not use LSTMs in such context (single basin, 1 year of data) as they don’t perform well. You could remedy this in multiple ways:
- Use a process-based model calibrated on your basins as the baseline
- Use more data (e.g., 3-5 years). Although this is still very limited, it could make for a fairer comparison
- Drop the section and move it in the discussion of limitations (your method seems to also suffer from the lack of data, see Figure 8.).
- We know from previous research that one should not use LSTMs in such context (single basin, 1 year of data) as they don’t perform well. You could remedy this in multiple ways:
- Table 5.: I would be interested in knowing the number of peaks you evaluate on.
- L. 390-391.: You could tone down your claims as Figure 9. does not support them. While your method shows “better” time series, it does not capture the trend nor the flow peak.
Notes:
- L. 167: you describe CAMELS-DE as providing data starting in 1951, but you only use data from 1965 onwards. Is it because the data is not complete for these years or is there another reason (I’m not an expert of this dataset)?
- L. 240: you could drop either RMSE or MSE since they represent the exact same information but scaled. You could maybe replace one with the overall bias of your model.
- Section 4.3:
- Figure 8.: while you improve over the classical approach, the results for the first 5 basins show NSE values around 0.1, which remain quite low.
- This highlights that SSL struggles too in this context (this can also be seen for the instability in performance depending on the catchment).
- This relates to my minor comment about the need of this section, which could be discussed less extensively in the discussion.
- Figure 8.: while you improve over the classical approach, the results for the first 5 basins show NSE values around 0.1, which remain quite low.
- L. 422-428: From my understanding, classical transfer learning approaches can transfer a relation between inputs (meteorological forcing, geographical data, catchment attributes) and outputs (observations of streamflow e.g.). On the other hand, SSL transfers only an encoding of meteorological “behavior” and geographical/catchment attributes. You are right that transfer learning should be first trained on a variety of catchment representing the catchment you want to transfer to. However, when searching for relevant papers, I came across a paper from Alzhanov et al. [1], where they seem to address some of the concerns you mention about transfer learning. This could be an addition to balance your claims. Note that I read rapidly this reference.
- L. 525-526: You used the base values as those provided in the code of NeuralHydrology. Have you tried different approaches, e.g., by reducing the number of timesteps between peaks and increasing the prominence? With this, you could study how your method performs in case of multiple peaks in a short timeframe. I know you would have to go into the NeuralHydrology code, so I leave this as a note.
- Figure B3.: some of your labels are partially hidden (does not affect understanding)
References:
[1] Alzhanov, A., Nugumanova, A., & Moreido, V. (2025). Transfer learning using the global Caravan dataset for developing a local river streamflow prediction model. Environmental Modelling & Software, 194, 106691. https://doi.org/10.1016/j.envsoft.2025.106691
Citation: https://doi.org/10.5194/egusphere-2026-1637-RC1
Data sets
CAMELS-DE: hydro-meteorological time series and attributes for 1582 catchments in Germany Ralf Loritz et al. https://doi.org/10.5281/zenodo.16755906
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 302 | 85 | 15 | 402 | 16 | 20 |
- HTML: 302
- PDF: 85
- XML: 15
- Total: 402
- BibTeX: 16
- EndNote: 20
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The paper “Streamflow prediction in data-scarce regions with semi-supervised deep learning” presents a two-stage semi-supervised learning framework for streamflow prediction in regions where observed discharge data are limited. The authors use Contrastive Predictive Coding (CPC) to pre-train an encoder and LSTM network using unlabeled meteorological forcing data, and then fine-tune the model with a limited amount of labeled streamflow data.
Overall, this is a timely and valuable contribution. The paper is methodologically sound, clearly motivated, and relevant to hydrological prediction in poorly gauged or ungauged basins. The results support the usefulness of semi-supervised learning for data-scarce streamflow prediction. Therefore, I believe the manuscript is worth publishing after minor revisions.
Minor comments: