<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "https://jats.nlm.nih.gov/nlm-dtd/publishing/3.0/journalpublishing3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" specific-use="SMUR" dtd-version="3.0" xml:lang="en">
<front>
<journal-meta>
<journal-id journal-id-type="publisher">EGUsphere</journal-id>
<journal-title-group>
<journal-title>EGUsphere</journal-title>
<abbrev-journal-title abbrev-type="publisher">EGUsphere</abbrev-journal-title>
<abbrev-journal-title abbrev-type="nlm-ta">EGUsphere</abbrev-journal-title>
</journal-title-group>
<issn pub-type="epub"></issn>
<publisher><publisher-name>Copernicus Publications</publisher-name>
<publisher-loc>Göttingen, Germany</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.5194/egusphere-2026-1965</article-id>
<title-group>
<article-title>Better data or better architecture? Improving deep-learning-based prediction in ungauged basins</article-title>
</title-group>
<contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Heudorfer</surname>
<given-names>Benedikt</given-names>
<ext-link>https://orcid.org/0000-0001-7801-9375</ext-link>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Gupta</surname>
<given-names>Hoshin</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Dolich</surname>
<given-names>Alexander</given-names>
<ext-link>https://orcid.org/0000-0003-4160-6765</ext-link>
</name>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Loritz</surname>
<given-names>Ralf</given-names>
<ext-link>https://orcid.org/0000-0002-0540-6478</ext-link>
</name>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
</contrib>
</contrib-group><aff id="aff1">
<label>1</label>
<addr-line>Karlsruhe Institute of Technology (KIT), Institute of Meteorology and Climate Research – Atmospheric Trace Gases and Remote Sensing, Karlsruhe, Germany</addr-line>
</aff>
<aff id="aff2">
<label>2</label>
<addr-line>Department of Hydrology and Atmospheric Sciences, The University of Arizona, Tucson, AZ, USA</addr-line>
</aff>
<aff id="aff3">
<label>3</label>
<addr-line>Karlsruhe Institute of Technology (KIT), Institute for Water and Environment, Karlsruhe, Germany</addr-line>
</aff>
<pub-date pub-type="epub">
<day>20</day>
<month>04</month>
<year>2026</year>
</pub-date>
<volume>2026</volume>
<fpage>1</fpage>
<lpage>30</lpage>
<permissions>
<copyright-statement>Copyright: &#x000a9; 2026 Benedikt Heudorfer et al.</copyright-statement>
<copyright-year>2026</copyright-year>
<license license-type="open-access">
<license-p>This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit <ext-link ext-link-type="uri"  xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link></license-p>
</license>
</permissions>
<self-uri xlink:href="https://egusphere.copernicus.org/preprints/2026/egusphere-2026-1965/">This article is available from https://egusphere.copernicus.org/preprints/2026/egusphere-2026-1965/</self-uri>
<self-uri xlink:href="https://egusphere.copernicus.org/preprints/2026/egusphere-2026-1965/egusphere-2026-1965.pdf">The full text article is available as a PDF file from https://egusphere.copernicus.org/preprints/2026/egusphere-2026-1965/egusphere-2026-1965.pdf</self-uri>
<abstract>
<p>Large-sample hydrology has recently been driven by two key developments. First, the introduction of hydrological benchmark datasets such as CAMELS-US and CARAVAN, and second, the emergence of deep‑learning modelling frameworks, particularly LSTM‑based regional models, which have demonstrated performance on par with, and in some cases exceeding, that of process-based models for streamflow prediction in gauged and ungauged settings. Building on these developments, we investigate whether (i) further enhanced LSTM architectures, (ii) new sets of static features, or (iii) a combination of both enable us to significantly improve Predictions in Ungauged Basins (PUB). In this study, we evaluate a state-of-the-art regional LSTM model (base LSTM) against embedded (EMB-LSTM) and cross‑attention enhanced (CA-LSTM) variants, in combination with a suite of newly applied static features, namely MODIS surface reflectance bands, ALPHAEARTH embeddings, DEM-, meteorology- and catchment coordinate-derived auxiliary aggregates, and conventional CAMELS attributes. We tested these model-and-data combinations in pseudo‑ungauged 5‑fold cross‑validation across the 531 CAMELS‑US catchments. Model performance was quantified by the Nash‑Sutcliffe Efficiency (NSE), while latent‑space complexity was assessed via the Shannon effective rank (erank). Results show that the quality of static features is more important than architectural improvements. ALPHAEARTH embeddings attained the highest median NSE, but only in combination with auxiliary static feature data (ALPHAEARTH&lt;sub&gt;plus&lt;/sub&gt;). Architectural refinements yielded only modest improvements. Thereby the relatively simple EMB-LSTM, which allowed the LSTM layer to better ingest ALPHAEAERTH&lt;sub&gt;plus&lt;/sub&gt; static features, outperformed the other architectures. With this combination, we achieved a median performance of NSE = 0.726, significantly improving the state-of-the-art PUB performance (NSE = 0.69) for the CAMELS-US dataset. Auxiliary analysis indicates that further improvement is possible when adding MODIS bands as additional dynamic features to the model. In conclusion, our study indicates that, broadly speaking, (a) better data is more important than better architecture, (b) better architecture is necessary only to accommodate better data, (c) the single layer LSTM remains the most suitable core model as of now, and (d) the Shannon effective rank complexity of the latent space is a useful diagnostic for linking improved PUB performance to improved quality of latent hydrological representation inside the model. Overall, this highlights the need for improved measurement‑derived descriptor datasets, especially for soil and geology.</p>
</abstract>
<counts><page-count count="30"/></counts>
</article-meta>
</front>
<body/>
<back>
</back>
</article>