<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "https://jats.nlm.nih.gov/nlm-dtd/publishing/3.0/journalpublishing3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" specific-use="SMUR" dtd-version="3.0" xml:lang="en">
<front>
<journal-meta>
<journal-id journal-id-type="publisher">EGUsphere</journal-id>
<journal-title-group>
<journal-title>EGUsphere</journal-title>
<abbrev-journal-title abbrev-type="publisher">EGUsphere</abbrev-journal-title>
<abbrev-journal-title abbrev-type="nlm-ta">EGUsphere</abbrev-journal-title>
</journal-title-group>
<issn pub-type="epub"></issn>
<publisher><publisher-name>Copernicus Publications</publisher-name>
<publisher-loc>Göttingen, Germany</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.5194/egusphere-2026-2237</article-id>
<title-group>
<article-title>Can We Trust LLMs for Complex Earth System Model Analysis? Silent Failure and Evidence from Module-Grounded Benchmarking</article-title>
</title-group>
<contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Zhou</surname>
<given-names>Tian</given-names>
<ext-link>https://orcid.org/0000-0003-1582-4005</ext-link>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Qian</surname>
<given-names>Yun</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Leung</surname>
<given-names>L. Ruby</given-names>
<ext-link>https://orcid.org/0000-0002-3221-9467</ext-link>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
</contrib-group><aff id="aff1">
<label>1</label>
<addr-line>Paciﬁc Northwest National Laboratory, Richland, WA, USA</addr-line>
</aff>
<pub-date pub-type="epub">
<day>27</day>
<month>04</month>
<year>2026</year>
</pub-date>
<volume>2026</volume>
<fpage>1</fpage>
<lpage>26</lpage>
<permissions>
<copyright-statement>Copyright: &#x000a9; 2026 Tian Zhou et al.</copyright-statement>
<copyright-year>2026</copyright-year>
<license license-type="open-access">
<license-p>This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit <ext-link ext-link-type="uri"  xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link></license-p>
</license>
</permissions>
<self-uri xlink:href="https://egusphere.copernicus.org/preprints/2026/egusphere-2026-2237/">This article is available from https://egusphere.copernicus.org/preprints/2026/egusphere-2026-2237/</self-uri>
<self-uri xlink:href="https://egusphere.copernicus.org/preprints/2026/egusphere-2026-2237/egusphere-2026-2237.pdf">The full text article is available as a PDF file from https://egusphere.copernicus.org/preprints/2026/egusphere-2026-2237/egusphere-2026-2237.pdf</self-uri>
<abstract>
<p>Large language models (LLMs) are becoming increasingly capable of complex scientific scripting, but this growing robustness creates a paradox: the more trustworthy their outputs appear, the more easily scientifically incorrect results can pass unnoticed. In Earth system model (ESM) analysis, such silent failures are more dangerous than visible crashes because they produce plausible figures and statistics that may be accepted without detailed inspection. We address this risk with ESFlow, a module-grounded agentic AI framework that constrains the LLM to compose workflows from validated analysis tools rather than generate arbitrary code. The LLM reads an auto-generated, self-describing catalog and outputs a YAML (human-readable data-serialization) workflow, which is then executed by a deterministic engine. We demonstrate this framework with a validated tool library for Energy Exascale Earth System Model (E3SM) land surface hydrology diagnostics in a benchmark spanning seven analysis tasks and six contemporary LLMs. Across both single-attempt runs and runs augmented with automatic self-debugging, the module-grounded approach attains an overall success rate above 80 %, maintains a low and stable silent-failure rate, and reaches 100 % success for the three high-capability models, whereas unconstrained Python code generation succeeds in only about 5 % of runs and sees its silent-failure rate rise from roughly 16 % to about 40 % under self-debugging. These results suggest that increasing LLM capability does not remove the reliability problem in scientific scripting; it makes silent failures more consequential by making incorrect outputs more convincing. The answer to the trust question posed in the title is therefore conditional: unconstrained code generation is not trustworthy for complex ESM analysis, whereas module-grounded workflow composition can be highly reliable for frontier models and remains substantially more robust under iterative self-debugging. By shifting the LLM&apos;s role from code generation to the composition of trusted tools, this framework provides a safer, more scalable architecture for AI-assisted scientiﬁc discovery that is aligned with FAIR (ﬁndable, accessible, interoperable, and reusable) principles.</p>
</abstract>
<counts><page-count count="26"/></counts>
<funding-group>
<award-group id="gs1">
<funding-source>U.S. Department of Energy</funding-source>
<award-id>89233218CNA000001</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body/>
<back>
</back>
</article>