Can We Trust LLMs for Complex Earth System Model Analysis? Silent Failure and Evidence from Module-Grounded Benchmarking

Zhou, Tian; Qian, Yun; Leung, L. Ruby

doi:10.5194/egusphere-2026-2237

Preprints

https://doi.org/10.5194/egusphere-2026-2237

Preprints

27 Apr 2026

| 27 Apr 2026

Status: this preprint is open for discussion and under review for Geoscientific Model Development (GMD).

Can We Trust LLMs for Complex Earth System Model Analysis? Silent Failure and Evidence from Module-Grounded Benchmarking

Tian Zhou, Yun Qian, and L. Ruby Leung

Abstract. Large language models (LLMs) are becoming increasingly capable of complex scientific scripting, but this growing robustness creates a paradox: the more trustworthy their outputs appear, the more easily scientifically incorrect results can pass unnoticed. In Earth system model (ESM) analysis, such silent failures are more dangerous than visible crashes because they produce plausible figures and statistics that may be accepted without detailed inspection. We address this risk with ESFlow, a module-grounded agentic AI framework that constrains the LLM to compose workflows from validated analysis tools rather than generate arbitrary code. The LLM reads an auto-generated, self-describing catalog and outputs a YAML (human-readable data-serialization) workflow, which is then executed by a deterministic engine. We demonstrate this framework with a validated tool library for Energy Exascale Earth System Model (E3SM) land surface hydrology diagnostics in a benchmark spanning seven analysis tasks and six contemporary LLMs. Across both single-attempt runs and runs augmented with automatic self-debugging, the module-grounded approach attains an overall success rate above 80 %, maintains a low and stable silent-failure rate, and reaches 100 % success for the three high-capability models, whereas unconstrained Python code generation succeeds in only about 5 % of runs and sees its silent-failure rate rise from roughly 16 % to about 40 % under self-debugging. These results suggest that increasing LLM capability does not remove the reliability problem in scientific scripting; it makes silent failures more consequential by making incorrect outputs more convincing. The answer to the trust question posed in the title is therefore conditional: unconstrained code generation is not trustworthy for complex ESM analysis, whereas module-grounded workflow composition can be highly reliable for frontier models and remains substantially more robust under iterative self-debugging. By shifting the LLM's role from code generation to the composition of trusted tools, this framework provides a safer, more scalable architecture for AI-assisted scientiﬁc discovery that is aligned with FAIR (ﬁndable, accessible, interoperable, and reusable) principles.

Received: 19 Apr 2026 – Discussion started: 27 Apr 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 3230 KB)

Supplement (43039 KB)

Download & links

Preprint (3230 KB)
Metadata XML
Supplement (43039 KB)
BibTeX
EndNote

Tian Zhou, Yun Qian, and L. Ruby Leung

Status: open (until 22 Jun 2026)

Post a comment Subscribe to comment alert

Tian Zhou, Yun Qian, and L. Ruby Leung

Supplement

https://doi.org/10.5194/egusphere-2026-2237-supplement

Tian Zhou, Yun Qian, and L. Ruby Leung

Viewed

Total article views: 342 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
238	89	15	342	80	16	11

HTML: 238
PDF: 89
XML: 15
Total: 342
Supplement: 80
BibTeX: 16
EndNote: 11

Views and downloads (calculated since 27 Apr 2026)

Month	HTML	PDF	XML	Total
Apr 2026	125	58	5	188
May 2026	113	31	10	154

Cumulative views and downloads (calculated since 27 Apr 2026)

Month	HTML	PDF	XML	Total
Apr 2026	125	58	5	188
May 2026	113	31	10	154

Viewed (geographical distribution)

Total article views: 342 (including HTML, PDF, and XML) Thereof 342 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 17 May 2026

Short summary

AI can now write scientific analysis code that looks correct but quietly produces wrong results. We tested whether asking the assistant to assemble analyses from a library of validated building blocks, rather than write code from scratch, makes it more reliable. Across six language models and seven Earth science tasks, free code writing succeeded only about five percent of the time while the constrained approach exceeded eighty percent, suggesting trust needs guardrails.


Total:	0
HTML:	0
PDF:	0
XML:	0