EGUsphere

Copernicus Publications

Göttingen, Germany

10.5194/egusphere-2026-539

Task aggregation as a strategy to optimize Earth System Model workflows in HPC: assessing real scenarios with EC-Earth

Goitia

Pablo

https://orcid.org/0009-0004-2011-6462

¹ ² G. Marciani

Manuel

https://orcid.org/0000-0002-9852-3322

¹ Castrillo

Miguel

https://orcid.org/0000-0003-1826-623X

¹ Acosta

Mario C.

https://orcid.org/0000-0001-7054-8168

Barcelona Supercomputing Center (BSC), Barcelona, Spain

University of Cantabria (UC), Santander, Spain

21 04 2026

2026 1 21

2026

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/

This article is available from https://egusphere.copernicus.org/preprints/2026/egusphere-2026-539/

The full text article is available as a PDF file from https://egusphere.copernicus.org/preprints/2026/egusphere-2026-539/egusphere-2026-539.pdf

Earth System Models (ESMs) are commonly executed as complex workflows consisting of numerous interdependent tasks – the atomic unit of computation within the workflow – which comprise steps as model and data deployment, simulation, data transfer, and post-processing. Workflows facilitate the execution of long high-resolution configurations by splitting the runtime of the simulation into different tasks, in order to ensure frequent checkpointing and to comply with the restrictions of the large High-Performance Computing (HPC) machines where they are executed. These machines are frequently congested and, therefore, implement scheduling policies to share the resources among the users, complicating the execution of long ensemble simulation workflows due to the accumulated queue time, because each successive simulation task can only be submitted once the preceding one finishes. These queue times are the duration for which the jobs – the compute units sent to be executed remotely – wait for the HPC platform to allocate the required resources for their execution. To alleviate this issue, we propose achieving shorter times-to-response, which are the durations from the first submission to the completion of the final task, by applying task aggregation to reduce subsequent requests for resources and, consequently, reducing queue times. Task aggregation is a strategy that consists of grouping multiple tasks and submitting them as a single job, respecting their dependencies, and without altering their underlying logic. In this paper, we performed the first controlled assessment on the effects of task aggregation by conducting concurrent pairs of climate simulations in production machines, with the sole difference that one uses aggregation. These simulations were executed on three European supercomputers: MeluXina, MareNostrum 4, and MareNostrum 5. We measured the evolution of the fair share, a scheduling factor that normally plays a major role in the priority of the jobs. Besides absolute time, we compute the impact of aggregation by obtaining the differences between the Simulated Years Per Day (SYPD) and the Actual Simulated Years Per Day (ASYPD), two consolidated performance metrics for climate models within the community. We prove the benefits of the task aggregation using EC-Earth3, a widely used European community climate model that shares main features with many other ESMs and has a representative workload. The experimental findings of our research indicate that across the three evaluated supercomputing platforms, applying task aggregation decreases total queue times by 11.17 to 12.33 times compared to a workflow that does not, representing an improvement in the ASYPD of up to 23,04 % in the case of the platform with the highest congestion. Therefore, results have shown that task aggregation proves to be beneficial for long climate simulations. Moreover, we have credible reasons to believe that any vertical (also called chained) workflow should benefit from using it. We explain that this reduction in the time-to-solution comes from the decrease in the number of submitted jobs and the congestion of the machine. By aggregating tasks, we had many times less jobs queued and, albeit their longer length, we observed that they are in queue less time than if they had been submitted individually. Our results also show that the user's jobs are being held in queue in spite of their utilization due to the fair share being influenced by the other members of the HPC account, which is a direct consequence of the fair share policy of the machine.