Preprints
https://doi.org/10.5194/egusphere-2026-539
https://doi.org/10.5194/egusphere-2026-539
21 Apr 2026
 | 21 Apr 2026
Status: this preprint is open for discussion and under review for Geoscientific Model Development (GMD).

Task aggregation as a strategy to optimize Earth System Model workflows in HPC: assessing real scenarios with EC-Earth

Pablo Goitia, Manuel G. Marciani, Miguel Castrillo, and Mario C. Acosta

Abstract. Earth System Models (ESMs) are commonly executed as complex workflows consisting of numerous interdependent tasks – the atomic unit of computation within the workflow – which comprise steps as model and data deployment, simulation, data transfer, and post-processing. Workflows facilitate the execution of long high-resolution configurations by splitting the runtime of the simulation into different tasks, in order to ensure frequent checkpointing and to comply with the restrictions of the large High-Performance Computing (HPC) machines where they are executed. These machines are frequently congested and, therefore, implement scheduling policies to share the resources among the users, complicating the execution of long ensemble simulation workflows due to the accumulated queue time, because each successive simulation task can only be submitted once the preceding one finishes. These queue times are the duration for which the jobs – the compute units sent to be executed remotely – wait for the HPC platform to allocate the required resources for their execution.

To alleviate this issue, we propose achieving shorter times-to-response, which are the durations from the first submission to the completion of the final task, by applying task aggregation to reduce subsequent requests for resources and, consequently, reducing queue times. Task aggregation is a strategy that consists of grouping multiple tasks and submitting them as a single job, respecting their dependencies, and without altering their underlying logic.

In this paper, we performed the first controlled assessment on the effects of task aggregation by conducting concurrent pairs of climate simulations in production machines, with the sole difference that one uses aggregation. These simulations were executed on three European supercomputers: MeluXina, MareNostrum 4, and MareNostrum 5. We measured the evolution of the fair share, a scheduling factor that normally plays a major role in the priority of the jobs. Besides absolute time, we compute the impact of aggregation by obtaining the differences between the Simulated Years Per Day (SYPD) and the Actual Simulated Years Per Day (ASYPD), two consolidated performance metrics for climate models within the community.

We prove the benefits of the task aggregation using EC-Earth3, a widely used European community climate model that shares main features with many other ESMs and has a representative workload. The experimental findings of our research indicate that across the three evaluated supercomputing platforms, applying task aggregation decreases total queue times by 11.17 to 12.33 times compared to a workflow that does not, representing an improvement in the ASYPD of up to 23,04 % in the case of the platform with the highest congestion.

Therefore, results have shown that task aggregation proves to be beneficial for long climate simulations. Moreover, we have credible reasons to believe that any vertical (also called chained) workflow should benefit from using it. We explain that this reduction in the time-to-solution comes from the decrease in the number of submitted jobs and the congestion of the machine. By aggregating tasks, we had many times less jobs queued and, albeit their longer length, we observed that they are in queue less time than if they had been submitted individually. Our results also show that the user's jobs are being held in queue in spite of their utilization due to the fair share being influenced by the other members of the HPC account, which is a direct consequence of the fair share policy of the machine.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.
Share
Pablo Goitia, Manuel G. Marciani, Miguel Castrillo, and Mario C. Acosta

Status: open (until 16 Jun 2026)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
Pablo Goitia, Manuel G. Marciani, Miguel Castrillo, and Mario C. Acosta

Data sets

Machine utilization and statistics of EC-Earth3 workflows running on MareNostrum 4, MareNostrum 5, and MeluXina Pablo Goitia https://doi.org/10.5281/zenodo.15292084

Model code and software

EC-Earth3 workflow with wrappers in MareNostrum 4 Pablo Goitia et al. https://doi.org/10.48546/workflowhub.workflow.2067.1

EC-Earth3 workflow without wrappers in MareNostrum 4 Pablo Goitia et al. https://doi.org/10.48546/workflowhub.workflow.2066.1

EC-Earth3 workflow with wrappers in MareNostrum 5 Pablo Goitia et al. https://doi.org/10.48546/workflowhub.workflow.2065.1

EC-Earth3 workflow without wrappers in MareNostrum 5 Pablo Goitia et al. https://doi.org/10.48546/workflowhub.workflow.2064.1

EC-Earth3 workflow with wrappers in MeluXina Pablo Goitia et al. https://doi.org/10.48546/workflowhub.workflow.2063.1

EC-Earth3 workflow without wrappers in MeluXina Pablo Goitia et al. https://doi.org/10.48546/workflowhub.workflow.2062.1

Header scripts to gather statistics and usage records of EC-Earth3 workflows from MareNostrum 4, MareNostrum 5, and Meluxina Pablo Goitia and Manuel G. Marciani https://doi.org/10.5281/zenodo.15673462

Pablo Goitia, Manuel G. Marciani, Miguel Castrillo, and Mario C. Acosta
Metrics will be available soon.
Latest update: 21 Apr 2026
Download
Short summary
Earth System Model workflows commonly run on highly congested high-performance computing platforms, meaning that each individual workflow task potentially faces lengthy waiting times in the queues of the schedulers. In this work, we evaluate the task aggregation approach in EC-Earth3 workflows to reduce the queue times and, consequently, the total execution time. The results show an increase of up to 23.04 % in the actual simulated years per day, with queuing times reduced by up to 12.33 times.
Share