Preprints
https://doi.org/10.5194/egusphere-2025-1104
https://doi.org/10.5194/egusphere-2025-1104
20 May 2025
 | 20 May 2025
Status: this preprint is open for discussion and under review for Geoscientific Model Development (GMD).

Evaluating the Impact of Task Aggregation in Workflows with Shared Resource Environments: use case for the MONARCH application

Manuel G. Marciani, Miguel Castrillo, Gladys Utrera, Mario C. Acosta, Bruno P. Kinoshita, and Francisco Doblas-Reyes

Abstract. High Performance Computing (HPC) is commonly employed to run high-impact Earth System Model (ESM) simulations, such as those for climate change. However, running workflows of ESM simulations on cutting-edge platforms can take long due to the congestion of the system and the lack of coordination between current HPC schedulers and workflow manager systems (WfMS). The Earth Sciences community has estimated the time in queue to be between 10 % to 20 % of the runtime in climate prediction experiments, the most time-consuming exercise. To address this issue, the developers of Autosubmit, a WfMS tailored for climate and air quality sciences, have developed wrappers to join multiple subsequent workflow tasks into a single submission. However, although wrappers are widely used in production for community models such as EC-Earth3, MONARCH, and Destination Earth simulations, to our knowledge, the benefits and potential drawbacks have never been rigorously evaluated. In addition, with portability in mind, the developers proposed to wrap depending on the entitlement of the user to the machine. In the widely utilized Slurm scheduler, this factor is called fair share. The objective of this paper is to quantify the impact of wrapping on queue time and understand its relationship with the fair share and the job's CPU and runtime request. To do this, we used a Slurm simulator to reproduce the behavior of the scheduler and, to recreate a representative usage of an HPC platform, we generated synthetic static workloads from data of the LUMI supercomputer and a dynamic workload from a past flagship HPC platform. As an example, we introduced jobs modeled after the MONARCH air quality application in these workloads, which we tracked their queue time. We found that, by simply joining tasks, the total runtime of the simulation reduces up to 7 %, and we have indications that this value is larger in reality. This saving translates to absolute terms in at least eight days less wasted in queue time for half of the simulations from the IS-ENES3 consortium of CMIP6 simulations. We also identified a high inverse correlation, of -0.87, between the queue time and the fair share factor.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.
Share
Manuel G. Marciani, Miguel Castrillo, Gladys Utrera, Mario C. Acosta, Bruno P. Kinoshita, and Francisco Doblas-Reyes

Status: open (until 15 Jul 2025)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on egusphere-2025-1104', Anonymous Referee #1, 21 May 2025 reply
    • AC1: 'Reply on RC1', Manuel Giménez de Castro Marciani, 22 May 2025 reply
      • RC2: 'Sorry for uploading the wrong review comments. The original comments were for another manuscript, but below are for this manuscript.', Anonymous Referee #1, 23 May 2025 reply
        • AC3: 'Reply on RC2', Manuel Giménez de Castro Marciani, 11 Jun 2025 reply
  • RC3: 'Comment on egusphere-2025-1104', Anonymous Referee #1, 26 May 2025 reply
    • AC2: 'Reply on RC3', Manuel Giménez de Castro Marciani, 11 Jun 2025 reply
Manuel G. Marciani, Miguel Castrillo, Gladys Utrera, Mario C. Acosta, Bruno P. Kinoshita, and Francisco Doblas-Reyes

Data sets

Wrapper Impact Workloads and BSC Slurm Simulator Output of Dynamic Traces from CEA Curie Manuel G. Marciani https://doi.org/10.5281/zenodo.10623439

Full Results from Simulations for Static and Dynamic Workloads Using BSC Slurm Simulator Manuel G. Marciani https://doi.org/10.5281/zenodo.10818813

Wrapper Impact Workloads and BSC Slurm Simulator Output of Static Traces based on Data from LUMI Supercomputer Manuel G. Marciani https://doi.org/10.5281/zenodo.10624403

Interactive computing environment

Static Workload Results Analysis Scripts Manuel G. Marciani https://doi.org/10.5281/zenodo.12801377

Scripts and Files to Add Workflow to Curie Manuel G. Marciani https://doi.org/10.5281/zenodo.12801281

Docker Image of the Computational Earth Sciences Slurm Simulator Manuel G. Marciani https://doi.org/10.5281/zenodo.12801138

Manuel G. Marciani, Miguel Castrillo, Gladys Utrera, Mario C. Acosta, Bruno P. Kinoshita, and Francisco Doblas-Reyes

Viewed

Total article views: 171 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
132 29 10 171 6 5
  • HTML: 132
  • PDF: 29
  • XML: 10
  • Total: 171
  • BibTeX: 6
  • EndNote: 5
Views and downloads (calculated since 20 May 2025)
Cumulative views and downloads (calculated since 20 May 2025)

Viewed (geographical distribution)

Total article views: 166 (including HTML, PDF, and XML) Thereof 166 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 13 Jun 2025
Download
Short summary
Earth System Model simulations are executed with workflows in congested HPC resources. These workflows could be made of thousands of tasks, which, if naively submitted to be executed, might add overheads due to queueing for resources. In this paper we explored a technique of aggregating tasks into a single submission. We related it to a key factor used by the software in charge of the scheduling. We find that this simple technique can reduce up to 7 % of the time spent in queue.
Share