Preprints
https://doi.org/10.5194/egusphere-2025-1104
https://doi.org/10.5194/egusphere-2025-1104
20 May 2025
 | 20 May 2025

Evaluating the Impact of Task Aggregation in Workflows with Shared Resource Environments: use case for the MONARCH application

Manuel G. Marciani, Miguel Castrillo, Gladys Utrera, Mario C. Acosta, Bruno P. Kinoshita, and Francisco Doblas-Reyes

Abstract. High Performance Computing (HPC) is commonly employed to run high-impact Earth System Model (ESM) simulations, such as those for climate change. However, running workflows of ESM simulations on cutting-edge platforms can take long due to the congestion of the system and the lack of coordination between current HPC schedulers and workflow manager systems (WfMS). The Earth Sciences community has estimated the time in queue to be between 10 % to 20 % of the runtime in climate prediction experiments, the most time-consuming exercise. To address this issue, the developers of Autosubmit, a WfMS tailored for climate and air quality sciences, have developed wrappers to join multiple subsequent workflow tasks into a single submission. However, although wrappers are widely used in production for community models such as EC-Earth3, MONARCH, and Destination Earth simulations, to our knowledge, the benefits and potential drawbacks have never been rigorously evaluated. In addition, with portability in mind, the developers proposed to wrap depending on the entitlement of the user to the machine. In the widely utilized Slurm scheduler, this factor is called fair share. The objective of this paper is to quantify the impact of wrapping on queue time and understand its relationship with the fair share and the job's CPU and runtime request. To do this, we used a Slurm simulator to reproduce the behavior of the scheduler and, to recreate a representative usage of an HPC platform, we generated synthetic static workloads from data of the LUMI supercomputer and a dynamic workload from a past flagship HPC platform. As an example, we introduced jobs modeled after the MONARCH air quality application in these workloads, which we tracked their queue time. We found that, by simply joining tasks, the total runtime of the simulation reduces up to 7 %, and we have indications that this value is larger in reality. This saving translates to absolute terms in at least eight days less wasted in queue time for half of the simulations from the IS-ENES3 consortium of CMIP6 simulations. We also identified a high inverse correlation, of -0.87, between the queue time and the fair share factor.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.
Share
Manuel G. Marciani, Miguel Castrillo, Gladys Utrera, Mario C. Acosta, Bruno P. Kinoshita, and Francisco Doblas-Reyes

Status: final response (author comments only)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on egusphere-2025-1104', Anonymous Referee #1, 21 May 2025
    • AC1: 'Reply on RC1', Manuel Giménez de Castro Marciani, 22 May 2025
      • RC2: 'Sorry for uploading the wrong review comments. The original comments were for another manuscript, but below are for this manuscript.', Anonymous Referee #1, 23 May 2025
        • AC3: 'Reply on RC2', Manuel Giménez de Castro Marciani, 11 Jun 2025
  • RC3: 'Comment on egusphere-2025-1104', Anonymous Referee #1, 26 May 2025
    • AC2: 'Reply on RC3', Manuel Giménez de Castro Marciani, 11 Jun 2025
  • RC4: 'Comment on egusphere-2025-1104', Anonymous Referee #2, 17 Jun 2025
    • AC4: 'Reply on RC4', Manuel Giménez de Castro Marciani, 06 Aug 2025
  • RC5: 'Comment on egusphere-2025-1104', Anonymous Referee #3, 19 Jun 2025
    • AC5: 'Reply on RC5', Manuel Giménez de Castro Marciani, 06 Aug 2025
Manuel G. Marciani, Miguel Castrillo, Gladys Utrera, Mario C. Acosta, Bruno P. Kinoshita, and Francisco Doblas-Reyes

Data sets

Wrapper Impact Workloads and BSC Slurm Simulator Output of Dynamic Traces from CEA Curie Manuel G. Marciani https://doi.org/10.5281/zenodo.10623439

Full Results from Simulations for Static and Dynamic Workloads Using BSC Slurm Simulator Manuel G. Marciani https://doi.org/10.5281/zenodo.10818813

Wrapper Impact Workloads and BSC Slurm Simulator Output of Static Traces based on Data from LUMI Supercomputer Manuel G. Marciani https://doi.org/10.5281/zenodo.10624403

Interactive computing environment

Static Workload Results Analysis Scripts Manuel G. Marciani https://doi.org/10.5281/zenodo.12801377

Scripts and Files to Add Workflow to Curie Manuel G. Marciani https://doi.org/10.5281/zenodo.12801281

Docker Image of the Computational Earth Sciences Slurm Simulator Manuel G. Marciani https://doi.org/10.5281/zenodo.12801138

Manuel G. Marciani, Miguel Castrillo, Gladys Utrera, Mario C. Acosta, Bruno P. Kinoshita, and Francisco Doblas-Reyes

Viewed

Total article views: 648 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
544 70 34 648 13 23
  • HTML: 544
  • PDF: 70
  • XML: 34
  • Total: 648
  • BibTeX: 13
  • EndNote: 23
Views and downloads (calculated since 20 May 2025)
Cumulative views and downloads (calculated since 20 May 2025)

Viewed (geographical distribution)

Total article views: 642 (including HTML, PDF, and XML) Thereof 642 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 30 Aug 2025
Download
Short summary
Earth System Model simulations are executed with workflows in congested HPC resources. These workflows could be made of thousands of tasks, which, if naively submitted to be executed, might add overheads due to queueing for resources. In this paper we explored a technique of aggregating tasks into a single submission. We related it to a key factor used by the software in charge of the scheduling. We find that this simple technique can reduce up to 7 % of the time spent in queue.
Share