the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Evaluating the Impact of Task Aggregation in Workflows with Shared Resource Environments: use case for the MONARCH application
Abstract. High Performance Computing (HPC) is commonly employed to run high-impact Earth System Model (ESM) simulations, such as those for climate change. However, running workflows of ESM simulations on cutting-edge platforms can take long due to the congestion of the system and the lack of coordination between current HPC schedulers and workflow manager systems (WfMS). The Earth Sciences community has estimated the time in queue to be between 10 % to 20 % of the runtime in climate prediction experiments, the most time-consuming exercise. To address this issue, the developers of Autosubmit, a WfMS tailored for climate and air quality sciences, have developed wrappers to join multiple subsequent workflow tasks into a single submission. However, although wrappers are widely used in production for community models such as EC-Earth3, MONARCH, and Destination Earth simulations, to our knowledge, the benefits and potential drawbacks have never been rigorously evaluated. In addition, with portability in mind, the developers proposed to wrap depending on the entitlement of the user to the machine. In the widely utilized Slurm scheduler, this factor is called fair share. The objective of this paper is to quantify the impact of wrapping on queue time and understand its relationship with the fair share and the job's CPU and runtime request. To do this, we used a Slurm simulator to reproduce the behavior of the scheduler and, to recreate a representative usage of an HPC platform, we generated synthetic static workloads from data of the LUMI supercomputer and a dynamic workload from a past flagship HPC platform. As an example, we introduced jobs modeled after the MONARCH air quality application in these workloads, which we tracked their queue time. We found that, by simply joining tasks, the total runtime of the simulation reduces up to 7 %, and we have indications that this value is larger in reality. This saving translates to absolute terms in at least eight days less wasted in queue time for half of the simulations from the IS-ENES3 consortium of CMIP6 simulations. We also identified a high inverse correlation, of -0.87, between the queue time and the fair share factor.
- Preprint
(741 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 15 Jul 2025)
-
RC1: 'Comment on egusphere-2025-1104', Anonymous Referee #1, 21 May 2025
reply
Publisher’s note: the content of this comment was removed on 26 May 2025 since the comment was posted by mistake.
Citation: https://doi.org/10.5194/egusphere-2025-1104-RC1 -
AC1: 'Reply on RC1', Manuel Giménez de Castro Marciani, 22 May 2025
reply
Dear reviewer,
Thank you for your prompt answer. We are glad that you consider that our work "achieves significant improvements in processing efficiency," although you do not find this work fitting for the journal.
First, we find it crucial to clarify that this work does not involve “technical implementation of running ArcGIS toolboxes,” nor does it uses “containerization,” nor it is oriented toward "parallel computing optimization.” You can verify in the document that these topics are not addressed in the paper. Instead, it tackles the queue time overhead that is a traversal issue across all models that run on shared HPC platforms. And we utilize the MONARCH chemical weather prediction system [1] as an example of a high-impact application that delivers forecasts operationally for the region of North Africa, the Middle East, and Europe [2] and is part of the Copernicus quality assessment ensemble [3].
Regardless of whether this work directly contributes to the advancement of geoscientific models, we believe that the contribution of this paper has a strong impact across various models and studies that have been presented in this journal. The overhead caused by queue time is a cross-cutting issue that is highly relevant to the journal’s readership and to the broader community working with diverse scientific applications. Similarly, we have seen other transversal topics addressed in the journal that have been positively considered. In particular, this paper validates a novel method in the Earth Sciences domain by studying task aggregation and its interplay with the HPC scheduler on highly utilized machines, providing save estimates of up to 7% of the total runtime of the simulation. To put it in context, a paper published in this journal has reported 20% of overhead [4] in some platforms due to the time jobs spent in the queue. The models there considered where runs of the CMIP6 exercise, including — but not limited to — IFS [5], NEMO [6], ICON [7], and FESOM [8]. We could also highlight the current European flagship Destination Earth workflow [9], which is executed on three shared HPC platform, and thus is facing these large overheads from the queue time and task aggregation to mitigate them.
Moreover, even though aggregation is being used, there is no work such as ours in the literature about how aggregating tasks impacts queue time. So, in this work, we make an effort to understanding how Slurm's factors and policies impact the time that a job remains in queue, which is of interest of everyone that utilizes HPC on a daily basis.
And in our field, we have seen how HPC workflows for geoscientific models have gained renewed attention in fields such as climate change research, where decision-making relies on simulations that can take weeks to run on supercomputers. As a result, not only improving the throughput of the models but also optimizing the entire workflow within the digital continuum has become increasingly important.
For these reasons, along with the references we have added, both from GMD and beyond, we believe there is a solid foundation to support the positive impact of our work on the community, and its relevance to the journal. We sincerely hope this will be taken into consideration during the review process.
Sincerely,
The authors.
[1] https://doi.org/10.5194/gmd-14-6403-2021
[2] https://dust.aemet.es/
[3] https://regional-evaluation.atmosphere.copernicus.eu/pages/evaluation/?project=cams2-83&model=MONARCH#
[4] https://doi.org/10.5194/gmd-17-3081-2024
[5] https://doi.org/10.5194/gmd-11-3681-2018
[6] https://doi.org/10.5194/gmd-15-1567-2022
[7] https://doi.org/10.1002/qj.2378
[8] https://doi.org/10.1007/s00382-014-2290-6
[9] https://doi.org/10.1016/j.cliser.2023.100394
Citation: https://doi.org/10.5194/egusphere-2025-1104-AC1 -
RC2: 'Sorry for uploading the wrong review comments. The original comments were for another manuscript, but below are for this manuscript.', Anonymous Referee #1, 23 May 2025
reply
This manuscript investigates the impact of task aggregation (i.e., wrapping multiple workflow tasks into a single submission) on job queue time in high-performance computing (HPC) environments. Using the MONARCH air quality model as a case study, the authors employ a Slurm simulator and synthetic workloads based on LUMI and historical HPC data to assess how task wrapping influences queue time and its correlation with factors such as fair share, CPU request, and runtime. The study finds that task aggregation can reduce total runtime by up to 7% and shows a strong negative correlation (−0.87) between queue time and fair share. Despite the relevance of the topic to HPC workflow optimization, this manuscript suffers from several major deficiencies that preclude publication in its current form. First, the introduction is poorly structured and lacks a coherent logical flow, making it difficult to understand the motivation and novelty of the study. The authors are strongly advised to thoroughly revise the introduction, clearly stating the research gap, objectives, and context within existing literature. Second, the overall structure of the manuscript is not conducive to clarity. It is recommended that the manuscript be reorganized into five standard sections: Introduction, Data and Methods, Results and Analysis, Discussion, and Conclusion. Currently, the paper lacks a meaningful discussion section, which is essential for interpreting results, evaluating strengths and weaknesses, and situating the work in a broader scientific context. Third, the content is overly simplistic, with limited methodological depth and superficial analysis, which significantly reduces the academic value of the paper. The authors focus primarily on a technical implementation without formulating or addressing a well-defined scientific problem. Furthermore, the figures and quantitative results, while potentially useful in a production context, do not provide sufficient insight or generalizability for a scientific audience. Lastly, the lack of rigorous validation or real-world deployment results further weakens the credibility of the conclusions. In summary, the manuscript lacks scientific depth, a clear problem formulation, and a meaningful discussion of results. Substantial revisions are needed to improve the structure, expand the analytical depth, and provide a more comprehensive evaluation of the methodology and its implications. Based on these significant shortcomings, I recommend rejection of this manuscript.
Citation: https://doi.org/10.5194/egusphere-2025-1104-RC2 -
AC3: 'Reply on RC2', Manuel Giménez de Castro Marciani, 11 Jun 2025
reply
We answer these reviewer's comments on the RC3 thread.
Citation: https://doi.org/10.5194/egusphere-2025-1104-AC3
-
AC3: 'Reply on RC2', Manuel Giménez de Castro Marciani, 11 Jun 2025
reply
-
RC2: 'Sorry for uploading the wrong review comments. The original comments were for another manuscript, but below are for this manuscript.', Anonymous Referee #1, 23 May 2025
reply
-
AC1: 'Reply on RC1', Manuel Giménez de Castro Marciani, 22 May 2025
reply
-
RC3: 'Comment on egusphere-2025-1104', Anonymous Referee #1, 26 May 2025
reply
This manuscript investigates the impact of task aggregation (i.e., wrapping multiple workflow tasks into a single submission) on job queue time in high-performance computing (HPC) environments. Using the MONARCH air quality model as a case study, the authors employ a Slurm simulator and synthetic workloads based on LUMI and historical HPC data to assess how task wrapping influences queue time and its correlation with factors such as fair share, CPU request, and runtime. The study finds that task aggregation can reduce total runtime by up to 7% and shows a strong negative correlation (−0.87) between queue time and fair share. Despite the relevance of the topic to HPC workflow optimization, this manuscript suffers from several major deficiencies that preclude publication in its current form. First, the introduction is poorly structured and lacks a coherent logical flow, making it difficult to understand the motivation and novelty of the study. The authors are strongly advised to thoroughly revise the introduction, clearly stating the research gap, objectives, and context within existing literature. Second, the overall structure of the manuscript is not conducive to clarity. It is recommended that the manuscript be reorganized into five standard sections: Introduction, Data and Methods, Results and Analysis, Discussion, and Conclusion. Currently, the paper lacks a meaningful discussion section, which is essential for interpreting results, evaluating strengths and weaknesses, and situating the work in a broader scientific context. Third, the content is overly simplistic, with limited methodological depth and superficial analysis, which significantly reduces the academic value of the paper. The authors focus primarily on a technical implementation without formulating or addressing a well-defined scientific problem. Furthermore, the figures and quantitative results, while potentially useful in a production context, do not provide sufficient insight or generalizability for a scientific audience. Lastly, the lack of rigorous validation or real-world deployment results further weakens the credibility of the conclusions. In summary, the manuscript lacks scientific depth, a clear problem formulation, and a meaningful discussion of results. Substantial revisions are needed to improve the structure, expand the analytical depth, and provide a more comprehensive evaluation of the methodology and its implications. Based on these significant shortcomings, I recommend rejection of this manuscript.
Citation: https://doi.org/10.5194/egusphere-2025-1104-RC3 -
AC2: 'Reply on RC3', Manuel Giménez de Castro Marciani, 11 Jun 2025
reply
We would like to thank the reviewer for their prompt review. We address all of the comments below.
We did not understand the reviewer's recommendation for the manuscript to “be reorganized into five standard sections: Introduction, Data and Methods, Results and Analysis, Discussion, and Conclusion.” We adopted a structure identical to the one recommended, with the only addition being the background section, which explains the fundamental relationship between scheduler factors and time in queue.
With regard to the “poorly structured” introduction not “clearly stating the research gap, objectives, and context within existing literature,” we believe that all of the reviewer’s concerns are addressed in the introduction. The research gap is stated on line 30, where we mention that “there has been a growing awareness of considering the entire execution of the workflow, taking into account not only the runtime of the most demanding part of it, but also the time spent queuing for resources and post-processing, with possible failures.” Our objective is in lines 74-75 and also in the second to last paragraph of the introduction, where we say that “Our results help to advance the understanding, from the user side, on how to optimize the submission in order to reduce the total queue time of their workflows.” Regarding the context within the literature, we state in lines 52–57 of the introduction that aggregation was identified elsewhere in the weather and climate community and that, as far as we know, there is no other work that tries to validate its usage.
As for the lack of a meaningful discussion section, “evaluating strengths and weaknesses, and situating the work in a broader scientific context.” We believe we do address these points. For example, lines 268-269 draw attention to the relationship between a low fair share factor and aggregation improvement. We also reflect on the possible shortcomings of our methodology in lines 271-272, explaining that we rely on data from an old system that was not always as congested as current flagship systems. We also discuss the negative role of the backfill algorithm in lines 274-275. As for the broader scientific context, we remark — again — that this work is novel in the analysis within our context, as far as we know.
With regard to the reviewer’s comment about the content being “overly simplistic, with limited methodological depth and superficial analysis,” we would like to point out — again — that aggregation is used across various fields, including climate and weather, materials sciences (Aiida with HyperQueue [1]) and bioinformatics (Snakemake with grouping [2]). This work is therefore novel in its evaluation of this technique for solving the queue issue, which has never been tackled head-on in the literature. Therefore, we did it in the most direct and straight forward way.
As for our figures and quantitative results not providing “sufficient insight or generalizability for a scientific audience,” we believe we were sufficiently general to cover modern HPC centers, given the available data, using the two distinct experiments. As stated in lines 301–303, “To have both modern job requests and realistic behavior on the usage of the machines, we performed two experiment types.”
Finally, with regard to “the lack of rigorous validation or real-world deployment,” we agree that real-world deployments would enrich our argument, executing them would require running multiple expensive concurrent simulations to test aggregation. Additionally, as Acosta et al. [3] have shown, the time in queue depends heavily on the specific platform. Therefore, we would also need to span this experimentation across sites. In conclusion, although we understand the request, we believe that real-world deployments are neither trivial nor cheap to run.
We value all reviews and comments, as we always strive to ensure that our science is as rigorous and accurate as possible. Therefore, we would now prefer to wait for the remaining reviews before deciding how to proceed. Thank you.
[1] https://aiida-hyperqueue.readthedocs.io/en/latest/
[2] https://snakemake.readthedocs.io/en/stable/executing/grouping.html
[3] https://gmd.copernicus.org/articles/17/3081/2024/
Citation: https://doi.org/10.5194/egusphere-2025-1104-AC2
-
AC2: 'Reply on RC3', Manuel Giménez de Castro Marciani, 11 Jun 2025
reply
Data sets
Wrapper Impact Workloads and BSC Slurm Simulator Output of Dynamic Traces from CEA Curie Manuel G. Marciani https://doi.org/10.5281/zenodo.10623439
Full Results from Simulations for Static and Dynamic Workloads Using BSC Slurm Simulator Manuel G. Marciani https://doi.org/10.5281/zenodo.10818813
Wrapper Impact Workloads and BSC Slurm Simulator Output of Static Traces based on Data from LUMI Supercomputer Manuel G. Marciani https://doi.org/10.5281/zenodo.10624403
Interactive computing environment
Static Workload Results Analysis Scripts Manuel G. Marciani https://doi.org/10.5281/zenodo.12801377
Scripts and Files to Add Workflow to Curie Manuel G. Marciani https://doi.org/10.5281/zenodo.12801281
Docker Image of the Computational Earth Sciences Slurm Simulator Manuel G. Marciani https://doi.org/10.5281/zenodo.12801138
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
132 | 29 | 10 | 171 | 6 | 5 |
- HTML: 132
- PDF: 29
- XML: 10
- Total: 171
- BibTeX: 6
- EndNote: 5
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1