Task aggregation as a strategy to optimize Earth System Model workflows in HPC: assessing real scenarios with EC-Earth
Abstract. Earth System Models (ESMs) are commonly executed as complex workflows consisting of numerous interdependent tasks – the atomic unit of computation within the workflow – which comprise steps as model and data deployment, simulation, data transfer, and post-processing. Workflows facilitate the execution of long high-resolution configurations by splitting the runtime of the simulation into different tasks, in order to ensure frequent checkpointing and to comply with the restrictions of the large High-Performance Computing (HPC) machines where they are executed. These machines are frequently congested and, therefore, implement scheduling policies to share the resources among the users, complicating the execution of long ensemble simulation workflows due to the accumulated queue time, because each successive simulation task can only be submitted once the preceding one finishes. These queue times are the duration for which the jobs – the compute units sent to be executed remotely – wait for the HPC platform to allocate the required resources for their execution.
To alleviate this issue, we propose achieving shorter times-to-response, which are the durations from the first submission to the completion of the final task, by applying task aggregation to reduce subsequent requests for resources and, consequently, reducing queue times. Task aggregation is a strategy that consists of grouping multiple tasks and submitting them as a single job, respecting their dependencies, and without altering their underlying logic.
In this paper, we performed the first controlled assessment on the effects of task aggregation by conducting concurrent pairs of climate simulations in production machines, with the sole difference that one uses aggregation. These simulations were executed on three European supercomputers: MeluXina, MareNostrum 4, and MareNostrum 5. We measured the evolution of the fair share, a scheduling factor that normally plays a major role in the priority of the jobs. Besides absolute time, we compute the impact of aggregation by obtaining the differences between the Simulated Years Per Day (SYPD) and the Actual Simulated Years Per Day (ASYPD), two consolidated performance metrics for climate models within the community.
We prove the benefits of the task aggregation using EC-Earth3, a widely used European community climate model that shares main features with many other ESMs and has a representative workload. The experimental findings of our research indicate that across the three evaluated supercomputing platforms, applying task aggregation decreases total queue times by 11.17 to 12.33 times compared to a workflow that does not, representing an improvement in the ASYPD of up to 23,04 % in the case of the platform with the highest congestion.
Therefore, results have shown that task aggregation proves to be beneficial for long climate simulations. Moreover, we have credible reasons to believe that any vertical (also called chained) workflow should benefit from using it. We explain that this reduction in the time-to-solution comes from the decrease in the number of submitted jobs and the congestion of the machine. By aggregating tasks, we had many times less jobs queued and, albeit their longer length, we observed that they are in queue less time than if they had been submitted individually. Our results also show that the user's jobs are being held in queue in spite of their utilization due to the fair share being influenced by the other members of the HPC account, which is a direct consequence of the fair share policy of the machine.
General comments
Task aggregation as a strategy to optimize Earth System Model workflows in HPC: assessing real scenarios with EC-Earth by P. Goitia et al. describes a specific feature of the autosubmit workflow software to wrap multiple task into a single (SLURM) job. The authors compare performance measurements with the focus on time to solution by submitting tasks as single jobs versus aggregating tasks into a larger job. The authors point out the benefits of task aggregation and show a gain in time to solution of up top 20%. As pointed out by the authors this technique as such is not new but has not been used in the context of climate modelling with its considerably complex workflows. The general idea is well described and discussed. The use cases (here EC-Earth run on three different HPC system) are relevant for the climate modelling community. The paper is not general in the sense that it does not provide a guideline for someone who would like to take autosubmit and apply this to a different workflow like a different numerical model to be run on yet another HPC system. The manuscript is concise and well written. Nevertheless, I stumbled over some repetitions which could be avoided for the sake of providing new information.
I well accept the authors choice of keeping the focus on task scheduling as such. Having said that, in this light I am missing some more in-depths discussion of the interrelation between queue waiting times, shortcomings of SLURM parameter adjustments and its bypassing with the task aggregation approach. While I understand the underlying problem and the proposed solution I wonder if this can also be approached by simply making a single task running for longer time, e.g. using chunks of 10 years rather than 1 month. Thus I am left with a bit of a doubt that task aggregation is the only solution for any Earth system model to reduce the time to solution.
Specific comments
Abstract
A matter of taste, but to me the abstract does not read as a short summary highlighting the most important findings but already provides very detailed information (e.g. congested machines). Task aggregation and queue time are mentioned before they are defined.
Introduction
The introduction already provides details of the methods. I propose to shift these to your section 3 (Methods).
L 68ff I propose to move the introduction to autosubmit after the introduction of other work (as state of the art) and thus address the mentioned problems mentioned rather than jumping back and forth between your approach and other work.
L 95ff The political impact of simulations with EC-Earth3 is not really relevant in this context.
L 99 I have had a hard time understanding why your solution is transversal (see my remark for your sec. 3)
L 102/103 Isn’t this sentence better placed in the next and final paragraph of the introduction? You assume that the reader is already familiar with “fair share”. Thus I wonder if its mentioning here is too early.
L 104 tells me what I have just read and can be deleted.
L 111ff This is just a plain repetition of what was said in line 105 and 106.
Background
L 155 I do not get why the backfill algorithm plays a negative role. It is just designed to fill gaps if possible to guarantee an efficient use of the whole system.
L 187 is of an account: please rephrase.
L 215ff Is the detailed description of EC-Earth3 really necessary in this context here?
L 226, 227 Repetition from the introdction
L 236 we tried: Did you succeed?
Methods
Why does a chunk size has to be 12 months? Why can’t it be 120? This would allow one to run 5 single jobs without the need of wrapping? Is there some limitation set by EC-Earth3 or any of the internal components which stand in the way? Are there any other advantages using chunk sizes of 12 and thus a wrapper size > 1 to reach the maximum allowed queue time?
How and where in the manuscript do you make use of Slurm parameters fair share, Level Fair Share, and raw usage that you collected? Earlier on, you explained Level Fair Share. How is this related to Fair Share and Raw usage mentioned in the figures. I am afraid that the reader gets a bit lost here. A bit more guidance would be nice.
Results and discussion
I interpret the differences in SYPD for the wrapped vs unwrapped cases as system fluctuations. The percentage and speedup with a precision of up to two decimals (even though true for these single cases) suggest an exactness which is questionable. I assume that these numbers are rather volatile and depend on the overall load of the system at a given time.
Figures 4, 5 and 6: The figure title is a repetition of the figure caption.
Conclusions
You described task aggregation of rather homogenous tasks. If other tasks like postprocessing were added would you expect similar benefits? Would there still be a benefit if all users on a congested system opted for task aggregation? How much could be compensated by adjusting the SLURM parameters? Wouldn’t it make more sense to use reservations to runs those long experiments.