the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Optimizing output operations in high-resolution climate models through dynamic scheduling
Abstract. This study presents a new approach to improve the efficiency of data output in high-resolution climate models. The method begins by forwarding data to processes with lighter workloads or finishing their tasks earlier, allowing these units to serve as temporary storage. Following this, the processes create multiple smaller communication groups to reorganize the data and then use an I/O aggregation approach to enable efficient parallel writing. A dedicated control process dynamically manages these phases based on the status of each process. To further refine the I/O strategies, we collect performance data from the target machine to build a simulated environment. A reinforcement learning agent is deployed in this environment to identify and test better parameter configurations. Experiments conducted on two models, GOMO1.0 and LICOM3, show that this method increases output efficiency by factors of 1.54 and 13.1, respectively, compared to the commonly used PnetCDF and MPI-IO. These results suggest that this approach can significantly reduce the overhead associated with data output, providing a promising solution for enhancing the performance of climate models.
Competing interests: The author Xiaomeng Huang is the member of the editorial board of journal GMD.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.- Preprint
(4036 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2024-3533', Anonymous Referee #1, 27 Feb 2025
This paper introduces an updated version of parallel I/O library for climate models. The authors have implemented new methods in data forwarding, data arrangement and data writing operations for improving the efficiency of parallel I/O. Moreover, a reinforcement learning scheme is provided to optimize timings for data arrangement processes. This is a nice piece of work that help optimizing I/O output operations in climate models. Although not much science on model results is provided, the architecture of parallel I/O operations is interesting. I have a few questions about the three processes that are suggested to enhance the I/O operational efficiency. I believe addressing these questions in a revision will better clarify the state-of-the-art architecture behind this I/O operation.Lines 205-210 and Figure 7: it is not explained in the manuscript that how could we make sure same data blocks are sent to a processor in data forwarding. Also, is the data rearrangement done within a processor or through a multi-processor task? Could you discuss how efficient is your rearrangement task for big datasets.Algorithm 1 in page 10: I suggest change this into a pseudo-code. Also, I suggest authors to discuss how efficient is this algorithm for metadata, particularly the send-receive communications.Pages 11 and 12: How are trade-off 1 and 2 optimized? If they are not efficient, the I/O operation will not be improved. Also, the training process of reinforcement learning is not discussed. A fair comment here is that if you spend too much time on training your ML algorithm, that should be considered in evaluation of your updated I/O operation versus the previous I/O approaches.Figure 1: there is empty space on the right side. I suggest moving panel (b) to top left side. This will save some space. Same comment is for Figure 2.Citation: https://doi.org/
10.5194/egusphere-2024-3533-RC1 -
AC1: 'Reply on RC1', Dong Wang, 13 Mar 2025
Dear editor and reviewer,
First of all, we would like to express our sincere appreciation to your valuable feed-backs. Your comments are highly insightful and enable us to substantially improve the quality of our manuscript. The uploaded pdf file includes our point-by-point responses to all the comments and our plans to revise the manuscript.
For more details, please refer to the supplement
-
AC1: 'Reply on RC1', Dong Wang, 13 Mar 2025
-
RC2: 'Comment on egusphere-2024-3533', Anonymous Referee #2, 16 Mar 2025
This paper addresses the critical bottleneck of I/O efficiency in high-resolution climate models by introducing CFIO2.0, a dynamic parallel I/O framework that optimizes data output operations through load-aware scheduling and reinforcement learning (RL). Key innovations include: (1) leveraging process load imbalances to overlap I/O with computation via dynamic data forwarding, (2) replacing dedicated I/O processes with a single control process to enhance scalability, (3) an RL-driven strategy that autonomously optimizes parameters (e.g., timing, process allocation) using simulated performance data, and (4) empirical validation on models (GOMO1.0 and LICOM3). I think this study provides a compelling advancement in climate model I/O optimization. This work has significant potential for impact in high-performance computing communities. Please find my comments below.
Major comments:
- Please clarify the reinforcement learning methodology. Provide detailed information on the RL agent's architecture (e.g., algorithm type, reward function specifics, training duration, and hyperparameters). Specify how performance data (network speed, PnetCDF rates) is collected and preprocessed for the virtual environment. It would be helpfuol to add pseudocode or a flowchart for the RL training process.
- The evaluation system can be expanded by including metrics beyond speedup, such as CPU/memory utilization during I/O phases, network overhead, or buffer management efficiency.
- Some discussions about trade-offs between resource consumption and performance gains can be added in the revised version.
- Please explicitly address limitations of CFIO2.0, such as dependency on pre-collected data for RL, scalability across heterogeneous file systems (e.g., Lustre vs. ParaStor), or adaptability to non-climate modeling workloads.
- Ensure that all parameters (e.g., stride values in PIO experiments) are explicitly defined in tables or appendices.
Minor suggestions:
P1. Line 14-15: it is unclear whether the increases in ouput efficiency are across the two models or two I/O strategies.
Page 3, Line 70: "Figure 1: Concurrent I/O and Computation." → Align with figure numbering (e.g., "Figure 1: (a) Alternating... (b) Parallel...").
Page 18, Line 385: "LICOM case study" → Add a colon for consistency ("LICOM Case Study:").
Figure 3 Caption (Page 4, Lines 100–110): Move the lengthy step-by-step description from the caption to the main text or a supplementary note to improve readability.
Section 5.3 (Page 19, Lines 410–415): Label subfigures in Figure 13 as "Fig. 13a" and "Fig. 13b" instead of "(a)" and "(b)" to avoid confusion.
References (Page 23–24): Correct "Corbetty et al. (1996)" to "Corbett et al. (1996)" in the text (Page 2, Line 45).
Ensure all citations (e.g., "Kang et al. (2019)") have corresponding entries in the References section.
Code Availability (Page 22, Lines 525–530): Verify that all Zenodo links are functional and datasets are publicly accessible.
Citation: https://doi.org/10.5194/egusphere-2024-3533-RC2 -
AC2: 'Reply on RC2', Dong Wang, 19 Mar 2025
Dear editor and reviewer,First of all, we would like to express our sincere appreciation for your valuable feedback. Your review is not only highly insightful but also extremely meticulous. You have provided us with many important suggestions and have also pointed out numerous formatting errors in detail. Your comments will help us to substantially improve the quality of our manuscript.The uploaded PDF file includes our point-by-point responses to all the comments, as well as our plans to revise the manuscript. For more details, please refer to the supplement. Thank you once again for your time and effort.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
105 | 37 | 9 | 151 | 5 | 3 |
- HTML: 105
- PDF: 37
- XML: 9
- Total: 151
- BibTeX: 5
- EndNote: 3
Viewed (geographical distribution)
Country | # | Views | % |
---|---|---|---|
United States of America | 1 | 72 | 45 |
undefined | 2 | 14 | 8 |
China | 3 | 10 | 6 |
Germany | 4 | 8 | 5 |
France | 5 | 8 | 5 |
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
- 72