A Python interface to the Fortran-based Parallel Data Assimilation Framework: pyPDAF v1.0.0

Chen, Yumeng; Nerger, Lars; Lawless, Amos S.

doi:https://doi.org/10.5194/egusphere-2024-1078

Preprints

https://doi.org/10.5194/egusphere-2024-1078

Preprints

11 Jun 2024

| 11 Jun 2024

A Python interface to the Fortran-based Parallel Data Assimilation Framework: pyPDAF v1.0.0

Yumeng Chen, Lars Nerger, and Amos S. Lawless

Abstract. Data assimilation (DA) is an essential component of numerical weather and climate prediction. Efficient implementation of DA benefits both operational prediction and research. Currently, a variety of DA software programs are available. One of the notable DA libraries is the Parallel Data Assimilation Framework (PDAF) designed for ensemble data assimilation. The DA framework is widely used with complex high-dimensional climate models and is applied for research on atmosphere, ocean, sea ice and marine ecosystem modelling, as well as operational ocean forecasting. Meanwhile, there exists increasing need for flexible and efficient DA implementations using Python due to the increasing amount of intermediate complexity models as well as machine learning based models coded in Python. To accommodate for such needs, here, we introduce a Python interface to PDAF, pyPDAF. The Python interface allows for flexible DA system development while retaining the efficient implementation of the core DA algorithms in the Fortran-based PDAF. The ideal use-case of pyPDAF is a DA system where the model integration is independent from the DA program, which reads the model forecast ensemble, produces a model analysis and update the restart files of the model, or a DA system where the model can be used in Python. With implementations of both PDAF and pyPDAF, this study demonstrates the use of pyPDAF and PDAF for coupled data assimilation (CDA) in a coupled atmosphere and ocean model, the Modular Arbitrary-Order Ocean-Atmosphere Model (MAOOAM). Using both weakly and strongly CDA, we demonstrate that pyPDAF allows for the utilisation of Python-based user-supplied functions in the Fortran-based DA framework. We also show that the Python-based user-supplied routine can be a main reason for the slow-down of the DA system based on pyPDAF. Our CDA experiments confirm the benefit of strongly coupled data assimilation compared to the weakly coupled data assimilation. We also demonstrate that the CDA not only improves the instantaneous analysis but also the long-term trend of the coupled dynamical system.

Received: 09 Apr 2024 – Discussion started: 11 Jun 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Yumeng Chen, Lars Nerger, and Amos S. Lawless

Status: closed

RC1: 'Comment on egusphere-2024-1078', Anonymous Referee #1, 12 Jun 2024

I quickly reviewed this paper and found that critical issues exist as follows:

- Incorrect terminology even in abstract (e.g., model analysis)
- No explanation of math symbols even in Eq. (1) (e.g., p_i)
- Larger forecast/analysis RMSEs than the prescribed observation errors in Fig. 4 (i.e., filter divergence occurs)

Therefore, there is a high possibility that this paper does not satisfy the standards of international journals.

Citation: https://doi.org/10.5194/egusphere-2024-1078-RC1
RC2: 'Comment on egusphere-2024-1078', Anonymous Referee #2, 03 Jul 2024

Summary:

The authors introduced a new Python interface for the PDAF software. The idea of implementing a Python wrapper for a sophisticated Fortran library is much welcomed, since coding in Python is much easier and it will allow more rapid development especially for new models written in Python. However, I found the authors missed the opportunity to convince the readers in this paper that pyPDAF will provide them with tools to quickly implement DA in their Python models. Instead, a lot of focus is put on the details such as second order exact sampling (in SEIK) and comparing weakly and strongly coupled DA. As a user, I only see some hints of what I need to do to use pyPDAF to build a DA system, but I don't know exactly how after reading the paper. With the benchmarking results the authors seem to prefer the Fortran-based PDAF anyway. The test case dimension is very small comparing to typical application, so there is a question whether the Python-Fortran type conversion overhead is significant if pyPDAF is applied to large-dimensional real models. Overall, I feel that the authors didn't promote the new software with convincing enough results.
Major issues:

1. While Figs 1 and 2 provide the overview of software architectures, the readers will not understand exactly what's needed from their side to build a typical DA system fully based on Python. Can you provide an example with a typical cycling DA experiment setup (initial ensemble generation, ensemble forecasts, compute observation priors, collect/distribute state, assimilate algorithm, etc), and highlight the functions where users can either code their own version, or use something readily available from PDAF core? The OMI is also quite confusing, what functions does it provide? I feel that you have omitted these details because they are available either in PDAF documentation or in previous papers, but listing the details here can convince readers more effectively that it is "easy" to build a system using pyPDAF.
2. There are excessive details on the second-order exact sampling which I found not relevant for the topic of this paper. You can use SEIK to demonstrate that the software works, but there are many other options in PDAF that also available. You should definitely save the page limit for something more important. The same thing applies to the comparison between the weakly and strongly coupled DA. The main theme of the paper should be 1. providing the details of how pyPDAF is designed, 2. what's its advantage over other Python DA prototypes, 3. how to use it to build a Python DA system, then 4. some results from a test case. Currently number 2 and 3 are quite missing.
3. Efficiency benchmark in Figure 8 is performed for a really small model (129x129x3) and ensemble size (16), the number of observations also really small (17x17?) What the figure is conveying is that coding in Python introduced additional overhead and the resulting software is much slower, therefore using the Fortran code is better? I would argue otherwise: for the test case chosen here, if it only take ~0.02 seconds per cycle, I would much prefer coding in Python since it will save me days of work compared to coding in Fortran. It doesn't matter if its 0.02 seconds or 0.005 seconds, they are trivial compare to the "human overhead".
You miss the opportunity here to demonstrate that pyPDAF is a good alternative to PDAF: it should be close to the PDAF efficiency despite the Python-Fortran conversion overhead, but much favorably reducing human overhead in coding. We all know how the ETKF scales with problem sizes, it is roughly O(Ne^3 + Ne^2 Nobs) per state variable. Arguably if the problem size is much larger, and computation spent on the DA problems increase, the overhead of doing the Python-Fortran object conversion can become trivial in propotion to the whole cost. Isn't that a better story to tell?
Minor issues:

Line 51, "The Python tool only has a Python interface to a few PDAF routines", do you refer to EAT? and what's the difference between this "Python interface" and pyPDAF?
Line 54, "...can facilitate code development thanks to ...", maybe you mean "easier code developement", because it is still possible to write Fortran code only a bit more time consuming.
Line 56, "...written onto a disk", do you mean the model restart files?
Line 65, "can improve dynamically balanced analysis", need a bit rephrasing, how about "can can improve dynamical balance in the analysis"
Line 77, "This allows for an embarrassingly parallel... does not increase with the ensemble size", this statement is a bit too general. The embarrassingly parallel variant is the LETKF where each state variable can be analyzed separately. Some other EnKF variants is more tricky to parallelize. Also, in ETKF the computational cost does increase with ensemble size O(Ne^3), I guess you mean at the end of the statement "does not increase with state dimension".
Line 81, this sentence is repeat of the first sentence.
Line 86, "In practice computational resources limit the feasible ensemble size", are you referring to the fact that model forecasts cost a lot so that one cannot run huge ensemble forecasts?
Line 105, 108, you've defined PDAF earlier, so why repeating the full name over and over?
Line 110, smoother, 3DVar and other non-linear filters, these are not introduced earlier, either mention these in the introduction, or adding some references here.
Lines 113-120, why is the detail on second-order exact sampling provided here. PDAF not only provides SEIK but also a lot of other filter types, the SEIK-specific details should be moved to later maybe in the experiment design.
Line 131, the OMI is introduced here in one sentence, can you provide more details. How does the user utilize the functions provided in OMI? Do they import them through the pyPDAF interface or do they have to write their own Cython code to call their Fortran versions?
Line 134, "replaced by model restart files", this is not trivial to implement, how exactly can this be done? Since this paper only compared two online DA systems using PDAF and pyPDAF, I'm not sure implementing the offline DA is even relevant here.
Figure 2: There are several levels of "user supplied" functions, in both Python and the C interface. This is confusing. As a user using pyPDAF, do I need to code in Python and compile the package, or do I have to also write Cython code? Or, is it just two options?
Line 138, "Due to" -> "Thanks to"
Line 141, "...also allows for an efficient code development and modifications..." this sentence needs some work. I get the first half that pyPDAF allows you to build an online DA system for a Python model. But why does the second half relate to the first half?
Line 143, "...before performing an optimised implementation for high-dimensional Fortran-based models", there are Python models that are high-dimensional with well optimised numerics, I don't see the point here why using pyPDAF for a high-dimensional system is not possible now for a prototypical system.
Line 149, "pyPDAF fully supports the parallel features", can you provide more details how the MPI featuers are utilized in the Python interface, in Fig. 2 is every function run with MPI, is the Python code run with mpi4py?
Figure 4: the error time series seems to be not reaching steady state, it keeps increasing and there is a sign of exponential growth towards the end. Is the filter actually stable in time?
Line 265, "16 members...ETKF without spatial localization", given the results in Fig. 4 maybe you want to add some localization to stablize the filter.
Line 278, this is not true for the final 100 years, errors are larger than spread.
Figure 8: you didn't provide information on all the subroutines, what does "init dim obs" mean?

Citation: https://doi.org/10.5194/egusphere-2024-1078-RC2
AC1: 'Comment on egusphere-2024-1078', Yumeng Chen, 30 Sep 2024

Dear reviewers,
Thank you for the comments for the manuscript. Please find the point-by-point response letter is attached.
Kind regards,
Yumeng Chen, Lars Nerger, Amos Lawless

Citation: https://doi.org/10.5194/egusphere-2024-1078-AC1

Status: closed

RC1: 'Comment on egusphere-2024-1078', Anonymous Referee #1, 12 Jun 2024

I quickly reviewed this paper and found that critical issues exist as follows:

- Incorrect terminology even in abstract (e.g., model analysis)
- No explanation of math symbols even in Eq. (1) (e.g., p_i)
- Larger forecast/analysis RMSEs than the prescribed observation errors in Fig. 4 (i.e., filter divergence occurs)

Therefore, there is a high possibility that this paper does not satisfy the standards of international journals.

Citation: https://doi.org/10.5194/egusphere-2024-1078-RC1
RC2: 'Comment on egusphere-2024-1078', Anonymous Referee #2, 03 Jul 2024

Summary:

The authors introduced a new Python interface for the PDAF software. The idea of implementing a Python wrapper for a sophisticated Fortran library is much welcomed, since coding in Python is much easier and it will allow more rapid development especially for new models written in Python. However, I found the authors missed the opportunity to convince the readers in this paper that pyPDAF will provide them with tools to quickly implement DA in their Python models. Instead, a lot of focus is put on the details such as second order exact sampling (in SEIK) and comparing weakly and strongly coupled DA. As a user, I only see some hints of what I need to do to use pyPDAF to build a DA system, but I don't know exactly how after reading the paper. With the benchmarking results the authors seem to prefer the Fortran-based PDAF anyway. The test case dimension is very small comparing to typical application, so there is a question whether the Python-Fortran type conversion overhead is significant if pyPDAF is applied to large-dimensional real models. Overall, I feel that the authors didn't promote the new software with convincing enough results.
Major issues:

1. While Figs 1 and 2 provide the overview of software architectures, the readers will not understand exactly what's needed from their side to build a typical DA system fully based on Python. Can you provide an example with a typical cycling DA experiment setup (initial ensemble generation, ensemble forecasts, compute observation priors, collect/distribute state, assimilate algorithm, etc), and highlight the functions where users can either code their own version, or use something readily available from PDAF core? The OMI is also quite confusing, what functions does it provide? I feel that you have omitted these details because they are available either in PDAF documentation or in previous papers, but listing the details here can convince readers more effectively that it is "easy" to build a system using pyPDAF.
2. There are excessive details on the second-order exact sampling which I found not relevant for the topic of this paper. You can use SEIK to demonstrate that the software works, but there are many other options in PDAF that also available. You should definitely save the page limit for something more important. The same thing applies to the comparison between the weakly and strongly coupled DA. The main theme of the paper should be 1. providing the details of how pyPDAF is designed, 2. what's its advantage over other Python DA prototypes, 3. how to use it to build a Python DA system, then 4. some results from a test case. Currently number 2 and 3 are quite missing.
3. Efficiency benchmark in Figure 8 is performed for a really small model (129x129x3) and ensemble size (16), the number of observations also really small (17x17?) What the figure is conveying is that coding in Python introduced additional overhead and the resulting software is much slower, therefore using the Fortran code is better? I would argue otherwise: for the test case chosen here, if it only take ~0.02 seconds per cycle, I would much prefer coding in Python since it will save me days of work compared to coding in Fortran. It doesn't matter if its 0.02 seconds or 0.005 seconds, they are trivial compare to the "human overhead".
You miss the opportunity here to demonstrate that pyPDAF is a good alternative to PDAF: it should be close to the PDAF efficiency despite the Python-Fortran conversion overhead, but much favorably reducing human overhead in coding. We all know how the ETKF scales with problem sizes, it is roughly O(Ne^3 + Ne^2 Nobs) per state variable. Arguably if the problem size is much larger, and computation spent on the DA problems increase, the overhead of doing the Python-Fortran object conversion can become trivial in propotion to the whole cost. Isn't that a better story to tell?
Minor issues:

Line 51, "The Python tool only has a Python interface to a few PDAF routines", do you refer to EAT? and what's the difference between this "Python interface" and pyPDAF?
Line 54, "...can facilitate code development thanks to ...", maybe you mean "easier code developement", because it is still possible to write Fortran code only a bit more time consuming.
Line 56, "...written onto a disk", do you mean the model restart files?
Line 65, "can improve dynamically balanced analysis", need a bit rephrasing, how about "can can improve dynamical balance in the analysis"
Line 77, "This allows for an embarrassingly parallel... does not increase with the ensemble size", this statement is a bit too general. The embarrassingly parallel variant is the LETKF where each state variable can be analyzed separately. Some other EnKF variants is more tricky to parallelize. Also, in ETKF the computational cost does increase with ensemble size O(Ne^3), I guess you mean at the end of the statement "does not increase with state dimension".
Line 81, this sentence is repeat of the first sentence.
Line 86, "In practice computational resources limit the feasible ensemble size", are you referring to the fact that model forecasts cost a lot so that one cannot run huge ensemble forecasts?
Line 105, 108, you've defined PDAF earlier, so why repeating the full name over and over?
Line 110, smoother, 3DVar and other non-linear filters, these are not introduced earlier, either mention these in the introduction, or adding some references here.
Lines 113-120, why is the detail on second-order exact sampling provided here. PDAF not only provides SEIK but also a lot of other filter types, the SEIK-specific details should be moved to later maybe in the experiment design.
Line 131, the OMI is introduced here in one sentence, can you provide more details. How does the user utilize the functions provided in OMI? Do they import them through the pyPDAF interface or do they have to write their own Cython code to call their Fortran versions?
Line 134, "replaced by model restart files", this is not trivial to implement, how exactly can this be done? Since this paper only compared two online DA systems using PDAF and pyPDAF, I'm not sure implementing the offline DA is even relevant here.
Figure 2: There are several levels of "user supplied" functions, in both Python and the C interface. This is confusing. As a user using pyPDAF, do I need to code in Python and compile the package, or do I have to also write Cython code? Or, is it just two options?
Line 138, "Due to" -> "Thanks to"
Line 141, "...also allows for an efficient code development and modifications..." this sentence needs some work. I get the first half that pyPDAF allows you to build an online DA system for a Python model. But why does the second half relate to the first half?
Line 143, "...before performing an optimised implementation for high-dimensional Fortran-based models", there are Python models that are high-dimensional with well optimised numerics, I don't see the point here why using pyPDAF for a high-dimensional system is not possible now for a prototypical system.
Line 149, "pyPDAF fully supports the parallel features", can you provide more details how the MPI featuers are utilized in the Python interface, in Fig. 2 is every function run with MPI, is the Python code run with mpi4py?
Figure 4: the error time series seems to be not reaching steady state, it keeps increasing and there is a sign of exponential growth towards the end. Is the filter actually stable in time?
Line 265, "16 members...ETKF without spatial localization", given the results in Fig. 4 maybe you want to add some localization to stablize the filter.
Line 278, this is not true for the final 100 years, errors are larger than spread.
Figure 8: you didn't provide information on all the subroutines, what does "init dim obs" mean?

Citation: https://doi.org/10.5194/egusphere-2024-1078-RC2
AC1: 'Comment on egusphere-2024-1078', Yumeng Chen, 30 Sep 2024

Dear reviewers,
Thank you for the comments for the manuscript. Please find the point-by-point response letter is attached.
Kind regards,
Yumeng Chen, Lars Nerger, Amos Lawless

Citation: https://doi.org/10.5194/egusphere-2024-1078-AC1

Yumeng Chen, Lars Nerger, and Amos S. Lawless

Model code and software

yumengch/PDAF-MAOOAM: Submitted version Yumeng Chen https://doi.org/10.5281/zenodo.11367123

yumengch/pyPDAF: pyPDAF v1.0.0 Yumeng Chen https://doi.org/10.5281/zenodo.10950129

PDAF V2.1 Lars Nerger https://doi.org/10.5281/zenodo.7861829

Yumeng Chen, Lars Nerger, and Amos S. Lawless

Viewed

Total article views: 969 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
671	171	127	969	62	75

HTML: 671
PDF: 171
XML: 127
Total: 969
BibTeX: 62
EndNote: 75

Views and downloads (calculated since 11 Jun 2024)

Month	HTML	PDF	XML	Total
Jun 2024	141	47	15	203
Jul 2024	67	20	9	96
Aug 2024	36	10	3	49
Sep 2024	29	10	1	40
Oct 2024	16	7	32	55
Nov 2024	39	6	47	92
Dec 2024	22	9	6	37
Jan 2025	17	6	6	29
Feb 2025	20	4	2	26
Mar 2025	21	2	2	25
Apr 2025	14	9	0	23
May 2025	24	2	0	26
Jun 2025	29	26	1	56
Jul 2025	16	4	0	20
Aug 2025	48	2	0	50
Sep 2025	117	5	2	124
Oct 2025	15	2	1	18

Cumulative views and downloads (calculated since 11 Jun 2024)

Month	HTML	PDF	XML	Total
Jun 2024	141	47	15	203
Jul 2024	67	20	9	96
Aug 2024	36	10	3	49
Sep 2024	29	10	1	40
Oct 2024	16	7	32	55
Nov 2024	39	6	47	92
Dec 2024	22	9	6	37
Jan 2025	17	6	6	29
Feb 2025	20	4	2	26
Mar 2025	21	2	2	25
Apr 2025	14	9	0	23
May 2025	24	2	0	26
Jun 2025	29	26	1	56
Jul 2025	16	4	0	20
Aug 2025	48	2	0	50
Sep 2025	117	5	2	124
Oct 2025	15	2	1	18

Viewed (geographical distribution)

Total article views: 940 (including HTML, PDF, and XML) Thereof 940 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 24 Oct 2025


Total:	0
HTML:	0
PDF:	0
XML:	0