DCU-accelerated 3DVAR data assimilation with automatic differentiation for WRF-Chem
Abstract. This study developed a PyTorch-based three-dimensional variational (3DVAR) data assimilation system (Py3DVAR) for the Weather Research and Forecasting model coupled to Chemistry (WRF-Chem), which integrates automatic differentiation (AD) to replace traditional manual gradient derivation and adopts Deep Computing Unit (DCU) acceleration for high computational efficiency. Py3DVAR enables the simultaneous assimilation of gaseous pollutants (SO₂, NO₂, CO, O₃) and particulate matter (PM₂.₅, PM₁₀) and supports flexible deployment on both Central Processing Unit (CPU) and DCU computing platforms. To evaluate its performance and efficiency, idealized and real-case assimilation experiments (27 km and 9 km grid resolutions) were conducted, compared against a traditional CPU-parallelized Fortran-based 3DVAR system (Fortran-3DVAR). Idealized results show Py3DVAR effectively propagates observation information, generating increment fields consistent with Fortran-3DVAR. In real-case experiments, Py3DVAR substantially improves the model initial field quality: at 27 km resolution, correlation coefficients (CORR) for SO₂, NO₂, CO, O₃, PM₂.₅, and PM₁₀ increased by 0.77, 0.51, 0.71, 0.98, 0.60, and 0.69, respectively; corresponding improvements at 9 km resolution are 0.78, 0.98, 0.66, 0.96, 0.63, and 0.78. The root mean square error (RMSE) and mean absolute error (MAE) are also significantly reduced, with analysis field accuracy comparable to Fortran-3DVAR. In terms of computational efficiency, Py3DVAR shows remarkable advantages: on the same CPU platform, the total iteration time at 27 km resolution is only 7.1 s, approximately 8.8 times faster than Fortran-3DVAR (62.5 s); on the DCU platform, the speedup reaches 32.7 times at 27 km and 40.3 times at 9 km. A 24-hour forecast test shows that the improved initial fields have sustained positive effects on short-term forecasts: the improvements persist for over 24 hours for SO₂, CO, PM₂.₅, and PM₁₀, and for over 6 hours for NO₂ and O₃. This study confirms that Py3DVAR achieves order-of-magnitude gains in computational efficiency while maintaining accuracy equivalent to traditional assimilation algorithms, providing a flexible new technical pathway for operational atmospheric chemical data assimilation and future intelligent assimilation systems.
The authors present a numerical study that evaluates the computational performance of a 3D-Var algorithm implemented using the PyTorch library and compares it to a conventional Fortran-based implementation. They conclude that the PyTorch version offers substantial computational advantages, especially when using accelerators instead of traditional CPUs.
The manuscript is reasonably easy to read and its goals and findings are for most part clearly formulated. Conceptually, the study offers little progress beyond the existing data assimilation literature, since the basic techniques that are used were developed decades ago, and they have been applied in air quality forecasts in many studies. However, 3D-Var methods are used operationally in many applications and techniques to improve their efficiency could be of practical interest to modelers. Libraries with built-in automatic differentiation could also open up new possibilities for estimating error covariance parameters.
The study does suffer from some significant limitations which are not discussed in the manuscript and which may hinder its transferability to other assimilation setups. These can be summarized as follows.
First, the numerical experiments are limited in scope. The first, idealized experiment could serve as as verification that the two implementations indeed produce numerically equivalent results, but the setting is probably too unrealistic to give useful insights to the computational performance. I don’t see it deserving as much weight in the results as it is currently given. For example, having only two assimilated data points is likely to make the optimizers convergence much faster than they would in a realistic situation. The second experiment is more realistically set up, but still only involves a single analysis step. I don’t think that it would be prohibitively difficult to run multiple assimilations to check that the findings continue to hold as the chemical and physical atmospheric conditions change.
Only assimilation of in-situ data is considered. Although in-situ data are very commonly used in practice, this is a limitation from the computational point of view. Especially with the narrow correlation distances used here, using only point observations results in a very well-behaved minimization problem. Also the process of developing the adjoint observation operator becomes very simple in this case, which reduces the advantage of the built-in automatic differentiation in the numerical library. The assimilation experiments should have used a split between assimilation and test data, since evaluating the analysis against the assimilated data is largely uninteresting.
The performance analysis lacks depth. The computations are mainly compared in terms of wall clock time. While I am not surprised that Py3DVAR running DCUs outperforms the CPU versions, I am somewhat surprised by the big performance gap between the CPU-running PyTorch and Fortran versions. The PyTorch library is no doubt very well optimized, but is also possible that the Fortran version is suboptimal in some way. While it’s true that manual code optimization can be tedious compared to simply using a package like PyTorch, it would be useful to analyze where the bottleneck is (for example, is it the single-core performance, parallelization of the linear algebra, or something else).
Finally, some aspects of the experiments, particularly the Fortran implementation of 3D-Var, are insufficiently described and referenced. As far as I understand, WRF-Chem includes a data assimilation component, but it’s unclear to me whether it is being used here. Again, this makes it difficult for the reader to understand whether the performance gains demonstrated here would transfer to other assimilation systems.
In summary, in my view, the manuscript makes a valid point on the potential of PyTorch and similar libraries in implementing variational (or other) practical data assimilation algorithms. However, the experimental results are rather minimal, and I strongly recommend expanding the experiments to cover variable conditions, and to give more details on how the methods perform in a real-world assimilation setup. The manuscript could also use improvements in scope of the introduction and in use of citations. Some conclusions should be revised to better match the extent of the study.
Specific comments
Introduction: Much of the literature and topics reviewed here are tangential to the work that is being reported. For example, a paragraph (L104-L132) is devoted for discussing the use of remote sensing data in chemical data assimilation, which is not very relevant for the current study that only uses in-site observations. Also the discussion of various AI and hybrid assimilation approaches (L147-L196) appears unnecessarily extensive, since the current study is based on a classical variational method. A seemingly relevant reference that is not mentioned is Cheng et al (2025) who report implementing multiple assimilation algorithms including 3D-Var using Pytorch. Some useful connections might be also found in the Gaussian Process literature, as done by Key et al (2024), given the similarities between 3D-Var and Gaussian process regression.
L37: “Errors in their initial conditions can accumulate during model integration”. This happens in some systems. But does it really happen in air quality models? In the forecast experiments in this study the initial conditions seemed to have a rather transient (6-24 h ) effect on the forecasts, and similar results have been shown for many other models as well.
L77: Please define what 3DVAR stands for and add appropriate references for the method.
L82: Please give references for the operational systems. Many operational systems have also switched from 3D-Var to more advanced methods.
L120: what is a “collaborative assimilation scheme”?
L120: Ye et al. 2020 is not listed (but 2021 and 2022 are)
L129: Year missing from Wang et al citation
L142 Please define the “DCU“ and briefly if there are significant differences to the GPU hardware that some readers might more familiar with.
L212: "principle inheritance" and "implementation innovation". Are these established concepts? If not, please consider whether they are really helping the reader.
L228: Please provide some overall references for the WRF-Chem.
L239-243. Please provide appropriate references for the chemical mechanisms and aerosol models. I don’t they are those that are given here. Which of the gas phase mechanisms did you use?
L244: which model version?
L286: I guess there’s a paper that could be referred about MEIC?
L362: Does the observation operator include horizontal and/or vertical interpolation?
L372: Two issues: first, the idea (“method of Li et al.”) of preconditioning the iteration with the square root of B is quite fundamental and has been described already in early variational DA literature (see e.g. the discussion in El Akkraoui et al., 2012). Second, the transformation alone would probably not solve the memory issue, since factorization of B would in general case be very large. It is arguably the separation of dimensions explained in section 2.6 that makes the problem more easily solvable.
L387: Please give a reference for the L-BFGS method.
L390: It’s surprising that you manually would implement the L-BFGS algorithm, since there are many well-established implementations available for Fortran.
L390: Is the Fortran version an existing framework (reference?) or did you write it from scratch? How is it parallelized? What numerical linear algebra library (if any) does it link with?
L402: Automatic differentiation tools have for long time been available for Fortran and other languages, even if they are not always easy to use.
L408: I actually don’t find Eq. 4 especially complex as far as adjoint operators go. Please elaborate why it struggles to exploit parallel computing.
L412: This should probably be section 2.6.
L414: You might consider also referring to earlier work where the tensor product formulation has been introduced.
Figure 2: It is very difficult to see the differences between the different lines here.
L462: Were the forecasts for estimating the background error standard deviations run using the same setup as the assimilation experiments? Which resolution was used?
Figure 4: I’m quite puzzled with what is shown here. Is this a fit, or the original, empirical correlation function? How many pairs data points are there for each distance to evaluate the correlation coefficient? It seems that the SO2 fields must have very different spatial features from the others to explain this level of difference, or, put differently, the correlation functions for all other pollutants appear to be exceptionally regular. This is surprising, since also pollutants like NO2 or PM10 are emitted by point sources and they do typically show strong spatial heterogeneity. What is the x axis unit?
L494: That’s what the kernel prescribes, but is it a physical mechanism?
L537: Can you verify that the use of single precision floating point numbers is not in itself introducing numerical errors to the analysis?
L542 I cannot find any information about Intel Xeon 7495 processors. Please check.
Section 4.1: This section should be shortened. Most of the findings here are very unsurprising and basically confirm that the code is working. I don’t think that the reader needs to be explained that the increments “exhibit an approximately circular distribution”, and so on.
L737: Is it possible to examine how sensitive the results are to the choice of parameters of the optimization algorithm (which might be more or less arbitrary)? Here it would be useful to perform a longer experiment to find out if the differences could be caused by essentially random differences in the iterations.
L751: Why not show the single iteration time for the Fortran version?
L793: Can you elaborate on how the automatic parallel optimization works on the current problem? It would be very useful to see the speedup of the both CPU codes as a function of cores.
L827: This is a very common finding in chemical data assimilation and it might be worth citing some earlier studies with similar results.
L867: “significantly optimizes” sounds ambiguous in a statistical context.
L870-881: The listing of statistical parameters is very tedious and should be summarized.
L884: “due to its use of vectorized […] computations”. With the material that has been presented (and no source code for the Fortran implementation), it’s actually not possible to identify the causes for the performance difference. For all we know, it could be be due to a sub-optimally ordered for loop, inefficient use of parallelization directives, or a poorly performing linear algebra library.
L900: Again, there is no evidence of the role of automatic differentiation in the speedup. I suspect it to be rather small, since the gradient (Eq. 4) is easy to implement in any computational framework.
L904: The closing statement is an unnecessary hyperbole. Besides using an ML framework for doing data assimilation, you have not shown any results on integration of deep learning models or “intelligent data assimilation”, whatever it might mean.
References
Cheng, S., Min, J., Liu, C. and Arcucci, R., 2025. TorchDA: A Python package for performing data assimilation with deep learning forward and transformation functions. Computer Physics Communications, 306, p.109359.
El Akkraoui, A., Trémolet, Y. and Todling, R., 2013. Preconditioning of variational data assimilation and the use of a bi‐conjugate gradient method. Quarterly Journal of the Royal Meteorological Society, 139(672), pp.731-741.
Key, O., Takao, S., Giles, D. and Deisenroth, M.P., 2025. Scalable data assimilation with message passing. Environmental Data Science, 4, p.e1.