Enhancing weather radar data by removing non-meteorological echoes, using neural networks trained on synthetic weather data

Bölz, Richard; Kirstein, Tom; Fuchs, Lukas; Josipović, Lukas; Böhm, Annette; Blahak, Ulrich; Schmidt, Volker

doi:10.5194/egusphere-2026-992

Preprints

https://doi.org/10.5194/egusphere-2026-992

Preprints

18 Mar 2026

| 18 Mar 2026

Enhancing weather radar data by removing non-meteorological echoes, using neural networks trained on synthetic weather data

Richard Bölz, Tom Kirstein, Lukas Fuchs, Lukas Josipović, Annette Böhm, Ulrich Blahak, and Volker Schmidt

Abstract. Meteorological weather radars are essential for atmospheric research, weather forecasting and aviation safety, but they often detect non-meteorological echoes from scatterers such as insects, birds, and ground clutter. These non-meteorological echoes can then lead to misinterpretations in quantitative precipitation estimation and hydrometeor classification, which cause difficulties for atmospheric research and weather forecasting. This paper introduces a novel AI-based approach to identify such non-meteorological echoes in polarimetric radar data using a convolutional neural network. More specifically, we utilize a so-called U-net, which relies on large amounts of labeled radar data for training. To address the challenge of acquiring labeled radar data consisting of meteorological and non-meteorological echoes, we generate synthetic training samples by combining preprocessed winter data (meteorological echoes) with cluttered summer data (non-meteorological echoes) provided by Deutscher Wetterdienst (DWD). After training on synthetic data, evaluation of the U-net approach on operationally measured radar data shows that it outperforms the state-of-the-art DWD classification algorithm overall. This is particularly evident in the preservation of precipitation signals at the boundaries of larger weather events.

Received: 20 Feb 2026 – Discussion started: 18 Mar 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Richard Bölz, Tom Kirstein, Lukas Fuchs, Lukas Josipović, Annette Böhm, Ulrich Blahak, and Volker Schmidt

Status: final response (author comments only)

RC1: 'Comment on egusphere-2026-992', Jenna Ritvanen, 04 May 2026

This manuscript presents an interesting idea on how to address the lack

of expert-labeled data in classification of weather radar echoes. Using

this approach, the authors train a U-net model to classify

meteorological and non-meteorological echoes. The issue being addressed

is critical to producers and users of radar data, and the authors

achieve promising results with their model. However, the small amount of

data used to train and validate the model raise questions that should be

addressed in the manuscript.
# Specific comments
1) One major concern is the small amount of data used in the model

training. It appears that the dataset contains data from 3 hours,

plus some test data from a time period that is not mentioned. Even

if the dataset contains measurements from all 17 radars in the DWD

network, the small temporal windows covered by the dataset raise

serious concerns of the resulting model validity. Preferably, the

authors should increase the dataset size. If using more data for the

model training is not possible, at least the following issues should

be addressed with sensitivity tests or discussion in text:
1. How representative are the selected time periods of the

conditions they aim to represent? E.g. for cluttered summer

measurements it is worth to note that the appearance of insect

echoes can vary depending on temperature, diurnal cycle, and

annual cycle. Similarly, how representative are the selected

winter sweeps?
2. How representative are the scaled winter images of summer

precipitation?
3. Does the selection of the sweeps require some considerations? If

one were to attempt to repeat the study using different

measurements, how should one address the dataset selection?
4. When are the "experimentally measured mixed radar images"

measured? How were the images selected?
5. As far as I can tell, Figure 5d shows differential attenuation

that is unlikely to be present in any winter measurement and

thus would not appear in your training material. How does the

model perform for this case or other similar artefacts in summer

measurements that are not present in winter?
2) Creating synthetic images:
1. The process for selecting the scaling factor seem rather

arbitrary. When comparing the images, did you compare mean

values, max values etc?
2. How representative do you expect the scaling determined in the

described way to be over longer time periods? Is it impacted by

calibration differences etc?
3. Ideally, the scaling factor should be related to some physical

factor or explanation, e.g. differences between Z-R and Z-S

relations rather than subjective selection.
4. Assigning of UDR when creating the synthetic images: did you

test other approaches to assign the value, e.g. weighted sum of

the original images? Now, as far as can I follow, a radar gate

in the synthetic image has contributions of both the winter and

summer images in DBZH and ZDR but not in UDR? Do the resulting

UDR values in the synthetic images follow a similar distribution

(on their own and joint distributions with DBZH/ZDR) as

experimentally measured values?
3) There are also some concerns related to the model training and test

dataset construction:
1. There is no validation dataset. Typically, ML model training

should include a training dataset used to train the model,

validation dataset for hyperparameter selection and monitoring

training convergence (i.e., selecting the best model outcome

among all possibilities), and test dataset for independent

validation of the selected model. Since there is no validation

dataset, how is the training convergence monitored?
2. Given that the training and test datasets are temporally

overlapping, how was information leakage between the datasets

reduced? The test dataset being from a different radar site does

not automatically remove information leakage, as the

precipitation areas move and same precipitation could easily be

present in the measurements of multiple radars; we can also

expect that precipitation within one hour is correlated within

different areas inside Germany. Additionally, the measurement

range is said to be 150km; do the images from multiple radars

overlap?
3. It is unclear how are the two winter and summer measurements

used to a create synthetic image selected? Randomly, matching

time from start of the measurement period, something else? How

does this selection impact the dataset and model skill? Do you

only combine images from a single radar site?
4) Description of model training is incomplete:
1. How is the model training convergence monitored and how do you

decide if the training has converged?
2. Please list all relevant hyperparameters used in the training

(e.g. learning rate), and refer to any relevant libraries used

in the model implementation and training.
5) Issues related to data visualization:
1. The colormap of ZDR measurements should be limited to show only

the interval of interest, which the authors state to be around 0

dB to 20dB (not starting from -20dB)
2. It would be better to show the excluded radar gates with some

color that is visible to aid the reader in interpreting the

figures
3. I'm not sure if there is a need to repeat the range ring labels

in every image; the images would be less cluttered if those were

removed especially in the smaller images
6) I would appreciate more specificity on the description of radar

measurements in introduction. There is also some repetition in the

descriptions in the introduction and Section 2.1 that could be

reduced. Specific comments:
1. Lines 51-54: If talking about radar moments, it would be better

to name them, e.g. "compute so-called radar moments, such as

radar reflectivity representing the strength of the signal" etc.
2. Lines 65-66: I would interpret this to mean radar systems with

waveguide switches; how about radar systems that transmit H and V

polarizations simultaneously?
3. Lines 219-221: this seems repetitive
7) The description of the state-of-the-art method in section 3.2 is

confusing. It would be better to order the description so that steps

are described in the order that they are performed. For example,

paragraph starting on L609 should be after the eligible pixels are

first mentioned, and the paragraph starting on L626 should follow

them
8) Eq. 1: I assume $\theta$ denotes the azimuthal angle? This should be

mentioned in the text

Citation: https://doi.org/10.5194/egusphere-2026-992-RC1
RC2:
'Comment on egusphere-2026-992', Matteo Guidicelli, 23 Jul 2026
General comments
The manuscript presents an interesting approach for identifying non-meteorological echoes in polarimetric weather-radar data. The central idea of training a U-net on synthetically generated mixed radar scenes is original and addresses the important practical difficulty of obtaining sufficiently large pixel-wise labelled datasets. The manuscript is generally clearly written, and the comparison with the operational DWD method indicates that the proposed approach may better preserve weak precipitation and the boundaries of meteorological echoes.
My main concern is the limited size, diversity, and independence of the datasets used for training and testing, as well as the apparent absence of a separate validation dataset. The meteorological winter dataset and the cluttered summer dataset each originate from a single one-hour period. To my understanding, although one radar is withheld from the synthetic training set, data from the remaining radars are collected on the same dates and may contain the same large-scale meteorological or non-meteorological structures. Therefore, the current train–test split mainly evaluates transfer to an unseen radar rather than generalization to independent events and atmospheric conditions. The evaluation on measured data is additionally based on only five manually labelled sweeps, whose exact dates and independence from the data used to develop the synthetic samples should be reported clearly. All figure captions showing radar data should also report the date and timestamp of the displayed event.
I strongly encourage the authors to evaluate the already trained model on at least one fully independent date containing both meteorological and non-meteorological echoes, as this would substantially strengthen the assessment of generalization without requiring changes to the proposed methodology. Such an evaluation would provide an important test of transfer to unseen conditions, although a single additional date would still not be sufficient to demonstrate broad robustness across seasons and event types. I would therefore recommend using an independent and diverse test set that is representative of the range of meteorological and non-meteorological conditions typically encountered in operational data. If additional independently labelled data cannot be obtained, the limitations of the current test design should be discussed explicitly, and the claims regarding generalizability, robustness, and absence of overfitting should be moderated accordingly. Although this limitation is briefly mentioned in lines 738–740, it is a central issue and should receive greater emphasis.
Given the limited training sample and the complexity of the U-net, I would also encourage the authors to assess whether model performance has saturated, for example through a learning-curve analysis using progressively larger training subsets and a fixed independent and diverse test set.
The manuscript would also benefit from a clearer explanation of how the final network architecture and training hyperparameters were selected. It is currently unclear whether a separate validation set or cross-validation was used for hyperparameter tuning.
I would also welcome some analysis of how the U-net relies on the individual radar moments, as this would improve the model interpretability (see specific comment for Sect. 4).
Overall, I find the study promising, but I recommend revisions mainly related to the amount and diversity of the data used for training and evaluation.
In the following, I reported more specific comments and technical suggestions.
Specific comments
Lines 35–37: The scanned volume is not exactly conical. Please rephrase to indicate that the radar scans “approximately” conical volumes.

Lines 46–64: This paragraph provides a rather extensive textbook-level description of the radar measurement principle, which may not be necessary for the intended readership. I suggest shortening it considerably and retaining only the aspects directly relevant to the problem addressed in the study.

Lines 65–74: This paragraph provides an oversimplified and partly inaccurate description of dual-polarization radar measurements. Dual-polarization radars do not necessarily alternate horizontal and vertical polarization from pulse to pulse, as many operational systems transmit both components simultaneously. In addition, the preferential horizontal orientation applies primarily to oblate raindrops and should not be generalized to all precipitation particles. The description of the derived polarimetric variables as resulting from “differences and their correlations in time” should also be clarified.

Lines 72 and 96–107: The discussion in lines 96–107 is more closely connected to the actual classification problem than the preceding general description of radar operation. I suggest briefly introducing, possibly already around line 72, why DBZH, ZDR, and UDR provide complementary information for separating meteorological and non-meteorological echoes, while leaving their full definitions to Sect. 2.1.

Lines 116–118: The statement that conventional classifiers can incorporate information only in close proximity to the classified bin is not generally true, since spatial and temporal aggregate features may also be included. Please qualify this statement. In my opinion, the clear advantage of the U-net is that feature engineering is not necessary.

Lines 127–129: Please briefly clarify how the deep-learning approach in Atanbori et al. (2025) dealt with noisy or uncertain labels and why this improved the classification. In particular, what characteristics of the proposed deep learning model made it suitable for learning from noisy training data?

Lines 142–157: The term “infeasible” may be too categorical; wording such as “extremely laborious” may be more appropriate. The sentence at line 156 describing scaling, rotation, and orientation inversion is too methodological for the Introduction and could be removed.

Sect. 2.1 title: The title could be more precise, since the section also describes the radar moments used in the study and not only radar-data acquisition.

Lines 214–215: Please use the standard term “copolar cross-correlation coefficient”.

Fig. 1. A short description of the main differences among the winter, mixed-summer, and cluttered-summer examples would be helpful when introducing Fig 1 for the first time.

Lines 226–239: The temporal coverage of the dataset is quite limited. It would be useful to acknowledge this when discussing the representativeness of the training samples and the generalization of the proposed method. Please discuss more explicitly how representative the selected winter sweeps are of the diversity of meteorological situations. Since the meteorological component of the synthetic training data originates from a single winter period, the model is likely not exposed to sufficiently diverse precipitation structures and polarimetric signatures (e.g. summer convection). If possible, please provide examples addressing performance under meteorological conditions that are poorly represented in the training data.

Sect. 2.3: Please clarify how the final network configuration and architecture were selected. Were alternative architectures or different U-net configurations tested, and was the chosen setup based on a preliminary sensitivity analysis or evaluated against validation data?

Lines 387–392: Please clarify whether the choice of λw ∈ [5, 8] was supported by a quantitative comparison in addition to visual inspection of the representative cases.

Fig. 4: Since −40 dBZ is used as the placeholder value, the color scale should extend down to −40 dBZ. Alternatively, use an “under” indicator to make clear that all values below −20 dBZ share the same color; otherwise, placeholder values and clipped low-reflectivity values cannot be distinguished. The same should be applied to Figs. 7; 8.

Lines 409–417: When labelling the ground truth data, please briefly clarify that pixels containing both meteorological and non-meteorological contributions are intentionally labelled as meteorological because the meteorological signal is assumed to dominate.

Lines 440-442 - test design: For a more robust and independent assessment, the model should ideally be evaluated on data from a completely independent date, with no radar data from that date or event used during training or model selection. The current split by radar station primarily assesses generalization to an unseen radar rather than to unseen meteorological and clutter conditions. If this cannot be done, the resulting limitation for generalization should be discussed explicitly. Also, please specify which radar has been used for network evaluation.

Sect. 2.5 - model and hyperparameter selection: Please clarify how the final architecture was selected and how the training hyperparameters were tuned. Was a separate validation set, cross-validation procedure employed? In particular, please justify choices such as the sampling probabilities of 0.75, 0.125, the batch size, the number of training iterations, and other fixed values. Care should be taken not to use the final test datasets for hyperparameter selection.

Sect. 2.5 - learning curve: Given the limited amount of training data and the high complexity of the U-net architecture, it would be informative to include a learning-curve analysis in which the model is trained using progressively larger subsets of the training dataset and evaluated on the same independent test set. If performance is still improving as the training-set size increases, this would indicate that the model is data-limited and that additional training data could further improve its skill. Such an analysis would, however, require a test set that is sufficiently representative of the range of meteorological and non-meteorological conditions encountered in operational radar data.

Loss function: Please check the binary cross-entropy formula. Should the second term be (1−x) log(1−y)?

Sections 3.1 and 3.2: These sections describe the evaluation metrics and the reference DWD classification method rather than presenting results. I strongly recommend moving both subsections to Materials and Methods and starting the Results section with the network evaluation.

Lines 637-644 - dates of labelled sweeps: Please report the exact dates and times of the manually labelled sweeps and clarify whether they are independent in time and date from the data used for training and for developing the synthetic-data procedure. This description should also be included in the Method section rather than the results.

Sect. 3.3: The evaluation on measured data is based on only five manually labelled sweeps. Please provide more information on the selected cases. In Fig. 9, please report the variability of performance across individual sweeps in addition to the overall scores. This would help assess how robust the reported improvement over the DWD method is.

Line 660: Unless statistical significance was formally assessed, please replace “significantly worse” with wording such as “substantially worse”.

Sect. 4: The manuscript currently provides limited insight into how the U-net uses the individual radar moments and essentially attributes the better performance than the DWD method to the larger spatial neighborhood used by the U-net. Some information and discussion about the feature importance would be interesting. For example, a permutation-based feature-importance analysis on an independent test set, obtained by perturbing each input channel separately and measuring the corresponding performance decrease, would improve the model interpretability and help quantify the model’s reliance on DBZH, ZDR, and UDR, without requiring retraining.

Lines 719–727: The explanation linking the reduced performance near the radar to incorrectly labelled non-meteorological echoes in the winter data is plausible but appears speculative. Please clarify whether this was verified quantitatively or present it explicitly as a hypothesis.

Lines 728–740: The claims regarding generalizability, robustness across different times of day, and the absence of overfitting should be moderated. Although one radar is excluded from training, data from the other radars originate from the same dates and may observe the same large-scale meteorological or non-meteorological structures. The test set is therefore not fully independent in terms of events or atmospheric conditions. Together with the additional evaluation on only five measured sweeps, the current results demonstrate transfer to an unseen radar more than generalization to unseen events. A large, fully independent test dataset would make the conclusions much more robust if the reported performance were confirmed.

Lines 770–776: The suggestions to include consecutive time steps and multiple elevation angles are repeated twice. Please merge these sentences to avoid redundancy.

Technical corrections
All figure captions showing radar data should report the date and timestamp of the displayed event.

Please revise references to figures and sections according to the journal style, using “Fig.” and “Sect.” in running text.

Line 74: add the missing full stop “.” at the end of the sentence.

Line 119: Please check whether “annual structure” should read “annular structure”.

Line 144: consider removing the hyphen in “multiple-radar moments”.

Lines 160–165: The final sentence could be slightly smoother, for example: “Section 4 discusses the results, and Section 5 presents the conclusions.”

Line 164: define the abbreviation of Deutscher Wetterdienst (DWD) at first occurrence.

Lines 167–168: Please rephrase the reference to Sect. 2.1, for example: “… introduce the polarimetric radar moments in more detail in Sect. 2.1.”

Line 237: 1 is already introduced at line 224. Please remove one of the two references to avoid repetition.

Line 271: consider writing “… connected components consisting of several pixels (Fig. 1).”

Line 380: insert a space after the full stop in “images.To”.

Line 752: write “… motion (Fig. 10).”

Line 761: Please use the abbreviation DWD after it has been introduced in the manuscript.
Citation: https://doi.org/10.5194/egusphere-2026-992-RC2

Richard Bölz, Tom Kirstein, Lukas Fuchs, Lukas Josipović, Annette Böhm, Ulrich Blahak, and Volker Schmidt

Viewed

Total article views: 1,106 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
651	395	60	1,106	75	76

HTML: 651
PDF: 395
XML: 60
Total: 1,106
BibTeX: 75
EndNote: 76

Views and downloads (calculated since 18 Mar 2026)

Month	HTML	PDF	XML	Total
Mar 2026	450	156	42	648
Apr 2026	99	103	6	208
May 2026	55	84	5	144
Jun 2026	11	20	2	33
Jul 2026	36	32	5	73

Cumulative views and downloads (calculated since 18 Mar 2026)

Month	HTML	PDF	XML	Total
Mar 2026	450	156	42	648
Apr 2026	99	103	6	208
May 2026	55	84	5	144
Jun 2026	11	20	2	33
Jul 2026	36	32	5	73

Viewed (geographical distribution)

Total article views: 1,098 (including HTML, PDF, and XML) Thereof 1,098 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 29 Jul 2026

Short summary

We introduce an AI-based approach to identify unwanted signals in weather radar images using a neural network. To address the challenge of acquiring sufficient amounts of labeled radar images, the network is trained on synthetically generated radar images. By evaluating the segmentation performance of the trained network on experimentally measured radar images with expert-labeled ground truth, we demonstrate that it outperforms a state-of-the-art method currently used at Deutscher Wetterdienst.


Total:	0
HTML:	0
PDF:	0
XML:	0