Hourly surface nitrogen dioxide retrieval from GEMS tropospheric vertical column densities: Benefit of using time-contiguous input features for machine learning models

Gödeke, Janek; Richter, Andreas; Lange, Kezia; Maaß, Peter; Hong, Hyunkee; Lee, Hanlim; Park, Junsung

doi:10.5194/egusphere-2024-3145

Preprints

https://doi.org/10.5194/egusphere-2024-3145

Preprints

04 Nov 2024

| 04 Nov 2024

Hourly surface nitrogen dioxide retrieval from GEMS tropospheric vertical column densities: Benefit of using time-contiguous input features for machine learning models

Janek Gödeke, Andreas Richter, Kezia Lange, Peter Maaß, Hyunkee Hong, Hanlim Lee, and Junsung Park

Abstract. Launched in 2020, the Korean Geostationary Environmental Monitoring Spectrometer (GEMS) is the first geostationary satellite mission for observing trace gas concentrations in the Earth’s atmosphere. Observations are made over Asia. Geostationary orbits allow for hourly measurements, which leads to a much higher temporal resolution compared to daily measurements taken from low Earth orbits, such as by the TROPOspheric Monitoring Instrument (TROPOMI) or Ozone Monitoring Instrument (OMI). This work estimates the hourly concentration of surface NO₂ from GEMS tropospheric NO₂ vertical column densities (tropospheric NO₂ VCDs) and additional meteorological features, which serve as inputs for Random Forests and linear regression models. With several measurements per day, not only the current observations but also those from previous hours can be used as inputs for the machine learning models. We demonstrate that using these time-contiguous inputs leads to reliable improvements regarding all considered performance measures, such as Pearson correlation or Mean Square Error. For Random Forests, the average performance gains are between 4.5 % and 7.5 %, depending on the performance measure. For linear regression models, average performance gains are between 7 % and 15 %. For performance evaluation, spatial cross validation with surface in-situ measurements is used to measure how well the trained models perform at locations where they have not received any training data. In other words, we inspect the models’ ability to generalize to unseen locations. Additionally, we investigate the influence of tropospheric NO₂ VCDs on the performance. The region of our study is Korea.

Received: 09 Oct 2024 – Discussion started: 04 Nov 2024

Competing interests: At least one of the (co-)authors is a member of the editorial board of Atmospheric Measurement Techniques.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 7213 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (7213 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

11 Aug 2025

Hourly surface nitrogen dioxide retrieval from GEMS tropospheric vertical column densities: benefit of using time-contiguous input features for machine learning models

Janek Gödeke, Andreas Richter, Kezia Lange, Peter Maaß, Hyunkee Hong, Hanlim Lee, and Junsung Park

Atmos. Meas. Tech., 18, 3747–3779, https://doi.org/10.5194/amt-18-3747-2025,https://doi.org/10.5194/amt-18-3747-2025, 2025

Short summary

Janek Gödeke, Andreas Richter, Kezia Lange, Peter Maaß, Hyunkee Hong, Hanlim Lee, and Junsung Park

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-3145', Anonymous Referee #1, 25 Nov 2024

General Comments:
This paper uses machine learning models and a network of surface NO2 monitors to derive surface NO2 over South Korea with time-resolved NO2 satellite columns from the GEMS instrument. This is the first study to examine the use of time-contiguous inputs (i.e., ones from previous hours) to derive surface NO2 using machine learning. The authors show performance gains of 4.5-15%, depending on the performance measure and the model. Geostationary measurements of trace gases only became available with the launch of GEMS in 2020, and these kind of studies are very interesting for looking at the benefits of using these new data sources.
I thought the paper was well-written and clear. My expertise is in remote sensing but not machine learning, yet I found the description of the method easy to follow and even learned a few things. The figures and results clearly indicate the use of earlier data improves the performance of these machine learning models overall. I would have liked to see a small discussion about how these improvements might change over the course of a day. The GEMS observations are probably much less accurate in the morning and late evening (high angles, less sensitivity to the surface). The morning is furthermore limited in earlier time-contiguous observations but the evening is not. How does the performance of the final result change as a function of time? Right now all the results at all times and locations are getting lumped in together.
Two rather basic models are used (linear regression and Random Forest), which the authors chose to more easily isolate the performance changes. It’s not clear how the performance with time-contiguous data would change in other model setups. Do you expect those to have the same gains in performance?
Overall, I thought it was a nice paper and recommend it be published after the authors address a few minor comments.

Specific Comments:
Line 44: Change to “the measurement of lower tropospheric gases is not accurate”
Line 44: “This is why most studies estimated daily” doesn’t follow from your previous statement. The estimate they give is still at a specific time, not a daily average which is implied here. Clarify this sentence.
Line 105: I’m confused… where did j come from? The above equation uses t-k+1 (no t-j mentioned.)
Line 148: Would be useful for context to summarize accuracy of the NO2 product you use, both for troposphere and stratosphere. And how does this change over a day?
Line 153: The TM5 model may leave residual structure in the results… maybe mention resolution here.
Line 175: What kind of sensors are used? What is accuracy of the sensors?
Line 183: “We assume” – this seems like something that should be clear in a user guide or the information could come from the data producers upon request. Is this a fact or are you really making an assumption? Without more information it could also be assumed that 1:00UTC is describing the monthly average from 00:30-1:30 UTC. I generally find the time stamp discussion confusing. Wouldn’t it make sense to label this example as 2021/01/23/02 since two datasets at least are occurring around 2:00 UTC?
Line 209: Maybe I don’t know enough about how these models work, but I don’t understand how these negative values can be excluded, or why they have to be. Can you give some more justification? If the model is trained on a dataset that is biased at low column values of NO2, how does this affect results? If you don’t care about the bias but can’t handle negatives, why not add a background amount to make all the negatives positive to maximize use of all data? If you want to use the column values later to estimate surface NO2 in a given location but have negative values and haven’t considered them in the model, how can these be used?
Table 1: I think it would be useful to re-define N and give its unit here in caption.
Line 314: I’m not really clear about why latitude should get included at all as a feature in the first place. It’s good to see later that its inclusion doesn’t matter much, as the tropospheric VCD should have very little dependence on latitude in a physical way. Presumably the correlation in Table B1 is moderately high because in Korea the NO2 sources are dominated by a few cities including Seoul in the North, but the latitude is not the cause of enhanced tropospheric NO2. It could be important for other gases and larger domains, but not trop NO2 in a tiny area like South Korea.
Figures 2 and 3: Not a big deal but I’m not sure why left column has to be included… seems redundant with middle column which provides a more complete result.
Line 653: Here and earlier, I’m not clear why you would want to use this model outside of Korea with no VCD input (also, the focus of the paper seems to be GEMS – i.e., satellite observations). Can you elaborate under what circumstances this would be useful? I would expect it to be pretty inaccurate without the VCD, especially in regions with no monitors, and not as useful as a physical model output from something like CAMS or GEOS-CF.

Technical Comments:
There are a few minor English issues that hopefully will mostly be resolved in copy editing. I have listed a few below (non-exhaustive).
Line 16/17: Phrasing is awkward and doesn’t make sense – remove “on the ones hand” and “on the other hand”. These usually are used to describes opposites, not just different topics.
Line 20: “In short” is awkward – remove.
Line 25: Change “derived” to “calculated”
Line 25: This sentence is not clear. Need to rephrase to something like “In their study, surface NO2 was estimated by applying an assumed NO2 vertical distribution derived by a chemical transport model to tropospheric NO2 vertical column densities (tropospheric NO2 VCDs) to determine surface concentrations, using VCDs measurements from the Ozone Monitoring Instrument (OMI, Levelt et al. (2006))”.
Line 31: Change “or” to “and”
Line 33: Needs a verb, i.e., “in determining surface NO2”
Line 42: “Pass over”
Line 52: Confusing phrasing “over around 20 countries”. Just say “20 countries”
Line 69: Sentence does not have a verb.
Line 140: “up to ten observations over a given location according to the season”
Line 153: Usually it’s an “air mass” not “airmass”
Line 283: Change “In all what follows” to “In all that follows”
Figure 2 and 3: In these and other figures, the linewidth, symbol size and sometimes font size are very small and hard to read on my screen. There are not many points, so there is a lot of room to improve the figures by making lines and symbols larger in future plots.

Citation: https://doi.org/10.5194/egusphere-2024-3145-RC1
- AC1: 'Reply on RC1', Janek Gödeke, 27 Feb 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2024/egusphere-2024-3145/egusphere-2024-3145-AC1-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2024-3145-AC1
RC2:
'Comment on egusphere-2024-3145', Anonymous Referee #2, 03 Dec 2024
The paper explores the use of hourly observations from the Korean Geostationary Environmental Monitoring Spectrometer (GEMS), the first geostationary satellite for monitoring trace gases over Asia, to estimate surface NO2 concentrations. The authors leverage GEMS's high temporal resolution, which allows for hourly measurements, and combine these with meteorological data as inputs for Random Forests and linear regression models. A key innovation is the use of time-contiguous data, incorporating both current and prior hours' measurements, which enhances model performance.

The study evaluates the models using spatial cross-validation with in-situ surface measurements, assessing their ability to generalize to unseen locations. Results indicate that including previous observations improves performance, with Random Forest models achieving a 4.5% to 7.5% gain and linear regression models a 7% to 15% gain across various metrics. The research focuses on Korea and highlights the critical role of GEMS tropospheric NO2 vertical column densities (VCDs) in driving model accuracy.

This work demonstrates the potential of geostationary satellite data for surface air quality assessment and underscores the advantages of incorporating temporal data in machine learning models.

The manuscript addresses an important topic and demonstrates innovative use of geostationary satellite data for estimating surface NO2 concentrations. However, substantial revisions are needed to clarify data processing, justify model choices, address potential biases, and strengthen sensitivity analyses. Providing additional validation and comparison with advanced methods could significantly enhance the robustness and impact of this work.

General Comments
Data Processing

The data processing methodology is unclear and lacks sufficient detail. The authors should enhance this section by:

Including a flowchart in the Data section to visually illustrate the entire data processing workflow.

Providing a table detailing each data source, including the spatial and temporal resolution, and any preprocessing steps applied to the input datasets.

Satellite and Ground Station Pairing

GEMS data has a relatively coarse spatial resolution (~8x8 km) compared to in-situ ground measurements. The manuscript does not clearly explain the methodology for pairing satellite pixels with ground stations. Specifically:

The statement “we associated the location of an in situ station with the VCD pixel or meteorological pixel whose center is nearest to the station’s location” needs clarification. Does this refer to the center of the satellite pixel?

If multiple ground stations fall within the same satellite pixel, how are these handled? Are they averaged, or is one selected?

GEMS pixel locations vary slightly with each scan due to orbital and observation geometry. Did the authors regrid the satellite data before co-location to ensure consistency?

Characteristics of Ground Stations

What type of instruments are used at the ground stations? For example, are they chemiluminescent analyzers?

Ground stations are often categorized as urban, background, or roadside. Did the authors use all station types, or restrict their analysis to specific types? The representativeness of the training data depends on this choice.

Temporal Input and Data Loss

The use of prior hourly data as inputs raises some concerns:

The model will not produce predictions for the first few hours of each day, creating data gaps.

Cloud cover and other issues affecting satellite measurements in prior hours can propagate errors into the current hour’s input, resulting in significant data losses during training and prediction. This cascading loss reduces the dataset from over 1.3 million data points for a 1-hour input window to approximately 350,000 for a 5-hour window. The authors should provide a clear justification for accepting this trade-off between increased data gaps and potential gains in model accuracy.

Additionally, the authors should evaluate and discuss how these data gaps impact not only the training and validation phases but also the model's predictions and its applicability to real-world scenarios. This includes addressing potential limitations in the model’s ability to generalize when encountering similar conditions in operational or extended applications.

Justification for Input Variables and Preprocessing

The paper lacks justification for the selection of input variables. A sensitivity analysis or variance inflation factor (VIF) analysis should be conducted to ensure the chosen variables are non-redundant and significant.

The input variables differ in units and magnitudes, which could cause instability in model performance. Did the authors scale, normalize, or log-transform these variables before training? This critical preprocessing step is missing from the discussion.

Choice of Models

The authors used Random Forest and linear regression but did not justify these choices.

More advanced machine learning methods, such as Artificial Neural Networks (ANN), Recurrent Neural Networks (RNN), or Convolutional Neural Networks (CNN), have been shown to better handle non-linear relationships and spatio-temporal dependencies in atmospheric data. The authors should explain why these advanced methods were not used or compare their results to them.

Handling of Negative Values

The authors ignored negative GEMS VCD values, which will bias the average toward positive values. Justification is needed for this choice.

Similarly, were there negative values in the in-situ measurements? If so, how were these handled? This needs to be explicitly discussed.

QA Value Threshold and Bias

The authors only used data with QA values equal to 1. This choice filters out cloudy conditions but potentially introduces a clear-sky bias since cloudy conditions can be associated with higher aerosol or NO2 levels. The authors should address this limitation and quantify its impact on results.

Inclusion of Latitude

Including latitude as an input variable needs further justification, as the latitudinal variation over South Korea is minimal. The authors should explain the rationale behind this decision.

Section-Specific Comments
Section 5.2:
The atmospheric lifetime of NO2 varies with season and time of day, and this variability likely influences model sensitivity. The authors should:

Conduct and present seasonal and diurnal sensitivity analyses to account for these variations.

Address potential biases from the limited temporal scope of training data (January 2021 to November 2022). For instance, why was data from December underrepresented, and why were only 23 months used instead of two full years?

Discuss whether differences in valid data points across seasons (e.g., more data in summer due to fewer clouds) lead to seasonal biases in model training.

Section 5.3:
The prediction maps show that the model has been applied beyond South Korea, including regions over the ocean, Japan, and North Korea. The authors should:

Validate the model's performance in these regions by comparing predictions to in-situ measurements from other countries, such as Japan. This would demonstrate the model's transferability across different geographies.

The prediction maps also exhibit noticeable grid structures, likely originating from the meteorological ERA5 dataset. Did the authors interpolate the ERA5 data to reduce these artifacts? If not, why?

Clarify how gaps in GEMS data (e.g., due to cloud cover) were handled during prediction. The maps show no missing areas (Figure 9 and 10), suggesting the model was applied to cloudy data despite such data being excluded during training. Discuss the implications of using potentially contaminated data and its impact on model accuracy.
Citation: https://doi.org/10.5194/egusphere-2024-3145-RC2
- AC2: 'Reply on RC2', Janek Gödeke, 27 Feb 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2024/egusphere-2024-3145/egusphere-2024-3145-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2024-3145-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-3145', Anonymous Referee #1, 25 Nov 2024

General Comments:
This paper uses machine learning models and a network of surface NO2 monitors to derive surface NO2 over South Korea with time-resolved NO2 satellite columns from the GEMS instrument. This is the first study to examine the use of time-contiguous inputs (i.e., ones from previous hours) to derive surface NO2 using machine learning. The authors show performance gains of 4.5-15%, depending on the performance measure and the model. Geostationary measurements of trace gases only became available with the launch of GEMS in 2020, and these kind of studies are very interesting for looking at the benefits of using these new data sources.
I thought the paper was well-written and clear. My expertise is in remote sensing but not machine learning, yet I found the description of the method easy to follow and even learned a few things. The figures and results clearly indicate the use of earlier data improves the performance of these machine learning models overall. I would have liked to see a small discussion about how these improvements might change over the course of a day. The GEMS observations are probably much less accurate in the morning and late evening (high angles, less sensitivity to the surface). The morning is furthermore limited in earlier time-contiguous observations but the evening is not. How does the performance of the final result change as a function of time? Right now all the results at all times and locations are getting lumped in together.
Two rather basic models are used (linear regression and Random Forest), which the authors chose to more easily isolate the performance changes. It’s not clear how the performance with time-contiguous data would change in other model setups. Do you expect those to have the same gains in performance?
Overall, I thought it was a nice paper and recommend it be published after the authors address a few minor comments.

Specific Comments:
Line 44: Change to “the measurement of lower tropospheric gases is not accurate”
Line 44: “This is why most studies estimated daily” doesn’t follow from your previous statement. The estimate they give is still at a specific time, not a daily average which is implied here. Clarify this sentence.
Line 105: I’m confused… where did j come from? The above equation uses t-k+1 (no t-j mentioned.)
Line 148: Would be useful for context to summarize accuracy of the NO2 product you use, both for troposphere and stratosphere. And how does this change over a day?
Line 153: The TM5 model may leave residual structure in the results… maybe mention resolution here.
Line 175: What kind of sensors are used? What is accuracy of the sensors?
Line 183: “We assume” – this seems like something that should be clear in a user guide or the information could come from the data producers upon request. Is this a fact or are you really making an assumption? Without more information it could also be assumed that 1:00UTC is describing the monthly average from 00:30-1:30 UTC. I generally find the time stamp discussion confusing. Wouldn’t it make sense to label this example as 2021/01/23/02 since two datasets at least are occurring around 2:00 UTC?
Line 209: Maybe I don’t know enough about how these models work, but I don’t understand how these negative values can be excluded, or why they have to be. Can you give some more justification? If the model is trained on a dataset that is biased at low column values of NO2, how does this affect results? If you don’t care about the bias but can’t handle negatives, why not add a background amount to make all the negatives positive to maximize use of all data? If you want to use the column values later to estimate surface NO2 in a given location but have negative values and haven’t considered them in the model, how can these be used?
Table 1: I think it would be useful to re-define N and give its unit here in caption.
Line 314: I’m not really clear about why latitude should get included at all as a feature in the first place. It’s good to see later that its inclusion doesn’t matter much, as the tropospheric VCD should have very little dependence on latitude in a physical way. Presumably the correlation in Table B1 is moderately high because in Korea the NO2 sources are dominated by a few cities including Seoul in the North, but the latitude is not the cause of enhanced tropospheric NO2. It could be important for other gases and larger domains, but not trop NO2 in a tiny area like South Korea.
Figures 2 and 3: Not a big deal but I’m not sure why left column has to be included… seems redundant with middle column which provides a more complete result.
Line 653: Here and earlier, I’m not clear why you would want to use this model outside of Korea with no VCD input (also, the focus of the paper seems to be GEMS – i.e., satellite observations). Can you elaborate under what circumstances this would be useful? I would expect it to be pretty inaccurate without the VCD, especially in regions with no monitors, and not as useful as a physical model output from something like CAMS or GEOS-CF.

Technical Comments:
There are a few minor English issues that hopefully will mostly be resolved in copy editing. I have listed a few below (non-exhaustive).
Line 16/17: Phrasing is awkward and doesn’t make sense – remove “on the ones hand” and “on the other hand”. These usually are used to describes opposites, not just different topics.
Line 20: “In short” is awkward – remove.
Line 25: Change “derived” to “calculated”
Line 25: This sentence is not clear. Need to rephrase to something like “In their study, surface NO2 was estimated by applying an assumed NO2 vertical distribution derived by a chemical transport model to tropospheric NO2 vertical column densities (tropospheric NO2 VCDs) to determine surface concentrations, using VCDs measurements from the Ozone Monitoring Instrument (OMI, Levelt et al. (2006))”.
Line 31: Change “or” to “and”
Line 33: Needs a verb, i.e., “in determining surface NO2”
Line 42: “Pass over”
Line 52: Confusing phrasing “over around 20 countries”. Just say “20 countries”
Line 69: Sentence does not have a verb.
Line 140: “up to ten observations over a given location according to the season”
Line 153: Usually it’s an “air mass” not “airmass”
Line 283: Change “In all what follows” to “In all that follows”
Figure 2 and 3: In these and other figures, the linewidth, symbol size and sometimes font size are very small and hard to read on my screen. There are not many points, so there is a lot of room to improve the figures by making lines and symbols larger in future plots.

Citation: https://doi.org/10.5194/egusphere-2024-3145-RC1
- AC1: 'Reply on RC1', Janek Gödeke, 27 Feb 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2024/egusphere-2024-3145/egusphere-2024-3145-AC1-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2024-3145-AC1
RC2:
'Comment on egusphere-2024-3145', Anonymous Referee #2, 03 Dec 2024
The paper explores the use of hourly observations from the Korean Geostationary Environmental Monitoring Spectrometer (GEMS), the first geostationary satellite for monitoring trace gases over Asia, to estimate surface NO2 concentrations. The authors leverage GEMS's high temporal resolution, which allows for hourly measurements, and combine these with meteorological data as inputs for Random Forests and linear regression models. A key innovation is the use of time-contiguous data, incorporating both current and prior hours' measurements, which enhances model performance.

The study evaluates the models using spatial cross-validation with in-situ surface measurements, assessing their ability to generalize to unseen locations. Results indicate that including previous observations improves performance, with Random Forest models achieving a 4.5% to 7.5% gain and linear regression models a 7% to 15% gain across various metrics. The research focuses on Korea and highlights the critical role of GEMS tropospheric NO2 vertical column densities (VCDs) in driving model accuracy.

This work demonstrates the potential of geostationary satellite data for surface air quality assessment and underscores the advantages of incorporating temporal data in machine learning models.

The manuscript addresses an important topic and demonstrates innovative use of geostationary satellite data for estimating surface NO2 concentrations. However, substantial revisions are needed to clarify data processing, justify model choices, address potential biases, and strengthen sensitivity analyses. Providing additional validation and comparison with advanced methods could significantly enhance the robustness and impact of this work.

General Comments
Data Processing

The data processing methodology is unclear and lacks sufficient detail. The authors should enhance this section by:

Including a flowchart in the Data section to visually illustrate the entire data processing workflow.

Providing a table detailing each data source, including the spatial and temporal resolution, and any preprocessing steps applied to the input datasets.

Satellite and Ground Station Pairing

GEMS data has a relatively coarse spatial resolution (~8x8 km) compared to in-situ ground measurements. The manuscript does not clearly explain the methodology for pairing satellite pixels with ground stations. Specifically:

The statement “we associated the location of an in situ station with the VCD pixel or meteorological pixel whose center is nearest to the station’s location” needs clarification. Does this refer to the center of the satellite pixel?

If multiple ground stations fall within the same satellite pixel, how are these handled? Are they averaged, or is one selected?

GEMS pixel locations vary slightly with each scan due to orbital and observation geometry. Did the authors regrid the satellite data before co-location to ensure consistency?

Characteristics of Ground Stations

What type of instruments are used at the ground stations? For example, are they chemiluminescent analyzers?

Ground stations are often categorized as urban, background, or roadside. Did the authors use all station types, or restrict their analysis to specific types? The representativeness of the training data depends on this choice.

Temporal Input and Data Loss

The use of prior hourly data as inputs raises some concerns:

The model will not produce predictions for the first few hours of each day, creating data gaps.

Cloud cover and other issues affecting satellite measurements in prior hours can propagate errors into the current hour’s input, resulting in significant data losses during training and prediction. This cascading loss reduces the dataset from over 1.3 million data points for a 1-hour input window to approximately 350,000 for a 5-hour window. The authors should provide a clear justification for accepting this trade-off between increased data gaps and potential gains in model accuracy.

Additionally, the authors should evaluate and discuss how these data gaps impact not only the training and validation phases but also the model's predictions and its applicability to real-world scenarios. This includes addressing potential limitations in the model’s ability to generalize when encountering similar conditions in operational or extended applications.

Justification for Input Variables and Preprocessing

The paper lacks justification for the selection of input variables. A sensitivity analysis or variance inflation factor (VIF) analysis should be conducted to ensure the chosen variables are non-redundant and significant.

The input variables differ in units and magnitudes, which could cause instability in model performance. Did the authors scale, normalize, or log-transform these variables before training? This critical preprocessing step is missing from the discussion.

Choice of Models

The authors used Random Forest and linear regression but did not justify these choices.

More advanced machine learning methods, such as Artificial Neural Networks (ANN), Recurrent Neural Networks (RNN), or Convolutional Neural Networks (CNN), have been shown to better handle non-linear relationships and spatio-temporal dependencies in atmospheric data. The authors should explain why these advanced methods were not used or compare their results to them.

Handling of Negative Values

The authors ignored negative GEMS VCD values, which will bias the average toward positive values. Justification is needed for this choice.

Similarly, were there negative values in the in-situ measurements? If so, how were these handled? This needs to be explicitly discussed.

QA Value Threshold and Bias

The authors only used data with QA values equal to 1. This choice filters out cloudy conditions but potentially introduces a clear-sky bias since cloudy conditions can be associated with higher aerosol or NO2 levels. The authors should address this limitation and quantify its impact on results.

Inclusion of Latitude

Including latitude as an input variable needs further justification, as the latitudinal variation over South Korea is minimal. The authors should explain the rationale behind this decision.

Section-Specific Comments
Section 5.2:
The atmospheric lifetime of NO2 varies with season and time of day, and this variability likely influences model sensitivity. The authors should:

Conduct and present seasonal and diurnal sensitivity analyses to account for these variations.

Address potential biases from the limited temporal scope of training data (January 2021 to November 2022). For instance, why was data from December underrepresented, and why were only 23 months used instead of two full years?

Discuss whether differences in valid data points across seasons (e.g., more data in summer due to fewer clouds) lead to seasonal biases in model training.

Section 5.3:
The prediction maps show that the model has been applied beyond South Korea, including regions over the ocean, Japan, and North Korea. The authors should:

Validate the model's performance in these regions by comparing predictions to in-situ measurements from other countries, such as Japan. This would demonstrate the model's transferability across different geographies.

The prediction maps also exhibit noticeable grid structures, likely originating from the meteorological ERA5 dataset. Did the authors interpolate the ERA5 data to reduce these artifacts? If not, why?

Clarify how gaps in GEMS data (e.g., due to cloud cover) were handled during prediction. The maps show no missing areas (Figure 9 and 10), suggesting the model was applied to cloudy data despite such data being excluded during training. Discuss the implications of using potentially contaminated data and its impact on model accuracy.
Citation: https://doi.org/10.5194/egusphere-2024-3145-RC2
- AC2: 'Reply on RC2', Janek Gödeke, 27 Feb 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2024/egusphere-2024-3145/egusphere-2024-3145-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2024-3145-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Janek Gödeke on behalf of the Authors (27 Feb 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (05 Mar 2025) by Diego Loyola

RR by Anonymous Referee #1 (17 Mar 2025)

Suggestions for revision or reasons for rejection

The authors have addressed most of my comments in a satisfactory way, and I think the paper is much stronger.

I am not convinced with the reply about the exclusion of negative values. I note that the other reviewer also had a similar concern in their original review.
In their response, the authors note that “Negative VCDs, so negative concentrations, have no physical meaning. This is why we excluded them from both the training and the test data to increase the quality of the dataset.”
It is not true that negative values have no physical meaning in these kind of satellite data, which are in effect differential measurements. There are a number of reasons that negative values many occur. The first is fitting a spectrum with random noise. Consider a measurement of a completely “clean” atmosphere with no tropospheric NO2 and a column of zero molecules/cm2. If this is measured by a detector with any noise, we would expect the observations to be distributed about zero (i.e., half will be positive and half will be negative.). Excluding all negative values in an analysis with this data will bias results high, where in truth each small negative value is more or less as significant as each equally small positive value.

In addition, systematic uncertainties in spectral fitting inputs, like cross sections, could cause negative biases in the data in clean regions. Furthermore, the fact that a tropospheric column is being derived from a total column measurement using an estimated stratospheric column is another source of a potential negative bias in a tropospheric column. Even if the model cannot deal with negative values (which I guess is the case), I think there needs to be more of a discussion of how excluding these values could affect the results, rather than brushing over this point. I would think as well as affecting model performance, it needs to be explained what will happen to the negative VCDs when the final model is applied to a map of VCDs to derive concentrations. Can these be used, or are you at risk of losing data over clean regions that are actually useful (again, potentially biasing results)?

I am also still confused by the use of latitude as a predictor variable in such a small region as Korea (even if the justification is that it has been used in a previous paper over Switzerland). I am still not convinced it is a useful variable on which to focus, but I suppose at this point would it would require a lot of work to redo the model and paper. The authors responded to me and the other reviewer (also with this concern) but I am not sure they have changed the paper in any way to address this comment. It might be helpful to add a line or two to the paper to clarify the choice of this predictor.

Hide

RR by Anonymous Referee #2 (25 Mar 2025)

Suggestions for revision or reasons for rejection

The author addressed most of my concerns, however, further clarification and explanation is still needed.

Your response mentions the use of "models" and a "rule of thumb" while utilizing an ensemble of models trained on different time-contiguous features. However, critical questions regarding the model's practical application and potential biases remain. While you suggest using a Random Forest trained with k = j' + 1 when time-contiguous features are available, and switching to a model trained without time-contiguity when they are not. Did the authors analysis the bias or any potential differences when switching among these models trained with different time-contiguous features?

In your response, you state, 'Negative VCDs, so negative concentrations, have no physical meaning. This is why we excluded them from both the training and the test data to increase the quality of the dataset.' While it is true that negative concentrations are not physically realistic in the absolute sense, negative measured values can and do occur in real-world datasets due to measurement noise, particularly when the true values are close to zero. These negative values are not necessarily indicative of data quality issues but rather a reflection of the inherent uncertainty in the measurements. Furthermore, your claim that excluding these negative values does not introduce bias is actually incorrect. By systematically removing negative values, you are artificially shifting the mean of the dataset towards positive values. This will inevitably introduce a positive bias in any model trained on this altered dataset, regardless of whether the test data also lacks negative values. While your model might perform well on your artificially positive-shifted test data, it will not accurately reflect real-world scenarios where extremely low values are sometimes measured as negative. Skipping that part will definitely result a positive bias, especially for regions with low values. Also, I cannot see why the authors have to apply this additional filtering and cause addition missing data in their prediction. Therefore, the argument that the model will only be used on data without negative values, as you imply, is not well supported.

Thank you for pointing out the black mask indicates missing data in your figures. But I would suggest using a different color for missing values, as your coast lines and lowest value of your colorbar are also black.

Hide

ED: Publish subject to minor revisions (review by editor) (31 Mar 2025) by Diego Loyola

AR by Janek Gödeke on behalf of the Authors (09 Apr 2025) Author's response Author's tracked changes Manuscript

ED: Publish as is (22 May 2025) by Diego Loyola

AR by Janek Gödeke on behalf of the Authors (23 May 2025)

Journal article(s) based on this preprint

11 Aug 2025

Hourly surface nitrogen dioxide retrieval from GEMS tropospheric vertical column densities: benefit of using time-contiguous input features for machine learning models

Janek Gödeke, Andreas Richter, Kezia Lange, Peter Maaß, Hyunkee Hong, Hanlim Lee, and Junsung Park

Atmos. Meas. Tech., 18, 3747–3779, https://doi.org/10.5194/amt-18-3747-2025,https://doi.org/10.5194/amt-18-3747-2025, 2025

Short summary

Janek Gödeke, Andreas Richter, Kezia Lange, Peter Maaß, Hyunkee Hong, Hanlim Lee, and Junsung Park

Viewed

Total article views: 3,052 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,719	886	447	3,052	100	154

HTML: 1,719
PDF: 886
XML: 447
Total: 3,052
BibTeX: 100
EndNote: 154

Views and downloads (calculated since 04 Nov 2024)

Month	HTML	PDF	XML	Total
Nov 2024	246	80	14	340
Dec 2024	58	30	2	90
Jan 2025	66	24	2	92
Feb 2025	48	26	36	110
Mar 2025	38	24	104	166
Apr 2025	46	14	94	154
May 2025	48	6	98	152
Jun 2025	66	32	50	148
Jul 2025	56	8	2	66
Aug 2025	204	18	0	222
Sep 2025	402	22	4	428
Oct 2025	38	24	0	62
Nov 2025	32	36	8	76
Dec 2025	80	46	4	130
Jan 2026	56	98	12	166
Feb 2026	78	32	4	114
Mar 2026	88	206	6	300
Apr 2026	31	79	4	114
May 2026	25	59	1	85
Jun 2026	10	11	1	22
Jul 2026	3	11	1	15

Cumulative views and downloads (calculated since 04 Nov 2024)

Month	HTML	PDF	XML	Total
Nov 2024	246	80	14	340
Dec 2024	58	30	2	90
Jan 2025	66	24	2	92
Feb 2025	48	26	36	110
Mar 2025	38	24	104	166
Apr 2025	46	14	94	154
May 2025	48	6	98	152
Jun 2025	66	32	50	148
Jul 2025	56	8	2	66
Aug 2025	204	18	0	222
Sep 2025	402	22	4	428
Oct 2025	38	24	0	62
Nov 2025	32	36	8	76
Dec 2025	80	46	4	130
Jan 2026	56	98	12	166
Feb 2026	78	32	4	114
Mar 2026	88	206	6	300
Apr 2026	31	79	4	114
May 2026	25	59	1	85
Jun 2026	10	11	1	22
Jul 2026	3	11	1	15

Viewed (geographical distribution)

Total article views: 3,044 (including HTML, PDF, and XML) Thereof 3,044 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 21 Jul 2026

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (7213 KB)
Metadata XML

Short summary

The Korean Geostationary Environmental Monitoring Spectrometer (GEMS) monitors trace gases over Asia, e.g., NO₂. GEMS provides hourly data, improving the time-resolution compared to the daily overpasses by other satellites. For the prediction of hourly surface NO₂ over Korea from GEMS observations and meteorological data, this study shows that machine learning models benefit from this higher time-resolution. This is achieved by using observations from previous hours as additional inputs.


Total:	0
HTML:	0
PDF:	0
XML:	0