Deep learning tool: Reconstruction of long missing climate data based on multilayer perceptron (MLP)

Yan, Zhang; Tianxin, Xu; Chenjia, Zhang; Daokun, Ma

doi:https://doi.org/10.5194/egusphere-2024-439

Preprints

https://doi.org/10.5194/egusphere-2024-439

Preprints

07 Mar 2024

| 07 Mar 2024

Deep learning tool: Reconstruction of long missing climate data based on multilayer perceptron (MLP)

Zhang Yan, Xu Tianxin, Zhang Chenjia, and Ma Daokun

Abstract. Long-term monitoring of climate data is significant for grasping the law and development trend of climate change and guaranteeing food security. However, some weather stations lack monitoring data for even decades. In this study, 62 years of historical monitoring data from 105 weather stations in Xinjiang were used for missing sequence prediction, validating proposed data reconstruction tool. First of all, study area was divided into three parts according to the climatic characteristics and geographical locations. A deep learning tool based on multilayer perceptron (MLP) was established to reconstruct meteorological data with three time scales (Short term, cycle and long term) and one spatio dimension as inputing, filling in long sequence blank data. By designing an end-to-end model to autonomously detect the locations of missing data and make rolling predictions, we obtained complete meteorological monitoring data of Xinjiang from 1961 to 2022. Seven kinds of parameter reconstructed include maximum temperature (Max_T), minimum temperature (Min_T), mean temperature (Ave _ T), average water vapor pressure (Ave _ WVP), relative humidity (Ave _ RH), average wind speed (10 m Ave _ WS), and sunshine duration (Sun_H). The quality of reconstructed data was evaluated by calculating correlation coefficient with the monitored sequences of nearest station. Results show that, proposed model reached satisfied average correlation coefficient for Max_T, Min_T, Ave _ T and Ave _ WVP parameters are 0.969, 0.961, 0.971 and 0.942 respectively. The average correlation coefficient of Sun_H and Ave _ RH are 0.720 and 0.789. Although it is difficult to predict extreme values, it can still capture the period and trend; the reconstruction effect of 10 m Ave _ WS is poor, with the average similarity of 0.488. Finally, we published the trained parameter files and prediction codes as a micro service on the Agricultural Smart Brain platform, which provides firstly a deep learning tool for rapid and reliable reconstruction of meteorological monitoring data.

Received: 14 Feb 2024 – Discussion started: 07 Mar 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Zhang Yan, Xu Tianxin, Zhang Chenjia, and Ma Daokun

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-439', Anonymous Referee #1, 20 Mar 2024
This study used multi-layer perceptrons (MLP) to reconstruct missing climate data series by a deep learning approach which provided a useful tool under limiting data source. The manuscript is well-written and developed a practical and helpful methodology. My main concerns are:
the manuscript requires clear section of method and materials; which data are used to train the model and which data is for validation, and how to do the training and validation.

requires discussion with limitations, implications. The reliability of MLP in rapidly reconstructing meteorological data is mentioned, but it is not discussed the applicability of the model in different environments or datasets.

conclusion needs to be more general and to address how can use the results (tool) to help science and production.

It is mentioned that the model was designed with four MLP modules and two fully connected predictive heads, but it is not explained why this specific model structure was chosen and how the performance of the model is ensured.

Discuss why there are low correlation coefficients for sunshine and wind speed.

I suggest to restructure the manuscript following the logic of, introduction, method and materials, results, discussions, and conclusions.

Detail comments:
L10, 16: add space between two words. Also check whole text.
L14-16, delete the expression of evaluation, just give the results.
L16, results showed that….,
In the abstract, it is better to give how many data the paper used and how many data (%) were missed in the data sheet. The parameter for evaluation, a root mean square error (RMSE) might be help to understand the accuracy of the reconstruction.
Keywords: use different terms with title
L26-36, too long to say the importance of climate change, not closely relevant to this study, just general statement is enough.
L86: only 143 missing data? Give total data and missing data percentage. This would show the power of method performance.
L85-95: here mostly say how to do, but need to say the clear objectives of the paper. How to do could be move to the M&M section.
L96-123: This section gives the information of study area and data from, but too vague, for example, table is not self-clarification. The section could be belong to 2. M&M, 2.1 Data source. In fact, where is the study done is not so important but what is study data is very important. Suggest to develop a table (combine figure 1, table 1 and figure 2) including name of site, latitude, observed yeas, total data, missing data percentage. Also clarify which data is used for training the model and which data were used to validate the model.
L123- : this section describes the model, belonging to M&M
Citation: https://doi.org/10.5194/egusphere-2024-439-RC1
RC2: 'Comment on egusphere-2024-439', Anonymous Referee #2, 01 Jun 2024

The manuscript describes a deep learning tool applied to the reconstruction of time series of meteorological variables using the correlation coefficients of reconstructed and nearest station time series as an indicator of the model performance.
General comments
The English language is very poor throughout the manuscript and requires extensive revision to make the content clear and the presentation smooth. For example, Section 4.3 is confusing and needs extensive rewriting as the poor English contributes greatly to the poor readability of the text. Although it is possible to try to decipher the text to find out the original intended meaning the presentation is not at the level of an international journal.
The main topic of the journal is about atmospheric measurements but no information and comments on the measurements themselves are provided by the authors. Instrument type, calibration and instrument changes over time are not even mentioned (but expected for such a long measurement period). This contrasts with the fact that in the presented time series of meteorological variables (see e.g. Figure 9) there are indications of non-homogeneity (across about 2008 for the anemometer, where no-wind occurrences become less frequent, possibly because of a replacement of the cup and vane with the ultrasonic measurement principle?). In such cases is the model proposed by the authors still applicable? The authors should address this question in detail.
It is not until page 19 that the temporal resolution of the time series examined is mentioned. This creates some confusion about the actual extent of the periods of missing data since in Figure 2 the missing lines have a maximum extension of about 40 years, while in the text (e.g. in lines 86-87 and 201) 143 occurrences are called either missing data or sub-tasks, but the sub-tasks are not defined before that line and their temporal extension is not defined.
Some figures are very difficult to read, e.g. Figures 8 and 9 where 15 to 30 overlapping series are included in the same graph. A more synthetic and clearer presentation should be used. For example, in line 263 the authors state that in Figure 8 “reconstructed sequences (…) are indistinguishable from the real sequences” but this is very difficult to see from the graph. The only information that can be gleaned from these figures is however the abrupt change in the behaviour of some variables at certain weather stations, which is neither mentioned nor discussed in the text. In particular, the impact of such inhomogeneities on the proposed reconstruction method should be extensively addressed and discussed.
On page 17, the authors appear to address the limitations of their approach, but the paragraph is confused, and the consequences are not clearly stated. Limitations deserve a much more extensive and detailed discussion, which would be better presented in a dedicated section.
The conclusions are just a summary of the work, but this is the scope of the abstract rather than of the conclusions. A discussion of the main limitations and strengths of the proposed reconstruction model is missing, as is any reference to other possible approaches (e.g. the do-nothing option). What application would benefit from reconstructing time series based on temporal correlation alone? What is the improvement over using the shorter time series as is or a simple stochastic generation of the missing data?
Throughout the manuscript, and in the conclusions, the quality of the reconstructed data is only assessed by comparing the correlation coefficients of the reconstructed and nearest time series. Is this sufficient to justify the use of deep learning tools against simpler and more robust methods, e.g. stochastic generation? Would the comparison of more meaningful indicators, e.g. higher order statistics, show the same level of “credibility” and “consistency”. Is the observed frequency, the percentage of null values, etc. correctly reproduced? Does the method easily reproduce variables with a high intrinsic correlation over time and poorly reproduce variables having larger fluctuations (see Figure 10)?
It is not clear from the manuscript whether the correlation coefficient used to evaluate the quality of the reconstruction is calculated over the whole series or on a reduced portion. In the first case, the limited extension of the sequences of missing data makes the comparison unbalanced, since the missing data would only slightly affect the correlation of the whole series.
"Conflict of interest: None"
Indeed, some conflict of interest seems to exist since the proposed model is said to be offered by a company within an online platform (“Agricultural Smart Brain”) whose access (the link reported does not work) is subject to “purchasing access authority” (see section 4.5). The authors should declare their role in the company or if they get any consultancy fee or royalties from the use of the platform.
Specific comments
In the abstract, please report the temporal resolution of the data used.
The first part of the Introduction is largely unjustified, irrelevant to the content of the manuscript, and should be removed. The IPCC summary for policymakers (2012) referenced in line 30-31 is not a scientific document and contains the interpretation of scientific results by non-scientists. Therefore, its use in a scientific context is discouraged. Please refer to the scientific documents of IPCC instead.
Line 39: “climate forecasting” should be either “weather forecasting” or “climate projection”.
Line 57: Replace the dot after the word “flow” with a comma.
Lines 60-61: “next 24 hours of climate” makes no sense as the climate is the average weather over a multi-year period (30 years according to the World Meteorological Organization).
Lines 87 and 111: Weather stations with missing data are said to be 43 in line 87 and 44 in line 111.
In table 1, the variable in column two is indicated as “weather station number of subtasks” but this definition does not appear anywhere in the text. Are these the missing data sequences?
Equation in line 134 and 138, could be simplified by stating that only positive values of the expression in parentheses in the first equation are considered.
Line 139: The cost function LOSS is not defined, please explain what this function is and include its definition.
Line 158: the input sequence length is set to 8, with no indication or comments on the reason behind this choice, and especially the sensitivity of the model results to this value. The same is needed for the resample intervals (line 160) of the Periodic MLP and Trend MLP, set to 90 and 365 days, respectively.
Line 178: the sentence makes little sense, please rephrase.
Lines 186-187 and Figure 5: The order of the variables listed or plotted is not consistent with the text or the caption.
Line 190: the acronym “GWL” is used twice in this line but was never defined and appears nowhere else.
Line 201: the terminology “21 scenarios” is introduced here but its relationship with the sub-tasks is never defined and not clear at all.
Line 213-214: The two sentences make no sense, and the terminology “143 groups” is now introduced, again with no clear relationship with the terms “scenarios” and “sub-tasks” used above.
Line 224: the terminology “1000 epochs” is introduced here, with no clear meaning.
Line 283: it is not clear how the value of 45 (in parentheses) is obtained.
Line 315: the 143 sub-tasks are here indicated as “tasks” and 44 stations are mentioned. Please check and correct.
Lines 325-326: Not relevant for the conclusions section, please delete.
Technical corrections
In the citations throughout the text and in the reference list, some papers are reported using the first name instead of the family name of one or more authors.
Blanks are missing on many occasions throughout the text.
Numerous typos and grammatical errors are present throughout the text. Extensive revision is required.

Citation: https://doi.org/10.5194/egusphere-2024-439-RC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-439', Anonymous Referee #1, 20 Mar 2024
This study used multi-layer perceptrons (MLP) to reconstruct missing climate data series by a deep learning approach which provided a useful tool under limiting data source. The manuscript is well-written and developed a practical and helpful methodology. My main concerns are:
the manuscript requires clear section of method and materials; which data are used to train the model and which data is for validation, and how to do the training and validation.

requires discussion with limitations, implications. The reliability of MLP in rapidly reconstructing meteorological data is mentioned, but it is not discussed the applicability of the model in different environments or datasets.

conclusion needs to be more general and to address how can use the results (tool) to help science and production.

It is mentioned that the model was designed with four MLP modules and two fully connected predictive heads, but it is not explained why this specific model structure was chosen and how the performance of the model is ensured.

Discuss why there are low correlation coefficients for sunshine and wind speed.

I suggest to restructure the manuscript following the logic of, introduction, method and materials, results, discussions, and conclusions.

Detail comments:
L10, 16: add space between two words. Also check whole text.
L14-16, delete the expression of evaluation, just give the results.
L16, results showed that….,
In the abstract, it is better to give how many data the paper used and how many data (%) were missed in the data sheet. The parameter for evaluation, a root mean square error (RMSE) might be help to understand the accuracy of the reconstruction.
Keywords: use different terms with title
L26-36, too long to say the importance of climate change, not closely relevant to this study, just general statement is enough.
L86: only 143 missing data? Give total data and missing data percentage. This would show the power of method performance.
L85-95: here mostly say how to do, but need to say the clear objectives of the paper. How to do could be move to the M&M section.
L96-123: This section gives the information of study area and data from, but too vague, for example, table is not self-clarification. The section could be belong to 2. M&M, 2.1 Data source. In fact, where is the study done is not so important but what is study data is very important. Suggest to develop a table (combine figure 1, table 1 and figure 2) including name of site, latitude, observed yeas, total data, missing data percentage. Also clarify which data is used for training the model and which data were used to validate the model.
L123- : this section describes the model, belonging to M&M
Citation: https://doi.org/10.5194/egusphere-2024-439-RC1
RC2: 'Comment on egusphere-2024-439', Anonymous Referee #2, 01 Jun 2024

The manuscript describes a deep learning tool applied to the reconstruction of time series of meteorological variables using the correlation coefficients of reconstructed and nearest station time series as an indicator of the model performance.
General comments
The English language is very poor throughout the manuscript and requires extensive revision to make the content clear and the presentation smooth. For example, Section 4.3 is confusing and needs extensive rewriting as the poor English contributes greatly to the poor readability of the text. Although it is possible to try to decipher the text to find out the original intended meaning the presentation is not at the level of an international journal.
The main topic of the journal is about atmospheric measurements but no information and comments on the measurements themselves are provided by the authors. Instrument type, calibration and instrument changes over time are not even mentioned (but expected for such a long measurement period). This contrasts with the fact that in the presented time series of meteorological variables (see e.g. Figure 9) there are indications of non-homogeneity (across about 2008 for the anemometer, where no-wind occurrences become less frequent, possibly because of a replacement of the cup and vane with the ultrasonic measurement principle?). In such cases is the model proposed by the authors still applicable? The authors should address this question in detail.
It is not until page 19 that the temporal resolution of the time series examined is mentioned. This creates some confusion about the actual extent of the periods of missing data since in Figure 2 the missing lines have a maximum extension of about 40 years, while in the text (e.g. in lines 86-87 and 201) 143 occurrences are called either missing data or sub-tasks, but the sub-tasks are not defined before that line and their temporal extension is not defined.
Some figures are very difficult to read, e.g. Figures 8 and 9 where 15 to 30 overlapping series are included in the same graph. A more synthetic and clearer presentation should be used. For example, in line 263 the authors state that in Figure 8 “reconstructed sequences (…) are indistinguishable from the real sequences” but this is very difficult to see from the graph. The only information that can be gleaned from these figures is however the abrupt change in the behaviour of some variables at certain weather stations, which is neither mentioned nor discussed in the text. In particular, the impact of such inhomogeneities on the proposed reconstruction method should be extensively addressed and discussed.
On page 17, the authors appear to address the limitations of their approach, but the paragraph is confused, and the consequences are not clearly stated. Limitations deserve a much more extensive and detailed discussion, which would be better presented in a dedicated section.
The conclusions are just a summary of the work, but this is the scope of the abstract rather than of the conclusions. A discussion of the main limitations and strengths of the proposed reconstruction model is missing, as is any reference to other possible approaches (e.g. the do-nothing option). What application would benefit from reconstructing time series based on temporal correlation alone? What is the improvement over using the shorter time series as is or a simple stochastic generation of the missing data?
Throughout the manuscript, and in the conclusions, the quality of the reconstructed data is only assessed by comparing the correlation coefficients of the reconstructed and nearest time series. Is this sufficient to justify the use of deep learning tools against simpler and more robust methods, e.g. stochastic generation? Would the comparison of more meaningful indicators, e.g. higher order statistics, show the same level of “credibility” and “consistency”. Is the observed frequency, the percentage of null values, etc. correctly reproduced? Does the method easily reproduce variables with a high intrinsic correlation over time and poorly reproduce variables having larger fluctuations (see Figure 10)?
It is not clear from the manuscript whether the correlation coefficient used to evaluate the quality of the reconstruction is calculated over the whole series or on a reduced portion. In the first case, the limited extension of the sequences of missing data makes the comparison unbalanced, since the missing data would only slightly affect the correlation of the whole series.
"Conflict of interest: None"
Indeed, some conflict of interest seems to exist since the proposed model is said to be offered by a company within an online platform (“Agricultural Smart Brain”) whose access (the link reported does not work) is subject to “purchasing access authority” (see section 4.5). The authors should declare their role in the company or if they get any consultancy fee or royalties from the use of the platform.
Specific comments
In the abstract, please report the temporal resolution of the data used.
The first part of the Introduction is largely unjustified, irrelevant to the content of the manuscript, and should be removed. The IPCC summary for policymakers (2012) referenced in line 30-31 is not a scientific document and contains the interpretation of scientific results by non-scientists. Therefore, its use in a scientific context is discouraged. Please refer to the scientific documents of IPCC instead.
Line 39: “climate forecasting” should be either “weather forecasting” or “climate projection”.
Line 57: Replace the dot after the word “flow” with a comma.
Lines 60-61: “next 24 hours of climate” makes no sense as the climate is the average weather over a multi-year period (30 years according to the World Meteorological Organization).
Lines 87 and 111: Weather stations with missing data are said to be 43 in line 87 and 44 in line 111.
In table 1, the variable in column two is indicated as “weather station number of subtasks” but this definition does not appear anywhere in the text. Are these the missing data sequences?
Equation in line 134 and 138, could be simplified by stating that only positive values of the expression in parentheses in the first equation are considered.
Line 139: The cost function LOSS is not defined, please explain what this function is and include its definition.
Line 158: the input sequence length is set to 8, with no indication or comments on the reason behind this choice, and especially the sensitivity of the model results to this value. The same is needed for the resample intervals (line 160) of the Periodic MLP and Trend MLP, set to 90 and 365 days, respectively.
Line 178: the sentence makes little sense, please rephrase.
Lines 186-187 and Figure 5: The order of the variables listed or plotted is not consistent with the text or the caption.
Line 190: the acronym “GWL” is used twice in this line but was never defined and appears nowhere else.
Line 201: the terminology “21 scenarios” is introduced here but its relationship with the sub-tasks is never defined and not clear at all.
Line 213-214: The two sentences make no sense, and the terminology “143 groups” is now introduced, again with no clear relationship with the terms “scenarios” and “sub-tasks” used above.
Line 224: the terminology “1000 epochs” is introduced here, with no clear meaning.
Line 283: it is not clear how the value of 45 (in parentheses) is obtained.
Line 315: the 143 sub-tasks are here indicated as “tasks” and 44 stations are mentioned. Please check and correct.
Lines 325-326: Not relevant for the conclusions section, please delete.
Technical corrections
In the citations throughout the text and in the reference list, some papers are reported using the first name instead of the family name of one or more authors.
Blanks are missing on many occasions throughout the text.
Numerous typos and grammatical errors are present throughout the text. Extensive revision is required.

Citation: https://doi.org/10.5194/egusphere-2024-439-RC2

Zhang Yan, Xu Tianxin, Zhang Chenjia, and Ma Daokun

Viewed

Total article views: 741 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
530	180	31	741	34	65

HTML: 530
PDF: 180
XML: 31
Total: 741
BibTeX: 34
EndNote: 65

Views and downloads (calculated since 07 Mar 2024)

Month	HTML	PDF	XML	Total
Mar 2024	89	23	7	119
Apr 2024	25	3	4	32
May 2024	28	16	2	46
Jun 2024	53	9	6	68
Jul 2024	15	11	2	28
Aug 2024	12	5	2	19
Sep 2024	23	6	0	29
Oct 2024	9	8	0	17
Nov 2024	16	8	0	24
Dec 2024	7	8	0	15
Jan 2025	9	4	1	14
Feb 2025	14	6	0	20
Mar 2025	18	14	1	33
Apr 2025	17	15	0	32
May 2025	16	6	0	22
Jun 2025	20	9	1	30
Jul 2025	11	8	1	20
Aug 2025	37	8	1	46
Sep 2025	104	8	3	115
Oct 2025	7	5	0	12

Cumulative views and downloads (calculated since 07 Mar 2024)

Month	HTML	PDF	XML	Total
Mar 2024	89	23	7	119
Apr 2024	25	3	4	32
May 2024	28	16	2	46
Jun 2024	53	9	6	68
Jul 2024	15	11	2	28
Aug 2024	12	5	2	19
Sep 2024	23	6	0	29
Oct 2024	9	8	0	17
Nov 2024	16	8	0	24
Dec 2024	7	8	0	15
Jan 2025	9	4	1	14
Feb 2025	14	6	0	20
Mar 2025	18	14	1	33
Apr 2025	17	15	0	32
May 2025	16	6	0	22
Jun 2025	20	9	1	30
Jul 2025	11	8	1	20
Aug 2025	37	8	1	46
Sep 2025	104	8	3	115
Oct 2025	7	5	0	12

Viewed (geographical distribution)

Total article views: 757 (including HTML, PDF, and XML) Thereof 757 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 09 Oct 2025

Short summary

A deep learning tool based on multilayer perceptron (MLP) is established for the meteorological data reconstruction at three time scales. Seven parameter reconstruction methods were used to validate the proposed data reconstruction tool. Finally, the trained parameter files and prediction code were released as microservices on the Agricultural Smart Brain platform, which provides the first deep learning tool for rapid and reliable reconstruction of meteorological monitoring data.


Total:	0
HTML:	0
PDF:	0
XML:	0