the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Multi-objective calibration and evaluation of the ORCHIDEE land surface model over France at high resolution
Abstract. We present here a strategy to obtain a realistic hydrological simulation over France with the ORCHIDEE land surface model. The model is forced by the Safran atmospheric reanalysis at 8-km resolution and hourly time steps from 1959 to 2020, and by a high-resolution DEM (around 1.3 km in France). Each Safran grid cell is decomposed into a graph of hydrological transfer units (HTUs) based on the higher resolution DEM to better describe lateral water movements. In particular, it is possible to accurately locate 3507 stations among the 4081 stations collected from the national hydrometric network HydroPortail (filtered to drain an upstream area larger than 64 km2). A simple trial-and-error calibration is conducted by modifying selected parameters of ORCHIDEE to reduce the biases of the simulated water budget compared to the evapotranspiration products (the GLEAM and FLUXCOM datasets) and the HydroPortail observations of river discharge. The simulation that is eventually preferred is extensively assessed with classic goodness-of-fit indicators complemented by trend analysis at 1785 stations (filtered to have records for at least 8 entire years) across France. For example, the median bias of evapotranspiration is −0.5 % against GLEAM (−4.3 % against FLUXCOM), the median bias of river discharge is 6.3 %, and the median KGE of square-rooted river discharge is 0.59. The spatial contrasts and temporal trends of river discharge across France are well represented with an accuracy of 76.4 % for the trend signal and an accuracy of 62.7 % for the trend significance. Despite inadequate performance in some specific regions (the Alps and the Seine sedimentary basin), this study offers a thorough historical overview of water resources and a robust configuration for climate change impact analysis at the nationwide scale of France.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(30323 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(30323 KB) - Metadata XML
- BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2024-445', Anonymous Referee #1, 01 May 2024
The authors present multi-objective calibration and evaluation for high-resolution hydrologic simulations over France produced by the ORCHIDEE land surface model (LSM). They conduct a comprehensive evaluation for the model performance, considering both classic goodness-of-fit indicators including KGE and bias and trends in streamflow and ET. The comparison is promising. Overall, the manuscript is well-written with high-quality figures.
However, I still have the following concerns.
In the abstract, the authors claim that they present a strategy to obtain a realistic hydrological simulation over France, but in fact, the ORCHIDEE land surface model does not consider human impacts, leading to poor performance in some regions in France. Therefore, it would be more precise to state “a reliable hydrological simulation”.
Through the comparison of streamflow and ET, the model always performs better against GLEAM than FLUXCOM. Is that because GLEAM reanalysis data is still modeling data and FLUXCOM is generated based on observational datasets? If the ORCHIDEE land surface model uses similar physics equations to those of GLEAM, we may expect the results from ORCHIDEE and GLEAM are in good agreement.
In the Introduction section (Line 77-87, Page 3), the authors introduce the first distributed LSM at the nationwide scale of France, SIM. SIM has shown very good performance in generating hydrologic simulations. Why do the authors decide to use another LSM, ORCHIDEE for France? What are the limitations of SIM?
In Line 264-265, Page 11, the authors state that “The timelag criterion of the simulated Q is also greatly improved from a range of -11 to 27 days to a range of -3 to 5 days.”. In my opinion, the zero timelag is the best, right?
Specific comments:
Line 102, Page 4: “(revision 7738)” should be deleted.
Line 219, Page 8: “STD” first appears. What is “STD”?
Table 2: Please explain the meanings of the labels in the caption, such as “PPV”.
Citation: https://doi.org/10.5194/egusphere-2024-445-RC1 - AC1: 'Reply on RC1', Peng Huang, 12 Jul 2024
-
RC2: 'Comment on egusphere-2024-445', Anonymous Referee #2, 26 May 2024
The article presents the calibration and the evaluation of the ORCHIEE land-surface model over France.
The article has several major limitations. Several important choices of the methodology applied by the authors are not well explained or justified. The model version with modified parameters sets provides less biased results than the standard version, but it is difficult to evaluate whether the results should be considered as satisfactory over the test domain due to the lack of external benchmark. Besides some explanations on model failure remain unverified hypotheses.
Major comments
Title: As detailed in comments below, I think the article does not explain how the “multi-objective calibration” of the model was done (or at least I did not understand that). This is a strong limitation of the article.
L10. Getting an almost perfect match on actual evapotranspiration given the uncertainty on the observational product used may be considered as overcalibration. This almost perfect bias value actually hides a large spatial variability over the study domain.
L10-15. I found these sentences too optimistic on model results. It seems that there are many modelling problems remaining to well capture the actual hydrological dynamics. I did not see convincing demonstration that these results would provide a “thorough historical overview of water resources” nor a “robust configuration for climate change impact analysis”. The study does not analyse how model performance evolves over the study period and the model robustness is not evaluated by dedicated tests (model robustness to extrapolate in space or time).
L65-67: Though erroneous data may prevent obtaining good calibration results, the model itself is generally the main problem in getting good results.
L120: Why soil is 2-m deep everywhere? Is not that a strong approximation given that it is not the case?
L126: Why 22 layers? Are they all of the same depth? Is this level of complexity justified by model performance (and consistent with the approximation mentioned in the previous comment)?
L147-156: Though linear stores are very commonly used in hydrological modelling, they have limited efficiency in simulating some flow ranges, typically low flows. Why only linear stores are used in the model?
Section 2.3: It was not fully clear for me at which time steps the evaluation criteria were calculated. At the monthly time step for ET and daily time step for Q? The authors could also explain which expected model qualities are assessed by the criteria selected. Especially, one could expect that there could be some criteria focusing on high and low flows. Bias and correlation coefficient are two of the three components of the KGE criterion. Would the third component (ratio of variances or ratio of variation coefficient) be useful to consider also?
Section 2.3: It is unclear whether the observed time series were visually checked before use. In large datasets, there are often many remaining observational errors, which may strongly influence model evaluation.
L205: One part of the evaluation is on trends. However 8 years (for the shortest series) are too short to evaluate trends. This evaluation should be restricted to stations where there are long time series (at least 30 years).
Section 2.4: This section is not detailed enough to fully understand what was done by the authors. Table 1 seems to suggest that 5 parameters were selected and that only six combinations of these parameters were tested. Is that what was done? If yes, I do not think this can be considered an actual calibration. Calibration is generally understood as a search of an optimum in the multi-dimensional parameter space. Here one cannot say that testing six parameter sets is an actual search. If the authors came to these values after a search in the parameter space, the way this search was done should be explained. Besides, it is unclear why these specific parameters were selected for testing (the model probably has many other parameters) and if the modifications apply uniformly over the entire testing zone. It is also unclear how the criteria calculated at each station were aggregated to get an overall performance at the level of the catchment set (e.g. the KGE criterion may generate highly negative values which may bias the calculation of the mean performance), which weight was given to ET and Q respectively during calibration (i.e. how the authors cope with the multi-objective aspect of calibration), and how the criteria were actually used in the calibration process.
Section 2.4: Another problem in the experimental design is that the authors only report performance criteria in calibration. The authors do not test how the model would behave if only half of the available period had been taken for calibration and the other half for validation (as classically done in a split sample test scheme) or if the catchment set had been split in two parts, one for calibration and the other for spatial validation (proxy-basin test). This is essential to evaluate the robustness of the proposed modelling options.
Table 1: I did not understand how the authors selected the PFT classes whose values were modified. Why only six classes over the 15 PFT were modified? Why only the bias with FLUXCOM appears in the table? Does it mean that only the bias against this ET product was considered during calibration?
Section 3.1: I did not understand where the “true” values of catchment area come from.
L264-265: Time lags of -3 to 5 days remain very large for the French catchments. How such errors can be obtained? Is there a problem of calculation of this criterion for catchments with slow response?
Section 3.3: Though results shown here seem to bring some improvement over the standard model version, one strong limit of the model evaluation shown here is that it is very difficult to say if the results are satisfactory or not. Some errors seem still very large after improvement (e.g. time lag in some cases). The use of an external benchmark (e.g. a simpler model) would be very useful to discuss this point.
Section 3.3: Maybe the title of this section should be changed to “Spatial evaluation…” since it presents this part of the evaluation. Section 3.2 and 3.3 are basically based on the same results obtained in calibration. The titles should not suggest that one part is calibration and the other is an independent evaluation, to avoid confusion.
Section 3.3.1: Figures 4 a and b suggest that actual evapotranspiration is very difficult to know. The maps show huge differences in some regions (and I think the sentence in L297-298 is wrong). In these conditions, I do not understand how these products can be used simultaneously to constrain the calibration, or at least how the choice can be made between the two maps to select the best model parameters.
Section 3.3.1: Maybe I am wrong, but some regions where there is a large bias seem to correspond to zones where there is a lower density of stations (Fig. 2). It this something observed by the authors? If yes, this may suggest that there are problems in transposing the parameter sets in space.
L307-309: Does the larger average bias with FLUXCOM comes from the fact that it was not directly considered in the calibration? I don’t know whether the more consistent spatial bias with FLUXCOM is good news. Please comment on this.
Section 3.3.2: I was surprised that the human influences appear to be one of the main reasons mentioned for model failure. Though they probably contribute to the limited performance sometimes, I doubt that the level of influences on these basins can explain the gaps between observed and simulated time series. This is not realistic. Besides there are many stations in the catchment sample that are influenced. Why were they kept as calibration target if the objective is to simulate natural behaviour? The calibration process and catchment selection should be better explained and potentially revised.
L319-320: 2 to 4 or 6 days timelag is huge for these basins. In practice, how can the model be used with such time lags?
L361-369: This probably would be better placed in the method section.
L372-383: I do not understand why human influences are no more a problem here to evaluate trends though they were one of the major reasons for model failure a few paragraphs above. For me this is not really consistent.
Section 4: As explained in previous comments above, I think that the discussion should better acknowledge the limitations of the modelling framework proposed here. Though it was improved, the model is still limited in some cases (as any model). Besides if the authors intend to do an actual model calibration, they should do corresponding validation test to evaluate model robustness in space and time. Else the results probably show an over-optimistic picture of model predictive power.
L384: I do not know what “high-resolution” means here. There are models implemented at the km² scale.
Appendices: There are a lot of appendices. I am unsure they should be kept as appendices. They may be better placed in supplementary material.
Appendix I: Appendix I is not an actual demonstration that the modelling problems come from the artificial influences. Some other characteristics which may differ between the two sub-samples may also explain the performance differences.
Minor comments
Introduction: a few subtitles may be useful to highlight the main aspects of the introduction
L102: Please clarify what is “revision 7738”.
Fig. 1: I wonder whether this figure is actually useful (at least in the main text)
Fig. 3: The caption should indicate which distribution percentiles are shown on the box-plots.
Citation: https://doi.org/10.5194/egusphere-2024-445-RC2 - AC2: 'Reply on RC2', Peng Huang, 12 Jul 2024
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2024-445', Anonymous Referee #1, 01 May 2024
The authors present multi-objective calibration and evaluation for high-resolution hydrologic simulations over France produced by the ORCHIDEE land surface model (LSM). They conduct a comprehensive evaluation for the model performance, considering both classic goodness-of-fit indicators including KGE and bias and trends in streamflow and ET. The comparison is promising. Overall, the manuscript is well-written with high-quality figures.
However, I still have the following concerns.
In the abstract, the authors claim that they present a strategy to obtain a realistic hydrological simulation over France, but in fact, the ORCHIDEE land surface model does not consider human impacts, leading to poor performance in some regions in France. Therefore, it would be more precise to state “a reliable hydrological simulation”.
Through the comparison of streamflow and ET, the model always performs better against GLEAM than FLUXCOM. Is that because GLEAM reanalysis data is still modeling data and FLUXCOM is generated based on observational datasets? If the ORCHIDEE land surface model uses similar physics equations to those of GLEAM, we may expect the results from ORCHIDEE and GLEAM are in good agreement.
In the Introduction section (Line 77-87, Page 3), the authors introduce the first distributed LSM at the nationwide scale of France, SIM. SIM has shown very good performance in generating hydrologic simulations. Why do the authors decide to use another LSM, ORCHIDEE for France? What are the limitations of SIM?
In Line 264-265, Page 11, the authors state that “The timelag criterion of the simulated Q is also greatly improved from a range of -11 to 27 days to a range of -3 to 5 days.”. In my opinion, the zero timelag is the best, right?
Specific comments:
Line 102, Page 4: “(revision 7738)” should be deleted.
Line 219, Page 8: “STD” first appears. What is “STD”?
Table 2: Please explain the meanings of the labels in the caption, such as “PPV”.
Citation: https://doi.org/10.5194/egusphere-2024-445-RC1 - AC1: 'Reply on RC1', Peng Huang, 12 Jul 2024
-
RC2: 'Comment on egusphere-2024-445', Anonymous Referee #2, 26 May 2024
The article presents the calibration and the evaluation of the ORCHIEE land-surface model over France.
The article has several major limitations. Several important choices of the methodology applied by the authors are not well explained or justified. The model version with modified parameters sets provides less biased results than the standard version, but it is difficult to evaluate whether the results should be considered as satisfactory over the test domain due to the lack of external benchmark. Besides some explanations on model failure remain unverified hypotheses.
Major comments
Title: As detailed in comments below, I think the article does not explain how the “multi-objective calibration” of the model was done (or at least I did not understand that). This is a strong limitation of the article.
L10. Getting an almost perfect match on actual evapotranspiration given the uncertainty on the observational product used may be considered as overcalibration. This almost perfect bias value actually hides a large spatial variability over the study domain.
L10-15. I found these sentences too optimistic on model results. It seems that there are many modelling problems remaining to well capture the actual hydrological dynamics. I did not see convincing demonstration that these results would provide a “thorough historical overview of water resources” nor a “robust configuration for climate change impact analysis”. The study does not analyse how model performance evolves over the study period and the model robustness is not evaluated by dedicated tests (model robustness to extrapolate in space or time).
L65-67: Though erroneous data may prevent obtaining good calibration results, the model itself is generally the main problem in getting good results.
L120: Why soil is 2-m deep everywhere? Is not that a strong approximation given that it is not the case?
L126: Why 22 layers? Are they all of the same depth? Is this level of complexity justified by model performance (and consistent with the approximation mentioned in the previous comment)?
L147-156: Though linear stores are very commonly used in hydrological modelling, they have limited efficiency in simulating some flow ranges, typically low flows. Why only linear stores are used in the model?
Section 2.3: It was not fully clear for me at which time steps the evaluation criteria were calculated. At the monthly time step for ET and daily time step for Q? The authors could also explain which expected model qualities are assessed by the criteria selected. Especially, one could expect that there could be some criteria focusing on high and low flows. Bias and correlation coefficient are two of the three components of the KGE criterion. Would the third component (ratio of variances or ratio of variation coefficient) be useful to consider also?
Section 2.3: It is unclear whether the observed time series were visually checked before use. In large datasets, there are often many remaining observational errors, which may strongly influence model evaluation.
L205: One part of the evaluation is on trends. However 8 years (for the shortest series) are too short to evaluate trends. This evaluation should be restricted to stations where there are long time series (at least 30 years).
Section 2.4: This section is not detailed enough to fully understand what was done by the authors. Table 1 seems to suggest that 5 parameters were selected and that only six combinations of these parameters were tested. Is that what was done? If yes, I do not think this can be considered an actual calibration. Calibration is generally understood as a search of an optimum in the multi-dimensional parameter space. Here one cannot say that testing six parameter sets is an actual search. If the authors came to these values after a search in the parameter space, the way this search was done should be explained. Besides, it is unclear why these specific parameters were selected for testing (the model probably has many other parameters) and if the modifications apply uniformly over the entire testing zone. It is also unclear how the criteria calculated at each station were aggregated to get an overall performance at the level of the catchment set (e.g. the KGE criterion may generate highly negative values which may bias the calculation of the mean performance), which weight was given to ET and Q respectively during calibration (i.e. how the authors cope with the multi-objective aspect of calibration), and how the criteria were actually used in the calibration process.
Section 2.4: Another problem in the experimental design is that the authors only report performance criteria in calibration. The authors do not test how the model would behave if only half of the available period had been taken for calibration and the other half for validation (as classically done in a split sample test scheme) or if the catchment set had been split in two parts, one for calibration and the other for spatial validation (proxy-basin test). This is essential to evaluate the robustness of the proposed modelling options.
Table 1: I did not understand how the authors selected the PFT classes whose values were modified. Why only six classes over the 15 PFT were modified? Why only the bias with FLUXCOM appears in the table? Does it mean that only the bias against this ET product was considered during calibration?
Section 3.1: I did not understand where the “true” values of catchment area come from.
L264-265: Time lags of -3 to 5 days remain very large for the French catchments. How such errors can be obtained? Is there a problem of calculation of this criterion for catchments with slow response?
Section 3.3: Though results shown here seem to bring some improvement over the standard model version, one strong limit of the model evaluation shown here is that it is very difficult to say if the results are satisfactory or not. Some errors seem still very large after improvement (e.g. time lag in some cases). The use of an external benchmark (e.g. a simpler model) would be very useful to discuss this point.
Section 3.3: Maybe the title of this section should be changed to “Spatial evaluation…” since it presents this part of the evaluation. Section 3.2 and 3.3 are basically based on the same results obtained in calibration. The titles should not suggest that one part is calibration and the other is an independent evaluation, to avoid confusion.
Section 3.3.1: Figures 4 a and b suggest that actual evapotranspiration is very difficult to know. The maps show huge differences in some regions (and I think the sentence in L297-298 is wrong). In these conditions, I do not understand how these products can be used simultaneously to constrain the calibration, or at least how the choice can be made between the two maps to select the best model parameters.
Section 3.3.1: Maybe I am wrong, but some regions where there is a large bias seem to correspond to zones where there is a lower density of stations (Fig. 2). It this something observed by the authors? If yes, this may suggest that there are problems in transposing the parameter sets in space.
L307-309: Does the larger average bias with FLUXCOM comes from the fact that it was not directly considered in the calibration? I don’t know whether the more consistent spatial bias with FLUXCOM is good news. Please comment on this.
Section 3.3.2: I was surprised that the human influences appear to be one of the main reasons mentioned for model failure. Though they probably contribute to the limited performance sometimes, I doubt that the level of influences on these basins can explain the gaps between observed and simulated time series. This is not realistic. Besides there are many stations in the catchment sample that are influenced. Why were they kept as calibration target if the objective is to simulate natural behaviour? The calibration process and catchment selection should be better explained and potentially revised.
L319-320: 2 to 4 or 6 days timelag is huge for these basins. In practice, how can the model be used with such time lags?
L361-369: This probably would be better placed in the method section.
L372-383: I do not understand why human influences are no more a problem here to evaluate trends though they were one of the major reasons for model failure a few paragraphs above. For me this is not really consistent.
Section 4: As explained in previous comments above, I think that the discussion should better acknowledge the limitations of the modelling framework proposed here. Though it was improved, the model is still limited in some cases (as any model). Besides if the authors intend to do an actual model calibration, they should do corresponding validation test to evaluate model robustness in space and time. Else the results probably show an over-optimistic picture of model predictive power.
L384: I do not know what “high-resolution” means here. There are models implemented at the km² scale.
Appendices: There are a lot of appendices. I am unsure they should be kept as appendices. They may be better placed in supplementary material.
Appendix I: Appendix I is not an actual demonstration that the modelling problems come from the artificial influences. Some other characteristics which may differ between the two sub-samples may also explain the performance differences.
Minor comments
Introduction: a few subtitles may be useful to highlight the main aspects of the introduction
L102: Please clarify what is “revision 7738”.
Fig. 1: I wonder whether this figure is actually useful (at least in the main text)
Fig. 3: The caption should indicate which distribution percentiles are shown on the box-plots.
Citation: https://doi.org/10.5194/egusphere-2024-445-RC2 - AC2: 'Reply on RC2', Peng Huang, 12 Jul 2024
Peer review completion
Journal article(s) based on this preprint
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
385 | 111 | 31 | 527 | 16 | 18 |
- HTML: 385
- PDF: 111
- XML: 31
- Total: 527
- BibTeX: 16
- EndNote: 18
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Peng Huang
Agnès Ducharne
Lucia Rinchiuso
Jan Polcher
Laure Baratgin
Vladislav Bastrikov
Eric Sauquet
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(30323 KB) - Metadata XML