the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A data-driven framework for assessing climatic impact-drivers in the context of food security
Abstract. Understanding how physical climate-related hazards affect food production requires transforming climate data into relevant information for regional risk assessment. Data-driven methods can bridge this gap; however, more development must be done to create interpretable models, emphasizing regions lacking data availability. The main objective of this article was to evaluate the impact of climate risks on food security. We adopted the climatic impact-driver (CID) approach proposed by Working Group I (WGI) in the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC). In this work, we used the CID framework to select the most relevant indices that drive crop yield losses and identify important thresholds for the indices. When these thresholds are exceeded, the impact probability increases. We then examine the impact of two CID types (heat and cold, and wet and dry) represented by indices of climate extremes considering the impact on different crop yield datasets, focusing on maize and soybeans in the central agro-producing municipalities in Brazil. We used the random forest model in a bootstrapping experiment to select the most relevant climate indices. Then, we applied the Shapley Additive Explanations (SHAP) with the XGBoost model explanatory analysis to identify the indices thresholds that caused impacts. We found that the mean precipitation is a highly relevant CID. However, there is a window in which crops are more vulnerable to precipitation deficit. For soybeans, in many regions of Brazil, precipitation below 80 mm/month in December, January, and February represents an increasing risk of crop yield losses. This is the end of the growing season for those regions. In the case of maize, there is a similar pattern with precipitation below 100 mm/month in April and May. Indices of extremes are relevant to represent crop yield variability. Nevertheless, including climate means remains highly relevant and recommended for studying the impact of climate risk on agriculture. Our findings contribute to a growing body of knowledge critical for informed decision-making, policy development, and adaptive strategies in response to climate change and its impact on agriculture.
- Preprint
(2935 KB) - Metadata XML
-
Supplement
(5336 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2023-3002', Anonymous Referee #1, 27 Feb 2024
General comments:
The study uses interpretable machine learning to identify the most important climate impact drivers for predicting maize and soybean yield variability in Brazilian states. Overall, the manuscript is quite well-written and has clear descriptions of the datasets used which are very helpful for the reader. They discuss in some detail the advantages and disadvantages of the use of different yield datasets in the region, which is crucial for the interpretation of the results of this type of analysis, and make the effort to show a comparison of the datasets and where they agree and disagree. They also use two specific examples of droughts in Brazil as case studies to examine the interpretations, which is interesting and helps to verify their approach.
The topic is very important, and novel methods such as this have a clear use-case in identifying the most relevant periods at which different CIDs impact yields. However, I have some concerns about the methodology. The description of the methodology used is not sufficiently thorough, so these concerns may have been addressed by the authors, but this should be clarified.
Random forests are often used for this type of study and are a good choice when working with tabular data such as this. However, care must be taken when using any machine learning method not to allow the model to overfit to dependencies or correlation between features. The training and testing method used was not explained clearly, except in Figure 1, which only states that 20% of the data was used to test the models, but not how that 20% was selected. Given that models were trained on a state level, multiple municipalities within each state would have highly correlated climate and yields. Were the datapoints split in time and/or space to account for this, or sampled randomly? If they were sampled randomly, this can lead to misleading estimations of model performance and the interpretations are less likely to represent the physical mechanisms that are intended to be studied. Particularly relevant - if soil is used as a predictive feature, which does not vary in time in the dataset used (I believe), the model can easily spatially overfit.
Overall, I find the manuscript to be quite well-written and the thorough analysis of the different datasets used and how they impact the results is interesting and excellent scientific practice. However, I think that some small changes to the methodology (most importantly, selecting a test set considering the spatiotemporal autocorrelation and estimating SHAP values using this test set, ideally using a different feature selection method such as SFS) and better explanation of the steps involved to generate the results discussed could very much improve the paper. As the paper aims to present a framework to enhance the interpretability of ML methods fo crop yield loss prediction, it is important that the framework is robust and can deal with common issues for this type of problem such as overfitting to spatiotemporal data.
Finally, given that the title of the paper and stated goal is to present a framework that can be used by other researchers, the code used should be published and made openly available, but this is not currently stated in the manuscript.
Specific comments:
At what stage was RFE used to select features, and how was this conducted? How many features were selected? I also question the use of RFE in cases where models can overfit (e.g. when spatiotemporal data is used), as features that the model find most important are more likely to not be physically meaningful. Using, for example, sequential feature selection with a spatial or temporal cross-validation splitting method would be more likely to return relevant drivers, and I would recommend to the authors to try this if possible.
In Figure 4, it would be helpful to have descriptions of what features were included in the different scenarios - in particular, I could not understand what ‘Complex’ meant.
In Figure 5 and 6, is this after RFE has been used to select only 10 features? I was confused by the fact that for maize, only February features are shown, but later in the text it states that April and May precipitation was important for some regions.
I would strongly advice not removing correlated variables before doing the feature selection. You can expect that the highly correlated variables will not both be selected, and it is another opportunity for data leakage to enter.
I think it is very useful to compare the importances between the different states and datasets, as this can help to find robust insights and identify potential problems with the datasets used. It would be useful to see uncertainty quantification here as well, as given that similar model performance can come from many combinations of features (as shown in Figure 4), one would expect that there is significant uncertainty in the interpretations as well. I would also consider using an additional feature importance metric (permutation feature importance on held-out test set?) for comparison, but this might be out of scope.
I also find it unusual to fit random forest models and then to use a more complex model (XGBoost) to explain them via SHAP. Normally, SHAP is used directly on the trained model to be interpreted, and if a second model was used it would normally be a simpler model. Why not use XGBoost for the initial part of the analysis instead of adding this complexity of using a second model to explain the first?
Partial dependence plots do not need SHAP values - they can be calculated by just varying individual features and estimating the output. It might be interesting to compare this against those gained from SHAP (but again, maybe out of scope). It would at least be useful to discuss/justify in the text why the partial dependence plots gained from SHAP are more useful (which is very plausible).
SHAP values are also sensitive to the data used to calculate them, and I would again recommend to use test sets for this that are split with consideration to the spatial and temporal correlations.
Interpreting the results of this type of study can be difficult, as in general, any feature used for training is one that could be a causal driver. This means that it is hard to figure out if the results are meaningful or if the model has learned some spurious correlations. The fact that only February features are shown as important for maize suggests, to me, that something strange is going on, as the authors state that this is peak planting date and in some regions, planting is not finished until the beginning of April. It seems more likely that heat, for example, would be more important during the reproductive period. Using the different test sets as I mentioned before might help with this, as well as using permutation feature importance instead of the internal RF variable importance measure.
Why remove heteroskedasticity? Could this be justified more in the text? As we expect more climate variability with climate change and therefore more yield variability, it isn’t obvious that this should be corrected for.
Lines 171-172 describe a second analysis using Gaussian copulas, but I could not find this further described or any results from this in the rest of the manuscript?
Technical corrections:
The paragraph on interpretability (lines 53 to 56) I could not understand.
Please state briefly that the crop yields were detrended in the main text (the further explanation in the Supplementary is very helpful, but there is no mention of the fact that the yields are detrended in the main manuscript which is very important to interpret the results).
Some references on selecting test sets appropriately when using ML with spatiotemporal data:
Meyer, H., Reudenbach, C., Wöllauer, S. & Nauss, T. Importance of spatial predictor variable selection in machine learning applications – Moving from data reproduction to spatial prediction. Ecological Modelling 411, 108815 (2019).Sweet, L., Müller, C., Anand, M. & Zscheischler, J. Cross-Validation Strategy Impacts the Performance and Interpretation of Machine Learning Models. Artificial Intelligence for the Earth Systems 2, (2023).Roberts, D. R. et al. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40, 913–929 (2017).Citation: https://doi.org/10.5194/egusphere-2023-3002-RC1 - AC2: 'Reply on RC1', Marcos Roberto Benso, 30 May 2024
-
RC2: 'Comment on egusphere-2023-3002', Anonymous Referee #2, 15 Apr 2024
Shortly about myself to better interpret my review: I am an agricultural economist postdoc fellow, working in the interdisciplinary field of agricultural trade, food security, with application of econometrics and machine learning, mostly interpretable. I have no in-depth background in climate change.
Summary:
The paper uses the Climatic Impact-Driver (CID) approach to evaluate the impact of climate risks on food security, focusing on maize and soybeans in Brazil. The authors use data-driven methods and machine learning models to identify the most relevant climate indices and their thresholds that increase the impact probability. They found that mean precipitation is a key CID, with specific thresholds indicating increased risk of crop yield losses. The study emphasizes the relevance of both extreme and mean climate indices in assessing climate risk on agriculture, contributing to decision-making and policy development in response to climate change.Introduction
I find the introduction comprehensive and insightful, providing a clear overview of the challenges associated with predicting crop yield variability in response to climate extremes. The emphasis on the importance of considering multiple weather variables and employing models that incorporate sector-specific vulnerability and exposure adds depth to the discussion, highlighting the complexity of agricultural risk assessment. The introduction's exploration of machine learning algorithms, particularly decision tree algorithms like random forest models, offers innovative possibilities for improving predictive accuracy despite data availability constraints.Furthermore, I like the idea of using model interpretability techniques in the modeling framework to address the limitations of existing approaches. The integration of the CID framework promises a solid foundation for contextualizing climate in decision-making, aligning with the need for localized solutions in agricultural systems. Overall, the introduction effectively sets the stage for a research endeavor that holds significant potential for informing critical decisions and strategies aimed at enhancing food production resilience in the face of climate variability.
Methodology
This methodology section presents a comprehensive approach to investigate the impacts of climate extremes on soybean and maize crop yields in Brazil, which is of paramount importance for agricultural research and policy-making. The modeling framework outlined, with its emphasis on data filtering, variable selection, and threshold determination, offers a systematic way to analyze the complex relationships between climatic variables and crop yields. By integrating different interpretable machine learning techniques, the study ensures both predictive accuracy and interpretability, crucial for gaining trust of people who will later use the proposed modelling-framework.The delineation of the study area and selection criteria for municipalities provide a clear understanding of the geographical scope and rationale behind the dataset selection. Moreover, the detailed description of data collection and processing, including the handling of outliers and missing values, enhances the reliability and reproducibility of the study's findings. Additionally, the inclusion of soil data enriches the analysis by considering the influence of soil properties on crop productivity.
However, while the methodology appears robust and well-structured, some sections could benefit from further clarification. As a non-expert in climate change, I could benefit from an explanation regarding the application of climate indices and their relevance to crop yield analysis. Providing more insights into the selection process of specific indices and their interpretation within the context of agricultural impacts would enhance the understanding of the wide audience. Overall, this methodology sets a solid foundation for investigating climate extremes' impacts on food production, contributing valuable insights to the field of agricultural economics.
Results and Discussion
The chapter is clear and summarizes the article very well.Citation: https://doi.org/10.5194/egusphere-2023-3002-RC2 - AC1: 'Reply on RC2', Marcos Roberto Benso, 27 May 2024
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
409 | 170 | 31 | 610 | 83 | 21 | 18 |
- HTML: 409
- PDF: 170
- XML: 31
- Total: 610
- Supplement: 83
- BibTeX: 21
- EndNote: 18
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1