the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Assessing the potential of complex artificial neural networks for modelling small-scale soil erosion by water
Abstract. Accurately modelling soil erosion by water is essential for developing effective mitigation strategies and preventing on- and off-site damages in agricultural areas. So far, complex artificial neural networks have rarely been applied in small-scale soil erosion modelling, and their potential still remains unclear. This study compares the performance of different neural network architectures for modelling soil erosion by water at a small spatial scale in agricultural cropland. The analysis is based on erosion rate data at a 5 m × 5 m resolution, derived from a 20-year monitoring programme, and covers 458 hectares of cropland across six investigation areas in northern Germany. Nineteen predictor variables related to topography, climate, management and soil properties were selected as inputs to assess their interrelationships with observed erosion patterns and to predict continuous soil erosion rates. A single-layer neural network (SNN), a deep neural network (DNN), and a convolutional neural network (CNN) were applied and evaluated against a random forest (RF) model used as a benchmark. All machine learning models have successfully captured spatial patterns of soil erosion, with the CNN consistently outperforming the others across all evaluation metrics. The CNN achieves the lowest root mean squared error (RMSE: 1.05) and mean absolute error (MAE: 0.41), outperforming the RF (RMSE: 1.31, MAE: 0.58) and the SNN (RMSE: 1.48, MAE: 0.63), while the DNN performs similarly to the CNN with a slightly higher RMSE (1.1) and MAE (0.45). The CNN notably outperforms the other three approaches when evaluating their capability to accurately predict soil erosion within given classes, achieving a weighted mean F1 score of 0.7. A permutation importance analysis identified the digital elevation model as the most influential predictor variable across all models, contributing between 15 % and 18.3 %, while USLE C and R factors also had significant importance. Overall, these findings highlight the potential of complex neural networks for predicting spatially explicit rates of soil erosion by water.
- Preprint
(5305 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-3583', Anonymous Referee #1, 29 Aug 2025
-
RC2: 'Comment on egusphere-2025-3583', Anonymous Referee #2, 17 Sep 2025
The paper is well-written technically and linguistically and has a pleasant structure with a coherent logic. The soil erosion context is a bit out of my expertise but is understandable. The use of different ML models, ranging from standard approaches such as Random Forest to various deep learning architectures, is relevant to the study. However, the choice of models is not particularly novel. More importantly, the modelling suffers from data leakage issues, and I remain skeptical about the validity of the evaluation strategy. At a minimum, the authors should address the leakage problems and provide a clearer justification and description of their validation procedure. To strengthen the contribution and increase the novelty, the study could also benefit from incorporating more recent ML concepts or approaches. Based on these concerns, I recommend a major revision.
1. Introduction:
The introduction is well written and effectively prepares the reader for the paper. However, the authors largely restrict their literature review to soil erosion modelling. While this is understandable to a certain degree, the claimed novelty of the paper lies in applying “new” methods such as CNNs and multi-layer neural networks. These models, however, are not particularly novel in this context, as CNNs have been applied to soil prediction tasks at least since 2019 (e.g., Padarian et al., 2019). The study would offer stronger novelty by considering more recently proposed methods from the broader ML literature (for instance, the high-quality TabArena benchmark by Erickson et al., 2025, which compares state-of-the-art tabular learners). Several of these modern methods have already been successfully tested in soil science, and established approaches such as CatBoost have been available for even longer. I understand that it is not feasible to cover every recent method, but the current comparison does feel somewhat outdated for a paper that aims to emphasize on machine learning aspects.
Padarian, J., Minasny, B., & McBratney, A. B. (2019). Using deep learning for digital soil mapping. Soil, 5(1), 79-89.
Erickson, N., Purucker, L., Tschalzev, A., Holzmüller, D., Desai, P. M., Salinas, D., & Hutter, F. (2025). Tabarena: A living benchmark for machine learning on tabular data. arXiv preprint arXiv:2506.16791.
Minor comment
33: The use of the term AI does not seem appropriate in this context and comes across more as a buzzword. Since the paper exclusively discusses machine learning methods (e.g., L. 67), I suggest using machine learning consistently instead of AI.
2. Methodology:
I have several concerns about the hyperparameters and the validation used in this study. Other comments are of minor nature:
Hyperparameters:
It remains unclear how the authors tuned their models. From the description (L. 178–179), it appears that hyperparameters were adjusted directly on the validation folds of the 5-fold CV. This approach introduces data leakage, as the same data are effectively used both for model selection and for performance estimation, which reduces the penalty for overfitting. Proper hyperparameter optimisation requires a nested cross-validation scheme, where the data are split into three parts: a training set for fitting the model, an inner validation set for selecting hyperparameters, and an outer test set (or fold) for obtaining a performance estimate.
I looked into the provided code but could not find any script related to hyperparameter tuning. Instead, in the models script I found only fixed parameter settings. This is problematic, as optimal hyperparameters should be determined separately for each training fold within the cross-validation. Without such a procedure, the reported results may not reflect the best achievable model performance and risk being biased by arbitrary parameter choices.
Lastly, the search space for the hyperparameters was not given. This is extremely important for a fair model comparison, if a poorly tuned RF is compared to a well-tuned NN, the comparison would not be fair. There is a lot of studies on how this can induce bias in benchmarking (e.g., Nießl et al. 2022).
Nießl, C., Herrmann, M., Wiedemann, C., Casalicchio, G., & Boulesteix, A. L. (2022). Over‐optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(2), e1441.
Validation:
Do I understand correctly that this figure shows the “ground-truth” soil erosion dataset, and that these data are available in raster format, i.e., the true (or approximate true) erosion values are known across the entire study area? If so, I find this somewhat questionable, since such complete “ground truth” presumably relies on interpolation or modelling itself, and may therefore not represent true independent measurements. More importantly, it is unclear why additional modelling is applied, given that each cross-validation repetition already uses 80% of the study area for training. In digital soil mapping, modelling is typically motivated by sparse point observations, where the objective is to generate high-resolution maps from limited data. In contrast, this study seems to assume ground-truth values for every raster cell, a setup that almost inevitably leads to overly optimistic performance estimates with poor generalization value. Would a strategy such as “leave-one-validation-site-out” not provide a more realistic evaluation of model performance? I may be missing a domain-specific aspect of soil erosion mapping, but from a classical digital soil mapping perspective this design appears problematic.
For example in Gholami et al. (2021), which is also cited in this paper, they used some point data and they have specified validation points. I am missing something like this in this study. To me, this makes much more sense but I do not see this in Fig. 1.
Gholami, V., Sahour, H., & Amri, M. A. H. (2021). Soil erosion modeling using erosion pins and artificial neural networks. Catena, 196, 104902.
Small comments:
104: I may be wrong, but the overall study areas cover a few hundred ha, but the grid of the original R-factor was 1 km x 1 km. Even if resampled (how?), is this not too broad for the study area context. Maybe a reference which refers to this procedure could be useful?
123: It would be more precise to write “a random subset of the feature [or variables]”. Using a subset of data (i.e., training data) is also possible as a hyperparameter but not by definition a classical parameter in Random Forest.
2.3.4 It is not clear from the section but implied. Did the authors use a “2D” CNN, with what Y x Y raster cell?
3. Discussion & Results:
I do not have many comments on these sections, as they are well written. However, given my concerns regarding the validity of the results, I feel that any interpretation at this stage would be premature until these issues are addressed.
Small comments:
Figure 4: Why do the ECDF curves of the models appear so smooth? I would expect them, similar to the mapped erosion rate, to be step functions. This suggests that the ECDFs may have been constructed differently for the models and for the mapped erosion rate. Could the authors please clarify how these curves were generated?
Figure 5: The unit is missing. It is not simply [%], but rather increase of MSE in %. While this may be clear from the context, the figure should explicitly state the correct unit.
Citation: https://doi.org/10.5194/egusphere-2025-3583-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
545 | 46 | 13 | 604 | 20 | 23 |
- HTML: 545
- PDF: 46
- XML: 13
- Total: 604
- BibTeX: 20
- EndNote: 23
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
The manuscript explores whether machine learning approaches could improve our ability to predict soil erosion. We are just starting with this field, and I appreciate the authors' efforts. The manuscript reports an interesting piece of work that needs some improvement, but in general is clearly worth publishing.
However, the conclusions are based on unreplicated results and thus speculative. Setting up a replicated experiment would be relatively easy and fast (see my last paragraph). Furthermore, the authors justify their work with clearly wrong statements, and I wonder whether there is no better justification for their work. The enthusiasm about something relatively new cannot replace logically sound arguments:
Please be accurate in your arguments. They are not random quantities.
Details
L 77: Which models?
Chapter Data collection: In general, this chapter does not give enough details about the sources of data, the measurement methods, their range, their resolution, and their quality. The lack of reference to the sources also makes it impossible for the reader to get an idea about these relevant aspects.
L 95: What is the accuracy of the data? Were there independent repeated surveyors to estimate the accuracy? How did you know there had been an erosion event, given that high-intensity rain cells have only a spatial extent of about 1 km² (see Lochbihler et al. 2017, Geophysical Research Letters)?
L 97: What is sheet-to-linear erosion? Isn't this rill erosion, which is already in the first group?
L 101: Nineteen variables are pretty limited. I would not criticise this, but in L 38, you criticised a limited number of variables. Your arguments do not match. (BTW: The (R)USLE uses more than 19 variables to calculate the final six factors; hence, your data set is more limited).
L 110: Better call it the Pearson correlation coefficient because Pearson and even the regression have several coefficients. In the following, r is mostly in italics. Please be consistent.
Table 1: DEM is definitely wrong because this is the entity of all elevation data. Do you mean altitude?
More details about the resolution and the quality of your DEM have to be given (see the general remark regarding the data chapter) because many of your following variables depend strongly on these two parameters.
How was slope length defined, in the sense of the USLE or in a geomorphological sense? Was it defined for the field or for the raster cell? I guess you did not use slope length, which would be one value for the entire slope, but you may have used the upslope length of each raster cell. I do not like guessing what you did (a similar question could be raised for almost all variables).
Flow accumulation is described as the total accumulated runoff. This would require runoff modelling because runoff will depend on soil, crops, heterogeneity of rain and other variables. I guess you mean the upslope drainage area. More explanation required!
Wetness index: What is a 'modified catchment area calculation'?
Machining direction: This will differ on different field parts because of the headland and complex topography. How was it defined? It may also vary over time.
Regarding the R and LS factors, see below. How was the C factor determined? Did you consider individual rains and the corresponding field states, or did you use some more generalised C factor? Which degree of generalisation did you use? K factor, based on which data?
The table must be complemented with statistical metrics like mean, SD, min, and max, which give an idea of the range the data covers. This is essential for the interpretation of Fig. 5.
L 178: Conventional cross-validation is inappropriate in your case because your raster cells are highly autocorrelated. Hence, the left-out data are not an independent data set. I suggest using a seven-fold cross-validation by leaving out one of your study areas at a time.
L 185: I cannot see the five pairs in Table 1. Which pairs do you mean?
L 187: The correlation between R and altitude is strange. I am not aware of any meteorological process that would influence rain within your altitudinal and spatial range. I guess the correlation is an artefact of an inappropriate resampling procedure. Unfortunately, resampling was not described.
Fig. 4: The x-axis appears to have a log scale. Then, zero would not be possible, although shown (likely it is 0.001) and although being found in the data set. I recommend using a square-root scale, which allows for a true zero and does not compress the data in the relevant range of 0.1 to 50 t because of the inflation of the irrelevant range between 0.001 and 0.1.
This also leads to the question: Were there no negative values in your data set (colluviation)? Including negative values would be a clear advantage compared to the USLE. In any case, the reason for the lack of negative values has to be explained.
L 224: The high importance of altitude shows that the results of your approach lack transferability to other areas. I can easily imagine a similar erosion situation (similar topography, similar soils, similar land use, similar rain), but a few hundred meters higher (or even a few thousand meters higher if we think of a high valley in the Andes). The large importance of altitude would then cause very strange predictions. The matching of the training and the application situation is an indispensable requisite for your approach that does not restrict the input data to meaningful and universally valid variables (especially if you request unlimited variables). It is worth discussing this constraint, which is especially important in the black box of neuronal networks. Whether the variables are used meaningfully in view of the erosion process by the network is unknown and irrelevant for the result. It is, however, highly relevant for the transferability. While it is relatively easy to find out whether, for instance, the K factor equation is applicable in a specific case (e.g., peatland erosion), it is difficult to find out in which case a neural network result will fail when transferred to a different situation.
Fig. 5: The low importance of LS is strange, particularly because of the higher importance of flow accumulation and slope. Essentially, LS is the product of flow accumulation and slope gradient and thus must be of higher importance. Could LS be wrongly calculated by assuming straight slopes, although you have converging and diverging slopes? Furthermore, did you use the field's LS factor or the pixel's LS factor, which is entirely different information? Your M&M section requires clearly more information. Otherwise, the results cannot be understood.
CNN was the best method in your case. Does this have any relevance? Will CNN always or at least often be the best? We don't know because this is an unreplicated experiment. Usually, we regard unreplicated results as meaningless. I wonder whether you could improve the validity of your analysis. For instance, you could run your seven study areas separately. Is CNN the best in all seven cases? Is the ranking of variables similar in all seven cases (which would allow us to say something about transferability at least within your region)? You could run your analysis ten times with a subset of 10 randomly selected variables from your data set. Is CNN the best method in all cases? Presently, we do not know, and hence your conclusion that CNN outperforms other methods remains just a speculation.