the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Object-based ensemble estimation of snow depth and snow water equivalent over multiple months in Sodankylä, Finland
Abstract. Snowpack characteristics such as snow depth and snow water equivalent (SWE) are widely studied in regions prone to heavy snowfall and long winters. These features are measured in the field via manual or automated observations and over larger spatial scales with stand-alone remote sensing methods. However, individually these methods may struggle with accurately assessing snow depth and SWE in local spatial scales of several square kilometers. One method for leveraging the benefits of each individual dataset is to link field-based observations with high-resolution remote sensing imagery and then employ machine learning techniques to estimate snow depth and SWE across a broader geographic region. Here, we combined field-based repeat snow depth and SWE measurements over six instances from December 2022 to April 2023 in Sodankylä, Finland with Light Detection and Ranging (LiDAR) and WorldView-2 (WV-2) data to estimate snow depth, SWE, and snow density over a 10 km2 local scale study area. This was achieved with an object-based machine learning ensemble approach by first upscaling more numerous snow depth field data and then utilizing the estimated local scale snow depth to aid in estimating SWE over the study area. Snow density was then calculated from snow depth and SWE estimates. Snow depth peaked in March, SWE shortly after in early April, and snow density at the end of April. The ensemble-based approach had encouraging success with upscaling snow depth and SWE. Associations were also identified with carbon- and mineral-based forest surface soils, alongside dry and wet peatbogs.
- Preprint
(2648 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2024-3936', Anonymous Referee #1, 26 Mar 2025
Object-based ensemble estimation of snow depth and snow water equivalent over multiple months in Sodankylä, Finland
egusphere-2024-3936
March 2025
General Comments:
Brodylo et al.’s manuscript is well-written, structured clearly, and supported by strong graphical presentation, providing a straightforward exploration into snow depth and snow water equivalent (SWE) estimation using an ensemble machine learning approach. The integration of LiDAR, remote sensing imagery, and in-situ observations is logical and aligns well with the type of studies frequently published in this journal. However, I have several significant concerns regarding the novelty of the approach, methodological clarity, and the limited sample size—particularly for SWE estimation—that need to be thoroughly addressed before the paper can be considered for publication. I have outlined these major concerns, along with specific suggestions for improvement, in detail below.
Major Comments:
1. Currently, the paper's primary novel contributions are unclear to me. While the presented approach effectively integrates established practices (ensemble machine learning methods, LiDAR-based snow depth estimation), the methodological novelty seems incremental and primarily focused on application in the specific context of Sodankylä, Finland. Intuitively, an ensemble approach should outperform individual techniques; however, given the limited sample size—especially with SWE data (only around a dozen observations)—it becomes challenging to conclusively demonstrate superiority over simpler, more traditional methods such as multiple linear regression. Indeed, as highlighted in Table 3, some machine learning models significantly underperform in certain months, likely due to this limited dataset. Thus, at present, the main takeaways and broader scientific significance are somewhat ambiguous. I encourage the authors to clearly articulate the core contributions of their approach, considering the constraints posed by dataset size. If a stronger case for novelty can be made, particularly in comparison to simpler or previously established methods, this would greatly strengthen the manuscript, as I am currently unsure of the main takeaways.2. Further clarity is needed regarding the training and validation processes for the machine learning models. The authors briefly mention using a "k-fold" validation but do not clearly specify how the data was partitioned into training, validation, and test sets at each step. Important details are missing, such as whether splits were random or sequential—random splits could inadvertently introduce spatial autocorrelation issues. Additionally, specifics on the machine learning implementations are essential. For instance, how deep were the random forest trees allowed to grow? What structure was adopted for training the multi-layer perceptron—including the number of hidden layers, neurons per layer, activation functions, epochs, and optimization methods? Providing visualizations of training and validation curves for MLP models would also help clarify the model training and generalization processes. These details are crucial for reproducibility and fully understanding the robustness of the results.
3. Given the inherently spatial nature of snow depth and SWE, I'm curious if the authors considered employing machine learning methods specifically designed to leverage spatial dependencies in data. The current choice of models—MLR, RF, and MLP—generally treats each data point independently, potentially losing valuable spatial context unless explicitly provided as an input feature. Models that explicitly capture spatial information (e.g., convolutional neural networks like U-Nets, or vision transformer approaches) could better represent the spatial variability across diverse land types. Exploring spatially-aware methods, despite your current dataset limitations, could significantly increase the novelty and impact of your study.
4. Finally, I also feel that this paper would really benefit from a more comprehensive comparison to existing approaches in the literature. Although your method is LiDAR-derived, related studies by Bair et al. (2018), King et al. (2020), Liljestrand et al. (2024), Shao et al. (2022), and Vafakhah et al. (2022) (amongst others) have utilized similar ML methodologies (RF and neural-network-based architectures) to predict regional variations of SWE. A clearer positioning of your work in relation to these papers would not only help justify the novelty of your method but also allow readers to better appreciate your contributions relative to the current state-of-the-art approaches. Such contextualization could also probably help address some of the concerns I raise in Comment 1 regarding methodological novelty.
Minor Comments:
- Lines 89: With all the different datasets being used here, I wonder if a summary table listing their names, variables, resolution, and source would help better situate readers?
- Lines 162-163: It wasn’t totally clear to me what this RF classification scheme was referring to here? Why is this step necessary?
- Section 3.1: I also don’t fully understand this image segmentation step and how it is “utilized as the spatial unit for image assessment”. Why does this need to be done for this project, and how are the resulting segments used in the models afterwards?
- Lines 189-192: I think this section is important, and I would add a little more detail describing each of these models and how they’ve been used in other studies, as they really underpin your main results. For instance, I’d mention bootstrapping and aggregation in the RF, and I would rework your description of the ANN (as the linkage to the human nervous system is somewhat spurious) and not a clear description of how it actually works (i.e., a feedforward directed acyclic graph connected with artificial neurons with nonlinear activation functions)
- Lines 203-204: Do you know why the SVM performance so poor? I’m wondering if the the sample was simply too small for this approach? This goes back to my earlier major point that the same issue with the limited SWE data is also likely impacting the other models. However, it does feel a bit odd to me to just choose to not include a model in some cases due to poor performance when using an ensemble approach
- Eqs. 1/2/3: This is personal preference but these are all very common metrics that don’t need to be explicitly defined in this work
- Lines 258-260: From a physical perspective, what do you think is causing this large swing in performance for the ANN over these months? Is there something about the onset snow in December that makes this an especially challenging task for the NN?
- Table 1: For this table and the others after, I am wondering if this would be more interpretable as a bar graph? Comparing so many numbers in a table like this can bit a bit challenging
- Table 2: Similar to my previous table comment
- Figure 5: The red->green color scheme for snow depth can be challenging to view for color blind individuals, and I would recommend moving to something more accessible
- Lines 318-319: Was the SVM left out because it had bad performance everywhere for SWE? As you state, the RF was also inconsistent for SWE prediction, but was still included in this part of the analysis
- Lines 344-362: I appreciate the detail the authors put into comparing SWE over various land cover types, however this section (and other similar paragraphs) are a bit challenging to parse in their current form. Currently, you list many statistics in a row, and it isn’t fully clear to me what I am to take from all of these stats? I wonder if you could restructure these paragraphs to highlight the most important findings and relate those to what the predictive accuracy means for each land cover type?
- Lines 428-429: When referring to EA here, it sounds as if it is it’s own technique, but really it is just a combination of the MLR/RF/MLP. And this enhanced performance in the EA is because of high variability in individual models with biases which mostly cancel out resulting in a more stable prediction. So is this section speaking primarily to the high variability of individual models?
- Line 430: I would reword this sentence “EA consistently produced the best or second best metrics, and generally produced the best metrics”
- Lines 471-475: Could you have included reanalysis estimates from say ERA5 to provide temperature, humidity and pressure data to your models? While coarse, this would perhaps give you some additional information about the surrounding environmental context at the time of observation?
- Lines 501-502: I would strongly recommend including some code for reproducing at least a subset of these results, perhaps in an interactive notebook uploaded to Google Colab with some test data? Then others could more easily test and build on what you have provided here
References
Bair, E. H., Abreu Calfa, A., Rittger, K., & Dozier, J. (2018). Using machine learning for real-time estimates of snow water equivalent in the watersheds of Afghanistan. The Cryosphere, 12(5), 1579–1594. https://doi.org/10.5194/tc-12-1579-2018
King, F., Erler, A. R., Frey, S. K., & Fletcher, C. G. (2020). Application of machine learning techniques for regional bias correction of snow water equivalent estimates in Ontario, Canada. Hydrology and Earth System Sciences, 24(10), 4887–4902. https://doi.org/10.5194/hess-24-4887-2020
Liljestrand, D., Johnson, R., Skiles, S. M., Burian, S., & Christensen, J. (2024). Quantifying regional variability of machine-learning-based snow water equivalent estimates across the Western United States. Environmental Modelling & Software, 177, 106053. https://doi.org/10.1016/j.envsoft.2024.106053
Shao, D., Li, H., Wang, J., Hao, X., Che, T., & Ji, W. (2022). Reconstruction of a daily gridded snow water equivalent product for the land region above 45° N based on a ridge regression machine learning approach. Earth System Science Data, 14(2), 795–809. https://doi.org/10.5194/essd-14-795-2022
Vafakhah, M., Nasiri Khiavi, A., Janizadeh, S., & Ganjkhanlo, H. (2022). Evaluating different machine learning algorithms for snow water equivalent prediction. Earth Science Informatics, 15(4), 2431–2445. https://doi.org/10.1007/s12145-022-00846-z
Citation: https://doi.org/10.5194/egusphere-2024-3936-RC1 -
RC2: 'Comment on egusphere-2024-3936', Anonymous Referee #2, 24 Apr 2025
The paper “Object-based ensemble estimation of snow depth and snow water equivalent over multiple months in Sodankylä, Finland,” authored by Brodylo et al., investigates the use of four machine learning techniques and their ensemble for snow depth estimation. The estimated snow depths were then used to estimate SWE. Finally, the ratio of the modeled SWE to snow depths was taken to estimate snow density. In my estimation, the paper is well written. However, I have major comments regarding the methodological clarity.
- In section 3.2, the authors mentioned using Artificial Neural Networks (ANNs), among other models. However, they did not mention the exact architecture of the ANN (e.g., feed-forward, convolutional, transformers, etc.) used. Without this information, it is difficult to evaluate the appropriateness of the ANN architecture used in the study.
- In section 3.2, the details of the hyperparameters of the ML models (SVM, RF, and ANN) used were not mentioned. For example, for ANN, in addition to the architecture type, it would be beneficial to add the number of layers and neurons per layer, the activation function used, regularization (if any), the number of epochs, and other important hyperparameters used. For SVM, the kernel used, gamma, tolerance, and other important hyperparameters should be specified. For RF, the number of trees, the maximum depth, the minimum number of samples required to be at a leaf node, the minimum number of samples required to split an internal node, and other important hyperparameters should be specified. These details are essential for reproducibility.
- Also, in section 3.2, the authors mentioned using 10-fold cross-validation. However, important details are missing.
- Was the 10-fold CV done on the entire dataset or just the training set?
- No details about the train/test split ratio and strategy (random, stratified, etc) were mentioned.
- During the CV, how were hyperparameter configurations selected? Was it a grid search or Bayesian? A table of the hyperparameters tuned and their optimal values can be placed in the appendix.
- In section 3.3, the authors used Pearson’s correlation as a measure of prediction accuracy. However, a perfect correlation does not necessarily mean that the model is good or that the predicted values are close to the true values. For example, cor(y, y) = cor(y, 20y) = cor(y, 300y) = cor(y, 10000y) = 1. That is to say, a model could be doing significantly worse and still have a perfect correlation. I encourage the authors to use the coefficient of determination instead. Please do not square the correlation coefficient; you can use r2_score in sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html) or see this link for the formula (https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score).
- This study uses 13 points of SWE and 88 points of depths to train the ML models. This is an extremely limited sample size for training any machine learning model, especially when trying to predict across 37,917 image objects with varying characteristics. This raises a serious concern about overfitting. With such a small training set, for example, for the SWE estimation problem, there's a high risk that the model would simply memorize the patterns in those 13 objects rather than learning generalizable relationships. Therefore, the authors should comment on how to validate the SWE across the upscaled 10 km2. How did the authors ensure that the model wasn't overfitting for the SWE estimates? These points should be added to the discussion.
- Line 204: The model weights should use another metric since correlation is not reliable based on comment 4. Also, I think adding the weighting formula would be helpful to readers.
- Line 203: SVM was dropped due to poor performance. Could you please quantify "poor" in this scenario?
- Figure 3: One might think field snow depth and field swe are inputs. The authors should clarify in the caption that they are the outcome variables, not the input. Or they could represent output data with a different color.
- Tables 1-4: Were these metrics obtained from the entire dataset or just the testing set?
- The authors should comment on the transferability of the ML models in this study. Can we grab this model and apply it elsewhere? The authors could dedicate a paragraph to model transferability in the discussion.
- Line 167: A period is missing between "scale" and "In OBIA".
Citation: https://doi.org/10.5194/egusphere-2024-3936-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
175 | 42 | 10 | 227 | 11 | 9 |
- HTML: 175
- PDF: 42
- XML: 10
- Total: 227
- BibTeX: 11
- EndNote: 9
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1