Automated glacial lake extraction using an Object-Based Image Analysis approach in Google Earth Engine

Morgan, Tomos; McNabb, Robert; Dunlop, Paul

doi:10.5194/egusphere-2026-2033

Preprints

https://doi.org/10.5194/egusphere-2026-2033

Preprints

08 May 2026

| 08 May 2026

Automated glacial lake extraction using an Object-Based Image Analysis approach in Google Earth Engine

Tomos Morgan, Robert McNabb, and Paul Dunlop

Abstract. The combination of glacial retreat and climate change is increasing the number and size of glacial lakes globally. Many of these glacial lakes are in dangerous glaciated environments, and satellite remote sensing provides a way to improve monitoring efforts, though automated methods are needed to accurately and rapidly detect changes in these lakes. We undertake a total of 40 classification experiments to investigate the impact of classifier parameters, input features and training data on classification accuracy. We run 18 additional experiments to identify the optimal combination of Simple Non-Iterative Clustering segmentation parameters (connectivity and neighborhoodSize), assess the impact of input features, determine the required number of training and testing images and compare water extraction indices for the OBIA classification. Our results show that the best-performing combination of parameters was 100–250 training points per class, and values of four and 128 for connectivity and neighborhoodSize, respectively. The inclusion of input features such as hillshade, slope, the NDVI and MNDWI in our OBIA classifier improves the overall delineation of glacial lakes and other land classes in our study, particularly in shadow bodies, which are commonly misclassified as water bodies. Finally, we demonstrate that it is possible to accurately classify a time series of images using a single training image, with superior results compared to training with multiple images. We hypothesise that this is due to the complexities of radiometric sensitivity, heterogenous values for bands and indices and temporal changes in land cover throughout the study. Our OBIA approach is a more efficient and accurate way in mapping glacial lakes using Landsat 4-9 satellite imagery over traditional pixel-based approaches, with an overall accuracy of 94.6 %, with a producer’s accuracy and user’s accuracy of 95.3 % and 95.5 % respectively, for water. This suggests that this method has the potential to map glacial lakes accurately and rapidly over larger regions.

Received: 10 Apr 2026 – Discussion started: 08 May 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Tomos Morgan, Robert McNabb, and Paul Dunlop

Status: final response (author comments only)

RC1:
'Comment on egusphere-2026-2033', Anonymous Referee #1, 05 Jun 2026

Morgan and colleagues present recommendations for an optimal workflow for generating an OBIA/random forest classifier for proglacial lakes proximal to Tasman Glacier, New Zealand. The discussion of the workflow touches upon various aspects of the workflow including optimising the parameter choice in the segmentation and random forest workflows, resource allocation on GEE, and the value of training the RF classifier on single vs. multiple images.
In a research space where machine learning classification methods are dominated by “AI” (deep learning, CNN, vision transformers, etc.), there is a clear value in continuing to highlight more lightweight ML approaches such as the OBIA/RF workflow discussed here. These workflows require less compute resource, fewer training data, and are accessible to a wider range of users through tools such as GEE. Aspects of this paper such as training data recommendations and GEE compute monitoring will aid users looking to apply this method in their own glacier monitoring workflows. Having said that, many of the useful aspects of this paper are lost among a very long results section that is not translatable to wider applications, and technical language that is disconnected from standard ways of presenting ML results. I think that this paper likely requires significant work before publication.
MAJOR COMMENTS
Whilst reading the paper, I struggled to extract the key messages at times because the language used to describe the methods (and at times, the approaches themselves) differs quite a lot from that standardised across other studies producing and accessing ML workflows.
Most significantly, this arises in what the paper refers to as “classification and segmentation experiments (Section 2.3, 2.4, 3.1, 3.2), which is what the wider literature would probably refer to as (hyper)parameter optimisation using a grid search; and “using input features” (Section 2.4, 3.3), which is probably more widely recognised as feature selection. By not using standardised language to explain these approaches, these sections are overly long and descriptive as they have to re-invent (or at least, re-explain) the wheel. In a typical RF-focussed paper, a grid search problem might be referred to in only a few sentences, explaining the scope of the search and the optimal results, with additional information (e.g. sections 3.1, 3.2, fig. 2, fig. 3, table 6, table 7, table 8) probably entirely relegated to supplementary material. The depth at which this information is discussed in the main text also implies that the optimal parameters selected are broadly transferable across other geographic contexts, whereas they are likely specific to the precise location chosen here.
Also considering other contemporary ML studies, it is strange that more modern methods of accuracy assessment aren’t chosen - e.g. IoU, F1 score, recall/precision. Having said that, for later tables, it is probably best to pick one statistic best suited for your task at hand and stick with it. This would definitely improve readability and reduce the amount of tables in the main body - for instance, Table 9 and 10 could be combined if only one appropriate statistic was selected to compare, with the full data relegated to the supplement.
The lack of alignment with traditional ML language/standardised approaches also makes it hard to assess what is happening with the training and validation data. Some mention of random sampling of points for training is made (L189), but how was the training/validation split approached in the parameter optimisation stage (e.g. k-fold cross validation; RF OOB error) and in the final training (e.g. 80:20 test:train split)? Certainly, it is unusual that the study does not have entirely independent test images that are not used for training, especially considering that the abstract claims that the method can “accurately classify a time series of images” (L20).
On this topic, the really interesting part of the paper - to me - is the potential that an RF model produced from a single scenes training dataset can be used to classify an entire time series of images (L20, Section 3.4, 3.5). This provides a clear advantage over deep learning approaches for small-scale observational studies such as these, as producing large training dataset for e.g. CNN/vision transformer problems is time-intensive and demanding. However, I have some issues with the actual results here. The first is relating to the validation issue shown above: testing of the multi-image-trained classifier is performed *on the images the classifier was trained on*, which is not a suitable test to make for the claim made. This error is repeated in Figure 5, where (to the best of my understanding) the single-image classifier is shown classifying the image it was trained on. In this context, I am unsurprised the single-image-classifier performed better. To properly validate the claim that a single image performs well in classifying other images, an entirely separate test dataset should be assessed (perhaps including data from another region to show how this can/can’t be transferred geographically). Additionally, it is surprising, given the ease at which this can be done in GEE, that the classifier isn’t subsequently applied to a full time stack of imagery, and data presented of example lake(s) area through time. This would be an exciting and practical demonstration of the method that is sorely lacking in the current paper - and, in combination with sharing a way of applying the trained classifier (see comment to L603 below) would elevate this paper towards having real applicability and impact.
LINE COMMENTS
L11-12 This sentence (and the following) come a bit out of nowhere - the classification method (OBIA) is not mentioned by name until line 15, so it is unclear what method the parameters/input features/training data relate to.
L36 - PBIA is mentioned as though it is distinct from approaches such as NDWI (L35-27), when in fact most applications of indices are applied in a PBAI fashion (as stated L41-43). Indeed, this section in general probably needs a better differentiation between comparative methods, split into e.g. (i) OBIA vs PBIA; (ii) indice/threshold-based vs ML based (e.g. RF); (iii) local vs cloud-based (GEE, AWS, etc.) methods. These three classes are not mutually exclusive, and each has positives and negatives.
L93 - most efficient method? Given reference to changing parameters, perhaps this is meant to mean the optimal parameters for the classification task?
Paragraph beginning L105 - Arguably this paragraph could be removed or highly condensed - some unnecessary information (geologic history?) and climate information is largely irrelevant to the rest of the study - the second paragraph of this section, summarising regional lake studies, is more relevant.
Fig 1. - what is the data origin of the lake outlines?
L123-136 - For an EO specialist journal, could probably lose this preamble and start at the sentence “We use Landsat 4-9 Collection Tier 1…”.
L142 - Is it necessarily within the methods to explain that these bands were renamed within the workflow? Perhaps just say that equivalent bands were used where band numbers differed between 4-7/8-9, and refer only to the band ‘name’ hereon (e.g. Green, SWIR1, etc.).
L148 - The NZ DEM is 8 m, but was presumably resampled to match the resolution of the Landsat imagery for feeding into segmentation/RF processes. In this case, could it be equally easy to recommend e.g. COPDEM-30, and have this be globally valid?
L162 - out of the four parameters, three are referred to in prose and one by the actual parameter name within the GEE javascript API (neighborhoodSize). I would recommend consistency here.
L185-9 - “undertook a total of 48 (40+8) experiments” - ‘experiments’ should probably be better referred to as a gridded parameter sweep aiming to find the optimal number of training points and trees, followed by a second parameter sweep using optimised parameters from the first round which aimed to find the best segmentation parameters.
L197 - I am slightly unclear as why the RF parameter sweep was performed before the segmentation parameter sweep. As the segmentation parameters will alter the input to the RF training, then changing the segmentation may alter the optimal RF values (as an addendum, what were the segmentation parameters fixed as when running the RF sweep?). The solution here would be to perform a parameter sweep on all three component simultaneously (either as an exhaustive grid search or through a random or bayesian search if optimisation is needed).
L198 - With only 8 experiments, how were connectivity/neighbourhoodSize determined? Was this e.g. a 4x2 grid search?
L207 - is there a better term than ‘hillshade’ to describe this data, given the well-established meaning within the literature?
L269 – is it perhaps fair to say that the RF model is robust to the number of trees (probably unsurprising, given the nature of RF).
Figure 5 - Assuming this is the 2023 LC09 image (it is not made clear in the caption) it is probably not surprising that the single-image-classifier performs better - it has been trained solely on this image, but the multi-image classifier has been trained on a larger set! A different image should be selected for both qualitative and quantitative assessment - indeed, the authors should be more careful throughout that training data is not being used in their model validation steps.
Table 5 - Perhaps I miss this in the discussion, but it’s probably worth including the context of GEE recently moving to limit ‘free’ usage for academic users. After all, a couple of minutes difference in runtime is largely inconsequential if one is producing a classifier that will be used thousands of times - but a double in EECU-seconds may be highly significant if one is training many models in a ‘free’ plan.
L419 - why do higher sample sizes lead to a reduction in classifier accuracy? Is this simply overfitting or something else?
Fig 7 and associated discussion - Is a solution to train separate RF models for the older and newer Landsats?
L603 - Nice presentation of figure and data code within the github. However, I’m less clear on how I can actually access and use the actual model trained here, although perhaps I am just missing something. If the classifier is indeed broadly applicable as stated, it would be useful to have an additional GEE code (and perhaps explanatory tutorial included in the Markdown documentation) showing how users can apply the trained model to a new region.

Citation: https://doi.org/10.5194/egusphere-2026-2033-RC1
- AC1: 'Reply on RC1', Tomos Morgan, 15 Jul 2026
  
  Dear Referee 1,
  Thank you very much for your time and the constructive feedback on our manuscript.
  Please refer to the attached PDF file for our complete responses, including detailed explanations and specific line-number references to the revised text.
  Kind regards,
  Tomos
  
  Citation: https://doi.org/10.5194/egusphere-2026-2033-AC1
RC2:
'Comment on egusphere-2026-2033', Anonymous Referee #2, 16 Jun 2026

Morgan and colleagues present a workflow for extracting glacial lake delineations using an automated OBIA/ML classification method. The paper provides a thorough overview of classification and segmentation parameter optimisation, feature selection, computational requirements, and a comparison of using 1 vs. 5 Landsat images for training the classifier. This workflow is a welcome contribution in a field increasingly dominated by deep learning approaches. By demonstrating that a more computationally efficient and accessible method can still achieve competitive accuracy, the authors make a convincing case that there remains an important place for simpler classification methods.

One of the main findings that a single training image can achieve accurate classification of a time series is intriguing and very exciting, but it does raise a methodological concern that is worth addressing. As described in Section 2, the same image was used to collect both training and testing datasets. The resulting accuracies were then compared against those from a classifier trained and tested on five images (if I understood correctly). In this case, I would assume the higher accuracy of the single-image method may simply reflect reduced dataset variety rather than superior classification performance.

I wish the authors would acknowledge this issue and discuss its potential effect on the results.

Another unexplored opportunity is applying the classifier to a longer or more frequent time series, or to a new region. With the imagery archive available in GEE, extending the study in this way could be relatively straightforward and would show how well the classifier generalises beyond the study area.
These were my main concerns when reading the paper and worth a revision. Further specific and technical comments are listed below.
Line comments:
L36-37 Missing the word "Index" in Normalized Difference Snow …

L80-84 I would add the reference where the study is first mentioned.

L115 Add space in 23km for consistency.

L138 Add space in timeseries.

L139-140 Using the same image for both training and testing could risk data leakage and the two datasets being too similar.

L184 Classification and segmentation experiments sounds vague to me. Maybe consider switching to "hyperparameter tuning" or "parameter optimisation", as they would fit your description in the paragraph.

**L199-201** I found this difficult to understand. Maybe simplify to something like: In this study, OBIA was used for both multi-class classification of all feature types and binary classification of water versus non-water features.

L202 I would consider switching the title to a more clear "feature selection"

L208-210 Longer explanation of commonly known NDVI not necessary, especially when other indices didn't have one.

L229 I don't think it is necessary to explain QGIS for this journals audience.

L352-354 If I understood correctly, all LS images cover the same area, which makes your results highly region-specific. How can you be confident that the OBIA classifier is applicable across the Southern Alps if it has not been tested in other areas? I think this is a missed opportunity to evaluate the generalisability of your classifier.

L359-366 Since you used the same image for both training and testing, the higher accuracy in the single-image method is expected. For the 5-image classifier, was the data collected from all five images for both training and testing? If so, the slightly greater variety introduced might explain the difference.

L432 10m -> 10 m for consistency.

L468 The word 'Classifier' is capitalized, whereas all other instances use lowercase.

L518-519 Did you apply the single-image classifier to the whole collection, and if so, what were the results? Also, does 'whole collection' refer to the full time series for your specific study area, or a wider selection of areas?

L602 I had some confusion in trying to locate the supporting scripts and main scripts. Maybe adding more explanation on how access these would be beneficial for other users.

Citation: https://doi.org/10.5194/egusphere-2026-2033-RC2
- AC2: 'Reply on RC2', Tomos Morgan, 15 Jul 2026
  
  Dear Referee 2,
  Thank you very much for your time and the constructive feedback on our manuscript.
  Please refer to the attached PDF file for our complete responses, including detailed explanations and specific line-number references to the revised text.
  Kind regards,
  Tomos
  
  Citation: https://doi.org/10.5194/egusphere-2026-2033-AC2

Tomos Morgan, Robert McNabb, and Paul Dunlop

Viewed

Total article views: 382 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
215	153	14	382	12	11

HTML: 215
PDF: 153
XML: 14
Total: 382
BibTeX: 12
EndNote: 11

Views and downloads (calculated since 08 May 2026)

Month	HTML	PDF	XML	Total
May 2026	147	73	7	227
Jun 2026	25	11	1	37
Jul 2026	43	69	6	118

Cumulative views and downloads (calculated since 08 May 2026)

Month	HTML	PDF	XML	Total
May 2026	147	73	7	227
Jun 2026	25	11	1	37
Jul 2026	43	69	6	118

Viewed (geographical distribution)

Total article views: 372 (including HTML, PDF, and XML) Thereof 372 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 28 Jul 2026

Short summary

As glaciers retreat, lakes form and grow quickly that poses flood risks in downstream populations. Using satellite imagery, we test ways in which we can improve the accuracy of tracking the changes in the areal extent of glacial lakes. By combining training and testing parameters with terrain and vegetation information we achieved overall accuracies of 94.6 % showing that our method produces a rapid way of delineating glacial lakes to support hazard planning and mitigation.


Total:	0
HTML:	0
PDF:	0
XML:	0