Automated glacial lake extraction using an Object-Based Image Analysis approach in Google Earth Engine
Abstract. The combination of glacial retreat and climate change is increasing the number and size of glacial lakes globally. Many of these glacial lakes are in dangerous glaciated environments, and satellite remote sensing provides a way to improve monitoring efforts, though automated methods are needed to accurately and rapidly detect changes in these lakes. We undertake a total of 40 classification experiments to investigate the impact of classifier parameters, input features and training data on classification accuracy. We run 18 additional experiments to identify the optimal combination of Simple Non-Iterative Clustering segmentation parameters (connectivity and neighborhoodSize), assess the impact of input features, determine the required number of training and testing images and compare water extraction indices for the OBIA classification. Our results show that the best-performing combination of parameters was 100–250 training points per class, and values of four and 128 for connectivity and neighborhoodSize, respectively. The inclusion of input features such as hillshade, slope, the NDVI and MNDWI in our OBIA classifier improves the overall delineation of glacial lakes and other land classes in our study, particularly in shadow bodies, which are commonly misclassified as water bodies. Finally, we demonstrate that it is possible to accurately classify a time series of images using a single training image, with superior results compared to training with multiple images. We hypothesise that this is due to the complexities of radiometric sensitivity, heterogenous values for bands and indices and temporal changes in land cover throughout the study. Our OBIA approach is a more efficient and accurate way in mapping glacial lakes using Landsat 4-9 satellite imagery over traditional pixel-based approaches, with an overall accuracy of 94.6 %, with a producer’s accuracy and user’s accuracy of 95.3 % and 95.5 % respectively, for water. This suggests that this method has the potential to map glacial lakes accurately and rapidly over larger regions.
Morgan and colleagues present recommendations for an optimal workflow for generating an OBIA/random forest classifier for proglacial lakes proximal to Tasman Glacier, New Zealand. The discussion of the workflow touches upon various aspects of the workflow including optimising the parameter choice in the segmentation and random forest workflows, resource allocation on GEE, and the value of training the RF classifier on single vs. multiple images.
In a research space where machine learning classification methods are dominated by “AI” (deep learning, CNN, vision transformers, etc.), there is a clear value in continuing to highlight more lightweight ML approaches such as the OBIA/RF workflow discussed here. These workflows require less compute resource, fewer training data, and are accessible to a wider range of users through tools such as GEE. Aspects of this paper such as training data recommendations and GEE compute monitoring will aid users looking to apply this method in their own glacier monitoring workflows. Having said that, many of the useful aspects of this paper are lost among a very long results section that is not translatable to wider applications, and technical language that is disconnected from standard ways of presenting ML results. I think that this paper likely requires significant work before publication.
MAJOR COMMENTS
Whilst reading the paper, I struggled to extract the key messages at times because the language used to describe the methods (and at times, the approaches themselves) differs quite a lot from that standardised across other studies producing and accessing ML workflows.
Most significantly, this arises in what the paper refers to as “classification and segmentation experiments (Section 2.3, 2.4, 3.1, 3.2), which is what the wider literature would probably refer to as (hyper)parameter optimisation using a grid search; and “using input features” (Section 2.4, 3.3), which is probably more widely recognised as feature selection. By not using standardised language to explain these approaches, these sections are overly long and descriptive as they have to re-invent (or at least, re-explain) the wheel. In a typical RF-focussed paper, a grid search problem might be referred to in only a few sentences, explaining the scope of the search and the optimal results, with additional information (e.g. sections 3.1, 3.2, fig. 2, fig. 3, table 6, table 7, table 8) probably entirely relegated to supplementary material. The depth at which this information is discussed in the main text also implies that the optimal parameters selected are broadly transferable across other geographic contexts, whereas they are likely specific to the precise location chosen here.
Also considering other contemporary ML studies, it is strange that more modern methods of accuracy assessment aren’t chosen - e.g. IoU, F1 score, recall/precision. Having said that, for later tables, it is probably best to pick one statistic best suited for your task at hand and stick with it. This would definitely improve readability and reduce the amount of tables in the main body - for instance, Table 9 and 10 could be combined if only one appropriate statistic was selected to compare, with the full data relegated to the supplement.
The lack of alignment with traditional ML language/standardised approaches also makes it hard to assess what is happening with the training and validation data. Some mention of random sampling of points for training is made (L189), but how was the training/validation split approached in the parameter optimisation stage (e.g. k-fold cross validation; RF OOB error) and in the final training (e.g. 80:20 test:train split)? Certainly, it is unusual that the study does not have entirely independent test images that are not used for training, especially considering that the abstract claims that the method can “accurately classify a time series of images” (L20).
On this topic, the really interesting part of the paper - to me - is the potential that an RF model produced from a single scenes training dataset can be used to classify an entire time series of images (L20, Section 3.4, 3.5). This provides a clear advantage over deep learning approaches for small-scale observational studies such as these, as producing large training dataset for e.g. CNN/vision transformer problems is time-intensive and demanding. However, I have some issues with the actual results here. The first is relating to the validation issue shown above: testing of the multi-image-trained classifier is performed *on the images the classifier was trained on*, which is not a suitable test to make for the claim made. This error is repeated in Figure 5, where (to the best of my understanding) the single-image classifier is shown classifying the image it was trained on. In this context, I am unsurprised the single-image-classifier performed better. To properly validate the claim that a single image performs well in classifying other images, an entirely separate test dataset should be assessed (perhaps including data from another region to show how this can/can’t be transferred geographically). Additionally, it is surprising, given the ease at which this can be done in GEE, that the classifier isn’t subsequently applied to a full time stack of imagery, and data presented of example lake(s) area through time. This would be an exciting and practical demonstration of the method that is sorely lacking in the current paper - and, in combination with sharing a way of applying the trained classifier (see comment to L603 below) would elevate this paper towards having real applicability and impact.
LINE COMMENTS
L11-12 This sentence (and the following) come a bit out of nowhere - the classification method (OBIA) is not mentioned by name until line 15, so it is unclear what method the parameters/input features/training data relate to.
L36 - PBIA is mentioned as though it is distinct from approaches such as NDWI (L35-27), when in fact most applications of indices are applied in a PBAI fashion (as stated L41-43). Indeed, this section in general probably needs a better differentiation between comparative methods, split into e.g. (i) OBIA vs PBIA; (ii) indice/threshold-based vs ML based (e.g. RF); (iii) local vs cloud-based (GEE, AWS, etc.) methods. These three classes are not mutually exclusive, and each has positives and negatives.
L93 - most efficient method? Given reference to changing parameters, perhaps this is meant to mean the optimal parameters for the classification task?
Paragraph beginning L105 - Arguably this paragraph could be removed or highly condensed - some unnecessary information (geologic history?) and climate information is largely irrelevant to the rest of the study - the second paragraph of this section, summarising regional lake studies, is more relevant.
Fig 1. - what is the data origin of the lake outlines?
L123-136 - For an EO specialist journal, could probably lose this preamble and start at the sentence “We use Landsat 4-9 Collection Tier 1…”.
L142 - Is it necessarily within the methods to explain that these bands were renamed within the workflow? Perhaps just say that equivalent bands were used where band numbers differed between 4-7/8-9, and refer only to the band ‘name’ hereon (e.g. Green, SWIR1, etc.).
L148 - The NZ DEM is 8 m, but was presumably resampled to match the resolution of the Landsat imagery for feeding into segmentation/RF processes. In this case, could it be equally easy to recommend e.g. COPDEM-30, and have this be globally valid?
L162 - out of the four parameters, three are referred to in prose and one by the actual parameter name within the GEE javascript API (neighborhoodSize). I would recommend consistency here.
L185-9 - “undertook a total of 48 (40+8) experiments” - ‘experiments’ should probably be better referred to as a gridded parameter sweep aiming to find the optimal number of training points and trees, followed by a second parameter sweep using optimised parameters from the first round which aimed to find the best segmentation parameters.
L197 - I am slightly unclear as why the RF parameter sweep was performed before the segmentation parameter sweep. As the segmentation parameters will alter the input to the RF training, then changing the segmentation may alter the optimal RF values (as an addendum, what were the segmentation parameters fixed as when running the RF sweep?). The solution here would be to perform a parameter sweep on all three component simultaneously (either as an exhaustive grid search or through a random or bayesian search if optimisation is needed).
L198 - With only 8 experiments, how were connectivity/neighbourhoodSize determined? Was this e.g. a 4x2 grid search?
L207 - is there a better term than ‘hillshade’ to describe this data, given the well-established meaning within the literature?
L269 – is it perhaps fair to say that the RF model is robust to the number of trees (probably unsurprising, given the nature of RF).
Figure 5 - Assuming this is the 2023 LC09 image (it is not made clear in the caption) it is probably not surprising that the single-image-classifier performs better - it has been trained solely on this image, but the multi-image classifier has been trained on a larger set! A different image should be selected for both qualitative and quantitative assessment - indeed, the authors should be more careful throughout that training data is not being used in their model validation steps.
Table 5 - Perhaps I miss this in the discussion, but it’s probably worth including the context of GEE recently moving to limit ‘free’ usage for academic users. After all, a couple of minutes difference in runtime is largely inconsequential if one is producing a classifier that will be used thousands of times - but a double in EECU-seconds may be highly significant if one is training many models in a ‘free’ plan.
L419 - why do higher sample sizes lead to a reduction in classifier accuracy? Is this simply overfitting or something else?
Fig 7 and associated discussion - Is a solution to train separate RF models for the older and newer Landsats?
L603 - Nice presentation of figure and data code within the github. However, I’m less clear on how I can actually access and use the actual model trained here, although perhaps I am just missing something. If the classifier is indeed broadly applicable as stated, it would be useful to have an additional GEE code (and perhaps explanatory tutorial included in the Markdown documentation) showing how users can apply the trained model to a new region.