the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Global Attention of Transformer Empowers Montane Periglacial Lake Identification
Abstract. Montane periglacial lakes, as sensitive indicators of cryospheric change, are undergoing rapid expansion under global warming. Investitating their evolving distribution is essential for monitoring climate understanding impacts and assessing associated geohazards. The complex topography and heterogeneous landscapes in high-mountain regions pose significant challenges for conventional methods, leading to the underdetection of small lakes, elevated false positive rates, and limited ability to discriminate between lake formation types. This study introduces a Vision Transformer (ViT)-based framework for montane periglacial lake identification, employing a two-step process of lake boundary segmentation and type classification. By leveraging ViT’s global attention mechanism, the framework captures long-range spatial and spectral relationships, enhancing contextual understanding of lakes and their surroundings. Compared to CNN-based models, the ViT-based approach achieved a mean intersection over union (MIoU) of 91.01 % for segmentation and an F1-score of 89.75 % for classification. It significantly improved detection of small lakes (as small as 0.0001 km2), reduced artifacts from shadows, snow, ice, and river fragments, and provided more accurate lake type classification. Applied to the Southeastern Tibetan Plateau Gorge Region, a region with high glacial lake density and outburst flood risks, the framework identified 3,266 lakes (1,708 glacial and 1,558 non-glacial), surpassing existing inventories in completeness and accuracy.
- Preprint
(1976 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 31 Dec 2025)
- RC1: 'Comment on egusphere-2025-3628', Jonas Köhler, 05 Dec 2025 reply
-
RC2: 'Comment on egusphere-2025-3628', Jan-Christoph Otto, 07 Dec 2025
reply
The authors present a new method for automatic mapping and classification of high-mountain/glacial lakes applied to the Third Pole region. The manuscript is generally well composed and the method presented represents a valuable addition to the approaches applied so far. The contribution is of high relevance since knowledge on lake distribution, especially in climate-change affected mountains, is essential for hazard management and mitigation.
While the method and results are well presented and the discussion and conclusion are largely convincing, the manuscript suffers severely from a poor application of the terminology and definition of glacial and non-glacial lakes. This has little effect on the lake detection itself but huge implication on the lake classification and the results and comparison in general. With respect to the potential relevance of the produced dataset for hazard management, this issue needs to be resolved. Otherwise, despite its technological performance, the dataset will be of little use.
Terminology: The authors need to reconsider the definition of glacial and non-glacial lakes. In the manuscript a variety of terms are applied starting with the term periglacial lakes in the title and introduction (and not more afterwards) than glacier lakes, montane lakes and non-glacier lakes. The authors mention to follow the classification by Yao et al. (2018) but a detailed definition of the terminology is absolutely required. This will influence the results and interpretation. For additional clarification I suggest fundamental review papers on the terminology for example by Carrivick and Tweed (2013) [DOI: 10.1016/j.quascirev.2013.07.028]
To illustrate this, one must investigate chapter 3.3: In the STPG region most of the non-glacial lakes identified and depicted in Fig. 4 are indeed glacial lakes, according to most classification schemes, because they have been formed by glacial erosion. Many are found in cirques that have been sculpted by glaciers (e.g. in the area around 29°,11.441’ N/95°33,340’E). The only difference is that they are located in catchments without current glaciers, thus they have been formed by glacier action in the past. Your terminology should therefore not only include a geomorphological and topographic definition, but also a temporal one (see for example Buckel et al. (2018)). Non-glacial, from my perspective would be restricted to lakes formed by landslides/debris flow dams or of volcanic origin. Lakes purely formed by excessive precipitation are vary rare in mountainous regions from my perspective.
My suggestion would be to either add a temporal aspect to your definition (Holocene, historic glacial lake) or to only to focus on ice-contact or near-glacier lakes (which would involve a distance-based definition).
This terminological uncertainty should be resolved and then considered in the discussion of the distance-based method. Your comment may of course be valid for some applications esp. natural hazards assessment (e.g. GLOF), but some of the argumentation is lost when the terminology is better defined and applied. In this respect authors need to consider that the distance-based method is justified here, assuring that there is a glacier upslope of the lake.
Furthermore, the title is confusing. Despite the use of the term “periglacial lake”, I also don’t know what “global attention” is signifying in this context. Please reconsider a more appropriate title.
Some minor comments:
L36– exchange the term “montane” with “alpine/high-alpine” – montane refers a biogeographic altitudinal zone usually at intermediate altitudes. (throughout the manuscript!!)
L39ff - You should provide a better definition of non-glacial lakes. The reference to “thermodynamic processes" is not enough from a geomorphological perspective since this is a too broad term from physics. The term periglacial lakes is not commonly used, since the formation is not linked to periglacial processes (involving ground ice and freeze-thaw). Using periglacial lakes with respect to the location of the lake should be avoided due to the misleading connotation of the term periglacial here.
L58ff – same issue as above…
L251ff – You compare the result to other approaches (CNN, UNet, DeepLapv3+), but you don’t mention that you applied these methods as well. How was this comparison done? Did you use existing data from other studies? This need to be mentioned in the methods section (e.g. 2.5) and reference in Table 1.
L269 – Table 2 (and same for table 3): The tables hold the category “all”. What does this mean? Are these the mapped lakes? I suggest renaming this class for better clarity.
L282 – Ch 3.2 – Similar to the comment above – You compare your classification results with two other CNN approaches (EfficientNet, ResNet). How was this done? Again no mentioning in the methods before.
L291 – Add explanation for TP, FP, TN, FN in the table caption.
L329 – Exchange “The proposed framework” with a more precise description excluding the CNN/alternative methods. Like: ViT-based methods…
L395ff – Chap 4.2 – please add the a, b to the Zhang references throughout the chapter to better differentiate between the publications.
To conclude, I think this manuscript requires a revision with respect to the application of the right terms for the objects in focus. In addition, more attention should be paid to the introduction of the comparative database to improve clarity. These revisions require moderate effort, will not affect the geometry of the lakes dataset but surely will change the classification and the discussion. This will improve the quality of the study and ensure comparability and a wider application of the dataset in the intended way.
Citation: https://doi.org/10.5194/egusphere-2025-3628-RC2
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 126 | 51 | 14 | 191 | 14 | 14 |
- HTML: 126
- PDF: 51
- XML: 14
- Total: 191
- BibTeX: 14
- EndNote: 14
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The proposed manuscript “Global Attention of Transformer Empowers Montane Periglacial Lake Identification” seeks to advance the remote-sensing based detection and classification of montane periglacial lakes. Accurate inventories of these lakes are of particular relevance as these lakes are indicators of climate change, important sources of fresh water, and pose geohazard risks through GLOFs. The authors identify three main challenges in the remote-sensing based detection of periglacial lakes, which are the difficult detection of very small lakes, spectral confusion due to topographical shadows and similar land surface classes, and the discrimination between glacial and non-glacial lakes. To address these challenges, the authors propose a two-stage classification approach, in both of which Vision Transformer (ViT) models replace more established models.
First, lakes in a Himalayan study region are detected from a Sentinel-2 mosaic using image segmentation. For this the authors propose the ViT model Mask2Former. Second, the identified lake shapes are analysed in their original environmental context to semantically classify them as either glacial or non-glacial. For this task, the authors propose the Swin Transformer v2 model. The models are trained in one region and applied and tested in a second to avoid overfitting and ensure transferability. The model results are compared to those of different established convolutional neural networks (CNN) architectures, and the proposed framework appears to yield better results throughout. The final mapping product for the validation region is furthermore compared to two different lake mapping products. The new mapping approach detects a significantly larger amount of lakes thank the comparison datasets, which the authors attribute to the ability of their framework to detect particularly small lakes.
General comments
The manuscript has a clear approach, is generally well structured and concise. The methodology of comparing a newly developed framework to existing ones is suitable. The discussion does well in explaining the performance of ViT compared to CNN based on the different model architectures. The presented results are in so far relevant, as they seem to be a significant improvement in comparison to established lake mapping methods (e.g. the U-Net) in montane areas. Even though the study is driven rather by a methodological instead of a geoscientific research question, I feel that with some rework it can be a valuable contribution to the cryosphere research community as it demonstrates a way to generate comprehensive inventories of periglacial lakes.
However, I think the paper needs some major revisions before publication. A major point is that the authors need to elaborate more on their methodology to facilitate transparency of their experiments and reproducibility. More details and explanation would help the geoscientific community to better understand the selected model architectures and configurations. More specific feedback on these issues can be found in the Specific Comments below. Furthermore, I strongly encourage the authors to share the lake labels used for training and testing in an open repository, not only for transparency but also to bolster the credibility of their results. Otherwise, it will be impossible to verify these. The same is true for the programming code of the applied models should they have been modified from their original source. Finally, I find the part of the discussion that addresses the confusion of glacial and non-glacial lakes in close proximity to glaciated areas to be insufficient. This could be improved by a similar analysis as presented in the results section. Again, more specific feedback on that can be found in the Specific comments.
Specific comments
Title: I feel the word “empower” is too strong, as the proposed method rather advances/improves the already working detection of periglacial lakes.
L52-53: What is the reasoning for this exact lake size threshold? Is it sensor resolution? Is it the low relevance of lakes of such small a size? Or from a different perspective: Why is it important to also include these small lakes and develop a method, which is able to detect these? I think it is worthwhile to address this, as the proposed methods later shows its strengths at exactly this lake size.
L68: What do you mean by “adaptive feature selection” in a Machine Learning context?
L101: What is “Hydroformer”? Is it a CNN? How does its architecture compare to the other introduced methods. Briefly elaborate.
L101-104: To be consistent with the sources you cited before: Could you briefly add in which spatial context (location, scale) the two studies cited here were conducted.
L107: I think what’s missing here is an overview about which specific shortcomings of the ViT studies cited before the proposed approach in this paper is supposed to address. Is it just the lack of application of ViTs for the detection of periglacial lakes? I can see, that ViTs have been applied before to detect lakes (and other surface features) in different contexts, but what in the cited studies makes the authors claim that ViTs are particularly suitable for this type of setting (montane periglacial)? I very much agree that it is worthwhile to investigate the suitability of ViTs for the proposed task, but the introduction chapter could be improved by providing some stronger arguments why particularly ViTs are promising.
L110: The claim that the study “elucidates the underlying physical mechanisms” (of what?) is too bold. This is not at all addressed in the study.
Figure 1: The figure indicates an accuracy assessment on the test data of the deep learning dataset. However, there is no arrow connecting back from the accuracy assessment to the two models. Were these models tuned and optimized or just used “out-of-the-box”? This should be also addressed in the text.
Figure 2: The third panel of the map (the overview) would be much more insightful if it provided a shaded relief of the topography. This way, readers not familiar with the region would be better able to understand the setting of the two study regions within the larger topographical context. Consider also to zoom-in a little bit (not too far) to the Himalayas and surrounding mountain ranges themselves. Too much space in this panel is wasted on regions which are not important to this study (Siberia, Australia, Indonesia etc.)
L166: You only use imagery from a single season. As a training dataset should be diverse to reflect a wide range of environmental conditions you should provide a good explanation why you focus on this limited time frame.
L166: During compositing, how do you account for intra-annual variability of the environment and particularly lake areas? You say you favor snow-free conditions with maximum lake extent (which is totally reasonable) but how do you control that this is reflected in the composite?
L178: What is the point of upsampling 10m/30m resolution input data to 5 m? Without any additional very high-resolution data there is no information gain. Why not just stick with 10m? In fact, because the input imagery into the ViTs is tiled into tiles with a fixed number of pixels (256x256), you might be losing a lot of spatial context with the higher resolution, don’t you?
L180ff: Training labels: Generating training labels is always a crucial process in ML/DL approaches. If two different experts were responsible for creating these labels, could you elaborate on any measures taken to ensure consistency between the labels? Also, I feel it would be a huge benefit to the community to make the training and validation labels available to the open public.
L184: How were the data standardized? Which method did you use?
L195ff: As this part is very technical ML/DL language, I would recommend some reworks to cater to the geoscientific community of this journal. Specifically, I’d like to see some elaboration on how the different components/features (e.g. multi-feature extraction, self/cross attention) of the two architectures are beneficial to the tasks of segmentation and classification of periglacial lakes in a montane setting. For example, which of the challenges described in the introduction section are addressed by choosing these model architectures and configurations.
Methods-Section: The methods section misses an entire sub-section on the additional models used for model comparison, i.e. U-Net, DeepLab V3+, ResNet, and EfficientNet. Although this section does not need to be as detailed as the (revised) section 2.4, some basic information is indeed required, such as reasoning for the choice of the comparative models, proper citation of the sources of the models, configuration of the input data for these models, and essential model hyperparameters. The reader must be able to reproduce the experiments the authors performed.
L216ff: What is the reasoning behind choosing these specific hyperparameter settings? Is there a loss curve that warrants that a training of 100 Epochs is enough?
Section 3.1: It is very good that the authors analyse and compare the performance of the different models for lake polygons, lake size, and elevation range using the MIoU. However, this could be complemented by an analysis of lake area, i.e. the ability by the different models to map the lake area as “completely as possible”. The analysis shown in Fig 6a already goes into this direction, where you can see that although, for example, DeepLab detects a lake as an entity, it fails to completely map the lake boundary as determined by the ground-truth data. I recommend a MioU analysis based on the total number of lake pixels detected by the different approaches.
L270ff: To me, it was not immediately clear, why the authors chose to evaluate the performance of the models across elevation gradients. In the discussion, it turns out, that the authors associate different elevations with different environmental conditions (particularly vegetation cover and prevalence of snow). I agree, that the elevation gradient is a good proxy to model changing environmental conditions. However, I’d like a short (half-) sentence about that also in the results around L270 to avoid confusion.
Tables 4, 5 and 6: Please add the F1 score as an additional column.
Figure 6: While I think that the examples demonstrated here show very well the strengths of the proposed approach, for the reader it is difficult to generalize these strengths from only two samples. Consider showing 2-3 other examples for (a) and (b), respectively, as an Annex/Supplementary material to the paper to bolster your claim.
Section 4.2: Several things need to be addressed in this discussion:
L419ff: As I understand it, the analysis provided here is supposed to demonstrate, how much more accurate the proposed lake classification approach is in comparison to drawing a 10 km buffer around a glaciated area and marking all lakes inside as “glacial” and all lakes outside as “non-glacial”. I see several issues with this approach:
However, I agree that the confusion of glacial and non-glacial lakes particularly in close proximity of glaciers needs to be addressed and evaluated! Figure 4 shows a plausible pattern of lake classifications across the region, but how robust is the proposed method specifically in regions where both types of lakes co-occur? I can imagine an accuracy assessment similar to that in Table 4 based on a subset of lakes in very close proximity to glaciers (e.g. a 1km buffer around all glaciated areas as determined by the RGI). This would be something for the results section. The discussion then needs to pick-up on these results, and, if possible, compare the performance of the proposed method (regarding lake type classification) with the comparison datasets by Zhang (2024a,b).
Discussion section in general: Are there significant differences in computational effort between the compared DL models? If so, do the authors think the increase of accuracy is worth the additional effort?
Technical corrections
L15-16: Add commas to sentence to enhance readability
L17-18: Suggestion: “challenges for conventional identification methods”
L26: “provided a more accurate lake type classification”
L80: The is a space too much here
L86: Remove full stop before citation
L97: add “the” or “a” before ViT architecture
L164-165: The links provided refer to the data portals but not the datasets. Please provide links to the respective data catalogue entries of the platforms. If these links are to long, consider a scientific citation of the original data.
L215: Please provide direct links to the models and datasets instead of just to the platform.
L364: I wouldn’t call the use of various spectral bands and calculated indices thereof “multisource remote sensing data”, when all of the products come from a single system (Sentinel-2).
Fig 6: Format figure caption consistently.
Fig 7: Please add scales to all of the images, and some kind of indication, where the area is located (e.g. map inset or geographic coordinates).
L394f and Table 7: Inconsistency in the citation of the Zhang (2024) papers. Please either use (2024 a/b) OR G. Zhang/T. Zhang (whichever fits best to the journal’s preferred citation style).
L459: lower case o in “Overestimation”