the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
AutoTerm: A "big data" repository of Greenland glacier termini delineated using deep learning
Abstract. Ice sheet marine margins via outlet glaciers are susceptible to climate change and are expected to respond through retreat, steepening, and acceleration, although with significant spatial heterogeneity. However, research on ice-ocean interactions has continued to rely on decentralized, manual mapping of features at the ice-ocean interface, impeding progress in understanding the response of glaciers and ice sheets to climate change. The proliferation of remote sensing images lays the foundation for a better understanding of ice-ocean interactions and also necessitates the automation of terminus delineation. While deep learning (DL) techniques have already been applied to automate the terminus delineation, none involve sufficient quality control and automation to enable DL applications to “Big Data” problems in glaciology. Here, we build on established methods to create a fully automated pipeline for terminus delineation that makes several advances over prior studies. First, we leverage existing manually-picked terminus traces (16,440) as training data to significantly improve the generalization of the DL algorithm. Second, we employ a rigorous automated screening module to enhance the data product quality. Third, we perform a thoroughly automated uncertainty quantification on the resulting data. Finally, we automate several steps in the pipeline allowing data to be regularly delivered to public databases with increased frequency. The automation level of our method ensures the sustainability of terminus data production. Altogether, these improvements produce the most complete and high-quality record of terminus data that exists for the Greenland Ice Sheet (GrIS). Our pipeline has successfully picked 278,239 termini for 295 glaciers in Greenland from Landsat-5, -7, -8, Sentinel-1, and -2 images, spanning from 1984 to 2021 with an average uncertainty of ~37 meters. The high sampling frequency and the controlled quality of our terminus data will enable better quantification of ice sheet change and model-based parameterizations of ice-ocean interactions.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(8190 KB)
-
Supplement
(11040 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(8190 KB) - Metadata XML
-
Supplement
(11040 KB) - BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2022-1095', Anonymous Referee #1, 30 Nov 2022
General Comments
This paper presents an automated pipeline in Google Earth Engine for glacier terminus tracing together with a so-derived dataset and updated ice/ocean masks. Such a pipeline is highly needed and of great significance to the community. This extent of automation has not been reached in related works. We thank the authors for their valuable contribution!
While this paper employs a sound deep learning architecture in combination with a promising screening module, I have several major concerns, including the technical correctness of the evaluation protocol and, thus, the validity of the proposed study, as the generalizability of the deep learning network still needs to be proven. Furthermore, comparisons to other studies need to be conducted in a technically correct way, and the reproducibility of the study needs to be ensured by making the assembled training dataset publicly available. Lastly, the structure of the manuscript should be improved upon.
Major Concern 1: Evaluation Protocol
The pipeline has not been properly tested, and hence, we can not yet rely on its output. In my understanding, the authors seem to confuse uncertainty estimation with error assessment. In line 245, they call the calculation of the difference between prediction and ground truth „uncertainty quantification“. The authors then claim that comparing to manually picked traces „requires significant manual effort“ because it would have to be redone, as „network accuracy likely varies over time as glaciers experience different conditions“. Instead, the authors use two different uncertainty quantifications that do not rely on ground truth data. Calculating uncertainties is definitely useful, and the two used ways of calculating the effect of different sources of uncertainty (model inherent and input inherent) look very promising. However, calculating the uncertainty is no substitute for an error assessment. The authors themselves state in line 395: „if both duplicated traces are deviated from reality but are close to each other, the uncertainty would not represent the reality.“ It is, therefore, indispensable to calculate the deviation of the network’s predictions to manually delineated ground truth traces on a test set that is independent of the train set. First, we need to know how well the network is performing at the moment before we apply it to new unseen data and afterward assess whether the network’s performance degrades when new sensors are used or other conditions change (called domain shift in machine learning).
Additionally, an experiment should be conducted to determine whether and by how much the error between prediction and ground truth on the test set is reduced when the screening module is applied versus not applied. In this way, the effectiveness of the screening module can be demonstrated. The same holds for the upsampling of small images (it is not sufficient to visualize the results of one sample, as shown in Fig. 13).
Major Concern 2: Generalizability
The pipeline has to be tested on out-of-sample data (i.e., glaciers not present in the training dataset) and data outside of Greenland to show generalizability to the global scope.
- Line 451 „Owing to the transferability of deep learning, the entire pipeline has the potential to be applied to many other outlet glaciers around the world“
- Line 135 „converting the TermPicks terminus data into a training dataset suitable for deep learning highly generalizes the network“
These claims have to be proven on such a test set. As most manually annotated traces available from related work are part of TermPicks and hence, have been used for training, another test set has to be used. For testing on SAR imagery, the dataset provided by Gourmelon et al. could, for example, be used, as it is not incorporated in TermPicks (except Jacobshaven, which probably has overlaps with TermPicks). However, test data for optical imagery might have to be created manually (e.g., from Antarctica or the Russian Arctic). At least, I am unaware of a dataset based on optical imagery that is not incorporated in TermPicks.
Major Concern 3: Comparability
It is not possible to compare the calculated uncertainties of this manuscript to the errors calculated in related works, as done in, e.g., line 304 or line 379. Two totally different metrics are compared here, and studies have been conducted on different datasets. For a valid comparison, the exact same network/pipeline needs to be tested on different datasets, or different networks/pipelines have to be trained, optimized, and tested on the exact same data (a so-called benchmark dataset). Altering both the dataset and the network/pipeline introduces too many changes, and a changed performance could result from either the different dataset (for example, the test set might be easier, and therefore, the performance of an otherwise worse performing network would be better on this test set) or the different network/pipeline.
Concludingly, the claimed improvements 1 and 2 (line 377 „1 increasing the generalization level of the deep learning network to enable more and better quality terminus predictions; 2 deploying size normalization to improve the accuracy of terminus delineation for small glaciers“) are not proven.
One way to show the superior terminus prediction performance on SAR imagery could be the use of the benchmark dataset recently proposed by Gourmelon et al. (2022) (i.e., retraining the pipeline on the train set and evaluating it on the test set using the stated metrics). To the best of our knowledge, there is no equivalent benchmark dataset for optical imagery.
Major Concern 4: Reproducibility
Please make your complete assembled training data (including the satellite imagery) publicly available, as only in this way the reproducibility of the results is guaranteed. Moreover, please also provide the manually created reference polygons for each glacier.
Major Concern 5: Structure of the manuscript
The structure of the manuscript has to be improved upon. There is a mix-up between the training and inference of the pipeline, and some information is given twice at different positions in the manuscript. It is hard to tell when the authors write about the newly derived dataset in contrast to the dataset derived from TermPicks for training the network, e.g., in line 295 („We find an average success rate of 64%“), it is unclear on which dataset the success rate was calculated. I would suggest splitting the manuscript into two main parts as follows, but there could also be another better split-up:
- Training Pipeline: manual delineated dataset creation (TermPicks + additional manual annotations), neural network (architecture), network training (train-validation-test split, learning rate, number of trained epochs, etc.), screening module, error calculation, uncertainty estimation
- Inference Pipeline: new data acquisition + pre-processing, uncertainty estimation on this newly derived dataset, ice/ocean mask updates
Moreover, paragraph line 188 to 195 should be moved to limitations.
In section 4.1, the authors first introduce the ‚success rate‘, which should, however, be introduced in the methods section.
An explanation of the two uncertainty measures, as given in lines 319 to 323, should be moved to further at the beginning of the manuscript.
Major Comments:
- It is unclear to me whether the name „AutoTerm“ refers to the automated pipeline or the derived dataset, or both.
- The title of the manuscript does not mention the automated pipeline, which is, in my humble regard, the most significant contribution. Hence, I’d argue for a more suitable title, e.g. AutoTerm: an automated Google Earth Engine pipeline for glacier terminus extraction and „big data“ repository of Greenland glacier termini
- It needs to be clarified whether the region of interest that has to be defined for each new glacier has to be a polygon like in figure 2 or whether it can simply be a bounding box.
- Line 163 onwards: „This allows glaciers with various natural sizes to have a similar image size in computer vision, which largely decreases the complexity of delineating glacier terminus.“ This statement (the second part of it) needs more explanation or a reference.
- The normalization of image sizes is not clear to me. Small images are upsampled, but large images are not downsampled. Hence, do they still have different sizes? I would not call this normalization, then. Moreover, the authors extract patches afterward, so the input size is always equal anyways. Additionally, only showing one figure that shows an improvement for one trace is not sufficient evidence that this upsampling generally improves the delineation performance. Please show the improvement in numbers over a complete, independent test set (refer to major concerns 1 and 2).
- Section 3.3:
- „encoder-decoder structure [...] can obtain sharp object boundaries“: Actually, an encoder-decoder structure without skip connections would most probably not recover any details and, therefore, no sharp object boundaries. In Chen et al. (2018), they use a more sophisticated method to obtain the sharp boundaries: „A fully connected CRF [conditional random field] is then applied to refine the segmentation result and better capture the object boundaries.“
- „atrous convolution [...] senses multi-scale contextual information“: It is not the atrous convolutions alone that make recognition of multi-scale contextual information possible, but the combination of atrous convolutions in ASPP. "Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales." (Chen et al., 2018)
- „multi-scale contextual information [...] [is] helpful for our task since [...] we integrate remote sensing datasets with different spatial resolutions“: Multi-scale refers to how many pixels a neuron is able to see (effective receptive field) and not how much square meters one pixel can see. Hence, multi-scale contextual information helps when the calving front covers many versus only a few pixels. Thus, it helps only indirectly with different spatial resolutions of the dataset.
- „This network has been proven to have large learning capability, spatial transferability [...]“: These are quite big claims based on a train set of two glaciers and a test set of one glacier that are all located in Greenland (Zhang et al., 2021).
- „The network is trained with a learning rate of 0.005 [...] as recommended by (Zhang et al., 2021)“: The optimal learning rate for training is highly dependent on the dataset as well as on the batch size (not just the model). Hence, the learning rate has to be treated as a hyperparameter, which has to be optimized on a validation set (not the test set). A sub-optimal learning rate can lead to significantly longer training times until convergence or no convergence at all.
- „we choose the largest batch size (16)“: This should be „largest possible batch size (16) on an A100 GPU with 40/80 GB GPU memory“. Please specify whether your A100s have 40 or 80 GB GPU memory.
- What exactly is meant by "maximize our computational power“ in line 204?
- „The network training takes about a week“: This is quite long and might be due to a sub-optimal learning rate. Please specify not only the training time but also the number of trained iterations over the complete augmented dataset. Also, specify your train-test-split and your metrics for evaluation (refer to major concern 1). Moreover, did you use an early stopping criterion? You might have overfitted during this long training time.
- Line 210 „do not have any quality control“: At least Cheng et al. have manual control. So, maybe rephrase it to „do not have any automated quality control“.
- Line 215 onwards: Please mention that the screening builds on top of existing works here (Zhang et al. 2021 – terminus curvature screening, Baumhoer et al. 2019 – time series outliers, Gourmelon et al. 2022 – removal of too short termini predictions), but goes one step further, i.e., doesn’t use any manual intervention or prior knowledge of the data.
- Line 217 „Terminus length is determined by the sum of the piece-wise length along an individual terminus trace“. Please explain in more detail. This, at least for me, is hard to understand.
- Line 218 „Terminus curvature is computed between two adjacent points for each point along the terminus and then an average is taken for each terminus trace.“ This is also not completely clear to me. I think an equation would help.
- Line 224 „percentile of the data range“: Do you refer to the data range of the generated training data? Is this computed per glacier? Per satellite? The validity of these thresholds needs to be checked on an independent test set (refer to major concerns 1 and 2).
- Line 227 „For outliers in terminus length, we remove both the lower and upper thresholds (Eqns. 1 and 2) because we do not anticipate large changes in terminus length in either direction (bigger or smaller).“ As far as I understood, these thresholds were calculated on data for Greenland. Hence, the optimal thresholds for, e.g., Antarctica, might completely deviate from the ones calculated for Greenland. This might hinder the global applicability of the pipeline (refer to major concerns 1 and 2). This should be added to the limitations.
- Line 235 „We then repeat this screening procedure ten times to maintain the quality of the terminus product“: What screening procedure is meant here exactly now? All three or just the one with large areas? And does the outcome change when the screening procedure is done several times? If yes, please explain why.
- Line 245 „Traditional uncertainty quantification for glacier terminus position is conducted by calculating the difference between manually picked termini and automatically-picked termini.“ This is not uncertainty quantification but an error assessment (see major concerns 1).
- Line 262 „instead of quantifying the uncertainties of terminus traces, [Hartmann et al.] use the multiple inferences of MC dropout as extra information to retrain the network. “ This is not quite correct. Hartmann et al. use the model uncertainty on one specific input as additional information for a second network with dropout. This second network then outputs several predictions again from which uncertainties could be calculated - but instead, to make it more robust, the predictions are averaged to eliminate this uncertainty.
- Line 267 „To strike a balance between computational cost and the reliability of the MC dropout, we randomly chose ten images from all the sensors and make three inferences for each of them“: It is not quite clear to me. Are ten images of each sensor taken? „in total each glacier will have two measures of uncertainty“ – So, also ten images of each glacier?
- The results do, at some points, not validate the conclusions. No correlation was calculated (or it was not stated in the manuscript), and even a correlation would not necessarily induce causality. Please rephrase the conclusions to hypotheses.
- Line 307 „glaciers with less training data will have larger uncertainties and lower success rates“
- Line 309 „since they have the highest spatial resolution“
- Line 310 onwards „The reasons for the Landsat-5 uncertainty are twofold [...]“
- Line 314 „The higher uncertainty of Sentinel-1 images is due to its low image quality, coarse resolution, and the lower volume of training data derived from this sensor.“
- Line 325 „the uncertainty from duplicate traces is more representative of Landsat-7 and Sentinel-2 than other datasets“ - Is it not only representative of these two datasets, as it was only calculated for these?
- Line 308 „Among the five datasets used, Landsat-8 and Sentinel-2 have the lowest average uncertainties“: Please give the exact numbers here. A table showing the different values for different data subsets would be good.
- Line 355 „The metadata contains the date in YYYY-MM-DD, Glacier ID, source image satellite, and the uncertainty of each trace by averaging the two types of uncertainties provided“: I thought the uncertainties were not available for every single trace, as they were only calculated for some of them due to computational limitations? Please clarify.
- Line 423: „additional training data will be required to improve the data quality“: or an improved network/pipeline.
- Line 434: „The pipeline can alert us of its failure based on the success rate within the screening module.“: With your limitation that the screening module might not provide valid results for glaciers with few training examples, this alert might not trigger.
- Please revise the color scheme of your figures, as red and green should not appear in the same plot (https://www.the-cryosphere.net/submission.html#figurestables).
- Figure 9: Is the number on the bottom left the average? Moreover, it would be good to state between which sensors the duplicates were calculated in the description of the figure.
- Figure S5: Visually, this does not appear to be a linear relationship. Have you done a correlation test?
Specific Comments:
- Line 56 onwards: Heidler et al. 2022 (Deep Active Contour Models for Delineating Glacier Calving Fronts), Loebel et al. 2022 (Extracting glacier calving fronts by deep learning: the benefit of multi-spectral, topographic and textural input features), Gourmelon et al. 2022 (Calving fronts and where to find them: a benchmark dataset and methodology for automatic glacier calving front extraction from synthetic aperture radar imagery), and Davari et al. 2022 (Pixelwise Distance Regression for Glacier Calving Front Detection and Segmentation) are missing.
- Line 188: „Although TermPicks covers a range of conditions and brings great diversity to the training set, additional training data would improve the accuracy of the network in difficult situations.“ Please rephrase more cautiously (e.g., „... would presumably improve ...“), as you have no hard evidence that further training data would really improve the accuracy in this situation.
- Line 205: GPU -> GPUs
- Line 220 „With these three metrics, we calculate the lower (TL) and upper thresholds (TU) for each based on the inter-quartile range:“ - The sentence structure is hard to follow. So, you compute the thresholds for each individual criterion?
- Line 417 „120 GB of GPU memory“: I guess you mean 120GB RAM? There are only a 40GB and an 80GB A100 version as far as I know, and 4 (=number of GPUs) times 40 GB is already 160 GB.
- Line 444: Remove the word „fully“, as you still have some manual steps like defining the region of interest.
- Table 1 includes abbreviations that were not introduced.
- Figure S2: Please name the conditions in the figure‘s description as well, referencing (a) to (e).
Citation: https://doi.org/10.5194/egusphere-2022-1095-RC1 - AC1: 'Reply on RC1', Enze Zhang, 01 Feb 2023
-
RC2: 'Comment on egusphere-2022-1095', Anonymous Referee #2, 04 Dec 2022
General Comments:
Presented in this manuscript is an automated data processing pipeline for extracting glacier termini positions, and the associated dataset that consists of data spanning 295 Greenlandic glaciers over period 1984-2021. The dataset consists of 278,239 glacier termini for 295 glaciers, and includes ice/ocean masks for the years 2018-2020. The pipeline consists of a Google Earth Engine based downloader , combined with a deep neural network to extract termini locations from the subsetted and preprocessed satellite imagery. The literature review covers most of the existing work in the field. The deep learning methodology also incorporates the greatest diversity of sensors (Landsat 5-8, Sentinel 1 & 2) and sensor types (both optical and SAR), which is a novel development. The methodology is quality controlled by assessing its performance on two uncertainty quantification metrics.
In summary, the study represents a significant contribution to the cryosphere and scientific community, by providing a new glacial termini dataset for Greenland, and an automated deep learning based pipeline for automated glacial feature extraction. However, there are certain comments to be addressed regarding the dataset and the manuscript before acceptance at the editor’s discretion, as detailed below.
Major Comments:
- A primary concern to be noted is the lack of certain validation metrics that are commonly used in works such as this. Previous studies use the same established validation metrics (average area/distance between predicted and observed termini, or Mean distance error) to ensure ease of comparison. This measure is used in existing works such as Mohajerani et al. (2019), Baumhoer et al. (2019), Cheng et al. (2021), Heidler et al. (2021), Gourmelon et al. (2022), Loebel et al. (2022), and specifically Zhang et al. (2019, 2021). The average uncertainty of 37m, which is calculated using the average distance between duplicate picks from Landsat-8 and Sentinel-2, is somewhat misleading given this context, and the lack of such mean distance error calculation with respect to the ground truth should be addressed. Use of existing validation sets (Cheng et al. (2021), TermPicks/Goliber et al. (2022), and specifically Gourmelon et al. (2022)) would be advisable, as this would allow a fair comparison of this method with existing studies on established measures.
- A related concern to be noted is the biases inherent in the chosen validation metrics. One validation metric (average distance between duplicate picks from Landsat-8 and Sentinel-2) is biased towards lower/better values, since it is only calculated on higher resolution images, and doesn’t measure the method’s performance with respect to manual delineated observations that function as the ground truth. Furthermore, this uncertainty quantification cannot be calculated across the entire dataset, so its use as a metric to gauge the quality of the dataset is questionable.
- The data itself has a few issues that require reevaluation of the automated screening module. Within the provided dataset, there are fronts that are closed loops, make large spatio-temporal jumps, or are otherwise erroneous. Additionally, there is a non-negligible number of glaciers with termini that are cutoff by the boundaries of the ROI, which should be expanded and/or otherwise addressed.
- While the primary contributions of this study are the data processing pipeline and dataset, there is value in providing some analysis of the results, such as commenting on the general/regional area change trends (as shown for individual glaciers in the supplement, and to a degree in Figure 6), volume loss (when integrated with velocity datasets, though this may be out of scope), or correlations with temperatures/other measurements.
- The integration of figures in the manuscript could be better handled. Specifically, few figures are referenced within the manuscript (6, 8, 9, and 10 being the exceptions).
- It would be in the best interests of the community for the TermPicks derived training data to be released for ease of use for future projects.
- The training & pre/postprocessing of the network can be elaborated upon. The learning rate/regularization factors are less important/useful than information such as the optimizer used, number of epochs trained on, the total number of images trained on, loss function used, vectorization algorithm, and data augmentations used (i.e., if no data augmentations were used, why not, and if so, what were they).
Specific Comments:
P2 L58: I would recommend adding Gourmelon et al. (2022) and Loebel et al. (2022) to this list.
P3 L70-71, P7 L210: There are automated verification steps in Cheng et al. (2021), which includes filtering out unconfident predictions from the DL classifier.
P8 L225: Could a detail/edge preserving speckle filter be applied? Or other types of Sentinel-1 processing steps to reduce speckle noise?
P11 L341: Is there a limitation (such as spatial coverage gaps) restricting ice mask generation to 2018-2020, or could they be made for other years?
P21 Figure 1: The flowchart is a not straight forward to follow. Perhaps consider separating the training/inference flowcharts, or organizing it in a more linear fashion.
P26 Figure 6: The color of the uncertainty bars and your results are the same (both are black). This makes the figure hard to interpret. Additionally, consider using colorblind friendly color schemes.
P31 Figure 11: Are the uncertainty bars for all of GID164’s picks the same size?
Citation: https://doi.org/10.5194/egusphere-2022-1095-RC2 - AC2: 'Reply on RC2', Enze Zhang, 01 Feb 2023
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2022-1095', Anonymous Referee #1, 30 Nov 2022
General Comments
This paper presents an automated pipeline in Google Earth Engine for glacier terminus tracing together with a so-derived dataset and updated ice/ocean masks. Such a pipeline is highly needed and of great significance to the community. This extent of automation has not been reached in related works. We thank the authors for their valuable contribution!
While this paper employs a sound deep learning architecture in combination with a promising screening module, I have several major concerns, including the technical correctness of the evaluation protocol and, thus, the validity of the proposed study, as the generalizability of the deep learning network still needs to be proven. Furthermore, comparisons to other studies need to be conducted in a technically correct way, and the reproducibility of the study needs to be ensured by making the assembled training dataset publicly available. Lastly, the structure of the manuscript should be improved upon.
Major Concern 1: Evaluation Protocol
The pipeline has not been properly tested, and hence, we can not yet rely on its output. In my understanding, the authors seem to confuse uncertainty estimation with error assessment. In line 245, they call the calculation of the difference between prediction and ground truth „uncertainty quantification“. The authors then claim that comparing to manually picked traces „requires significant manual effort“ because it would have to be redone, as „network accuracy likely varies over time as glaciers experience different conditions“. Instead, the authors use two different uncertainty quantifications that do not rely on ground truth data. Calculating uncertainties is definitely useful, and the two used ways of calculating the effect of different sources of uncertainty (model inherent and input inherent) look very promising. However, calculating the uncertainty is no substitute for an error assessment. The authors themselves state in line 395: „if both duplicated traces are deviated from reality but are close to each other, the uncertainty would not represent the reality.“ It is, therefore, indispensable to calculate the deviation of the network’s predictions to manually delineated ground truth traces on a test set that is independent of the train set. First, we need to know how well the network is performing at the moment before we apply it to new unseen data and afterward assess whether the network’s performance degrades when new sensors are used or other conditions change (called domain shift in machine learning).
Additionally, an experiment should be conducted to determine whether and by how much the error between prediction and ground truth on the test set is reduced when the screening module is applied versus not applied. In this way, the effectiveness of the screening module can be demonstrated. The same holds for the upsampling of small images (it is not sufficient to visualize the results of one sample, as shown in Fig. 13).
Major Concern 2: Generalizability
The pipeline has to be tested on out-of-sample data (i.e., glaciers not present in the training dataset) and data outside of Greenland to show generalizability to the global scope.
- Line 451 „Owing to the transferability of deep learning, the entire pipeline has the potential to be applied to many other outlet glaciers around the world“
- Line 135 „converting the TermPicks terminus data into a training dataset suitable for deep learning highly generalizes the network“
These claims have to be proven on such a test set. As most manually annotated traces available from related work are part of TermPicks and hence, have been used for training, another test set has to be used. For testing on SAR imagery, the dataset provided by Gourmelon et al. could, for example, be used, as it is not incorporated in TermPicks (except Jacobshaven, which probably has overlaps with TermPicks). However, test data for optical imagery might have to be created manually (e.g., from Antarctica or the Russian Arctic). At least, I am unaware of a dataset based on optical imagery that is not incorporated in TermPicks.
Major Concern 3: Comparability
It is not possible to compare the calculated uncertainties of this manuscript to the errors calculated in related works, as done in, e.g., line 304 or line 379. Two totally different metrics are compared here, and studies have been conducted on different datasets. For a valid comparison, the exact same network/pipeline needs to be tested on different datasets, or different networks/pipelines have to be trained, optimized, and tested on the exact same data (a so-called benchmark dataset). Altering both the dataset and the network/pipeline introduces too many changes, and a changed performance could result from either the different dataset (for example, the test set might be easier, and therefore, the performance of an otherwise worse performing network would be better on this test set) or the different network/pipeline.
Concludingly, the claimed improvements 1 and 2 (line 377 „1 increasing the generalization level of the deep learning network to enable more and better quality terminus predictions; 2 deploying size normalization to improve the accuracy of terminus delineation for small glaciers“) are not proven.
One way to show the superior terminus prediction performance on SAR imagery could be the use of the benchmark dataset recently proposed by Gourmelon et al. (2022) (i.e., retraining the pipeline on the train set and evaluating it on the test set using the stated metrics). To the best of our knowledge, there is no equivalent benchmark dataset for optical imagery.
Major Concern 4: Reproducibility
Please make your complete assembled training data (including the satellite imagery) publicly available, as only in this way the reproducibility of the results is guaranteed. Moreover, please also provide the manually created reference polygons for each glacier.
Major Concern 5: Structure of the manuscript
The structure of the manuscript has to be improved upon. There is a mix-up between the training and inference of the pipeline, and some information is given twice at different positions in the manuscript. It is hard to tell when the authors write about the newly derived dataset in contrast to the dataset derived from TermPicks for training the network, e.g., in line 295 („We find an average success rate of 64%“), it is unclear on which dataset the success rate was calculated. I would suggest splitting the manuscript into two main parts as follows, but there could also be another better split-up:
- Training Pipeline: manual delineated dataset creation (TermPicks + additional manual annotations), neural network (architecture), network training (train-validation-test split, learning rate, number of trained epochs, etc.), screening module, error calculation, uncertainty estimation
- Inference Pipeline: new data acquisition + pre-processing, uncertainty estimation on this newly derived dataset, ice/ocean mask updates
Moreover, paragraph line 188 to 195 should be moved to limitations.
In section 4.1, the authors first introduce the ‚success rate‘, which should, however, be introduced in the methods section.
An explanation of the two uncertainty measures, as given in lines 319 to 323, should be moved to further at the beginning of the manuscript.
Major Comments:
- It is unclear to me whether the name „AutoTerm“ refers to the automated pipeline or the derived dataset, or both.
- The title of the manuscript does not mention the automated pipeline, which is, in my humble regard, the most significant contribution. Hence, I’d argue for a more suitable title, e.g. AutoTerm: an automated Google Earth Engine pipeline for glacier terminus extraction and „big data“ repository of Greenland glacier termini
- It needs to be clarified whether the region of interest that has to be defined for each new glacier has to be a polygon like in figure 2 or whether it can simply be a bounding box.
- Line 163 onwards: „This allows glaciers with various natural sizes to have a similar image size in computer vision, which largely decreases the complexity of delineating glacier terminus.“ This statement (the second part of it) needs more explanation or a reference.
- The normalization of image sizes is not clear to me. Small images are upsampled, but large images are not downsampled. Hence, do they still have different sizes? I would not call this normalization, then. Moreover, the authors extract patches afterward, so the input size is always equal anyways. Additionally, only showing one figure that shows an improvement for one trace is not sufficient evidence that this upsampling generally improves the delineation performance. Please show the improvement in numbers over a complete, independent test set (refer to major concerns 1 and 2).
- Section 3.3:
- „encoder-decoder structure [...] can obtain sharp object boundaries“: Actually, an encoder-decoder structure without skip connections would most probably not recover any details and, therefore, no sharp object boundaries. In Chen et al. (2018), they use a more sophisticated method to obtain the sharp boundaries: „A fully connected CRF [conditional random field] is then applied to refine the segmentation result and better capture the object boundaries.“
- „atrous convolution [...] senses multi-scale contextual information“: It is not the atrous convolutions alone that make recognition of multi-scale contextual information possible, but the combination of atrous convolutions in ASPP. "Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales." (Chen et al., 2018)
- „multi-scale contextual information [...] [is] helpful for our task since [...] we integrate remote sensing datasets with different spatial resolutions“: Multi-scale refers to how many pixels a neuron is able to see (effective receptive field) and not how much square meters one pixel can see. Hence, multi-scale contextual information helps when the calving front covers many versus only a few pixels. Thus, it helps only indirectly with different spatial resolutions of the dataset.
- „This network has been proven to have large learning capability, spatial transferability [...]“: These are quite big claims based on a train set of two glaciers and a test set of one glacier that are all located in Greenland (Zhang et al., 2021).
- „The network is trained with a learning rate of 0.005 [...] as recommended by (Zhang et al., 2021)“: The optimal learning rate for training is highly dependent on the dataset as well as on the batch size (not just the model). Hence, the learning rate has to be treated as a hyperparameter, which has to be optimized on a validation set (not the test set). A sub-optimal learning rate can lead to significantly longer training times until convergence or no convergence at all.
- „we choose the largest batch size (16)“: This should be „largest possible batch size (16) on an A100 GPU with 40/80 GB GPU memory“. Please specify whether your A100s have 40 or 80 GB GPU memory.
- What exactly is meant by "maximize our computational power“ in line 204?
- „The network training takes about a week“: This is quite long and might be due to a sub-optimal learning rate. Please specify not only the training time but also the number of trained iterations over the complete augmented dataset. Also, specify your train-test-split and your metrics for evaluation (refer to major concern 1). Moreover, did you use an early stopping criterion? You might have overfitted during this long training time.
- Line 210 „do not have any quality control“: At least Cheng et al. have manual control. So, maybe rephrase it to „do not have any automated quality control“.
- Line 215 onwards: Please mention that the screening builds on top of existing works here (Zhang et al. 2021 – terminus curvature screening, Baumhoer et al. 2019 – time series outliers, Gourmelon et al. 2022 – removal of too short termini predictions), but goes one step further, i.e., doesn’t use any manual intervention or prior knowledge of the data.
- Line 217 „Terminus length is determined by the sum of the piece-wise length along an individual terminus trace“. Please explain in more detail. This, at least for me, is hard to understand.
- Line 218 „Terminus curvature is computed between two adjacent points for each point along the terminus and then an average is taken for each terminus trace.“ This is also not completely clear to me. I think an equation would help.
- Line 224 „percentile of the data range“: Do you refer to the data range of the generated training data? Is this computed per glacier? Per satellite? The validity of these thresholds needs to be checked on an independent test set (refer to major concerns 1 and 2).
- Line 227 „For outliers in terminus length, we remove both the lower and upper thresholds (Eqns. 1 and 2) because we do not anticipate large changes in terminus length in either direction (bigger or smaller).“ As far as I understood, these thresholds were calculated on data for Greenland. Hence, the optimal thresholds for, e.g., Antarctica, might completely deviate from the ones calculated for Greenland. This might hinder the global applicability of the pipeline (refer to major concerns 1 and 2). This should be added to the limitations.
- Line 235 „We then repeat this screening procedure ten times to maintain the quality of the terminus product“: What screening procedure is meant here exactly now? All three or just the one with large areas? And does the outcome change when the screening procedure is done several times? If yes, please explain why.
- Line 245 „Traditional uncertainty quantification for glacier terminus position is conducted by calculating the difference between manually picked termini and automatically-picked termini.“ This is not uncertainty quantification but an error assessment (see major concerns 1).
- Line 262 „instead of quantifying the uncertainties of terminus traces, [Hartmann et al.] use the multiple inferences of MC dropout as extra information to retrain the network. “ This is not quite correct. Hartmann et al. use the model uncertainty on one specific input as additional information for a second network with dropout. This second network then outputs several predictions again from which uncertainties could be calculated - but instead, to make it more robust, the predictions are averaged to eliminate this uncertainty.
- Line 267 „To strike a balance between computational cost and the reliability of the MC dropout, we randomly chose ten images from all the sensors and make three inferences for each of them“: It is not quite clear to me. Are ten images of each sensor taken? „in total each glacier will have two measures of uncertainty“ – So, also ten images of each glacier?
- The results do, at some points, not validate the conclusions. No correlation was calculated (or it was not stated in the manuscript), and even a correlation would not necessarily induce causality. Please rephrase the conclusions to hypotheses.
- Line 307 „glaciers with less training data will have larger uncertainties and lower success rates“
- Line 309 „since they have the highest spatial resolution“
- Line 310 onwards „The reasons for the Landsat-5 uncertainty are twofold [...]“
- Line 314 „The higher uncertainty of Sentinel-1 images is due to its low image quality, coarse resolution, and the lower volume of training data derived from this sensor.“
- Line 325 „the uncertainty from duplicate traces is more representative of Landsat-7 and Sentinel-2 than other datasets“ - Is it not only representative of these two datasets, as it was only calculated for these?
- Line 308 „Among the five datasets used, Landsat-8 and Sentinel-2 have the lowest average uncertainties“: Please give the exact numbers here. A table showing the different values for different data subsets would be good.
- Line 355 „The metadata contains the date in YYYY-MM-DD, Glacier ID, source image satellite, and the uncertainty of each trace by averaging the two types of uncertainties provided“: I thought the uncertainties were not available for every single trace, as they were only calculated for some of them due to computational limitations? Please clarify.
- Line 423: „additional training data will be required to improve the data quality“: or an improved network/pipeline.
- Line 434: „The pipeline can alert us of its failure based on the success rate within the screening module.“: With your limitation that the screening module might not provide valid results for glaciers with few training examples, this alert might not trigger.
- Please revise the color scheme of your figures, as red and green should not appear in the same plot (https://www.the-cryosphere.net/submission.html#figurestables).
- Figure 9: Is the number on the bottom left the average? Moreover, it would be good to state between which sensors the duplicates were calculated in the description of the figure.
- Figure S5: Visually, this does not appear to be a linear relationship. Have you done a correlation test?
Specific Comments:
- Line 56 onwards: Heidler et al. 2022 (Deep Active Contour Models for Delineating Glacier Calving Fronts), Loebel et al. 2022 (Extracting glacier calving fronts by deep learning: the benefit of multi-spectral, topographic and textural input features), Gourmelon et al. 2022 (Calving fronts and where to find them: a benchmark dataset and methodology for automatic glacier calving front extraction from synthetic aperture radar imagery), and Davari et al. 2022 (Pixelwise Distance Regression for Glacier Calving Front Detection and Segmentation) are missing.
- Line 188: „Although TermPicks covers a range of conditions and brings great diversity to the training set, additional training data would improve the accuracy of the network in difficult situations.“ Please rephrase more cautiously (e.g., „... would presumably improve ...“), as you have no hard evidence that further training data would really improve the accuracy in this situation.
- Line 205: GPU -> GPUs
- Line 220 „With these three metrics, we calculate the lower (TL) and upper thresholds (TU) for each based on the inter-quartile range:“ - The sentence structure is hard to follow. So, you compute the thresholds for each individual criterion?
- Line 417 „120 GB of GPU memory“: I guess you mean 120GB RAM? There are only a 40GB and an 80GB A100 version as far as I know, and 4 (=number of GPUs) times 40 GB is already 160 GB.
- Line 444: Remove the word „fully“, as you still have some manual steps like defining the region of interest.
- Table 1 includes abbreviations that were not introduced.
- Figure S2: Please name the conditions in the figure‘s description as well, referencing (a) to (e).
Citation: https://doi.org/10.5194/egusphere-2022-1095-RC1 - AC1: 'Reply on RC1', Enze Zhang, 01 Feb 2023
-
RC2: 'Comment on egusphere-2022-1095', Anonymous Referee #2, 04 Dec 2022
General Comments:
Presented in this manuscript is an automated data processing pipeline for extracting glacier termini positions, and the associated dataset that consists of data spanning 295 Greenlandic glaciers over period 1984-2021. The dataset consists of 278,239 glacier termini for 295 glaciers, and includes ice/ocean masks for the years 2018-2020. The pipeline consists of a Google Earth Engine based downloader , combined with a deep neural network to extract termini locations from the subsetted and preprocessed satellite imagery. The literature review covers most of the existing work in the field. The deep learning methodology also incorporates the greatest diversity of sensors (Landsat 5-8, Sentinel 1 & 2) and sensor types (both optical and SAR), which is a novel development. The methodology is quality controlled by assessing its performance on two uncertainty quantification metrics.
In summary, the study represents a significant contribution to the cryosphere and scientific community, by providing a new glacial termini dataset for Greenland, and an automated deep learning based pipeline for automated glacial feature extraction. However, there are certain comments to be addressed regarding the dataset and the manuscript before acceptance at the editor’s discretion, as detailed below.
Major Comments:
- A primary concern to be noted is the lack of certain validation metrics that are commonly used in works such as this. Previous studies use the same established validation metrics (average area/distance between predicted and observed termini, or Mean distance error) to ensure ease of comparison. This measure is used in existing works such as Mohajerani et al. (2019), Baumhoer et al. (2019), Cheng et al. (2021), Heidler et al. (2021), Gourmelon et al. (2022), Loebel et al. (2022), and specifically Zhang et al. (2019, 2021). The average uncertainty of 37m, which is calculated using the average distance between duplicate picks from Landsat-8 and Sentinel-2, is somewhat misleading given this context, and the lack of such mean distance error calculation with respect to the ground truth should be addressed. Use of existing validation sets (Cheng et al. (2021), TermPicks/Goliber et al. (2022), and specifically Gourmelon et al. (2022)) would be advisable, as this would allow a fair comparison of this method with existing studies on established measures.
- A related concern to be noted is the biases inherent in the chosen validation metrics. One validation metric (average distance between duplicate picks from Landsat-8 and Sentinel-2) is biased towards lower/better values, since it is only calculated on higher resolution images, and doesn’t measure the method’s performance with respect to manual delineated observations that function as the ground truth. Furthermore, this uncertainty quantification cannot be calculated across the entire dataset, so its use as a metric to gauge the quality of the dataset is questionable.
- The data itself has a few issues that require reevaluation of the automated screening module. Within the provided dataset, there are fronts that are closed loops, make large spatio-temporal jumps, or are otherwise erroneous. Additionally, there is a non-negligible number of glaciers with termini that are cutoff by the boundaries of the ROI, which should be expanded and/or otherwise addressed.
- While the primary contributions of this study are the data processing pipeline and dataset, there is value in providing some analysis of the results, such as commenting on the general/regional area change trends (as shown for individual glaciers in the supplement, and to a degree in Figure 6), volume loss (when integrated with velocity datasets, though this may be out of scope), or correlations with temperatures/other measurements.
- The integration of figures in the manuscript could be better handled. Specifically, few figures are referenced within the manuscript (6, 8, 9, and 10 being the exceptions).
- It would be in the best interests of the community for the TermPicks derived training data to be released for ease of use for future projects.
- The training & pre/postprocessing of the network can be elaborated upon. The learning rate/regularization factors are less important/useful than information such as the optimizer used, number of epochs trained on, the total number of images trained on, loss function used, vectorization algorithm, and data augmentations used (i.e., if no data augmentations were used, why not, and if so, what were they).
Specific Comments:
P2 L58: I would recommend adding Gourmelon et al. (2022) and Loebel et al. (2022) to this list.
P3 L70-71, P7 L210: There are automated verification steps in Cheng et al. (2021), which includes filtering out unconfident predictions from the DL classifier.
P8 L225: Could a detail/edge preserving speckle filter be applied? Or other types of Sentinel-1 processing steps to reduce speckle noise?
P11 L341: Is there a limitation (such as spatial coverage gaps) restricting ice mask generation to 2018-2020, or could they be made for other years?
P21 Figure 1: The flowchart is a not straight forward to follow. Perhaps consider separating the training/inference flowcharts, or organizing it in a more linear fashion.
P26 Figure 6: The color of the uncertainty bars and your results are the same (both are black). This makes the figure hard to interpret. Additionally, consider using colorblind friendly color schemes.
P31 Figure 11: Are the uncertainty bars for all of GID164’s picks the same size?
Citation: https://doi.org/10.5194/egusphere-2022-1095-RC2 - AC2: 'Reply on RC2', Enze Zhang, 01 Feb 2023
Peer review completion
Journal article(s) based on this preprint
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
537 | 239 | 22 | 798 | 70 | 10 | 7 |
- HTML: 537
- PDF: 239
- XML: 22
- Total: 798
- Supplement: 70
- BibTeX: 10
- EndNote: 7
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Cited
Ginny Catania
Daniel Trugman
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(8190 KB) - Metadata XML
-
Supplement
(11040 KB) - BibTeX
- EndNote
- Final revised paper