Clustering has a meaning: optimization of angular similarity to detect 3D geometric anomalies in geological terrains

Michalak, Michał; Teper, Lesław; Wellmann, Florian; Żaba, Jerzy; Gaidzik, Krzysztof; Kostur, Marcin; Maystrenko, Yuriy; Leonowicz, Paulina

doi:https://doi.org/10.5194/egusphere-2022-633

Preprints

https://doi.org/10.5194/egusphere-2022-633

Preprints

22 Jul 2022

| 22 Jul 2022

Clustering has a meaning: optimization of angular similarity to detect 3D geometric anomalies in geological terrains

Michał Michalak, Lesław Teper, Florian Wellmann, Jerzy Żaba, Krzysztof Gaidzik, Marcin Kostur, Yuriy Maystrenko, and Paulina Leonowicz

Abstract. The geological potential of sparse subsurface data is not being fully exploited since the available workflows are not specifically designed to detect and interpret 3D geometric anomalies hidden in the data. We develop a new unsupervised machine learning framework to cluster and analyze the spatial distribution of orientations sampled throughout a geological interface. Our method employs Delaunay triangulation and clustering with the squared Euclidean distance to cluster local unit orientations/attitude which results in minimizing the within-cluster cosine distance. We performed the clustering on two representations of the triangles: normal and dip vectors. The classes resulting from clustering were attached to a geometric centre of a triangle (irregular version). We developed also a regular version of spatial clustering which allows to answer whether points from a grid structure can be affected by anomalies. To illustrate the usefulness of the combination between cosine distance as dissimilarity metric and two cartographic versions, we analyzed subsurface data documenting two horizons: 1) the bottom Jurassic surface from the Central European Basin System (CEBS) and 2) an interface between Middle-Jurassic units within the Kraków-Silesian Homocline (KSH) which is a part of the CEBS. The empirical results suggest that clustering normal vectors may result in near collinear cluster centers and boundaries between clusters of similar trend, thus pointing to axis of a potential megafold. Clustering dip vectors resulted on the other hand in near co-circular cluster centers, thus pointing to a potential megacone. We also show that the linear arrangements of the anomalies, their topological relationships and internal structure can provide insights regarding the internal structure of the singularity, e.g. whether it may be due to drilling a nonvertical fault plane or due to a wider deformation zone composed of many smaller faults.

Received: 12 Jul 2022 – Discussion started: 22 Jul 2022

Download & links

Preprint (PDF, 5726 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (5726 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

09 Nov 2022

Clustering has a meaning: optimization of angular similarity to detect 3D geometric anomalies in geological terrains

Michał P. Michalak, Lesław Teper, Florian Wellmann, Jerzy Żaba, Krzysztof Gaidzik, Marcin Kostur, Yuriy P. Maystrenko, and Paulina Leonowicz

Solid Earth, 13, 1697–1720, https://doi.org/10.5194/se-13-1697-2022,https://doi.org/10.5194/se-13-1697-2022, 2022

Short summary

Michał Michalak et al.

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2022-633', Guillaume Duclaux, 24 Aug 2022

Review of "Clustering has a meaning: optimization of angular similarity to detect 3D geometric anomalies in geological terrains" by MichaÅ Michalak, LesÅaw Teper, Florian Wellmann, Jerzy Å»aba, Krzysztof Gaidzik, Marcin Kostur, Yuriy Maystrenko, and Paulina Leonowicz.

This manuscript introduces a new workflow for dealing with geological-surface mapping using sparse subsurface data. In particular, this work develops and investigates two new features for geological mapping using unsupervised machine-learning : 1) the role of structural data representations (as normal and dips vectors) on clustering results, and 2) the characterisation of Voronoi diagrams to explain the meaning of the boundaries between obtained clusters. The potential of these two methods are illustrated through applications to a couple of examples focusing at the very large scale on clustering regular data for the bottom Jurassic surface of the Central European Basin System, and at a smaller scale on clustering of irregular data for a middle Jurassic interface within the Krakow-Silesian Homocline in South-Central Poland.

Now, I am definitely not an expert in either machine-learning, nor clustering methods... so I've reviewed this manuscript from the perspectives of a structural geologist to whom such methods could be very useful for interpreting subsurface geology and structures.

Overall, the manuscript is well written and organised, and seems well suited for EGUsphere readership. The application of the unsupervised clustering method is presented, tested and analysed for different k-means and different vector representations in numerous figures (that still require some editing and clarifications). Limitations are appropriately discussed which keeps the contribution very honest. Such new machine-learning approach will potentially provide opportunities for geologists to (re)interpret subsurface structures in regions with either available geological surfaces, or dense boreholes coverage. Based on my review - as a structural geologist - I would recommend accepting this manuscript after moderate revisions of the figures and minor revisions of the text.

I present below a few key points for which I have some questions/concerns followed by a list of minor comments.

1) Choice of the optimum number of clusters: I have some trouble understanding the k-means choices based on the elbow method the authors have employed to determine the optimal number of clusters in their case studies... First, I wonder whether Figure 7 is flawed? Why y the y-axes values so variable between the normal and dip representations for a single dataset? Shouldn't the numbers be similar between a and b (CEBS), and c and d (KSH)? Each structural data provide both dip and normal values, for CEBS there should be 236380 data points. I might miss something related to the y-axis for each figure: what is "tot_withinss"? Why does Figure 7B can either have 2 or 4 optimum number of clusters (Line 285-286)?

The number of clusters will be very important for determining the data clustering pattern based on the cluster centers analysis, so I reckon this section should be strengthened.

2) Stereographic representations: This applies for figure 2c, 8, 9, 10, 11, 12, 13 and 14.

- On Figure 2c it isn't clear which hemisphere is displayed for the data + we don't know where the North is located on figures 2a and 2b.

- Fig 8 to 14: There a lower and an upper hemisphere half globes shown next to the stereonets in all these figures. For the lower hemisphere the whole Stereonet is shown, not for the upper hemisphere... what is the grid spacing on these stereo? I may have missed it but it isn't clear to me what projection is being used (I suppose an azimuthal polar stereographic projection, is that correct?) - structural geologist would classically use equal-area or equal-angle stereonets. Please clarify this in the caption of Figure 8 where the stereos first appear.

3) Direct screenshots from Paraview are hard to read. I'm thinking of Figures 1, 2 and mostly 4 and 6. The scale are not always meaningful, or hard to read. For example in Figure 1 the color scale legend is scalars... I guess elevation would be more adequate. The bounding boxes scale units in figure 2 are nor readable either and are totally missing in Figure 4 and 6. I believe redrafting Figure 4 would massively improve its readability. The moiré pattern visible in Figure b and c is just terrible. Units should also be added to the scale bars, and finally the scale bars' min and max values should be written in the same encoding than the rest of the values (-7.2e+03 and 1.6e+03 should be -7200 and 1600; 1.6e-04 should be 0; 6.1e+01 should be 61; 2.2e-03 and 3.6e+02 should be 0 and 360). Same goes for Figure 6.

Minor comments:

+ Equation 2 page 4: why is each line for Eq. (2) development given a different number? This is the same equation and as such should be only referred to Eq (2).

+ Line 144: please remove the second "a" in this line: "[...] whether a specific 2D point p [...]"

+ Line 234: "en échelon" is missing its accent.

+ Line 240: Genus and species for Strenocera subfurcatum should be written in italic font.

+ Line 272: Please revise the reference for the Anon borehole database citation. I understand it is not published, though.

+ Line 350-351: Theorem 1 --> Do you mean Eq 1?

+ Line 373-374: How would you differentiate between a graben structure and an "antithetic shear with hanging walls dipping against the main fault"?

+ Line 440: missing s in "this result suggests"

+ Figures and captions in general: the figures use lower case letter (a, b, c...) while in the captions and the text upper case letters are used (A, B, C...). Please harmonise between the figures and the text/captions.

Nice, 24/08/2022

Guillaume Duclaux

Citation: https://doi.org/10.5194/egusphere-2022-633-RC1
- AC1: 'Reply on RC1', Michal Michalak, 16 Sep 2022
  
  See the attached pdf.
  
  Citation: https://doi.org/10.5194/egusphere-2022-633-AC1
RC2:
'Comment on egusphere-2022-633', Thomas Blenkinsop, 30 Aug 2022

Clustering has a meaning: optimization of angular similarity to detect 3D geometric anomalies in geological terrains

This is a difficult paper to read, because it contains a lot of jargon about geometry, and because of vague general statements, some of which are unnecessary (e.g. “dip angle is not capable of showing the dip direction of faults and vice-versa” and “Geology is considered to be a subjective science (Curtis, 2012)”). A further problem for understanding the paper is that some of the methods section is couched in the technical language of the CGAL library. This is unhelpful to the general reader, and needs to be explained in simple terms.

One of the main conclusions, that applying clustering methods to normal vectors and dip direction vectors from the same data set results in different interpretations of the structure (Fig. 15), seems unlikely to be correct. There is no material difference between the geometrical significance and information contained in a normal vector compared to a dip direction vector. If there is a difference in the outcome of the clustering methods, that must be an artefact of the way the methods have been applied to each data set.

Another main conclusion is that optimisation methods must be applied to investigate clustering. This is relatively trivial: any clustering algorithm requires a similarity index, and the one used here (cosine distance) is a standard metric for assessing orientation differences. Further to the previous point, this metric should not result in significant differences between normal and dip direction vectors, because the cosine distance between two normal vectors must be the same as the cosine difference between the two dip vectors of the same surface.

There is some discussion about anomalous results:

“The above effect could be explained by several competitive hypotheses. For example, the fault plane could have been drilled, 365 thus broadening the zone of triangles genetically related to the fault (Michalak et al., 2021). Assuming the tectonic origin of the related structures, it can be hypothesized that fault drags on the hanging wall contribute to subsidiary elevation differences that must be consumed by nearby triangles. It could also be argued that an unusual lowering of the contact surface is due to a deformation zone composed of many smaller faults. Another hypothesis could be that the related feature is not a fault but rather a sedimentary slope, which would explain the gradual lowering of the contact surface.”

Such hypotheses are useful, but would be better illustrated with specific examples and some reasoning about which is the preferred hypothesis.

The determination of the optimum number of clusters is explained in Figure 7, but the results sections shows results from 2, 3 and 4 numbers of clusters. This is unnecessary: only the optimum results should be shown.

The figures could be substantially improved. The use of such a dark background does not help (e.g. Fig. 6c). In most cases the grid is the most dominant and least important aspect of the maps, obscuring the detail of the clustering. The stereoplots are not explained in the figure captions.

Citation: https://doi.org/10.5194/egusphere-2022-633-RC2
- AC2: 'Reply on RC2', Michal Michalak, 16 Sep 2022
  
  See the attached pdf.
  
  Citation: https://doi.org/10.5194/egusphere-2022-633-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2022-633', Guillaume Duclaux, 24 Aug 2022

Review of "Clustering has a meaning: optimization of angular similarity to detect 3D geometric anomalies in geological terrains" by MichaÅ Michalak, LesÅaw Teper, Florian Wellmann, Jerzy Å»aba, Krzysztof Gaidzik, Marcin Kostur, Yuriy Maystrenko, and Paulina Leonowicz.

This manuscript introduces a new workflow for dealing with geological-surface mapping using sparse subsurface data. In particular, this work develops and investigates two new features for geological mapping using unsupervised machine-learning : 1) the role of structural data representations (as normal and dips vectors) on clustering results, and 2) the characterisation of Voronoi diagrams to explain the meaning of the boundaries between obtained clusters. The potential of these two methods are illustrated through applications to a couple of examples focusing at the very large scale on clustering regular data for the bottom Jurassic surface of the Central European Basin System, and at a smaller scale on clustering of irregular data for a middle Jurassic interface within the Krakow-Silesian Homocline in South-Central Poland.

Now, I am definitely not an expert in either machine-learning, nor clustering methods... so I've reviewed this manuscript from the perspectives of a structural geologist to whom such methods could be very useful for interpreting subsurface geology and structures.

Overall, the manuscript is well written and organised, and seems well suited for EGUsphere readership. The application of the unsupervised clustering method is presented, tested and analysed for different k-means and different vector representations in numerous figures (that still require some editing and clarifications). Limitations are appropriately discussed which keeps the contribution very honest. Such new machine-learning approach will potentially provide opportunities for geologists to (re)interpret subsurface structures in regions with either available geological surfaces, or dense boreholes coverage. Based on my review - as a structural geologist - I would recommend accepting this manuscript after moderate revisions of the figures and minor revisions of the text.

I present below a few key points for which I have some questions/concerns followed by a list of minor comments.

1) Choice of the optimum number of clusters: I have some trouble understanding the k-means choices based on the elbow method the authors have employed to determine the optimal number of clusters in their case studies... First, I wonder whether Figure 7 is flawed? Why y the y-axes values so variable between the normal and dip representations for a single dataset? Shouldn't the numbers be similar between a and b (CEBS), and c and d (KSH)? Each structural data provide both dip and normal values, for CEBS there should be 236380 data points. I might miss something related to the y-axis for each figure: what is "tot_withinss"? Why does Figure 7B can either have 2 or 4 optimum number of clusters (Line 285-286)?

The number of clusters will be very important for determining the data clustering pattern based on the cluster centers analysis, so I reckon this section should be strengthened.

2) Stereographic representations: This applies for figure 2c, 8, 9, 10, 11, 12, 13 and 14.

- On Figure 2c it isn't clear which hemisphere is displayed for the data + we don't know where the North is located on figures 2a and 2b.

- Fig 8 to 14: There a lower and an upper hemisphere half globes shown next to the stereonets in all these figures. For the lower hemisphere the whole Stereonet is shown, not for the upper hemisphere... what is the grid spacing on these stereo? I may have missed it but it isn't clear to me what projection is being used (I suppose an azimuthal polar stereographic projection, is that correct?) - structural geologist would classically use equal-area or equal-angle stereonets. Please clarify this in the caption of Figure 8 where the stereos first appear.

3) Direct screenshots from Paraview are hard to read. I'm thinking of Figures 1, 2 and mostly 4 and 6. The scale are not always meaningful, or hard to read. For example in Figure 1 the color scale legend is scalars... I guess elevation would be more adequate. The bounding boxes scale units in figure 2 are nor readable either and are totally missing in Figure 4 and 6. I believe redrafting Figure 4 would massively improve its readability. The moiré pattern visible in Figure b and c is just terrible. Units should also be added to the scale bars, and finally the scale bars' min and max values should be written in the same encoding than the rest of the values (-7.2e+03 and 1.6e+03 should be -7200 and 1600; 1.6e-04 should be 0; 6.1e+01 should be 61; 2.2e-03 and 3.6e+02 should be 0 and 360). Same goes for Figure 6.

Minor comments:

+ Equation 2 page 4: why is each line for Eq. (2) development given a different number? This is the same equation and as such should be only referred to Eq (2).

+ Line 144: please remove the second "a" in this line: "[...] whether a specific 2D point p [...]"

+ Line 234: "en échelon" is missing its accent.

+ Line 240: Genus and species for Strenocera subfurcatum should be written in italic font.

+ Line 272: Please revise the reference for the Anon borehole database citation. I understand it is not published, though.

+ Line 350-351: Theorem 1 --> Do you mean Eq 1?

+ Line 373-374: How would you differentiate between a graben structure and an "antithetic shear with hanging walls dipping against the main fault"?

+ Line 440: missing s in "this result suggests"

+ Figures and captions in general: the figures use lower case letter (a, b, c...) while in the captions and the text upper case letters are used (A, B, C...). Please harmonise between the figures and the text/captions.

Nice, 24/08/2022

Guillaume Duclaux

Citation: https://doi.org/10.5194/egusphere-2022-633-RC1
- AC1: 'Reply on RC1', Michal Michalak, 16 Sep 2022
  
  See the attached pdf.
  
  Citation: https://doi.org/10.5194/egusphere-2022-633-AC1
RC2:
'Comment on egusphere-2022-633', Thomas Blenkinsop, 30 Aug 2022

Clustering has a meaning: optimization of angular similarity to detect 3D geometric anomalies in geological terrains

This is a difficult paper to read, because it contains a lot of jargon about geometry, and because of vague general statements, some of which are unnecessary (e.g. “dip angle is not capable of showing the dip direction of faults and vice-versa” and “Geology is considered to be a subjective science (Curtis, 2012)”). A further problem for understanding the paper is that some of the methods section is couched in the technical language of the CGAL library. This is unhelpful to the general reader, and needs to be explained in simple terms.

One of the main conclusions, that applying clustering methods to normal vectors and dip direction vectors from the same data set results in different interpretations of the structure (Fig. 15), seems unlikely to be correct. There is no material difference between the geometrical significance and information contained in a normal vector compared to a dip direction vector. If there is a difference in the outcome of the clustering methods, that must be an artefact of the way the methods have been applied to each data set.

Another main conclusion is that optimisation methods must be applied to investigate clustering. This is relatively trivial: any clustering algorithm requires a similarity index, and the one used here (cosine distance) is a standard metric for assessing orientation differences. Further to the previous point, this metric should not result in significant differences between normal and dip direction vectors, because the cosine distance between two normal vectors must be the same as the cosine difference between the two dip vectors of the same surface.

There is some discussion about anomalous results:

“The above effect could be explained by several competitive hypotheses. For example, the fault plane could have been drilled, 365 thus broadening the zone of triangles genetically related to the fault (Michalak et al., 2021). Assuming the tectonic origin of the related structures, it can be hypothesized that fault drags on the hanging wall contribute to subsidiary elevation differences that must be consumed by nearby triangles. It could also be argued that an unusual lowering of the contact surface is due to a deformation zone composed of many smaller faults. Another hypothesis could be that the related feature is not a fault but rather a sedimentary slope, which would explain the gradual lowering of the contact surface.”

Such hypotheses are useful, but would be better illustrated with specific examples and some reasoning about which is the preferred hypothesis.

The determination of the optimum number of clusters is explained in Figure 7, but the results sections shows results from 2, 3 and 4 numbers of clusters. This is unnecessary: only the optimum results should be shown.

The figures could be substantially improved. The use of such a dark background does not help (e.g. Fig. 6c). In most cases the grid is the most dominant and least important aspect of the maps, obscuring the detail of the clustering. The stereoplots are not explained in the figure captions.

Citation: https://doi.org/10.5194/egusphere-2022-633-RC2
- AC2: 'Reply on RC2', Michal Michalak, 16 Sep 2022
  
  See the attached pdf.
  
  Citation: https://doi.org/10.5194/egusphere-2022-633-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Michal Michalak on behalf of the Authors (04 Oct 2022) Author's response Author's tracked changes Manuscript

ED: Publish subject to technical corrections (10 Oct 2022) by David Healy

ED: Publish subject to technical corrections (10 Oct 2022) by Federico Rossetti (Executive editor)

AR by Michal Michalak on behalf of the Authors (12 Oct 2022) Manuscript

Journal article(s) based on this preprint

09 Nov 2022

Clustering has a meaning: optimization of angular similarity to detect 3D geometric anomalies in geological terrains

Michał P. Michalak, Lesław Teper, Florian Wellmann, Jerzy Żaba, Krzysztof Gaidzik, Marcin Kostur, Yuriy P. Maystrenko, and Paulina Leonowicz

Solid Earth, 13, 1697–1720, https://doi.org/10.5194/se-13-1697-2022,https://doi.org/10.5194/se-13-1697-2022, 2022

Short summary

Michał Michalak et al.

Model code and software

GeoAnomalia Michał Michalak https://github.com/michalmichalak997/Triangulation_2/blob/master/README.md

Michał Michalak et al.

Viewed

Total article views: 520 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
401	103	16	520	5	4

HTML: 401
PDF: 103
XML: 16
Total: 520
BibTeX: 5
EndNote: 4

Views and downloads (calculated since 22 Jul 2022)

Month	HTML	PDF	XML	Total
Jul 2022	92	31	6	129
Aug 2022	114	29	5	148
Sep 2022	99	22	4	125
Oct 2022	81	16	1	98
Nov 2022	15	5	0	20
Dec 2022	0
Jan 2023	0
Feb 2023	0
Mar 2023	0
Apr 2023	0
May 2023	0
Jun 2023	0
Jul 2023	0
Aug 2023	0
Sep 2023	0
Oct 2023	0
Nov 2023	0
Dec 2023	0
Jan 2024	0

Cumulative views and downloads (calculated since 22 Jul 2022)

Month	HTML	PDF	XML	Total
Jul 2022	92	31	6	129
Aug 2022	114	29	5	148
Sep 2022	99	22	4	125
Oct 2022	81	16	1	98
Nov 2022	15	5	0	20
Dec 2022	0
Jan 2023	0
Feb 2023	0
Mar 2023	0
Apr 2023	0
May 2023	0
Jun 2023	0
Jul 2023	0
Aug 2023	0
Sep 2023	0
Oct 2023	0
Nov 2023	0
Dec 2023	0
Jan 2024	0

Viewed (geographical distribution)

Total article views: 472 (including HTML, PDF, and XML) Thereof 472 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 10 Jan 2024

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (5726 KB)
Metadata XML

Short summary

When characterizing geological/geophysical surfaces, various geometric attributes are calculated such as dip angle (1D) or dip direction (2D). However, the boundaries between specific values may be subjective and without optimization significance as resulting from using default color palletes. This study proposes minimizing cosine distance among within-cluster observations to detect 3D anomalies. Our results suggest that the method holds promise for identification of megafolds or megacones.


Total:	0
HTML:	0
PDF:	0
XML:	0