Clustering has a meaning: optimization of angular similarity to detect 3D geometric anomalies in geological terrains
 ^{1}Institute of Earth Sciences, Faculty of Natural Sciences, University of Silesia in Katowice, Będzińska 60, 41205 Sosnowiec, Poland
 ^{2}Faculty of Geology, Geophysics and Environmental Protection, AGH University of Science and Technology, Mickiewicza 30, 30059 Cracow, Poland
 ^{3}Computational Geoscience and Reservoir Engineering, RWTH Aachen, Wüllnerstr. 2, 52056 Aachen, Germany
 ^{4}Faculty of Science and Technology, University of Silesia in Katowice, 75. Pułku Piechoty, 41500 Chorzów, Poland
 ^{5}The Geological Survey of Norway (NGU), Leiv Eirikssons vei 39, 7040 Trondheim, Norway
 ^{6}Faculty of Geology, University of Warsaw, Żwirki i Wigury 93, PL02089 Warszawa, Poland
 ^{1}Institute of Earth Sciences, Faculty of Natural Sciences, University of Silesia in Katowice, Będzińska 60, 41205 Sosnowiec, Poland
 ^{2}Faculty of Geology, Geophysics and Environmental Protection, AGH University of Science and Technology, Mickiewicza 30, 30059 Cracow, Poland
 ^{3}Computational Geoscience and Reservoir Engineering, RWTH Aachen, Wüllnerstr. 2, 52056 Aachen, Germany
 ^{4}Faculty of Science and Technology, University of Silesia in Katowice, 75. Pułku Piechoty, 41500 Chorzów, Poland
 ^{5}The Geological Survey of Norway (NGU), Leiv Eirikssons vei 39, 7040 Trondheim, Norway
 ^{6}Faculty of Geology, University of Warsaw, Żwirki i Wigury 93, PL02089 Warszawa, Poland
Abstract. The geological potential of sparse subsurface data is not being fully exploited since the available workflows are not specifically designed to detect and interpret 3D geometric anomalies hidden in the data. We develop a new unsupervised machine learning framework to cluster and analyze the spatial distribution of orientations sampled throughout a geological interface. Our method employs Delaunay triangulation and clustering with the squared Euclidean distance to cluster local unit orientations/attitude which results in minimizing the withincluster cosine distance. We performed the clustering on two representations of the triangles: normal and dip vectors. The classes resulting from clustering were attached to a geometric centre of a triangle (irregular version). We developed also a regular version of spatial clustering which allows to answer whether points from a grid structure can be affected by anomalies. To illustrate the usefulness of the combination between cosine distance as dissimilarity metric and two cartographic versions, we analyzed subsurface data documenting two horizons: 1) the bottom Jurassic surface from the Central European Basin System (CEBS) and 2) an interface between MiddleJurassic units within the KrakówSilesian Homocline (KSH) which is a part of the CEBS. The empirical results suggest that clustering normal vectors may result in near collinear cluster centers and boundaries between clusters of similar trend, thus pointing to axis of a potential megafold. Clustering dip vectors resulted on the other hand in near cocircular cluster centers, thus pointing to a potential megacone. We also show that the linear arrangements of the anomalies, their topological relationships and internal structure can provide insights regarding the internal structure of the singularity, e.g. whether it may be due to drilling a nonvertical fault plane or due to a wider deformation zone composed of many smaller faults.

Notice on discussion status
The requested preprint has a corresponding peerreviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint
(5726 KB)

The requested preprint has a corresponding peerreviewed final revised paper. You are encouraged to refer to the final revised version.
 Preprint
(5726 KB)  BibTeX
 EndNote
 Final revised paper
Journal article(s) based on this preprint
Michał Michalak et al.
Interactive discussion
Status: closed

RC1: 'Comment on egusphere2022633', Guillaume Duclaux, 24 Aug 2022
Review of "Clustering has a meaning: optimization of angular similarity to detect 3D geometric anomalies in geological terrains" by MichaÅ Michalak, LesÅaw Teper, Florian Wellmann, Jerzy Å»aba, Krzysztof Gaidzik, Marcin Kostur, Yuriy Maystrenko, and Paulina Leonowicz.
This manuscript introduces a new workflow for dealing with geologicalsurface mapping using sparse subsurface data. In particular, this work develops and investigates two new features for geological mapping using unsupervised machinelearning : 1) the role of structural data representations (as normal and dips vectors) on clustering results, and 2) the characterisation of Voronoi diagrams to explain the meaning of the boundaries between obtained clusters. The potential of these two methods are illustrated through applications to a couple of examples focusing at the very large scale on clustering regular data for the bottom Jurassic surface of the Central European Basin System, and at a smaller scale on clustering of irregular data for a middle Jurassic interface within the KrakowSilesian Homocline in SouthCentral Poland.
Now, I am definitely not an expert in either machinelearning, nor clustering methods... so I've reviewed this manuscript from the perspectives of a structural geologist to whom such methods could be very useful for interpreting subsurface geology and structures.
Overall, the manuscript is well written and organised, and seems well suited for EGUsphere readership. The application of the unsupervised clustering method is presented, tested and analysed for different kmeans and different vector representations in numerous figures (that still require some editing and clarifications). Limitations are appropriately discussed which keeps the contribution very honest. Such new machinelearning approach will potentially provide opportunities for geologists to (re)interpret subsurface structures in regions with either available geological surfaces, or dense boreholes coverage. Based on my review  as a structural geologist  I would recommend accepting this manuscript after moderate revisions of the figures and minor revisions of the text.
I present below a few key points for which I have some questions/concerns followed by a list of minor comments.
1) Choice of the optimum number of clusters: I have some trouble understanding the kmeans choices based on the elbow method the authors have employed to determine the optimal number of clusters in their case studies... First, I wonder whether Figure 7 is flawed? Why y the yaxes values so variable between the normal and dip representations for a single dataset? Shouldn't the numbers be similar between a and b (CEBS), and c and d (KSH)? Each structural data provide both dip and normal values, for CEBS there should be 236380 data points. I might miss something related to the yaxis for each figure: what is "tot_withinss"? Why does Figure 7B can either have 2 or 4 optimum number of clusters (Line 285286)?
The number of clusters will be very important for determining the data clustering pattern based on the cluster centers analysis, so I reckon this section should be strengthened.2) Stereographic representations: This applies for figure 2c, 8, 9, 10, 11, 12, 13 and 14.
 On Figure 2c it isn't clear which hemisphere is displayed for the data + we don't know where the North is located on figures 2a and 2b.
 Fig 8 to 14: There a lower and an upper hemisphere half globes shown next to the stereonets in all these figures. For the lower hemisphere the whole Stereonet is shown, not for the upper hemisphere... what is the grid spacing on these stereo? I may have missed it but it isn't clear to me what projection is being used (I suppose an azimuthal polar stereographic projection, is that correct?)  structural geologist would classically use equalarea or equalangle stereonets. Please clarify this in the caption of Figure 8 where the stereos first appear.3) Direct screenshots from Paraview are hard to read. I'm thinking of Figures 1, 2 and mostly 4 and 6. The scale are not always meaningful, or hard to read. For example in Figure 1 the color scale legend is scalars... I guess elevation would be more adequate. The bounding boxes scale units in figure 2 are nor readable either and are totally missing in Figure 4 and 6. I believe redrafting Figure 4 would massively improve its readability. The moiré pattern visible in Figure b and c is just terrible. Units should also be added to the scale bars, and finally the scale bars' min and max values should be written in the same encoding than the rest of the values (7.2e+03 and 1.6e+03 should be 7200 and 1600; 1.6e04 should be 0; 6.1e+01 should be 61; 2.2e03 and 3.6e+02 should be 0 and 360). Same goes for Figure 6.
Minor comments:
+ Equation 2 page 4: why is each line for Eq. (2) development given a different number? This is the same equation and as such should be only referred to Eq (2).
+ Line 144: please remove the second "a" in this line: "[...] whether a specific 2D point p [...]"
+ Line 234: "en échelon" is missing its accent.
+ Line 240: Genus and species for Strenocera subfurcatum should be written in italic font.
+ Line 272: Please revise the reference for the Anon borehole database citation. I understand it is not published, though.
+ Line 350351: Theorem 1 > Do you mean Eq 1?
+ Line 373374: How would you differentiate between a graben structure and an "antithetic shear with hanging walls dipping against the main fault"?
+ Line 440: missing s in "this result suggests"
+ Figures and captions in general: the figures use lower case letter (a, b, c...) while in the captions and the text upper case letters are used (A, B, C...). Please harmonise between the figures and the text/captions.
Nice, 24/08/2022
Guillaume Duclaux AC1: 'Reply on RC1', Michal Michalak, 16 Sep 2022

RC2: 'Comment on egusphere2022633', Thomas Blenkinsop, 30 Aug 2022
Clustering has a meaning: optimization of angular similarity to detect 3D geometric anomalies in geological terrains
This is a difficult paper to read, because it contains a lot of jargon about geometry, and because of vague general statements, some of which are unnecessary (e.g. “dip angle is not capable of showing the dip direction of faults and viceversa” and “Geology is considered to be a subjective science (Curtis, 2012)”). A further problem for understanding the paper is that some of the methods section is couched in the technical language of the CGAL library. This is unhelpful to the general reader, and needs to be explained in simple terms.
One of the main conclusions, that applying clustering methods to normal vectors and dip direction vectors from the same data set results in different interpretations of the structure (Fig. 15), seems unlikely to be correct. There is no material difference between the geometrical significance and information contained in a normal vector compared to a dip direction vector. If there is a difference in the outcome of the clustering methods, that must be an artefact of the way the methods have been applied to each data set.
Another main conclusion is that optimisation methods must be applied to investigate clustering. This is relatively trivial: any clustering algorithm requires a similarity index, and the one used here (cosine distance) is a standard metric for assessing orientation differences. Further to the previous point, this metric should not result in significant differences between normal and dip direction vectors, because the cosine distance between two normal vectors must be the same as the cosine difference between the two dip vectors of the same surface.
There is some discussion about anomalous results:
“The above effect could be explained by several competitive hypotheses. For example, the fault plane could have been drilled, 365 thus broadening the zone of triangles genetically related to the fault (Michalak et al., 2021). Assuming the tectonic origin of the related structures, it can be hypothesized that fault drags on the hanging wall contribute to subsidiary elevation differences that must be consumed by nearby triangles. It could also be argued that an unusual lowering of the contact surface is due to a deformation zone composed of many smaller faults. Another hypothesis could be that the related feature is not a fault but rather a sedimentary slope, which would explain the gradual lowering of the contact surface.”
Such hypotheses are useful, but would be better illustrated with specific examples and some reasoning about which is the preferred hypothesis.
The determination of the optimum number of clusters is explained in Figure 7, but the results sections shows results from 2, 3 and 4 numbers of clusters. This is unnecessary: only the optimum results should be shown.
The figures could be substantially improved. The use of such a dark background does not help (e.g. Fig. 6c). In most cases the grid is the most dominant and least important aspect of the maps, obscuring the detail of the clustering. The stereoplots are not explained in the figure captions.
 AC2: 'Reply on RC2', Michal Michalak, 16 Sep 2022
Interactive discussion
Status: closed

RC1: 'Comment on egusphere2022633', Guillaume Duclaux, 24 Aug 2022
Review of "Clustering has a meaning: optimization of angular similarity to detect 3D geometric anomalies in geological terrains" by MichaÅ Michalak, LesÅaw Teper, Florian Wellmann, Jerzy Å»aba, Krzysztof Gaidzik, Marcin Kostur, Yuriy Maystrenko, and Paulina Leonowicz.
This manuscript introduces a new workflow for dealing with geologicalsurface mapping using sparse subsurface data. In particular, this work develops and investigates two new features for geological mapping using unsupervised machinelearning : 1) the role of structural data representations (as normal and dips vectors) on clustering results, and 2) the characterisation of Voronoi diagrams to explain the meaning of the boundaries between obtained clusters. The potential of these two methods are illustrated through applications to a couple of examples focusing at the very large scale on clustering regular data for the bottom Jurassic surface of the Central European Basin System, and at a smaller scale on clustering of irregular data for a middle Jurassic interface within the KrakowSilesian Homocline in SouthCentral Poland.
Now, I am definitely not an expert in either machinelearning, nor clustering methods... so I've reviewed this manuscript from the perspectives of a structural geologist to whom such methods could be very useful for interpreting subsurface geology and structures.
Overall, the manuscript is well written and organised, and seems well suited for EGUsphere readership. The application of the unsupervised clustering method is presented, tested and analysed for different kmeans and different vector representations in numerous figures (that still require some editing and clarifications). Limitations are appropriately discussed which keeps the contribution very honest. Such new machinelearning approach will potentially provide opportunities for geologists to (re)interpret subsurface structures in regions with either available geological surfaces, or dense boreholes coverage. Based on my review  as a structural geologist  I would recommend accepting this manuscript after moderate revisions of the figures and minor revisions of the text.
I present below a few key points for which I have some questions/concerns followed by a list of minor comments.
1) Choice of the optimum number of clusters: I have some trouble understanding the kmeans choices based on the elbow method the authors have employed to determine the optimal number of clusters in their case studies... First, I wonder whether Figure 7 is flawed? Why y the yaxes values so variable between the normal and dip representations for a single dataset? Shouldn't the numbers be similar between a and b (CEBS), and c and d (KSH)? Each structural data provide both dip and normal values, for CEBS there should be 236380 data points. I might miss something related to the yaxis for each figure: what is "tot_withinss"? Why does Figure 7B can either have 2 or 4 optimum number of clusters (Line 285286)?
The number of clusters will be very important for determining the data clustering pattern based on the cluster centers analysis, so I reckon this section should be strengthened.2) Stereographic representations: This applies for figure 2c, 8, 9, 10, 11, 12, 13 and 14.
 On Figure 2c it isn't clear which hemisphere is displayed for the data + we don't know where the North is located on figures 2a and 2b.
 Fig 8 to 14: There a lower and an upper hemisphere half globes shown next to the stereonets in all these figures. For the lower hemisphere the whole Stereonet is shown, not for the upper hemisphere... what is the grid spacing on these stereo? I may have missed it but it isn't clear to me what projection is being used (I suppose an azimuthal polar stereographic projection, is that correct?)  structural geologist would classically use equalarea or equalangle stereonets. Please clarify this in the caption of Figure 8 where the stereos first appear.3) Direct screenshots from Paraview are hard to read. I'm thinking of Figures 1, 2 and mostly 4 and 6. The scale are not always meaningful, or hard to read. For example in Figure 1 the color scale legend is scalars... I guess elevation would be more adequate. The bounding boxes scale units in figure 2 are nor readable either and are totally missing in Figure 4 and 6. I believe redrafting Figure 4 would massively improve its readability. The moiré pattern visible in Figure b and c is just terrible. Units should also be added to the scale bars, and finally the scale bars' min and max values should be written in the same encoding than the rest of the values (7.2e+03 and 1.6e+03 should be 7200 and 1600; 1.6e04 should be 0; 6.1e+01 should be 61; 2.2e03 and 3.6e+02 should be 0 and 360). Same goes for Figure 6.
Minor comments:
+ Equation 2 page 4: why is each line for Eq. (2) development given a different number? This is the same equation and as such should be only referred to Eq (2).
+ Line 144: please remove the second "a" in this line: "[...] whether a specific 2D point p [...]"
+ Line 234: "en échelon" is missing its accent.
+ Line 240: Genus and species for Strenocera subfurcatum should be written in italic font.
+ Line 272: Please revise the reference for the Anon borehole database citation. I understand it is not published, though.
+ Line 350351: Theorem 1 > Do you mean Eq 1?
+ Line 373374: How would you differentiate between a graben structure and an "antithetic shear with hanging walls dipping against the main fault"?
+ Line 440: missing s in "this result suggests"
+ Figures and captions in general: the figures use lower case letter (a, b, c...) while in the captions and the text upper case letters are used (A, B, C...). Please harmonise between the figures and the text/captions.
Nice, 24/08/2022
Guillaume Duclaux AC1: 'Reply on RC1', Michal Michalak, 16 Sep 2022

RC2: 'Comment on egusphere2022633', Thomas Blenkinsop, 30 Aug 2022
Clustering has a meaning: optimization of angular similarity to detect 3D geometric anomalies in geological terrains
This is a difficult paper to read, because it contains a lot of jargon about geometry, and because of vague general statements, some of which are unnecessary (e.g. “dip angle is not capable of showing the dip direction of faults and viceversa” and “Geology is considered to be a subjective science (Curtis, 2012)”). A further problem for understanding the paper is that some of the methods section is couched in the technical language of the CGAL library. This is unhelpful to the general reader, and needs to be explained in simple terms.
One of the main conclusions, that applying clustering methods to normal vectors and dip direction vectors from the same data set results in different interpretations of the structure (Fig. 15), seems unlikely to be correct. There is no material difference between the geometrical significance and information contained in a normal vector compared to a dip direction vector. If there is a difference in the outcome of the clustering methods, that must be an artefact of the way the methods have been applied to each data set.
Another main conclusion is that optimisation methods must be applied to investigate clustering. This is relatively trivial: any clustering algorithm requires a similarity index, and the one used here (cosine distance) is a standard metric for assessing orientation differences. Further to the previous point, this metric should not result in significant differences between normal and dip direction vectors, because the cosine distance between two normal vectors must be the same as the cosine difference between the two dip vectors of the same surface.
There is some discussion about anomalous results:
“The above effect could be explained by several competitive hypotheses. For example, the fault plane could have been drilled, 365 thus broadening the zone of triangles genetically related to the fault (Michalak et al., 2021). Assuming the tectonic origin of the related structures, it can be hypothesized that fault drags on the hanging wall contribute to subsidiary elevation differences that must be consumed by nearby triangles. It could also be argued that an unusual lowering of the contact surface is due to a deformation zone composed of many smaller faults. Another hypothesis could be that the related feature is not a fault but rather a sedimentary slope, which would explain the gradual lowering of the contact surface.”
Such hypotheses are useful, but would be better illustrated with specific examples and some reasoning about which is the preferred hypothesis.
The determination of the optimum number of clusters is explained in Figure 7, but the results sections shows results from 2, 3 and 4 numbers of clusters. This is unnecessary: only the optimum results should be shown.
The figures could be substantially improved. The use of such a dark background does not help (e.g. Fig. 6c). In most cases the grid is the most dominant and least important aspect of the maps, obscuring the detail of the clustering. The stereoplots are not explained in the figure captions.
 AC2: 'Reply on RC2', Michal Michalak, 16 Sep 2022
Peer review completion
Journal article(s) based on this preprint
Michał Michalak et al.
Model code and software
GeoAnomalia Michał Michalak https://github.com/michalmichalak997/Triangulation_2/blob/master/README.md
Michał Michalak et al.
Viewed
HTML  XML  Total  BibTeX  EndNote  

401  103  16  520  5  4 
 HTML: 401
 PDF: 103
 XML: 16
 Total: 520
 BibTeX: 5
 EndNote: 4
Viewed (geographical distribution)
Country  #  Views  % 

Total:  0 
HTML:  0 
PDF:  0 
XML:  0 
 1
The requested preprint has a corresponding peerreviewed final revised paper. You are encouraged to refer to the final revised version.
 Preprint
(5726 KB)  Metadata XML