the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
GStatSim V1.0: a Python package for geostatistical interpolation and simulation
Emma Johanne MacKie
Michael Field
Lijing Wang
Nathan Schoedl
Matthew Hibbs
Allan Zhang
Abstract. The interpolation of geospatial phenomena is a common problem in Earth sciences applications that can be addressed with geostatistics, where spatial correlations are used to constrain interpolations. In certain applications, it can be particularly useful to perform geostatistical simulation, which is used to generate multiple non-unique realizations that reproduce the variability of measurements and are constrained by observations. Despite the broad utility of this approach, there are few open-access geostatistical simulation software. To address this accessibility issue, we present GStatSim, a Python package for performing geostatistical interpolation and simulation. GStatSim is distinct from previous geostatistics tools in that it emphasizes accessibility for non-experts, geostatistical simulation, and applicability to remote sensing data sets. It includes tools for performing non-stationary simulations and interpolations with secondary constraints. This package is accompanied by a Jupyter Book with user tutorials and background information on different interpolation methods. These resources are intended to significantly lower the technological barrier to using geostatistics and encourage the use of geostatistics in a wider range of applications. We demonstrate the different functionalities of this tool for the interpolation of subglacial topography measurements in Greenland.
Emma Johanne MacKie et al.
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2022-1224', Mirko Mälicke, 07 Jan 2023
The authors present a well-written, concise and easy to understand paper about a new Python package called GStatSim. The package is designed to fill the gap of missing geostatistical simulations tools in the scientific Python landscape, as identified by the authors.
The presented package meets modern programming and publishing standards as common in Python and I personally like the approach of focusing educational use-cases. The provided tutorials are outstanding.
With some minor comments, concerning the paper and the code itself, I am looking forward to see this nice work published.
Best,
Mirko Mälicke
In fact I really only have one main point:
- P.7 135. My personal main comment to the work is the limitation to exponential variogram models. I think the authors should make clearer, why other models are not supported. Comparing to other geostatistical software, this design decision seems to make integrations with other packages difficult, as most of them share a substantial range of available models.
In case there are algortihmic, technical or scientific limitations that I am not aware of right now, it would be most helpful to discuss these limitations in the discussion transparently and beyond the simple outlook, that more models could be implemented, as currently stated in the discussion.
Minor comments:
- P.3 L.62-67. I think this paragraph could make clearer, why the approach of introducing a new package from scratch was chosen, over the extension of existing packages, as ie. GSTools already has one field simulation function.
- P.3 L.68-74. GStatSim implements ordinary and simple kriging, which also exists is various other Python packages. What makes the two new implementations different from existing ones?
- P.3 L.80-82. GStatSim was designed with geophysical point data in mind. Does this affect the application on ‘classic’ point based data? Are there differences in the algorithms, ie. for kriging?
- P. 4. L.94 To me this reads like GStatSim is written for Py3.8 only, but I am confident it will run on newer versions as well. This could be checked systematically (see my last comment) and adjusted accordingly.
- P.5-6 L.113-118: For me it is not completely clear, if the described procedures are part of GStatSim and its functions, or if these transformations are part of the presented workflow and specific to the presented dataset.
- P.8 L.162ff, the ordinary kriging algorithm is described to be robust to spatial trends. In case spatial trends are expected, why not using universal kriging in these cases?
- P. 9 section 4.3.1: As I was completely unfamiliar with NN octant search, further references to the method, and especially a description of the ‘8 main octants of the coordinate plane’ would be highly appreciated. In addition: Does this ‘alignment’ have advantages for use-cases, other than the geopyhsical data described?
- P.11 L.201-205: Similar to my comment to P.5-6, I also think that here it would be again helpful to state more clearly if the described back-transformation is part of the presented kriging algorithms, or a workflow step specific to the presented data.
- P.11 section 4.4: In my opinion, this whole section could be substantially upgraded by the figure from the documentation demos: 4. Sequential Gaussian simulation, which illustrates the algorithm: https://raw.githubusercontent.com/GatorGlaciology/GStatSim/f03710188fb200985c86d65832dd7f8e68c1d662/images/SGS_cartoon-01.jpg
- P.11 L.214 Is ‘path’ referring to an actual path here? Is the sequence at which nodes are simulated limited by the constrain that the nodes have to be neighboring, which is my understanding of a path? If yes, why?
- P.12-13 L. 219-224: I think it would be beneficial to give the user a short insight on why there are two different sgs algorithms and when you would choose which one in general terms.
- P.15 L. 255-261: Are the authors advising to apply the rbf_trend function to any dataset, before applying GStatSim functions or are there also downsides of the approach the user needs to be aware of? In addition t would be helpful to provide references to more detailed descriptions of the RBF approach, as some readers (including myself) might not be familiar with it.
- P.16 -17 section 4.7: Are there general rules, how each grid cell can be attributed to a cluster? Is this based on expert knowledge? How should I decide on the number of clusters necessary?
- P. 17 section 4.7.1: I guess this relates to my previous comment: How do I know that I need three clusters here? The presented split of the dataset in Figure 10a) seems quite arbitrary to me.
- P. 17 section 4.72.: This sections adresses the issues I raised in my last comment, but still, the decision of cluster size is for me as abritrary as the decision on the number of clusters. Ultimately, by setting (spatial) cluster sizes I will also imply a narrow range of how many clusters can be found, right? Maybe I miss something.
I think it could be beneficial to add a short description of why the specified parameters were chosen in the style of an example. I am sure that I will get it then. - P. 19 L.322-329: How does the described procedure relate to kriging with external drift? Can that be used as well, to incorporate secondary variables into an interpolation or simulation?
Finally, there are a few questions/comments concerning the code itself. Non of them are part of the review in a strict sense:
- Why are some methods implemented as class instance methods, while no of the classes has an implemented constructor function (__init__)?
If no attributes/settings/data are shared across the methods, why use class instance methods? I am asking more out of curiosity, as I can’t see any advantages beside a grouping of related functions (which can be achieved with sub-modules as well). - The docstrings seem to be of non-stand format (I think). More as a suggestion: Why not use ie. Numpy styled docstrings and create a sphinx-based documentation?
- https://numpydoc.readthedocs.io/en/latest/format.html
- https://www.sphinx-doc.org/en/master/usage/quickstart.html
- https://docs.readthedocs.io/en/stable/intro/getting-started-with-sphinx.html
- The package seems to lack unittests. Although these bring some substantial extra-work with them, I personally always made the experience that this is well invested effort (Many, many developers simply ignore code without tests). I know there are different approaches out there, but from my point of view reliable code is as important as FAIR. In python there are frameworks like pytest (https://docs.pytest.org/en/7.2.x/) that make it particularly easy to add a few tests that ie. run the full analysis given the nice data already included. With CI tools like GH actions, Travis CI or circle CI, the tests will have zero overhead once implemented and can run on all current Python versions and operation systems in parallel. Please take this comment more as my personal opinion and suggestion, to dramatically increase the range of this really nice software.
Citation: https://doi.org/10.5194/egusphere-2022-1224-RC1 - P.7 135. My personal main comment to the work is the limitation to exponential variogram models. I think the authors should make clearer, why other models are not supported. Comparing to other geostatistical software, this design decision seems to make integrations with other packages difficult, as most of them share a substantial range of available models.
-
RC2: 'Comment on egusphere-2022-1224', Joseph MacGregor, 10 Feb 2023
Review of “GStatSim V1.0: a Python package for geostatistical interpolation and simulation” by E. MacKie et al.
Joseph A. MacGregor
10 February 2023This manuscript introduces a Python package for geostatistical interpolation and simulation and provides several use cases for Greenland radar-sounding data. The authors present a natural progression introduced of geostatistical complexity and each example is carefully justified, along with the overall motivation for this package, which I expect could be widely used. This MS is exceptionally well-written and is among a very small number that I’ve reviewed whose key messages were fully understood after a single reading. I perused the Jupyter notebooks and found them similarly informative. I have only minor comments below.
While the MS (including the Introduction) are very clear and well-motivated, the Introduction is a bit long-winded (10 paragraphs) and could be more concise.
The MS weaves back and forth between meters and kilometers for length units, mostly using meters in figures and kilometers in text, but not exclusively. I suggest using kilometers only for figures and text…fewer zeroes for the reader to parse.
Throughout the MS, I suggest a discrete color map instead of continuous for the gridded bed topo examples. Perhaps a 50 m interval. Little information would be lost, but I suspect it would be easier to discern the patterns referred to in the text and difference between various methods, particularly examples like Figure 6.
Figure 1b: Label full Greenland plot and blue coastline. Would be nice to have an ice-sheet outline also.
107: ArcticDEM is now out of date compared to GrIMP (https://nsidc.org/data/nsidc-0645/versions/1), but that is a minor issue and does not warrant recalculation.
109: It’s a bit odd, given the FAIR stance of GStatSim, to use a MATLAB package (AMT) for projection of geographic coordinates, which could be done with GDAL instead.
113: From this sentence, I don’t fully understand what operation is done. Looking at the Jupyter Notebook, it seems like the data are binned (averaged within those bins?) to 1 km. A bit more exposition needed here.
Figure 3a: Label nugget/sill/range values mentioned on 142.
132: How is it assessed that there is no significant anisotropy? Seems like they disagree at lag > ~35 km, but that is above sill.
Figure 4b,d: Use different color maps for what are fundamentally different quantities being shown compared to 4a/c. I suggest a red/blue colormap for 4d centered on zero.
388: Permit me to not-so-humbly suggest that MacGregor et al. (2015; doi: 10.1002/2014JF003215) is a more relevant reference here, as that study performed ordinary kriging to interpolate englacial layers that could be improved using GStatSim and is closer to the examples discussed in the MS.
Citation: https://doi.org/10.5194/egusphere-2022-1224-RC2
Emma Johanne MacKie et al.
Emma Johanne MacKie et al.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
331 | 280 | 11 | 622 | 6 | 5 |
- HTML: 331
- PDF: 280
- XML: 11
- Total: 622
- BibTeX: 6
- EndNote: 5
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1