the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
OpenMindat v1.0.0 R package: A machine interface to Mindat open data to facilitate data-intensive geoscience discoveries
Abstract. Powered by data-driven knowledge discovery technologies such as machine learning and deep learning, increasingly exciting patterns are discovered in complex earth science big data. One of the world's most enormous treasure troves of mineral databases, Mindat ("mindat.org"), contains vast amounts of knowledge that are yet to be mined. Through a project called OpenMindat, an application programming interface (API) to enable open data query and access from Mindat had been set up. This paper presented an open-source R package (OpenMindat v1.0.0) to bridge the data highway, connecting users' overwhelming data needs, facilitating data-intensive query and access, unlocking novel insights, and enabling groundbreaking geoscience discoveries. The package was designed to be user-friendly and extensible. It exploits the capabilities of the Mindat API, including the data subjects of geomaterials (e.g., rocks, minerals, synonyms, variety, mixture, and commodity), localities, and the IMA (International Mineralogical Association)-approved mineral list. In addition to providing functions for querying those data subjects, the package supports exporting data to various formats such as CSV, JSON-LD, and TTL. In applications, these functions only require minor coding and provide invaluable convenience for users with limited R environment experience. The package is open on GitHub under the MIT license and with detailed tutorial documentation. The field of mineralogy and many other geoscience disciplines are facing the opportunities enabled by open data. Various research topics such as mineral network analysis, mineral association rule mining, mineral ecology, mineral evolution, and critical minerals have already benefited from Mindat's open data. We hope this R package will accelerate the process of those data-intensive studies and lead to more scientific discoveries.
- Preprint
(1698 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
CC1: 'Comment on egusphere-2024-1141', Wenqiang Tang, 14 Nov 2024
The OpenMindat team develops data query and access tools based on the Mindat API server, which is groundbreaking and has practical, solid value. The technical architecture described is clear; the modular design enhances the readability and maintainability of the code, and its caching mechanism also improves efficiency. It is expected that the team will be able to provide more and more prosperous combined retrieval functions for Mindat mineral data.The team is expected to maintain the software package and keep it updated to provide more prosperous combination retrieval functions for Mindat mineral data.
Citation: https://doi.org/10.5194/egusphere-2024-1141-CC1 -
CC2: 'Comment on egusphere-2024-1141', Sensen Wu, 14 Nov 2024
This manuscript introduces an implemented open-source R package (OpenMindat v1.0.0), which was well-designed to build a data highway that meets the growing data demands of users and facilitates data-intensive queries and access. It can help reveal new insights and promote the emergence of more data-driven knowledge discoveries in the field of earth sciences. It discusses the limitations of the current Mindat.org and its cyberinfrastructure ecosystem. Although this work uses existing technology (R package) to fill the gap in expanding its essential network ecosystem, it provides users with a systematic understanding of Mindat.org's primary data, which is undoubtedly an indispensable foundational work.
Citation: https://doi.org/10.5194/egusphere-2024-1141-CC2 -
RC1: 'Comment on egusphere-2024-1141', Anonymous Referee #1, 25 Nov 2024
General comments
The manuscript describes an R package as the machine interface to the open data of Mindat.org. So, it is not a model itself but describes tooling around models to access data. Overall, I think that there should be more papers like this that describe the details behind software packages, in journals that scientists that are not developers can find, read, and understand. This paper clearly states that it is written from the software developer's perspective. I wasn't sure if this was in scope for Geoscientific Model Development, as sometimes these explanations are left to the grey literature user manual, or a software journal like JOSS, but I found other examples of similar papers in GMD, so I believe it is in scope.Â
Coming from a geoscientist, and not software developer background, some of the concepts and terminology were difficult for me to understand. However, I do think that the paper is useful to be published, with some efforts made to increase the understandability that might be more obvious to software developers (where to find the controlled vocabularies, brief descriptions at the top of .ipynb). Also, I am not an R user so I hope there are other reviewers that are able to execute the examples, my review is based on the paper and looking at the available code.Â
The paper itself references several examples showing uses and potential uses. The research group and its collaborators have built up a robust technical ecosystem for this type of data exploration and scientific research, with leaders in the field. Having these newer methods used by more people who access mineral data would be a strong positive.Â
The methods are aligned with the best practices that I know about. This is a minor comment but I found the description of the benefit a bit exaggerated in language and benefit. This might be because at my organization we are held back from effusive language and told to just state the facts. However, this is up to the editor, it is a style comment. (Examples: overwhelming, groundbreaking, viral, invaluable.) For the review prompt question, "Is the language fluent and precise?" these would be examples of imprecise language.
Regarding reproducibility, the repository appears to have everything needed to reproduce, with quite decent documentation, but I hope there is another reviewer that can confirm that.Â
The tables, in the format presented in the preprint, are not in the best format for readability. For this type of content I would hope that the final version they are fixed-width.
All other components of the paper (abstract, background, references, supplementary material) appear to be adequate and useful to me.Â
Seems like this type of capability could be generalized to many scientific database APIs, so it is important that the paper is understandable to non-mineralogists.
I like how there are a variety of output formats, and that it looks like a lot of user-research was done up to this point. This gives me more confidence that this product is usable and relevant to the community.
Â
Technical corrections
- The language is a little bit more opinionated/flowering than allowed in my institution "groundbreaking" etc. - leave it to the editors to decide. Simpler more direct language would be best for the international audience.
- overwhelming, viral,Â
- data highway
- Line 51 - cite reference for FAIR principlesÂ
- Line 63 - is this first use of IMA? Please expand the acronym, I see it is expanded later, but expand acronyms at first use.
- Line 71 - what are meteoritiecs and petroleum categories and how are they related?
- Are all of the Mindat fields, vocabularies, etc. defined elsewhere, make sure to reference it clearly for new users.Â
- Figure 1 is informative but would be even more informative if it followed a pathway starting from the user request. (rearrange it so that there is a path for the reader to follow)
- Line 110 - define what "classes" and "functions" mean in this context, as users are probably a very broad range, some who may not be familiar.
- Table 1: something about the spacing is a little confusing of how the different columns line up.
- Table 1: The caption says "some of the functions" and the text says 100 functions. Where does one go to view ALL of the functions? Is there a repository/Github link that is comprehensive?
- Lines 120- 140, split up into paragraphs or bullets for clarity.
- Line 178: use of "besides" is awkward- another word would be more appropriate. Maybe "Also"?
- Table 2: "Head 3 records" does that mean the first three records? Use more straightforward language for this venue with these readers. Everywhere in the paper where it says "head 3" can it be replaced by "first 3"?
- Table 2: is there a typo in the first row? There are special characters and unicode decimal codes.
- There are R examples and there is a ipynb example - could it be more clear which or both? I think that the ipynb example is R, but can that be clarified in a short header section of the .ipynb?
- Line 360 is clear that it is from software developer perspective, that is useful.Â
- It would have been useful if more fixed-width font were used in the PDF. Does this happen at the journal editorial stage?
- I had some trouble repeating the resolution of the .ipynbs in the tutorials repo, multiple times the notebook did not resolve, but other times it did.Â
- For example: https://github.com/quexiang/OpenMindat/blob/main/notebook/Retrieve_Geomaterials_by_physical_prop_1.ipynb
- The first time IÂ get the "unable to render code block"
- upon reloading a second (or third) time, it worked. Is this a known issue or repeated by anyone else? It happened on multiple machines, in the Chrome browser, for me.
- For the examples in the notebooks on Github, can you place a header at the top of the .ipynb that describes that these are examples for OpenMindat in R? Then a sentence or two of what the notebook does, what the output says, and where to get more information if you wanted to be able to understand the output better. (As someone who is not very familiar with OpenMindat, I would want to know where to access information about interpreting the returned results.)Â Especially since there are examples for python too, I think it would be useful for each .ipynb to have a short description at the top.
Citation: https://doi.org/10.5194/egusphere-2024-1141-RC1 -
RC2: 'Comment on egusphere-2024-1141', Dominik Hezel, 21 Dec 2024
The paper presents an R package to query the mindat.org API. Such tools are important and required for an easy, quick and simple access to existing databases. It is difficult to make such tools visible in domain specific journals, as these are not classical research papers, maybe reports, and there might be a categorical gap for such relevant papers. This might be the reason for a couple of shortcomings of the paper, as I outline below. As I will argue, the authors need to make a decision whether they want their work to see published in e.g., JOSS or here.
I find this an important, relevant, and valuable contribution, nonetheless, it needs significant reworking before it can be published.
Â
General commentsÂ
The authors are unnecessarily overselling the capabilities of ML and big data by frequently using superlatives associated with largely unsupported claims. Here are 5 examples just from the beginning:
›… increasingly exciting patterns are discovered in complex earth science big data.‹
›… contains vast amounts of knowledge that are yet to be mined …‹
›… overwhelming data needs …‹
›… this can unlock novel insights, enabling groundbreaking discoveries …‹
›… Mindat plays an increasingly significant role in scientific value and societal impacts …‹ (how so?)
While I enjoy and support the authors enthusiasm, after a number of years in this field as well, it seems obvious that ML, big data, etc. are important new techniques, but it is not that these new techniques ushered us in a new era of nonstop scientific breakthroughs. I believe tools such as the authors created are nonetheless of great importance and a big service to the community, which, and here I would agree, is slow in picking up such tools that make a geoscientists life much easier. If I would speculate, I would say that the authors fear their contribution might be seen as insufficiently important for a publication if they are not including some bold statements about why it is important. This, I think, reflects a reluctance in the community to embrace new tools such as the ones in this manuscript and appreciate the work that goes into the development of such tools, and their high value and service to the community. I strongly support the publication of such tools, either in specific journals such as JOSS, or, if it is something highly specific to a particular community, in domain specific journals such as this. It might be an opportunity to add a dedicated journal category, such as software/software tool. This would make it much more easier for authors and readers to have such valuable ms.
Hence, I would like to strongly motivate the authors to tune down on the possible, but yet unproven greatness of ML and big data, and focus on the indisputable importance and helpfulness of obtaining more information more quickly and progressing of the underrated importance of making data FAIR. I believe this will make your ms sound more scientific, and less rambling and almost philosophical.Â
I find the current ms not well organised, and although I like the idea of section 4, it is currently not very helpful. But I want to start with an even more general comment. I think the authors need to decide what audience to address: a more technical audience for which they want to provide technical information? If so, this ms should be published in JOSS. If the authors want to address geoscientists who might not yet be (too) familiar with coding, then it can be published here, but the technical terms need to be moved to on single method section, be on a very broad level, and the actual technicalities moved to the online documentation. The authors then need to make it clear in the ms that some kind of access key is required, for which detailed instructions are provided in the documentation (not many will know what a ›token‹ is). Terms such as API, httr, jsonlite, etc. all need to be moved into the documentation. The ms then needs less technical details (unless the authors choose JOSS), and present what the tools does on a much ore practical level. In case the authors want to publish their tool here, I suggest the following:
Section 1 needs some statements why using such a package is advantageous compared to using the mindat webpage. And here I mean concrete, practical statements, not the claims mentioned above. This will be helpful for researchers unfamiliar with such possibilities.
Fig. 1 is a good idea, but needs to be reduced to a much simpler version that can be understood in an instant. The more complex figure can be part of a documentation.
Section 3 is a mishmash of technicalities that need to be moved into section 2, while the capabilities can easily be merged with a new section 4.
Section 4 should be completely rewritten. Examples are great, but should be presented so that even non-programmers understand it and might be motivated to use programming tools in the future. First, there are way too many examples. As a programmer, you will get the idea instantly and start using the tool, as a non-programmer you get lost. I suggest focussing on what the tool does on an abstract level, which is, as I understand it: getting information from mindat using predefined functions, i.e. using a filter to only get specific data. Here a single line of such a function (cf. line 256), i.e., filter, could be used, and explained what the various elements of the function does. A programmer will not read the explanation, but understand what is done from this one line, and a non-programmer will fully understand what is done. Then I suggest to prepare an example e.g., Jupyter Notebook with this function and the result of this function. Provide a screenshot so anyone get a real idea what is happening. Then reduce Table 1 to fewer functions, e.g. 5-10, and explain all of these in plain words and what they do. I would remove all other tables of section 4. Then explain the additional functionalities, such as combining functions/filters, again with one line of example code and a screenshot of this line and its result from the example Jupyter Notebook. Do the same with the ›wild cards‹ capabilities. This way you have a very clean, straight forward, much briefer and clearer section 4. Also a Jupyter Notebook with examples provide a great start for a user to actually try your tool. I would find it fully sufficient if the explanation how to get a mindat token would be at the beginning of such a notebook. Using tools such as this are much more instructive than these rather convoluted tables. If someone is really interested, I bet she or he would rather try it than read through all these pages. The few screenshots should sufficiently convey the idea and potential. Just add a brief description how to get started with the Jupyter Notebook. That might in fact be crucial.
Some ideas for more specific examples: at the moment, there are only these tables, which are rather convoluted. Why not providing some simple e.g., pie charts that show some statistical results such as the relative numbers of mineral crystallographic types, relative numbers of minerals from each country, or the like. Or similar bar charts. You already have a map that is also a much more interesting result than the tables. All this would make the examples much more lively and interesting.
I couldn’t find a demo on ›https://cran.r-project.org/web/packages/OpenMindat/‹ as promised in line 439, at least not quickly and easily. I did find code examples on https://quexiang.github.io/OpenMindat/, but only after trying several links (there are quite a couple, maybe you could collect and describe them in one, single table, at the moment it is a bit left to the reader to find his or her way through the various links, demos, … or provide just one link to one page from which you link out to the others) could not readily run them. Maybe give some advice to python coders like me. In general, this page is great, and could contain the documentation as suggested above.
Finally, I really would be interested about speed and limitations. This is only mentioned peripherally in line 295. But this would really be some critical information: How long does it take to query properties of all the minerals, can I easily display the images of 500 minerals, what cannot be done, etc. would be valuable information. The authors claim in the title that with their package ›data-intensive geoscience discoveries‹ are possible. So I would be interested to what extent data-intensive queries are possible. E.g., also, whether I can do 10, 100, or 1000 such queries a day. Maybe even in class with 20 students.
Section 5 in fact resembles to what I am suggesting above. I apparently wasn’t aware of this while suggesting the above, which might happen to another reader as well, hence a restructuring might be sensible. If an example from a co-author is used (Ma et al. 2024, I guess, not 2023 as in line 364), this might be made more clear. Also, are this valid code lines? I am familiar with python, not R. It looks like this are mixtures of results and commands, but I might be mistaken. Should I be right, I would suggest to only use codelines, I find this combination confusing. Otherwise section 5 is more of a summary than a discussion with another example.
Section 6 concludes with some additional, exhaustive announcements what such tools might be able to do, which is no less than ›… bring many advantages that revolutionize how we study and understand the Earth …‹. Again, I am all for it, but – see above.
Â
Detailed comments
- Title: I am not sure I understand what is meant by ›machine interface‹. I would likely have expected simply: The OpenMindat R package to facilitate data-intensive mindat.org queries«
- 41f and 69ff: I suggest to make a table on what is stored in mindat.
- 61ff: I suggest to delete this table of contents. It is not needed and likely therefore uncommon.
- 91: I think this IMA description is not needed.
- 93: IMA, not IAM.
- 103: The abstract says JSON-LD, here it says TXT.
- 123: I am not sure what is meant here in (2).
- 133: cf comment to line 103.
- 149: what are hitter and jsonlite, how are these installed?
- 210ff: I don’t understand what the ›special symbols‹ are.
- 225: What are ›head 3‹ records?
- 253: ›combine conditions‹, I guess ›property‹ would be more appropriate than ›conditions‹
- 263: This is not a helpful link at all. It just points to the API description which coders likely already know, and others will not understand.
- 289: Is it not possible to include it now?
- 325: yet another list of output-formats
- 352: What kind of examples? I could not find examples that I can play around with.
- 353: ›are shared‹
- 357: Provide just one webpage and link out from there, all these links are confusing. I suggest using only: https://quexiang.github.io/OpenMindat/index.htmlÂ
- 409: When you say envision, I would expect what could be done in the future, not the works of more or less one first author from what has been done so far.
- 431: And another combination of output formats. Maybe this does not need to be mentioned that many times anyway.
Citation: https://doi.org/10.5194/egusphere-2024-1141-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
428 | 91 | 26 | 545 | 20 | 19 |
- HTML: 428
- PDF: 91
- XML: 26
- Total: 545
- BibTeX: 20
- EndNote: 19
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1