the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A database-driven research data framework for integrating and processing high-dimensional geoscientific data
Abstract. This paper introduces a modular research data framework designed for geoscientific research across disciplinary boundaries. It is specifically designed to support small research projects, that need to adhere to strict data management requirements from funding bodies, but often lack the financial and human resources to do so. The framework supports the transformation of raw research data into scientific knowledge. It addresses critical challenges, such as the rapid increase in the volume, variety and complexity of geoscientific datasets, data heterogeneity, spatial complexity, and the need to comply to the FAIR (Findable, Accessible, Interoperable, and Reusable) principles. The approach optimises the research management process by enhancing scalability and enabling interdisciplinary integration. It is adaptable to evolving research requirements and it supports various data types and methodological approaches, such as machine learning and deep learning, that have high requirements on the used data and their formats. A case study in Western Romania presents the data framework's application in an interdisciplinary geoarchaeological research project by processing and storing heterogeneous datasets, demonstrating its potential to support geoscientific research data management by reducing data management efforts, improving replicability, findability and reproducibility and streamlining the integration of high-dimensional data.
- Preprint
(659 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-4832', Anonymous Referee #1, 08 Oct 2025
-
AC1: 'Reply on RC1', Dennis Handy, 02 Dec 2025
GENERAL COMMENTS
The paper describes an IT system set up for general use with an example implementation based on soil and sediment information from an archaeological project in Romania. The system consists of an OLTP and OLAP component, tied together with pipelines and each having a (relational) data schema although the OLAP schema is de-normalised.
The work is interesting and not atypical of many IT systems constructed in many scientific academic departments. It is well designed for the purpose. It may provide some guidance for others developing such systems.
The problem is the claim to generality of application, and to interoperability. On both claims, the difficulty is that the IT system is locally designed, clearly heavily influenced by an individual project and takes no account of wider developments in generality and interoperation aspects of IT systems (across a range of scientific disciplines). Thus, there is no motivation for this work to be seen as part of global information (see detailed comments).
There is no reference – for example – to the metadata schemas of the European Open Science Cloud which aims for the generality and interoperability this paper claims (but does not demonstrate). It does not mention more recent work on Scientific Knowledge Graphs. It does not mention the leading large geoscience IT systems in Europe (which provide generality and interoperability) nor those in North America, East Asia and Australasia. The data model and schemas are not compared with those of the aforementioned systems.
The pipelines (workflows) appear not to use CWL (Common Workflow Language) which is a basis for generality and interoperability. Similarly, in many scientific disciplines the use of RO-Crates (Research Object Crates) is widely encouraged for generality and interoperability (including provenance and reproducibility).
The difficult problem of semantic consistency and formalisation is mentioned in lines 375-380. However, this is consigned to future work and there is no roadmap or plan of how such semantic consistency – for generality and interoperation – is to be achieved except a mention of discipline-related thesauri and a nod to ontologies. The key research question is how to bridge across heterogeneous thesauri – it requires semantic relationships and is likely to include also probabilistic measures.
Response: We would like to thank you for your careful review of our manuscript and for the detailed, insightful feedback you provided. Your comments are very helpful in enabling us to critically reflect on and refine the positioning and conceptual contribution of our work. Firstly, we would like to thank you for your precise summary of our framework's architecture. You correctly identified that it implements a robust separation between the OLTP and OLAP modules, with data pipelines connecting them.
Your assessment that our approach is 'not atypical of many IT systems constructed in scientific academic departments' has made it clear to us that we have not emphasized the strategic and conceptual novelty of our work enough in its current form. Our primary concern is not to develop another self-contained software package. Instead, we present a modular research data framework as an architectural approach that we hope will inspire cultural and organisational change in academic data management. Unlike many systems that focus on static storage or final applications, our approach is holistic and supports the entire research data lifecycle.
We acknowledge the validity of your criticism of the claim to universal validity and interoperability. It has made us realise that we need to distinguish more clearly between our bottom-up approach, aimed explicitly at small, resource-limited projects, and global, top-down infrastructure. In our view, criticising the system as 'locally designed' confuses the specific implementation with the general concept. To address these issues and improve the clarity of the manuscript, we will implement the following major revisions based on your feedback:
1. We will clarify during the discussion that the claimed universal applicability refers to the adaptability of the architecture (e.g., the separation of OLTP and OLAP), rather than to universal software. We will emphasise that our focus is on establishing framework conditions and fostering expertise in data engineering rather than providing a universal solution. The Toboliu case study demonstrates the practical application of this concept, prioritising local efficiency gains. Our initial definition of interoperability focuses on solving systemic problems at the project level (e.g., data provenance and integrity), which are prerequisites for subsequent global interoperability.
2. We will explicitly differentiate between global standards and the chosen scope of our approach:
- EOSC & SKGs: We will argue that, unlike large repositories or knowledge graphs, our framework has a different objective. Rather than aiming for final archiving, we facilitate the generation of FAIR-compliant data during active research by integrating standardised storage solutions and data processing pipelines. We lay the technical groundwork (internal standardisation) that will enable future connection to external standards such as DCAT and EOSC schemas.
- CWL & RO-Crates: Your comment has made it clear to us that the manuscript could be significantly improved by clarifying the definition of a data pipeline and how it differs from a scientific workflow. We will therefore clarify this distinction in the revised manuscript as follows: a) A data pipeline (implemented with Dagster in our framework) refers to the technical, automated process of data orchestration, involving the extraction, transformation, and loading of data to ensure its quality and availability for analysis. b) A scientific workflow, for which CWL is an excellent tool, describes the conceptual steps involved in scientific analysis at a higher level. Therefore, our data pipeline is not a replacement for a CWL workflow, but rather a foundational prerequisite. It provides clean, analysis-ready data on which a CWL-defined workflow can operate. Similarly, RO-Crates are a standard for packaging research outputs, whereas our framework focuses on managing data during the active research phase.
3. We agree that achieving semantic consistency is one of the most significant challenges. However, suggesting that we are simply postponing this issue without a plan ignores the fact that our manuscript engages with current research in this field to address this challenge. We use recent publications to emphasise the extent of the problem and highlight that resolving these semantic differences requires community-driven efforts that extend beyond the scope of a single technical framework. Therefore, we have decided not to address this semantic issue within our framework. Instead, we have established the technical prerequisite for doing so. By creating an internally consistent, provenance-based data environment, we are preparing the data for the semantic mapping and bridging that will be developed through collaborative efforts. Our role is to lay the foundation, not solve the overarching problem. The revised version of the manuscript will explain this more clearly, emphasising that our work is a prerequisite for addressing the major challenge of semantic interoperability, rather than an alternative to it.
4. To reinforce the conceptual separation from implementation and increase transparency, we will publish the code of the OLTP module on Zenodo. We will also encourage the users of our framework to publish their specific pipelines consistently.
We hope these revisions address the criticisms you raised and clarify the contribution of our manuscript, which provides a pragmatic architectural blueprint for data management in research. Thank you again for your stimulating and helpful feedback.
SPECIFIC COMMENTS
Line 19 - "Wilkinson et al. introduced the FAIR principles" The FAIR principles were produced by FORCE11 \url{https://force11.org/info/the-fair-\data-principles/} although Wilkinson et al elaborated their interpretation and even more formally by the RDA Working Group on FAIR Data Maturity \url{https://zenodo.org/records/3909563#.YGRNnq8za70
Response Thank you for this important clarification regarding the origins of the FAIR principles within the community. You are right that the principles emerged from a broad community process. We decided to cite Wilkinson et al. (2016) in line with standard academic practice, as this was the formal, peer-reviewed publication that introduced and defined the principles for the scientific record. As noted on the FORCE11 website that you provided, this article formally published the principles. To acknowledge this critical nuance, we will revise the sentence in the manuscript to better reflect the community origin. The revised sentence will read: "The FAIR (findability, accessibility, interoperability, and reusability) data principles (Wilkinson et al., 2016), which emerged from the FORCE11 community,..."
Line 38 - "For this reason, spatial data must have explicit locational information in its metadata" For FAIR all datasets (and for that matter software services and workflows) need to have metadata. This is not discussed in the paper, nor is there any suggestion of adopting/improving widely used metadata standards such as DCAT \url{https://dcat.org/} and particularly extensions (APs: Application Profiles) that allow for domain specialisation while retaining interoperability and generality through the main entities of DCAT. Use of such standards improves greatly interoperability.
Response This passage does not deny that detailed metadata is required for a dataset to comply with the FAIR principles. Instead, we focus on the specific challenges associated with spatially explicit data, which we consider to be a unique feature of geoscientific data. To clarify this, we rephrase the sentence to "Next to conventional metadata describing the dataset, spatial data must also have explicit locational information." This suggests that the paper's aim is not to provide a comprehensive discussion of metadata's significance, but to address the inherent structural challenges of geoscientific data.
While we agree with the importance of metadata, the demand for further discussion on metadata standards misses the scope and objective of the presented system. As mentioned previously, we are not introducing a system for sharing static scientific data. However, DCAT refers to "an RDF vocabulary for representing data catalogues" (https://www.w3.org/TR/vocab-dcat-3/\#dcat-scope). Furthermore, DCAT is built around distinct datasets which "represent a collection of data published or curated by a single agent or identifiable community". Demanding its adoption, or even improvement, within the scope of our work misconstrues not only the nature of relational databases, but also the objectives and properties of our approach as a whole. If anything, DCAT is applied after our approach.
Line 46 - "a comprehensive, interdisciplinary approach has been missing" See EPOS https://www.epos-eu.org/ and for workflows its extension through DT-GEO https://dtgeo.eu/ and/or EGDI https://www.europe-geology.eu/
Response Thank you for highlighting these significant European initiatives. You are right that large-scale infrastructures such as EPOS and EGDI offer comprehensive data integration platforms at a macro level. However, our work focuses on a different — and, we would argue, complementary — part of the research data lifecycle. These platforms operate at a macro-infrastructural level, focusing primarily on aggregating, harmonising, and publishing data from various sources. In contrast, our framework is a 'bottom-up' solution operating at the micro-level of an individual research project, particularly those with limited resources. Our primary goal is not data publication, but rather the management of data throughout the active research process — from field collection to final analysis.
We believe that effective management of data at its point of origin is a crucial prerequisite for it to be FAIR enough to be ingested into a system like EPOS. Our framework enables smaller projects to achieve this. DT-GEO's focus on digital twins for geophysical modelling is conceptually and technically distinct from our goal of providing a foundational data management architecture.
To clarify our position, we will add a paragraph to the introduction of our project to explicitly distinguish it from large-scale publication infrastructures. This will better define the specific gap that our work aims to fill and demonstrate that our approach facilitates the preparation of data for inclusion in larger, domain-specific databases.
Line 46 - "a comprehensive, interdisciplinary approach has been missing" The paper does not provide a solution that is comprehensive; it is limited to a particular area of geoscience (sediment and soil samples)
Response Thank you for raising this point. You are right that our case study focuses on soil and sediment samples. This gives us a valuable opportunity to clarify the intended meaning of comprehensive in our manuscript. The comprehensiveness we claim is architectural and procedural, rather than disciplinary. Our framework is comprehensive, covering the entire research data lifecycle — from initial field data collection through processing and analysis to preparation for reuse.
While the case study is specific, it demonstrates how this general architectural blueprint can be implemented in a real-world research context. Although the specific data models and workflows would change, the core components (e.g., the OLTP/OLAP separation and the data pipelines) are designed to be adaptable to other geoscientific domains.
We recognise that our original wording was ambiguous. We will revise the manuscript to explicitly state that the claimed comprehensiveness refers to coverage of the data lifecycle rather than universal disciplinary applicability.
Fig. 2 - The OLTP/OLAP architecture is not suitable for real-time or event-driven workflows Re-think the architecture to allow for real-time data ingestion and inline analysis to detect events if it is to be comprehensive
Fig. 2 The figure does not include cardinality and optionality symbols.
Fig. 2 - The entity sample seems to imply a physical sample of sediment or soil Consider a sample could also be e.g., a digital seismogram or a chemical analysis of a volcanic gas if it is to be comprehensive
Fig. 2 There is no clear indication of what is data and what is metadata, and the schema does not match well-known schemas used in geoscience or schemas used generally in research
Response Thank you for your detailed and constructive feedback on Figure 2. Your comments provide us with a valuable opportunity to clarify the figure's purpose and the core design principles of our framework. The central point that addresses several of your comments is that Figure 2 is intended as a conceptual illustration rather than a complete, prescriptive implementation schema. Its primary purpose is to contrast the structure of a normalised OLTP schema (Fig. 2a) with that of a denormalised OLAP star schema (Fig. 2b), thereby explaining the architectural trade-offs. It is not intended to propose a new universal standard for geoscientific data. To clarify, we will revise the figure caption to explicitly state that it is a conceptual model. With this in mind, we would like to address your specific points.
- Data vs. Metadata & Schema Standards: You are right that the figure does not depict a specific schema, such as IGSN, or label entity types explicitly as data or metadata. This is because our framework adheres to a fundamental principle of relational database design, whereby the distinction is logical rather than physical. Both data (e.g. a measurement value) and metadata (e.g. its unit) are stored as values in columns. The context is defined by the relationships between tables. This approach avoids the rigid separation of data and metadata into separate systems, a problem we explicitly address. We will clarify the logical handling of metadata in the revised figure caption.
- Definition of Sample: Thank you for highlighting this ambiguity. In our model, a sample is deliberately defined as a physical specimen (e.g., sediment or rock). Other items, such as chemical analyses or seismograms, are modelled as analyses or measurements associated with that sample. This conceptual separation is fundamental to the framework's fidelity. To avoid ambiguity for a broader audience, we will add a clarifying sentence to the text.
- Cardinality and Optionality: We agree. Adding this notation will make the conceptual model more informative. We will revise Figure 2 to include the correct symbols in the updated manuscript.
- Real-time vs. Batch Processing: You are right to point out that our architecture is not a real-time stream processing engine. This is a deliberate design choice tailored to the dominant workflows in our target geoscientific domain, where data is typically processed in batches. However, we would like to clarify that our architecture fully supports event-driven workflows. The decoupled orchestration layer (Dagster) can trigger pipelines in response to events, such as the arrival of a new data file. As the paper's core message is an architectural blueprint for batch-oriented research projects, we feel that a detailed discussion of real-time architectures would be beyond its scope.
We are confident that these clarifications and revisions will address your concerns and significantly improve the clarity of our manuscript. To further clarify how our model builds on that provided by Nordsiek and Halisch (2024) and extends it in terms of relational strictness, we will add an explanatory note to Chapter 3.2 and include a detailed description of the data model in the appendix.
Line 161 - relationships into smaller tables that are linked by foreign keys More modern relational database work has relationships themselves as entities (tables) linking e.g., sample and device thus a tuple in each of the base tables is related by the relationship table and can be read as e.g., sample X - was collected by - device Y (ideally with added temporal and spatial information in the relationship table)
Response Thank you for this precise suggestion. You are right that the most robust way to model many-to-many relationships is to use a dedicated relationship table. This is a well-established principle of relational database theory (e.g., Codd, 1970) and our OLTP module strictly adheres to it. The reason this level of detail is not immediately apparent in Figure 2 is that the diagram is a conceptual simplification designed to illustrate the high-level differences between OLTP and OLAP structures. As mentioned in our previous response, we will include a detailed description of the data model in the appendix to address your point fully and make our implementation explicit. This appendix will clearly demonstrate how we use relationship tables to model key many-to-many associations in our framework.
Line 191 In this section there is no discussion of using CWL (Common Workflow Language)https://www.commonwl.org/ which would allow for interoperability / reusability, nor the use of 'research object crates' https://www.researchobject.org/ro-crate/ to enable interoperability / reusability: reasons for rejecting these solutions should be explained
Response Thank you for raising this important point. As detailed in our primary response to your general critique, we will address this issue by revising Chapter 3.4 (Data Pipelines) to clarify the distinction between data pipelines and scientific workflows, such as CWL. A new paragraph in the Discussion section will emphasise that our framework provides a foundation for, rather than replaces, these essential standards.
Line 310 - geochemical fingerprints No mention of geochemical measurements in the schema (Fig 2)
Response Thank you for your comment. You are correct that geochemical measurements are not included in Figure 2. This is because, as detailed in our previous responses, the figure is a high-level conceptual illustration. In our actual data model, however, a geochemical fingerprint is handled exactly as one would expect in a robust relational system: it is modelled as a specific type of analysis linked to a physical specimen via a relationship. This flexible approach is a core strength of our design. It is an excellent example of the implementation-level detail omitted from the conceptual figure for clarity, which will be made explicit in the detailed data model to be added to the appendix.
Line 369 - integrating metadata and measurements into a unified data model, ensuring that all information is stored consistently and is easily accessible. Even for small projects it is possible that datasets are stored on different servers in discipline-based laboratories (e.g., geochemistry on one server, grain size data on another) and so the model of metadata in a database on one server (possibly replicated) pointing to files on many different servers may be more generally applicable
Response The situation you describe, of datasets being stored on different servers in discipline-based laboratories, is precisely the core problem of data fragmentation and siloing that our framework is designed to solve.
Your suggestion of a metadata model pointing to files on different servers physically separates the data by design, severely limiting data integration and complex cross-dataset analysis. It is not possible to run a single query across geochemical data on one server and grain size data on another.
Our framework takes a different, more powerful approach. Rather than merely pointing to distributed files, our data pipelines are designed to ingest data from various sources (e.g., geochemistry and grain-size servers) and integrate it into the unified OLTP/OLAP model.
The purpose of our architecture is to move beyond a simple catalogue and create a single, consistent, queryable environment where geochemical and grain size data reside together. This enables advanced integrated analyses that are impossible in the federated model you describe.
We acknowledge that our manuscript did not make this fundamental design choice and its rationale explicit enough. The fact that our solution can be mistaken for the problem it solves is a clear sign that we must improve our explanation. Therefore, we will add a dedicated paragraph to the discussion. There, we will explicitly contrast our 'unified data warehouse' approach with the 'data catalogue' model, explaining the trade-offs and justifying why our deep integration is essential for achieving the analytical goals of modern, data-driven geoscientific research.
Line 461 - expanding interdisciplinary collaboration (Some) other disciplines have well established metadata schemas, workflows (pipelines): there is no explanation in the paper to support the assertion i.e. how would the system interact with extant systems in geoscience, soil science, climate science, environmental science, biodiversity… and even archaeology? The very nature of the system is to be small, project-based, using locally-designed solutions, limited by the institutional IT infrastructure.
Response Thank you for this important question regarding the final sentence of our manuscript. It highlights a misunderstanding that we will address in two points.
1. Our framework is a scalable blueprint, not a small, local solution. The comment asserts that our system is small, project-based and uses locally designed solutions. This is incorrect. It stems from confusing an instance of the framework, such as the Toboliu case study, with the framework's standardised, project-independent design. Our concluding sentence is the paper's central thesis. It is based on the framework's standardised design, which can be deployed for any number of projects. The architecture's entire purpose, especially that of the OLAP module for large-scale, cross-project analysis, is the opposite of a small, local solution since it is used to manage and integrate datasets across multiple projects.
2. Our final statement is not a vague assertion. It is based on the most critical capability of our framework: the deep integration of heterogeneous data into a unified model. While the manuscript does not detail specific application programming interfaces (APIs) for external systems, it does explain the foundational mechanisms that facilitate interdisciplinary collaboration.
- Overcoming Data Silos: The main issue in interdisciplinary work is fragmented data. Our framework addresses this issue by consolidating data from various sources, such as geochemistry and sedimentology, into a single, unified system. This internal integration is the essential first step for any meaningful external collaboration. You cannot share what you have not first integrated.
- Enabling a "Single Source of Truth": Our framework transforms diverse data into a consistent structure, creating a single, queryable source of truth. This unified dataset provides the necessary foundation for building interoperable data products or APIs for other systems, such as those in soil and climate sciences.
- Flexibility through Pipelines: The transform step in our ETL pipelines is an architectural component designed explicitly to handle data heterogeneity. Although we have only demonstrated this with internal data, the same mechanism could be used to implement transformations from external schemas, thereby making the framework extensible by design.
We concede that the manuscript does not make this strategic position explicit enough. The fact that our scalable blueprint could be mistaken for a small, local solution demonstrates the need for further elaboration. We will therefore add a dedicated section to the discussion titled 'Positioning the Framework in the Research Data Ecosystem'. In this section, we will clarify that our primary contribution is the creation of a deeply integrated, analysis-ready data asset. This asset is the foundational prerequisite for subsequent steps, such as interdisciplinary data sharing and interaction with larger, external systems.
Line 462 - Code availability. This paragraph indicates that the code is not open source. Hardly FAIR (which applies to software and workflows as well as data
Response We agree that the goal of making software and workflows as open and accessible as possible is crucial. Your comment has made it clear that our current Code availability statement is inadequate and misleading. The reviewer is correct in stating that the code is not currently available under a full open-source licence. This is a consequence of the framework's two-part architectural design. The implementation consists of two distinct components:
- The generic, reusable OLTP module that serves as the core blueprint of our framework.
- The project-specific data pipelines and analyses, which are naturally tailored to our institutional IT infrastructure and the Toboliu project's data sources and are therefore not generally reusable.
Your comment prompted us to reconsider how best to support the FAIR principles, which involves clearly separating these two components. The OLTP module is the core intellectual contribution and the most reusable part. To maximise the reusability and FAIRness of our work and encourage its adoption and further development by the community, we will publish the Python code for the OLTP module under a permissive open-source licence in Zenodo. This will assign the code a DOI, making it findable, accessible, and reusable for any research group wishing to adopt our approach.
We will revise the 'Code availability' section of the manuscript accordingly and provide a DOI for the repository.
Citation: https://doi.org/10.5194/egusphere-2025-4832-AC1
-
AC1: 'Reply on RC1', Dennis Handy, 02 Dec 2025
-
RC2: 'Comment on egusphere-2025-4832', C. Kristina Rossavik, 09 Nov 2025
General Comments
A report that presents data organization structure for geoscientific databases with geospatial properties and data pipelines.
Specific Comments
May want to use a more updated citation related to (Zickel et al., 2025).
Technical Corrections
Lines 60 and 64: Capitalize “address”.
Line 76: Suggestion to add “the” before “geosciences”, or make geosciences singular (“geoscience”).
Line 104: Remove e in ‘graine’.
Line215: Add a space after grain in “grainsize”.
Line 275: Typo in “ the interfac ereturns key” - Add e to the end of “interfac”, remove e in ‘ereturns’.
Citation: https://doi.org/10.5194/egusphere-2025-4832-RC2 -
AC2: 'Reply on RC2', Dennis Handy, 02 Dec 2025
We would like to thank you for your accurate summary of our work, as well as for your suggestions regarding technical corrections. Regarding your specific comment on the Zickel et al. (2025) citation, you are right to draw attention to it. This citation refers to a paper that is about to be submitted to a preprint server. We will update the reference to reflect this in the revised manuscript.
Citation: https://doi.org/10.5194/egusphere-2025-4832-AC2
-
AC2: 'Reply on RC2', Dennis Handy, 02 Dec 2025
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 179 | 78 | 21 | 278 | 17 | 17 |
- HTML: 179
- PDF: 78
- XML: 21
- Total: 278
- BibTeX: 17
- EndNote: 17
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Review Comments
https://egusphere.copernicus.org/preprints/2025/egusphere-2025-4832/egusphere-2025-4832.pdf
General Comments
The paper describes an IT system set up for general use with an example implementation based on soil and sediment information from an archaeological project in Romania. The system consists of an OLTP and OLAP component, tied together with pipelines and each having a (relational) data schema although the OLAP schema is de-normalised.
The work is interesting and not atypical of many IT systems constructed in many scientific academic departments. It is well designed for the purpose. It may provide some guidance for others developing such systems.
The problem is the claim to generality of application, and to interoperability. On both claims, the difficulty is that the IT system is locally designed, clearly heavily influenced by an individual project and takes no account of wider developments in generality and interoperation aspects of IT systems (across a range of scientific disciplines). Thus, there is no motivation for this work to be seen as part of global information (see detailed comments).
There is no reference – for example – to the metadata schemas of the European Open Science Cloud which aims for the generality and interoperability this paper claims (but does not demonstrate). It does not mention more recent work on Scientific Knowledge Graphs. It does not mention the leading large geoscience IT systems in Europe (which provide generality and interoperability) nor those in North America, East Asia and Australasia. The data model and schemas are not compared with those of the aforementioned systems.
The pipelines (workflows) appear not to use CWL (Common Workflow Language) which is a basis for generality and interoperability. Similarly, in many scientific disciplines the use of RO-Crates (Research Object Crates) is widely encouraged for generality and interoperability (including provenance and reproducibility).
The difficult problem of semantic consistency and formalisation is mentioned in lines 375-380. However, this is consigned to future work and there is no roadmap or plan of how such semantic consistency – for generality and interoperation – is to be achieved except a mention of discipline-related thesauri and a nod to ontologies. The key research question is how to bridge across heterogeneous thesauri – it requires semantic relationships and is likely to include also probabilistic measures.
Specific Comments
line
text
suggestion
19
Wilkinson et al. introduced the FAIR principles
The FAIR principles were produced by FORCE11 https://force11.org/info/the-fair-data-principles/ although Wilkinson et al elaborated their interpretation and even more formally by the RDA Working Group on FAIR Data Maturity https://zenodo.org/records/3909563#.YGRNnq8za70
38
For this reason, spatial data must have explicit locational information in its metadata
For FAIR all datasets (and for that matter software services and workflows) need to have metadata. This is not discussed in the paper, nor is there any suggestion of adopting/improving widely used metadata standards such as DCAT https://dcat.org/ and particularly extensions (APs: Application Profiles) that allow for domain specialisation while retaining interoperability and generality through the main entities of DCAT. Use of such standards improves greatly interoperability.
46
a comprehensive, interdisciplinary approach has been missing
See EPOS https://www.epos-eu.org/ and for workflows its extension through DT-GEO https://dtgeo.eu/ and/or EGDI https://www.europe-geology.eu/
46
a comprehensive, interdisciplinary approach has been missing
The paper does not provide a solution that is comprehensive; it is limited to a particular area of geoscience (sediment and soil samples)
Fig2
The figure does not include cardinality and optionality symbols
Fig2
The OLTP/OLAP architecture is not suitable for real-time or event-driven workflows
Re-think the architecture to allow for real-time data ingestion and inline analysis to detect events if it is to be comprehensive
Fig2
The entity sample seems to imply a physical sample of sediment or soil
Consider a sample could also be e.g., a digital seismogram or a chemical analysis of a volcanic gas if it is to be comprehensive
Fig 2
There is no clear indication of what is data and what is metadata, and the schema does not match well-known schemas used in geoscience or schemas used generally in research
161
relationships into smaller tables that are linked by foreign keys
More modern relational database work has relationships themselves as entities (tables) linking e.g., sample and device thus a tuple in each of the base tables is related by the relationship table and can be read as e.g., sample X - was collected by - device Y (ideally with added temporal and spatial information in the relationship table)
191
In this section there is no discussion of using CWL (Common Workflow Language)https://www.commonwl.org/ which would allow for interoperability / reusability, nor the use of ‘research object crates’ https://www.researchobject.org/ro-crate/ to enable interoperability / reusability: reasons for rejecting these solutions should be explained
310
geochemical fingerprints
No mention of geochemical measurements in the schema (Fig 2)
369
integrating metadata and measurements into a unified data model, ensuring that all information is stored consistently and is easily accessible.
Even for small projects it is possible that datasets are stored on different servers in discipline-based laboratories (e.g., geochemistry on one server, grain size data on another) and so the model of metadata in a database on one server (possibly replicated) pointing to files on many different servers may be more generally applicable
461
expanding interdisciplinary collaboration
(Some) other disciplines have well established metadata schemas, workflows (pipelines): there is no explanation in the paper to support the assertion i.e. how would the system interact with extant systems in geoscience, soil science, climate science, environmental science, biodiversity… and even archaeology?
The very nature of the system is to be small, project-based, using locally-designed solutions, limited by the institutional IT infrastructure.
462
Code availability.
This paragraph indicates that the code is not open source. Hardly FAIR (which applies to software and workflows as well as data)
Technical comments
104
graine
grain
105
by
at
163
datbase
database
275
interfac ereturns
interface returns