A database-driven research data framework for integrating and processing high-dimensional geoscientific data
Abstract. This paper introduces a modular research data framework designed for geoscientific research across disciplinary boundaries. It is specifically designed to support small research projects, that need to adhere to strict data management requirements from funding bodies, but often lack the financial and human resources to do so. The framework supports the transformation of raw research data into scientific knowledge. It addresses critical challenges, such as the rapid increase in the volume, variety and complexity of geoscientific datasets, data heterogeneity, spatial complexity, and the need to comply to the FAIR (Findable, Accessible, Interoperable, and Reusable) principles. The approach optimises the research management process by enhancing scalability and enabling interdisciplinary integration. It is adaptable to evolving research requirements and it supports various data types and methodological approaches, such as machine learning and deep learning, that have high requirements on the used data and their formats. A case study in Western Romania presents the data framework's application in an interdisciplinary geoarchaeological research project by processing and storing heterogeneous datasets, demonstrating its potential to support geoscientific research data management by reducing data management efforts, improving replicability, findability and reproducibility and streamlining the integration of high-dimensional data.
Review Comments
https://egusphere.copernicus.org/preprints/2025/egusphere-2025-4832/egusphere-2025-4832.pdf
General Comments
The paper describes an IT system set up for general use with an example implementation based on soil and sediment information from an archaeological project in Romania. The system consists of an OLTP and OLAP component, tied together with pipelines and each having a (relational) data schema although the OLAP schema is de-normalised.
The work is interesting and not atypical of many IT systems constructed in many scientific academic departments. It is well designed for the purpose. It may provide some guidance for others developing such systems.
The problem is the claim to generality of application, and to interoperability. On both claims, the difficulty is that the IT system is locally designed, clearly heavily influenced by an individual project and takes no account of wider developments in generality and interoperation aspects of IT systems (across a range of scientific disciplines). Thus, there is no motivation for this work to be seen as part of global information (see detailed comments).Â
There is no reference – for example – to the metadata schemas of the European Open Science Cloud which aims for the generality and interoperability this paper claims (but does not demonstrate). It does not mention more recent work on Scientific Knowledge Graphs. It does not mention the leading large geoscience IT systems in Europe (which provide generality and interoperability) nor those in North America, East Asia and Australasia. The data model and schemas are not compared with those of the aforementioned systems.Â
The pipelines (workflows) appear not to use CWL (Common Workflow Language) which is a basis for generality and interoperability. Similarly, in many scientific disciplines the use of RO-Crates (Research Object Crates) is widely encouraged for generality and interoperability (including provenance and reproducibility).
The difficult problem of semantic consistency and formalisation  is mentioned in lines 375-380. However, this is consigned to future work and there is no roadmap or plan of how such semantic consistency – for generality and interoperation – is to be achieved except a mention of discipline-related thesauri and a nod to ontologies. The key research question is how to bridge across heterogeneous thesauri – it requires semantic relationships and is likely to include also probabilistic measures.
Â
Specific Comments
line
text
suggestion
Â
Â
Â
19
Wilkinson et al. introduced the FAIR principles
The FAIR principles were produced by FORCE11 https://force11.org/info/the-fair-data-principles/ although Wilkinson et al elaborated their interpretation and even more formally by the RDA Working Group on FAIR Data Maturity https://zenodo.org/records/3909563#.YGRNnq8za70
38
For this reason, spatial data must have explicit locational information in its metadata
For FAIR all datasets (and for that matter software services and workflows) need to have metadata. This is not discussed in the paper, nor is there any suggestion of adopting/improving widely used metadata standards such as DCAT https://dcat.org/ and particularly extensions (APs: Application Profiles) that allow for domain specialisation while retaining interoperability and generality through the main entities of DCAT. Use of such standards improves greatly interoperability.
46
 a comprehensive, interdisciplinary approach has been missing
See EPOS https://www.epos-eu.org/ and for workflows its extension through DT-GEO https://dtgeo.eu/ and/or EGDI https://www.europe-geology.eu/
46
 a comprehensive, interdisciplinary approach has been missing
The paper does not provide a solution that is comprehensive; it is limited to a particular area of geoscience (sediment and soil samples)
Fig2
Â
The figure does not include cardinality and optionality symbols
Fig2
The OLTP/OLAP architecture is not suitable for real-time or event-driven workflows
Re-think the architecture to allow for real-time data ingestion and inline analysis to detect events if it is to be comprehensive
Fig2
The entity sample seems to imply a physical sample of sediment or soil
Consider a sample could also be e.g., a digital seismogram or a chemical analysis of a volcanic gas if it is to be comprehensive
Fig 2
Â
There is no clear indication of what is data and what is metadata, and the schema does not match well-known schemas used in geoscience or schemas used generally in research
161
relationships into smaller tables that are linked by foreign keys
More modern relational database work has relationships themselves as entities (tables) linking e.g., sample and device thus a tuple in each of the base tables is related by the relationship table and can be read as e.g., sample XÂ - was collected by - device Y (ideally with added temporal and spatial information in the relationship table)
191
Â
In this section there is no discussion of using CWL (Common Workflow Language)https://www.commonwl.org/ which would allow for interoperability / reusability, nor the use of ‘research object crates’ https://www.researchobject.org/ro-crate/ to enable interoperability / reusability: reasons for rejecting these solutions should be explained
310
geochemical fingerprints
No mention of geochemical measurements in the schema (Fig 2)
369
integrating metadata and measurements into a unified data model, ensuring that all information is stored consistently and is easily accessible.
Even for small projects it is possible that datasets are stored on different servers in discipline-based laboratories (e.g., geochemistry on one server, grain size data on another) and so the model of metadata in a database on one server (possibly replicated) pointing to files on many different servers may be more generally applicable
461
expanding interdisciplinary collaboration
(Some) other disciplines have well established metadata schemas, workflows (pipelines): there is no explanation in the paper to support the assertion i.e. how would the system interact with extant systems in geoscience, soil science, climate science, environmental science, biodiversity… and even archaeology?
The very nature of the system is to be small, project-based, using locally-designed solutions, limited by the institutional IT infrastructure.
462
Code availability.
This paragraph indicates that the code is not open source. Hardly FAIR (which applies to software and workflows as well as data)
Â
Technical comments
104
graine
grain
105
by
at
163
datbase
database
275
interfac ereturns
interface returns
Â