A database-driven research data framework for integrating and processing high-dimensional geoscientific data

Handy, Dennis; Van der Meij, W. Marijn; Zickel, Mirijam; Reimann, Tony

doi:10.5194/egusphere-2025-4832

Preprints

https://doi.org/10.5194/egusphere-2025-4832

Preprints

07 Oct 2025

| 07 Oct 2025

A database-driven research data framework for integrating and processing high-dimensional geoscientific data

Dennis Handy, W. Marijn Van der Meij, Mirijam Zickel, and Tony Reimann

Abstract. This paper introduces a modular research data framework designed for geoscientific research across disciplinary boundaries. It is specifically designed to support small research projects, that need to adhere to strict data management requirements from funding bodies, but often lack the financial and human resources to do so. The framework supports the transformation of raw research data into scientific knowledge. It addresses critical challenges, such as the rapid increase in the volume, variety and complexity of geoscientific datasets, data heterogeneity, spatial complexity, and the need to comply to the FAIR (Findable, Accessible, Interoperable, and Reusable) principles. The approach optimises the research management process by enhancing scalability and enabling interdisciplinary integration. It is adaptable to evolving research requirements and it supports various data types and methodological approaches, such as machine learning and deep learning, that have high requirements on the used data and their formats. A case study in Western Romania presents the data framework's application in an interdisciplinary geoarchaeological research project by processing and storing heterogeneous datasets, demonstrating its potential to support geoscientific research data management by reducing data management efforts, improving replicability, findability and reproducibility and streamlining the integration of high-dimensional data.

Received: 30 Sep 2025 – Discussion started: 07 Oct 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Dennis Handy, W. Marijn Van der Meij, Mirijam Zickel, and Tony Reimann

Status: final response (author comments only)

RC1: 'Comment on egusphere-2025-4832', Anonymous Referee #1, 08 Oct 2025

Review Comments

https://egusphere.copernicus.org/preprints/2025/egusphere-2025-4832/egusphere-2025-4832.pdf

General Comments

The paper describes an IT system set up for general use with an example implementation based on soil and sediment information from an archaeological project in Romania. The system consists of an OLTP and OLAP component, tied together with pipelines and each having a (relational) data schema although the OLAP schema is de-normalised.

The work is interesting and not atypical of many IT systems constructed in many scientific academic departments. It is well designed for the purpose. It may provide some guidance for others developing such systems.

The problem is the claim to generality of application, and to interoperability. On both claims, the difficulty is that the IT system is locally designed, clearly heavily influenced by an individual project and takes no account of wider developments in generality and interoperation aspects of IT systems (across a range of scientific disciplines). Thus, there is no motivation for this work to be seen as part of global information (see detailed comments).

There is no reference – for example – to the metadata schemas of the European Open Science Cloud which aims for the generality and interoperability this paper claims (but does not demonstrate). It does not mention more recent work on Scientific Knowledge Graphs. It does not mention the leading large geoscience IT systems in Europe (which provide generality and interoperability) nor those in North America, East Asia and Australasia. The data model and schemas are not compared with those of the aforementioned systems.

The pipelines (workflows) appear not to use CWL (Common Workflow Language) which is a basis for generality and interoperability. Similarly, in many scientific disciplines the use of RO-Crates (Research Object Crates) is widely encouraged for generality and interoperability (including provenance and reproducibility).

The difficult problem of semantic consistency and formalisation is mentioned in lines 375-380. However, this is consigned to future work and there is no roadmap or plan of how such semantic consistency – for generality and interoperation – is to be achieved except a mention of discipline-related thesauri and a nod to ontologies. The key research question is how to bridge across heterogeneous thesauri – it requires semantic relationships and is likely to include also probabilistic measures.

Specific Comments

line	text	suggestion

19	Wilkinson et al. introduced the FAIR principles	The FAIR principles were produced by FORCE11 https://force11.org/info/the-fair-data-principles/ although Wilkinson et al elaborated their interpretation and even more formally by the RDA Working Group on FAIR Data Maturity https://zenodo.org/records/3909563#.YGRNnq8za70
38	For this reason, spatial data must have explicit locational information in its metadata	For FAIR all datasets (and for that matter software services and workflows) need to have metadata. This is not discussed in the paper, nor is there any suggestion of adopting/improving widely used metadata standards such as DCAT https://dcat.org/ and particularly extensions (APs: Application Profiles) that allow for domain specialisation while retaining interoperability and generality through the main entities of DCAT. Use of such standards improves greatly interoperability.
46	a comprehensive, interdisciplinary approach has been missing	See EPOS https://www.epos-eu.org/ and for workflows its extension through DT-GEO https://dtgeo.eu/ and/or EGDI https://www.europe-geology.eu/
46	a comprehensive, interdisciplinary approach has been missing	The paper does not provide a solution that is comprehensive; it is limited to a particular area of geoscience (sediment and soil samples)
Fig2		The figure does not include cardinality and optionality symbols
Fig2	The OLTP/OLAP architecture is not suitable for real-time or event-driven workflows	Re-think the architecture to allow for real-time data ingestion and inline analysis to detect events if it is to be comprehensive
Fig2	The entity sample seems to imply a physical sample of sediment or soil	Consider a sample could also be e.g., a digital seismogram or a chemical analysis of a volcanic gas if it is to be comprehensive
Fig 2		There is no clear indication of what is data and what is metadata, and the schema does not match well-known schemas used in geoscience or schemas used generally in research
161	relationships into smaller tables that are linked by foreign keys	More modern relational database work has relationships themselves as entities (tables) linking e.g., sample and device thus a tuple in each of the base tables is related by the relationship table and can be read as e.g., sample X - was collected by - device Y (ideally with added temporal and spatial information in the relationship table)
191		In this section there is no discussion of using CWL (Common Workflow Language)https://www.commonwl.org/ which would allow for interoperability / reusability, nor the use of ‘research object crates’ https://www.researchobject.org/ro-crate/ to enable interoperability / reusability: reasons for rejecting these solutions should be explained
310	geochemical fingerprints	No mention of geochemical measurements in the schema (Fig 2)
369	integrating metadata and measurements into a unified data model, ensuring that all information is stored consistently and is easily accessible.	Even for small projects it is possible that datasets are stored on different servers in discipline-based laboratories (e.g., geochemistry on one server, grain size data on another) and so the model of metadata in a database on one server (possibly replicated) pointing to files on many different servers may be more generally applicable
461	expanding interdisciplinary collaboration	(Some) other disciplines have well established metadata schemas, workflows (pipelines): there is no explanation in the paper to support the assertion i.e. how would the system interact with extant systems in geoscience, soil science, climate science, environmental science, biodiversity… and even archaeology? The very nature of the system is to be small, project-based, using locally-designed solutions, limited by the institutional IT infrastructure.
462	Code availability.	This paragraph indicates that the code is not open source. Hardly FAIR (which applies to software and workflows as well as data)

Technical comments

104	graine	grain
105	by	at
163	datbase	database
275	interfac ereturns	interface returns

Citation: https://doi.org/10.5194/egusphere-2025-4832-RC1

RC2: 'Comment on egusphere-2025-4832', C. Kristina Rossavik, 09 Nov 2025

General Comments
A report that presents data organization structure for geoscientific databases with geospatial properties and data pipelines.
Specific Comments
May want to use a more updated citation related to (Zickel et al., 2025).
Technical Corrections
Lines 60 and 64: Capitalize “address”.
Line 76: Suggestion to add “the” before “geosciences”, or make geosciences singular (“geoscience”).
Line 104: Remove e in ‘graine’.
Line215: Add a space after grain in “grainsize”.
Line 275: Typo in “ the interfac ereturns key” - Add e to the end of “interfac”, remove e in ‘ereturns’.

Citation: https://doi.org/10.5194/egusphere-2025-4832-RC2

Dennis Handy, W. Marijn Van der Meij, Mirijam Zickel, and Tony Reimann

Viewed

Total article views: 182 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
128	42	12	182	14	14

HTML: 128
PDF: 42
XML: 12
Total: 182
BibTeX: 14
EndNote: 14

Views and downloads (calculated since 07 Oct 2025)

Month	HTML	PDF	XML	Total
Oct 2025	109	28	10	147
Nov 2025	19	14	2	35

Cumulative views and downloads (calculated since 07 Oct 2025)

Month	HTML	PDF	XML	Total
Oct 2025	109	28	10	147
Nov 2025	19	14	2	35

Viewed (geographical distribution)

Total article views: 173 (including HTML, PDF, and XML) Thereof 173 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 20 Nov 2025

Short summary

Geoscientific projects often struggle to manage complex data effectively, resulting in valuable information being lost due to poor findability and accessibility. To address this, we present a comprehensive research data framework for storing and processing data throughout a project, from fieldwork to data analysis. This ensures that datasets are clearly defined, reproducible and adhere to the FAIR principles (findability, accessibility, interoperability and reusability).


Total:	0
HTML:	0
PDF:	0
XML:	0