the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Modular approach to near-time data management for multi-city atmospheric environmental observation campaigns
Abstract. Urban observation networks are becoming denser, more diverse, and more mobile, while being required to provide results in near-time. The Synergy Grant urbisphere funded by the European Research Council (ERC) has multiple simultaneous field campaigns in cities of different sizes collecting data, for improving weather and climate models and services, including assessing the impact of cities on the atmosphere (e.g., heat, moisture, pollutant and aerosol emissions) and people's exposure to extremes (e.g., heat waves, heavy precipitation, air pollution episodes). Here, a solution to this challenge for facilitating diverse data streams, from multiple sources, scales (e.g., indoors, regional-scale atmospheric boundary layer) and cities is presented.
For model development and evaluation in heterogeneous urban environments, we need meshed networks of in situ observations with ground-based and airborne (remote-)sensing platforms. In this contribution we describe challenges, approaches and solutions for data management, data infrastructure, and data governance to handle the variety of data streams from primarily novel modular observation networks deployed in multiple cities, in combination with existing data collected by partners, ranging in scale from indoor sensor deployments to regional-scale boundary layer observations.
A metadata system documents: (1) sensors/instruments, (2) location and configuration of deployed components, and (3) maintenance and events. This metadata system provides the backbone for converting instrument records to calibrated, location-aware, convention-aligned and quality-assured data products, according to FAIR (Findable, Accessible, Interoperable and Reusable) principles. The data management infrastructure provides services (via, e.g., APIs, Apps, ICEs) for data inspection and subsequent calculations by campaign participants. Some near real-time distributions are made to international networks (e.g., AERONET, Phenocam) or local agencies (e.g., GovDATA) with appropriate attribution. The data documentation conventions, used to ensure structured data sets, in this case are used to improve the delivery of integrated urban services, such as to research and operational agencies, across many cities.
- Preprint
(8101 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2024-1469', Anonymous Referee #1, 07 Jul 2024
This manuscript addresses the very important subject of data management for large environmental datasets. The items discussed in this manuscript are often overlooked which can lead to confusion during publication and future data analysis. The authors have obviously put a lot of thought and experience into the design of the data management platform.
There are three overarching items that should be addressed before publication. The first is that acronyms need to be defined upon first use and then their use should be consistent throughout the manuscript. The second is that some of the sentences are rather long and difficult to follow. Technical editing for clarity is recommended. The third is that some of the figures seem redundant or overly complex and simplification would greatly improve the readability of the manuscript. Some specific comments are provided in the attached pdf.
-
RC2: 'Comment on egusphere-2024-1469', Scotty Strachan, 09 Jul 2024
General comments:
This paper describes at high- to mid-levels a framework for data-centric management of extremely heterogeneous observational sensor networks based in urban environments. The authors describe the motivating scientific drivers as requiring both fixed/long-term as well as mobile/short-duration campaigns. This set of requirements is complex and ambitious, representing the cutting edge challenges of applying the Internet of Things to observational science. This paper makes important contributions to approaching these challenges with a systematic and modular architecture, with attention to the scientific necessities of data integrity, traceable quality control, and data interoperability. The paper narrative is well-organized and well-written, with no real issues in terms of language, communication, or structure. The figures and tables are generally clear and communicative of key concepts. Some of the related low-level technical implementations and solutions remain vague, as well as the scale/implementation of the information technology infrastructure and related engineering roles. The paper is certainly valuable and makes a community contribution as-is (with excellent descriptions of sensor metadata and data workflow frameworks), however the authors should consider a section (or a separate paper) detailing the back-end infrastructure solution, the associated engineering team, and any implemented development-operations steady-state workflows that support long-term resilience and consistency. Major observational networks in the publicly-funded science sphere have always struggled with prioritizing the back-end engineering to avoid "technical debt" on the decadal scale, and multi-generational operation of observational science now depends on the development-operations architecture and associated engineering management.
Specific comments: Note: many of the following comments are made with respect to the potential of this manuscript to contribute to improving the observational community's concept of a model and modern large-scale heterogeneous framework, which the authors state is their objective.Line 25: It may be an issue of nuance or language use, but "data management" falls short of representing the foundational technology platform/architecture of end-to-end data acquisition-transport-staging-processing-production-dissemination-archival. Modern technology-driven business applications recognize clear differences between "data science" (science requirements, analysis, products), "data management" (logical requirements, policies, workflow design, quality control), and "data engineering" (software & hardware architecture, software development-operations, performance management, end-to-end user/security/network design/implementation). Longevity, repeatability, and resiliency of a complex heterogeneous digital system requires teams of expertise focused on each of these separate roles, and (all but the largest) scientific operations struggle with recognizing and provisioning them. This key figure/graphical abstract over-simplifies the technical foundations of modern and future observational science at scale.
Line 97: An observational network of this scale requires significant hands-on quality review and control in order to ensure production of scientifically-valid data. An estimate or description (somewhere in the manuscript) of the number of expert/hours/day that are needed to perform continuous data review to maintain consistent quality would be valuable.
Lines 110-147: The authors are to be commended on their attention to detail and variety in the metadata design. It would also be good for the community to know more detail about how these properties get populated (for example, use-case workflows for different categories of sensors, both in the setup/deployment phase as well as the maintenance phase). Useful details would be the method of metadata access/modification, the expected time requirements of the operators, etc. The entire manuscript describes the intended design, but reveals very little in terms of day-to-day function.
Lines 110-147; 199-209: Without knowing much about how the various databases are interacted with by operations and applications, I'll mention that there are other architectures and models for the overall data/metadata management that are emerging in the community. For example, a cloud-hosted modular multi-network sensor data system is in operation and development for several U.S. academic monitoring networks (https://dendra.science/; https://essopenarchive.org/doi/full/10.1002/essoar.10501341.1), that utilizes metadata and QC levels as "annotations" that are applied on data extraction from the RAW/L0 DB. This is a very different approach from the classic "level-version-copy" method generally used by the community, but has potential for better long-term data efficiency/scalability, semi-automated QC interface/entry improvements, and scientific stability.
Lines 219-229: As previously mentioned, the back-end infrastructure solution is of key interest to the long-term observatory community. Information technology architecture, engineering design, features, and required human/machine resources for steady-state operations would be of significant value.
Lines 230-252: This sounds complicated to manage and ensure integrity, and it appears that there are many case-based exceptions and manual permissions-related processes to be handled. This will prove difficult to align with emerging cybersecurity requirements, and likely complicates the process of tracking data provenance. In general, friction in any process for researchers to interact with data and technology solutions means less participation and longer time-to-science, so perhaps the architectural team will keep this whole access management process high on its priority list to automate and reduce complexity.Lines 278-287: This somewhat legacy approach to transport protocols (compared to modern large-scale business data systems in industry) is understandable but also difficult to scale and manage. Are there any considerations of using a logical network architecture based on store-and-forward publish-subscribe data aggregation protocols?
Lines 350-358: I don't see separate roles for the "data management" from the "infrastructure engineering" (see my first comment for Line 25). Industry standard for highly-reliable and long-term digital infrastructure systems place greater emphasis, cost, and scale on the engineering teams as opposed to the scientific and management/policy teams.
Lines 380-386: As mentioned, this project as great potential to impact the observational science community with updated practices, architectures, demonstrated workflows, use cases, and lessons learned. Looking forward to seeing how it develops!
Line 458: I found this entire paper to be valuable, but especially the detailed information in Appendix A. The use of absolute and relative coordinate systems and the concepts of documenting sensor deployment environments for urban zones should prove to be very useful for the observational community as these kinds of standards evolve.Citation: https://doi.org/10.5194/egusphere-2024-1469-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
238 | 102 | 25 | 365 | 15 | 14 |
- HTML: 238
- PDF: 102
- XML: 25
- Total: 365
- BibTeX: 15
- EndNote: 14
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1