Scalable Earth Observation Data Cubes for Advanced Analytics of Dynamic Earth Surface Processes: An Open-Source Package for Customized Processing of Sentinel-2 Data on HPCs and Beyond
Abstract. Earth Observation archives now encompass petabytes of multispectral imagery, yet transforming these heterogeneous collections into analysis-ready data (ARD) cubes remains a critical bottleneck. We present an open-source Python package that unifies cloud masking, co-registration, and super-resolution into a seamless Xarray-based workflow, tailored specifically to close practical gaps in ARD cube generation. Leveraging scalable high-performance computing (HPC) infrastructure, our framework delivers rapid, reproducible cube construction and incremental updates, enabling users to build or extend large time-series data cubes without reprocessing historical scenes. Besides HPCs, our package is also suitable for local processing of Sentinel-2 data. Our approach integrates (1) s2cloudless, a probabilistic cloud-masking algorithm offering user-defined thresholds to overcome the rigid limitations of the Sentinel-2 Scene Classification Layer (SCL) and STAC metadata; (2) AROSICS, a sliding-window co-registration routine that ensures sub-pixel alignment over complex, dynamic landscapes to produce smoother temporal metrics and more consistent change detection; and (3) SEN2SR, a deep-learning super-resolution model that refines all bands to 2.5 m, revealing fine geomorphic and ecological features previously obscured at native resolutions. Together, these components address three recurring ARD cube gaps in existing Xarray-based toolkits: adaptive cloud filtering, robust time-series alignment, and integrated spatial enhancement within a single, reproducible pipeline. To maximize accessibility and reuse, the package is accompanied by well documented, interactive Python notebooks that guide users through configuration, and end-to-end cube generation. Validated on the German Aerospace Center’s terrabyte HPDA clusters, the pipeline runs equally well on local workstation and can be accessed at https://github.com/BaturalpArisoy/stac2cube.
This paper presents an open-source Python package that implements a series of Sentinel-2 processing steps, covering cloud detection, co-registration, and super-resolution. End-to-end pipelines of this kind are useful contributions to the broader scientific community and help democratize access to advanced methodologies.
Unfortunately, I can currently not recommend this for publication due to several major concerns throughout this work:
Major Concerns:
Minor: