A new sub-chunking strategy for fast netCDF-4 access in local, remote and cloud infrastructures, chunkindex V1.1.0

Penard, Cédric; Gouillon, Flavien; Delaunay, Xavier; Herlédan, Sylvain

doi:10.5194/egusphere-2025-2983

Preprints

https://doi.org/10.5194/egusphere-2025-2983

Preprints

10 Sep 2025

| 10 Sep 2025

A new sub-chunking strategy for fast netCDF-4 access in local, remote and cloud infrastructures, chunkindex V1.1.0

Cédric Penard, Flavien Gouillon, Xavier Delaunay, and Sylvain Herlédan

Abstract. NetCDF (Network Common Data Form) is a self-describing, portable and platform-independent format for array-oriented scientific data which has become a community standard for sharing measurements and analysis results in the fields of oceanography, meteorology, and space domain.

The volume of scientific data is continuously increasing at a very fast rate and poses challenges for efficient storage and sharing of these data. The object storage paradigm that appeared with cloud infrastructures, can help with data storage and parallel access issues.

The availability of ample network bandwidth within cloud infrastructures allows for the utilization of large amounts of data. Processing data where the data is located is preferable as it can result in substantial resource savings. However, for some use cases downloading data from the cloud is required and results still have to be fetched once processing tasks have been executed on the cloud.

However, networks bandwidth and quality can exhibit significant variations depending on the available resources in different use cases: networks can range from fiber-optic and copper connections to satellite connections with poor reception in degraded conditions on boats, among other scenarios. Therefore, it is crucial for formats and software libraries to be specifically designed to optimize access to data by minimizing the transfer to only what is strictly necessary.

By design, the NetCDF data format offers such capabilities. A netCDF file is composed of a pool of chunks. Each chunk is a small unit of data that is independent and may be compressed. These units of data are read or written in a single I/O operation. In this article, we reuse the notion of sub-chunk introduced in the kerchunk library (Sterzinger et al, 2021): we refer to a sub-chunk as a sub-part of a chunk.

The sub-chunking strategies help reducing the amount of data transferred by splitting netCDF chunks in even smaller unit of data. Kerchunk limits the sub-chunking capability to uncompressed chunks. Our approach goes further: it allows the sub-chunking of compressed chunks by indexing their content. In this context, a new approach has emerged in the form of a library that indexes the content of netCDF-4. This indexing enables the retrieval of sub-chunks without the need to read and decompress the entire chunk. This approach targets access patterns such as time series in netCDF-4 datasets formatted with large chunks and it has the advantage of not requiring the entire file to be reformatted.

This report provides a performance assessment of netCDF-4 datasets for various use cases and conditions: POSIX and S3 local filesystems, as well as a simulated degraded network connection. The results of this assessment may provide guidance on the most suitable and most efficient library for reading netCDF data in different situations.

Another outcome of this study is the impact of the libraries used to access the data: while extensive existing literature compares different file formats performance (open, read and write), the impact of specific standard libraries remains poorly studied. This study evaluates the performance of four Python libraries (netcdf4-python, xarray, h5py, and a custom chunk-indexing library) for reading parts of the datasets through fsspec or S3fs. To complete the study, a comparison with the cloud-oriented formats Zarr and ncZarr is conducted. Results show that the h5py library provides high efficiency for accessing netCDF-4 formatted files and performance close to Zarr on a S3 server. These results suggest that it may not be necessary to convert netCDF-4 datasets to Zarr when moving to the cloud. This would avoid reformatting petabytes of data and costly adaptation of scientific software.

Received: 23 Jun 2025 – Discussion started: 10 Sep 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Cédric Penard, Flavien Gouillon, Xavier Delaunay, and Sylvain Herlédan

Status: final response (author comments only)

RC1:
'Comment on egusphere-2025-2983', Aleksandar Jelenak, 22 Oct 2025
This paper presents an interesting technique for reading data from HDF5 or netCDCF-4 files in web object stores, S3 is one example and very well known. Data access relies on very slow HTTP requests which fundamentally redefine the I/O performance compared to when the same files are in a file system. The proposed method in this paper contributes to the ongoing deliberations of what to do for the already existing conventional HDF5 and netCDF-4 files which may end up in various types of web object stores.
Below are my comments and suggestions:
Section 2.3 would benefit from a clear explanation of the software stack employed and how the data interpretation differs from one layer to another. First six access methods, top to bottom labels of the y-axis in Figure 8, all depend on the HDF5 library for actual I/O operations but this is not mentioned as background explanation.

Line 143: h5py is a Python interface to the HDF5 library, not directly to the HDF5 file format.

Line 144: Mention of h5netcdf would benefit from the more background context. What functionality brings h5netcdf Python package given that both Netcdf4-python and h5py are already used?

Line 155: Perhaps the labels in the results figures should be "xarray_nczarr" to accurately reflect the software used for this case because xarray adds an overhead for its interpretation of the data.

Depicting the 32 kB blocks in Figure 2 would help readers understand the zran concept easier.

Line 295: There is no explanation why the Deflate compression was not used for the Zarr version of the netCDF-4 data because it is available in Blosc. LZ4 decompression is faster than Deflate so the Zarr runtime results are unnecessarily less comparable with those getting data from the netCDF-4 files.

Line 332: Suggest to pick one spelling of the word "connexion"/"connection" since both are currently present in the paper.

Using fsspec via h5py and HDF5 library involves reading data by a fixed amount, with 5 MiB as the fsspec's default (I think). This means the data from the netCDF-4 files were either read more than needed per single request, or less per single request, hence, requiring multiple requests. Both cases contribute to a larger total runtime. Fsspec also has several different caching mechanisms for already accessed parts of a file, which can also impact runtime. Sections 5.2 and 5.3 do not discuss the configurations of fsspec or s3fs and their effect on the measured results.

There is no mention of HDF5 library's Read-Only S3 (ROS3) virtual file driver which could have been used for the netCDF-4 files in the S3 store. Did the authors consider it and decided not to use? It would be good if this was noted in the paper.
Citation: https://doi.org/10.5194/egusphere-2025-2983-RC1
RC2: 'Comment on egusphere-2025-2983', Max Jones, 23 Dec 2025

This paper presents a novel and interesting approach of applying random read access for ZLIB, GZIP, and DEFLATE compressed files to the NetCDF-4 data format. The approach has substantial potential to mitigate costs associated with migrating archival data to the cloud by reducing data transfer. The paper also compares the performance of loading data using the sub-chunking implementation to a few other libraries and data formats.
Overall, this is an excellent contribution that I would recommend for publication. The strengths of this paper include a well-written and clear motivation and scope, a novel and valuable methodology, and an open-source implementation to accompany the new methods. The primary downside is that the limitations of the benchmarking approach are not acknowledged in the paper. I recommend acknowledging these limitations rather than trying to fix them, as benchmarking can be a never-ending task.
In my opinion, the paper extrapolates too far in making recommendations about libraries given the limitations of the benchmarks. At a minimum, it should be acknowledged that different versions of the libraries or different configurations could provide different results.
- I think it's worth being more specific about how the components of the benchmarking options fit together. For example, Xarray does not do any I/O or decompression so it doesn't make sense to include it as analogous to netcdf4-python, h5py, or chunkindex. A figure such as the one at the start of https://tutorial.xarray.dev/intermediate/remote_data/remote-data.html could help for distinguishing which libraries perform I/O, which perform decompression, which provide a Python wrapper, and which only provide a data model.
- It's worth mentioning that the benchmarks only explore a small part of the possible parameter space for loading NetCDF data over the network, but still represents an important step forward. For example, the NetCDF-4 (C library, wrapped to Python by netcdf4-python), HDF5 (C Library, wrapped to python by h5py, both wrapped by h5netcdf), and Zarr-Python (Python Library) perform I/O. NetCDF-4 and HDF5 also decompress the data, whereas Zarr-Python uses other codec implementations for decompression (e.g., Python's blosc, numcodecs, or more recently Rust implementations). Each of these may use a file-system or object storage API for read requests (fsspec, or more recently obstore). These pieces can fit together in different ways (e.g., using h5py without h5netcdf, or either with or without xarray), which would all impact the results.
- The configuration for fsspec has a massive impact on performance, for example see https://agu2025.workshops.geojupyter.org/modules/data-in-the-cloud/#local-vs-cloud-performance-comparison which shows a 10x difference between default (read-ahead) caching vs. block caching.
- The version of software libraries would also have a profound difference on the results. For example, https://earthmover.io/blog/i-o-maxing-tensors-in-the-cloud shows a 10x improvement from Zarr-Python 2 to Zarr-Python 3). The concurrency configuration for Zarr-Python and whether obstore or fsspec is used to interact with obstore can also have a >3x impact (e.g., https://github.com/maxrjones/zarr-obstore-performance)
The specific compression used would also likely impact the results; it's unclear why the compression used for the Zarr data differs from the NetCDF4 data.
People may want to optimize chunking exactly for a specific use-case, which may warrant data transformation and would be worth mentioning. For example, sub-chunking would not be able to provide the same performance for time series analyses as having a single chunk aligned perfectly along the region of interest.
I recommend mentioning the downsides as well as the benefits of the sub-chunking approach, for example:
- chunk indexes can get out of sync if the data in the original file is modified. This can be circumvented by storing and using checksums or the last modified date.
- Many libraries coalesce adjacent range requests to match the optimal get request size for S3 APIs (~4 MB), which is not supported when using sub-chunking.
Minor errors
- S3 APIs allow range requests, so it is not required to read entire objects (Line 135)
- The Zenodo upload for the performance tests is missing the configuration file that was used for the paper.
Potential next steps to consider mentioning
- It should be possible to only use the sub-chunking for requests where it offers a performance benefit (small chunks) and use the full chunks for larger requests via a coalescence mechanism.
- It's still necessary to adapt workflows to use this technology, because it is not a drop-in replacement for h5py, etc. An Xarray backend, integration with a NetCDF-4, and/or integration with Kerchunk or VirtualiZarr would greatly improve accessibility.
- Support for Zarr V3 in the chunkindex store should dramatically improve performance as a fully async implementation.
Ignorable nits
- It'd be helpful to have self-describing labels on the histograms (e.g., NetCDF via h5py with shuffling)
- Figure 2 would be more legible at a higher resolution
- It would be worth mentioning around L84 that OPeNDAP also implements chunk indexing internally via DMR++ (https://opendap.github.io/DMRpp-wiki/DMRpp.html)
- Citing the Kerchunk library (Durant et al.; https://github.com/fsspec/kerchunk) may be more appropriate for Kerchunk than Sterzinger et al., 2021
- "Objects are analogous to files in a POSIX file system" would be more technically accurate than "Objects are equivalent to files in a POSIX file system" (Line 129)
- https://github.com/pauldmccarthy/indexed_gzip preceded zran and is likely worth a citation
- Although it likely came out after this paper was started, it may be worth mentioning Icechunk as a higher-performance alternative to Kerchunk, and possibly an integration point for chunkindex.
- It could help to explain in Line 390 that the netCDF4 library cannot read from buffers and therefore requires an open file using operating system machinery.
- It'd be nice to at some point include a sentence or table describing exactly what subset of files chunkindex can be used for (e.g., all HDF5 or just NetCDF4, all compression levels, the specific compressions implemented by zran)
Overall, very well done! I look forward to seeing this published.

Citation: https://doi.org/10.5194/egusphere-2025-2983-RC2

Cédric Penard, Flavien Gouillon, Xavier Delaunay, and Sylvain Herlédan

Viewed

Total article views: 1,465 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,280	164	21	1,465	34	29

HTML: 1,280
PDF: 164
XML: 21
Total: 1,465
BibTeX: 34
EndNote: 29

Views and downloads (calculated since 10 Sep 2025)

Month	HTML	PDF	XML	Total
Sep 2025	1,115	17	4	1,136
Oct 2025	85	46	6	137
Nov 2025	28	39	5	72
Dec 2025	48	59	6	113
Jan 2026	4	3	0	7

Cumulative views and downloads (calculated since 10 Sep 2025)

Month	HTML	PDF	XML	Total
Sep 2025	1,115	17	4	1,136
Oct 2025	85	46	6	137
Nov 2025	28	39	5	72
Dec 2025	48	59	6	113
Jan 2026	4	3	0	7

Viewed (geographical distribution)

Total article views: 1,454 (including HTML, PDF, and XML) Thereof 1,454 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 04 Jan 2026

Short summary

In this work, we propose a novel approach, called chunkindex, that was designed to improve the access to time series from native NetCDF (Network Common Data Form) files in the cloud. The advantage of our approach is that it keeps existing data as they are without requiring any reformatting. The idea is to reduce the amount of data read from the NetCDF file by creating sub-chunks that allow extracting smaller portions of compressed data without reading the entire chunk.


Total:	0
HTML:	0
PDF:	0
XML:	0