A guide to optimised spatiotemporal data co-location by mutual information maximisation

Martin, Andrew Steven; Guy, Heather; Gallagher, Michael Ray; Neely III, Ryan Reynolds

doi:10.5194/egusphere-2025-6079

Preprints

https://doi.org/10.5194/egusphere-2025-6079

Preprints

17 Dec 2025

| 17 Dec 2025

A guide to optimised spatiotemporal data co-location by mutual information maximisation

Andrew Steven Martin, Heather Guy, Michael Ray Gallagher, and Ryan Reynolds Neely III

Abstract. The matching of data described on different coordinate systems between multiple data sources – spatiotemporal co-location – is a necessary and crucial step in geospatial data synthesis and validation. The particular choice of co-location scheme, and the choice of parameters applied to it, decide what subsets of the original datasets are included in downstream analyses, affecting the quantitative outputs of comparison studies and multi-retrieval synthesised datasets. Previously, no generalised framework for deciding how best to co-locate data has existed. We outline a domain- and data-agnostic framework that generalises the process of selecting an optimised co-location parametrisation for a given co-location scheme, by maximising the mutual information encoded between the data included in the subsequent analyses. We demonstrate the framework by applying it to a comparison of vertical cloud fraction profiles retrieved from the polar-orbiting ICESat-2 satellite's ATL09 data product, and surface-based observations at four Cloudnet observatories. We evaluate per-site optimised co-location parametrisations and find that using the optimised co-location parametrisations quantitatively improves the comparison between the datasets over naive choices of co-location parameters. This work has implications across almost all remote sensing data products – especially for satellite validations – and will facilitate deep learning methodologies by producing paired datasets with the maximal information about the structure between datasets available to be learned.

Received: 05 Dec 2025 – Discussion started: 17 Dec 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 4964 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (4964 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

27 May 2026

A guide to optimised spatiotemporal data co-location by mutual information maximisation

Andrew Steven Martin, Heather Guy, Michael Ray Gallagher, and Ryan Reynolds Neely III

Atmos. Meas. Tech., 19, 3511–3537, https://doi.org/10.5194/amt-19-3511-2026,https://doi.org/10.5194/amt-19-3511-2026, 2026

Short summary

Andrew Steven Martin, Heather Guy, Michael Ray Gallagher, and Ryan Reynolds Neely III

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-6079', Anonymous Referee #1, 23 Jan 2026
The paper outlines an objective method to identify the parameters of a collocation scheme, illustrated for the comparison of ICESAT-2 to Cloudnet profiles of cloud mask by the selection of a radial separation in the satellite track and temporal window for the ground-based field. The algorithm optimises the mutual information content provided by paired collocation observations, arguing that that value increases as the volume of data considered increases until such time as uncorrelated observations begin to contaminate the set.
I cannot more strongly recommend this paper for publication. It was an absolute delight to read and an astonishingly good document for a junior researcher. I have some minor comments and corrections that may assist in the uptake of this method by the atmospheric science community (who are generally unfamiliar with formal mathematics or statistics), but mostly wish to thank the authors for providing me with a rewarding read. I look forward to applying the technique when I next need to run a validation study.
Minor comments:
The text in figures would, ideally, be similar in size to the text within the captions. While those reading a PDF can zoom in, those of us foolish enough to print off papers to get some time away from a screen cannot read any of the labels in the figures of this paper. Is it possible to regenerate the diagrams with text scaled to their printed size?

Though it isn’t necessary, the paper may be improved by a qualitative description of some of the mathematical operations in S2.3 as most researchers in atmospheric science do not have much training in the formalism of mathematics. For example, “nearest neighbour distances between samples” would likely be interpreted to refer to the physical distance between observations rather than the distance in state space. The references given are accurate but not necessarily useful for the audience of this journal; a textbook or lecture course may be useful (as was done for the description of the copula).

L68: It’s not clear to me why optimising the correlation coefficient biases the results high. This may be a matter of definition, as I’m reading this line to say “we are preferentially selecting results where one is consistently larger than the other” as that is how ‘bias’ tends to be used in atmospheric science. You may have meant “we are preferentially selecting results that are large” (in magnitude) or “we are preferentially selecting results that resemble each other”. My instinct is that the sample will be biased but with unpredictable sign. There is every chance this is a standard result in statistics of which I am unaware.

L161 and L163: The values given in the text differ slightly from those given in Fig. 2. Please check which is correct.

L270: I agree that, in the 2-D presentation of Fig. 4, Cloudnet data is a point. However, in 3-D space, the satellite swath is a (mostly vertical) surface and the site is a vertical line. I’m not sure the document would be improved by making this point, but the pedant in me could not leave this distinction unmentioned.

L401: It is not obvious to me that the number of profiles scales with \cosh(R). I would have expected this to come from the rule of cosines giving the length of a bisector of a circle to be $R\sqrt{2(1-\cos\gamma)}$; hence linear R dependence. Regardless, if you are correct, isn’t $\lim_{R \to \infty} cosh(R) = \infty$ rather than R?

Fig. 5 and L422: The discontinuity in significant solutions deserves some more comment to guide readers in how to communicate their confidence in the selection of optimal parameters. A continuous region of significance can sensibly be communicated through traditional uncertainty notation (even if not strictly appropriate), but this second solution is much harder to communicate. Would the authors expect this degeneracy to resolve as more data is added to the problem (i.e. wait for a longer dataset) or is this an unavoidable aspect of the problem? What would you do if the difference in mutual information between the two solutions was negligible?

L425: I understand this section to argue that I(p) decreases as you move away from \hat{p}, but I’m unsure that ‘unimodal’ is the correct word for that. Unimodal means the surface is single-valued at each point – which is true by construction for I. Collapsing the 2-D surface into a function of distance from \hat{p} clearly isn't unimodal because the surfaces clearly aren’t isotropic about that point. ‘Largely monotonic’ strikes me as more appropriate but several options are available.

L778: If the authors think there is any general utility to this simulation, it may be worth mentioning within the main body that $N > 10^4$ achieves consistency in these estimators. Some guide on the number of observations necessary to apply this method would be useful to most readers.

App. C: This is probably just a difference between the language of a mathematician and a physicist, but I would call these three points approximations rather than assumptions that are broken.

Sec. 3.6.2: There is no way you could have known this, but section 4 of https://doi.org/10.1007/s10712-025-09898-4 published after your submission explores some of the issues with collocation you discuss at the end of this section. Further, in the aerosol community there is precedent in the publications of Nick Schutgens for the conclusion that a one-size-fits-all collocation criteria is sub-optimal, such as Fig. 1 of https://doi.org/10.5194/acp-20-12431-2020 or the conclusions of https://doi.org/10.5194/acp-17-9761-2017. Admittedly, Nick aims to minimise the bias, which is exactly what you argue against doing, but you might find the comparison could be useful.

Technical corrections:
L16: validation and constraining

L62: expectation to the comparison

Tab. 1: There should be a space before N_{profiles}.

L598: these different quantities

L710: There should be a space after the comma.

Eq. C10: Shouldn’t it be \lambda_{21} here? If I’m correct, it might be clearer for C11 to show \cos\lambda_{21} and switch to \cos\lambda_{12} for C12.

Maybe C14 should come before C13 as the more natural equation to follow from C5?

L888: this bearing into an across-track

I randomly tried a few of the DOI links in the references; Crameri 2023 and Palm 2021a were dead links for me.
Citation: https://doi.org/10.5194/egusphere-2025-6079-RC1
- AC1: 'Reply on RC1', Andrew Martin, 27 Mar 2026
  
  We thank the reviewer for their comments. Our response is given in the supplementary PDF file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6079-AC1
RC2:
'Comment on egusphere-2025-6079', Anonymous Referee #2, 25 Jan 2026

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-6079/egusphere-2025-6079-RC2-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2025-6079-RC2
- AC2: 'Reply on RC2', Andrew Martin, 27 Mar 2026
  
  We thank the reviewer for their comments. Our response is given in the supplementary PDF file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6079-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-6079', Anonymous Referee #1, 23 Jan 2026
The paper outlines an objective method to identify the parameters of a collocation scheme, illustrated for the comparison of ICESAT-2 to Cloudnet profiles of cloud mask by the selection of a radial separation in the satellite track and temporal window for the ground-based field. The algorithm optimises the mutual information content provided by paired collocation observations, arguing that that value increases as the volume of data considered increases until such time as uncorrelated observations begin to contaminate the set.
I cannot more strongly recommend this paper for publication. It was an absolute delight to read and an astonishingly good document for a junior researcher. I have some minor comments and corrections that may assist in the uptake of this method by the atmospheric science community (who are generally unfamiliar with formal mathematics or statistics), but mostly wish to thank the authors for providing me with a rewarding read. I look forward to applying the technique when I next need to run a validation study.
Minor comments:
The text in figures would, ideally, be similar in size to the text within the captions. While those reading a PDF can zoom in, those of us foolish enough to print off papers to get some time away from a screen cannot read any of the labels in the figures of this paper. Is it possible to regenerate the diagrams with text scaled to their printed size?

Though it isn’t necessary, the paper may be improved by a qualitative description of some of the mathematical operations in S2.3 as most researchers in atmospheric science do not have much training in the formalism of mathematics. For example, “nearest neighbour distances between samples” would likely be interpreted to refer to the physical distance between observations rather than the distance in state space. The references given are accurate but not necessarily useful for the audience of this journal; a textbook or lecture course may be useful (as was done for the description of the copula).

L68: It’s not clear to me why optimising the correlation coefficient biases the results high. This may be a matter of definition, as I’m reading this line to say “we are preferentially selecting results where one is consistently larger than the other” as that is how ‘bias’ tends to be used in atmospheric science. You may have meant “we are preferentially selecting results that are large” (in magnitude) or “we are preferentially selecting results that resemble each other”. My instinct is that the sample will be biased but with unpredictable sign. There is every chance this is a standard result in statistics of which I am unaware.

L161 and L163: The values given in the text differ slightly from those given in Fig. 2. Please check which is correct.

L270: I agree that, in the 2-D presentation of Fig. 4, Cloudnet data is a point. However, in 3-D space, the satellite swath is a (mostly vertical) surface and the site is a vertical line. I’m not sure the document would be improved by making this point, but the pedant in me could not leave this distinction unmentioned.

L401: It is not obvious to me that the number of profiles scales with \cosh(R). I would have expected this to come from the rule of cosines giving the length of a bisector of a circle to be $R\sqrt{2(1-\cos\gamma)}$; hence linear R dependence. Regardless, if you are correct, isn’t $\lim_{R \to \infty} cosh(R) = \infty$ rather than R?

Fig. 5 and L422: The discontinuity in significant solutions deserves some more comment to guide readers in how to communicate their confidence in the selection of optimal parameters. A continuous region of significance can sensibly be communicated through traditional uncertainty notation (even if not strictly appropriate), but this second solution is much harder to communicate. Would the authors expect this degeneracy to resolve as more data is added to the problem (i.e. wait for a longer dataset) or is this an unavoidable aspect of the problem? What would you do if the difference in mutual information between the two solutions was negligible?

L425: I understand this section to argue that I(p) decreases as you move away from \hat{p}, but I’m unsure that ‘unimodal’ is the correct word for that. Unimodal means the surface is single-valued at each point – which is true by construction for I. Collapsing the 2-D surface into a function of distance from \hat{p} clearly isn't unimodal because the surfaces clearly aren’t isotropic about that point. ‘Largely monotonic’ strikes me as more appropriate but several options are available.

L778: If the authors think there is any general utility to this simulation, it may be worth mentioning within the main body that $N > 10^4$ achieves consistency in these estimators. Some guide on the number of observations necessary to apply this method would be useful to most readers.

App. C: This is probably just a difference between the language of a mathematician and a physicist, but I would call these three points approximations rather than assumptions that are broken.

Sec. 3.6.2: There is no way you could have known this, but section 4 of https://doi.org/10.1007/s10712-025-09898-4 published after your submission explores some of the issues with collocation you discuss at the end of this section. Further, in the aerosol community there is precedent in the publications of Nick Schutgens for the conclusion that a one-size-fits-all collocation criteria is sub-optimal, such as Fig. 1 of https://doi.org/10.5194/acp-20-12431-2020 or the conclusions of https://doi.org/10.5194/acp-17-9761-2017. Admittedly, Nick aims to minimise the bias, which is exactly what you argue against doing, but you might find the comparison could be useful.

Technical corrections:
L16: validation and constraining

L62: expectation to the comparison

Tab. 1: There should be a space before N_{profiles}.

L598: these different quantities

L710: There should be a space after the comma.

Eq. C10: Shouldn’t it be \lambda_{21} here? If I’m correct, it might be clearer for C11 to show \cos\lambda_{21} and switch to \cos\lambda_{12} for C12.

Maybe C14 should come before C13 as the more natural equation to follow from C5?

L888: this bearing into an across-track

I randomly tried a few of the DOI links in the references; Crameri 2023 and Palm 2021a were dead links for me.
Citation: https://doi.org/10.5194/egusphere-2025-6079-RC1
- AC1: 'Reply on RC1', Andrew Martin, 27 Mar 2026
  
  We thank the reviewer for their comments. Our response is given in the supplementary PDF file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6079-AC1
RC2:
'Comment on egusphere-2025-6079', Anonymous Referee #2, 25 Jan 2026

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-6079/egusphere-2025-6079-RC2-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2025-6079-RC2
- AC2: 'Reply on RC2', Andrew Martin, 27 Mar 2026
  
  We thank the reviewer for their comments. Our response is given in the supplementary PDF file.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6079-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Andrew Martin on behalf of the Authors (13 Apr 2026) Author's response Author's tracked changes Manuscript

ED: Publish subject to minor revisions (review by editor) (07 May 2026) by Luca Lelli

Dear authors,

thanks for uploading this revised version of the manuscript. Apologies for the delay in processing it but EGU takes its toll in terms of time and capacity.

I have enjoyed the review process and I am ready to accept the work for final publication upon consideration of two suggestions I have after my own read and understanding.

1) To reviewer#1 "What would you do if the difference in mutual information between the two solutions was negligible?" you correctly answer that it is an independent choice of the researcher what scenario to choose. And this aspect must not be overlooked. I’d suggest to be as clear as possible and add a concise sentence explaining this, thereby leaving the door open to one's independent reasoning.

2) I think the reviewer is right when he points to the possibility that unimodality is not proven. In the spirit of mathematical rigor, the observation that funcion I decreases from p0 in all directions identifies p0 as a local maximum, but is not sufficient to establish unimodality, as per its formal definition.

To call it unimodal one must additionally guarantee, either by structural assumptions (function's concavity or by topology of the sample space – ie the domain is compact and concave) or by global analysis, that p0 is the unique global maximum and that no other local extrema (saddle points or maxima) exist in the sample space. In the absence of such guarantees, the function may still be multimodal despite the local decrease condition being satisfied.

An immediate counterexample would be

I(p) = sin( || p ||^2 ) where the sample space is R^2

The function decreases in all radial directions from a local maximum, yet it is multimodal with several singular maximum values.

So, please remove the claim of unimodality.

Other than this, I am happy to have reviewed this insightful work.

Best regards
Luca Lelli

Hide

AR by Andrew Martin on behalf of the Authors (12 May 2026) Author's response Author's tracked changes Manuscript

ED: Publish as is (13 May 2026) by Luca Lelli

AR by Andrew Martin on behalf of the Authors (18 May 2026) Manuscript

Journal article(s) based on this preprint

27 May 2026

A guide to optimised spatiotemporal data co-location by mutual information maximisation

Andrew Steven Martin, Heather Guy, Michael Ray Gallagher, and Ryan Reynolds Neely III

Atmos. Meas. Tech., 19, 3511–3537, https://doi.org/10.5194/amt-19-3511-2026,https://doi.org/10.5194/amt-19-3511-2026, 2026

Short summary

Andrew Steven Martin, Heather Guy, Michael Ray Gallagher, and Ryan Reynolds Neely III

Data sets

Mutual information maximisation for spatiotemporal co-location: ICESat-2 ATL09 and Cloudnet categorize Andrew Martin https://doi.org/10.5281/zenodo.17817304

Interactive computing environment

DAndrewA/a-guide-to-optimised-spatiotemporal-data-co-location-by-mutual-information-maximisation: v1.0.1 Andrew Martin https://doi.org/10.5281/zenodo.17830442

Andrew Steven Martin, Heather Guy, Michael Ray Gallagher, and Ryan Reynolds Neely III

Viewed

Total article views: 2,740 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,516	1,033	191	2,740	133	95

HTML: 1,516
PDF: 1,033
XML: 191
Total: 2,740
BibTeX: 133
EndNote: 95

Views and downloads (calculated since 17 Dec 2025)

Month	HTML	PDF	XML	Total
Dec 2025	365	250	55	670
Jan 2026	518	253	45	816
Feb 2026	196	181	27	404
Mar 2026	322	248	51	621
Apr 2026	67	47	6	120
May 2026	34	45	2	81
Jun 2026	9	4	1	14
Jul 2026	5	5	4	14

Cumulative views and downloads (calculated since 17 Dec 2025)

Month	HTML	PDF	XML	Total
Dec 2025	365	250	55	670
Jan 2026	518	253	45	816
Feb 2026	196	181	27	404
Mar 2026	322	248	51	621
Apr 2026	67	47	6	120
May 2026	34	45	2	81
Jun 2026	9	4	1	14
Jul 2026	5	5	4	14

Viewed (geographical distribution)

Total article views: 2,718 (including HTML, PDF, and XML) Thereof 2,718 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 22 Jul 2026

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (4964 KB)
Metadata XML

Short summary

Matching geospatial data between datasets recorded on different coordinate systems requires choosing parameters that impact the subset of data in downstream analyses. We developed a framework to optimise the choice of parameters by maximising the mutual information between the data being compared. The optimised parameters vary spatially, and using the optimised parameters results in better comparisons between data than using fixed choices of parameters.


Total:	0
HTML:	0
PDF:	0
XML:	0