The Stippled Gridpoints are Statistically Significant: (Mis)uses of False Discovery Rate Correction for Geospatial Data
Abstract. Peer-reviewed articles in the geosciences routinely assess statistical significance in spatially distributed data. Statistical significance is often assessed independently at each grid point, while formal adjustment for multiple testing is applied less consistently. Although several approaches to account for multiple testing exist, their application to geosciences data is not always straightforward, as these data often exhibit spatially coherent signals.
In this work, we revisit multiple-testing correction in the context of spatially structured datasets. We first highlight how neglecting multiple testing correction can substantially inflate the number of false positives. We further show that the global false discovery rate (FDR) approach, proposed in literature for application in geosciences, can yield counterintuitive and potentially misleading results when applied to spatially coherent signals. To illustrate the latter point, we provide an example based on near-surface air temperature composites following sudden stratospheric warmings. We show that when anomalies are spatially coherent, restricting the spatial domain can increase the FDR-adjusted significance threshold. Consequently, the same underlying field can appear more statistically significant solely due to domain selection, despite unchanged data. We explain this behavior from the rank-based structure of the FDR procedure and discuss its implications for spatial inference and uncertainty quantification in the geosciences.
Building on these insights, we outline practical recommendations for transparent and robust significance assessment in geoscientific applications. These include clearly documenting multiple-testing corrections when adjusted pointwise significance is shown, cautious interpretation of adjusted thresholds, and considering spatially aware alternatives such as regional or cluster-based inference when appropriate.
Overall, our results highlight both the need to account for multiple-testing and potential issues with a naïve application and interpretation of the FDR correction. We hope that our work may contribute to more robust statistical testing in the geosciences.