The Stippled Gridpoints are Statistically Significant: (Mis)uses of False Discovery Rate Correction for Geospatial Data
Abstract. Peer-reviewed articles in the geosciences routinely assess statistical significance in spatially distributed data. Statistical significance is often assessed independently at each grid point, while formal adjustment for multiple testing is applied less consistently. Although several approaches to account for multiple testing exist, their application to geosciences data is not always straightforward, as these data often exhibit spatially coherent signals.
In this work, we revisit multiple-testing correction in the context of spatially structured datasets. We first highlight how neglecting multiple testing correction can substantially inflate the number of false positives. We further show that the global false discovery rate (FDR) approach, proposed in literature for application in geosciences, can yield counterintuitive and potentially misleading results when applied to spatially coherent signals. To illustrate the latter point, we provide an example based on near-surface air temperature composites following sudden stratospheric warmings. We show that when anomalies are spatially coherent, restricting the spatial domain can increase the FDR-adjusted significance threshold. Consequently, the same underlying field can appear more statistically significant solely due to domain selection, despite unchanged data. We explain this behavior from the rank-based structure of the FDR procedure and discuss its implications for spatial inference and uncertainty quantification in the geosciences.
Building on these insights, we outline practical recommendations for transparent and robust significance assessment in geoscientific applications. These include clearly documenting multiple-testing corrections when adjusted pointwise significance is shown, cautious interpretation of adjusted thresholds, and considering spatially aware alternatives such as regional or cluster-based inference when appropriate.
Overall, our results highlight both the need to account for multiple-testing and potential issues with a naïve application and interpretation of the FDR correction. We hope that our work may contribute to more robust statistical testing in the geosciences.
This paper points out potential problems associated with use of the FDR procedure in the context of spatially correlated hypothesis tests. The authors show that the rule of thumb recommended in the 2016 Wilks paper, which was derived from a particular synthetic data setting with relatively few locally significant gridpoints, apparently behaves badly in the extreme restricted-domain example highlighted in the present paper, and so is evidently not optimal in general.
The paper is fairly short, and would be stronger if it were to include a spectrum of synthetic-data simulations aimed at quantifying optimized parameterization of the ratio of the FDR to the local test level, perhaps as a function of the proportion of local tests that are nominally significant (for example as suggested on line 196), or possibly in terms of the relationship between the domain size and the spatial autocorrelation length scale. Figure 2 seems to indicate that setting the FDR level closer to 0.05 might yield more consistent results in Figure 1d. At minimum, it would be interesting to see the counterpart of Figure 1d with equality of the FDR level and the local test level (i.e., 0.05).
A few more specific comments:
para beginning line 65. Not quite accurate: FDR is actual, not expected number of incorrectly rejected nulls. The Benjamini-Hochberg procedure limits (“controls”) the proportion of such rejections, in expectation. So alpha-FDR = 0.1 implies that, on average, no more than 10% of rejected nulls are false positives, and indeed there may be fewer than this. (The characterization on line 91 is correct).
line 110. The test setup here appears to assume implicitly that SSW events are uniformly distributed in the November-March data window, which is substantially longer than 60 days. Is this the case in the observations? Also, is there a physical justification for use of the 60-day period, external to the test data? What is the effect of temperature nonstationarity during November-March?
line 125. The choice of the small Northern European gridbox appears to have been made a posteriori, after calculation of the initial hemispheric analysis. The several papers cited, apparently to justify this choice, presumably were based on the same or substantially overlapping historical data. The possible impact on the second, spatially restricted, analysis should be discussed more fully.