the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Triple collocation validates CONUS-wide evapotranspiration inferred from atmospheric conditions
Abstract. Large-scale estimation of evapotranspiration (ET) remains challenging because no direct remote sensing estimates of ET exist and because most data-driven estimation approaches require assumptions about the impact of moisture conditions and biogeography on ET. The surface flux equilibrium (SFE) approach offers an alternative, deriving ET directly from atmospheric temperature and humidity under the assumption that conditions in the atmospheric boundary layer reflect ET’s land boundary condition. We present a 4 km resolution, continental United States-wide, daily ET dataset spanning from 1979 to 2024 using the SFE method. The Bowen ratio is first calculated using the SFE method solely based on temperature and specific humidity estimates from gridMET and then converted to ET using net radiation and ground heat fluxes from ERA5-Land. We evaluate its performance using extended triple collocation to estimate the standard deviation of the random error and the correlation coefficient of SFE ET compared to true ET, as well as those of three widely used alternative ET datasets: GLEAM, FluxCom, and ERA5-Land. Despite its extreme simplicity, SFE ET achieves performance comparable to or exceeding the other datasets across large portions of CONUS, particularly in the Western U.S., while requiring no information about land surface, vegetation, or soil properties and no assumptions about ET’s response to environmental and climate drivers. Our results support the use of SFE as a scalable, observation-driven method for estimating ET.
- Preprint
(3894 KB) - Metadata XML
-
Supplement
(941 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-4225', Alexander Gruber, 21 Oct 2025
- AC1: 'Reply on RC1', Erica McCormick, 30 Jan 2026
-
RC2: 'Comment on egusphere-2025-4225', Anonymous Referee #2, 02 Jan 2026
Summary and significance
This manuscript fits well within the scope of Hydrology and Earth System Sciences. The authors introduce a new daily, 4km evapotranspiration (ET) dataset over continental US (CONUS) using the surface flux equilibrium (SFE) approach and compares it against other ET products (GLEAM, FluxCom, ERA5-Land). The authors present a careful statistical evaluation via triple collocation, giving random error and correlation to truth metrics. This manuscript is well written and conceptually clear. I particularly appreciate how the authors have put great effort and care in explaining how SFE compares to other ET estimation approaches and explains assumptions, strengths and limitations.
The demonstration that SFE has comparable performance to more complex approaches in many regions, particularly in the western US, is useful as it adds confidence in SFE as a practical alternative to estimate ET.
I recommend publication with minor revisions for clarity.
Suggestions
Figure 1: Clarify what panels b-g are showing by giving them a title and explicitly labeling the x-axis. The current x-axis was confusing, and I suggest writing out the month and year (e.g., Dec 2000).
Methods:
- L141-144: I understand that the input data for SFE has been proven to be robust at the eddy covariance tower level (addressed in the introduction, Thakur et al., 2025). This may be extended when using gridMET and ERA5-Land data for this analysis, but can the authors directly tie that logic in Section 2.1? Can the authors address the biases of gridMET and ERA5-Land and how that may affect SFE ET?
- L145: Can the authors justify the 10% ground heat flux (G) assumption with a citation or provide a sensitivity analysis showing how varying G can affect σε and RT? The former is more reasonable to accomplish, but I would want to know the authors expect σε and RT to change if G is varied (e.g., 5-20%)
- L145: Can the authors explain how including days with negative net radiation (Rn) can affect daily ET estimation and triple collocation statistics and justify their exclusion?
Discussion:
- I suggest adding a brief discussion about expected SFE performance outside CONUS and considerations for global implementation. How do the authors expect SFE to perform in regions with weaker land-atmosphere coupling (e.g., Southeast Asia)?
- L572: The authors note that SFE bias in arid conditions needs further investigation. Can the authors please add specific recommendations for future investigation and/or what additional measurements may be needed (advocating for certain measurements?).
Citation: https://doi.org/10.5194/egusphere-2025-4225-RC2 - AC2: 'Reply on RC2', Erica McCormick, 30 Jan 2026
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 1,441 | 161 | 23 | 1,625 | 39 | 27 | 28 |
- HTML: 1,441
- PDF: 161
- XML: 23
- Total: 1,625
- Supplement: 39
- BibTeX: 27
- EndNote: 28
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
This manuscript shows a rigorous analysis of the uncertainties in evapotranspiration estimates from "classical" and an alternative method using triple collocation analysis. It is very well written, clear, sound, relevant, and fits well into the scope of HESS. I recommend this manuscript to be published after minor revisions, which mainly concern clarifications of the methodology and justifications for certain assumptions.
My two main concerns are:
Sec. 2.2: How justified is the linear error model for ET? Given the non-linear nature of Eq. 1, I am a bit worried that it might not be. Then again, I don't know much about error structures in ET data, so this is more of a personal gut feeling. I can image other people having simlar concerns though, so perhaps you could add some words on that, or a reference to previous work that had looked into that?
Discussion: Your discussion revolves around the different patterns you see in \sigma_eps and R_T. If I understand your your methodology correctly, you compare *unscaled* \sigma_eps estimates (Eq. 6). How meaningful is such a comparison? In Fig 2. you show clearly that the different data sets have a different mean and variability, thus we do expect variations in the \beta terms. I'm not an ET guy, but I assume that most data set applications would try to get rid of any systematic error and therefore scale the random errors accordingly. So I would argue that it only makes sense to compare scaled random error variances, i.e., \sigma_eps, that relate to the same signal variability. After all, it is the signal-to-noise ratio that determines how well the data set information can be separated from underlying noise, and this is directly reflected (in a normalized way) by R_T.
Other comments:
It is stated repeatedly that one advantage of the SFE method is that it doesn't make assumptions about root-zone soil moisture or vegetation status. I understand that the SFE method doesn't require one to do that directly, but aren't such assumptions necessary for the computation of air temperature and humidity that are used as input for the SFE method?
The term "error" and variations thereof are used a bit loosely. There is currently a push to harmonize the terminology concerning "errors" across communities; I recommend having a look at Merchent et al. (2017) and consider adopting their proposed terminology (in particular the useage of "error" vs. "uncertainty").
Sec. 2.3 I'm missing an explaination what you do with the redundant TCA estimates from the different triplets.... In the supplement, you show the results of the individual triplets, which is fine, but in the main text, it is not clear what you show... I assume it is the average of the estimates from all triplets? Did you average for both \sigma_eps and R_t? If so, it is generally advised NOT to average correlation coefficients, but this advise comes from averaging Pearson correlations; I'm not sure if that holds here too. One could actually throw in all the four data sets in a least-squares estimator to get adjusted estimates for the signal and error variance (see Gruber et al., 2016), and then derive the R_t estimates from these, which may be a bit more robust but I'm only speculating here. Anyway, I think it would be good to at least elaborate what you did/show.
Sec. 3.2: Section titles for 3.1. and 3.3 state what its shown whereas the section title for 3.2. is a spelled out conclusion. In the discussion, titles change again to questions. I suggest just choosing one title naming style and stay consistent.
L81-82: "This suggests that..." The use of the embedded relative clause with a dangling preposition felt a bit awkward to read, I suggest rephrasing this sentence.
L104: As far as I know, triple collocation is similar, but not the same as the "three-cornered hat" approach (see e.g., Sjoberg et al., 2021). I recommend to just remove this parenthetical clause.
L139: C_p here is upper case but in Eq. 1 it's lower case. Also, perhaps change all equations symbols in text to equation mode (italic) to be consistent with the Equations?
L145: Why 10%? Can you justify that number, and might it be useful to mention the implications of this assumption?
L161: I find the explaination "By treating the product of \sigma_T as a single unknown variable ..." a bit misleading. It is not the fact that they are treated as a single variable which lets you solve for the error variance, its the fact that the betas for two data sets cancel out in the covariance ratios, which then lets you get rid of the sigma_T term by subtracting the resulting estimate from the variance of the data set.
L199: "increasing the robustness of TC assumptions" sounds a bit odd. I guess you mean that convergence of error estimates incrases our confidence that the assumptions are valid?
L278--: You compare the ET estimates qualitatively and mention some numbers in the text, but I think it could be useful to also show a summary table with all the relevant metrics (e.g., correlations and biases between all data set combinations).
L421: The acronym MAP hasn't been introduced.
L472-482: Doing a weighted averaging comes from least squares theory and serves the purpose of reducing random errors only. I guess what is meant with "this aproach has the disadvantage of obscuring the individual problems" is that if data sets have different systematic errors, especially if they are non-stationary, then you create some uncontroled blend of biased estimates, and any improvement is only a matter of luck because weights derived from random error variances do not account for these biases that are instead assumed to be zero.
L482-484: Isn't this statement trivial and already implied by the paragraph's introductory statement: "It is posible to average ET estimates weighted by each dataset's performance"?
L521: Why is this contrary to expectation? You do state that this might have to do with the lower ET amounts in these regions, so considering my argument in the beginning concerning scaling in TCA, I would argue that this is simply a result of showing unscaled \sigma_eps estimates. When looking at signal-to-noise ratios instead, this gradient vanishes, right?
L589: "complex" instead of "complicated"?
Eq. (4)-(7): The introduction of Q_ii seems a bit unnecessary to me. Since you define Q_ii just as equivalent to \sigma^2_ii, you could use the latter instead of Q directly in Eqs. 6 and 7, which I don't think would make it any more difficult to read. This might be just a personal preference though.
Figure 1: The x-axis date labelling confused me when I first looked at it. The figure caption only states "Mean annual SFE from 1979 to 2024"... Perhaps also spell out the date range shown in the example time series: "Points show time series for [...] from Dec. 2000 to Dec. 2002"?
Figure 7/8: The order of the Figure panels is inconsistent.
Supplement: I always find it hard to visually compare patterns like these. You draw the conclusion that differences are small when using different triplets, therefore assumptions can be considered to be valid. But when exactly are differences "small enough" to draw this conclusion? There isn't an aweful lot of contrast in the figures, and there indeed seem to be regions with some greaterr differences. Perhaps it might be worth plotting the actual *differences* between the TCA results for triplet combinations, or maybe complement the maps you show with boxplots of the differences?
References:
https://doi.org/10.5194/essd-9-511-2017
https://doi.org/10.1175/JTECH-D-19-0217.1
https://doi.org/10.1002/2015JD024027