the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Towards a Harmonized Operational Earthquake Forecasting Model for Europe
Abstract. We develop a harmonized earthquake forecasting model for Europe based on the Epidemictype Aftershock Sequence (ETAS) model to describe the spatiotemporal evolution of aftershock sequences. We propose a method modification that integrates information from the European Seismic Hazard Model (ESHM20) about the spatial variation of background seismicity during ETAS parameter inversion based on the expectationmaximization (EM) algorithm. Other modifications to the basic ETAS model are explored, namely fixing the productivity term to a higher value to balance the more productive triggering by highmagnitude events with their much rarer occurrence, and replacing the bvalue estimate with one relying on the bpositive method to observe the possible effect of shortterm incompleteness on model parameters. Retrospective and pseudoprospective tests demonstrate that ETASbased models outperform the timeindependent benchmark model as well as an ETAS model calibrated on global data. The backgroundinformed ETAS variant achieves the highest score in the pseudoprospective experiment, but the performance difference to the secondbest model is not significant. Our findings highlight promising areas for future exploration, such as avoiding the simplification of using a single bvalue for the entire region or reevaluating the completeness of the used seismic catalogs.
 Preprint
(7748 KB)  Metadata XML
 BibTeX
 EndNote
Status: final response (author comments only)

RC1: 'Comment on egusphere20233153', Anonymous Referee #1, 31 Jan 2024
This paper describes the first attempt to build an operational earthquake forecasting system for Europe.
My overall opinion on the paper is positive, but I think that the authors should address some points to make the paper more convincing and reproducible.Below I list my main comments that should be addressed in a revised version.
 The description of the quality of the homogeneous earthquake catalog is missing. The authors refer to the paper Danciu et al (2021), but it would be good to show something about the homogeneity in terms of magnitude. Each agency uses different magnitudes and it could be cumbersome and really challenging to homogenize them. But the homogenization of the magnitudes is essential for the goal of this paper.
So, I suggest adding some more quantitative information about the catalogs, including the kind of magnitude adopted and how different magnitudes have been homogenized. The authors use a binned magnitude of deltaM=0.2. It is not clear to me if this is an average that accounts for uncertainty in very old and very new earthquakes. In this case, the use of a mean value could not be appropriate.
In other words, what is the rationale of this choice (deltaM=0.2)? Moreover, the use of deltaM=0.2 has important consequences in terms of the bvalue. Some papers show that the use of binned magnitudes introduce a bias in the bvalue calculation. This could be important in simulating data for forecast and testing. Reading the paper, I cannot understand if the same binning has been maintained also for the newer earthquake catalog used for prospective testing. If not, the bvalue calculated using deltaM=0.2 could not be appropriate for simulating data that will be binned with a different deltaM.
This point has to be analyzed in detail in a revised version. The authors use very often the term "aftershock". I know that this is contained also in the name of the model (ETAS), but this could be very misleading, in particular for people working on seismic hazard that have a different definition of aftershocks (e.g., aftershocks can never be larger than the mainshock). I suggest replacing the term "aftershock" with the term "triggered earthquake" that is more appropriate for the ETAS model, which assumes that earthquakes can be divided only as background and triggered, not as foremainaftershocks.
 In the introduction the authors write "There is not a unique agreedupon best way to provide OEF ...". It is not clear to me if the authors are talking about communication or about scientific output. For instance, Jordan et al (2011) made the case in which OEF should be provided continuously, whereas some agencies provide this information only in some circumstances. Are the authors referring to that? Or to the challenging way in which probabilities can be communicated? In any case I would suggest being clearer on this point.
 As regards the "model fit", I am wondering why the authors do not provide the usual residual plot that shows visually if the model explains well the data. The authors use different plots, that could be ok and add more information, but they look like less informative than the residual plot (at least, this is my first impression)
 In the section "Discussion of the model fit", many results sound trivial and can be easily explained by the wellknown correlation among parameters. I would suggest making clearer what are the results that are new and cannot be explained by what we already know.
 The caption of Figure 3 should contain an explanation of the colors used in the cells of the grid.
 One of the most interesting result is that the version of the ETAS model with alpha=beta is less performing producing "explosive" earthquake sequences (branching ratios larger than 1). I am wondering if the authors are using some maximum magnitudes (or corner magnitudes as well) in their simulation. As far as I know this is an outcome of ESHM20, and it could reduce drastically the problem of "explosive" sequences. A recent paper by Mancini and Marzocchi (Seismol. Res. Lett. 2024) uses alpha=beta without having problems with "explosive" sequences. Maybe a few explanations on why the authors get explosive earthquake sequences could be worthwhile.
Citation: https://doi.org/10.5194/egusphere20233153RC1  AC1: 'Reply on RC1', Marta Han, 30 Apr 2024

RC2: 'Comment on egusphere20233153', Maximilian Werner, 26 Apr 2024
First, my apologies for the delay of this review!
This paper presents preliminary candidates for operational earthquake forecasting in Europe, based on variants of the wellestablished Epidemic Type Aftershock Sequence (ETAS) model. The authors draw on available harmonized datasets and a background model from the European Seismic Hazard model ESHM2020 to generate the panEuropean models, making good use of prior painstaking work to collate the myriad of different national catalogs and methods. The study demonstrates convincingly that the ETAS models forecast some of the observed spacetime clustering and may therefore be useful for realtime deployment. The detailed model evaluations include insample tests, which reveal some issues in some of the models as well as some issues with the applied tests, and they include a more severe outofsample pseudoprospective test, and some sober and honest analysis of the relative performance of the models. They conclude that indeed 12 of the proposed ETAS models perform best and could serve as initial candidates (to be improved in the future).
Please see my (many) comments below. It’s a well written and structured paper – and I have only a couple of major comments, everything else is largely around presentation and suggestions for improvements. I have one request to assess a forecast/simulation choice in more detail, and another to pursue some more analysis in the evaluation.
What is the influence of the chosen minimum probability level in histograms where no simulations have filled bins? How many number bins of the full spacetime (and magnitude?) bins are set to this level? How frequently are they “observed”? How frequently do these empty number bins sit next to filled bins, suggesting perhaps that interpolation would be a better approach?
The authors rightly identify some undesirable features in the retrospective hypothesis tests applied, which is a useful contribution. But that means they should pursue other/additional methods, e.g. comparing the cumulative number trajectories of observed and synthetic catalogs (see specific comment below), comparing (at least visually) the spatial distribution of the observed and synthetic catalogs, etc. Similarly, Figure 5 of the pseudoprospective information gains don’t seem very insightful – this is a first step to identify what’s gone wrong/right, but would really benefit from some analysis, e.g. are the aftershocks not well captured in space, time or both? Why are the ETAS models sometimes performing worse than the ESHM2020 model? Even qualitative visual analysis would be useful.
Otherwise, please consider the below comments as hopefully constructive suggestions.
Abstract:
L2 – is the purpose an aftershock forecast model? It sounds like it based on the first sentence. Please clarify.
The abstract would benefit from a sentence on the data the model is fit to, and any issues with “the” European catalog. Ie – what’s harmonized?
It would help to clarify that these are currently 2D seismicity forecasts, not extended ruptures with depth (which might be fruitful areas for the future).
A clearer statement of the different models explored and their rankings would be of interest.
L10 How do the findings highlight these promising areas (last sentence)? Can you express it concisely or are these speculations or a wish list?
Introduction
L60 – missing a word after m_c – maybe “more”?
The introduction could probably be shortened – e.g. the background to m_c and the GR law etc seems more relevant for a thesis than a paper – unless the authors are specifically trying to reach a broader audience.
L75: As I understand it, the Mizrahi et al 2021 model accounts for termporal – not spatial – variations in mc. Is the model being extended? If so, mention it in the abstract and here.
L81: ESHM20 contains many models, some idea of how that’s incorporated into ETAS in both the abstract as well as more discussion here would be appropriate.
L112: These statistics of mc presumably come from Danciu et al 2021? Please clarify.
Figure 1: the red dots look black because of the marker edge colour. It’s impossible to differentiate red from green dots in this figure. Try a bigger size or different/no edge colour.
L137: what do you mean here by “such differences”? Do you mean a difference in completeness magnitude, or a difference in the spatial locations? The second is indeed expected, the former perhaps less so. Please clarify.
L140: Good to clarify. Does the binning require the discrete version of the bvalue estimation procedure? Which one was used in Fig 1d? What are the uncertainties? Why show b=0.99? Perhaps this will be discussed below, then guide the reader.
L142: I would appreciate a figure of the longterm rates in these two models.
L146153: this sounds like Methods, rather than data. I recommend more discussion of the two longterm models (ie a figure of the two models and a bit more background on both models). Then move the discussion of how you map these into the background rate of ETAS into a subsection in Methods, including some discussion of how this longterm seismicity might or might not be compatible with the background (independent) rate in an ETAS model would be appropriate (or else just extend this section).
L187: again, if the ESHM20 considers longterm rate as the declustered rate, there is some consistency (even that’s debatable), but if the ESHM20 rate is the longterm, i.e. average rate, then setting this to the background rate may lead to an overestimation of the total rate (instead the average model rate should equal the ESHM20 rate, see e.g. Field et al., 2017). Or are you solely considering the relative spatial information, rather than the absolute rate? I think clarification of this point earlier in the paper would help. (You clarify this in L242, and I agree this is a good approach, but I’d mention it earlier.)
L193: ETAS_USGS: A few more details would help – isn’t this an aftershockonly model? Are these truly onesize fits all for the entire globe or are they regionalized?
L203: burn period > burnin period
L213: which events’ locations exactly – the new/simulated background events, or do you not include all background events in the burnin period?
L258: It would help to a bit more precise here, and to consider the role of the maximum magnitude or a tapered GR law. “Exploding” aftershock behavior may occur with some finite probability when the branching ratio r is >1, and sequences are finite with probability 1 when r=1. The calculation of r involves the GR law, and thus depends on your choice of pure GR vs tapered/cut GR, and thus in the latter case on Mcorner/Mmax. First, please include a short discussion/justification on your choice earlier when you introduce the magnitude distribution (including the potential for spatial variations). Second, in the context of constraining parameters for subscritical branching, these modifications matter and thus the parameter constraints change. The branching ratio r may well be <1 if the GR is tapered/cut while it may not with a pure GR law. Sornette and Werner (2005) derived r equations for a truncated GR. Finally, clarify what ‘e’ is (another variable or Euler’s number?).
L264: “standard” in the USGS software?
L286: Does this mean that the explicitly spatiallyvariable models forecast spatially uniform background rates in each spatial ESHM zone? I don’t believe so, but please clarify – if I understand correctly, the ETAS_bg models use the same procedure to simulate background events as the ETAS_0 model, but both the probability of being a background rate has changed because of different parameters & the additional constraint that the rates in different zones are relatively constrained. Is that correct?
L292: Ah, so it is a truncated GR law – does this influence the branching ratio constraint alpha = e? See earlier comment.
L295: “true” > observed? Here and below.
L320: Why only focus on retrospective consistency testing? 7 years of daily outofsample testing seems like a great start, even if not 25 years. I recommend doing these or similar tests, or some of them – the information gain assess a different aspect of the models, but tell us less about how close to the data the models are (and while the nonPoisson cellwise LL scores are a great improvement, they still neglect known correlations between spatial cells).
L338: What is the influence of this implementation and the particular choice of a minimum waterlevel on the overall scores?: how many zerobins are there in a 100k simulation set per day? How does the waterlevel compare with the background rate (and its Poisson process implied probabilities)? How “rough” is the zerobin distribution, ie could an interpolation procedure approximate the ETAS forecasts better than this waterlevel?
L419: It would be illustrative to give the range of values in days or years after which the Omori law is significantly tapered in this formulation (ie give the range of tau in days). (I see the range in L436, but give units in years).
Parameters: Can you connect your discussion of the ETAS parameter estimation with the literature more directly (e.g. do you see similar patterns to Seif et al, 2017, and others…)?
Do you have LL scores for your various ETAS model fits? Can you compare the insample fit using LL and AIC? It’s a useful indicator of fit, though not necessarily about predictive skill.
Figure 3: Please clarify these are simulations over the entire training period. Also, L279 states 100k catalogs were simulated, but here it states only 10k. I would clarify the timeframe of the retrospective testing also in the relevant Methods section around L279. These are not nextday forecasts, as in the pseudoprospective testing section.
Figure 3a: I would recommend looking at the cumulative counts in each of the catalogs and compare with the observed count. You could look at the cumulative 95% model range and compare with the observed number. But you also get important insights into how well the model is qualitatively and quantitatively reproducing the features you are hoping it will, namely aftershock clustering and background rate.
Fig3: As part of these formal tests, I would also look at visual fits, e.g. Fig 1d seems to show a comparison of observed data and the magnitude forecasts from the different bvalue models. I know the formal tests do this comparison quantitatively, but I would discuss the visual fit (which quite clearly favours the higher bvalue).
Supercritical branching when alpha is fixed: these branching ratios are indeed very high, but supercritical has been found before, e.g. Seif et al. 2017 and some of the other references you used have found similar issues. There’s clearly model misspecification that’s driving this bias, so it would be interesting to discuss possibilities: changing background rates, anisotropic spatial aftershock triggering, spatial variation in incompleteness?
L476: “suspiciously well” in spatial terms means that the models are too smooth, ie that the events occur too close to likelihood peaks without the scatter expected if the model were the data generator. That should be spelt out.
L479: refer to Fig 1d in this discussion of magnitude distribution fit to data.
L481: is it true that “ETAS_bg^(b+) uses the same spatial distribution for placing the background events” as ETAS_0^(b+)? As commented above, I understood two differences: the relative rates between zones are constrained, and the absolute rate is different, and the parameters are different, so the probability of being a background event is also affected. Could you clarify (here and above where commented)?
Irrespectively, and again, I would visually compare the forecast and observed magnitude distributions and use this to support your case that the difference in magnitude forecast performance of the two models is indeed a surprising result and indicates an issue with the Mtest (and the Stest, although I’m not sure I understand your detailed argument here, see above).
L482: “known issue”: as you know, Francesco Serafini and others identified a correlation between the Mtest and the Ntest, which is indeed problematic and being fixed. You might cite Francesco Serafini (personal communication, 2024), or perhaps refer to the manuscript in prep that might be submitted by the time this is published.
Given the (reasonable) caution you advise in interpreting these M/S/PL results, you might use visual checks, including of the spatial distributions, appropriately averaged (or not) over the many simulations, to show the extent to which models reproduce the features you expect them to.
Figure 4a would benefit from a panel underneath with the magnitudetime plot of the seismicity, which will help identify clustering and quieter periods and visually link them to the likelihood gains (or losses).
Figure 4b/c: these tables are hard to digest, despite the nice colors. Could you please plot (instead or in addition) a figure instead showing mean information gain over the ESHM20 model with 95% range (which I believe you get from the paired ttest?) to indicate significance?
L506: “achieve a significance level below 0.05” is I believe not the right interpretation of these tests – you might just state that their pvalues are below the critical value.
L515: “could be expected” – yes, of course, but the purpose is to develop a model that does indeed use this knowledge and puts this into practice. It might be a basic statement for some of the community, but not for others and potential users. I would emphasise the success in finding one or several models that do show substantial improvement in predictive skill, and that this skill shows up during periods of clustering (presumably, although it would be good to illustrate/visualize this explicitly, as suggested eg via the modified figure 4a that links cumulative LL scores trajectories to seismicity magnitudetime plots to show where the clustering occurs). And I would be careful with absolute statements like “poor performance” – this is relative to the other ETAS models as measured by your metric, which is (a) an approximation of the correlation of seismicity between spatial bins and (b) requires many model simulations, which may not be sufficient to fill all “bins” with simulations (see waterlevel comment). So I’d be a bit more careful and emphasise this first milestone – a European ETAS model that does what it says on the tin: forecast aftershocks and capture some of the time dependence of seismicity, clearly much better than a timeindependent model. And yes, there are some unexpected results here, which would be good to understand in more depth in order to improve on this first attempt.
With respect to the spatial LL comparison in Fig 5, I suspect there are patterns, and some more analysis would be required to understand these patterns. It’s surprising to see such strongly negative gains between an ETAS model and the ESHM20 model, for instance. Are these individual events? Clusters gone awry? And what’s going on at the midoceanic ridge? This section is a bit thin, so at least some more discussion would help. What explains the major differences in some cells between the two ETAS versions? Or at least what happens there?
Best wishes,
Max Werner
Citation: https://doi.org/10.5194/egusphere20233153RC2  AC2: 'Reply on RC2', Marta Han, 30 Apr 2024
Viewed
HTML  XML  Total  BibTeX  EndNote  

246  107  20  373  12  9 
 HTML: 246
 PDF: 107
 XML: 20
 Total: 373
 BibTeX: 12
 EndNote: 9
Viewed (geographical distribution)
Country  #  Views  % 

Total:  0 
HTML:  0 
PDF:  0 
XML:  0 
 1