Towards a Harmonized Operational Earthquake Forecasting Model for Europe

Han, Marta; Mizrahi, Leila; Wiemer, Stefan

doi:https://doi.org/10.5194/egusphere-2023-3153

Preprints

https://doi.org/10.5194/egusphere-2023-3153

Preprints

10 Jan 2024

| 10 Jan 2024

Towards a Harmonized Operational Earthquake Forecasting Model for Europe

Marta Han, Leila Mizrahi, and Stefan Wiemer

Abstract. We develop a harmonized earthquake forecasting model for Europe based on the Epidemic-type Aftershock Sequence (ETAS) model to describe the spatio-temporal evolution of aftershock sequences. We propose a method modification that integrates information from the European Seismic Hazard Model (ESHM20) about the spatial variation of background seismicity during ETAS parameter inversion based on the expectation-maximization (EM) algorithm. Other modifications to the basic ETAS model are explored, namely fixing the productivity term to a higher value to balance the more productive triggering by high-magnitude events with their much rarer occurrence, and replacing the b-value estimate with one relying on the b-positive method to observe the possible effect of short-term incompleteness on model parameters. Retrospective and pseudo-prospective tests demonstrate that ETAS-based models outperform the time-independent benchmark model as well as an ETAS model calibrated on global data. The background-informed ETAS variant achieves the highest score in the pseudo-prospective experiment, but the performance difference to the second-best model is not significant. Our findings highlight promising areas for future exploration, such as avoiding the simplification of using a single b-value for the entire region or reevaluating the completeness of the used seismic catalogs.

Received: 31 Dec 2023 – Discussion started: 10 Jan 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Preprint (PDF, 7748 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (7748 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

05 Mar 2025

Towards a harmonized operational earthquake forecasting model for Europe

Marta Han, Leila Mizrahi, and Stefan Wiemer

Nat. Hazards Earth Syst. Sci., 25, 991–1012, https://doi.org/10.5194/nhess-25-991-2025,https://doi.org/10.5194/nhess-25-991-2025, 2025

Short summary

Marta Han, Leila Mizrahi, and Stefan Wiemer

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2023-3153', Anonymous Referee #1, 31 Jan 2024

This paper describes the first attempt to build an operational earthquake forecasting system for Europe.

My overall opinion on the paper is positive, but I think that the authors should address some points to make the paper more convincing and reproducible.
Below I list my main comments that should be addressed in a revised version.
- The description of the quality of the homogeneous earthquake catalog is missing. The authors refer to the paper Danciu et al (2021), but it would be good to show something about the homogeneity in terms of magnitude. Each agency uses different magnitudes and it could be cumbersome and really challenging to homogenize them. But the homogenization of the magnitudes is essential for the goal of this paper.

So, I suggest adding some more quantitative information about the catalogs, including the kind of magnitude adopted and how different magnitudes have been homogenized.
- The authors use a binned magnitude of deltaM=0.2. It is not clear to me if this is an average that accounts for uncertainty in very old and very new earthquakes. In this case, the use of a mean value could not be appropriate.

In other words, what is the rationale of this choice (deltaM=0.2)? Moreover, the use of deltaM=0.2 has important consequences in terms of the b-value. Some papers show that the use of binned magnitudes introduce a bias in the b-value calculation. This could be important in simulating data for forecast and testing. Reading the paper, I cannot understand if the same binning has been maintained also for the newer earthquake catalog used for prospective testing. If not, the b-value calculated using deltaM=0.2 could not be appropriate for simulating data that will be binned with a different deltaM.

This point has to be analyzed in detail in a revised version.
- The authors use very often the term "aftershock". I know that this is contained also in the name of the model (ETAS), but this could be very misleading, in particular for people working on seismic hazard that have a different definition of aftershocks (e.g., aftershocks can never be larger than the mainshock). I suggest replacing the term "aftershock" with the term "triggered earthquake" that is more appropriate for the ETAS model, which assumes that earthquakes can be divided only as background and triggered, not as fore-main-aftershocks.
- In the introduction the authors write "There is not a unique agreed-upon best way to provide OEF ...". It is not clear to me if the authors are talking about communication or about scientific output. For instance, Jordan et al (2011) made the case in which OEF should be provided continuously, whereas some agencies provide this information only in some circumstances. Are the authors referring to that? Or to the challenging way in which probabilities can be communicated? In any case I would suggest being clearer on this point.
- As regards the "model fit", I am wondering why the authors do not provide the usual residual plot that shows visually if the model explains well the data. The authors use different plots, that could be ok and add more information, but they look like less informative than the residual plot (at least, this is my first impression)
- In the section "Discussion of the model fit", many results sound trivial and can be easily explained by the well-known correlation among parameters. I would suggest making clearer what are the results that are new and cannot be explained by what we already know.
- The caption of Figure 3 should contain an explanation of the colors used in the cells of the grid.
- One of the most interesting result is that the version of the ETAS model with alpha=beta is less performing producing "explosive" earthquake sequences (branching ratios larger than 1). I am wondering if the authors are using some maximum magnitudes (or corner magnitudes as well) in their simulation. As far as I know this is an outcome of ESHM20, and it could reduce drastically the problem of "explosive" sequences. A recent paper by Mancini and Marzocchi (Seismol. Res. Lett. 2024) uses alpha=beta without having problems with "explosive" sequences. Maybe a few explanations on why the authors get explosive earthquake sequences could be worthwhile.

Citation: https://doi.org/10.5194/egusphere-2023-3153-RC1
- AC1: 'Reply on RC1', Marta Han, 30 Apr 2024
  
  We thank the reviewer for their insightful and constructive comments about our study. We will update the manuscript according to the points made. More detailed responses are submitted in the attachment.
  
  Citation: https://doi.org/10.5194/egusphere-2023-3153-AC1
RC2:
'Comment on egusphere-2023-3153', Maximilian Werner, 26 Apr 2024

First, my apologies for the delay of this review!

This paper presents preliminary candidates for operational earthquake forecasting in Europe, based on variants of the well-established Epidemic Type Aftershock Sequence (ETAS) model. The authors draw on available harmonized datasets and a background model from the European Seismic Hazard model ESHM2020 to generate the pan-European models, making good use of prior painstaking work to collate the myriad of different national catalogs and methods. The study demonstrates convincingly that the ETAS models forecast some of the observed space-time clustering and may therefore be useful for real-time deployment. The detailed model evaluations include in-sample tests, which reveal some issues in some of the models as well as some issues with the applied tests, and they include a more severe out-of-sample pseudo-prospective test, and some sober and honest analysis of the relative performance of the models. They conclude that indeed 1-2 of the proposed ETAS models perform best and could serve as initial candidates (to be improved in the future).

Please see my (many) comments below. It’s a well written and structured paper – and I have only a couple of major comments, everything else is largely around presentation and suggestions for improvements. I have one request to assess a forecast/simulation choice in more detail, and another to pursue some more analysis in the evaluation.

What is the influence of the chosen minimum probability level in histograms where no simulations have filled bins? How many number bins of the full space-time (and magnitude?) bins are set to this level? How frequently are they “observed”? How frequently do these empty number bins sit next to filled bins, suggesting perhaps that interpolation would be a better approach?

The authors rightly identify some undesirable features in the retrospective hypothesis tests applied, which is a useful contribution. But that means they should pursue other/additional methods, e.g. comparing the cumulative number trajectories of observed and synthetic catalogs (see specific comment below), comparing (at least visually) the spatial distribution of the observed and synthetic catalogs, etc. Similarly, Figure 5 of the pseudo-prospective information gains don’t seem very insightful – this is a first step to identify what’s gone wrong/right, but would really benefit from some analysis, e.g. are the aftershocks not well captured in space, time or both? Why are the ETAS models sometimes performing worse than the ESHM2020 model? Even qualitative visual analysis would be useful.

Otherwise, please consider the below comments as hopefully constructive suggestions.

Abstract:
L2 – is the purpose an aftershock forecast model? It sounds like it based on the first sentence. Please clarify.

The abstract would benefit from a sentence on the data the model is fit to, and any issues with “the” European catalog. Ie – what’s harmonized?

It would help to clarify that these are currently 2D seismicity forecasts, not extended ruptures with depth (which might be fruitful areas for the future).

A clearer statement of the different models explored and their rankings would be of interest.

L10 How do the findings highlight these promising areas (last sentence)? Can you express it concisely or are these speculations or a wish list?

Introduction

L60 – missing a word after m_c – maybe “more”?

The introduction could probably be shortened – e.g. the background to m_c and the GR law etc seems more relevant for a thesis than a paper – unless the authors are specifically trying to reach a broader audience.

L75: As I understand it, the Mizrahi et al 2021 model accounts for termporal – not spatial – variations in mc. Is the model being extended? If so, mention it in the abstract and here.

L81: ESHM20 contains many models, some idea of how that’s incorporated into ETAS in both the abstract as well as more discussion here would be appropriate.

L112: These statistics of mc presumably come from Danciu et al 2021? Please clarify.

Figure 1: the red dots look black because of the marker edge colour. It’s impossible to differentiate red from green dots in this figure. Try a bigger size or different/no edge colour.

L137: what do you mean here by “such differences”? Do you mean a difference in completeness magnitude, or a difference in the spatial locations? The second is indeed expected, the former perhaps less so. Please clarify.

L140: Good to clarify. Does the binning require the discrete version of the b-value estimation procedure? Which one was used in Fig 1d? What are the uncertainties? Why show b=0.99? Perhaps this will be discussed below, then guide the reader.

L142: I would appreciate a figure of the long-term rates in these two models.

L146-153: this sounds like Methods, rather than data. I recommend more discussion of the two long-term models (ie a figure of the two models and a bit more background on both models). Then move the discussion of how you map these into the background rate of ETAS into a subsection in Methods, including some discussion of how this long-term seismicity might or might not be compatible with the background (independent) rate in an ETAS model would be appropriate (or else just extend this section).

L187: again, if the ESHM20 considers long-term rate as the declustered rate, there is some consistency (even that’s debatable), but if the ESHM20 rate is the long-term, i.e. average rate, then setting this to the background rate may lead to an overestimation of the total rate (instead the average model rate should equal the ESHM20 rate, see e.g. Field et al., 2017). Or are you solely considering the relative spatial information, rather than the absolute rate? I think clarification of this point earlier in the paper would help. (You clarify this in L242, and I agree this is a good approach, but I’d mention it earlier.)

L193: ETAS_USGS: A few more details would help – isn’t this an aftershock-only model? Are these truly one-size fits all for the entire globe or are they regionalized?

L203: burn period -> burn-in period

L213: which events’ locations exactly – the new/simulated background events, or do you not include all background events in the burn-in period?

L258: It would help to a bit more precise here, and to consider the role of the maximum magnitude or a tapered GR law. “Exploding” aftershock behavior may occur with some finite probability when the branching ratio r is >1, and sequences are finite with probability 1 when r=1. The calculation of r involves the GR law, and thus depends on your choice of pure GR vs tapered/cut GR, and thus in the latter case on Mcorner/Mmax. First, please include a short discussion/justification on your choice earlier when you introduce the magnitude distribution (including the potential for spatial variations). Second, in the context of constraining parameters for subscritical branching, these modifications matter and thus the parameter constraints change. The branching ratio r may well be <1 if the GR is tapered/cut while it may not with a pure GR law. Sornette and Werner (2005) derived r equations for a truncated GR. Finally, clarify what ‘e’ is (another variable or Euler’s number?).

L264: “standard” in the USGS software?

L286: Does this mean that the explicitly spatially-variable models forecast spatially uniform background rates in each spatial ESHM zone? I don’t believe so, but please clarify – if I understand correctly, the ETAS_bg models use the same procedure to simulate background events as the ETAS_0 model, but both the probability of being a background rate has changed because of different parameters & the additional constraint that the rates in different zones are relatively constrained. Is that correct?

L292: Ah, so it is a truncated GR law – does this influence the branching ratio constraint alpha = e? See earlier comment.

L295: “true” -> observed? Here and below.

L320: Why only focus on retrospective consistency testing? 7 years of daily out-of-sample testing seems like a great start, even if not 25 years. I recommend doing these or similar tests, or some of them – the information gain assess a different aspect of the models, but tell us less about how close to the data the models are (and while the non-Poisson cell-wise LL scores are a great improvement, they still neglect known correlations between spatial cells).

L338: What is the influence of this implementation and the particular choice of a minimum waterlevel on the overall scores?: how many zero-bins are there in a 100k simulation set per day? How does the waterlevel compare with the background rate (and its Poisson process implied probabilities)? How “rough” is the zero-bin distribution, ie could an interpolation procedure approximate the ETAS forecasts better than this water-level?

L419: It would be illustrative to give the range of values in days or years after which the Omori law is significantly tapered in this formulation (ie give the range of tau in days). (I see the range in L436, but give units in years).

Parameters: Can you connect your discussion of the ETAS parameter estimation with the literature more directly (e.g. do you see similar patterns to Seif et al, 2017, and others…)?

Do you have LL scores for your various ETAS model fits? Can you compare the in-sample fit using LL and AIC? It’s a useful indicator of fit, though not necessarily about predictive skill.

Figure 3: Please clarify these are simulations over the entire training period. Also, L279 states 100k catalogs were simulated, but here it states only 10k. I would clarify the timeframe of the retrospective testing also in the relevant Methods section around L279. These are not next-day forecasts, as in the pseudo-prospective testing section.

Figure 3a: I would recommend looking at the cumulative counts in each of the catalogs and compare with the observed count. You could look at the cumulative 95% model range and compare with the observed number. But you also get important insights into how well the model is qualitatively and quantitatively reproducing the features you are hoping it will, namely aftershock clustering and background rate.

Fig3: As part of these formal tests, I would also look at visual fits, e.g. Fig 1d seems to show a comparison of observed data and the magnitude forecasts from the different b-value models. I know the formal tests do this comparison quantitatively, but I would discuss the visual fit (which quite clearly favours the higher b-value).

Supercritical branching when alpha is fixed: these branching ratios are indeed very high, but supercritical has been found before, e.g. Seif et al. 2017 and some of the other references you used have found similar issues. There’s clearly model misspecification that’s driving this bias, so it would be interesting to discuss possibilities: changing background rates, anisotropic spatial aftershock triggering, spatial variation in incompleteness?
L476: “suspiciously well” in spatial terms means that the models are too smooth, ie that the events occur too close to likelihood peaks without the scatter expected if the model were the data generator. That should be spelt out.

L479: refer to Fig 1d in this discussion of magnitude distribution fit to data.

L481: is it true that “ETAS_bg^(b+) uses the same spatial distribution for placing the background events” as ETAS_0^(b+)? As commented above, I understood two differences: the relative rates between zones are constrained, and the absolute rate is different, and the parameters are different, so the probability of being a background event is also affected. Could you clarify (here and above where commented)?

Irrespectively, and again, I would visually compare the forecast and observed magnitude distributions and use this to support your case that the difference in magnitude forecast performance of the two models is indeed a surprising result and indicates an issue with the M-test (and the S-test, although I’m not sure I understand your detailed argument here, see above).

L482: “known issue”: as you know, Francesco Serafini and others identified a correlation between the M-test and the N-test, which is indeed problematic and being fixed. You might cite Francesco Serafini (personal communication, 2024), or perhaps refer to the manuscript in prep that might be submitted by the time this is published.

Given the (reasonable) caution you advise in interpreting these M/S/PL results, you might use visual checks, including of the spatial distributions, appropriately averaged (or not) over the many simulations, to show the extent to which models reproduce the features you expect them to.

Figure 4a would benefit from a panel underneath with the magnitude-time plot of the seismicity, which will help identify clustering and quieter periods and visually link them to the likelihood gains (or losses).

Figure 4b/c: these tables are hard to digest, despite the nice colors. Could you please plot (instead or in addition) a figure instead showing mean information gain over the ESHM20 model with 95% range (which I believe you get from the paired t-test?) to indicate significance?

L506: “achieve a significance level below 0.05” is I believe not the right interpretation of these tests – you might just state that their p-values are below the critical value.

L515: “could be expected” – yes, of course, but the purpose is to develop a model that does indeed use this knowledge and puts this into practice. It might be a basic statement for some of the community, but not for others and potential users. I would emphasise the success in finding one or several models that do show substantial improvement in predictive skill, and that this skill shows up during periods of clustering (presumably, although it would be good to illustrate/visualize this explicitly, as suggested eg via the modified figure 4a that links cumulative LL scores trajectories to seismicity magnitude-time plots to show where the clustering occurs). And I would be careful with absolute statements like “poor performance” – this is relative to the other ETAS models as measured by your metric, which is (a) an approximation of the correlation of seismicity between spatial bins and (b) requires many model simulations, which may not be sufficient to fill all “bins” with simulations (see waterlevel comment). So I’d be a bit more careful and emphasise this first milestone – a European ETAS model that does what it says on the tin: forecast aftershocks and capture some of the time dependence of seismicity, clearly much better than a time-independent model. And yes, there are some unexpected results here, which would be good to understand in more depth in order to improve on this first attempt.

With respect to the spatial LL comparison in Fig 5, I suspect there are patterns, and some more analysis would be required to understand these patterns. It’s surprising to see such strongly negative gains between an ETAS model and the ESHM20 model, for instance. Are these individual events? Clusters gone awry? And what’s going on at the mid-oceanic ridge? This section is a bit thin, so at least some more discussion would help. What explains the major differences in some cells between the two ETAS versions? Or at least what happens there?
Best wishes,
Max Werner

Citation: https://doi.org/10.5194/egusphere-2023-3153-RC2
- AC2: 'Reply on RC2', Marta Han, 30 Apr 2024
  
  We are grateful to the reviewer for his detailed insights and constructive suggestions for improvements to our manuscript. We submit a response to each point in the attachment.
  
  Citation: https://doi.org/10.5194/egusphere-2023-3153-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2023-3153', Anonymous Referee #1, 31 Jan 2024

This paper describes the first attempt to build an operational earthquake forecasting system for Europe.

My overall opinion on the paper is positive, but I think that the authors should address some points to make the paper more convincing and reproducible.
Below I list my main comments that should be addressed in a revised version.
- The description of the quality of the homogeneous earthquake catalog is missing. The authors refer to the paper Danciu et al (2021), but it would be good to show something about the homogeneity in terms of magnitude. Each agency uses different magnitudes and it could be cumbersome and really challenging to homogenize them. But the homogenization of the magnitudes is essential for the goal of this paper.

So, I suggest adding some more quantitative information about the catalogs, including the kind of magnitude adopted and how different magnitudes have been homogenized.
- The authors use a binned magnitude of deltaM=0.2. It is not clear to me if this is an average that accounts for uncertainty in very old and very new earthquakes. In this case, the use of a mean value could not be appropriate.

In other words, what is the rationale of this choice (deltaM=0.2)? Moreover, the use of deltaM=0.2 has important consequences in terms of the b-value. Some papers show that the use of binned magnitudes introduce a bias in the b-value calculation. This could be important in simulating data for forecast and testing. Reading the paper, I cannot understand if the same binning has been maintained also for the newer earthquake catalog used for prospective testing. If not, the b-value calculated using deltaM=0.2 could not be appropriate for simulating data that will be binned with a different deltaM.

This point has to be analyzed in detail in a revised version.
- The authors use very often the term "aftershock". I know that this is contained also in the name of the model (ETAS), but this could be very misleading, in particular for people working on seismic hazard that have a different definition of aftershocks (e.g., aftershocks can never be larger than the mainshock). I suggest replacing the term "aftershock" with the term "triggered earthquake" that is more appropriate for the ETAS model, which assumes that earthquakes can be divided only as background and triggered, not as fore-main-aftershocks.
- In the introduction the authors write "There is not a unique agreed-upon best way to provide OEF ...". It is not clear to me if the authors are talking about communication or about scientific output. For instance, Jordan et al (2011) made the case in which OEF should be provided continuously, whereas some agencies provide this information only in some circumstances. Are the authors referring to that? Or to the challenging way in which probabilities can be communicated? In any case I would suggest being clearer on this point.
- As regards the "model fit", I am wondering why the authors do not provide the usual residual plot that shows visually if the model explains well the data. The authors use different plots, that could be ok and add more information, but they look like less informative than the residual plot (at least, this is my first impression)
- In the section "Discussion of the model fit", many results sound trivial and can be easily explained by the well-known correlation among parameters. I would suggest making clearer what are the results that are new and cannot be explained by what we already know.
- The caption of Figure 3 should contain an explanation of the colors used in the cells of the grid.
- One of the most interesting result is that the version of the ETAS model with alpha=beta is less performing producing "explosive" earthquake sequences (branching ratios larger than 1). I am wondering if the authors are using some maximum magnitudes (or corner magnitudes as well) in their simulation. As far as I know this is an outcome of ESHM20, and it could reduce drastically the problem of "explosive" sequences. A recent paper by Mancini and Marzocchi (Seismol. Res. Lett. 2024) uses alpha=beta without having problems with "explosive" sequences. Maybe a few explanations on why the authors get explosive earthquake sequences could be worthwhile.

Citation: https://doi.org/10.5194/egusphere-2023-3153-RC1
- AC1: 'Reply on RC1', Marta Han, 30 Apr 2024
  
  We thank the reviewer for their insightful and constructive comments about our study. We will update the manuscript according to the points made. More detailed responses are submitted in the attachment.
  
  Citation: https://doi.org/10.5194/egusphere-2023-3153-AC1
RC2:
'Comment on egusphere-2023-3153', Maximilian Werner, 26 Apr 2024

First, my apologies for the delay of this review!

This paper presents preliminary candidates for operational earthquake forecasting in Europe, based on variants of the well-established Epidemic Type Aftershock Sequence (ETAS) model. The authors draw on available harmonized datasets and a background model from the European Seismic Hazard model ESHM2020 to generate the pan-European models, making good use of prior painstaking work to collate the myriad of different national catalogs and methods. The study demonstrates convincingly that the ETAS models forecast some of the observed space-time clustering and may therefore be useful for real-time deployment. The detailed model evaluations include in-sample tests, which reveal some issues in some of the models as well as some issues with the applied tests, and they include a more severe out-of-sample pseudo-prospective test, and some sober and honest analysis of the relative performance of the models. They conclude that indeed 1-2 of the proposed ETAS models perform best and could serve as initial candidates (to be improved in the future).

Please see my (many) comments below. It’s a well written and structured paper – and I have only a couple of major comments, everything else is largely around presentation and suggestions for improvements. I have one request to assess a forecast/simulation choice in more detail, and another to pursue some more analysis in the evaluation.

What is the influence of the chosen minimum probability level in histograms where no simulations have filled bins? How many number bins of the full space-time (and magnitude?) bins are set to this level? How frequently are they “observed”? How frequently do these empty number bins sit next to filled bins, suggesting perhaps that interpolation would be a better approach?

The authors rightly identify some undesirable features in the retrospective hypothesis tests applied, which is a useful contribution. But that means they should pursue other/additional methods, e.g. comparing the cumulative number trajectories of observed and synthetic catalogs (see specific comment below), comparing (at least visually) the spatial distribution of the observed and synthetic catalogs, etc. Similarly, Figure 5 of the pseudo-prospective information gains don’t seem very insightful – this is a first step to identify what’s gone wrong/right, but would really benefit from some analysis, e.g. are the aftershocks not well captured in space, time or both? Why are the ETAS models sometimes performing worse than the ESHM2020 model? Even qualitative visual analysis would be useful.

Otherwise, please consider the below comments as hopefully constructive suggestions.

Abstract:
L2 – is the purpose an aftershock forecast model? It sounds like it based on the first sentence. Please clarify.

The abstract would benefit from a sentence on the data the model is fit to, and any issues with “the” European catalog. Ie – what’s harmonized?

It would help to clarify that these are currently 2D seismicity forecasts, not extended ruptures with depth (which might be fruitful areas for the future).

A clearer statement of the different models explored and their rankings would be of interest.

L10 How do the findings highlight these promising areas (last sentence)? Can you express it concisely or are these speculations or a wish list?

Introduction

L60 – missing a word after m_c – maybe “more”?

The introduction could probably be shortened – e.g. the background to m_c and the GR law etc seems more relevant for a thesis than a paper – unless the authors are specifically trying to reach a broader audience.

L75: As I understand it, the Mizrahi et al 2021 model accounts for termporal – not spatial – variations in mc. Is the model being extended? If so, mention it in the abstract and here.

L81: ESHM20 contains many models, some idea of how that’s incorporated into ETAS in both the abstract as well as more discussion here would be appropriate.

L112: These statistics of mc presumably come from Danciu et al 2021? Please clarify.

Figure 1: the red dots look black because of the marker edge colour. It’s impossible to differentiate red from green dots in this figure. Try a bigger size or different/no edge colour.

L137: what do you mean here by “such differences”? Do you mean a difference in completeness magnitude, or a difference in the spatial locations? The second is indeed expected, the former perhaps less so. Please clarify.

L140: Good to clarify. Does the binning require the discrete version of the b-value estimation procedure? Which one was used in Fig 1d? What are the uncertainties? Why show b=0.99? Perhaps this will be discussed below, then guide the reader.

L142: I would appreciate a figure of the long-term rates in these two models.

L146-153: this sounds like Methods, rather than data. I recommend more discussion of the two long-term models (ie a figure of the two models and a bit more background on both models). Then move the discussion of how you map these into the background rate of ETAS into a subsection in Methods, including some discussion of how this long-term seismicity might or might not be compatible with the background (independent) rate in an ETAS model would be appropriate (or else just extend this section).

L187: again, if the ESHM20 considers long-term rate as the declustered rate, there is some consistency (even that’s debatable), but if the ESHM20 rate is the long-term, i.e. average rate, then setting this to the background rate may lead to an overestimation of the total rate (instead the average model rate should equal the ESHM20 rate, see e.g. Field et al., 2017). Or are you solely considering the relative spatial information, rather than the absolute rate? I think clarification of this point earlier in the paper would help. (You clarify this in L242, and I agree this is a good approach, but I’d mention it earlier.)

L193: ETAS_USGS: A few more details would help – isn’t this an aftershock-only model? Are these truly one-size fits all for the entire globe or are they regionalized?

L203: burn period -> burn-in period

L213: which events’ locations exactly – the new/simulated background events, or do you not include all background events in the burn-in period?

L258: It would help to a bit more precise here, and to consider the role of the maximum magnitude or a tapered GR law. “Exploding” aftershock behavior may occur with some finite probability when the branching ratio r is >1, and sequences are finite with probability 1 when r=1. The calculation of r involves the GR law, and thus depends on your choice of pure GR vs tapered/cut GR, and thus in the latter case on Mcorner/Mmax. First, please include a short discussion/justification on your choice earlier when you introduce the magnitude distribution (including the potential for spatial variations). Second, in the context of constraining parameters for subscritical branching, these modifications matter and thus the parameter constraints change. The branching ratio r may well be <1 if the GR is tapered/cut while it may not with a pure GR law. Sornette and Werner (2005) derived r equations for a truncated GR. Finally, clarify what ‘e’ is (another variable or Euler’s number?).

L264: “standard” in the USGS software?

L286: Does this mean that the explicitly spatially-variable models forecast spatially uniform background rates in each spatial ESHM zone? I don’t believe so, but please clarify – if I understand correctly, the ETAS_bg models use the same procedure to simulate background events as the ETAS_0 model, but both the probability of being a background rate has changed because of different parameters & the additional constraint that the rates in different zones are relatively constrained. Is that correct?

L292: Ah, so it is a truncated GR law – does this influence the branching ratio constraint alpha = e? See earlier comment.

L295: “true” -> observed? Here and below.

L320: Why only focus on retrospective consistency testing? 7 years of daily out-of-sample testing seems like a great start, even if not 25 years. I recommend doing these or similar tests, or some of them – the information gain assess a different aspect of the models, but tell us less about how close to the data the models are (and while the non-Poisson cell-wise LL scores are a great improvement, they still neglect known correlations between spatial cells).

L338: What is the influence of this implementation and the particular choice of a minimum waterlevel on the overall scores?: how many zero-bins are there in a 100k simulation set per day? How does the waterlevel compare with the background rate (and its Poisson process implied probabilities)? How “rough” is the zero-bin distribution, ie could an interpolation procedure approximate the ETAS forecasts better than this water-level?

L419: It would be illustrative to give the range of values in days or years after which the Omori law is significantly tapered in this formulation (ie give the range of tau in days). (I see the range in L436, but give units in years).

Parameters: Can you connect your discussion of the ETAS parameter estimation with the literature more directly (e.g. do you see similar patterns to Seif et al, 2017, and others…)?

Do you have LL scores for your various ETAS model fits? Can you compare the in-sample fit using LL and AIC? It’s a useful indicator of fit, though not necessarily about predictive skill.

Figure 3: Please clarify these are simulations over the entire training period. Also, L279 states 100k catalogs were simulated, but here it states only 10k. I would clarify the timeframe of the retrospective testing also in the relevant Methods section around L279. These are not next-day forecasts, as in the pseudo-prospective testing section.

Figure 3a: I would recommend looking at the cumulative counts in each of the catalogs and compare with the observed count. You could look at the cumulative 95% model range and compare with the observed number. But you also get important insights into how well the model is qualitatively and quantitatively reproducing the features you are hoping it will, namely aftershock clustering and background rate.

Fig3: As part of these formal tests, I would also look at visual fits, e.g. Fig 1d seems to show a comparison of observed data and the magnitude forecasts from the different b-value models. I know the formal tests do this comparison quantitatively, but I would discuss the visual fit (which quite clearly favours the higher b-value).

Supercritical branching when alpha is fixed: these branching ratios are indeed very high, but supercritical has been found before, e.g. Seif et al. 2017 and some of the other references you used have found similar issues. There’s clearly model misspecification that’s driving this bias, so it would be interesting to discuss possibilities: changing background rates, anisotropic spatial aftershock triggering, spatial variation in incompleteness?
L476: “suspiciously well” in spatial terms means that the models are too smooth, ie that the events occur too close to likelihood peaks without the scatter expected if the model were the data generator. That should be spelt out.

L479: refer to Fig 1d in this discussion of magnitude distribution fit to data.

L481: is it true that “ETAS_bg^(b+) uses the same spatial distribution for placing the background events” as ETAS_0^(b+)? As commented above, I understood two differences: the relative rates between zones are constrained, and the absolute rate is different, and the parameters are different, so the probability of being a background event is also affected. Could you clarify (here and above where commented)?

Irrespectively, and again, I would visually compare the forecast and observed magnitude distributions and use this to support your case that the difference in magnitude forecast performance of the two models is indeed a surprising result and indicates an issue with the M-test (and the S-test, although I’m not sure I understand your detailed argument here, see above).

L482: “known issue”: as you know, Francesco Serafini and others identified a correlation between the M-test and the N-test, which is indeed problematic and being fixed. You might cite Francesco Serafini (personal communication, 2024), or perhaps refer to the manuscript in prep that might be submitted by the time this is published.

Given the (reasonable) caution you advise in interpreting these M/S/PL results, you might use visual checks, including of the spatial distributions, appropriately averaged (or not) over the many simulations, to show the extent to which models reproduce the features you expect them to.

Figure 4a would benefit from a panel underneath with the magnitude-time plot of the seismicity, which will help identify clustering and quieter periods and visually link them to the likelihood gains (or losses).

Figure 4b/c: these tables are hard to digest, despite the nice colors. Could you please plot (instead or in addition) a figure instead showing mean information gain over the ESHM20 model with 95% range (which I believe you get from the paired t-test?) to indicate significance?

L506: “achieve a significance level below 0.05” is I believe not the right interpretation of these tests – you might just state that their p-values are below the critical value.

L515: “could be expected” – yes, of course, but the purpose is to develop a model that does indeed use this knowledge and puts this into practice. It might be a basic statement for some of the community, but not for others and potential users. I would emphasise the success in finding one or several models that do show substantial improvement in predictive skill, and that this skill shows up during periods of clustering (presumably, although it would be good to illustrate/visualize this explicitly, as suggested eg via the modified figure 4a that links cumulative LL scores trajectories to seismicity magnitude-time plots to show where the clustering occurs). And I would be careful with absolute statements like “poor performance” – this is relative to the other ETAS models as measured by your metric, which is (a) an approximation of the correlation of seismicity between spatial bins and (b) requires many model simulations, which may not be sufficient to fill all “bins” with simulations (see waterlevel comment). So I’d be a bit more careful and emphasise this first milestone – a European ETAS model that does what it says on the tin: forecast aftershocks and capture some of the time dependence of seismicity, clearly much better than a time-independent model. And yes, there are some unexpected results here, which would be good to understand in more depth in order to improve on this first attempt.

With respect to the spatial LL comparison in Fig 5, I suspect there are patterns, and some more analysis would be required to understand these patterns. It’s surprising to see such strongly negative gains between an ETAS model and the ESHM20 model, for instance. Are these individual events? Clusters gone awry? And what’s going on at the mid-oceanic ridge? This section is a bit thin, so at least some more discussion would help. What explains the major differences in some cells between the two ETAS versions? Or at least what happens there?
Best wishes,
Max Werner

Citation: https://doi.org/10.5194/egusphere-2023-3153-RC2
- AC2: 'Reply on RC2', Marta Han, 30 Apr 2024
  
  We are grateful to the reviewer for his detailed insights and constructive suggestions for improvements to our manuscript. We submit a response to each point in the attachment.
  
  Citation: https://doi.org/10.5194/egusphere-2023-3153-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (16 Jul 2024) by Laurentiu Danciu

AR by Marta Han on behalf of the Authors (09 Aug 2024) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (12 Aug 2024) by Laurentiu Danciu

RR by Maximilian Werner (15 Aug 2024)

Suggestions for revision or reasons for rejection

The revised version is much improved. Below I have some remaining suggestings to improve presentation and clarity. And I do have some queries and comments about the water-level results, specifically the sensitivity of the information gain trends with respect to what seem minor changes in the water-level, which are very surprising and need some technical checks and some interpretation (not a solution). The sensitivity may be reduced by using a more appropriate Poisson baseline (see relevant comment below) or using the water-level only when quakes do appear in zero- forecasts bins (and not in all zero-forecast bins). After that, I can recommend publication.

Figure 1: b-value estimates should include uncertainties estimates.

Fig1: The difference between b and b+ seems large, and here b+ seems to give larger values than usual. Which delta m (ie equivalent mc) did you use for b+? Is it possible that b+ is biased for this very particular dataset? It just seems larger than expected.

L203: “no substantial evidence” I think requires some references, even if it’s your interpretation of some results that do include such updating.

L210: clearly state here that ETAS_0 contains one spatially uniform background rate.

L276: “due to their under-representation in training data” – There are multiple papers suggesting it’s the anisotropy of the aftershocks compared to the isotropic model (Helmstetter et al 2005 JGR; Hainzl et al 2008 BSSA; Zhang et al. BSSA 2020), as well as the covariance between K and alpha in the likelihood function (Sornette and Werner 2005 JGR and probably others)

L308: pls cite these papers in support of the pyCSEP toolkit efforts:
Savran et al., (2022). pyCSEP: A Python Toolkit For Earthquake Forecast Developers. Journal of Open Source Software, 7(69), 3658, https://doi.org/10.21105/joss.03658
Savran, W. H., Bayona, J. A., Iturrieta, P., Asim, K. M., Bao, H., Bayliss, K., ... & Werner, M. J. (2022). pyCSEP: a Python toolkit for earthquake forecast developers. Seismological Society of America, 93(5), 2858-2870.
Or alternatively/additionally:
L496: pls consider adding the JOSS citation here to the SRL citation. The former is the peer-reviewed code base and associated online documentation, while the latter describes the software and motivation.

L383: Actually, the log score is well defined when the forecast is zero and the observed count is zero: the likelihood is exactly 1 and the log likelihood is zero. Secondly, the log score is still well defined, it is negative infinity, when the forecast is zero and there are indeed events. (see comment at the end of the review)

L393: Could you clarify how the two year period helped set the value? Also, which benchmark model are you referring to (Poisson – as mentioned below)?

L498: true -> observed

L504: for clarity state why you exclude ETAS_alpha and ETAS_alpha,bg here (as you do in the caption of Fig 4).

Figure 5: Are the white cells in these figures those where no events were simulated? What are the units on the colour scale (expected events per year)? Do these figures suggest a water level based on the background?

Figure 6: I’m quite surprised by these results. Why isn’t there a stronger step change at the time of larger quakes, e.g. the M7 in late 2020? It’s so evident in late 2018 for a smaller mainshock, but the other large quakes don’t seem to generate much information gain for the ETAS models. It’s curious – perhaps few aftershocks above the completeness? Please provide a short interpretation.

Is the Poisson model uniform or spatially variable? Is the Poisson rate the average rate over the pseudo-prospective period (which would give it more information than the ETAS models got) or over the retrospective training period (which seems fairer, and one I’d recommend)? Please clarify in the text.

In addition, I wonder whether you want to exclude the models you’ve discounted based on the retrospective tests in Figure 6? You show that some of these excluded models have the highest information gains, but then discount them based on retrospective tests. Is it still useful to show them? In any case - make sure these two sections are very consistent with each other.

L628: The water level only needs to be invoked when you have observed quakes but the forecast is zero. If there are zero quakes, the forecast is technically correct and should give probability 1, i.e. log score zero, ie there is no log score penality (a perfect prediction). So the water level should only be applied where you do see events but the forecast is zero. Is this how you’ve implemented it?

Figure S4 is quite surprising! Slight changes in water level generate substantial changes in overall trends, and even rankings are affected. And the trends (ie overall positive or negative against Poisson) seem to change randomly even for small changes in water level, which is surprising if the baseline stays the same. Did you maintain the simulated forecasts between these plots and only changed the water-level or could there be an effect due to different simulated forecasts here too?
To isolate the water-level effect, I think you should keep the simulated forecasts the same, to make it’s not differences in stochastic simulations that generate these differences.

More importantly, how do you explain that models perform worse than Poisson in panel top-left when the water level is relatively high (but still less than Poisson?), but then better when the water-level is halved (second panel on the left), then worse again when divided by another 100? It’d be good to label the panels for this discussion.

My recommendation is to check the technical details above (fix simulations between different water-levels; only use it when there are quakes; explain the reversal of trends if it persists). It’s interesting to point out this sensitivity, because it helps the community develop better methods (hopefully). You don’t need to solve it here.

Best,
Max Werner

Hide

RR by Anonymous Referee #1 (22 Aug 2024)

ED: Reconsider after major revisions (further review by editor and referees) (04 Sep 2024) by Laurentiu Danciu

AR by Marta Han on behalf of the Authors (14 Sep 2024) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (07 Oct 2024) by Laurentiu Danciu

RR by Maximilian Werner (21 Oct 2024)

ED: Publish as is (21 Oct 2024) by Laurentiu Danciu

ED: Publish as is (12 Jan 2025) by Paolo Tarolli (Executive editor)

AR by Marta Han on behalf of the Authors (13 Jan 2025) Manuscript

Post-review adjustments

AA: Author's adjustment | EA: Editor approval

AA by Marta Han on behalf of the Authors (04 Mar 2025) Author's adjustment Manuscript

EA: Adjustments approved (04 Mar 2025) by Laurentiu Danciu

Journal article(s) based on this preprint

05 Mar 2025

Towards a harmonized operational earthquake forecasting model for Europe

Marta Han, Leila Mizrahi, and Stefan Wiemer

Nat. Hazards Earth Syst. Sci., 25, 991–1012, https://doi.org/10.5194/nhess-25-991-2025,https://doi.org/10.5194/nhess-25-991-2025, 2025

Short summary

Marta Han, Leila Mizrahi, and Stefan Wiemer

Viewed

Total article views: 604 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
392	180	32	604	28	25

HTML: 392
PDF: 180
XML: 32
Total: 604
BibTeX: 28
EndNote: 25

Views and downloads (calculated since 10 Jan 2024)

Month	HTML	PDF	XML	Total
Jan 2024	85	29	3	117
Feb 2024	42	36	3	81
Mar 2024	28	11	1	40
Apr 2024	38	14	6	58
May 2024	38	10	6	54
Jun 2024	30	11	4	45
Jul 2024	21	10	3	34
Aug 2024	23	5	3	31
Sep 2024	16	6	0	22
Oct 2024	12	10	1	23
Nov 2024	15	6	0	21
Dec 2024	12	11	0	23
Jan 2025	11	11	1	23
Feb 2025	19	10	1	30
Mar 2025	2	0	2

Cumulative views and downloads (calculated since 10 Jan 2024)

Month	HTML	PDF	XML	Total
Jan 2024	85	29	3	117
Feb 2024	42	36	3	81
Mar 2024	28	11	1	40
Apr 2024	38	14	6	58
May 2024	38	10	6	54
Jun 2024	30	11	4	45
Jul 2024	21	10	3	34
Aug 2024	23	5	3	31
Sep 2024	16	6	0	22
Oct 2024	12	10	1	23
Nov 2024	15	6	0	21
Dec 2024	12	11	0	23
Jan 2025	11	11	1	23
Feb 2025	19	10	1	30
Mar 2025	2	0	2

Viewed (geographical distribution)

Total article views: 612 (including HTML, PDF, and XML) Thereof 612 with geography defined and 0 with unknown origin.

Country	#	Views	%

Cited

Latest update: 05 Mar 2025

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (7748 KB)
Metadata XML

Short summary

Relying on recent accomplishments in collecting and harmonizing data by the 2020 European Seismic Hazard Model (ESHM20) and leveraging advancements in state-of-the-art earthquake forecasting methods, we develop a harmonized earthquake forecasting model for Europe. We propose several model variants and test them on training data for consistency and on a seven-year testing period against each other, as well as against both a time-independent benchmark and a global time-dependent benchmark.


Total:	0
HTML:	0
PDF:	0
XML:	0

Towards a Harmonized Operational Earthquake Forecasting Model for Europe

Journal article(s) based on this preprint

Interactive discussion

Interactive discussion

Peer review completion

Suggestions for revision or reasons for rejection

Suggestions for revision or reasons for rejection

Suggestions for revision or reasons for rejection

Post-review adjustments

Journal article(s) based on this preprint

Viewed

Viewed (geographical distribution)

Cited

3 citations as recorded by crossref.