the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Towards improving the spatial testability of aftershock forecast models
Muhammad Asim Khawaja
Behnam Maleki Asayesh
Sebastian Hainzl
Danijel Schorlemmer
Abstract. Aftershock forecast models are usually provided on a uniform spatial grid, and the receiver operating characteristic (ROC) curve is often employed for evaluation, drawing a binary comparison of earthquake occurrences or nonoccurrence for each grid cell. However, synthetic tests show flaws in using ROC for aftershock forecast ranking. We suggest a twofold improvement in the testing strategy. First, we propose to replace ROC with the Matthews correlation coefficient (MCC) and the F1 curve. We also suggest using a multiresolution test grid adapted to the earthquake density. We conduct a synthetic experiment where we analyze aftershock distributions stemming from a Coulomb Failure (∆CFS) model, including stress activation and shadow regions. Using these aftershock distributions, we test the true ∆CFS model as well as a simple distancebased forecast (R), only predicting activation. The standard test cannot clearly distinguish between both forecasts, particularly in the case of some outliers. However, using both MCCF1 instead of ROC curves and a simple radial multiresolution grid improves the test capabilities significantly. Our findings suggest that to conduct meaningful tests, we should have at least 8 % and 5 % cells with observed earthquakes to differentiate between a nearperfect forecast model and an informationless forecast using ROC and MCCF1, respectively. While we cannot change the observed data, we can adjust the spatial grid using a datadriven approach to reduce the disparity between the number of earthquakes and the total number of cells. Using the recently introduced Quadtree approach to generate multiresolution grids, we test real aftershock forecast models for ChiChi and Landers aftershocks following the suggested guideline. Despite the improved tests, we find that the simple R model still outperforms the ∆CFS model in both cases, indicating that the latter should not be applied without further model adjustments.

Notice on discussion status
The requested preprint has a corresponding peerreviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint
(10361 KB)

Supplement
(1181 KB)

The requested preprint has a corresponding peerreviewed final revised paper. You are encouraged to refer to the final revised version.
 Preprint
(10361 KB) 
Supplement
(1181 KB)  BibTeX
 EndNote
 Final revised paper
Journal article(s) based on this preprint
Muhammad Asim Khawaja et al.
Interactive discussion
Status: closed

CC1: 'Comment on egusphere2023309', Linxuan Li, 17 Apr 2023
This systematic research is important for this broad field and is the fundamental work for aftershock prediction. I very much enjoyed reading the paper. I have some questions that I would like to discuss with the author:
1) I wonder how you generate multiple aftershock sequences. If you use “delta CFS clock advance model”, the distribution of aftershocks is determined (lambda is given), do you change the parameters of the mainshock? Also, in Section 3.4, I also want to know the procedure of “simulating catalogs with different earthquake numbers but a fixed number of cells containing earthquakes”.
2) Figure 1: Perhaps I miss some points, but I think you should get different ROCs for Rmodel when you add SEs. In other words, you should have both red and blue dotted lines. Plus, I don’t understand why you use an inverse cumulative distribution function for Rmodel.
3) Line 231: What are (delta_CFS)^(1%) and R^(99%)? What are the criteria for selecting 1% and 99%?
4) Line 265266: Why do you choose the aftershocks based on such criteria? There are several methods that may help distinguish aftershocks, like ETAS. At least, I think the author can change the parameters (radius and time) to verify the stability of their conclusions.
Citation: https://doi.org/10.5194/egusphere2023309CC1 
AC1: 'Reply on CC1', Muhammad Asim Khawaja, 23 May 2023
Thank you for your appreciation and for reading our work.
Ans1:
Generating aftershocks (or earthquakes) based on the probability map of a model does not involve any changes in the mainshock parameters. However, it is about simulating events based on the probability distribution.
In simulating catalogs with different earthquake numbers but a fixed number of cells containing earthquakes, there can be two ways to achieve this programmatically. We can keep simulating catalogs until we get the catalog with the desired number of earthquakes and active cells. This can be computationally expensive. Secondly, we can simulate one earthquake at a time and keep simulating until we get the desired number of active cells. In this work, we use the latter approach.
Please find all the codes to reproduce results here:
https://github.com/khawajasim/aftershock_forecast_testing.
The functions to simulate catalogs can be found in the module “utils”, along with the required documentation.
Ans2:We decided to have just one ROC for Rmodel because it has a positive forecast everywhere and the ROCs remain similar. Since our goal is to observe the behavior of ROC when an earthquake occurs in a negative region of ΔCFS so introducing ROCs for the Rmodel would only clutter the figures with more lines without adding meaningful information.
We use an inverse cumulative distribution for one model (which happens to be Rmodel in this case), and a cumulative distribution for the other model, because we think that in this way we can easily distinguish visually whether two curves are clearly separated or not. We can also use cumulative distribution for the Rmodel and inverse cumulative for ΔCFS.
Ans3:
It refers to the difference between the performance of the ΔCFS and the R model. ΔCFS(1%) and Rmodel(99%) mean that we allow 1% flexibility or overlap according to our criteria. Alternatively, this could be set to 2.5% (i.e. ΔCFS(2.5%) and Rmodel(97.5%)), or 5% or 10%, etc. However, we have used 1% because we want to allow less overlap between the performance of the ΔCFS and the R model.
Ans4:We selected earthquakes that occurred within the one year after the mainshock with a horizontal distance of less than 100 km from the mainshock fault, because the seismicity in this spatial and temporal vicinity to the mainshock is usually dominated by aftershocks, and previous studies have used similar selection rules (Sharma et al. 2020, Asayesh et al. 2022, cited in the manuscript). However, some of the selected earthquakes may be background events unrelated to the mainshocks. We refrain from labeling the test data with a sophisticated approach such as the ETAS model, because a simpler approach is that the forecast models already account for possible stationary background activity. Here we have ignored background activity, which may have a small effect on our result, but won’t change the main result because the background activity is only a small fraction of the total activity. However, we have now clarified in the revised manuscript that we have ignored possible background activity in our test of the two real aftershock sequences.
Note that these are just two case studies for aftershock forecast evaluation (for ChiChi and Landers), that we have discussed here. There are several options for other case studies or data selections. However, the final conclusion of this study does not depend on this evaluation.
Sharma, S., Hainzl, S., Zöeller, G., and Holschneider, M.: Is Coulomb stress the best choice for aftershock forecasting?, Journal of Geophysical Research: Solid Earth, 125, e2020JB019 553, 2020.
Asayesh, B. M., Zafarani, H., Hainzl, S., and Sharma, S.: Effects of large aftershocks on spatial aftershock forecasts during the 2017–2019 western Iran sequence, Geophysical Journal International, 232, 147–161, 2022
Citation: https://doi.org/10.5194/egusphere2023309AC1

AC1: 'Reply on CC1', Muhammad Asim Khawaja, 23 May 2023

CC2: 'Comment on egusphere2023309', Behnam Malekiasayesh, 20 Apr 2023
Due to the importance of the forecasting of aftershocks following big earthquake and testability of these forecasting methods, this research is important. I have enjoyed the paper and have some questions/suggestions that I would like to discuss with the author:
1. The MCCF1 curve provides the opportunity to obtain the best threshold points. I am wondering if the minimum distance of all the points of the MCCF1 curve from point of perfect performance (1, 1) is equal to the distance of the best threshold point and the point of perfect performance?
2. It seems that there is a little difference between calculated MCCF1 metrics in this study and Cao et al. (2020) study. I am interested about the advantages of your suggested MCCF1 metric.
Citation: https://doi.org/10.5194/egusphere2023309CC2 
AC2: 'Reply on CC2', Muhammad Asim Khawaja, 23 May 2023
Hello Behnam, Thank you for providing some comments from readers’ perspective.
Ans1:
The best possible values of MCC and F1 are 1. So the best performance of the forecast in terms of MCC and F1 is the one closest to 1. So the performance of (MCC, F1) nearest to (1, 1) provides the best performance.
Ans2:
For MCCF1, the procedure of drawing the curve between MCC and F1 is the same as proposed by Cao et al. (2020). We differ slightly from the idea of Cao et al. (2020) for computing a single performance metric using the MCCF1 curve. Cao et al. (2020) propose to compute the average distance of the MCCF1 curve from the point of ideal performance, i.e. (1, 1). However, we slightly differ from this idea. MCC already incorporates all the entries of a confusion matrix, i.e. True Positive (TP), False Positive (FP), True Negative (TN), False Negative (TN). It never increases or decreases monotonically with varying decision thresholds, unlike in the ROC curve. Thus with average distance, we can never achieve the perfect performance of one for an ideal forecast. On the other hand, the highest performance can reveal the maximum potential of the forecast, so we choose the testing metric based on the minimum distance from the point of ideal performance (1,1).
Citation: https://doi.org/10.5194/egusphere2023309AC2

AC2: 'Reply on CC2', Muhammad Asim Khawaja, 23 May 2023

RC1: 'Comment on egusphere2023309', Jose Bayona, 26 Apr 2023
General comments:
I have read the EGUsphere2023209 manuscript titled "Towards improving the spatial testability of aftershock forecast models" by Asim M. Khawaja, Behnam Maleki Asayesh, Sebastian Hainzl, and Danijel Schorlemmer. In this manuscript, the authors propose to use Matthew correlation coefficients and F1 curves (MCCF1), as well as radial multiresolution spatial grids to improve the testability of aftershock forecasting models. They build on the work of Parsons (2020), who first discussed the utility of the Receiver Operating Characteris (ROC) curve for evaluating Coulumb Failure (∆CFS) aftershock models. Based on aftershock simulations provided by a relatively simple ∆CFS model, the authors find that replacing ROC curves with MCCF1 curves, and using radial spatial grids, instead of rectangular grids, allows to differentiate a near perfect ∆CFS model (from which the synthetic data are generated) from a noninformative random reference model. In addition, they find that such discrimination is possible if the aftershocks occur in a few percentage of the total analysed spatial cells. Finally, they test their method against real aftershock data, finding that the reference model outperforms the ∆CFS model in forecasting the spatial distribution of the aftershocks of the 1999 Mw 7.7 ChiChi and 1992 Mw 7.3 Landers earthquakes.
I think the manuscript is well written, informative, and certainly could be of use to the earthquake forecasting community in improving our current understanding of the clustering nature of earthquakes. However, I also consider that it could benefit from a few improvements, which I describe below.
Specific comments:
Line 29: “physicsbased”
Lines 3435: I agree that ideally only models that have prospectively evaluated should be considered for decisionmaking. However, in practice, peerreviewed models can also considered for this purpose.
Line 41: Provide a reference for the ROC.
Line 86: Provide magnitudes, years of occurrence and references for the Chi Chi and Landers earthquakes.
Line 128: The first statement is quite vague in my opinion. Either elaborate more or delete.
Line 166: Provide a reference for the Ellsworth B magnitudearea relationship.
Line 177: What is x in this context?
Line 182: The authors can also mention dynamic triggering (e.g., Hardebeck and Harris, 2022)
Lines 195196: The authors mention that, based on their results, the ROC is unable to perform meaningful testing, but is not it also that the models they use are notoriously limited in their ability to forecast aftershock locations? I would be curious to see a similar analysis using more complex physicsbased models or statistical models like ETAS.
Line 326: The authors could add "and computationally less expensive".
Technical corrections:
Figs. 1a and 1b. The legends are not very informative. Are the Coulumb stress changes in MPa? Have they been normalised?
Figs. 2b and 3b. There is a typo in the ylabel of these figures.
Table 1. The authors probably mean “Multi”
References:
Hardebeck, J.L. and Harris, R.A., 2022. Earthquakes in the shadows: Why aftershocks occur at surprising locations. The Seismic Record, 2(3), pp.207216.
Citation: https://doi.org/10.5194/egusphere2023309RC1 
AC3: 'Reply on RC1', Muhammad Asim Khawaja, 23 May 2023
Thank you for taking the time to review the manuscript and appreciating the effort.
Line 29 (Response): Corrected.
Lines 3435 (Response): We have changed the wording and tone of the sentence.
… “It is desirable to use those forecast models for societal decision making that have proven their applicability through testing”…
Line 41(Response): Done.
Line 86 (Response): Noted.
Line 128 (Response): Modified the sentence to clarify the meaning.
... "The reliability of the testing models for any type of dataset is primarily associated" ...
Line 166 (Response): We have now provided the reference in the text.
Line 177 (Response): ‘x’ refers to ∆CFS in this case. We have modified and replaced x with “Model”.
Line 182 (Response): Thank you for the suggestion. We have added the reference.
Lines 195196 (Response): These lines do not discuss the forecast testing based on a real earthquake catalog. Instead, it refers to the synthetic experiment. In this experiment, we use the simple ∆CFS model as a seismicity generator, and the same is being used for the forecast evaluation. Thus in this context, the ∆CFS is the perfect forecast model because it is also the catalog generator, so it should be able to perform better than any competing uninformative model.
We think that we failed to convey this in the first version of the manuscript, so we have modified the text slightly to make it clearer. Please see line 201 of the revised annotated manuscript.
Line 326 (Response): In this manuscript, we have not discussed the issue of computational resources in the context of aftershock forecast testing, so it would be inappropriate to add this to the conclusion.
Figs. 1a and 1b. (Response): The forecasts shown in the Figures 1a and 1b show the forecast rates (ƛ), and their calculation is provided in the manuscript. We have further elaborated the text and added this to the legend and caption of the figure.
Figs. 2b and 3b. (Response): Thank you for pointing this out. We have corrected them.
Table 1. (Response): We have corrected it.
Citation: https://doi.org/10.5194/egusphere2023309AC3

AC3: 'Reply on RC1', Muhammad Asim Khawaja, 23 May 2023

RC2: 'Comment on egusphere2023309', Anonymous Referee #2, 04 May 2023
In this manuscript, the authors have discussed the testing of earthquakes/aftershock forecasting models. With a wide range of techniques and models being developed for earthquake forecasting, it becomes important to analyze the testing metrics in detail for a meaningful evaluation of the testing metrics is an important issue. In this paper, the authors discussed binary testing metrics, i.e. ROC curve. I share some concerns:1 The manuscript discussing the MCCF1 is still not published but is available as a preprint. Why do you still consider it important to propose earthquake forecast testing?2 Can you justify whether changing the grid of the earthquake forecast model is changing the forecast itself or just affecting the test's outcomes? How will you prove that?3 Did you also consider any other binary classification evaluation metric other than MCCF1 and ROC curve? e.g. precision recall curve?Citation: https://doi.org/
10.5194/egusphere2023309RC2 
AC4: 'Reply on RC2', Muhammad Asim Khawaja, 23 May 2023
Ans1:
Although the manuscript proposing the MCCF1 curve has not yet been published in a journal, but we find it useful. However, MCC and F1 are already published and discussed performance measures in the literature. We find the MCCF1 curve useful because like ROC, it provides performance across different decision thresholds.
Ans2:
To address this question, we would like to refer to a recent study (Khawaja et al. 2023) which showed that aggregating a forecast on a multiresolution grid does not change the consistency of a forecast model, as long as a model is actually consistent with the actual observation.
Khawaja, Asim M., et al. "Statistical power of spatial earthquake forecast tests." Geophysical Journal International 233.3 (2023): 20532066.
Ans3:
Yes, we considered the performance measures that work using varying thresholds, for example, the PrecisionRecall curve. The Precision involves TP and FP, while Recall involves TP and FN. Similarly, both together take into consideration TP, FP, and FN, leaving out the effect of TN. Therefore, we preferred a performance metric that takes into account all four entries of a confusion matrix.
However, the novelty of this study is to provide a guideline on the minimum amount of data required to evaluate earthquake forecast models in terms of binary classification.
Citation: https://doi.org/10.5194/egusphere2023309AC4

AC4: 'Reply on RC2', Muhammad Asim Khawaja, 23 May 2023
Interactive discussion
Status: closed

CC1: 'Comment on egusphere2023309', Linxuan Li, 17 Apr 2023
This systematic research is important for this broad field and is the fundamental work for aftershock prediction. I very much enjoyed reading the paper. I have some questions that I would like to discuss with the author:
1) I wonder how you generate multiple aftershock sequences. If you use “delta CFS clock advance model”, the distribution of aftershocks is determined (lambda is given), do you change the parameters of the mainshock? Also, in Section 3.4, I also want to know the procedure of “simulating catalogs with different earthquake numbers but a fixed number of cells containing earthquakes”.
2) Figure 1: Perhaps I miss some points, but I think you should get different ROCs for Rmodel when you add SEs. In other words, you should have both red and blue dotted lines. Plus, I don’t understand why you use an inverse cumulative distribution function for Rmodel.
3) Line 231: What are (delta_CFS)^(1%) and R^(99%)? What are the criteria for selecting 1% and 99%?
4) Line 265266: Why do you choose the aftershocks based on such criteria? There are several methods that may help distinguish aftershocks, like ETAS. At least, I think the author can change the parameters (radius and time) to verify the stability of their conclusions.
Citation: https://doi.org/10.5194/egusphere2023309CC1 
AC1: 'Reply on CC1', Muhammad Asim Khawaja, 23 May 2023
Thank you for your appreciation and for reading our work.
Ans1:
Generating aftershocks (or earthquakes) based on the probability map of a model does not involve any changes in the mainshock parameters. However, it is about simulating events based on the probability distribution.
In simulating catalogs with different earthquake numbers but a fixed number of cells containing earthquakes, there can be two ways to achieve this programmatically. We can keep simulating catalogs until we get the catalog with the desired number of earthquakes and active cells. This can be computationally expensive. Secondly, we can simulate one earthquake at a time and keep simulating until we get the desired number of active cells. In this work, we use the latter approach.
Please find all the codes to reproduce results here:
https://github.com/khawajasim/aftershock_forecast_testing.
The functions to simulate catalogs can be found in the module “utils”, along with the required documentation.
Ans2:We decided to have just one ROC for Rmodel because it has a positive forecast everywhere and the ROCs remain similar. Since our goal is to observe the behavior of ROC when an earthquake occurs in a negative region of ΔCFS so introducing ROCs for the Rmodel would only clutter the figures with more lines without adding meaningful information.
We use an inverse cumulative distribution for one model (which happens to be Rmodel in this case), and a cumulative distribution for the other model, because we think that in this way we can easily distinguish visually whether two curves are clearly separated or not. We can also use cumulative distribution for the Rmodel and inverse cumulative for ΔCFS.
Ans3:
It refers to the difference between the performance of the ΔCFS and the R model. ΔCFS(1%) and Rmodel(99%) mean that we allow 1% flexibility or overlap according to our criteria. Alternatively, this could be set to 2.5% (i.e. ΔCFS(2.5%) and Rmodel(97.5%)), or 5% or 10%, etc. However, we have used 1% because we want to allow less overlap between the performance of the ΔCFS and the R model.
Ans4:We selected earthquakes that occurred within the one year after the mainshock with a horizontal distance of less than 100 km from the mainshock fault, because the seismicity in this spatial and temporal vicinity to the mainshock is usually dominated by aftershocks, and previous studies have used similar selection rules (Sharma et al. 2020, Asayesh et al. 2022, cited in the manuscript). However, some of the selected earthquakes may be background events unrelated to the mainshocks. We refrain from labeling the test data with a sophisticated approach such as the ETAS model, because a simpler approach is that the forecast models already account for possible stationary background activity. Here we have ignored background activity, which may have a small effect on our result, but won’t change the main result because the background activity is only a small fraction of the total activity. However, we have now clarified in the revised manuscript that we have ignored possible background activity in our test of the two real aftershock sequences.
Note that these are just two case studies for aftershock forecast evaluation (for ChiChi and Landers), that we have discussed here. There are several options for other case studies or data selections. However, the final conclusion of this study does not depend on this evaluation.
Sharma, S., Hainzl, S., Zöeller, G., and Holschneider, M.: Is Coulomb stress the best choice for aftershock forecasting?, Journal of Geophysical Research: Solid Earth, 125, e2020JB019 553, 2020.
Asayesh, B. M., Zafarani, H., Hainzl, S., and Sharma, S.: Effects of large aftershocks on spatial aftershock forecasts during the 2017–2019 western Iran sequence, Geophysical Journal International, 232, 147–161, 2022
Citation: https://doi.org/10.5194/egusphere2023309AC1

AC1: 'Reply on CC1', Muhammad Asim Khawaja, 23 May 2023

CC2: 'Comment on egusphere2023309', Behnam Malekiasayesh, 20 Apr 2023
Due to the importance of the forecasting of aftershocks following big earthquake and testability of these forecasting methods, this research is important. I have enjoyed the paper and have some questions/suggestions that I would like to discuss with the author:
1. The MCCF1 curve provides the opportunity to obtain the best threshold points. I am wondering if the minimum distance of all the points of the MCCF1 curve from point of perfect performance (1, 1) is equal to the distance of the best threshold point and the point of perfect performance?
2. It seems that there is a little difference between calculated MCCF1 metrics in this study and Cao et al. (2020) study. I am interested about the advantages of your suggested MCCF1 metric.
Citation: https://doi.org/10.5194/egusphere2023309CC2 
AC2: 'Reply on CC2', Muhammad Asim Khawaja, 23 May 2023
Hello Behnam, Thank you for providing some comments from readers’ perspective.
Ans1:
The best possible values of MCC and F1 are 1. So the best performance of the forecast in terms of MCC and F1 is the one closest to 1. So the performance of (MCC, F1) nearest to (1, 1) provides the best performance.
Ans2:
For MCCF1, the procedure of drawing the curve between MCC and F1 is the same as proposed by Cao et al. (2020). We differ slightly from the idea of Cao et al. (2020) for computing a single performance metric using the MCCF1 curve. Cao et al. (2020) propose to compute the average distance of the MCCF1 curve from the point of ideal performance, i.e. (1, 1). However, we slightly differ from this idea. MCC already incorporates all the entries of a confusion matrix, i.e. True Positive (TP), False Positive (FP), True Negative (TN), False Negative (TN). It never increases or decreases monotonically with varying decision thresholds, unlike in the ROC curve. Thus with average distance, we can never achieve the perfect performance of one for an ideal forecast. On the other hand, the highest performance can reveal the maximum potential of the forecast, so we choose the testing metric based on the minimum distance from the point of ideal performance (1,1).
Citation: https://doi.org/10.5194/egusphere2023309AC2

AC2: 'Reply on CC2', Muhammad Asim Khawaja, 23 May 2023

RC1: 'Comment on egusphere2023309', Jose Bayona, 26 Apr 2023
General comments:
I have read the EGUsphere2023209 manuscript titled "Towards improving the spatial testability of aftershock forecast models" by Asim M. Khawaja, Behnam Maleki Asayesh, Sebastian Hainzl, and Danijel Schorlemmer. In this manuscript, the authors propose to use Matthew correlation coefficients and F1 curves (MCCF1), as well as radial multiresolution spatial grids to improve the testability of aftershock forecasting models. They build on the work of Parsons (2020), who first discussed the utility of the Receiver Operating Characteris (ROC) curve for evaluating Coulumb Failure (∆CFS) aftershock models. Based on aftershock simulations provided by a relatively simple ∆CFS model, the authors find that replacing ROC curves with MCCF1 curves, and using radial spatial grids, instead of rectangular grids, allows to differentiate a near perfect ∆CFS model (from which the synthetic data are generated) from a noninformative random reference model. In addition, they find that such discrimination is possible if the aftershocks occur in a few percentage of the total analysed spatial cells. Finally, they test their method against real aftershock data, finding that the reference model outperforms the ∆CFS model in forecasting the spatial distribution of the aftershocks of the 1999 Mw 7.7 ChiChi and 1992 Mw 7.3 Landers earthquakes.
I think the manuscript is well written, informative, and certainly could be of use to the earthquake forecasting community in improving our current understanding of the clustering nature of earthquakes. However, I also consider that it could benefit from a few improvements, which I describe below.
Specific comments:
Line 29: “physicsbased”
Lines 3435: I agree that ideally only models that have prospectively evaluated should be considered for decisionmaking. However, in practice, peerreviewed models can also considered for this purpose.
Line 41: Provide a reference for the ROC.
Line 86: Provide magnitudes, years of occurrence and references for the Chi Chi and Landers earthquakes.
Line 128: The first statement is quite vague in my opinion. Either elaborate more or delete.
Line 166: Provide a reference for the Ellsworth B magnitudearea relationship.
Line 177: What is x in this context?
Line 182: The authors can also mention dynamic triggering (e.g., Hardebeck and Harris, 2022)
Lines 195196: The authors mention that, based on their results, the ROC is unable to perform meaningful testing, but is not it also that the models they use are notoriously limited in their ability to forecast aftershock locations? I would be curious to see a similar analysis using more complex physicsbased models or statistical models like ETAS.
Line 326: The authors could add "and computationally less expensive".
Technical corrections:
Figs. 1a and 1b. The legends are not very informative. Are the Coulumb stress changes in MPa? Have they been normalised?
Figs. 2b and 3b. There is a typo in the ylabel of these figures.
Table 1. The authors probably mean “Multi”
References:
Hardebeck, J.L. and Harris, R.A., 2022. Earthquakes in the shadows: Why aftershocks occur at surprising locations. The Seismic Record, 2(3), pp.207216.
Citation: https://doi.org/10.5194/egusphere2023309RC1 
AC3: 'Reply on RC1', Muhammad Asim Khawaja, 23 May 2023
Thank you for taking the time to review the manuscript and appreciating the effort.
Line 29 (Response): Corrected.
Lines 3435 (Response): We have changed the wording and tone of the sentence.
… “It is desirable to use those forecast models for societal decision making that have proven their applicability through testing”…
Line 41(Response): Done.
Line 86 (Response): Noted.
Line 128 (Response): Modified the sentence to clarify the meaning.
... "The reliability of the testing models for any type of dataset is primarily associated" ...
Line 166 (Response): We have now provided the reference in the text.
Line 177 (Response): ‘x’ refers to ∆CFS in this case. We have modified and replaced x with “Model”.
Line 182 (Response): Thank you for the suggestion. We have added the reference.
Lines 195196 (Response): These lines do not discuss the forecast testing based on a real earthquake catalog. Instead, it refers to the synthetic experiment. In this experiment, we use the simple ∆CFS model as a seismicity generator, and the same is being used for the forecast evaluation. Thus in this context, the ∆CFS is the perfect forecast model because it is also the catalog generator, so it should be able to perform better than any competing uninformative model.
We think that we failed to convey this in the first version of the manuscript, so we have modified the text slightly to make it clearer. Please see line 201 of the revised annotated manuscript.
Line 326 (Response): In this manuscript, we have not discussed the issue of computational resources in the context of aftershock forecast testing, so it would be inappropriate to add this to the conclusion.
Figs. 1a and 1b. (Response): The forecasts shown in the Figures 1a and 1b show the forecast rates (ƛ), and their calculation is provided in the manuscript. We have further elaborated the text and added this to the legend and caption of the figure.
Figs. 2b and 3b. (Response): Thank you for pointing this out. We have corrected them.
Table 1. (Response): We have corrected it.
Citation: https://doi.org/10.5194/egusphere2023309AC3

AC3: 'Reply on RC1', Muhammad Asim Khawaja, 23 May 2023

RC2: 'Comment on egusphere2023309', Anonymous Referee #2, 04 May 2023
In this manuscript, the authors have discussed the testing of earthquakes/aftershock forecasting models. With a wide range of techniques and models being developed for earthquake forecasting, it becomes important to analyze the testing metrics in detail for a meaningful evaluation of the testing metrics is an important issue. In this paper, the authors discussed binary testing metrics, i.e. ROC curve. I share some concerns:1 The manuscript discussing the MCCF1 is still not published but is available as a preprint. Why do you still consider it important to propose earthquake forecast testing?2 Can you justify whether changing the grid of the earthquake forecast model is changing the forecast itself or just affecting the test's outcomes? How will you prove that?3 Did you also consider any other binary classification evaluation metric other than MCCF1 and ROC curve? e.g. precision recall curve?Citation: https://doi.org/
10.5194/egusphere2023309RC2 
AC4: 'Reply on RC2', Muhammad Asim Khawaja, 23 May 2023
Ans1:
Although the manuscript proposing the MCCF1 curve has not yet been published in a journal, but we find it useful. However, MCC and F1 are already published and discussed performance measures in the literature. We find the MCCF1 curve useful because like ROC, it provides performance across different decision thresholds.
Ans2:
To address this question, we would like to refer to a recent study (Khawaja et al. 2023) which showed that aggregating a forecast on a multiresolution grid does not change the consistency of a forecast model, as long as a model is actually consistent with the actual observation.
Khawaja, Asim M., et al. "Statistical power of spatial earthquake forecast tests." Geophysical Journal International 233.3 (2023): 20532066.
Ans3:
Yes, we considered the performance measures that work using varying thresholds, for example, the PrecisionRecall curve. The Precision involves TP and FP, while Recall involves TP and FN. Similarly, both together take into consideration TP, FP, and FN, leaving out the effect of TN. Therefore, we preferred a performance metric that takes into account all four entries of a confusion matrix.
However, the novelty of this study is to provide a guideline on the minimum amount of data required to evaluate earthquake forecast models in terms of binary classification.
Citation: https://doi.org/10.5194/egusphere2023309AC4

AC4: 'Reply on RC2', Muhammad Asim Khawaja, 23 May 2023
Peer review completion
Journal article(s) based on this preprint
Muhammad Asim Khawaja et al.
Muhammad Asim Khawaja et al.
Viewed
HTML  XML  Total  Supplement  BibTeX  EndNote  

232  82  17  331  33  5  4 
 HTML: 232
 PDF: 82
 XML: 17
 Total: 331
 Supplement: 33
 BibTeX: 5
 EndNote: 4
Viewed (geographical distribution)
Country  #  Views  % 

Total:  0 
HTML:  0 
PDF:  0 
XML:  0 
 1
The requested preprint has a corresponding peerreviewed final revised paper. You are encouraged to refer to the final revised version.
 Preprint
(10361 KB)  Metadata XML

Supplement
(1181 KB)  BibTeX
 EndNote
 Final revised paper