An automated approach for developing geohazard inventories using news: Integrating NLP, machine learning, and mapping

Avcıoğlu, Aydoğan; Demir, Ogün; Görüm, Tolga

doi:https://doi.org/10.5194/egusphere-2025-7

Preprints

https://doi.org/10.5194/egusphere-2025-7

Preprints

21 Jan 2025

| 21 Jan 2025

An automated approach for developing geohazard inventories using news: Integrating NLP, machine learning, and mapping

Aydoğan Avcıoğlu, Ogün Demir, and Tolga Görüm

Abstract. Spatiotemporal inventories of natural hazards are essential for comprehending the building of resilient societies; yet, restricted access to global inventories hinders the advancement of mitigation strategies. Consequently, we developed an approach that enhances the capability of online newspapers in the creation of natural hazard inventory by utilizing web scraping, natural language processing (NLP), clustering, and geolocation of textual data. Here, we use the online newspapers from 1997 to 2023 in Türkiye to employ our approach. In the first stage, we retrieved 15,569 news by using our tr-news-scraper tool considering wildfire, flood, landslide, and sinkhole-related natural hazard news. Further, we utilized NLP preprocessing approaches to refine the raw texts obtained from newspaper sources, which were subsequently clustered into 4 natural hazard groups resulting in 3928 news. In the final stage of the approach, we developed a method, which geolocates the news using the Open Street Map (OSM) Nominatim tool, ending up with a total of 13940 natural hazard incidents derived from news comprising multiple incidents across various locations. As a result, we mapped 9609 floods, 1834 wildfires, 1843 landslides, and 654 sinkhole formation incidents from online newspaper sources, showing spatiotemporally consistent distribution with existing literature. Consequently, we illustrated the potential of online newspaper articles in the development of natural hazard inventories with our approach from the web sources as text data to map by leveraging the capabilities of web scraping, NLP, and mapping techniques.

Received: 02 Jan 2025 – Discussion started: 21 Jan 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Preprint (PDF, 2070 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (2070 KB)

Supplement (284 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

21 Jul 2025

An automated approach for developing geohazard inventories using news: integrating natural language processing (NLP), machine learning, and mapping

Aydoğan Avcıoğlu, Ogün Demir, and Tolga Görüm

Nat. Hazards Earth Syst. Sci., 25, 2421–2435, https://doi.org/10.5194/nhess-25-2421-2025,https://doi.org/10.5194/nhess-25-2421-2025, 2025

Short summary

Aydoğan Avcıoğlu, Ogün Demir, and Tolga Görüm

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-7', Anonymous Referee #1, 13 Feb 2025

I have read the manuscript entitled: An automated approach for developing geohazard inventories using news: Integrating NLP, machine learning, and mapping. Overall I think it is an interesting work whose objective is to develop an integrated method that recovers, classifies and geolocalizes multiple natural disasters. The basis of its study is the information published in some newspaper articles about natural hazards from Internet sources. The outline of the work is very well described in Figure 1. However, the manuscript is not clear in the following aspects: In section 2.2 the authors should clarify: The process of filtering the data or information obtained from newspaper sources. Equation 1 represents the probability of finding a term in the text, but equation 2 does not have a clear interpretation. It is not explained why the logarithm is used in Equation 2. What is the statistical interpretation of the product TF * IDF? The meaning of lines 170 and 171 is not clear. Figure 2 requires a clearer and more extensive explanation. Finally, I believe that the authors omit an analysis of the uncertainty in their results. Under these considerations, I believe that the work should be extensively revised.

Citation: https://doi.org/10.5194/egusphere-2025-7-RC1
- AC1:
  'Reply on RC1', Aydogan Avcioglu, 03 Mar 2025
  We appreciate your insightful and helpful remarks. After carefully reviewing every feedback and suggestion, we made the necessary revisions to the manuscript. An overview of the main modifications performed is provided below:
  The clarification and interpretation of the related equation and figure 2 have been added to the manuscript.
  
  A thorough description of uncertainty is provided below, and a new part describing newly conducted uncertainty analysis on location validations has been included in section 3.1.
  
  We reply sentence to sentence below (with bold style) showing our thorough answers to each comment. Please see our answers given in bold style.
  We have made an effort to address any concerns expressed while preserving the manuscript's clarity and scientific integrity. We would be pleased to respond as soon as possible to any further remarks or requests for clarification.
  Regards,
  Reviewer:
  I have read the manuscript entitled: An automated approach for developing geohazard inventories using news: Integrating NLP, machine learning, and mapping. Overall I think it is an interesting work whose objective is to develop an integrated method that recovers, classifies and geolocalizes multiple natural disasters. The basis of its study is the information published in some newspaper articles about natural hazards from Internet sources. The outline of the work is very well described in Figure 1.
  Response: We thank the reviewer for valuable feedback. We have divided the reviewer’s comments into relevant parts to better answer the raised points.
  
  Reviewer:
  However, the manuscript is not clear in the following aspects: In section 2.2 the authors should clarify: The process of filtering the data or information obtained from newspaper sources. Equation 1 represents the probability of finding a term in the text, but Equation 2 does not have a clear interpretation. It is not explained why the logarithm is used in Equation 2.c The meaning of lines 170 and 171 is not clear.
  Response: In our approach, we implement logarithmic scaling to address the frequency distribution of words within the dataset, as elucidated in the foundational equations. For example, ubiquitous terms such as "and" or "or" are found in nearly all 1,000 news articles, whereas a more specific term like "wildfire" may only appear in 10 articles. Absent logarithmic scaling, the term frequency ratio would be 1 for prevalent words (1,000/1,000) and 100 for less frequent words (1,000/10). Since similarity algorithms continue to prioritize these values, the lack of scaling would result in an excessive focus on infrequent words. By employing a logarithmic transformation, we mitigate the influence of excessively common words that lack utility for classification, while preserving significant distinctions among pertinent terms. By making this change, the model is guaranteed to concentrate on informative terms instead of giving common ones an excessive amount of weight. We clarified the interpretation of Equation 2 in the manuscript by considering our response here.
  We have also clarified lines 170 and 171, where we explain the n-grams within the TfidfVectorizer by adding the explanation: "By doing so, we benefit from more contextual meaning, for example, by maintaining word sequences, enabling models to differentiate phrases such as "sinkhole occurred" from the individual terms "sinkhole" and "occurred," which may possess distinct meanings when analyzed in apart from one another."
  
  Reviewer:
  Figure 2 requires a clearer and more extensive explanation. Finally, I believe that the authors omit an analysis of the uncertainty in their results. Under these considerations, I believe that the work should be extensively revised.
  Response: We made a better explanation in the Figure 2 caption, "The world clouds illustrate the most frequently seen words in the filtered for different geohazard news. The sizes of each word denote its relative frequency or significance within the dataset; larger words, such as “waters, of the sinkhole, due to landslide, of the fire” signify principal themes, whereas smaller words offer supplementary context pertaining to details of geohazards (for example “meters depth sinkhole”) within the news. The color variations serve solely for visual differentiation without indicating any categorical distinctions. Additionally, the spatial location of the words was arbitrarily positioned and does not indicate geographic relation with geohazards."
  We thank the reviewer for raising relevant points regarding the uncertainty allowing us a better opportunity for clarification of the limitations of our study. As indicated also by reviewer #2, we aimed to compare the spatiotemporal performance of our inventories with existing literature in Türkiye. However, accessible spatiotemporal geohazard inventories are limited or do not exist in Türkiye hindering our capabilities in evaluating our inventory performance. As we made in section 3.1, we compared our inventories with related but limited case studies, for example, landslides taking place in a particular region and time. Therefore, we have decided to make an additional analysis, the ground truth evaluation step which we targeted to manually verify our approach. For this aim, we opened a sub-section “Uncertainty assessment and limitations” in the Results and Discussion section. This method was followed by also related studies, extracting location from text-based data (Madruga de Brito et al., 2025; Stein et al., 2024). Here, we used random sampling as a ground truth evaluation step with 500 geohazard incidents to assess mapping performance. The random sampling resulted in 284, 97, 76, and 43 incidents of flood, landslide, wildfire, and sinkhole, respectively. We have manually checked these incidents and evaluated them by cross-checking the location of mapped geohazards and news context where we extracted location information. Our criteria were to achieve mapping the geohazard incidents to the center of the smallest administrative units which is available in the context of news. The uncertainty assessment for mapping performance overall resulted in good performance which is 82.4 % of geohazards accurately were mapped.
  
  References:
  Madruga de Brito, M., Sodoge, J., Kreibich, H., & Kuhlicke, C. (2025). Comprehensive Assessment of Flood Socioeconomic Impacts Through Text‐Mining. Water Resources Research, 61(1). https://doi.org/10.1029/2024WR037813
  Stein, L., Mukkavilli, S. K., Pfitzmann, B. M., Staar, P. W. J., Ozturk, U., Berrospi, C., Brunschwiler, T., & Wagener, T. (2024). Wealth Over Woe: Global Biases in Hydro‐Hazard Research. Earth’s Future, 12(10). https://doi.org/10.1029/2024EF004590
  
  Citation: https://doi.org/10.5194/egusphere-2025-7-AC1
  - RC3: 'Reply on AC1', Anonymous Referee #1, 04 Mar 2025
    
    I appreciate the replay from the authors and I consider the answers are enough to be published the manuscript.
    
    Citation: https://doi.org/10.5194/egusphere-2025-7-RC3
    
    AC3: 'Reply on RC3', Aydogan Avcioglu, 05 Mar 2025
    
    Dear Reviewer,
    
    We appreciate your positive remarks and your trust that our answers and manuscript are acceptable for publication. We thank the time and effort you invested in reading our manuscript and offering insightful comments.
    Kind regards.
    
    Citation: https://doi.org/10.5194/egusphere-2025-7-AC3
RC2:
'Comment on egusphere-2025-7', Anonymous Referee #2, 17 Feb 2025
The manuscript addresses a relevant topic and proposes an interesting workflow for constructing geohazards inventories using online newspapers and natural language processing. It is well organized.
I have few comments:
To further strengthen the manuscript, I recommend including more detailed explanations of the methods used—especially how potential biases from the model might influence the final dataset. A key concern is the geolocation strategy: if the exact street name is available, it is mapped to that street, otherwise it defaults to the city, and/or region (using the city/region’s center, presumably). This same principle seems to apply to city- and street-level data as well. While this approach may work if we want to see the geohazards’ distribution in a broader geographic area (e.g., all of Turkey), the uncertainty likely increases when examining finer geographic units. It would be helpful to clarify how this method might affect the accuracy and reliability of analyses at smaller scales.

Although the study compares its overall results with established literature, a more granular validation (e.g., comparing known hazard events in specific cities or regions) would be needed. Consider adding such a validation step to illustrate both the strengths and limitations of the approach at different scales. Few diverse cases are sufficient.

The first sentence in the Introduction states, “Natural disasters are vital. Could you please amend it?

Up to the authors: Consider whether replacing natural hazards with the term geohazards.
Citation: https://doi.org/10.5194/egusphere-2025-7-RC2
- AC2:
  'Reply on RC2', Aydogan Avcioglu, 03 Mar 2025
  We appreciate your insightful and helpful remarks. After carefully reviewing every feedback and suggestion, we made the necessary revisions to the manuscript. An overview of the main modifications performed is provided below:
  The assessment of potential accuracy changes regarding finer geographic units has been evaluated in detail.
  
  A thorough description of uncertainty is provided below, and a new part describing newly conducted uncertainty analysis on location validations has been included in section 3.1.
  
  Minor modifications have been made.
  
  We reply sentence to sentence below (with bold style) showing our thorough answers to each comment. Please see our answers given in bold style.
  We have made an effort to address any concerns expressed while preserving the manuscript's clarity and scientific integrity. We would be pleased to respond as soon as possible to any further remarks or requests for clarification.
  Regards,
  Reviewer:
  The manuscript addresses a relevant topic and proposes an interesting workflow for constructing geohazards inventories using online newspapers and natural language processing. It is well organized.
  Response: We thank the reviewer for her/his insightful comments and we’re happy to answer and clarify the points raised by the reviewer.
  
  Reviewer:
  I have few comments:
  To further strengthen the manuscript, I recommend including more detailed explanations of the methods used—especially how potential biases from the model might influence the final dataset. A key concern is the geolocation strategy: if the exact street name is available, it is mapped to that street, otherwise it defaults to the city, and/or region (using the city/region’s center, presumably). This same principle seems to apply to city- and street-level data as well. While this approach may work if we want to see the geohazards’ distribution in a broader geographic area (e.g., all of Turkey), the uncertainty likely increases when examining finer geographic units. It would be helpful to clarify how this method might affect the accuracy and reliability of analyses at smaller scales.
  
  Response: The authors thank the reviewer for recommending the points here, which we found also relevant to raise these topics both as a reply here and in the manuscript. As indicated, geolocation was one of the most challenging parts of this study since we relied on text-based information within the online gazettes. This is, firstly, because of the inhomogeneous context writing style by journalism which we can’t access the always similar “administrative level” information (city, county, and village) and details of this information, such as street, neighborhood, roads, etc. Therefore, our essential target for this study was to map the geohazards within these administrative levels by geolocating incidents to the center of these places, existing in the Open Street Map. We follow this procedure since we are not necessarily targeting to map geohazards (particularly landslides and sinkholes) geomorphologically meaningful terrains. Therefore, in this study, our target is to find and map the geohazard, temporally, within the smallest administrative level by taking the center of the cities, villages, etc. However, this procedure reveals spatially more accurate flood inventories compared to others since almost every newsworthy flood news occurs within the urbanized area of these administrative units. Secondly, the finer resolution such as street, and road information causes some problems as indicated initially due to the inhomogeneous context present within the news. For example, most of the news does not include street or road information for landslides and floods. Wildfire incidents naturally occur in forested areas most of the time outside of urbanized (but not necessarily, it might occur within the small forest patches in the urbanized area), and we made optimizations by assigning the wildfire incidents to the closest forested areas by using land use and land cover maps. Here, the most relevant geohazard for finer geographic units is flood incidents since most of the time floods occur in urbanized areas. However, since we are not able to extract – most of the time – specific street or road information, we prefer to geolocalize our inventories to the center of the administrative units. On the one hand, the problem for finer resolution, for example for street or road level, it is not possible to extract the information of which part, kilometers of this line-based location. On the other hand, since floods are represented by areal distribution, for example, inundation areas, further studies might provide better resolution by integrating a remote sensing-based approach to identify the exact location of these events within the urbanized area. Furthermore, achieving better accuracy or more accurate location representation for landslides and sinkholes requires also geomorphological interpretation, by integrating high-resolution satellite images to delineate their polygonal areas (particularly for landslides). We have clarified these issues by opening the section to the Results and Discussion, with the name “Uncertainty assessment and limitations”.
  Reviewer:
  Although the study compares its overall results with established literature, a more granular validation (e.g., comparing known hazard events in specific cities or regions) would be needed. Consider adding such a validation step to illustrate both the strengths and limitations of the approach at different scales. Few diverse cases are sufficient
  Response: We are grateful that the reviewer agreed with our discussion points, in which we compared our findings with previous research. To answer the raised points by reviewers regarding the uncertainty assessment, which was also raised by reviewer #1, we opened a sub-section “Uncertainty assessment and limitations” in the Results and Discussion section. Here, we would like to primarily express that validation is obtained data is challenging in Türkiye, due to the lack of complete and open-access inventories. Therefore section 3.1 mainly addresses and aims comparisons our inventories with the existing literature. However, most of these geohazard reports are case studies that investigate geohazards from a geological, geomorphological, or meteorological point of view rather than inventory assessment. Hence, to strengthen the reliability of our study, we have added the ground truth evaluation step which is a manual verification approach followed by also related studies (Madruga de Brito et al., 2025; Stein et al., 2024). Here, we used random sampling as a ground truth evaluation step with 500 geohazard incidents to assess mapping performance. The random sampling resulted in 284, 97, 76, and 43 incidents of flood, landslide, wildfire, and sinkhole, respectively. We have manually checked these incidents and evaluated them by cross-checking the location of mapped geohazards and news context where we extracted location information. Our criteria were to achieve mapping the geohazard incidents to the center of the smallest administrative units which is available in the context of news. The uncertainty assessment for mapping performance overall resulted in good performance which is 82.4 % of geohazards accurately were mapped.
  
  Reviewer:
  The first sentence in the Introduction states, “Natural disasters are vital. Could you please amend it?
  Response: Yes, we replaced the first sentence with “Geohazards are direct threats to human life, ecosystems, and societies worldwide socio-economically, demanding ongoing innovation and development in the mapping, analysis, and monitoring of these events.”
  
  Reviewer: Up to the authors: Consider whether replacing natural hazards with the term geohazards.
  Response: Thanks for the suggestion, we updated the terms natural hazards with geohazards to eliminate confusion that might take place due to different terms for geohazards.
  
  References:
  Madruga de Brito, M., Sodoge, J., Kreibich, H., & Kuhlicke, C. (2025). Comprehensive Assessment of Flood Socioeconomic Impacts Through Text‐Mining. Water Resources Research, 61(1). https://doi.org/10.1029/2024WR037813
  Stein, L., Mukkavilli, S. K., Pfitzmann, B. M., Staar, P. W. J., Ozturk, U., Berrospi, C., Brunschwiler, T., & Wagener, T. (2024). Wealth Over Woe: Global Biases in Hydro‐Hazard Research. Earth’s Future, 12(10). https://doi.org/10.1029/2024EF004590
  
  Citation: https://doi.org/10.5194/egusphere-2025-7-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-7', Anonymous Referee #1, 13 Feb 2025

I have read the manuscript entitled: An automated approach for developing geohazard inventories using news: Integrating NLP, machine learning, and mapping. Overall I think it is an interesting work whose objective is to develop an integrated method that recovers, classifies and geolocalizes multiple natural disasters. The basis of its study is the information published in some newspaper articles about natural hazards from Internet sources. The outline of the work is very well described in Figure 1. However, the manuscript is not clear in the following aspects: In section 2.2 the authors should clarify: The process of filtering the data or information obtained from newspaper sources. Equation 1 represents the probability of finding a term in the text, but equation 2 does not have a clear interpretation. It is not explained why the logarithm is used in Equation 2. What is the statistical interpretation of the product TF * IDF? The meaning of lines 170 and 171 is not clear. Figure 2 requires a clearer and more extensive explanation. Finally, I believe that the authors omit an analysis of the uncertainty in their results. Under these considerations, I believe that the work should be extensively revised.

Citation: https://doi.org/10.5194/egusphere-2025-7-RC1
- AC1:
  'Reply on RC1', Aydogan Avcioglu, 03 Mar 2025
  We appreciate your insightful and helpful remarks. After carefully reviewing every feedback and suggestion, we made the necessary revisions to the manuscript. An overview of the main modifications performed is provided below:
  The clarification and interpretation of the related equation and figure 2 have been added to the manuscript.
  
  A thorough description of uncertainty is provided below, and a new part describing newly conducted uncertainty analysis on location validations has been included in section 3.1.
  
  We reply sentence to sentence below (with bold style) showing our thorough answers to each comment. Please see our answers given in bold style.
  We have made an effort to address any concerns expressed while preserving the manuscript's clarity and scientific integrity. We would be pleased to respond as soon as possible to any further remarks or requests for clarification.
  Regards,
  Reviewer:
  I have read the manuscript entitled: An automated approach for developing geohazard inventories using news: Integrating NLP, machine learning, and mapping. Overall I think it is an interesting work whose objective is to develop an integrated method that recovers, classifies and geolocalizes multiple natural disasters. The basis of its study is the information published in some newspaper articles about natural hazards from Internet sources. The outline of the work is very well described in Figure 1.
  Response: We thank the reviewer for valuable feedback. We have divided the reviewer’s comments into relevant parts to better answer the raised points.
  
  Reviewer:
  However, the manuscript is not clear in the following aspects: In section 2.2 the authors should clarify: The process of filtering the data or information obtained from newspaper sources. Equation 1 represents the probability of finding a term in the text, but Equation 2 does not have a clear interpretation. It is not explained why the logarithm is used in Equation 2.c The meaning of lines 170 and 171 is not clear.
  Response: In our approach, we implement logarithmic scaling to address the frequency distribution of words within the dataset, as elucidated in the foundational equations. For example, ubiquitous terms such as "and" or "or" are found in nearly all 1,000 news articles, whereas a more specific term like "wildfire" may only appear in 10 articles. Absent logarithmic scaling, the term frequency ratio would be 1 for prevalent words (1,000/1,000) and 100 for less frequent words (1,000/10). Since similarity algorithms continue to prioritize these values, the lack of scaling would result in an excessive focus on infrequent words. By employing a logarithmic transformation, we mitigate the influence of excessively common words that lack utility for classification, while preserving significant distinctions among pertinent terms. By making this change, the model is guaranteed to concentrate on informative terms instead of giving common ones an excessive amount of weight. We clarified the interpretation of Equation 2 in the manuscript by considering our response here.
  We have also clarified lines 170 and 171, where we explain the n-grams within the TfidfVectorizer by adding the explanation: "By doing so, we benefit from more contextual meaning, for example, by maintaining word sequences, enabling models to differentiate phrases such as "sinkhole occurred" from the individual terms "sinkhole" and "occurred," which may possess distinct meanings when analyzed in apart from one another."
  
  Reviewer:
  Figure 2 requires a clearer and more extensive explanation. Finally, I believe that the authors omit an analysis of the uncertainty in their results. Under these considerations, I believe that the work should be extensively revised.
  Response: We made a better explanation in the Figure 2 caption, "The world clouds illustrate the most frequently seen words in the filtered for different geohazard news. The sizes of each word denote its relative frequency or significance within the dataset; larger words, such as “waters, of the sinkhole, due to landslide, of the fire” signify principal themes, whereas smaller words offer supplementary context pertaining to details of geohazards (for example “meters depth sinkhole”) within the news. The color variations serve solely for visual differentiation without indicating any categorical distinctions. Additionally, the spatial location of the words was arbitrarily positioned and does not indicate geographic relation with geohazards."
  We thank the reviewer for raising relevant points regarding the uncertainty allowing us a better opportunity for clarification of the limitations of our study. As indicated also by reviewer #2, we aimed to compare the spatiotemporal performance of our inventories with existing literature in Türkiye. However, accessible spatiotemporal geohazard inventories are limited or do not exist in Türkiye hindering our capabilities in evaluating our inventory performance. As we made in section 3.1, we compared our inventories with related but limited case studies, for example, landslides taking place in a particular region and time. Therefore, we have decided to make an additional analysis, the ground truth evaluation step which we targeted to manually verify our approach. For this aim, we opened a sub-section “Uncertainty assessment and limitations” in the Results and Discussion section. This method was followed by also related studies, extracting location from text-based data (Madruga de Brito et al., 2025; Stein et al., 2024). Here, we used random sampling as a ground truth evaluation step with 500 geohazard incidents to assess mapping performance. The random sampling resulted in 284, 97, 76, and 43 incidents of flood, landslide, wildfire, and sinkhole, respectively. We have manually checked these incidents and evaluated them by cross-checking the location of mapped geohazards and news context where we extracted location information. Our criteria were to achieve mapping the geohazard incidents to the center of the smallest administrative units which is available in the context of news. The uncertainty assessment for mapping performance overall resulted in good performance which is 82.4 % of geohazards accurately were mapped.
  
  References:
  Madruga de Brito, M., Sodoge, J., Kreibich, H., & Kuhlicke, C. (2025). Comprehensive Assessment of Flood Socioeconomic Impacts Through Text‐Mining. Water Resources Research, 61(1). https://doi.org/10.1029/2024WR037813
  Stein, L., Mukkavilli, S. K., Pfitzmann, B. M., Staar, P. W. J., Ozturk, U., Berrospi, C., Brunschwiler, T., & Wagener, T. (2024). Wealth Over Woe: Global Biases in Hydro‐Hazard Research. Earth’s Future, 12(10). https://doi.org/10.1029/2024EF004590
  
  Citation: https://doi.org/10.5194/egusphere-2025-7-AC1
  - RC3: 'Reply on AC1', Anonymous Referee #1, 04 Mar 2025
    
    I appreciate the replay from the authors and I consider the answers are enough to be published the manuscript.
    
    Citation: https://doi.org/10.5194/egusphere-2025-7-RC3
    
    AC3: 'Reply on RC3', Aydogan Avcioglu, 05 Mar 2025
    
    Dear Reviewer,
    
    We appreciate your positive remarks and your trust that our answers and manuscript are acceptable for publication. We thank the time and effort you invested in reading our manuscript and offering insightful comments.
    Kind regards.
    
    Citation: https://doi.org/10.5194/egusphere-2025-7-AC3
RC2:
'Comment on egusphere-2025-7', Anonymous Referee #2, 17 Feb 2025
The manuscript addresses a relevant topic and proposes an interesting workflow for constructing geohazards inventories using online newspapers and natural language processing. It is well organized.
I have few comments:
To further strengthen the manuscript, I recommend including more detailed explanations of the methods used—especially how potential biases from the model might influence the final dataset. A key concern is the geolocation strategy: if the exact street name is available, it is mapped to that street, otherwise it defaults to the city, and/or region (using the city/region’s center, presumably). This same principle seems to apply to city- and street-level data as well. While this approach may work if we want to see the geohazards’ distribution in a broader geographic area (e.g., all of Turkey), the uncertainty likely increases when examining finer geographic units. It would be helpful to clarify how this method might affect the accuracy and reliability of analyses at smaller scales.

Although the study compares its overall results with established literature, a more granular validation (e.g., comparing known hazard events in specific cities or regions) would be needed. Consider adding such a validation step to illustrate both the strengths and limitations of the approach at different scales. Few diverse cases are sufficient.

The first sentence in the Introduction states, “Natural disasters are vital. Could you please amend it?

Up to the authors: Consider whether replacing natural hazards with the term geohazards.
Citation: https://doi.org/10.5194/egusphere-2025-7-RC2
- AC2:
  'Reply on RC2', Aydogan Avcioglu, 03 Mar 2025
  We appreciate your insightful and helpful remarks. After carefully reviewing every feedback and suggestion, we made the necessary revisions to the manuscript. An overview of the main modifications performed is provided below:
  The assessment of potential accuracy changes regarding finer geographic units has been evaluated in detail.
  
  A thorough description of uncertainty is provided below, and a new part describing newly conducted uncertainty analysis on location validations has been included in section 3.1.
  
  Minor modifications have been made.
  
  We reply sentence to sentence below (with bold style) showing our thorough answers to each comment. Please see our answers given in bold style.
  We have made an effort to address any concerns expressed while preserving the manuscript's clarity and scientific integrity. We would be pleased to respond as soon as possible to any further remarks or requests for clarification.
  Regards,
  Reviewer:
  The manuscript addresses a relevant topic and proposes an interesting workflow for constructing geohazards inventories using online newspapers and natural language processing. It is well organized.
  Response: We thank the reviewer for her/his insightful comments and we’re happy to answer and clarify the points raised by the reviewer.
  
  Reviewer:
  I have few comments:
  To further strengthen the manuscript, I recommend including more detailed explanations of the methods used—especially how potential biases from the model might influence the final dataset. A key concern is the geolocation strategy: if the exact street name is available, it is mapped to that street, otherwise it defaults to the city, and/or region (using the city/region’s center, presumably). This same principle seems to apply to city- and street-level data as well. While this approach may work if we want to see the geohazards’ distribution in a broader geographic area (e.g., all of Turkey), the uncertainty likely increases when examining finer geographic units. It would be helpful to clarify how this method might affect the accuracy and reliability of analyses at smaller scales.
  
  Response: The authors thank the reviewer for recommending the points here, which we found also relevant to raise these topics both as a reply here and in the manuscript. As indicated, geolocation was one of the most challenging parts of this study since we relied on text-based information within the online gazettes. This is, firstly, because of the inhomogeneous context writing style by journalism which we can’t access the always similar “administrative level” information (city, county, and village) and details of this information, such as street, neighborhood, roads, etc. Therefore, our essential target for this study was to map the geohazards within these administrative levels by geolocating incidents to the center of these places, existing in the Open Street Map. We follow this procedure since we are not necessarily targeting to map geohazards (particularly landslides and sinkholes) geomorphologically meaningful terrains. Therefore, in this study, our target is to find and map the geohazard, temporally, within the smallest administrative level by taking the center of the cities, villages, etc. However, this procedure reveals spatially more accurate flood inventories compared to others since almost every newsworthy flood news occurs within the urbanized area of these administrative units. Secondly, the finer resolution such as street, and road information causes some problems as indicated initially due to the inhomogeneous context present within the news. For example, most of the news does not include street or road information for landslides and floods. Wildfire incidents naturally occur in forested areas most of the time outside of urbanized (but not necessarily, it might occur within the small forest patches in the urbanized area), and we made optimizations by assigning the wildfire incidents to the closest forested areas by using land use and land cover maps. Here, the most relevant geohazard for finer geographic units is flood incidents since most of the time floods occur in urbanized areas. However, since we are not able to extract – most of the time – specific street or road information, we prefer to geolocalize our inventories to the center of the administrative units. On the one hand, the problem for finer resolution, for example for street or road level, it is not possible to extract the information of which part, kilometers of this line-based location. On the other hand, since floods are represented by areal distribution, for example, inundation areas, further studies might provide better resolution by integrating a remote sensing-based approach to identify the exact location of these events within the urbanized area. Furthermore, achieving better accuracy or more accurate location representation for landslides and sinkholes requires also geomorphological interpretation, by integrating high-resolution satellite images to delineate their polygonal areas (particularly for landslides). We have clarified these issues by opening the section to the Results and Discussion, with the name “Uncertainty assessment and limitations”.
  Reviewer:
  Although the study compares its overall results with established literature, a more granular validation (e.g., comparing known hazard events in specific cities or regions) would be needed. Consider adding such a validation step to illustrate both the strengths and limitations of the approach at different scales. Few diverse cases are sufficient
  Response: We are grateful that the reviewer agreed with our discussion points, in which we compared our findings with previous research. To answer the raised points by reviewers regarding the uncertainty assessment, which was also raised by reviewer #1, we opened a sub-section “Uncertainty assessment and limitations” in the Results and Discussion section. Here, we would like to primarily express that validation is obtained data is challenging in Türkiye, due to the lack of complete and open-access inventories. Therefore section 3.1 mainly addresses and aims comparisons our inventories with the existing literature. However, most of these geohazard reports are case studies that investigate geohazards from a geological, geomorphological, or meteorological point of view rather than inventory assessment. Hence, to strengthen the reliability of our study, we have added the ground truth evaluation step which is a manual verification approach followed by also related studies (Madruga de Brito et al., 2025; Stein et al., 2024). Here, we used random sampling as a ground truth evaluation step with 500 geohazard incidents to assess mapping performance. The random sampling resulted in 284, 97, 76, and 43 incidents of flood, landslide, wildfire, and sinkhole, respectively. We have manually checked these incidents and evaluated them by cross-checking the location of mapped geohazards and news context where we extracted location information. Our criteria were to achieve mapping the geohazard incidents to the center of the smallest administrative units which is available in the context of news. The uncertainty assessment for mapping performance overall resulted in good performance which is 82.4 % of geohazards accurately were mapped.
  
  Reviewer:
  The first sentence in the Introduction states, “Natural disasters are vital. Could you please amend it?
  Response: Yes, we replaced the first sentence with “Geohazards are direct threats to human life, ecosystems, and societies worldwide socio-economically, demanding ongoing innovation and development in the mapping, analysis, and monitoring of these events.”
  
  Reviewer: Up to the authors: Consider whether replacing natural hazards with the term geohazards.
  Response: Thanks for the suggestion, we updated the terms natural hazards with geohazards to eliminate confusion that might take place due to different terms for geohazards.
  
  References:
  Madruga de Brito, M., Sodoge, J., Kreibich, H., & Kuhlicke, C. (2025). Comprehensive Assessment of Flood Socioeconomic Impacts Through Text‐Mining. Water Resources Research, 61(1). https://doi.org/10.1029/2024WR037813
  Stein, L., Mukkavilli, S. K., Pfitzmann, B. M., Staar, P. W. J., Ozturk, U., Berrospi, C., Brunschwiler, T., & Wagener, T. (2024). Wealth Over Woe: Global Biases in Hydro‐Hazard Research. Earth’s Future, 12(10). https://doi.org/10.1029/2024EF004590
  
  Citation: https://doi.org/10.5194/egusphere-2025-7-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (21 Mar 2025) by Vassiliki Kotroni

AR by Aydogan Avcioglu on behalf of the Authors (24 Mar 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (24 Mar 2025) by Vassiliki Kotroni

RR by Anonymous Referee #2 (01 Apr 2025)

RR by Anonymous Referee #3 (20 Apr 2025)

Suggestions for revision or reasons for rejection

In this study, the authors developed a tool for the creation of inventories, which are of great importance in earth sciences and disaster studies in case from Turkey. On the other hand, they quickly created the locations of the created inventories and made them ready for use. This topic has a very important issue in disaster mitigation and modelling like machine learning techniques, especially for countries that do not have local data and have a very small piece in global data. I have read the article several times and I can say that its structure is well constructed and well written. However, I can say that there are a few minor points. I, therefore, recommend that the article be accepted after the following minor points have been dealt with.

Major Comments:
1) Here I would recommend that you give more emphasis to the generalization of the results of the study for use worldwide, especially in economically underdeveloped countries.

Minor Comments:
1) Figure 1 Raw News should be replaced with Unrefined News to be consistent with Table 2.
2) It would be better if you consider changing “Natural Hazard Inventory” to “Geohazard Inventory” since you use the “geohazard” in the manuscript.
3) The reason why NMF has been chosen might be added to the modeling section.
4) Open Street Map should be emphasized in the Geolocator section since you use the Nominatim tool.
5) It’s up to the authors but consider replacing “online gazettes” with “newspaper” since you use mostly “newspaper” to to keep your manuscript consistent.
6) To demonstrate how the coherence score is regarded as an uncertainty indication, consider including a supporting sentence.
7) Consider changing the “research” with literature in this sentence “On the one hand, to enhance the reliability of our study, we incorporated a ground truth evaluation step, a manual verification method utilized in related research (Madruga de Brito et al., 2025; Stein et al., 2024)”
8) An explanation might be added to Figure 3 caption to clarify why the years vary in the X-axes of the plots.
9) “Yangın” (fire) and “orman” (forest) are the two most commonly (3.28% and 2.59%, respectively) used terms about wildfires.” The details within the parenthesis should be added to the end of the sentence.
10) Can you better explain with a supportive sentence how you distinguished the urban fires?

Referee Report: PDF

Hide

ED: Publish subject to minor revisions (review by editor) (22 Apr 2025) by Vassiliki Kotroni

AR by Aydogan Avcioglu on behalf of the Authors (23 Apr 2025) Author's response Author's tracked changes Manuscript

ED: Publish as is (24 Apr 2025) by Vassiliki Kotroni

AR by Aydogan Avcioglu on behalf of the Authors (24 Apr 2025) Manuscript

Journal article(s) based on this preprint

21 Jul 2025

An automated approach for developing geohazard inventories using news: integrating natural language processing (NLP), machine learning, and mapping

Aydoğan Avcıoğlu, Ogün Demir, and Tolga Görüm

Nat. Hazards Earth Syst. Sci., 25, 2421–2435, https://doi.org/10.5194/nhess-25-2421-2025,https://doi.org/10.5194/nhess-25-2421-2025, 2025

Short summary

Aydoğan Avcıoğlu, Ogün Demir, and Tolga Görüm

Supplement

https://doi.org/10.5194/egusphere-2025-7-supplement

Model code and software

tr-news-scraper: Scrape Turkish news articles Ogün Demir and Aydoğan Avcıoğlu https://github.com/demirogun/tr-news-scraper

Aydoğan Avcıoğlu, Ogün Demir, and Tolga Görüm

Viewed

Total article views: 519 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
279	221	19	519	48	13	28

HTML: 279
PDF: 221
XML: 19
Total: 519
Supplement: 48
BibTeX: 13
EndNote: 28

Views and downloads (calculated since 21 Jan 2025)

Month	HTML	PDF	XML	Total
Jan 2025	98	63	4	165
Feb 2025	45	33	2	80
Mar 2025	46	25	4	75
Apr 2025	23	20	2	45
May 2025	23	35	1	59
Jun 2025	24	19	6	49
Jul 2025	20	26	0	46

Cumulative views and downloads (calculated since 21 Jan 2025)

Month	HTML	PDF	XML	Total
Jan 2025	98	63	4	165
Feb 2025	45	33	2	80
Mar 2025	46	25	4	75
Apr 2025	23	20	2	45
May 2025	23	35	1	59
Jun 2025	24	19	6	49
Jul 2025	20	26	0	46

Viewed (geographical distribution)

Total article views: 517 (including HTML, PDF, and XML) Thereof 517 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 24 Jul 2025

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (2070 KB)
Metadata XML

Short summary

Here we demonstrate an approach for the development of inventories from internet sources to geolocalized geohazard incidents. We created a tool that autonomously gets news, processes it using NLP and machine learning, and maps using Open Street Map. Consequently, we present spatiotemporal inventories for geohazards resulting in a total of 13940 incidents between 1997 and 2023 in Türkiye. Our alternative and easy-to-implement development inventory method aids geohazard management and resilience.


Total:	0
HTML:	0
PDF:	0
XML:	0