the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Saudi Rainfall (SaRa): Hourly 0.1° Gridded Rainfall (1979–Present) for Saudi Arabia via Machine Learning Fusion of Satellite and Model Data
Abstract. We introduce Saudi Rainfall (SaRa), a gridded historical and near real-time precipitation (P) product specifically designed for the Arabian Peninsula, one of the most arid, water-stressed, and data-sparse regions on Earth. The product has an hourly 0.1° resolution spanning from 1979 to the present and is continuously updated with a latency of less than two hours. The algorithm underpinning the product involves 18 machine learning model stacks trained for different combinations of satellite and (re)analysis P products along with several static predictors. As a training target, hourly and daily P observations from gauges in Saudi Arabia (n=113) and globally (n=14,256) are used. To evaluate the performance of SaRa, we carried out the most comprehensive evaluation of gridded P products in the region to date, using observations from independent gauges (excluded from training) in Saudi Arabia as a reference (n=119). Among the 20 evaluated P products, our new product, SaRa, consistently ranked first across all evaluation metrics, including the Kling-Gupta Efficiency (KGE), correlation, bias, peak bias, wet days bias, and critical success index. Notably, SaRa achieved a median KGE — a summary statistic combining correlation, bias, and variability — of 0.36, while widely used non-gauge-based products such as CHIRP, ERA5, GSMaP V8, and IMERG-L V07 achieved values of -0.07, 0.21, -0.13, and -0.39, respectively. SaRa also outperformed four gauge-based products such as CHIRPS V2, CPC Unified, IMERG-F V07, and MSWEP V2.8 which had median KGE values of 0.17, -0.03, 0.29, and 0.20, respectively. Our new P product — available at www.gloh2o.org/sara — addresses a crucial need in the Arabian Peninsula, providing a robust and reliable dataset to support hydrological modeling, water resource assessments, flood management, and climate research.
- Preprint
(3863 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on egusphere-2025-254', Anonymous Referee #1, 05 Mar 2025
This study presents a machine-learning-based approach to estimating gridded rainfall data for Saudi Arabia, an arid region with significant data limitations. The proposed dataset, SaRa, is compared against multiple existing precipitation datasets. The approach uses a combination of random forests and XGboots models. While not very novel, the results suggest superior performance, thus adding value and contributing to the data availability in the region. While the paper is well-structured with a sound methodology, fundamental concerns arise regarding the model accuracy away from training sites, generalizability, and reliability of the identified trends.
Major:
- I appreciate the authors filtering for potentially double precipitation gauges within 2 km, but the paper needs more clarity on how the split training/testing sample was performed. Was it random? Stratified? Distance-based?
- When applying ML to geospatial datasets, a critical issue is the use of testing sites near training sites that often artificially boost validation statistics. That’s because precipitation data is spatially correlated. To enhance transparency and thrust into ML approaches, the accuracy of the ML models should also be evaluated based on their distance from training sites. Please plot the KGE testing accuracy of each testing point vs. its distance from the nearest training site (km). This will evidence how well the proposed ML approach is trusted in distant/ungauged areas. This plot would be informative for the main individual ML models and the ensemble stack.
- The ensemble approach, while interesting, results in a black-box system—there is little discussion on the physical interpretability of the model structures and the predictive power of the inputs. Sklearn Random forests and XGboost have out-of-the-box libraries that can be easily deployed to evaluate model interpretability further. This could improve model understanding and expand the proposed approaches' generalizability.
- The study does not sufficiently address uncertainty in trend estimations. There are no confidence intervals, no discussion of interannual variability, and no attempt to separate natural variability from long-term trends. Given the known issues with historical precipitation datasets, particularly in arid regions, one must question how much of the trend results from dataset evolution rather than actual climate change.
Moderate:
- The paper would benefit from a quantitative analysis and discussion of how temporal resolution mismatches in the gauge data impact validation results.
Minor:
- L 143 clarify what are gross errors.
Citation: https://doi.org/10.5194/egusphere-2025-254-RC1 - AC1: 'Reply on RC1', Xuetong Wang, 05 Jun 2025
-
RC2: 'Comment on egusphere-2025-254', Anonymous Referee #2, 30 May 2025
Review for " Saudi Rainfall (SaRa): Hourly 0.1° Gridded Rainfall (1979–Present) for Saudi Arabia via Machine Learning Fusion of Satellite and Model Data" by Wang et al. submitted to EGUsphere (MS No.: egusphere-2025-254).
General comments:
The authors introduce Saudi Rainfall (SaRa), a gridded precipitation product for the Arabian Peninsula developed using Machine Learning (ML) techniques. They clearly present the motivation behind the development of such a dataset, describe the procedures used to generate the SaRa product, and evaluate its performance. By leveraging a large amount of available gauge-based and gridded datasets, the authors produce a new dataset that shows improved performance compared to existing products—particularly in areas with sparse station observations and in the dominantly arid regions of the Arabian Peninsula.
This work makes a valuable contribution to the data community and enhances scientific understanding of precipitation patterns in data-scarce, arid environments. The overall quality of the manuscript is good, with well-cited references and generally clear writing. However, there is still room for further improvement. In particular, I would like to raise two main concerns:
- Limitations of Machine Learning: What are the potential limitations, challenges and sources of error introduced by using Machine Learning techniques in generating this dataset? A discussion on uncertainties and biases associated with ML itself would strengthen the paper.
- Broader Impact and Global Appeal: What is the relevance of this work beyond the Arabian Peninsula? Discussing the broader applicability of the methodology and insights would enhance the global significance of the study.
In addition, I suggest the authors consider the following points:
- Include a study area map: Add a map of the Arabian Peninsula showing the region’s topography and its location in a global context. This would help orient readers unfamiliar with the area.
- Describe ML Challenges: Provide a more detailed discussion of the challenges and limitations in implementing ML for P data generation.
- Discuss Practical Applications: Expand the discussion to highlight potential applications of the dataset, such as its use in flash flood risk mitigation, water resource management, or climate-related decision-making in arid regions.
Citation: https://doi.org/10.5194/egusphere-2025-254-RC2 - AC2: 'Reply on RC2', Xuetong Wang, 05 Jun 2025
Status: closed
-
RC1: 'Comment on egusphere-2025-254', Anonymous Referee #1, 05 Mar 2025
This study presents a machine-learning-based approach to estimating gridded rainfall data for Saudi Arabia, an arid region with significant data limitations. The proposed dataset, SaRa, is compared against multiple existing precipitation datasets. The approach uses a combination of random forests and XGboots models. While not very novel, the results suggest superior performance, thus adding value and contributing to the data availability in the region. While the paper is well-structured with a sound methodology, fundamental concerns arise regarding the model accuracy away from training sites, generalizability, and reliability of the identified trends.
Major:
- I appreciate the authors filtering for potentially double precipitation gauges within 2 km, but the paper needs more clarity on how the split training/testing sample was performed. Was it random? Stratified? Distance-based?
- When applying ML to geospatial datasets, a critical issue is the use of testing sites near training sites that often artificially boost validation statistics. That’s because precipitation data is spatially correlated. To enhance transparency and thrust into ML approaches, the accuracy of the ML models should also be evaluated based on their distance from training sites. Please plot the KGE testing accuracy of each testing point vs. its distance from the nearest training site (km). This will evidence how well the proposed ML approach is trusted in distant/ungauged areas. This plot would be informative for the main individual ML models and the ensemble stack.
- The ensemble approach, while interesting, results in a black-box system—there is little discussion on the physical interpretability of the model structures and the predictive power of the inputs. Sklearn Random forests and XGboost have out-of-the-box libraries that can be easily deployed to evaluate model interpretability further. This could improve model understanding and expand the proposed approaches' generalizability.
- The study does not sufficiently address uncertainty in trend estimations. There are no confidence intervals, no discussion of interannual variability, and no attempt to separate natural variability from long-term trends. Given the known issues with historical precipitation datasets, particularly in arid regions, one must question how much of the trend results from dataset evolution rather than actual climate change.
Moderate:
- The paper would benefit from a quantitative analysis and discussion of how temporal resolution mismatches in the gauge data impact validation results.
Minor:
- L 143 clarify what are gross errors.
Citation: https://doi.org/10.5194/egusphere-2025-254-RC1 - AC1: 'Reply on RC1', Xuetong Wang, 05 Jun 2025
-
RC2: 'Comment on egusphere-2025-254', Anonymous Referee #2, 30 May 2025
Review for " Saudi Rainfall (SaRa): Hourly 0.1° Gridded Rainfall (1979–Present) for Saudi Arabia via Machine Learning Fusion of Satellite and Model Data" by Wang et al. submitted to EGUsphere (MS No.: egusphere-2025-254).
General comments:
The authors introduce Saudi Rainfall (SaRa), a gridded precipitation product for the Arabian Peninsula developed using Machine Learning (ML) techniques. They clearly present the motivation behind the development of such a dataset, describe the procedures used to generate the SaRa product, and evaluate its performance. By leveraging a large amount of available gauge-based and gridded datasets, the authors produce a new dataset that shows improved performance compared to existing products—particularly in areas with sparse station observations and in the dominantly arid regions of the Arabian Peninsula.
This work makes a valuable contribution to the data community and enhances scientific understanding of precipitation patterns in data-scarce, arid environments. The overall quality of the manuscript is good, with well-cited references and generally clear writing. However, there is still room for further improvement. In particular, I would like to raise two main concerns:
- Limitations of Machine Learning: What are the potential limitations, challenges and sources of error introduced by using Machine Learning techniques in generating this dataset? A discussion on uncertainties and biases associated with ML itself would strengthen the paper.
- Broader Impact and Global Appeal: What is the relevance of this work beyond the Arabian Peninsula? Discussing the broader applicability of the methodology and insights would enhance the global significance of the study.
In addition, I suggest the authors consider the following points:
- Include a study area map: Add a map of the Arabian Peninsula showing the region’s topography and its location in a global context. This would help orient readers unfamiliar with the area.
- Describe ML Challenges: Provide a more detailed discussion of the challenges and limitations in implementing ML for P data generation.
- Discuss Practical Applications: Expand the discussion to highlight potential applications of the dataset, such as its use in flash flood risk mitigation, water resource management, or climate-related decision-making in arid regions.
Citation: https://doi.org/10.5194/egusphere-2025-254-RC2 - AC2: 'Reply on RC2', Xuetong Wang, 05 Jun 2025
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
647 | 187 | 21 | 855 | 31 | 49 |
- HTML: 647
- PDF: 187
- XML: 21
- Total: 855
- BibTeX: 31
- EndNote: 49
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1