Desert Model Intercomparison Project benchmark framework version 1.0 for assessing land-surface dynamics and surface memory in monthly dust aerosol optical depth over North Africa
Abstract. Deserts are the main global source of atmospheric mineral dust, yet large uncertainties remain in the simulation of dust variability across space and time. Part of this uncertainty may reflect the limited representation of dynamic land-surface states and antecedent surface conditions in current dust-model formulations. Here, we develop an interpretable machine-learning assessment framework, here termed the Desert Model Intercomparison Project (DesertMIP) climate–surface–memory benchmark framework version 1.0, to quantify the added value of land-surface dynamics and surface memory for monthly dust aerosol optical depth (AOD) over North Africa during 2003–2020. Three hierarchical configurations were tested: Climate-only, Climate+Surface, and Climate+Surface+Memory. Among the main explanatory models, Random Forest gave the best overall performance. Test-set skill increased from R² = 0.686 in the Climate-only case to 0.713 after adding surface variables and to 0.736 in the full Climate+Surface+Memory case, while RMSE declined from 0.058 to 0.056 and then to 0.053. The best model also gave MAE = 0.041 and Bias = 0.004. Residuals, defined here as observed minus predicted, were centered close to zero, with a mean of −0.004 and a standard deviation of 0.053, although residual spread increased at higher AOD values. Lag analysis showed a persistent memory-sensitive signal, with the strongest negative associations near a three-month lag for soil moisture (r ≈ −0.75) and vegetation density (LAI; r ≈ −0.82), whereas the highest mean cross-validation skill among the tested memory windows occurred at six months. Despite the gains in overall skill, severe dust outbreaks remained difficult to capture. For events above the 95th percentile, the full model gave POD = 0.423, CSI = 0.269, FAR = 0.574, extreme RMSE = 0.170, and extreme Bias = −0.060. These results support a reproducible pilot benchmark structure toward the DesertMIP.