A machine learning approach to driver attribution of dissolved organic matter dynamics in two contrasting freshwater systems
Abstract. Predicting water quality variables in lakes is critical for effective ecosystem management under climatic and human pressures. Dissolved organic matter (DOM) serves as an energy source for aquatic ecosystems and plays a key role in their biogeochemical cycles. However, predicting DOM is challenging due to complex interactions between multiple potential drivers in the aquatic environment and its surrounding terrestrial landscape. This study establishes an open and scalable workflow to identify potential drivers and predict fluorescent DOM (fDOM) in the surface layer of lakes by exploring the use of supervised machine learning models, including random forest, extreme gradient boosting, light gradient boosting, catboosting, k-nearest neighbors, support vector regression and linear model. It was validated in two contrasting systems: one natural lake in Ireland with a relatively undisturbed catchment, and one reservoir in Spain with a more human-influenced catchment. A total of 24 potential drivers were obtained from global reanalysis data, and lake and river process-based modelling. Partial dependence and SHapley Additive exPlanations (SHAP) analises were conducted for the most influential drivers identified, with soil moisture, soil temperature, and Julian day being common to both study sites. The best prediction was found when using the CatBoost model (during hold-out testing period, Irish site: KGE > 0.69, r² > 0.51; Spanish site: KGE > 0.66, r² > 0.54). Interestingly, when only using drivers from globally accessible climate and soil reanalysis data, the prediction capacity was maintained at both sites, showcasing potential for scalability. Our findings highlight the complex interplay of environmental drivers and processes that govern DOM dynamics in lakes, and contribute to the modelling of carbon cycling in aquatic ecosystems.