A machine learning approach to driver attribution of dissolved organic matter dynamics in two contrasting freshwater systems
Abstract. Predicting water quality variables in lakes is critical for effective ecosystem management under climatic and human pressures. Dissolved organic matter (DOM) serves as an energy source for aquatic ecosystems and plays a key role in their biogeochemical cycles. However, predicting DOM is challenging due to complex interactions between multiple potential drivers in the aquatic environment and its surrounding terrestrial landscape. This study establishes an open and scalable workflow to identify potential drivers and predict fluorescent DOM (fDOM) in the surface layer of lakes by exploring the use of supervised machine learning models, including random forest, extreme gradient boosting, light gradient boosting, catboosting, k-nearest neighbors, support vector regression and linear model. It was validated in two contrasting systems: one natural lake in Ireland with a relatively undisturbed catchment, and one reservoir in Spain with a more human-influenced catchment. A total of 24 potential drivers were obtained from global reanalysis data, and lake and river process-based modelling. Partial dependence and SHapley Additive exPlanations (SHAP) analises were conducted for the most influential drivers identified, with soil moisture, soil temperature, and Julian day being common to both study sites. The best prediction was found when using the CatBoost model (during hold-out testing period, Irish site: KGE > 0.69, r² > 0.51; Spanish site: KGE > 0.66, r² > 0.54). Interestingly, when only using drivers from globally accessible climate and soil reanalysis data, the prediction capacity was maintained at both sites, showcasing potential for scalability. Our findings highlight the complex interplay of environmental drivers and processes that govern DOM dynamics in lakes, and contribute to the modelling of carbon cycling in aquatic ecosystems.
Disclaimer: I have strong expertise in machine learning but my primary background is in oceanography, so I am less familiar with the specific impacts and conclusions relevant to lake ecosystems. Consequently, my review emphasizes the technical and methodological aspects of the manuscript and provides fewer comments on the ecological interpretation of the results.
The study by Mercado-Bettín et al. investigates how dissolved organic matter dynamics in two contrasting freshwater lakes can be modeled using a set of machine‑learning algorithms. By assembling a comprehensive set of physicochemical and meteorological predictors, the authors train and compare several regression techniques (including linear models, random forests, support vector regression and three gradient‑boosting frameworks) to predict fluorescent dissolved organic matter (fDOM) concentrations. Their analysis explores which environmental drivers most strongly influence fDOM variability and evaluates model performance across the two lake systems, aiming to demonstrate that data‑driven approaches can capture the temporal patterns of organic matter turnover in inland waters. Furthermore, they show that reducing the set of predictors merely decreases the prediction performance of the ML model.
General comments
The manuscript is clearly written and follows a logical progression, which makes the authors’ objectives easy to grasp. Nonetheless, several critical steps are lacking to support the central claim.
First, the manuscript does not show whether the target variable was inspected for distributional anomalies such as skewness, outliers, or zero‑inflation before model fitting. A simple histogram or density plot of fDOM (and a note on any transformation applied) would let readers judge whether the data were appropriately conditioned.
Second, the choice of six machine‑learning models, including three gradient‑boosting implementations, is left unexplained. Because these three models are largely interchangeable, the authors should either justify retaining each (for example, to compare computational efficiency or regularisation strategies) or reduce the set to a smaller, well‑motivated collection, explicitly outlining the strengths and weaknesses of each algorithm and stating whether a broad model comparison is a declared aim of the study.
Third, hyper‑parameter tuning is mentioned but not described. The manuscript should specify which parameters were tuned for each model, the search space explored, the optimisation strategy (grid, random, Bayesian, etc.) and the validation split used. Detailing this process is essential for assessing model robustness and guarding against over‑fitting. An early subsection that summarises key performance metrics of the different models would also ground the subsequent discussion of variable importance in a known predictive skill.
All figures suffer from overly small text; increasing the font size for axes, legends, and annotations is essential for readability. In Figures 3 and 4 the legends contain interpretative statements, which blurs the line between description and analysis : legends should merely describe the visual content, leaving interpretation to the main text.
Finally, the code is shared in a reasonably structured way, but two key improvements would greatly enhance its usability. First, assigning a clear execution order, either by numbering the scripts or by providing a master driver script, would allow anyone to run the workflow sequentially without guesswork. Second, replacing absolute paths such as “~/Documents/intoDBP/driver_attribution_fdom/” with relative paths or a configurable settings file would make the repository portable across different machines and operating systems. Implementing these changes would substantially increase the rigor and reproducibility of the study.
Specific comments
Figure 1: the caption should define DOC; otherwise the figure nicely illustrates the workflow
L21-23: please indicate the direction of variation for each process (increase or decrease)
L59: insert the word “respectively”
L93-96: you mention a 2 minutes measurement resolution but also give a number of points that roughly matches the number of days between the dates. Clarify whether the data were averaged daily before analysis.
Figure 2: define the acronyms “GWLF” and “GLM” (they are only explained later, at lines 114‑116). Also define “NSE”; you introduce NSE as a model‑evaluation metric here but it does not appear elsewhere in the paper.
L126-127: see also Molnar (2025) for interpretable machine learning.
L128-135: the description of how the data were split into training and test sets, and how the models were evaluated, is unclear. Align this description with the "Prediction Workflow" paragraph and distinguish clearly between validation (hyperparameter optimisation) and testing (final performance).
L136: the term “statistical” is vague (all these models belong to supervised machine learning), and the choice of models needs justification. Explain why these particular algorithms were selected, why e.g. a neural network was not considered, and what the relative strengths and weaknesses of each method are. If three gradient‑boosting frameworks (XGBoost, LightGBM, CatBoost) are used, state the reason for testing all three rather than picking one.
L140-141: consider using permutation‑importance as an alternative that works across all models
L165: I would like to see a short “Model performance” subsection before the analysis of important drivers, so readers first see how well the models actually performed.
Figure 3: I’m confused by what you mean regarding the “four ML models that directly provide feature importance: Random Forest (RF), eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting (LGB), CatBoost (CTB)”. These models do not provide direct feature importance, you have to compute it (that you did with node purity and gain contribution. On the opposite, a linear model directly provides feature importance, through its parameters (see Molnar (2025)).
Table 1: RMSE should be reported with its units, since it shares the same unit as the predicted variable. The fact that XGBoost yields a very poor performance (R² = 11%) compared to others (including a linear model with R² = 45%) suggests that the training could be largely improved. Which hyperparameters did you choose for the XGBoost? I dug in the code and found that only a few hundred trees (100-300) were used, which may be insufficient (but this also depends on the learning rate). Typically, boosting procedures require thousands of trees (see Hastie et al., 2009; Chapter 10). Moreover, the reported R² of 99 % on the training data in Table A1 strongly suggests overfitting.
Figure 4: mixing a partial‑dependence plot (PDP) for the Random Forest with SHAP values for CatBoost makes the interpretation confusing. Choose one model and interpret it thoroughly. Do not embed interpretation inside the legend. For the SHAP plot, indicate how the variables are ordered (e.g. by mean absolute SHAP value). Also, the cosine‑scaled Julian day axis should be accompanied by a note translating the cosine values back to calendar seasons, because “cos(julian day) = ‑0.5” is not intuitive.
L198 - 208 & Figure 4: the text mentions an experiment using “the most influential drivers” and another using “a reduced subset of reanalysis‑based and easily accessible drivers.” In the figure the purple line is labelled “Testing with all drivers,” which appears to correspond to the reduced set described in the text. Either correct the label or clarify the distinction between the two experimental setups.
L249: discuss the implications of predicting fDOM instead of total DOC. Is the former a simpler target, and does that affect the ecological relevance of the results?
L273: good point, thank you for highlighting the limitation of dataset shift.
Figure A5: define all acronyms
References
Hastie, T., Tibshirani, R., and Friedman, J.: The elements of statistical learning: data mining, inference, and prediction, Springer Science & Business Media, 2009.
Molnar, C. (2025). Interpretable Machine Learning: A Guide for Making Black Box Models Explainable (3rd ed.). christophm.github.io/interpretable-ml-book/