Real-time Monitoring of Petroleum Hydrocarbons in Groundwater using Hybrid Machine Learning Architectures
Abstract. Monitoring petroleum hydrocarbon (PHC) plumes in groundwater is essential for managing oil contamination but is often hindered by high costs. We evaluated machine learning (ML) frameworks that estimate concentrations of benzene, ethylbenzene, and xylenes (BEX), using affordable, in situ water quality parameters (iWQPs) as inputs: pH, dissolved oxygen, electrical conductivity, and oxidation-reduction potential. Due to a scarcity of field data, we trained and tested models on high-resolution virtual data generated by a reactive transport model. We compared a long short-term memory (LSTM) network against classical algorithms (multiple linear regression, random forest, support vector regression, XGBoost) and an LSTM-XGBoost hybrid. Model performance depended on the underlying geochemical relationship between iWQPs and BEX. Accurate predictions (R² ≥ 0.80, MAPE < 2.3 %) were achieved when iWQPs were strongly correlated with BEX degradation (e.g., as a primary electron donor); the LSTM model yielded predictions within a 5 % error margin for 70 % of the test cases. Performance declined sharply (R² < 0) during periods where iWQPs were correlated with non-volatile dissolved organic carbon, another component of dissolved PHC. Incorporating hydraulic head data improved accuracy by informing the model of groundwater flow dynamics. While the LSTM model struggled to extrapolate beyond its training data (e.g., during extreme flow events), it reliably detected the direction of concentration trends, providing a valuable trigger for adaptive monitoring. We also demonstrated how a hybrid Kalman filter could successfully capture concentration trends after source removal through recursive updating. Our proposed ML framework provides BEX level estimation for improved groundwater monitoring.
General comments
Good and robust research on contaminant transport in groundwater. The authors need to provide more detail before publication. See my specific comments to fix the issues.
Specific comments
Lines 26-27. “Groundwater contamination by petroleum hydrocarbons (PHCs) remains an environmental challenge, particularly in areas affected by historical spills or leaks”. General statement not backed up by references. Please, insert general literature on the topic.
- Agbotui, P. Y., Firouzbehi, F., Medici, G. 2025. Review of effective porosity in sandstone aquifers: insights for representation of contaminant transport. Sustainability, 17(14), 6469.
- Li, G., Huang, W., Lerner, D. N., Zhang, X. 2000. Enrichment of degrading microbes and bioremediation of petrochemical contaminants in polluted soil. Water Research, 34(15), 3845-3853.
Line 88. The aim of the research is clear. But what about the 3 to 4 specific objectives? Please, describe them by using numbers (e.g., i, ii, and iii).
Line 90. If you use MODFLOW-2005 you need much more detail on the boundary conditions.
Line 93. Provide more detail on the boundary conditions also for MT3DMS.
Line 220. Why only MAE? What about just Mean Error and Root Mean Squared Error?
Lines 490-510. I can see 5 bulletin points in your conclusions. Therefore, the specific objectives (see comment above) must be the same number to match.
Figures and tables
Would you like to add the flow field output of MODFLOW?
What about horizontal slices for the contaminant transport? I can see only a vertical one.
Figure 1. Important figure, make it larger.
Figure 1. What about a vertical scale in meters above the sea level?
Figure 3. Un-clear. Please, rise the graphic resolution.
Figure 5. Make the legend larger. You can use 3 lines.