A Bayesian model for quantifying errors in citizen science data: Application to rainfall observations from Nepal
Abstract. High quality citizen science data can be instrumental in advancing science toward new discoveries and a deeper understanding of under-observed phenomena. However, the error structure of citizen scientist (CS) data must be well-defined. Within a citizen science program, the errors in submitted observations vary, and their occurrence may depend on CS-specific characteristics. This study develops a graphical Bayesian inference model of error types in CS data. The model assumes that: (1) each CS observation is subject to a specific error type, each with its own bias and noise; and (2) an observation's error type depends on the error community of the CS, which in turn relates to characteristics of the CS submitting the observation. Given a set of CS observations and corresponding ground-truth values, the model can be calibrated for a specific application, yielding (i) number of error types and error communities, (ii) bias and noise for each error type, (iii) error distribution of each error community, and (iv) the error community to which each CS belongs. The model, applied to Nepal CS rainfall observations, identifies five error types and sorts CSs into four model-inferred communities. In the case study, 73 % of CSs submitted data with errors in fewer than 5 % of their observations. The remaining CSs submitted data with unit, meniscus, unknown, and outlier errors. A CS’s assigned community, coupled with model-inferred error probabilities, can identify observations that require verification. With such a system, the onus of validating CS data is partially transferred from human effort to machine-learned algorithms.
Jessica A. Eisma et al.
Status: open (until 10 Jul 2023)
- RC1: 'Comment on egusphere-2023-658', Jonathan Paul, 22 May 2023 reply
Jessica A. Eisma et al.
Jessica A. Eisma et al.
Viewed (geographical distribution)
- I very much enjoyed reading this paper, which will be (subject to minor revision) a valuable addition to the relatively new field of error bracketing in citizen science datasets. Specifically, Eisma et al. investigate the ability of one machine learning technique to analyse errors in quantitative hydrological data, which goes well beyond more simplistic analyses typical of qualitative data in e.g. biological science. The identification of individuals / communities that are especially “error prone” is especially useful.
- I felt as though the Introduction could be restructured slightly to focus immediately on citizen science (in hydrology) i.e. "it is popular and increasingly widespread because xxx, but there are many issues that impede its roll-out everywhere including xxx (e.g. lack of trust, incentivisation, lack of continuing engagement, and demonstrated errors / imprecision relative to more traditional monitoring methods". It seemed like a bit of a jump to couch the first sentence in terms of climate change.
- There is another slight jump in the narrative around line 49, where the text moves from citizen science background & contrasting techniques for (qualitative) error removal, to machine learning and GLMs. The paper is largely focused on elaboration of the models, but I suggest linking better the two halves of the Introduction at this point in the text. Perhaps it would be useful to include a few new lines / paragraph on error detection using machine learning in a broad sense (i.e. not restricted to citizen science datasets).
- The paragraph after your research questions could probably be excised as it reads like a summary / Conclusions of the study. Alternatively, you could list the basic structure of the article here (e.g. “In Section 2 we describe the design of our probabilistic model …”)
- You could address (in the Introduction) why rainfall data were chosen for the investigation (i.e. why not streamflow, or soil moisture, or temperature …? And why might the water cycle be a good place for citizen science datasets to be interrogated) – I suspect this has more to do with data availability and access rather than anything more technical (e.g. representative error distributions), but I think it should be addressed. Coming into Section 2.1, rainfall data are mentioned for the first time since line 38, and feels like an afterthought.
- Related to previous comment: the beginning of Section 3 is focused on data/background and should come earlier, before the model development of Section 2 (and possibly in the Introduction – notably the passages on the study area). Do you need Section 3.2? I suggest simply referring readers to Davids et al. (2019) at the end of Section 3.1.
- The use of “communities” (e.g. line 96) might be slightly confusing to readers more attuned to hearing it in terms of community-led programs. You could insert a caveat / clarification here that “communities” will be used in a statistical sense.
- Paragraph starting on Line 110 – the allocation of citizen scientists to a static single community is a huge simplification (necessary for the modelling), but this should probably be spelt out more explicitly beforehand e.g. in the Abstract.
- Paragraph starting on Line 362 – this is really exciting and perhaps one of the most important outcomes of the research. As such, I think you should place it more in the foreground (perhaps earlier in the Discussion, as well as a sentence in the Abstract / Conclusions?). Tailoring error messages and ways of improving observations to separate communities would be a significant step-change in enhancing the quality of citizen science data, and therefore their uptake.
- Line 18 - "measures that sometimes save hundreds, if not thousands of lives" - could you add a ref here? Seems a bit vague
- Line 20 on institutional capacity - this is a good point but could use a citation
- Could you fix the CS acronym? Defined as CS in the Abstract but CSs in the main text
- Line 40 – on time and effort spent on QC varying widely – repeats the beginning of that paragraph (Line 33)
- Line 63 – “Error modelling has only been employed … in a limited manner” – could you include a citation here?
- Beginning of Section 3 – to my mind the real value of citizen science rainfall observations is to capture rainfall extremes that are missed by satellite estimates. You could mention that Nepal has a lot of these extremes, as well as dramatic spatial variations, due to the interaction of the Monsoon with topography.
- Line 195: “rain gauges are notoriously inconsistent” – could you elaborate – in what way?
- Line 216: “100 mm diameter clear plastic bottle” rather than “clear plastic bottle with a 100 mm diameter”
- Line 233: might be worth explicitly defining “meniscus errors”
- Line 337: I did not fully understand this part of the Discussion on slope outliers? They seem fairly insignificant in the statistical sense to me?