A Bayesian model for quantifying errors in citizen science data: Application to rainfall observations from Nepal

Eisma, Jessica A.; Schoups, Gerrit; Davids, Jeffrey C.; van de Giesen, Nick

doi:https://doi.org/10.5194/egusphere-2023-658

Preprints

https://doi.org/10.5194/egusphere-2023-658

Preprints

15 May 2023

| 15 May 2023

A Bayesian model for quantifying errors in citizen science data: Application to rainfall observations from Nepal

Jessica A. Eisma, Gerrit Schoups, Jeffrey C. Davids, and Nick van de Giesen

Abstract. High quality citizen science data can be instrumental in advancing science toward new discoveries and a deeper understanding of under-observed phenomena. However, the error structure of citizen scientist (CS) data must be well-defined. Within a citizen science program, the errors in submitted observations vary, and their occurrence may depend on CS-specific characteristics. This study develops a graphical Bayesian inference model of error types in CS data. The model assumes that: (1) each CS observation is subject to a specific error type, each with its own bias and noise; and (2) an observation's error type depends on the error community of the CS, which in turn relates to characteristics of the CS submitting the observation. Given a set of CS observations and corresponding ground-truth values, the model can be calibrated for a specific application, yielding (i) number of error types and error communities, (ii) bias and noise for each error type, (iii) error distribution of each error community, and (iv) the error community to which each CS belongs. The model, applied to Nepal CS rainfall observations, identifies five error types and sorts CSs into four model-inferred communities. In the case study, 73 % of CSs submitted data with errors in fewer than 5 % of their observations. The remaining CSs submitted data with unit, meniscus, unknown, and outlier errors. A CS’s assigned community, coupled with model-inferred error probabilities, can identify observations that require verification. With such a system, the onus of validating CS data is partially transferred from human effort to machine-learned algorithms.

Received: 06 Apr 2023 – Discussion started: 15 May 2023

Download & links

Preprint (PDF, 1324 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (1324 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

09 Oct 2023

A Bayesian model for quantifying errors in citizen science data: application to rainfall observations from Nepal

Jessica A. Eisma, Gerrit Schoups, Jeffrey C. Davids, and Nick van de Giesen

Hydrol. Earth Syst. Sci., 27, 3565–3579, https://doi.org/10.5194/hess-27-3565-2023,https://doi.org/10.5194/hess-27-3565-2023, 2023

Short summary

Jessica A. Eisma et al.

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2023-658', Jonathan Paul, 22 May 2023

MAIN COMMENTS
- I very much enjoyed reading this paper, which will be (subject to minor revision) a valuable addition to the relatively new field of error bracketing in citizen science datasets. Specifically, Eisma et al. investigate the ability of one machine learning technique to analyse errors in quantitative hydrological data, which goes well beyond more simplistic analyses typical of qualitative data in e.g. biological science. The identification of individuals / communities that are especially “error prone” is especially useful.
- I felt as though the Introduction could be restructured slightly to focus immediately on citizen science (in hydrology) i.e. "it is popular and increasingly widespread because xxx, but there are many issues that impede its roll-out everywhere including xxx (e.g. lack of trust, incentivisation, lack of continuing engagement, and demonstrated errors / imprecision relative to more traditional monitoring methods". It seemed like a bit of a jump to couch the first sentence in terms of climate change.
- There is another slight jump in the narrative around line 49, where the text moves from citizen science background & contrasting techniques for (qualitative) error removal, to machine learning and GLMs. The paper is largely focused on elaboration of the models, but I suggest linking better the two halves of the Introduction at this point in the text. Perhaps it would be useful to include a few new lines / paragraph on error detection using machine learning in a broad sense (i.e. not restricted to citizen science datasets).
- The paragraph after your research questions could probably be excised as it reads like a summary / Conclusions of the study. Alternatively, you could list the basic structure of the article here (e.g. “In Section 2 we describe the design of our probabilistic model …”)
- You could address (in the Introduction) why rainfall data were chosen for the investigation (i.e. why not streamflow, or soil moisture, or temperature …? And why might the water cycle be a good place for citizen science datasets to be interrogated) – I suspect this has more to do with data availability and access rather than anything more technical (e.g. representative error distributions), but I think it should be addressed. Coming into Section 2.1, rainfall data are mentioned for the first time since line 38, and feels like an afterthought.
- Related to previous comment: the beginning of Section 3 is focused on data/background and should come earlier, before the model development of Section 2 (and possibly in the Introduction – notably the passages on the study area). Do you need Section 3.2? I suggest simply referring readers to Davids et al. (2019) at the end of Section 3.1.
- The use of “communities” (e.g. line 96) might be slightly confusing to readers more attuned to hearing it in terms of community-led programs. You could insert a caveat / clarification here that “communities” will be used in a statistical sense.
- Paragraph starting on Line 110 – the allocation of citizen scientists to a static single community is a huge simplification (necessary for the modelling), but this should probably be spelt out more explicitly beforehand e.g. in the Abstract.
- Paragraph starting on Line 362 – this is really exciting and perhaps one of the most important outcomes of the research. As such, I think you should place it more in the foreground (perhaps earlier in the Discussion, as well as a sentence in the Abstract / Conclusions?). Tailoring error messages and ways of improving observations to separate communities would be a significant step-change in enhancing the quality of citizen science data, and therefore their uptake.
MINOR COMMENTS
- Line 18 - "measures that sometimes save hundreds, if not thousands of lives" - could you add a ref here? Seems a bit vague
- Line 20 on institutional capacity - this is a good point but could use a citation
- Could you fix the CS acronym? Defined as CS in the Abstract but CSs in the main text
- Line 40 – on time and effort spent on QC varying widely – repeats the beginning of that paragraph (Line 33)
- Line 63 – “Error modelling has only been employed … in a limited manner” – could you include a citation here?
- Beginning of Section 3 – to my mind the real value of citizen science rainfall observations is to capture rainfall extremes that are missed by satellite estimates. You could mention that Nepal has a lot of these extremes, as well as dramatic spatial variations, due to the interaction of the Monsoon with topography.
- Line 195: “rain gauges are notoriously inconsistent” – could you elaborate – in what way?
- Line 216: “100 mm diameter clear plastic bottle” rather than “clear plastic bottle with a 100 mm diameter”
- Line 233: might be worth explicitly defining “meniscus errors”
- Line 337: I did not fully understand this part of the Discussion on slope outliers? They seem fairly insignificant in the statistical sense to me?

Citation: https://doi.org/10.5194/egusphere-2023-658-RC1
- AC1: 'Reply on RC1', Jessica Eisma, 13 Jul 2023
  
  Please see the author response in the attached pdf.
  
  Citation: https://doi.org/10.5194/egusphere-2023-658-AC1
RC2:
'Comment on egusphere-2023-658', Björn Weeser, 19 Jun 2023

General Comments
The manuscript “A Bayesian model for quantifying errors in citizen science data: application to rainfall observations from Nepal” presented by J. Esima et al. introduces a graphical Bayesian inference model to (1) analyze and categorize various error types present in citizen science data and (2) classify the citizens into groups (communities) according to the error distribution within each group. By considering specific error types, the model allows a comprehensive understanding of the error structure of crowdsourced data.
The model was applied to real crowdsourced rainfall observations collected within the project SmartPhones4Water in Nepal. The model identified five distinct error types and classified the citizens into four inferred communities based on their error patterns. Leveraging this information, the model enables the identification of observations that require further verification, reducing the burden of data validation on human efforts by employing machine-learned algorithms.
I enjoyed reading the manuscript and acknowledge the potential of using a Bayesian model as a novel approach to improve the efficiency and accuracy of data validation in citizen science. The findings underscore the importance of well-defined error structures in citizen science data and demonstrate the value of a graphical Bayesian inference models in understanding and harnessing such data effectively, which becomes more and more relevant with the increasing amount of crowdsourced data. Overall, the manuscript is well structured and contributes relevant insights to the emerging field of citizen science data collection and, subsequently, the use of such data for further research. I recommend considering this manuscript for publication in HESS with minor revision.
Specific Comments
L24: Consider removing the word “traditional” in front of scientists. What are “non-traditional” scientists – and – in general, all scientists should be concerned about data quality from whatever source.
L229: Are the data also checked/calibrated by automatic rain gauges installed according to certain quality assurance standards? If so, this section may need to be briefly expanded to include a comparison of the overall data quality between CS data and automatically collected data. However, as the overall quality of CS data is not the focus of this manuscript, this comparison is not critical.
L252: As an additional filter for the data, the authors set a maximum limit of 540 mm of rainfall per day. My concern with this limit is that citizens may not be able to record such an event because the rain collector would overflow. In this case, the maximum amount of precipitation that can be measured by a CS station per day/measurement might be a more realistic limit. The authors should also report the number and percentage of data points that exceeded the upper (and lower) limit.
L320: Maybe name the error type that was introduced with the model here (slope outliers). It is mentioned in section 4.3, but I was missing this information in this section. It might be also valuable to expand this section slightly to explain why “slope outliers” have been identified as an error type. When looking at the distribution of errors made within the communities (Table 2), slope outliers never occurred. The relevance of this error type remains unclear to me.
L344: I would recommend using a different term for the Few-MUn group. The “few group” also makes only Meniscus and Unknown errors – similar to the Few-MUn group. The only difference is the overall amount of errors (2 % vs 5%). Hence, the groups could be named according to the amount of errors (such as p2 and p5 group, or minor and few, etc.). This would also improve the readability of the manuscript.
L457: The authors mention that a set of erroneous data is required to train the model and that these data need to be identified and corrected by the CS program, which can be a significant effort. Other studies have shown that this task could also be done in collaboration with the community (e.g., Strobl et al. https://doi.org/10.1371/journal.pone.022257). It may be of interest to the readers of this study to include some information on this approach here. This is currently listed in Section 6 (Future work) but may fit better within the discussion in Section 4.6.
L486: A limitation of this study is that it was only tested with one CS project in one region. The authors should mention this limitation more clearly in the conclusion, as it remains unclear whether the method and model developed will work equally well in different settings.

Citation: https://doi.org/10.5194/egusphere-2023-658-RC2
- AC2: 'Reply on RC2', Jessica Eisma, 13 Jul 2023
  
  Please see the author response in the attached pdf.
  
  Citation: https://doi.org/10.5194/egusphere-2023-658-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2023-658', Jonathan Paul, 22 May 2023

MAIN COMMENTS
- I very much enjoyed reading this paper, which will be (subject to minor revision) a valuable addition to the relatively new field of error bracketing in citizen science datasets. Specifically, Eisma et al. investigate the ability of one machine learning technique to analyse errors in quantitative hydrological data, which goes well beyond more simplistic analyses typical of qualitative data in e.g. biological science. The identification of individuals / communities that are especially “error prone” is especially useful.
- I felt as though the Introduction could be restructured slightly to focus immediately on citizen science (in hydrology) i.e. "it is popular and increasingly widespread because xxx, but there are many issues that impede its roll-out everywhere including xxx (e.g. lack of trust, incentivisation, lack of continuing engagement, and demonstrated errors / imprecision relative to more traditional monitoring methods". It seemed like a bit of a jump to couch the first sentence in terms of climate change.
- There is another slight jump in the narrative around line 49, where the text moves from citizen science background & contrasting techniques for (qualitative) error removal, to machine learning and GLMs. The paper is largely focused on elaboration of the models, but I suggest linking better the two halves of the Introduction at this point in the text. Perhaps it would be useful to include a few new lines / paragraph on error detection using machine learning in a broad sense (i.e. not restricted to citizen science datasets).
- The paragraph after your research questions could probably be excised as it reads like a summary / Conclusions of the study. Alternatively, you could list the basic structure of the article here (e.g. “In Section 2 we describe the design of our probabilistic model …”)
- You could address (in the Introduction) why rainfall data were chosen for the investigation (i.e. why not streamflow, or soil moisture, or temperature …? And why might the water cycle be a good place for citizen science datasets to be interrogated) – I suspect this has more to do with data availability and access rather than anything more technical (e.g. representative error distributions), but I think it should be addressed. Coming into Section 2.1, rainfall data are mentioned for the first time since line 38, and feels like an afterthought.
- Related to previous comment: the beginning of Section 3 is focused on data/background and should come earlier, before the model development of Section 2 (and possibly in the Introduction – notably the passages on the study area). Do you need Section 3.2? I suggest simply referring readers to Davids et al. (2019) at the end of Section 3.1.
- The use of “communities” (e.g. line 96) might be slightly confusing to readers more attuned to hearing it in terms of community-led programs. You could insert a caveat / clarification here that “communities” will be used in a statistical sense.
- Paragraph starting on Line 110 – the allocation of citizen scientists to a static single community is a huge simplification (necessary for the modelling), but this should probably be spelt out more explicitly beforehand e.g. in the Abstract.
- Paragraph starting on Line 362 – this is really exciting and perhaps one of the most important outcomes of the research. As such, I think you should place it more in the foreground (perhaps earlier in the Discussion, as well as a sentence in the Abstract / Conclusions?). Tailoring error messages and ways of improving observations to separate communities would be a significant step-change in enhancing the quality of citizen science data, and therefore their uptake.
MINOR COMMENTS
- Line 18 - "measures that sometimes save hundreds, if not thousands of lives" - could you add a ref here? Seems a bit vague
- Line 20 on institutional capacity - this is a good point but could use a citation
- Could you fix the CS acronym? Defined as CS in the Abstract but CSs in the main text
- Line 40 – on time and effort spent on QC varying widely – repeats the beginning of that paragraph (Line 33)
- Line 63 – “Error modelling has only been employed … in a limited manner” – could you include a citation here?
- Beginning of Section 3 – to my mind the real value of citizen science rainfall observations is to capture rainfall extremes that are missed by satellite estimates. You could mention that Nepal has a lot of these extremes, as well as dramatic spatial variations, due to the interaction of the Monsoon with topography.
- Line 195: “rain gauges are notoriously inconsistent” – could you elaborate – in what way?
- Line 216: “100 mm diameter clear plastic bottle” rather than “clear plastic bottle with a 100 mm diameter”
- Line 233: might be worth explicitly defining “meniscus errors”
- Line 337: I did not fully understand this part of the Discussion on slope outliers? They seem fairly insignificant in the statistical sense to me?

Citation: https://doi.org/10.5194/egusphere-2023-658-RC1
- AC1: 'Reply on RC1', Jessica Eisma, 13 Jul 2023
  
  Please see the author response in the attached pdf.
  
  Citation: https://doi.org/10.5194/egusphere-2023-658-AC1
RC2:
'Comment on egusphere-2023-658', Björn Weeser, 19 Jun 2023

General Comments
The manuscript “A Bayesian model for quantifying errors in citizen science data: application to rainfall observations from Nepal” presented by J. Esima et al. introduces a graphical Bayesian inference model to (1) analyze and categorize various error types present in citizen science data and (2) classify the citizens into groups (communities) according to the error distribution within each group. By considering specific error types, the model allows a comprehensive understanding of the error structure of crowdsourced data.
The model was applied to real crowdsourced rainfall observations collected within the project SmartPhones4Water in Nepal. The model identified five distinct error types and classified the citizens into four inferred communities based on their error patterns. Leveraging this information, the model enables the identification of observations that require further verification, reducing the burden of data validation on human efforts by employing machine-learned algorithms.
I enjoyed reading the manuscript and acknowledge the potential of using a Bayesian model as a novel approach to improve the efficiency and accuracy of data validation in citizen science. The findings underscore the importance of well-defined error structures in citizen science data and demonstrate the value of a graphical Bayesian inference models in understanding and harnessing such data effectively, which becomes more and more relevant with the increasing amount of crowdsourced data. Overall, the manuscript is well structured and contributes relevant insights to the emerging field of citizen science data collection and, subsequently, the use of such data for further research. I recommend considering this manuscript for publication in HESS with minor revision.
Specific Comments
L24: Consider removing the word “traditional” in front of scientists. What are “non-traditional” scientists – and – in general, all scientists should be concerned about data quality from whatever source.
L229: Are the data also checked/calibrated by automatic rain gauges installed according to certain quality assurance standards? If so, this section may need to be briefly expanded to include a comparison of the overall data quality between CS data and automatically collected data. However, as the overall quality of CS data is not the focus of this manuscript, this comparison is not critical.
L252: As an additional filter for the data, the authors set a maximum limit of 540 mm of rainfall per day. My concern with this limit is that citizens may not be able to record such an event because the rain collector would overflow. In this case, the maximum amount of precipitation that can be measured by a CS station per day/measurement might be a more realistic limit. The authors should also report the number and percentage of data points that exceeded the upper (and lower) limit.
L320: Maybe name the error type that was introduced with the model here (slope outliers). It is mentioned in section 4.3, but I was missing this information in this section. It might be also valuable to expand this section slightly to explain why “slope outliers” have been identified as an error type. When looking at the distribution of errors made within the communities (Table 2), slope outliers never occurred. The relevance of this error type remains unclear to me.
L344: I would recommend using a different term for the Few-MUn group. The “few group” also makes only Meniscus and Unknown errors – similar to the Few-MUn group. The only difference is the overall amount of errors (2 % vs 5%). Hence, the groups could be named according to the amount of errors (such as p2 and p5 group, or minor and few, etc.). This would also improve the readability of the manuscript.
L457: The authors mention that a set of erroneous data is required to train the model and that these data need to be identified and corrected by the CS program, which can be a significant effort. Other studies have shown that this task could also be done in collaboration with the community (e.g., Strobl et al. https://doi.org/10.1371/journal.pone.022257). It may be of interest to the readers of this study to include some information on this approach here. This is currently listed in Section 6 (Future work) but may fit better within the discussion in Section 4.6.
L486: A limitation of this study is that it was only tested with one CS project in one region. The authors should mention this limitation more clearly in the conclusion, as it remains unclear whether the method and model developed will work equally well in different settings.

Citation: https://doi.org/10.5194/egusphere-2023-658-RC2
- AC2: 'Reply on RC2', Jessica Eisma, 13 Jul 2023
  
  Please see the author response in the attached pdf.
  
  Citation: https://doi.org/10.5194/egusphere-2023-658-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Publish subject to minor revisions (further review by editor) (28 Jul 2023) by Wouter Buytaert

AR by Jessica Eisma on behalf of the Authors (31 Jul 2023) Author's response Author's tracked changes Manuscript

ED: Publish as is (29 Aug 2023) by Wouter Buytaert

AR by Jessica Eisma on behalf of the Authors (29 Aug 2023) Manuscript

Journal article(s) based on this preprint

09 Oct 2023

A Bayesian model for quantifying errors in citizen science data: application to rainfall observations from Nepal

Jessica A. Eisma, Gerrit Schoups, Jeffrey C. Davids, and Nick van de Giesen

Hydrol. Earth Syst. Sci., 27, 3565–3579, https://doi.org/10.5194/hess-27-3565-2023,https://doi.org/10.5194/hess-27-3565-2023, 2023

Short summary

Jessica A. Eisma et al.

Viewed

Total article views: 376 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
264	94	18	376	9	9

HTML: 264
PDF: 94
XML: 18
Total: 376
BibTeX: 9
EndNote: 9

Views and downloads (calculated since 15 May 2023)

Month	HTML	PDF	XML	Total
May 2023	126	36	5	167
Jun 2023	54	16	2	72
Jul 2023	42	15	6	63
Aug 2023	22	9	0	31
Sep 2023	19	13	4	36
Oct 2023	1	5	1	7

Cumulative views and downloads (calculated since 15 May 2023)

Month	HTML	PDF	XML	Total
May 2023	126	36	5	167
Jun 2023	54	16	2	72
Jul 2023	42	15	6	63
Aug 2023	22	9	0	31
Sep 2023	19	13	4	36
Oct 2023	1	5	1	7

Viewed (geographical distribution)

Total article views: 389 (including HTML, PDF, and XML) Thereof 389 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 09 Oct 2023

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (1324 KB)
Metadata XML

Short summary

Citizen scientists often submit high quality data, but a robust method for assessing data quality is needed. This study develops a semi-automated program that characterizes the mistakes made by citizen scientists by grouping them into communities of citizen scientists with similar mistake tendencies and flags potentially erroneous data for further review. This work may help citizen science programs assess the quality of their data and can inform training practices.


Total:	0
HTML:	0
PDF:	0
XML:	0