Empirical Methodology for Crowdsourcing Ground Truth

Tracking #: 1887-3100

Anca Dumitrache
Oana Inel
Benjamin Timmermans
Carlos Ortiz
Robert-Jan Sips
Lora Aroyo
Chris Welty

Responsible editor: 
Guest Editors Human Computation and Crowdsourcing

Submission type: 
Full Paper
The process of gathering ground truth data through human annotation is a major bottleneck in the use of information extraction methods for populating the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to volume of data and lack of annotators. Typically these practices use inter-annotator agreement as a measure of quality. However, in many domains, such as event detection, there is ambiguity in the data, as well as a multitude of perspectives of the information examples. We present an empirically derived methodology for efficiently gathering of ground truth data in a diverse set of use cases covering a variety of domains and annotation tasks. Central to our approach is the use of CrowdTruth metrics that capture inter-annotator disagreement. We show that measuring disagreement is essential for acquiring a high quality ground truth. We achieve this by comparing the quality of the data aggregated with CrowdTruth metrics with majority vote, over a set of diverse crowdsourcing tasks: medical relation extraction, Twitter event identification, news event extraction and sound interpretation. We also show that an increased number of crowd workers leads to growth and stabilization in the quality of annotations, going against the usual practice of employing a small number of annotators.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Gerhard Wohlgenannt submitted on 03/May/2018
Review Comment:

After a quick check of the new version, I appreciate the improvements to the paper
by the authors, and stick with my previous recommendation (accept).

p14: "scores perform provide better ground truth"

Review #2
Anonymous submitted on 11/Jun/2018
Minor Revision
Review Comment:

I reviewed the new version of the paper, I read the authors' response letter (as well as the other previous reviews). I was reviewer number 4 on the previous version.

The authors definitely improved the result discussion and made their claims more clear and fair, addressing some of my comments; they added some more details and clarifications here and there; and they turned their claims on semantic web into claims on contribution to knowledge base curation.

However, the authors say that the novelty of this paper w.r.t. their previous publications lies in the application of CrowdTruth to open-ended tasks; however, they did not need to modify anything in their methodology on closed tasks to apply it to open-ended tasks, so I fail to see the novelty/originality.

Moreover, I still have major doubts on the proposed evaluation: all results are "biased" towards the cases with "multiple truths", because F1 of majority voting is by-design penalized by the lower values of recall w.r.t. those of CrowdTruth; the choice of taking as ground truth what they call "trusted judgement" (even with the addition of the appendix) is still very questionable and I disagree that it was the only way to run such an evaluation; on the contrary, I would be curious to see the results of comparing CrowdTruth to majority voting considering pure expert judgement as ground truth.

Therefore, I suggest a minor revision in which the authors extend the paper with the results of comparing CrowdTruth to majority voting considering pure expert judgement as ground truth.

Review #3
By Maribel Acosta submitted on 19/Jun/2018
Review Comment:

In this version of the manuscript, the authors have addressed my concerns raised previously, except for clarifying the time when the microtasks were submitted to CrowdFlower. This information is still not reported in the paper and only available for the sound task in the raw data available at https://github.com/CrowdTruth/Cross-Task-Majority-Vote-Eval. Especially if the tasks were not crowdsourced at the same time, it would be fitting to report the timeframes when the microtasks were executed in order to provide more context about the task prices that were paid on the platform at that time.