The Thin Red Line - To What Extent can Crowd Workers Emulate the Performance of Experts?

Tracking #: 1071-2283

Ujwal Gadiraju
Patrick Siehndel
Besnik Fetahu
Stefan Dietze
Thomas Krijnen
Jakob Beetz

Responsible editor: 
Guest Editors Human Computation and Crowdsourcing

Submission type: 
Full Paper
Crowdsourcing has seen an increasing popularity for solving a variety of tasks. Specifically for tasks which address Web data interoperability, such as entity interlinking or schema mapping, crowdsourcing has been widely adopted in research and practice. In earlier work, we have investigated the behavioral pattern exhibited by workers, specifically, with respect to malicious activity and its detection. However, little research has been conducted so far to investigate the capability of crowd workers to substitute expert-based judgements, or, the extent to which worker performance is influenced by the task environment. In this work, we present results from a study which assesses the impact of the audience, crowd workers versus experts, and the specific task design and environment on the worker performance. Our study has been conducted on the specific tasks of link prediction and schema mapping, where tasks have been carried out by crowd workers and predefined experts. We compare the performance of crowd workers and experts in environments that differ with respect to the level of contextual information provided about the task at hand. Our results show that it is feasible to attain high quality results from crowd workers, that are comparable to experts when adequate context is presented to the crowd. In addition, we present a detailed analysis of the errors that the crowd and experts tend to commit in the task of link prediction and schema mapping. Finally, we present a novel method to assess the inherent difficulty of the link prediction task, and the impact of this task difficulty on the quality of the results.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 02/Jun/2015
Review Comment:

The authors investigate the use of crowdsourcing for the task of link prediction and schema mapping. In particular, they investigate qualitative differences between annotations made by trained experts as well as lay people in the crowd.

The paper is well written and clearly communicates goals and methods.

I strongly urge the authors to rephrase the manuscript's title. Currently it is much too general in scope given what was actually done. The manuscript exclusively focuses on a single annotation task type and makes no statement about the multitude of possible applications (e.g., classification, surveys, etc.). It would be much preferable to let the title reflect the manuscript's true scope rather than leading the reader to believe this were a general investigation of experts vs. crowd.

Similarly, I strongly disagree with the author's claims of novelty. Over the years, we have seen many papers that investigate both the expert/crowd comparison as well as the influence of different interfaces. Here some concrete examples:
-[5], Eickhoff 2013
-Alonso, Omar, and Stefano Mizzaro. 2009 "Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment."
-Kazai, Gabriella, et al. 2011 "Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking."
-Kazai, Gabriella, et al. 2012 "The face of quality in crowdsourcing relevance labels: Demographics, personality and labeling accuracy."

As a consequence, none of the proposed research questions struck me as particularly novel:
-Zhang, Chen Jason, et al. 2013 "Reducing uncertainty of schema matching via crowdsourcing."
-Wang, Jingjing, et al. 2014 "Learning an accurate entity resolution model from crowdsourced labels."
-Sarasua et al. 2012 [20]
-Mortensen, Jonathan M. 2013 "Crowdsourcing ontology verification."
-Eickhoff, Carsten. 2014 "Crowd-powered experts: Helping surgeons interpret breast cancer images."
-[5], Eickhoff 2013
-Kazai, Gabriella, et al. 2011 "Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking."
-Alonso, Omar, and Ricardo Baeza-Yates. 2011 "Design and implementation of relevance assessments using crowdsourcing."
-Kazai, Gabriella. 2011 "In search of quality in crowdsourcing for search engine evaluation."

As far as I can see, this manuscript only ever compares the performance of single crowd workers with that of single experts. In practice, however, we typically find that, while single crowd workers' performance does not compete with that of single experts, aggregate solutions across multiple crowd workers meet the performance of a single expert. In order to obtain realistic results, comparable to the state of the art in crowdsourcing, I caution the authors to consider and evaluate the use of aggregation schemes such as majority voting as well.

The payment of 0.1 cent per annotation is extremely low even for crowdsourcing standards. It would be interesting to discuss how long people spent on each judgement task (both experts as well as the crowd) given this pay rate. Previous work often finds time-on-task to be a good proxy for quality.

There is another, undocumented difference between Interfaces 1 & 2. Where in the first condition, the annotator immediately gets the full range of choices presented as a radio group, in the second condition the interaction seems to be more tedious, requiring to open a drop-down menu. Was this a conscious choice? And if so, what was the intended/observed effect?

In the overview of the authors' own previous work in Section 2, it would be helpful to briefly describe the concrete task that was addressed in [8].

It is interesting that 'people' and 'society' show such fundamentally different difficulties. Are there any explanations for this rather counter-intuitive observation?

=== Presentation Remarks ===
-Erred-on and Erred-as could be labeled as 'false negatives' and 'false positives', respectively, in order to increase the intuitive interpretability.
-Significance of results should be included in tables and figures rather than only mentioning them in the text.
-Sec. 2: "pattern" -> "patterns"

In summary, this manuscript gives an interesting overview of observations relating to the use of crowdsourcing for the task of link prediction and schema matching. Unfortunately, the authors significantly overstate the novelty of their contribution as well as the generality of their findings (see title discussion). Additionally, a standard practice, aggregation of individual crowdsourcing results, seems to be ignored, resulting in unrealistic results. As a consequence, I cannot recommend acceptance of this manuscript in its current state.

Review #2
By Elena Simperl submitted on 09/Jul/2015
Review Comment:

The paper studies how well crowd workers perform in comparison to experts. This is a critical task in many scenarios, in particular when dealing with large amounts of data or complex workflows, which require many answers from the workers or iterative/sequences of tasks to be carried out. It is also important when experts are available and are only constrained by time resources.
The authors compare the performance of experts and crowd workers on a specific link assignment task in two environments (high and low context environments), which are presumed to play a role in the trade-offs between costs/accuracy. The experiments revealed that workers can perform reasonably well in task environments with adequate access to contextual information.

1. Originality
While the topic of the paper is important, it is not particularly new or novel. The Semantic Web tasks that are subject to the experiments have been studied in similar works in the past; perhaps not at the same level of detail, but nevertheless. Putting the background aside (see also comments below) and taking into account the weaknesses of the experimental set up and of the discussion, I don't believe the work is yet mature enough to be published in a journal.

2. Significance of the results
The results move us a step in understanding how to design task interfaces that improve worker performance. However, none of them are really surprising, and the discussion is limited to a summary of the results accompanied by few, unsubstantiated opinions of the authors. The discussion is not framed in a broader context (previous studies, SW research on link prediction and ontology alignment) and it is not obvious how the findings could advance the state of the art. A lot of space is dedicated to background work, which doesn't fit very well with the scope of the experiments. It is unclear how all of it applies to the current paper, which aims for a very specific contribution. The experiment design and the results have a few issues as well, as identified below. In particular it is unclear what the difference between RQ1 and RQ2 is, and several important design choices and results (e.g., Tables 4 and 5, see below) need to be explained and discussed in more detail.

3. Quality of writing

-Story line: The abstract and introduction present a good plot. The background section introduces material which is hard to fit into the overall picture. It seems too comprehensive for the specificity of the experiments, which represent the main contribution of the paper. E.g,. the details on demographic, compensation and malicious activities are already well established. My suggestion would be to shorten that part significantly. The remaining sections follow well with more points detailed below.

- Tables and figures: The tables and figures are quite clear. The anomalies in the results in table 4 need to be explained (details below). I’m not sure about the purpose of Figure 11 (details below).

- Writing: The writing is clear and easy to follow. A few sections need to be explained more clearly e.g., predicates erred on and erred as (details below). There were quite a few typos (details below).

- Literature review: The authors show a good understanding of background work.

4. Detailed comments

- Research has been also done on task design environment
- I can't see any form of "link prediction" task ... link prediction or link assignment?
- Context provided/presented or context "potentially" available

- The section holding "the remainder of this paper" needs to be revised

- enorporating -> incorporating

Objectives and Methodology
- What is the difference between RQ1 and RQ2 - nowhere in the paper discussion is the difference explicit neither does any result apply specifically to one or the other
- How do you define "reliably" and what constitutes a "significant" performance gap
- RQ3 enviroment -> environment
- There is a discrepancy in the way the "high context" is represented. In the text, it is said that the setup varies in the level of context "presented" (as in displayed). However Section 3.2 seems to define it as "access" (via a URI lookup).
- Environment 1 was suitable, Environment 2 was conducive, please elaborate.
- The description of Environment 2 could be improved.

Experiment setup
- Was it a between subjects or within subjects experiment
- It is not stated whether the task defines what the predicates mean, especially how to differentiate between same and equivalent. Examples are said to be given, however, without a definition, it would be hard to generalise.
- URI paris -> URI pairs
- Could the payment have had an effect? 2 cents for 20 pieces of a cognitive task is rather on the lower end.

- How was the average accuracy computed? Based on majority voting or computed on all responders?
- Was the inter-annotator agreement calculated on the accurate scores only? Were there cases of unanimous agreement against the gold standard?
- The definitions of predicates erred on and erred as could be made much clearer. An example would also help e.g. in illustrating why some values are zero.
- Lots of owl:samAs instead of owl:sameAs
- relionship -> relationship
- Given the high confusion matrix scores between sameAs and equivalentClass, could the task definition and instructions be said to have a high impact?
- The probable reasons for the difference in the results in Table 4 were not well stated. Given the lack of context, the more familiar/popular URIs (characterised by more links pointing to it and with more text) should have been easier to annotate correctly.
- Why did Table 5 have a lower depth for the false entities when compared to the largest difference observed in table 4
- What's the purpose of Figure 11? What's the significance of the selected topics?

Review #3
Anonymous submitted on 27/Sep/2015
Review Comment:

The paper concerns how crowd workers can ‘emulate’ expert performance.
The main issue with the current shape and setup of the paper is that it feels like [fallen] in-between two different ambitions. Part of the presentation and argumentation goes into a wide coverage of the subject with general challenges and general lessons to be learned. Part of the paper, specially the more concrete sections in the paper, goes into a specific example of the subject with a rather detailed consideration and contribution.
It is strongly recommended that the authors ensure that the presentation and claims they make are in line with the concrete results and approach they deliver in the main sections in the text. [For now, it seems that the more concrete parts of the work point towards a more focused and modest ambition; on the other hand, that does require convincing the audience about significance and innovation.]

Introduction – Here, it would be effective to sharpen the specification of the contributions towards the more concrete work considered in this research. The current presentation of the contributions is too generic [for the actual results that are later on delivered]. Depending on what the authors choose as ambition [see above], it is also recommended to explain how the link prediction and schema mapping tasks are in fact good targets of research, either by being representative (!) examples of a more generic subject or by explaining their outright relevance and complexity.
Background – Here, the purpose of representing the previous work needs to be explained. And connected to the main story of this paper.
Objectives – This first ‘real’ section is compact and could benefit from more argumentation: now it is mostly ‘what is done’ and for the bigger story of convincing the readers of the more general lessons learned it would be good to also know ‘why these are the right things to do’. Ideally in this process, the aspects of the link prediction/schema mapping problem and its crowdsourcing solution are as independent as possible.
Setup – What is missing here is a good separation of the specific properties from the link prediction problem at hand from the properties of a chosen crowdsourcing approach to that problem. In order to judge that approach we would need to assure that we judge [mostly] the choices made in constructing that crowdsourcing approach. Why is the same task format and design used for crowd and experts?
Results – The presentation of results ideally follows a detailed account of the results to be measured, to warrant that is shown ‘what we were looking for, not what we happened to see’. This also helps to zoom in on ‘results’ that relate to the crowdsourcing setup, rather than the link prediction’s inherent properties.
Impact – Following up on the previous remark, this section makes a more convincing opening, as it [implicitly] goes into properties that affect the crowdsourcing design. Again, though it is recommended to make sure that the lessons learned are fitting for the ambition chosen in the whole paper. Now, it feels that the lessons learned mostly relate to the specific problem considered, with an effect on the significance of the work.
Related work, Discussion – See the earlier remarks about the chosen ambition of this paper, and making it fit with what is concretely delivered. The RW section for example goes into the much more general ambition, while the Disc section goes clearly in a much more restricted ambition.