GERBIL – Benchmarking Named EntityRecognition and Linking Consistently

Tracking #: 1577-2789

Michael Röder
Ricardo Usbeck
Axel-Cyrille Ngonga Ngomo

Responsible editor: 
Ruben Verborgh

Submission type: 
Full Paper
The ability to compare frameworks from the same domain is of central importance for their introduction into complex applications. In the domains of named entity recognition and entity linking, the large number of systems and their orthogonal evaluation w.r.t. measures and datasets has led to an unclear landscape pertaining to the abilities and weaknesses of the different frameworks. We present Gerbil---an improved platform for repeatable, storable and citable semantic annotation experiments---and how we extended it since its release. With Gerbil, we narrowed this evaluation gap by generating concise, archivable, human- and machine-readable experiments, analytics and diagnostics. The rationale behind our framework is to provide developers, end users and researchers with easy-to-use interfaces that allow for the agile, fine-grained and uniform evaluation of annotation tools on multiple datasets. By these means, we aim to ensure that both tool developers and end users can derive meaningful insights pertaining to the extension, integration and use of annotation applications. In particular, Gerbil provides comparable results to tool developers so as to allow them to easily discover the strengths and weaknesses of their implementations with respect to the state of the art. With the permanent experiment URIs provided by our framework, we ensure the reproducibility and archiving of evaluation results. Moreover, the framework generates data in machine-processable format, allowing for the efficient querying and post-processing of evaluation results. Additionally, the tool diagnostics provided by Gerbil allows deriving insights pertaining to the areas in which tools should be further refined, thus allowing developers to create an informed agenda for extensions and end users to detect the right tools for their purposes. Finally, we implemented additional types of experiments including entity typing. Gerbil aims to become a focal point for the state of the art, driving the research agenda of the community by presenting comparable objective evaluation results. Furthermore, we tackle the central problem of the evaluation of entity linking, i.e., we answer the question how an evaluation algorithm can compare two URIs to each other without being bound to a specific knowledge base. Our approach to this problem opens a way to address the deprecation of URIs of existing gold standards for named entity recognition and entity linking, a feature which is currently not supported by the state of the art. We derived the importance of this feature from usage and dataset requirements collected from the Gerbil user community, which has already carried out more than 24.000 single evaluations using our framework. Through the resulting updates, Gerbil now supports 8 tasks, 46 datasets and 20 systems.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 03/Apr/2017
Minor Revision
Review Comment:

The present paper introduces recent improvements and provides an overview over the GERBIL benchmarking framework. Unifying annotation schemes and evaluation methods is a key ingredient to achieving comparable and reproducible experiments in many fields of science and in Information Retrieval and the Semantic Web in particular.

Generally the quality of the writing is good; however, some minor corrections are necessary, e.g. "extendedoutside" in Section 5 or the sometimes inconsistent typesetting of footnotes (e.g. footnote "36." vs. ".1"). In Section 4.3.1, I would like to ask the authors to double check that references are correct and meaningful. For example, for DBpedia Spotlight, while the Mendes et. al (2011) paper is cited, to the best of my knowledge no REST endpoint currently exists for this vector-space based version described therein, but all REST interfaces use the version of DBpedia Spotlight that is described in Daiber et al, 2013 (cf.

While the contribution of the paper is limited due to the fact that most of the details of the framework have been introduced earlier, this paper is a good overview over the latest developments. I consider this a topic of high importance and one very relevant to the Semantic Web and I therefore recommend to accept the paper with minor revisions.

Review #2
By Liang-Wei Chen submitted on 01/May/2017
Major Revision
Review Comment:

Writing: The level of writing is good. The paper clearly describes their approach and system.

Coverage and Reference: They clearly summarize existing state-of-the-art systems, but miss a few of the important references. For example, they do not mention coherence in the Illinois Wikifier (Chen and Roth 2013) and do not mention another key publication " Design Challenges for Entity Linking" (Weld et al. 2015), which provides a very careful survey and comparison of Entity Linking (EL) systems. They are also missing a few of the key datasets people have used, such as the TAC KBP data sets.

Approach: They extend the BAT-framework by increasing the number of published annotators and the number of datasets. They also provide new schema for the entity matching during evaluation.
However, they don't provide any quantified evaluation on their system, except the number of tasks runs on their system. They don't have any comparison between their framework and others.
Some of their design principles and features are too vague despite Table 1. For instance, they cannot claim that they reach the goal of “easy integration of annotators” without any quantification or user feedbacks. It’s also difficult to formally define their “diagnostics”.

Minor mistakes: Page 3 equation (3) : The first two or statements (s ≤ s’ ∧ e ≤ e’) , (s ≥ s’ ∧ e ≥ e’) should contain the constraints (s’ ≤ e) and (s ≤ e’). Otherwise, they can match any annotation in the document with the same u.

Summary: Their paper is interesting from the perspective of evaluation platforms. They point out some of the important issues involved in evaluating named entity recognition and linking: “Entity Matching”, “Deprecated entities in datasets”, and “New tasks and Adapters”. Their proposed pipeline of the entity matching process provides a better evaluation setting that should be supplied by a platform rather than by individual researchers.
However, their paper is weak from the perspective of new research results. At the very least, they should do more experiments to evaluate their system so that they can support their claims about their design principles.

Review #3
Anonymous submitted on 02/May/2017
Minor Revision
Review Comment:

The paper presents an extension for a previous work [reference 62]. The paper introduces GERBIL, a framework for experimenting with NER and NEL. It includes 20 annotation system, 12 datasets and 6 experimental setups. Although the paper does not provide a new system or competitor, it contributes to the domain by providing a unique location to test and compare many systems which would make comparisons in the future more easy and reliable.
The paper needs to provide the following:
- A clear description of the contributions made in this version over the previous version presented in reference 62.
- More comparison between the provided systems in term of processing time.