TermitUp: Generation and Enrichment of Linked Terminologies

Tracking #: 2885-4099

Patricia Martin-Chozas
Karen Vázquez-Flores
Pablo Calleja
Elena Montiel-Ponsoda
Víctor Rodríguez-Doncel

Responsible editor: 
Guest Editors Advancements in Linguistics Linked Data 2021

Submission type: 
Tool/System Report
Domain-specific terminologies play a central role in many language technology solutions. Substantial manual effort is still involved in the creation of such resources, and many of them are published in proprietary formats that cannot be easily reused in other applications. Automatic term extraction tools help alleviate this cumbersome task. However, their results are usually in the form of plain lists of terms or as unstructured data with limited linguistic information. Initiatives such as the Linguistic Linked Open Data cloud (LLOD) foster the publication of language resources in open structured formats, specifically RDF, and their linking to other resources on the Web of Data. In order to leverage the wealth of linguistic data in the LLOD and speed up the creation of linked terminological resources, we propose TermitUp, a service that generates enriched domain specific terminologies directly from corpora, and publishes them in open and structured formats. TermitUp is composed of five modules performing terminology extraction, terminology post-processing, terminology enrichment, term relation validation and RDF publication. As part of the pipeline implemented by this service, existing resources in the LLOD are linked with the resulting terminologies, contributing in this way to the population of the LLOD cloud. TermitUp has been used in the framework of European projects tackling different fields, such as the legal domain, with promising results. Different alternatives on how to model enriched terminologies are considered and good practices illustrated with examples are proposed.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By John McCrae submitted on 15/Oct/2021
Review Comment:

The paper is improved since the previous version. There is a more extensive comparison of the effectiveness of this system with other terminology extraction systems. This makes it easier for reviewers to see why this may be chosen over other competing systems. I would still like some direct side-by-side comparison of the accuracy compared to some of these systems, but the fact that this is open-source (unlike most of its competitors) shows the value of this work.

Minor issues:
"morning start" p13. l5
Check that TermitUp is always written the same way (e.g. p14 l14)

Review #2
Anonymous submitted on 14/Nov/2021
Review Comment:

Overall, the updated version of the paper reinforces the high quality and relevance of the work being developed, as well as TermitUp’s potential impact among the linguistic data science community. This enhanced version of the article provides a more thorough description of the tool’s architecture and its underlying methodology, as well as of TermitUp’s contribution to EU-funded projects (e.g. Prêt-a-LLOD, Lynx, and, more recently, SmarTerp) and to other initiatives, such as the collaboration with the DFKI and the work on the Emofinder database. Moreover, the authors managed to further highlight TermitUp’s complementary role when compared with other existing tools currently supporting terminological work. The topic concerning when and how human intervention may occur across the system has also been clarified. In addition, information has been provided by the authors on recent work in the proposal of an Ontolex Terminology module, which can certainly contribute to TermitUp’s evolution while, in turn, also benefitting from the tool’s development.
The links to the various repositories (namely GitHub and Zenodo) have been added, and the provided data resources appear to be complete, thereby enabling replication or experiments. Finally, even though most of the spelling issues have been solved, there are still some minor revisions pending:
p.4|9L: the ‘being’ before ‘half’ has still not been eliminated
p. 8|33-34R: “*Prospective* experiments to this module *were* already published *at the* LREC 2020 conference”
p. 9|48R: “be more suitable than *the* other”
p. 10|29L: “are divided *into* two parts…”; 43L: “editing platform connected to *the* TermitUp triple store”; 39-43L: The beginning of the sentence “Much as Prêt-à-LLOD…” is odd; please consider rephrasing.
p. 11|41L: please eliminate one of the “presented”
p. 13|50R: “similar to *those of TermitUp*;”
p. 14|11-12L: “community can *improve, contribute to or adapt it* to their specific *use cases*”

Review #3
By Andon Tchechmedjiev submitted on 01/Dec/2021
Review Comment:

"(1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). "
This paper presents TermItUp, a generic architecture integrating multiple state of the art tools with the purpose of providing a one-stop-shop for all terminology extraction needs.
The tool has been developed following FAIR and open science principles, using standard LLOD and LOD formats, guided by a set of requirements based on observations in the state of the art,
but also discussions with terminology experts.
The tool will be an extremely useful to the community as the systematic integration of its different components for each single project would be incredibly time-consuming and ad-hoc.
There are high impact uses of the tool in H2020 and other collaborative projects, showcasing the potential of the tool.

"(2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool. Please also assess the data file provided by the authors under “Long-term stable URL for resources”."
The paper is well written and organized, and clearly tell the story of TermItUp in terms of its capabilities and shortcomings.
There are small language corrections to be made for the camera ready version, highlighted at the end of this review.

The review of literature is complete, although I would have preferred to see at least a mention of SOTA efforts for terminology-extraction around TermEval2020/ACTER and an explanations as to why such systems although theoretically very accurate would be very difficult to integrate in a production system.
There's a very interesting multilingual extraction system in the 2021 finding of the ACL https://aclanthology.org/2021.findings-acl.316.pdf
I mention this because it's interesting, but the positioning of the paper doesn't necessarily require to go into this particular literature.
Perhaps the mention in the paper that there are ongoing efforts for multilingual terminology extraction actually refers to this.

Regarding the perspectives, I would love to see the future integration of a knowledge-graph aware association rule mining approach in addition to the extraction of hierarchical relations.
I am not asking to mention this in the paper, just an interesting thought.

"In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data,
(B) whether the provided resources appear to be complete for replication of experiments, and if not, why,
(C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and
(4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information."

The github repository is easily accessible and significant information is given in the README or on the main website of the tool.
The overall explanations are clear and the documentation of the API provided through swagger is very helpful.
The main improvement directions would be 1/adding a more technical documentation that explains how to deploy the service 2/adding some docstring documentation for developers
3/potentially refactor the code base if there is a subsequent increase in complexity, I would personally favour a generic class structure with polymorphic genericity rather than reflexively loading code modules that all include the same functions, even though the latter tends to create less overhead.
As the tool is still a prototype under active development, those are not significant issues with regard to the publication of the paper.

**I recommend acceptance of the paper**, the corrections are mostly cosmetic for the camera ready.

***Detailed corrections***

Page 2 line 12 left: the bit of the sentence about DBPedia is confusing, I don't understand what it means.

P2 l19 left: is to find -> is finding

P2 l25: different backgrounds and expertise levels to face language and related needs [...]

P2 l30 right: discussions that have arisen

P2 l37 right: This section covers

P2 l38 right: different processes mobilized in our system

P3 l42-43 right: Combining Wikipedia and other resources, BebelNet constitutes an multilingual [...]

P4 l9 left: domains, with half being closely related to [...]

P4 l10 left: Several scientific works are devoted

P4 l12 left: a SPARQL

P4 l28 left: can be of great help

P4 l6 right: corpus -> corpora

P4 l36 right: These can correspond to different [...]

P4 l43-45 right: The meaning of a unit is to be discovered in text and constructed through relations to other terminological units.

P5 l22 right: can significantly contribute to improving performance [...]

P5 l22 right: to developing

P5 l45 right: this translates to a necessity [...]

There are aditionnal small corrections like these to be made, can transmit feedback to authors later as time permits, preferable to delaying the review submission.