KINDEX – Automatic Subject Indexing with Knowledge Graph Services

Tracking #: 2361-3574

Lisa Wenige
Claus Stadler1
Simon Bin1
Lorenz Bühmann
Kurt Junghanns
Michael Martin

Responsible editor: 
Special Issue Cultural Heritage 2019

Submission type: 
Full Paper
Automatic subject indexing has been a longstanding goal of digital curators to facilitate effective retrieval access to large collections of both online and offline information resources. Controlled vocabularies are often used for this purpose, as they standardise annotation practices and help users to navigate online resources through following interlinked topical concepts. However, to this date, the assignment of suitable text annotations from a controlled vocabulary is still largely done manually, or at most (semi-)automatically, even though effective machine learning tools are already in place. This is because existing procedures require a sufficient amount of training data and they have to be adapted to each vocabulary, language and application domain anew. Against the background of tight budgets and missing IT personnel in cultural heritage as well as research infrastructure institutions, adoption of automatic subject annotation tools is hindered, while manual assignment of index terms is an even greater burden on the available financial resources. In this paper, we argue that there is a third solution to subject indexing which harnesses cross-domain knowledge graphs (i.e., DBpedia and Wikidata) to facilitate cost-effective automatic descriptor assignments that can be done without any algorithm tuning and training corpora. Our KINDEX approach fuses distributed knowledge graph information from different sources. Experimental evaluation shows that the approach achieves good accuracy scores by exploiting correspondence links of publicly available knowledge graphs.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 30/Jan/2020
Minor Revision
Review Comment:

This contribution describes a solution for automatic subject indexing that utilizes entity extraction together with lookup service to find indexing terms in knowledge graphs. The system additionally makes use of lexical matching.

The topic is particularly relevant for the special issue and presents a realistic and somewhat novel solution to a well-recognized problem. The novelty is mainly an applicable architecture and the use of semantic web data/graph to find subject identifiers in the required KOS.

The solution is well explained, although the emphasis is on the identity method. The lexical method may be trivial, but still should be explained in some more detail. Background and related work presentation is acceptable with reasonable focus on theory given that this is more targeting practitioners.
Evaluation is performed by constructing a gold standard using experts selecting terms proposed using the same sources as well as proposing their own. Precision and recall are measured in the experiment and F1 calculated. Although the sheer numbers show reasonably good results compared to other research results, it is difficult to compare because of different data. The comparison with the naïve lexical is more informative, as it shows the improvement of the method that is novel in this contribution. Authors should explain better why identity+lexical is different from lexical+identity as there is little information about how results from these methods are combined.

The success of this approach is of course largely depending on the quality of the data service and knowledge bases in use. I miss a discussion on how well the named entity recognition of dbpedia spotlight matches the indexing concepts in typical thesauri. This service is also used to extract the terms used in the lexical method, and authors need to explain why the lexical method is restricted to these terms (and why not use another method to extract relevant terms to use in the lexical method).

The paper is well structure and presented with good use of figures and tables. I have only marked a few sentences that maybe should be revised:

p1, line 42: mediate between information needs and data quantity – not clear what this means.
p2, line 40: an indexing tool ... would enable users – it is the subject descriptors that enables the users to find, not the indexing tool.
p2, line 46: the research direction is relevant, but it is the solution that is promising.
p3, line 8: identify «content» from text – I guess you mean extract information/knowledge from text, the word “content” is somewhat the same as text
p4, line 48: showcase «in the context of» use cases - is a strange phrase. Better to say that you showcase with two use cases
p10, line 20 & 39: figure number is wrong
p11, table 1+2: somewhat confusing to place wikidata and dbpedia in parenthesis after «lexical» since they are related to the identity method

The use of references shows a good overview, but some of the references are incomplete eg. 21, 25, 30. Also inconsistent naming of conference papers: some places only using the conference acronym and in other cases using the proceedings title.

Review #2
Anonymous submitted on 02/Feb/2020
Review Comment:

This article proposes a fully automated approach, called KINDEX, that uses different knowledge graph services for subject indexing. The motivation behind this new approach is two folds: difficulty to manually index large-scale collections of machine-readable information, and the burden of training new ML models for automatic indexing. The originality of the approach consists of using identity links of knowledge graphs, published in the web of data, together with lexical matching to improve the indexing process.

General review:

This work addressed an important research problems and proposes a "rather" ambitious approach. However, this latter suffers from two major weaknesses: (1) it depends on many parameters (especially sameAs links) and seems difficult to generalize, and (2) it lacks originality as it only consists of reusing existing techniques. The unpredictable results of the two use cases confirm the difficult application of the approach to other uses cases. Besides, only general indications are provided regarding the comparison with existing ML-approaches.

Detailed review:

The motivations and contributions of the approach are "generally" well presented. However there are some gray areas that should be clarified.

Section 1
It is indicated that one of the benefits of the proposed system is that "it does not need to be trained for a particular controlled vocabulary". However "the KOS system from which keywords are generate" should be published in the Linked Open Data. This is a strong condition, as if the KOS is not available in the LOD, its publication is a complex, very time-consuming and costly task.

Section 2
- "The automatic strategy has proven to be feasible with mostly high quality assignments of index terms [10, 11]" Regarding the chosen classification for related works (ML and associative indexing). To which category does this work belong?
- "Empirical investigations into the potential and effect ….. high quality data sources." This idea is repeated several times.

Section 4
-"the most suitable strategy is to first match the surface form as was determined by the spotlight index with the STW thesaurus’ preferred or alternative labels" Which technique is used to perform this matching ? Syntactic matching ? Semantic matching ?
- "FlexiFusion approach presented by Frey et al." reference is missing!
- "For instance, the mappings to the GND descriptors are accessible via the property , whereas mappings to VIAF can be determined from " How determining the property to use each time?
- "Thus the matching of descriptors is enabled through traversing identity paths" Traversing the web of data through sameAs links is a very complex problem? How to decide to stop the process!?
- "Hence, link-enabled ….Afterward, the cross-concordance …." These two paragraphs definitely show that the approach is based on a complex and context-dependent matching process!!

Section 5
- "For the LIMBO catalogue, it also seems to be the case that more input data (title+desc) leads to better F1 scores" In Figure 3, it seems that (desc) and not (title+desc) that leads to better F1 scores!?
- "Figures 5.2 and 5.2" maybe Figures 3 and 4 ?
-"Tabs. 5.2-2 show the final evaluation" problem with the reference!
- "Please note that these findings give only an indication, since the evaluations could not be run on the same sample." This is a real problem, as it is not possible to make a judgment based solely on indications!!

The language of the article is acceptable, but it contains some typos such as:

two real world usage scenario => two real world usage scenarios
has to be learnt => has to be learned
growing amount of publications => growing number of publications
and can not be easily => and cannot be easily
have been rarely taken => have rarely been taken
there are cross-domain indexing tools available that can be => to reformulate
There might be cases in which there neither => to reformulate
is even higher then directly => is even higher than directly
the cross-concordance that exist between => the cross-concordance that exists between
can be obained by querying => can be obtained by querying
are then send to => are then sent to
a few preliminary parameter => a few preliminary parameters
This might be explained with the fact that => This might be explained by the fact that

Review #3
By Gayo Diallo submitted on 26/Feb/2020
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

KINDEX – Automatic Subject Indexing with Knowledge Graph Services addresses the issue of automatic subject indexing.

The motivation of the paper is that under tight constraints about available resources, manually subject indexing is not feasible in the context of the cultural heritage domain.
it leverage relevant keyword suggestions thanks to the use of semantic links (mappings) from the underlying used knowledge graphs.
The paper is overall well-structured and written with a related work section highlighting some relevant papers in the topic. Two different use cases are introduced which showcase the KINDEX functionalities (a semi-automatic scenario one and ML-based strategy one). KINDEX has been evaluated according to the two described use cases

Although interesting, the topic is not new and has been under investigation for decade now. This is particularly the case in Digital Library and the Biomedical domain. However, what is quite original is to combine the indexing strategy and the integration of multiple available Knowledge graphs to enable additional keywords suggestion.

Below are my main concerns, comments and questions.

1. Introduction
After classifying into 2 main categories key-word phrases extraction, the authors argue that these approaches are not widespread. I would recommend looking into for instance the BioAsq challenge and all MeSH related biomedical texts indexing. A "training set" in this case is the manual indexing provided by the US NLM indexers (for the Medline database).
Another noticeable initiative is the Semantic Scholar search engine (DOI:10.18653/v1/N18-3011).

2. Related work
The authors seems limiting free text matching solely to n-gram based strategy, which is one way to describe a text features.
This section worths being expanded with more recent work based on Deep Learning (for instance, or those addressing specifically semantic based technologies in the cultural domain (MuseumFinland—Finnish museums on the semantic web). In addition, there are approaches that rely on Explicit Semantic Analysis (ESA) or Latent Semantic Indexing.

3. Description of the approach
KINDEX relies on the DBpedia Spotlight service which uses various indexing repositories according to the language of the text. Is KINDEX involves a language recognition process to redirect the request then to the suitable Spotlight index?

Section Experiments.
It would be interesting to evaluate KINDEX Vs other alternative approaches. The evaluation against the DBpedia Spotlight service could be improved in my opinion.