A simple method for inducing class taxonomies in knowledge graphs

Tracking #: 2708-3922

Marcin Pietrasik
Marek Reformat

Responsible editor: 
Guest Editors ESWC 2020

Submission type: 
Full Paper
The rise of knowledge graphs as a medium for storing and organizing large amounts of data has spurred research interest in automated methods for reasoning with and extracting information from this form of data. One area which seems to receive less attention is that of inducing a class taxonomy from such graphs. Ontologies, which provide the axiomatic foundation on which knowledge graphs are built, are often governed by a set of class subsumption axioms. These class subsumptions form a class taxonomy which hierarchically organizes the type classes present in the knowledge graph. Manually creating and curating these class taxonomies oftentimes requires expert knowledge and is time costly, especially in large-scale knowledge graphs. Thus, methods capable of inducing the class taxonomy from the knowledge graph data automatically are an appealing solution to the problem. In this paper, we propose a simple method for inducing class taxonomies from knowledge graphs that is scalable to large datasets. Our method borrows ideas from tag hierarchy induction methods, relying on class frequencies and co-occurrences, such that it requires no information outside the knowledge graph's triple representation. Furthermore, we show that the induced hierarchy may be used as a foundation for hierarchical clustering of knowledge graph subjects. We demonstrate the use of our method on four real-world datasets and compare our results with existing tag hierarchy induction methods. We show that our proposed method outperforms existing tag hierarchy induction methods, although both perform well when applied to knowledge graphs.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Patrick Westphal submitted on 24/Mar/2021
Major Revision
Review Comment:

With their submission "A simple method for inducing class taxonomies in knowledge graphs" the authors apply methods from the field of tag taxonomy induction to the problem of constructing a class hierarchy from an RDF knowledge graph, given type assertions of resources. The main content was published already in the proceedings of the European Semantic Web Conference 2020. The submission is a copy of the ESWC publication with the following additions:

1. Introduction: Three sentences
2. Related work: One subsection (2.3.) with two paragraphs covering methods for hierarchical clustering
4. Approach: One paragraph + One subsection (4.2.) with two paragraphs and one algorithm description about the hierarchical clustering procedure
5. Evaluation: One paragraph introducing additional measures (Doc-F1, Tag-F1) + one sub section (5.1.4.) explaining the IIMB dataset which was not used for evaluation in the ESWC publication + one paragraph discussing results for Doc-F1 and Tag-F1 + one subsection presenting performance of the authors' approach w.r.t. Doc-F1 and Tag-F1 + one paragraph touching on the computational complexity of Doc-F1 and Tag-F1
6. Conclusions: Two sentences

In terms of overall content the additions amount roughly to 1.5 pages of the total 13 pages. (Just in case this is of interest -- I could not find any information/requirements for this ESWC submission call.)

The overall topic of inducing terminological knowledge from RDF data to me seems relevant, especially in case of crowdsourced/user generated data which usually mainly comprises assertional knowledge. In terms of the newly introduced content regarding the hierarchical clustering considerations of RDF resources based on their types and the induced class taxonomy, IMO the authors could not convincingly motivate why this was of interest and what applications of this might be. It would be good if they could further provide some discussion and intuition when and why it makes sense to distinguish between the induced class hierarchy and its underlying hierarchical clustering ``with strong inheritance properties'' by means of this class hierarchy. Regarding the newly introduced F1 scores, Doc-F1 and Tag-F1, I would also appreciate some more details on what they actually measure and why it makes sense to consider them in the evaluation of an induced class taxonomy. My intuition for Doc-F1, for example, would be that whenever resources are `tagged' with the most specific class and only one more general class in the hierarchy, this would always lead to imperfect Doc-F1 scores (given that the class hierarchy depth is greater 2 and there are no empty classes that were pruned out). So this would imply that this Doc-F1 score highly depends on the input data and is not a good measure for the induced class taxonomy/hierarchical clustering. Especially in case this intuition is wrong I would like to ask the authors to provide a more in depth discussion on the Doc-F1 and Tag-F1 scores.

The overall approach of using a resource's class as a tag and applying tag taxonomy induction methods is easy to understand and well described in the submission. However, since I did not follow the most recent developments in the field of hierarchical clustering and taxonomy induction on tags and folksonomies I cannot reliably judge the originality of the submission. This also concerns the overview of related work which did not cover more recent publications on the topic.

I want to express two notational concerns that I found confusing:

(P1) p.4, l.11, l.13, l.16: s, s_i, d_i

As far as I understand s, s_i, as well as d_i refer to the same thing, i.e. resources on the subject position of a triple that have a type assigned (via rdf:type). If I'm not mistaken, s and s_i are never used again in the submission and I think it would be less confusing if they are dropped from the problem description.

(P2) p.4, l.39: ``the subsumption axioms {dbo:Person --> dbo:Artist} and {dbo:Artist --> dbo:Painter} imply that dbo:Painter is a dbo:Artist and that dbo:Artist is a dbo:Person''

To me this notation of {superclass --> subclass} is counter intuitive, and should be the other way around, i.e. {subclass --> superclass}. From a logical perspective, reading this as an implication would make more sense then. Furthermore, graphical notations of RDF graphs or UML also use arrows pointing to the superclasses, not to the subclasses. Same holds for the notation on p.5, l.31.

In terms of the correctness of the submission I have a few concerns detailed below:

(C1) p.1, l.22: ``we propose a simple method for inducing class taxonomies from knowledge graphs that is scalable to large datasets''; p.2, l.36: ``we construct a novel approach to inducing class taxonomies which outperforms existing tag hierarchy induction methods both in terms [of] scalability and quality of induced taxonomies''

I don't see the scalability issue discussed in the submission. Experimental results are only given for the F1 score quality measures. Further on p.9, l.19, it is said that the approach of ``Heymann and Garcia-Molina was not able to terminate sufficiently fast enough for us to obtain results'' without providing any numbers or discussion apart from the authors' judgement of being `not sufficiently fast enough'.

(C2) p.2, l.11: ``automated methods are not able to induce class taxonomies of the quality necessary to reliably apply to complex knowledge bases''

This statement seems to be far too general to me and comes without any discussion of which methods are meant and why they fail to induce class taxonomies. There is also no literature reference that would back this assertion.

(C3) p.3, l.37: ``Each of these rules has the relationship of premise and consequence which the authors treat as that of class and subclass.''

As far as I understood the paper referenced by [16] premises refer to subclasses and consequences to the superclasses, so `class' and `subclass' should be swapped in l.37.

The evaluation performed is divided into two parts: One part examining the quality of the induced taxonomies based on the respective gold standard taxonomies, and one touching on the quality of the hierarchical clustering outcomes. Whereas the authors compare their solution with methods from other publications in the former part, they only provide numbers for their own solution in the latter. This raises the question why this wasn't done also at least for the other methods re-implemented by the authors. The source code and datasets to re-run the first part of the evaluation are provided on GitHub. However, it is tied to the preprocessed datasets used in the evaluation and thus cannot be run on arbitrary RDF knowledge bases. I could not find source code for the second part of the evaluation.

Given the open points mentioned above I would like to ask the authors for feedback and to revise the respective parts of the submission if possible and considered meaningful.

Minor comments and typos:
p.2, l.36: ``both in terms scalability and quality'' --> `in terms of'
p.3, l.35: ``the Aprioir algorithm'' --> `Apriori algorithm'
p.3, l.47: ``that have a high frequencies'' --> `have a high frequency'/`have high frequencies'
p.4, l.3: ``A knowledge graph, K, is repository'' --> `is a repository'
p.5, eq.(1): D_b can be 0
p.6, alg.2: decay factor \alpha mentioned as input but not used in the algorithm
p.6, alg.2, l.5: ``for c_a \in T* do'' --> T* does not have clusters c_a as elements but subsumption axioms
p.6, l.45: ``tags from in the vocabulary'' --> `from the vocabulary'?
p.7, l.13: ``followed by a comparison our method'' --> `of our method'
p.7, l.23: There should be a comma between the footnotes 3 and 4
p.7, l.43ff (Just a note): Taking 'is-a' as the `type tag' isn't really in line with what was said in Sec.3. is-a rather translates to rdfs:subClassOf and it seems classes/concepts are clustered here instead of individuals having a `type tag'.
p.8, l.15: number formatting; a thousands delimiter was used in previous numbers
p.9, Table 1: Stdev. value for Heymann and Garcia-Molina/DBpedia differs from original ESWC publication (0.0149 vs. 0.0159)
p.12, l.25: O(|V^2|) vs. O(|V|^2)

Review #2
By Heiko Paulheim submitted on 25/Mar/2021
Major Revision
Review Comment:

The authors introduce an approach for creating class taxonomies from knowledge graphs. Their approach draws form approaches for inducing hierarchies for tags. The approach tackles an interesting problem, and the code and data are publicly available.

The main signal the approach exploits is the co-occurrence of two classes (denoted as tags in the approach). In a nutshell, the approach assumes that if many entities are typed with "person" and "actor", and "person" has a higher generality than "actor" (also computed based on class co-occurrence), then actor can be considered a subclass of person.

What is not done is any discussion on how far this is met by common knowledge graphs out there, and if yes, why. The latter, however, has strong implications on the validity of the evaluation.

The DBpedia dataset, for example, is *fully materialized* based on the ontology, i.e., if is in the knowledge graph, than is also contained for each superclass S of C, and the full transitive closure of the superclass relation is also materialized. In other words: the learning target (the class hierarchy) is actually present in the dataset. This, by the way, is also the reason why the Völker and Niepert approach has near-perfect F1, while the Ristoski et al. approach does not - in the latter, that information has been removed from DBpedia to make the evaluation sound. Without materializing ontology, DBpedia only assigns one class per instance, which would make the proposed approach unusable on DBpedia.

The same seems to hold true for the Life dataset (all superclasses are always present) and the other datasets used. A more realistic evaluation setup would be one where the taxonomy, which is the target to be learned, is not used to produce the annotations for the training set.

The comparison to the competing approaches on DBpedia is a bit wacky. Since there is not *the* DBpedia, but different versions, and different subsets of those versions are used in the different experiments, the comparison against values from the literature is close to comparing apples and oranges. I would strongly encourage the authors to rather re-run these approaches on their datasets. Moreover, a more thorough discussion of the results would be desirable. From the results, I read that the proposed approach works better than Schmitz on Life and DBpedia, and worse on WordNet and IIMB. What I would have expected here is some discussions on the characteristics of those datasets which might explain that difference.

Along the same lines, knowledge graphs have many interesting characteristics, e.g., incompleteness, noise, skewed class distributions - all of which might interfere with the approach proposed. I would have liked to see a more thorough discussion here, and some statements on situations in which the proposed approach is expected to work better or worse.

Generally, I found it hard to understand the description of the creation of the datasets. For example, the original dataset IIMB contains only once class per instance. How was the training set created from that? The description of the WordNet dataset mentions WordNet, DBpedia, and YAGO, but it is unclear how the different components play into the dataset creation.

While I appreciate the complexity analysis, I miss this being seconded by a quantitative scalability analysis. Is the approach actually usable for large-scale knowledge graphs? The authors use samples of 50k instances, which corresponds to less than 1% of DBpedia. It would be interesting to see some scaling experiments (they could also be done on smaller scale, e.g., running on 1%, 2%, 5%, 10%). Along the same lines, the authors claim that "Heymann and Garcia-Molina was not able to terminate sufficiently fast enough" - what is "sufficiently fast enough"? Some more details would be appreciated here.

Summarizing my observations above, the authors present an approach to a well-known problem (taxonomy induction on KGs), but the experiments lack depth of discussion and scientific rigour. I strongly recommend a reworking of the experiments before resubmission.

Review #3
Anonymous submitted on 29/Mar/2021
Review Comment:

At least 80% of this paper has already been published at ESWC2020. The authors even did not change the language of many sections in the paper. The only additional contents are a hierarchical clustering and evaluations on a new data set, which I do not think are qualified as significant contributions.