Recognizing the Truth: Automatically Ranking LOD Query Results with a Cluster Heuristic

Tracking #: 606-1814

Hans-Jorg Neth
Arjon Buikstra
Lael Schooler
Annette ten Teije
Frank van Harmelen

Responsible editor: 
Claudia d'Amato

Submission type: 
Full Paper
The primary contribution of this paper is the development and testing of a cluster heuristic that efficiently ranks the quality of answers obtained after querying Linked Open Data~(LOD). The heuristic derives from Van Rijsbergen's 1979 Cluster Hypothesis: Correct answers tend to be more similar to each other than incorrect ones. Using simple similarity metrics based on Tversky's 1977 feature matching model, we show that the cluster heuristic's answer rankings agree remarkably well with the rankings of a human rater. An additional contribution of the paper is to shed some light on the quality of LOD. We found that on our benchmark set of questions, on average 70% of the answers retrieved from FactForge are correct, while on average 20% of the answers are clearly incorrect. However, we find great deviations from this average across our set of benchmark questions, with some questions scoring 100% correct answers, whereas others yield over 80% of incorrect answers. The final contribution of this paper is the construction a publically available benchmark collection of 50 general knowledge questions formulated as SPARQL queries that are accompanied by gold standard answers and over 2000 answers obtained by posing the queries to, a large LOD repository. All these answers from FactForge have been manually ranked on their quality. This collection is freely available to other researchers as a benchmarking tool.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Achim Rettinger submitted on 02/Feb/2014
Major Revision
Review Comment:

The contribution focuses on the task of ranking answers contained in a result set returned after querying a knowledge base. This process becomes necessary because the quality of the data is insufficient (contains incorrect facts) at many points. The main contribution is a measure that helps to determine similarity of entities within the result set. The measure determines the number of shared property-value pairs between two entities. The final score of an entity is the sum of similarities to the other entities in the result set. The output of the heuristic is a re-ranked version of the result set that is constructed in accordance to each entity's score. A further contribution by the authors is a reference dataset which contains only correct answers. This enables other researches who focus on the same problem to evaluate their systems. Another contribution is a discussion about the quality of LOD.

Currently I see some major issues with the contribution:

x The title of the paper states "Recognizing the Truth". After reading the paper, I was certainly impressed and thought about how you could mark parts of the result set with meta information or visually with some color to indicate that a particular set of results is likely to be wrong. However, in fact, I do not think this problem is solved since the measure only ranks the results but does not distinguishing correct from incorrect answers. Thus, your title implies a classification task, but you use a ranking approach, evaluation measure and evaluation methodology. So, either the title and motivation needs to be adjusted or your approach, evaluation measure and evaluation methodology.

x If you settle for a classification task, it seems more appropriate to use something like „Sorting out incorrect results“ instead of "Recognizing the Truth“. Even if you mark false answers as wrong the person/machine querying for the answer cannot rely on the answer being complete (it might simply leave out correct results). In my feeling the „truth“ also includes some notion of recall over the whole data set, not only precision about the returned results.

x As stated in your paper, the reference dataset is likely to be already outdated and thus, in combination with the previous issue what you describe as a "final contribution" is rather marginal.

x I would like to see a discussion of the computational performance as you have a number of similarities to measure if the query states "name living people". The runtime of the heuristic is O(n²) and thus, though simple, does not provide a practical solution because of runtime issues with larger result sets. I would like to see the "how to select an appropriate evaluation measure"-part shrunk down to one page and rather have a second page that benchmarks the heuristic. To complement the statement: your set of queries seems to be selected in accordance to "provide result sets with less than 100 entities“.

x FactForge is a proprietary product by Ontotext AD. It includes various datasets such as DBpedia as well as Freebase. However, different reasoning modules and rules are in place and thus, the facts contained in FactForge do not reflect on the Quality of LOD (as potentially the band-membership of Stig Anderson is a fact that was derived through reasoning). This re-focuses your second contribution from "shed light on the quality of LOD" to "shed light on the quality of facts in FactForge“ and might require another change to the title.

x I miss a comparison with other approaches. Without checking the details numerous potential graph based similarity measures come to mind that could be applied. At least a baseline performance could help to get a feeling for the task and results. Right now, there is no real "lesson learned" concerning the ranking approach.

x In general the paper is well structured and the presentation is very solid. Maybe even too detailed. Considering the content, a more concise writing seems appropriate.

Review #2
By Nicola Fanizzi submitted on 03/Feb/2014
Major Revision
Review Comment:

The main goal of the paper is proving a sort of Clustering hypothesis on ranking the results of [simple] queries on LOD.
This conjecture does not seem to be fully empirically supported with incontrovertible [and stable] results.

Interestingly this work has also some side-products: a simple method + ad hoc measures and a benchmark to assess the quality of LOD

The simple methods and measures are borrowed and adapted from the literature [e.g. from IR].
Some evident limitations that have been also highlighted in the paper should be addressed.
The minimal goal should be to achieve more stability in the results.

The paper is well written and organized.
Relevant literature on similarity and ranking from the machine learning field is lacking while it could improve the discussion.

A lot of work likely comes from [30] and [31.
I'd suggest to explicitly highlight the novel contribution of this paper.


There's a huge literature on this kind of problems in machine learning and, more specifically the ILP subfield.
This related work is scarcely cited and discussed.
Most of ML methods can be regarded as characterized by the optimization of some objective function; when a closed form is not available such a kind of problems is generally solved through some heuristic search of sub-optimal solutions.

It is to be better justified why to borrow heuristics from IR when the target representation of the data in structured.
But, in the scenario of the paper, the authors are dealing with structured data represented by means of technologies such as the Linked Data, whose specificity does not seem to be addressed directly (e.g. the reasoning level).

The Ugly Duckling theorem [DHS] asserts that classification is impossible without some sort of bias. Similarity has many forms, the choice of a specific measure for a given concept biases the solution to an instance classification problem.
One way to take into account a notion of context for a query may be the one adopted in [dFEL], where the concepts of interest may be elicited from the query itself.

There are many related similarity measures that are related to the context and can be considered as general forms of the simple one used in the paper, e.g. see [BWH; dSF].
Also kernel functions designed for knowledge representations spanning from RDF to OWL (e.g. see [BS; FdE; FdE2]) may be regarded as similarity measures and would deserve at least a discussion [if not a comparison]

The similarity measure in Def. 1 is quite simple and does not seem to require any form of reasoning [instance-checking] but only a lookup on the model graph. It also requires the Unique Names Assumption for counting.
An easy extension that may be taken into account is adding also the similarity of o_1 and o_2 (under measurement) as fillers of the same subject, say s, for a given predicate p.

As mentioned before, this measure is absolute, yet it is plausible to assume that individuals that may be required to be similar w.r.t. a certain query could be better treated as dissimilar w.r.t. a different query. Again, this is a typical feature selection problem that is worth discussing in the paper.

Why is the Estimate Quality of an answer defined in absolute terms instead of percentages?
Queries considered in Def. 2 are limited to those with one answer variable. Discuss this limitation.

Search space topology: The Cluster Hypothesis in Def. 3 depends on the sparseness of the data. It works for concept queries having answers in a densely populated region of the graph but in case of more sparse data (scarcely populated regions) this assumption may lead to cumbersome situations.

The involvement of humans in the loop has to be more strongly justified: the translation of NL queries into a formal form may represent itself a difficult step (see sect. 3.3).
Matching hits on the ground of rdfs:label seems a bit adverse to the nature of the structural representation of the Linked Data.
The applicability of the method in case of limited or sparse labeling has to be taken into account.

The adopted DCG measure seems to be a better choice over other standard indices only if the results are ranked enumerations.
When results are mere set element enumerations there are many reasons errors may occur [an individual is not a member of the target set].
In absence of an explicit additional criterion all errors should be judged as equally wrong
[So why "The ideal ordering would have had the answers Ola Brunkert and Stig Anderson swapped" ?]

In def. 4: is the value rDCG different in the formula for nDCG different for each call?. Are you hinting some averaging over the possible random sequences?
The same questions arises for rMD in Def. 5.

Linear scaling cluster heuristic scores makes a strong assumption that must be justified. [(Multidimensional) Scaling is a form of clustering in itself]
The assumption tested is that answers to any query can fit a linear scale of 5 degrees.
Discerning correct/incorrect answers may need in principle only 2 clusters, but there may occur distributions of results that are naturally clustered in larger number of groups.

Queries considered in the paper are require instance-checking only.
SPARQL queries are more general.
Solutions to this limitation do not appear to be sufficiently discussed in the paper.

The variance in the results seems to be the main problem affecting the significance of the conclusions that can be drawn,
I am sure that resorting to more elaborate similarity measures / clustering methods [e.g. exploiting the data semantics] could improve the stability of the results.


- Tab. 1 caption: "asterix" --> "asterisk"?

- p. 5: "[31]" in 3.1 and 3.2: I'd avoid using citations of papers as subjects for sentences.

- Tab 3: the min and max bars appear a bit confusing
[you may use a different way to represent the std. dev. wrt the avg values for example]

- Fig.3 the proposed measures may behave quite differently;
e.g. consider queries 45 and 46 where quite different MD scores were observed w.r.t. an equal max nDCG score.


[dFEL] d'Amato, Fanizzi, Esposito & Lukasiewicz: Representing Uncertain Concepts in Rough Description Logics via Contextual Indiscernibility Relations. URSW (LNCS Vol.7123) 2013: 300-314

[DHS] Duda, Hart & Stork: Pattern Classification, 2nd Ed. 2000

[BWH] Borgida, Walsh & Hirsh.: Towards Measuring Similarity in Description Logics. Proceedings of the International Workshop on Description Logics DL2005, volume 147 of CEUR Workshop Proceedings,, 2005.

[dSF] d'Amato, Staab, Fanizzi: On the Influence of Description Logics Ontologies on Conceptual Similarity. EKAW 2008: 48-63

[BS] Bloehdorn & Sure: Kernel Methods for Mining Instance Data in Ontologies. Proc. ISWC/ASWC 2007: 58-71

[FdE] Fanizzi, d'Amato & Esposito: Learning with Kernels in Description Logics. Proc. ILP 2008: 210-225.

[FdE2] Fanizzi, d'Amato, Esposito: Induction of robust classifiers for web ontologies through kernel machines. J. Web Sem. 11: 1-13 (2012)

Review #3
By Mariano Rico submitted on 07/Mar/2014
Minor Revision
Review Comment:

Although I am not an expert on IR and I can not provide a wider view in other ranking metrics, this paper is stimulating and organized. I recommend a carefully review to clean up typos and concordance errors.
Some examples:
- In the abstract, I would remove the reference "[50]".
- In the abstract, I would rephrase the second phrase at second paragraph (the one that starts "We found that".
- In the abstract, third paragraph, "the construction a publically" --> "the construction OF a publically"
- In the Introduction, first sentence, I would add a reference to support that afirmation.
- In the Introduction, first paragraph, "anecdotal" --> "experimental"??.
- In the Introduction, look for "seven continents" --> "world". In europe we have 5 (or six if you consider Antarctica), seven is a United States cultural fact.
- In the Introduction, look for "to each other then they", then --> than.
- In section 4.6, "This, it is necessary" --> "That is, it is necessary".
- In section 4.6, last paragraph, "the the" --> "the".
- In section 6, paragrah "A second important...", "etc" --> "etc."
- In section 6, paragrah "A second important...", last sentence, Remove the parenthesis in "(These categories...".

Concerning methodology, you say in section 4.4 that users had no time limitations to answer. I consider this adds a user dependency as it is not the same to spend 5 minutes looking for the right answer that spend 20 minutes.

In figure 2, you show min, max and average. I think it is important to show also the std deviation (e.g. as an error bar centered in the top of the average column). In each column, under the percentage value, I would show also the number of queries.

Review #4
By Rabeeh Abbasi submitted on 16/Mar/2014
Review Comment:

This paper proposes a method for ranking answers to LoD queries. The ranking algorithm is based on clustering hypothesis, which says that similar answers tend to be correct than the dissimilar ones. The paper is well written and the flow of the paper is excellent.

The two main drawbacks of the paper are: 1) justification for choosing a simple method and 2) evaluation. As noted by the authors, the time complexity of the simple method is O(n2); As the proposed method is only compared with a random baseline, it is not justified whether this simple (and costly) method is a better choice. For example, how is the proposed method better than an even simpler method of ranking the subjects based on a (Jaccard or Cosine based) similarity matrix? A comparison with an existing state of the art method would suffice these drawbacks.

In section 2.1, the intersection is defined on discrete features, either the features should be binary or some operator other than intersection should be used.

In section 2.1, the function f(x) should be defined formally, particularly the one used in the paper.

The example of sharks and dolphins in second para of section 2.1 should be replaced by a more suitable example.

Some minor typos:
Page 2: "... to each other then they are..." -> "... than ..."
Page 2: "... all the answers is correct..." -> "... are ..."
Page 3: "... and concludes." -> "... and conclusions"