Hybrid Reasoning on OWL RL

Tracking #: 508-1707

Jacopo Urbani
Robert Piro
Frank van Harmelen
Henri Bal

Responsible editor: 
Pascal Hitzler

Submission type: 
Full Paper
act. Both materialization and backward-chaining as different modes of performing inference have complementary advantages and disadvantages. Materialization enables very efficient responses at query time, but at the cost of an expensive up front closure computation, which needs to be redone every time the knowledge base changes. Backward-chaining does not need such an expensive and change-sensitive pre-computation, and is therefore suitable for more frequently changing knowledge bases, but has to perform more computation at query time. Materialization has been studied extensively in the recent semantic web literature, and is now available in industrial-strength systems. In this work, we focus instead on backward-chaining, and we present a general hybrid algorithm to perform efficient backward-chaining reasoning on very large RDF data sets. To this end, we analyze the correctness of our algorithm by proving its completeness using the theory developed in deductive databases and we introduce a number of techniques that exploit the characteristics of our method to execute efficiently (most of) the OWL RL rules. These techniques reduce the computation and hence improve the response time by reducing the size of the generated proof tree and the number of duplicates produced in the derivation. We have implemented these techniques in an experimental prototype called QueryPIE and present an evaluation on both realistic and artificial data sets of a size that is between five and ten billion of triples. The evaluation was performed using one machine with commodity hardware and it shows that (i) with our approach the initial pre-computation takes only a few minutes against the hours (or even days) necessary for a full materialization and that (ii) the remaining overhead introduced by reasoning still allows atomic queries to be processed with an interactive response time. To the best of our knowledge our method is the first that demonstrates complex rule-based reasoning at query time over an input of several billion triples and it takes a step forward towards truly large-scale reasoning by showing that complex and large-scale OWL inference can be performed without an expensive distributed hardware architecture.
Full PDF Version: 
Previous Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Matthias Knorr submitted on 10/Jul/2013
Review Comment:

I would like to thank the authors for this revision. In my opinion, all raised issues were corrected/improved or explained to clarify any doubts. There are a few minor issues listed below (in the order of appearance ) - and only a few of them go beyond typos or even only suggestions (which the authors may or not follow). I am confident that the authors manage to fix these easily, and that no further review would be required, i.e., the paper can be accepted.

minor issues/typos:

- 3,l: "to the union of their set-representation." - plural "-s" missing at the end
- 4,l: "More formally, for each rule…let…"- maybe better: let each rule r\in P be of the form …"?
- 5,r: "organiz" - "e" missing
- 6,l: "This tree represents all the… In Figure 1 we report an example of such a tree…" - since Fig.1 does not show the complete tree, maybe it would be better to call it "a part of such a tree"?
- 6,r: Fig. 1 has two directed edges only: I guess not every edge is intended to be directed, but please consider using a consistent pattern, e.g., more arcs should (?) appear going into the various atom vertices with T(…), right?
- 8,l: in Algorithm 1 caption: "where Mat stores results previous materialization rounds…" - I assume "of" is missing?
- 8,r: in Example 6 "(Variables: x,y,z)" - I suppose variable should be lower-case?
- 8,r: the term "blocked" is only introduced on page 10, but used here already
-14,l: "and and"
-14,r: "(*)" is used for the second time as indicator (first time page 10); it's ok but maybe consider using a different symbol since Prop. 3 just refers to (*)?
-17,l: How can pre-materialization affect 44 out of 49 rules if there are only 44 left according to the info three paragraphs above?
-17,r: Now it "affects a further 30 rules…"? Is it 30 out of the 44 (or 49) rules? Please clarify what the "further" is referring to.
-17,r: "we chose two major duplicate sources…" - maybe better "pinpointed" or "determined" or "detected" since these were already there, i.e., it is nothing you deliberately chose?
-20,l: "although after a complete materialization reasoning is no longer needed…" - substitute "although" by "while"?
-22,r: "proceede"
-23,l: "one and three order…" - "orders"
-23,r: "Virtuoso [8] supports … of a few (but not all) OWL rules" - First, I guess you mean OWL RL rules - and since your work does not support all OWL RL rules either, I think it would be nice to add just one short phrase which quantifies this a bit more.
-23,r: The same matter applies to Oracle [11]: from the short description, it seems as if that work is basically the same as the authors which is probably not true nor what the authors want to transmit, i.e., differences should be stressed briefly (for approaches that are rather close in spirit).
-24,l: Maybe, my two previous complaints are covered by "none of these approaches supports a similar…" - but since it appears in a separate paragraph, it leaves the impression that this restriction only applies to references within that paragraph.

- throughout the paper: please check missing commas throughout the paper, e.g., there are quite a few instances of missing "," in the case of "if… , then"; also consider trying to separate quantifications (for each/all or even just for some technical term - see, e.g., Prop. 2 itself) in the formal parts to clarify the scope

Review #2
By Aidan Hogan submitted on 11/Jul/2013
Review Comment:

The authors have done a good job with the revisions. They are also correct on the points I erroneously raised about Datalog (I thought Datalog included stratified negation and built-ins as "standard", but these are indeed not part of "pure Datalog") and the variable substitution having been defined for constants (which I missed).

I recommend an accept. Here I list a few final typos and other nit-picks:

* "is [a] ground atom"
* The typewriter fonts in Example 1 and Example 2 are different. Make consistent.
* All tables: I believe the SWJ template style says to put table captions above tables.
* Table 1: Symmetric not Simmetric / rdfs not rdsf
* organiz[e]
* "subclass and subpropert[y] inheritance rules"
* Lemma 3: Tmp' is still using an inverted comma symbol, not a prime symbol.
* "Hence, if 1--3 hold[], ..."
* Lemma 4 proof: "r_0, \ldots [,] r_n"
* Lemma 4 proof: "Q_0, \ldots [,] Q_n"
* Table 2 caption: "1 is an abbreviation for the URI ..." That's a typed literal, not a URI.
* "... leaving 44 rules." ... "This affects 44 rules out of 49" Do you consider 44 rules or 49? Otherwise, where does 49 come from?
* "The second type of duplicate generation[] comes from"
* "we report [on] a set of experiments"
* When right aligning numbers with mixed decimal places in the tables, it makes it tricky to compare them vertically. *Maybe* consider the siunitx LaTeX package to typeset them nicely? Or maybe append trailing zeros in a fixed-width font. (Not a big deal.)
* "had access to" -> "had to access"
* "The behavior" -> "This behavior"
* "the range of [a] few seconds"
* "proceed[] as follows"
* "knowledge bases [in the order] of [] ten billion triples"
* Mixing British English and US English in parts. E.g., "-ize" (US) vs. "whilst/amongst" (UK). (Not a big deal.)