Review Comment:
Performing live time-traversal queries on RDF datasets
Arcangelo Massari and Silvio Peroni
The paper focus on querying datasets over time (time-traversal queries) ie. taking into account that dataset changes over time. A basic use-case is to be able to query a dataset at a given time (ex: lectures given by a certain teacher at time ti), or to query over time (ex: people that have been student and teacher for the same course).
Querying a dataset over time is a well-known topic in semantic web, with an important state of the art.
The paper states that “all existing solutions need to preprocess and index RDF data to work efficiently”, and that this preprocessing is not tractable with datasets that change frequently. On the other hand, “live system” (those that do not require pre-processing), only provides materialisation of states or delta without query processing integration.
Well, I need more evidence (with complexities/numbers) that pre-processing//indexing makes “existing solution” not tractable in presence of frequent changes.
Next, the paper proposes an approach based on the Open Citation Data Model (OCDM) to combine data and histories (following a change-based policy) that allow single-version, cross-version, single-delta, cross-delta structured query processing. The paper evaluates the performance of the proposal on an OCDM dataset with some queries.
Strong points:
* I agree that proposing a working system allowing all kinds of time-traversal queries is a valuable contribution.
* The “live” constraint for executing time-traversal is interesting
* The system has been experimented
Weak points:
* Some inconsistencies in the structure of the paper (see table 5 remarks below)
* Limitations of state of art needs more evidences
* Related works need more precision.
* Lacks of definitions/algorithms/complexities in proposal.
* All kinds of queries are not supported with the same performance. Some non-selective queries can be not tractable.
* Experimentation has no baseline, and use no benchmark.
Section introduction. I think that the Introduction is not very convincing. Limitations of state of the art are quickly/poorly described, to the detriment of a long description of OCDM (that could be described more precisely later).
Related Works:
I’m quite surprised to not see some examples of different types of time-traversal queries as described in [44] (just to explain the context) ie [44] is cited, but I had to read [44] again to really understand this paper.
How is the OCDM storage model positioned compared to [31] and R&WBase? Is it similar?
The semantic web journal published http://www.semantic-web-journal.net/content/towards-fully-fledged-archiv... [TFF20], but this publication is not cited (i’m not an author of this paper).
Table 1 is interesting, but how is the “scalable column” decided? I don’t see why RDF* is not scalable. Can you give evidence that RDF* does not scale? Also concerning the standard, i agree that RDF* is currently not a standard, but there is a W3C working group, with a first draft published and quite solid implementations available. Is it possible to represent the OCDM model in RDF* ??[a]
I really do not understand why Table 5 is at the end of the paper and not in related works. In fact table 5 is somehow described in the text of section 2.3 and table 5 resumes everything at the end of the paper.
As it is written, at the end of the related works, we don’t really know what is the scientific problem to solve. A strong restructuration of the paper is needed to leave the related with a crystal clear problem to solve.
Section 3 describes the proposal.
The model is not easy to understand. Many drawings are provided (fig3), but finally, the reader has to read listing 1 to see really where the first “hasUpdateQuery” is located. So examples are quite difficult to follow. For example, it requires time to understand the se1/se2 contains the history of the entity. (I think there is a Typo P10 line 38 “10.1111/j.1365 2648.2012.06023.x” instead of “10.1111/j.1365 2648.2012.06023.x” ).
Section 3.1 presents how it is possible to materialise Version and Delta from the OCDM model. I’m not a big fan of flow charts to describe how to do things. It lacks definitions, it lacks precision, and at the end it is nearly impossible to estimate complexities of operations. For these operations, is the proposal better than competitors (PromptDiff, SemVersion, R&WBase) at the algorithmical level?
Section 3.2 makes the lack of definition/algo more problematic. What is precisely an isolated triple pattern? What is an explicit solvable variable? As definitions are not present, it takes a lot of time (reading and re-reading) to really understand how the approach works. The same remark with flow charts also applies here.
I understood that the main issue with this approach is to materialise only the required entities at the right point in time during query evaluation. Of course the approach works well, if there is a root entity to start from. If not, the number of entities to materialise can be in the worst case the whole dataset. Such complexities are important to understand the limitation of the approach and, as it is not present, it is a serious drawback for the paper.
Again, on the algorithmic part, how can this approach be compared with v-RDFCSA in terms of complexities ?? It seems to me that the proposed approach can be catastrophic if non-selective queries.
For me, Section 3 lacks definition, algorithms and complexity analysis. This is required to understand the limitations of the approach compared to baseline. As stated by the authors, the approach should work well when at least a subject is a literal leading to a selective query.
Section4 describes the implementation. It looks like a user manual and is quite boring to read. We understand that the cache is important for the library, but it lacks, at least, a simple formalisation to understand how the cache works.
Section 5 describes the evaluation of the proposal. There are several problems with this evaluation. First, we don’t know the questions that the evaluation should answer. What is expected from this experimentation ?
Second, there is no baseline. According to table 5, for materialisation of all version/single version, PromptDiff,SemVersion/R&WBase are competitors. So the proposal is faster or slower? Even if OSTRICH/v-RDFCSA do not support the “Live” setup. On this setup, it was possible to compare performances of both approaches.
Third, there is no real benchmark. How can I know if proposed queries are challenging? In [TFF20], several benchmarks are cited as BEAR, some experiments are also conducted. How your system is positioned?
For me, the results of the evaluation part are not very useful as it is presented, because there is no baseline to compare with.
Section 6 presents a discussion.
Some of the limitations of the proposal are presented. Although the system is able to support all kind of time-traversal queries, good performances is achieved only for a small class of queries ie. queries with literal as subject. This reduces significantly the usefulness of the proposal.
I really do understand why table 5 appears only in section 6. For me, it should be moved to related work to present the positioning of the proposal vs the state of the art.
Overall, the paper has some good elements (supporting all kinds of time-traversal queries), but, as it is, the paper is closer to a resource paper than a research paper. If authors prefer writing a “resource paper”, better description of potential use-case is required.
On originality, the paper highlight the constraint of “live” time-traversal queries ie. querying without pre-processing. But this constraint needs to be to be better motivated.
On significance of the results, the lack of complexities when describing algorithms, and baselines during experimentations is really problematic.
On the quality of writing, the paper suffers from significant anomalies in the structure.
|