Resource-Level Versioning in Administrative Geography RDF Data

Paper Title: 
Resource-Level Versioning in Administrative Geography RDF Data
Alex Lohfink and Duncan McPhee
The Resource Description Framework (RDF) is a base technology of the semantic web, providing a web infrastructure that links distributed resources with semantically meaningful relationships at the data level. An issue with RDF data is that resources and their properties are subject, as with all data, to evolution through change, and this can lead to linked resources and their properties being either removed or outdated. In this paper we describe a simple model to address such an issue in administrative geography RDF data using RDF containers. The model version-enables such data at the resource level without disturbing the inherent simplicity of the RDF graph, and this translates to the querying of data, which retrieves version-specific properties by matching simple graph-patterns. Further, both information and non-information resources can be versioned by utilizing the model, and its inherent simplicity means that it could be implemented easily by data consumers and publishers. The model is evaluated using some UK administrative geography RDF datasets.
Full PDF Version: 
Submission type: 
Full Paper
Responsible editor: 
Krzysztof Janowicz

Review 1 by Prateek Jain

The work presents an approach for resource level version in RDF. This problem has been investigated by multiple other researchers which have been cited by the authors in their work. The authors have identified the issues which plague these approaches and have tried to provide their approach in order to mitigate the issues faced by these approaches. The approach by authors relies on the use of RDF Collection framework (specifically containers and bags) to provide versions of the resources. The older version of the resources are linked to the newer version by use of the containers and using dublin core vocabulary the dates are assigned to indicate the date of the version.

The problem investigated by researchers is definitely interesting and worth investigating. The problem surprisingly hasn't been explored a whole lot which in my opinion can be one of the two things (i) There isn't a significant uptake for the desired solution (ii) Most solutions do not meet the challenges. I am not quite sure if the solutions proposed by the authors meet the requirements which will be necessary for a viable solution. Any viable solution in my humble opinion should possess at least the following characteristics

(a) Work seamless across implementations of RDF store and SPARQL processor.
(b) Is implemented at the level of the triple store, so it does not require any additional effort/implementation from the developer/ontologist.
( c ) Can support RDF based inferencing without any obstacles.

While the work presents an approach for solving this problem, its not written well and does not explains the approach, the effects and how the approach can be used by other practitioners to achieve the effects. The work suffers from the following major issues

(a) It is very poorly written. There are various sentences which last for 4-5 lines without proper punctuation marks or line breaks. It is difficult to read such sentences and identify what is being conveyed by the authors in these phrases. For example: "However, no real solution currently exists to enable data…………….". The paper is full of phrases like these, which makes it extremely hard to read and identify the potential of this work. If the work is accepted, I would suggest a major overhaul as far as writing is concerned. Similar issues exist even with respect to grammar of the sentences.

(b) I am a little bit confused about the phrase 'non information resources' which has been used extensively by the authors across the paper. It isn't defined systematically and its hard to identify what kind of scenarios fit in this category. The authors can probably give a few examples and explain using those resources why they require a separate category then objective 1 and 2.

( c ) In the section for RDFS Inferencing, the authors say "subClass" of property allows for version hierarchy of the classes. I am not able to find any specific details in the work and not sure if its wise to make my own judgement about the same. Its an important issue as these can have a lot of implications from the perspective of reasoning and implementation of reasoners. Any such statement would probably require some kind of proof to establish the validity and implications. The authors have left it just at the level of saying its beyond the scope of this work… While its important for the authors to focus on the main issue at hand, leaving such an important topic out of the realm of discussion is hard to judge the technical correctness and applicability of this work.

(d) In the Section "RDF Containers and collections", the authors talk about the three approaches, explain how RDF based approach works best and then explains the three different approaches. Again like pointed before, the authors should focus on rewriting the paper to fix this and other suggested changes.

(e) The authors seems to be proposing a one size fits all kind of an approach. However, this approach seems to have an issue which hasn't been discussed by the authors. The LOD is having a significant momentum with respect to streaming data inclusion such as real time data about social networks, sensor data. Considering the performance figures, its extremely difficult to see the approach being useful in any of these scenarios. On the other hand, an approach which can take care of such streaming data will probably be useful for more or less static data like those examined by the authors (unless convincing arguments to the contrary are provided).

(f) I am curious as to what happens with respect to the older versions viz-a-viz properties of the newer versions. Is it propagated to the older versions? If yes, what kind of potential impact it will have on any sort of assertions performed by reasoners? While this isn't really the complete focus of this work, the technique and approach has ramifications which should be investigated.

(g) Minor comment: The diagrams are extremely hard to read in a printed version of the paper.

(h) Minor comment: The authors claim to have implemented a triple store. I disagree. The authors have stored their data in a triple store namely the Allegro Graph.

(i) The citations provided by the authors are not systematic and extremely poorly done. I am not sure what format they have been provided in. They do not include the name of the conferences, journals, publishers and such. I can understand in the internet era, by providing the title of the paper you can look for the paper on google. However, for the sake of proper publication protocols at least the authors can make an attempt to stick to the well established best practices.

(k) The evaluation isn't convincing enough to show how good the approach is. This kind of work requires both qualitative and quantitative evaluation to show the work. While the qualitative evaluation is missing from the work, the quantitative evaluation requires more effort than in its current form.

Overall, the work tries to address an interesting problem and tries to provide a solution. However, the approach is merely scratching the surface of the problem and isn't well written to be considered for the journal in its current format. I would recommend the authors to take the points mentioned above into consideration, address them and then resubmit it after the revisions.

Review 2 by Tomi Kauppinen

The article aims to tackle the problem of versioning in RDF. There is certainly a need for means of versioning in the context of spatial information: different properties and relations of features are valid at different times, there are always corrections and adjustments to be made. Merges, splits, name changes and boundary changes make it difficult to correctly integrate data referring to the places having this kind changes.

However, reading the paper did not make me convinced that the authors have a good solution for versioning. First of all, the readability of the paper was not high:

1) The text includes sentences like "The model version-enables such data..." that are hard to interpret.
2) Please consider e.g. using just prefixed versions of URIs in the text---now both full URIs and prefixed ones are in use.
3) Two paragraph layout with long URIs and SPARQL gives an unfinished feeling.

Furthermore, I was not convinced about the content of the paper. First of all, it is hard to see what is the generic value of the paper. I propose that authors will make it clearer what is their evaluated contribution rather than saying just that "In this paper we describe a mechanism that could solve this problem". The related work lacks many important papers, including those works about modeling the validity times of RDF triples. I was also not convinced of using anonymous nodes for the version numbers (Figure 3). Why versions could not be accessed via URIs?

Perhaps I missed something, but to be able to suggest the publication of this paper it would need to be much more clear about the goals, solutions, and results. This paper could preferably be discussed first at a workshop level and then adjusted for a conference or journal publication.