Facilitating Data Discovery by Connecting Related Resources

Tracking #: 484-1680

Antonia Rosati
Matthew Mayernik

Responsible editor: 
Guest editors Semantic Web Interfaces

Submission type: 
Full Paper
In this study, we investigate two approaches to increase the discoverability and connectivity of resources on the web. The first approach is the use of semantic web data structures in RDF/XML, in particular the Open Archives Initiative Object Reuse and Exchange (OAI-ORE) vocabulary for creating compound digital objects. The second approach is the use of Schema.org vocabularies for marking up html web pages to increase their visibility to web search engines. Through applying these two mark-up approaches to three case studies within the geosciences, we identify factors that help to evaluate their applicability to research data archives. Our analysis points toward the most efficient and effective markup for aggregating resources within research data archiving settings. We focus on factors that can lead to increasing public discoverability of datasets. Our evaluations are based on the following characteristics of each mark-up approach: ease of use, the available standards and vocabularies, the ease of interoperability, and the relation to data citation tools and methods.
Full PDF Version: 

Reject and Resubmit

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 15/Jun/2013
Review Comment:

The paper presents and compares RDF/XML (in the context of OAI-ORE) and Schema.org as means for enhancing discovery of datasets on the web.

The authors rightly point out that the problem of discovering datasets is important and needs special attention w.r.t. discovering documents, the main task of current search engines.

The discussion is grounded with real life case studies related to existing datasets.

The article is easily readable and makes reference to quite a few related works.

However, the paper in not suitable for publication in SWJ due to following reasons:

The paper was submitted for the special issue of Semantic Web Interfaces. This paper does not match this topic well.

Sections 1-3 are in many places too introductory for SWJ, explaining, e.g., basic notions of RDF.

The "beef" of the paper is the case studies at NCAR of section 4, and the corresponing analysis results of sections 5 and 6. In particular, RDF/XML and Schema.org are compared in terms of ease of use, availability of standards and vocabularies, and relation to citation tools and methodologies. Comparing and contrasting Schema.org microdata and RDF markup along the selected dimensions is challenging, eventhough in practise these concerns have to addressed. The problem from a scientific viewpoint here is that not enough formal detail is presented about the problems addressed or solutions to them.

The goal of the work is to enhance dataset discovery using metadata, but it remains unclear how the work presented actually helps here. Opinions pro and cons RDF and Schema.org are discussed but no "solution" or an application system is presented or results evaluated.

As a result, the paper does not quality as a scientific journal paper with a measurable contribution, but is an introduction and a general discussion about its topic. I think revising and resubmitting it does not help here because of the nature of the work.

References to the areas of semantic web search engines (Swoogle, Watson, SWSE, etc.), and research on metadata schemas and catalogs for datasets and data catalogs (CKAN, VoiD, etc.) are missing, if the goal is to discuss dataset discovery.

With enhancements the paper could of interest to some magazine as an introduction to issues related to Schema.org and RDF.

Review #2
By Ghislain Hachey submitted on 17/Jun/2013
Review Comment:

This paper investigates two different approaches to increase discoverability and connectivity of resources on the web: the Open Archives Initiative Object Reuse and Exchange vocabulary (OAI-ORE) which is based on RDF/XML and the use of Schema.org vocabularies for marking up HTML pages in a search engine friendly way.

Although there as been a lot of work on automatic discovery and connectivity of web resources, not a lot of quality work has been done to rigorously evaluate the alternatives so I think this work is important and original as far as I know. That being said, I have several problems with the research as it is now.

The results feel more like anecdotal evidence than rigorous scientific analysis. The paper started with a rather interesting (and essential) idea which was to investigate discoverability and connectivity of resources on the Web. For such a study I would expect a carefully crafted questionaire or at least a clear list of criteria to look for and to grade the three studied scientific projects with regards to the investigated approaches in a systematic manner. I understand this paper's purpose is not to conduct of survey, but it seems the main purpose is to "investigate" and as such I think maybe some elements of good evaluation papers such as [1,2,3] might provide insight into how to improve this research.

The discussion section sort of integrates criteria for an evaluation, but too informally. The criteria should be presented early in the scientific method used. After which experiment can be conducted based on the criteria and *then* discuss results. I do find the discussions to be an interesting read but as they are now they can not be considered as scientific evidence.

It is false to claim that Semantic web-enabled vocabularies are "innumerable"; there is in fact a relatively small (but growing) set of them mainly in scientific communities. A simple wording re-adjustment would be better I think.

The method of investigation would need to be more clearly explained too. The paper moves from giving some background information (which I really like as some of this was new to me) to providing results. There is a clear gap in outlining the Methodology used. It should be explicit (and repeatable) how results are to be compiled and right now I would find it hard to reproduce this evaluation/investigation.

The style of writing is formal and appropriate, but there are several grammatical errors and typos that could have been easily avoided. Here's some examples:

- p.6 filesfor missing space.
- p.6 wasmanually missing space.
- p.7 of of remove an 'of'.
- p.8 reearch missing 's'.
- p.9 regulariety ?!?

Finally, this seems to be outside the scope of the SWI SWJ Special issue. I would encourage the authors to continue this important work though and maybe go through a conference first (if not done so already). However, I do not think it is ready for journal publication of original work.

To summarise, the paper provides clear introductory text even for those new to the concepts discussed, but does not contain the required overall balance: almost half the paper is on background information and much missing in terms of methodology, results and discussions. The paper reads very well and I would definitely enjoy reading another more structured and rigorous version of it. It think the suject explored is critical, but the community would benefit from a more scientifically sound evaluation.

[1] T. Dyba and T. Dingsøyr. Empirical studies of agile software development: A systematic review. Inf. Softw. Technol., 50:833–859, August 2008.
[2] T. Dyba and T. Dingsøyr. Strength of evidence in systematic reviews in software engineering. In Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement, ESEM ’08, pages 178–187, New York, NY, USA, 2008. ACM.
[3] T. Dyba, T. Dingsøyr, and G. Hanssen. Applying systematic reviews to diverse study types: An experience report. In Empirical Software Engineering and Measurement, 2007. ESEM 2007. First International Symposium on, pages 225–234, sept. 2007.

Review #3
By Ian Dickinson submitted on 18/Jun/2013
Review Comment:

This paper has a number of minor flaws, but my principle reason for recommending rejection is that it does not live up to the premise that the authors establish. After a long and overly general preamble, the authors describe two efforts to annotate three different datasets with metadata in RDF and schema.org microdata. The premise is that doing so will make the datasets more discoverable and better connected, but this conjecture is never tested. It is not even discussed what "more discoverable" or "better connected" would mean in practice, nor are concrete, measurable objectives suggested. Moreover, the two methods discussed seem somewhat incomparable: schema.org can, as the authors note, be used to affect search rankings. RDF metadata, however, requires another tool - such as Sindice or something similar - to find and process the published RDF. Attempting to compare apparently incomparable approaches leaves the reader little the wiser; the more so when no conclusions are drawn.

The paper has many minor errors, too many typos, and many places where claims are made without citation. Thorough proofreading is required. Among the more concerning errors:

* "in order to find something, it must be named" (section 1). I disagree: anonymous things may be found, by their description. Perhaps it would be better to say "in order to find something, it must be identified", where identification is taken to include both naming and identifying reference expressions.

* "actionable identifiers" (section 2). The action of an identifier is to identify; therefore "actionable identifier" is a tautology. Later in this section, the authors appear to mean "resolvable" rather than "actionable".

* "Web 3.0 is essentially a way to bridge the gap between human users and computerized applications". I'm not sure quite what this means, but humans have been using computerized applications, successfully, for a long time. To the extent that Web 3.0 means anything (other than a rather vague marketing term), I don't believe that it means this.

* " Resource Description Framework ... is a standard" (section 3.1). Not being an accredited standards body, the W3C is careful to state that it makes recommendations, not that it sets standards. This should perhaps read "... is a specification"

* "RDF is built from XML triples" (section 3.1). This is most emphatically wrong. RDF and XML are completely orthoganal: one can encode RDF using XML, but XML is not fundamental to the definition of RDF.

* "RDF vocabularies are declared via namespace designations" (section 3.1). Also incorrect.

* "Prior to ORE, groups of related resources could not be made visible on the web via URLs" (section 3.2). I'm not sure what the authors are trying to convey here, but I disagree. Collections can be described in HTML as ul/li lists, or in RDF with seq and bag, or simply by publishing a list of URLs in a text file.

* "on a finite project" (section 4). Are there infinite projects?

* "RDF requires a triple store, which may be overwhelming to [..] users. It is based on XML" (section 6.1). Users do not need a triple store to publish and make use of RDF metadata, they only need a tool which can process it. Semantic web search engines, such as Sindice, can do this without the user ever creating a triple store themselves. Also, as noted above, RDF is not based on XML.

* Section 6 is correctly labelled discussion, which is all that it does. It would be more helpful to the reader if it were labelled "Evaluation", and then proceeded to evaluate the different metadata and identification approaches against measurable criteria. It is not apparent to me that an dataset creator wishing to make their dataset more discoverable could use the results of this paper as anything other than general background to a decision about how, and where, to publish metadata on the dataset.