Identifying, Querying, and Relating Large Heterogeneous RDF Sources

Tracking #: 3030-4244

This paper is currently under review
Andre Valdestilhas
Muhammad Saleem
Edgard Marx1
Bernardo Pereira Nunes
Tommaso Soru1
Wouter Beek1
Claus Stadler
Konrad Höffner1
Thomas Riechert

Responsible editor: 
Ruben Verborgh

Submission type: 
Full Paper
The Linked Open Data (LOD) principles have been widely adopted due to its undeniable advantages; however, publishing and connecting data to third parties remains a difficult and time-consuming task. A question often raised during the publication process is whether there is a dataset available on the Web with which we can connect. Although it seems a trivial question, it unfolds in quite complex issues, such as, where the related datasets are, how many there are, how similar they are and how to query these datasets in a heterogeneous environment. This paper tackle the aforementioned questions introducing (i) a novel method to detect datasets similarities including duplicated chunks of data among RDF datasets; (ii) a publicly queryable index called ReLOD responsible for identifying datasets sharing properties and classes; and, (iii) a SPARQL query processing engine called wimuQ able to execute both federated and non-federated SPARQL queries over a large amount of RDF data. To create the ReLOD index and execute SPARQL queries over the Web of Data, we harvested more than 668 k datasets from LOD Stats and LOD Laundromat, along with 559 active SPARQL endpoints corresponding to 221.7 billion triples or 5 TB of data. We presented an evaluation of the accuracy of ReLOD and the query execution performance of the wimuQ over such massive amount of data.
Full PDF Version: 
Under Review