Using Berlin SPARQL Benchmark to Evaluate Relational Database Virtual SPARQL Endpoints

Tracking #: 2473-3687

Milos Chaloupka
Martin Necasky

Responsible editor: 
Guest Editors Web of Data 2020

Submission type: 
Full Paper
The RDF is a well documented and popular format for publishing structured data on the web. It enables consuming data without the knowledge of how the data is internally stored. There are already several native RDF storage solutions that provide SPARQL endpoint. However, they are not widely adopted. It is still more common to store data in relational databases. There are already several implementations of virtual SPARQL endpoints over a relational database. However, their consistent evaluation is missing. On the other hand, for the native RDF storages there is a state of the art benchmark. In this paper, we show how this well defined benchmark can be used to evaluate virtual SPARQL endpoints.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Thomas Minier submitted on 07/Jun/2020
Major Revision
Review Comment:

————— Summary —————
In this article, the authors provide an experimental study to benchmark virtual SPARQL endpoints, i.e., SPARQL endpoints that query RDF data stored in relational tables. The authors focus on the R2RML-based approaches for virtual integration and compare the EVI, Ontop, Morph, and SparqlMap approaches using the BSBM benchmark. They also provide an R2RML mapping between the BSBM dataset and a relational database.
The experimental study ranks various approaches based on four criteria: usability, completeness, soundness, and performance. Experimental results show that the Ontop system outperforms the others.

Overall, I like the contributions of the article. Still, I find issues with the motivation and positioning of this work, as well as many open questions and some structural problems. Thus, I chose a major revision, and I am eager to see how the authors will answer to my comments.

————— Strengths —————
* The usability study is a good idea, and show which software can be used by real users.
* The experimental study is fully reproducible, as well as the R2RML mapping used by the authors.
* Authors focus on completeness and soundness before the raw performance, which is very reasonable for comparing systems that aim to deploy as public SPARQL endpoints.

————— Weaknesses —————
* Motivations for this work are unclear. What are the benefits of virtual SPARQL endpoints over traditional RDF database systems?
* The overall paper is focused on R2RML-based approaches and do not provide comparison nor related work regarding other approaches for virtual integration and virtual SPARQL endpoints.
* The BSBM benchmark is not well motivated for this study compared to other existing benchmarks.
* In their experimental study, interesting results are highlighted but lacks an in-depth analysis.

————— Review comments —————

Motivation and related work

My main concerns regarding this article are the motivation and positioning of this work.
Why is RDF native storage not widely adopted? And why virtual SPARQL endpoints are more interesting to use the latter solutions? Current commercial RDF data management systems like Virtuoso, Blazegraph, or Stardog are very efficient at storing and querying RDF data, and they have support for SPARQL endpoints. Mapping non-RDF data to RDF data surely mitigate the ingestion cost, but we lose the powerful indexing capabilities of a database system, which severely deteriorates SPARQL query processing performance. For me, if you want to build a SPARQL endpoint, performance is the key to ensure a good quality of service.

Next, the definition of virtual SPARQL endpoints is unclear. Based on my understanding of the whole article, it seems to be a database system that contains existing, non-RDF data, and that uses R2RML to be queried as RDF data with SPARQL queries. The author should explicitly define a virtual SPARQL endpoint in their context, to avoid any confusion. For example, Virtuoso stores RDF triples in PostgreSQL and execute SPARQL queries by rewriting them into SQL queries. The same goes for Blazegraph and Stardog. Do these systems fit into the authors' definition?

Regarding the benchmark used, the authors do not motivate and compare BSBM with other available SPARQL benchmarks [4,5,6]. In this article, the authors compare SPARQL query processing performance, but BSBM is not the best benchmark for this task. The Waterloo SPARQL Diversity Test suite [4] should be the benchmark to go in this situation.

Finally, authors are very focused around the R2RML standard, but I found the choice somewhat arbitrary. Why is this so important? Is R2RML the most widely accepted standard? The article covers virtual SPARQL endpoints, but R2RML is not the only way to create mappings between RDF and relational data [2,3]. Why are these approaches not covered in related works and compared with R2RML?

Evaluation section

Completeness for query execution is a popular metric, that does not correspond to the authors' definition (see [1]). The author states that « we do not consider whether the returned results are correct, only whether the tool can process the mapping » (p.8, l.11-12). Still, it is not the definition of answer completeness used in SPARQL experimental studies.
I do not think that the results showed are wrong, so I invite the authors to recompute answer completeness using the correct definition and update the results accordingly.

For the soundness metric, the authors use this metric's correct definition, but soundness should be measured as a percentage, ranging from 0% to 100%, and not simple checkboxes. Thus, we could compare how non-sound systems are. Again, I invite the authors to update the results, as I think they should have all the data to do it quickly.

The supported queries metrics are a good idea, and it allows us to compare approaches on their support of the SPARQL query language. For me, it is a mandatory feature of any system that aims to deploy a public SPARQL endpoint.

For the performance study, the authors claim to have used several client settings in the BSBM test driver (1, 2, 4, 8, 16, and 32), but I only see results for the configurations with one client and eight clients. Where are the results for the other settings?

About the results themselves, I'm impressed by the work deployed by the author to perform the actual study and make sure it is fully reproducible. However, I find that it lacks more in-depth explanations about the results themselves. For example, in section 6.4, the authors observed that results are not the same with MS SQL and MySQL, but they do not elaborate. The same goes for almost all the results found in this section. Why do we see these results, and what explains the difference between the various approaches? Is it how they translate SQL queries to SPARQL queries or the overhead for running the R2RML conversion process?

The authors choose to compare their results with Virtuoso, which is a good idea, as Virtuoso is one of the best commercial SPARQL endpoint available today. But the results are only summarized in text and not compared with the previous one on a figure. I agree that Virtuoso will have trouble loading large scale RDF datasets, but I would like to see how Virtuoso performs with low scale datasets against the other approaches. The authors notice that « For smaller datasets, the Virtuoso solution is significantly faster than any virtual SPARQL endpoint », but how much? Another interesting information is the actual turning point: when does the ingestion cost of Virtuoso makes it more interesting to use a virtual SPARQL endpoint?

Finally, I can't entirely agree with the authors' decision about using the configuration with a single client to compare all virtual SPARQL endpoints. By nature, a public web service like a SPARQL endpoint will be queried by a high number of concurrent clients, with unexpected query arrival rate. Experiments with a single client are an excellent way to assess answer completeness and soundness, but they are not representative of the performance of a SPARQL endpoint in action.

Other comments

There are some minor structural issues in this article.
In the introduction, the authors claim that « We define aspects that are used to evaluate a virtual SPARQL endpoint solution. » This an excellent contribution, but I found it poorly outlined. Indeed, the evaluation aspects only appear in section 6. Section 4 and 5 both cover related works and could be easily merged. Next, the related work section should appear much earlier in the article as it provides vital information on the solutions evaluated in the benchmark.

The section describing the mapping using R2RML lacks explicative figures and a concrete example, from RDF data to a relational table, with an example of how SPARQL queries are effectively translated into SQL. I am not familiar with R2RML, and I would have appreciated a more gentle introduction to the mapping process, rather than a big piece of R2RML code. Maybe the authors can use a SPARQL query from BSBM as an example, which will fit with the RDF data from Figure 1.

I also find that there is a heavy focus on the mapping between BSBM and a relational database, while it is not the core contribution of the article. I acknowledge that the mapping is essential, but it is a rather technical contribution. Maybe it could be put in an annex.

————— References —————

1. Pang, H., Jain, A., Ramamritham, K., & Tan, K. L. (2005, June). Verifying completeness of relational query results in data publishing. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data (pp. 407-418).
2. Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., & Van de Walle, R. (2014). RML: a generic language for integrated RDF mappings of heterogeneous data.
3. SCHARFFE, François, BIHANIC, Laurent, KÉPÉKLIAN, Gabriel, et al. Enabling linked data publication with the Datalift platform. In : Workshops at the Twenty-Sixth AAAI Conference on Artificial Intelligence* 2012.
4. Aluç, G., Hartig, O., Özsu, M. T., & Daudjee, K. (2014, October). Diversified stress testing of RDF data management systems. In International Semantic Web Conference (pp. 197-212). Springer, Cham.
5. Morsey, M., Lehmann, J., Auer, S., & Ngomo, A. C. N. (2011, October). DBpedia SPARQL benchmark–performance assessment with real queries on real data. In International semantic web conference (pp. 454-469). Springer, Berlin, Heidelberg.
6. GÖRLITZ, Olaf, THIMM, Matthias, et STAAB, Steffen. Splodge: Systematic generation of sparql benchmark queries for linked open data. In International Semantic Web Conference. Springer, Berlin, Heidelberg, 2012. p. 116-132.

Review #2
By Freddy Priyatna submitted on 09/Jun/2020
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This paper deals with the problem of the lack of consistent evaluation for virtual SPARQL endpoints (R2RML engines that translate SPARQL queries into SQL queries) and proposes the use of Berlin SPARQL Benchmark (BSBM). The engines being evaluated are: morph, ontop, SPARQL and EVI solution. As the BSBM was not originally designed to work with R2RML engines, relevant R2RML mappings and several modifications, such as slightly modified the IRI scheme, are proposed. The evaluation is performed and results are reported. The claimed contributions are: (C1) the definition of virtual SPARQL evaluation aspects, (C2) the definition of BSBM mappings, (C3) the identification of gap of using BSBM as virtual SPARQL endpoints evaluation, (C4) the use of BSBM to evaluate virtual SPARQL endpoints and (C5) the provision of resources to repeat the evaluation.

Originality. The use of BSBM to evaluate virtual SPARQL endpoints is not a novel proposal. In fact, it was clear that when R2RML was proposed as a W3C Recommendation in 2012, no benchmark could be used as-such to evaluate R2RML engines in virtual mode. Thus, some of the people involved in the development of R2RML engines decided to use BSBM given its popularity at that time. Furthermore, it has been many years since the publication of R2RML, and other benchmarks more suitable to evaluate R2RML engines have been proposed. Thus, I am reluctant to accept (C4) the use of BSBM to evaluate virtual SPARQL endpoints, as the original contribution.

Significance of result. As for the mappings, a naive approach, similar to direct mapping, in the sense that tables are mapped as classes and columns as properties, can/has been used. This is the reason why I don’t think that (C2) the definition of BSBM mappings is significant enough. I was actually expecting the paper to include other aspects, such as the analysis/comparison of the query-tree generated by the R2RML engines, the number of joins produces, the order of the tables, push-down SQL expression, and so on. The same objection for (C1) the definition of virtual SPARQL aspects (usability, completeness, soundness, performance), I do not think that it can be considered as a significant contribution, as I believe it’s pretty normal to include those aspects in any evaluation.

Quality of writing. This paper is well written and easy to follow. I would like to propose this modification: Rename Sec 2 to Background and move Sec 3.1 R2RML under Sec 2. So the new Sec 2 should have two subsections: BSBM and R2RML.

Furthermore, I am also including these additional dimensions: reproducibility and related work.

Reproducibility. I appreciate the authors’ effort to provide a docker image but unfortunately, I am not using a Windows machine and therefore not able to test this aspect. As a suggestion, take out .NET Framework and make sure the evaluation if repeatable with other operating systems besides Windows, thus making sure the (C5) the provision of resources to repeat the evaluation truly holds in all cases (not only on Windows machines).

Relevant Work. Sec 5. I would like the authors to include more relevant/recent proposals in the related work sections, especially those that propose benchmarks that are designed to evaluate R2RML engines and compare them with the (C3) identification of gaps of BSBM to be used to evaluate virtual SPARQL endpoints. Ideally, the authors should be able to justify the benefit and drawback of modifying BSBM benchmarks for evaluating virtual SPARQL endpoints instead of designing a new benchmark such as the one proposed by the NPD benchmark (that is not mentioned anywhere in the paper).

In summary, although I find the paper is easy to follow and well-written, I do not see it appropriate to be published in Semantic Web Journal given its lack of significant contributions. I suggest the authors improve the aspects that I have mentioned above including rephrasing the contribution section.