Beyond Querying: A Scalable Architecture for Linked Data Monitoring

Tracking #: 2589-3803

Authors: 
Burak Yönyül
Oylum Alatlı
Rıza Cenk Erdur
Oğuz Dikenelli

Responsible editor: 
Guest Editors Web of Data 2020

Submission type: 
Full Paper
Abstract: 
Monitoring the data sources for possible changes is an important consumption requirement for applications running in interaction with the web of data. In this paper, MonARCh (Monitoring Architecture for Result Changes) which is a scalable architecture for monitoring the result changes of registered SPARQL queries in the linked data environment has been introduced. Although MonARCh can be comprehended as a publish/subscribe system in the general sense, it differs in how the communication with the data sources are realized. The reason behind this is that the data sources in the linked data environment do not publish the changes on the data. MonARCh provides the necessary communication infrastructure between the data sources and the consumers for the notification of changes. Users subscribe SPARQL queries to the system which are then converted to federated queries. MonARCh periodically checks for updates by re-executing sub-queries and notifies users in case of any result change. In addition, to provide scalability MonARCh takes the advantage of concurrent computation of the actor model and a parallel join algorithm for faster query execution and result generation. The design science methodology has been used both during the design and implementation stage and for the evaluation of the architecture.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Aidan Hogan submitted on 14/Nov/2020
Suggestion:
Major Revision
Review Comment:

This paper proposes a framework, called MonARCh, for processing federated SPARQL queries over dynamic data. The framework is based on an actor model (Akka), where each actor independently receives messages, runs computations, and takes some action, thus enabling concurrency. Actors are defined for distributing queries (source selection), distributing sub-queries, scheduling the execution of sub-queries and detecting changes, parallelising joins, and executing joins. Experiments are run for a query with three triple patterns that involve three different datasets (NYTimes, DBpedia, and a custom Stocks dataset). These experiments are run on a cluster of nodes, using Virtuoso as the backend SPARQL endpoint/RDF store. Federation is conducted using WoDQA. The analysis looks at CPU usage, result sizes, query processing times, memory usage, change notification times, etc., for the selected query.

## Strengths:

S1: The paper addresses an interesting problem of relevance to the Special Issue. To the best of my knowledge, not much work has been done with respect to federation over dynamic datasets, and the actor-based approach proposed by the authors seems quite promising.

S2: The system described clearly features quite a lot of work, with a lot of engineering effort. Code is provided in a Github repository. (It would be great to add a README as well as licenses.)

S3: The paper follows a formal research methodology.

## Weaknesses

W1: The paper is much longer than it needs to be, and is often quite repetitive. The main issue is that discussion of methodology dominates the paper, is often very confusing, and distracts in a significant way from the proposals of the paper, the experimental results achieved, etc. As a consequence, it takes a while before it becomes clear what the technical contributions will be: it is not until Page 8 that the details of the proposal become clear. As a result, I found this paper to be difficult to read. While it is good to follow a methodology, I think that the methodology itself should not become a focus of the paper. The focus of the paper should be on the ideas, and the lessons learnt. More generally, I am not all that certain what the lengthy discussion of methodology contributes or what I learn from it. There have been many papers on the topics of federation, dynamic Linked Data, etc., and there is an established experimental methodology, including benchmarks, metrics, etc. Rather than follow this accepted methodology, it feels like the authors have gone about inducing a similar methodology from first principles. Why not simply re-use the standard methodology/metrics/benchmarks? (Some FedBench data are used, but not the full benchmark.)

W2: While W1 relates perhaps more to a subjective question of style, perhaps the biggest shortcoming of the paper lies in the experiments. While they are quite detailed, they are all based on a single and relatively simple SPARQL query: one with three patterns, where each pattern is answerable by precisely one dataset. This is insufficient to draw general conclusions about the performance of the proposed framework. What happens, for example, when subqueries have multiple relevant sources? What happens when subqueries have more than one triple pattern? What happens when other features of SPARQL are used? It is particularly hard to justify using only one query given that a lot of work has gone into designing benchmarks -- such as FedBench and LargeRDFBench -- for federated querying. The authors argue that: "If a heterogeneous set of query patterns have been used, it would be hard to visualize the limits of the OoS." This is not a convincing argument for me; surely having more diverse queries would lead to a better understanding of the limits of the system. They further state that "Since SPARQL endpoints are part of the context the artifact operates in, this will not bring additional load on the system." The costs of the endpoints are not "free" in the sense that a system that overloads the endpoints with refresh queries is clearly not sustainable. They also state: "Only if the queries to be monitored gets too complex including more than two constantly changing sub-queries, the number of queries that can be monitored may get lower." I am not fully sure I understood this, but these sorts of claims should be supported by experimental data.

W3: The proposed architecture seems to generate a lot of communication between the actors, which would seem to lead to high network usage. While CPU and memory are analysed in the results, I did not notice much discussion about communication overheads.

## Recommendation

Though the paper proposes an interesting approach (actor-based) to address an interesting problem (federated queries over dynamic data) of relevance to the special issue, I think that the experimental section, based on a single query, is far too weak. I lean towards rejection based solely on this issue, but I think that there are some promising ideas exhibited in this paper, that a lot of work has been invested into developing the system, and that the problem is interesting. Hence I recommend a Major Revision, where in a revised version I would expect a greater diversity of queries (and perhaps datasets) to be considered for experiments, perhaps based on FedBench, LargeRDFBench, DBpedia live, etc. Also it would be interesting to see more data regarding communication costs (beyond just a count of results).

As a secondary issue, I personally find that the discussion of methodology distracts and greatly dilutes the technical discussion. I would recommend removing or compressing some of this discussion, perhaps moving parts to an appendix, etc. This is admittedly an issue of style, and thus at the discretion of the authors, but I suspect that most readers would prefer if the methodological discussion were greatly curtailed in favour of a more crisp technical exposition.

## Minor comments

(This is a subset mostly focused on the earlier parts of the paper. Though the English is in general quite readable, there are frequent minor typos. A thorough proof-read is recommended.)
- The title is a little confusing. It is not clear how the work goes beyond querying when the input is precisely a SPARQL query?
- "takes [] advantage"
- "[The] Linked Data concept"
- "on the web [] become"
- "pull-push" -> "push-pull" It's the same, right? If so, please use one consistently. Push-pull sounds more natural.
- "The rest of the paper"
- "but just differentiate on replacing publisher with pull component while embracing subscriber itself" Unclear.
- Add spaces before references; e.g., "semantic web[10]", "Linked Data[9]"
- Be consistent with capitalisation: "semantic web" vs. "Semantic Web", "Linked Data" vs. "linked data", "web of data" vs. "Web of Data", "rdf" vs. "RDF", "Dbpedia" vs. "DBpedia", etc.
- "since it is not obligatory for SPARQL endpoints to be publishers, we assume to have no publisher ..." It is unclear what this means. Publishers can publish their data through a SPARQL endpoint.
- "and current result[s] of a query"
- "[the] actor model"
- "the system has been [] built on"
- "What mechanism [leads to not"
- "Why more queries cannot be be registered per unit of time" -> "What limits the number of queries being registered per unit of time"
- "[The i]nternal architecture" Also this paragraph is far too long! Given its importance, it urgently needs to be divided into various paragraphs, or turned into a list.
- "How do[es] the number ... affect[]"
- "[The m]onitoring engine identifies"
- "that in case of federated" -> "that in the federated case"
- "therefore join operation[s]"
- "[The s]ystem is able"
- "Data on the linked data space changes rapidly." [needs reference]
- "Additionally, a great number of the applications used on a daily basis are data sensitive." [needs reference]

Review #2
Anonymous submitted on 12/Dec/2020
Suggestion:
Major Revision
Review Comment:

The authors tackle the problem of monitoring the result changes of SPARQL queries and propose a system called MonARCh.
MonARCh is a scalable result monitoring system using a pub/sub architecture which allows to notify uses about changes in their registered query results.

The problem addressed in the paper is very relevant and also nicely analysed in the manuscript.
Change notification for application in the Linked Data , federated SPARQL EP setup is an important topic. Even in closed spaces , e.g. large companies that have several SPARQL endpoints running.
Applications could rely on very timely and detailed notification and as such save a lot of bandwidth and client logic.

However, the current state of the manuscript does not really provide a clear and structured solution to this problem.
The main weakness of this work is the lag of structure and also evaluation.
I found it very hard to extract the exact problem addressed in the paper, the proposed solution and architecture and what are the final results.
In contrast, the paper presents the research methodology in a very structured way - but fails to present the results in the same clear structure.
The authors should maybe focus less on presenting the methodology, but rather trying to slim down the scope and present in a clear way the problem, the solution design, implementation details and systematic evaluation.
Also introducing all methodology question in section 3 without providing the details is very confusing - the readers need to read section 4 and 5 before they can understand section 3.

The second major weakness is the evaluation- which is too unstructured and simplistic.
There are claims and conclusions which are not backed up by the numbers, proofs or references.
I dont see the scalability of the system at all. The numbers show me that i need a lot of memory for small datasets and small user load.
There is only one simple query tested with a actually small scale experiment, 3000 companies on two nodes.
I would like to see that the system can scale with the number of statements and registered queries and also query types (e.g. selectivity, results, #eps) and nodes.
The evaluation is not comparing to similar systems in the related work, why is that ? Also the claim that they are not scalable is not backed up

It is clear that the authors invested a lot of effort in studying the problem, related work and systems and in implementing MonARCh and that they try to present their contribution.
The manuscript at hand however does not yet present this effort and the smart solution in the way it should be.
As such, i highly encourage the authors to revise their manuscript by focusing on the core aspect of their contribution - the problem, the solution, the implementation and evaluation.
Slim down the design/research methodology section to its core points and then invest in the presentation of the solution and its evaluation in terms of the claim to be scalable.

# Introduction

The authors talk about Linked Data but then focus entirely on SPARQL queries which are federated. Linked Data is a set of 4 principles saying that data should be stored as RDF and accessed via dereferencing URIs. SPARQL queries are typically executed by RDF stores which harvest a subset of the Linked Data Web.

What is unclear in the introduction is how MonARCh is deployed. Is it a global service or is it deployed as a kind of proxy for my application?

# Section 2

The part on page 3 about semantic web and linked data reads odd. The authors talk about sources are interlinked, but a source refers typically to a dataset or a payload. the authors probably mean "interlinking resources or entities".
The argument about SPARQL Endpoints that interact with each other should be backed up by some references.
The SPARQL endpoints are typically a interface over a a collection of datasets (it could be crawled data, or a large data dump that is served).

What does it mean that it is not obligatory for a SPARQL Endpoint to be publishers (p3.l37)

# Research Methodology

The first paragraph is still not clear to me.
The authors talk about monitoring changes in the Linked Data but with SPARQL queries - which is at this point still unclear to me what that exactly means. SPARQL queries can be typically executed via SPARQL engines over a graph database, or in federation, over other SPARQL endpoints. There is one approach which allows to directly execute SPARQL queries with HTTP Get requests , but the others do not mention this at this stage. So overall, i don't know what the authors approach is at this part of the paper.

Also it is not clear how to get from the motivation to the design methodology. I think the motivation is too vague and too broad - enhance the response ability to changes in Linked Data - this could be done with HTTP Head and ifModifiedSince in the purest case.

Otherwise, I like that the authors introduce this clear methodology

# Design Problem and Knowledge Question

I think the authors have to make clear what linked data monitoring with federated SPARQL queries have in relation.

SOme research question make only sense if you know the remaining part of the paper. e-g. Pro and Cons of the join algorithm - join algorithm is the first time mention at this page - so why is a join algorithm important here ?

Why are some questions duplicated across several categories?

I think this section needs to be introduces later or with much more detail.
The problem is that the questions refer to specific implementation, architecture and SPARQL specifics which are not introduced at this stage in the journal.

# Design Cycle

I think here applies the same as in Section 4. There is so much context required to understand it. E.g. what are cluster dynamics and why is it important.

The confusing part is to see the architecture and implementation in the subsection Treatment Design.
I would put the architecture and design in its own section with additional subsection to get a better structure.

# Empirical Cycle
I am missing some clear overview of the experiment, the setup, and the results.

What means "the system can handle up to 3000 companies with 3000 queries at most"?
Is it queries per minute, or overall registered queries ?
What happens if i have a query for the article count and one query for the stock markets?
Would the system behave differently?
Could i handle only one query, or both ?

The query times are around 90 secs max with 2500 companies ?

That also means that the system cannot be used for near real tome notification. This is not a negative thing, but should be stated in the motivation that the goal is to monitor data with a change frequency above a certain min time.
Also the query times seem to increase quite strong with more data - is that still scalable ?
If the scale with factor 7 holds, the system would soon be unresponsive.

What is the reason that node 1 and node 2 behave differently?

Other questions:
There are no real acceptance criteria defined for the evaluation success.

Why is only one simple query used . What happens if other SPARQL operators are used (e.g. OPTIONALS, FILTERS)?
Why were there not different queries tested.

The authors need to explain the figures and why the curves are like they are.
One would expect from a scalable system that the relevant parameters increase linear with the number of inputs. But the figures show that there is sometimes a linear increase and then decrease.
THere must be a clear reason why it is like that

The system can monitor 48m tuples with 30GB.
Now consider that Linked Data spans billions of triples , how much memory is needed in the system.
Even in a moderate setup, with around 300m statements, this means the system needs 180GB Ram for each cluster node.

Section 6.6.4 needs more structure. One needs to scroll back to find the questions and then map them to the answers ( e.g. the first one asks ... , the other question -> Which other?)

Why did you not study or proof your answers? E.g. you claim that the system depends on the selectivity of the triple pattern - Why did your experiment not test that claim -
What happens if you join a low selectivity query with a high selectivity query in comparison to two queries with medium selectivity?
did you study join selectivity instead only triple pattern selectivity?

"In real world cases much less queries are expected " p22, l1-2.

I would argue the contrary - Your evaluation query is simple, notify me if the article count or stock price is changing for any company - Why makes you believe that an application would register less queries?
I would rather see that you could have several queries per company registered ( tell me if the stock price is changing, tell me if the company has a new product, tell me if the NYT writes a new article about a company ....)

Overall i think this section requires much more evaluation to answer all questions , or focuses on one or two core aspects of the system and inspects them in detail.
The claim is that the system is able to scale - so i would start to investigate that claim in a clear methodology-
That means that there should be a real scalable use case selected. The current setup is three datasets and one simple query - This is way to simplistic in my view.

Other remarks:
How to avoid DoS attacks on the SPARQL EPs?

Review #3
Anonymous submitted on 25/Jan/2021
Suggestion:
Reject
Review Comment:

Beyond Querying: A Scalable Architecture for Linked Data Monitoring

In this paper the authors present a system for monitoring updates to SPARQL query results. If a RDF dataset is updated, queries accessing that specific portion of the data will have their results updated as well. The authors use a formal framework (the design science methodology) in their research to guarantee that their advances in linked data monitoring are sound. Unfortunately I think the use of that framework overcomplicates the article, making it hard to read.

Introduction section
The authors introduce the problem, arguing about the need of monitoring SPARQL query results due to linked data dynamics. I agree with the authors that this is an interesting and important problem. The authors also describe briefly their approach, Monarch, and the paper’s layout. I think the section is a promising well written section and easy to follow.

Roots of the Linked Data Monitoring Architecture section
In this section the authors describe push and pull methods for notifying changes or how to represent changes to data sources using the DaDy vocabulary.

In this section we start to see some typos like “Actor is”, it should be “An actor is”. Also, the use of the passive voice in the paper makes it hard to read. In general there are many typos in the paper due to English mistakes, and I suggest the authors to double check the final version of the paper.

Research Methodology section

In this section the authors the authors present the formal framework they used for conducting their research. In general this is what one can expect from a research paper, it is always a nice to have section however I find that it has too much detail, overcomplicating the paper.

Design Problem and Knowledge Questions section

In this section the authors apply the research methodology from the section before. From this point onwards the paper gets very difficult to read. The use of the passive voice in the text makes the idead very hard to follow. While I think this section is interesting, I find it too long and it should be merged with the previous one, making it more concise.

Design Cycle section

In this section the authors describe the system's implementation to answer the questions from the previous section. Apart from using the passive voice along the section (making the section hard to read), the authors use the terminology from the framework they selected. However that terminology also makes the paper hard to read, since the community is not used to it, like the use of treatment instead of solution, making the section even more confusing.

Regarding the actual solution, the authors base it on the problem that "approaches to linked data monitoring are limited to monitoring single data sources or dependency on the push mechanisms of data sources”. However the authors also rely on monitoring single data sources, by implementing themselves the query execution to track if there has been any change to the data. If I understood correctly the authors implement a system that runs a query “constantly”, retrieving the results, comparing them to the previous execution using a hash join, and returning the updated result if that was the case.
I find the architecture of the system a bit overwhelming: when queries run constantly on an endpoint, that endpoint may block the IP address from the machine that runs these queries. Also, the operation for finding whether there are new results in the result set it is more of a set difference rather than a join. Have the authors considered to use a CONSTRUCT query and using a MINUS operation over the graphs?

Empirical Cycle section

In this section the authors evaluate their approach. Again, I would recommend the authors to use the common terminology within the community to make the paper easier to read. Also, the authors repeat most of the content from the research methodology and design problem sections. Furthermore, I think the readers of this paper already know the general evaluation methodology.

Regarding the evaluation itself, the authors focus on the amount of data they join and the resources needed to join the data (CPU, time, etc.), but I miss a chart showing the query frequency to the endpoints, whether they query live endpoints and if these endpoints rejected the IP at some point time. Also, I miss some definitions such as query selectivity. I also think that this section can be reduced drastically, since many of the explanations are superfluous and not needed by the current paper’s audience.

The related work section seems complete to me.

In general I think this article needs a heavy rewriting, removing most of the passive voice and many explanations of the method they use since the readers already know it. Regarding the contribution, I believe that the architecture can be simplified significantly.