Review Comment:
The paper presents an algorithm for mapping keyword queries to RDF graphs, which includes a few improvements over existing approaches, especially over [14] and [15].
I'm not entirely sure about the claimed contributions though. The paper starts saying that existing approaches do not "exploit semantic characteristics of the RDF graph", although [14], for example, also exploits the types of edges (in particular type and subclass edges). Also indexing is mentioned as a challenge, but there are approaches (e.g. Le et al., cf. https://www.cs.utah.edu/~lifeifei/papers/rdfkeyword.pdf) using Hexastore-like indices for efficiency. At another point it is mentioned that "pruning unwanted nodes [...] is the fundamental essence of the approach adopted", but pruning is a part of a lot of other approaches as well.
In addition, it should be stated in which respect the paper goes beyond earlier publications of the authors.
The approach in general seems valid and sound; however, the presentation does not yet meet the standards of the Semantic Web journal. A lot is unclear or hard to follow; I include some more detailed comments below. For example, on several occasions the presented approach is called a "hybrid" approach, but in which sense is it hybrid? And with respect to grammar and spelling, the paper is in pretty bad shape. Most striking is "Philip Camiano" occuring throughout the paper.
Regarding evaluation, the algorithm was tested on datasets with very small schemas, so I wonder how well it would scale to datasets like DBpedia, where the schema is considerably bigger (and noisier), giving rise to a lot of ambiguity. How well could the approach handle ambiguities and the resulting huge amounts of potential search queries?
In the following some more detailed comments:
Definition 1:
* Slightly longer names for nodes (e.g. Class Nodes, Entity Nodes, Literal Nodes) and edges would help. For example, the description of "Component Subgraph Creation" is really hard to read, as one has to keep in mind what EN, EA, IE, etc. stand for.
* And I wonder why you need different kinds of edges. Do you ever use this distinction for anything? It seems easier to me to just talk of edges between nodes, which would make the labeling function much easier as well as a lot of descriptions in the text.
Figure 1: The arrows for the property "year" should be the other way around, I guess?
Page 3:
* The following is not clear: "For the query{publication,1999}, [15] does not function as it deals with RDF graphs with only data nodes."
Page 4:
* It is not clear to me how the summarization technique loses "too much information". What is wrong with building a graph that captures the information need also when there are no answers?
Page 5:
* Can a relationship chain include inverse relations?
* In the definition of a class chain, shouldn't D-node and C-node be swapped? (CC_n is a C-node if RC_n is an IE-edge, and a D-node it it is an EA-edge.) Also the definition is missing how you actually get CC_n.
* SQ is a pair, not a triple.
* You define answer graphs as subgraphs s.t. every keyword is mapped to a node or edge. What is the reason for this? For example, if you have a query like "number of people living in Boston", it might be necessary to map "living in" to a property "population", while "people" does not map to anything.
4. Algorithm Description:
* The algorithms miss a caption and thus numbering, so it's not immediately obvious what Algorithm 1, 2 and so is.
* Please add a short prose explanation of the create-answer-graph procedure.
* A few parts are left undefined. For example, Algorithm 1 uses a compute-subgraph-sets-for-pruning and a compute-subgraph method, which is neither defined nor explained. Also, in Algorithm 3 you use a tilde+slash as relation between g and h -- from the text it's clear what is meant, but in the pseudo code it should either be defined or replaced by prose. Also, for the definition of ACNS you use notation that is not intuitive enough to not require any mention.
* In general, your abbreviations are not very mnemonic, e.g. ICNS, ACNS, FNCS, PNCS in Algorithm 3, which makes it harder to follow than necessary. Please consider something like CNodes_active, CNodes_pruned, etc. or similar.
* During Subgraph Enlargement you add half of the nodes to one graph and half of them to the other; why is that? Why not add all to one?
* In the hook-graphs method, you say CS_new = merge{g,h} and PRGS = merge{g,h} and PRGS = PRGS union CS_new, so PRGS = merge{g,h} union merge{g,h}? Also, I missed a definition of "lowest node cardinality" and "minimum relationship chain". (The latter probably just means the shortest one?)
Page 9:
* You say "where chnl = ... + 1", but the formula also contains chnl + 1. I guess one "+ 1" is too much.
* The score TR is only explained as "standard IR approach"; please add at least a reference, even better a brief definition or explanation.
* The node relevance is a fixed value?
Examples:
* You always talk about RDF fragments used for the examples -- I guess this is only for presentation purposes? Or would the component subgraphs look differently when using the whole dataset?
The organization of section 8 is not optimal, in my eyes. For example 8.3 I would put into the architecture description, while the second paragraph of 8.4 seems to belong to implementation, not evaluation. I would even separate a section on design and implementation from a section on experiments and evaluation.
Page 16:
* In the query, the first filter regex should be ".*Philipp.*" not ".*XMedia.*".
|