Review Comment:
The paper presents a proposal for a set of algebraic mapping operators which can be applied for the generation of RDF from other sources of data like CSV, JSON, etc.
The paper presents the definition of the different operators from using a formal notation based on multisets and also presents prototype implementations that convert two different technologies: RML, and ShExML, and generates the mapping operators in Rust , as well as an interpreter of the mapping operators in Javascript. The evaluation takes into account two aspects: expressiveness (or completeness according to the authors) and performance compared with other RML implementations using the results of the knowledge graph construction challenge.
Given that the manuscript has been submitted as a full paper, I proceed the review taking into account the 3 dimensions: originality, significance of the results and quality of writing.
1.- Originality. As far as I know the contents presented in the paper are original and I think the idea of providing a set of formal mapping operators is new (although as the authors indicate, the formalization evolves from the formalization of SPARQL, which is OK for me).
2.- Significance of the results. I think the results are significant and can be the basis for a new generation of mapping proposals which can be based on a set of formal abstractions which can be used for optimizing the mapping process. In fact, I think a natural next step for this paper would be to research optimization techniques based on those mappings.
3.- Quality of writing. Although the paper is well written and I think the authors have made an effort to make the paper readable, the current paper still needs to be improved before publication. I would suggest the authors present a good running example which could be used along the paper to present the main topics. Although the authors started with a running example about people and pets in section 4, the example is not followed in section 5 and 6, where the authors present an example about sports without including the data. I think it wouldn’t be difficult for the authors to extend the people/pets example and use it to exemplify the rest of the sections. Specially, section 5 which presents the converter from RML and ShExML could be improved if the authors included the mappings and ShExML file for that people/pets domain, which could be used to exemplify some of the concepts described in that section.
About the availability of the data and code, the authors provide a Zenodo link which I think can be good as a long term stable URI and also provide a .zip file. The contents point to two github repositories which include the Rust implementation algemaploom-rs which translates RML and ShExML to the mapping operators and the RMLWeaver-JS implementation which implements the engine that runs those mapping operators.
More specific remarks:
- Although the mapping operators are defined along the paper, I think it would make sense to provide an abstract syntax defining the operators indicating what are the parameters that they contain. I think the authors already have that language because they implemented an engine based on it, but it is not explicit in the paper.
- Page 2, the sentence that starts by “On the one hand, homogeneous mapping languages…” gives 3 examples: R2RML, TARQL and SML, and explains that R2RML is from relational data to RDF, and TARQL from CSV to RDF…leaving without explaining what SML does.
- In section 2.1, the authors consider ShExML as a constraint-based language based on ShEx. I am not so sure about that, because although ShExML was inspired by ShEx in the sense that it uses shapès and a concise syntax, it is not using constraints internally. I think it could also fit in the dedicated mapping language.
- Given that most of the operators defined in the paper are taken from SPARQL…I wonder if it would make sense to define a SPARQL-inspired mapping language which could have a more concise syntax than RML but also a more SPARQL-like syntax. In fact, I would like to know if the authors have considered to define an intermediate syntax for the mapping operators.
- As I indicated before I would suggest the authors to try to find a single running example to explain most of the operators and concepts. For that,
- In figure 1 the inputs seem to be only CSV files or Databases…I think the authors also want to support JSON and XML, right?
- Table 1 seems to contain an extra line space between the f_personal row and the f_friends rows…maybe it is intended.
- I am not sure that I understand the sentence: “The set of fragments, F, is infinite and pairwise disjoint with the other sets defined in Section 4.1.”.
- The choice of the letter xi for mapping tuples (both in lower and in upper case) is a bit unfortunate because it is not a popular greek letter and I think readers can find it difficult. I would suggest t (for mapping tuple) which would probably make the paper more readable.
- In my opinion, the source operator is not very well defined and needs some extra definitions. It is defined in terms of Iterators and Fields…but the authors indicate that the work on iterators is currently continued with logical views…I would suggest the authors to try to explain a bit better that part. I also found the definitions of iterators and fields not easy to follow because they are not self-contained, they use subfields without indicating what a subfield is. Maybe with the help of the running example, this part could be more readable.
- Definition 5 says: “The extracted data value can have a datatype in the Literal if the datatype can be inferred from the data source.”, how can that datatype be inferred from the data source? I think this definition requires a bit more explanation because it can be important. Is it possible to generate IRIs? And Blank nodes?
- What is the concept of default fragment? Is it defined?
- Definition 7, expr: \Omega \into T an expression statement…what does it mean it is an expression statement?
- Example 3, what is irify(_) ?
- Definition 12, I found a bit strange that the operands of the \theta expression are quite small. I think the authors are using some macro, because that happens in other places of the paper.
- Definition 15, I didn’t understand well what is \Psi, maybe add an example?
- It is interesting that Serialize can be defined in terms of Project and Extend…which seems that it is probably not a necessary operator as it is not a primitive operator. Is that the case?
- Example 9, “...and binds the variable<,>...”, I think that comma is not necessary.
- Page 15, the footnote pointing to Apache Kafka is repeated…maybe only one footnote would be enough.
- Example 10, it is not clear to me why the result will go to /target/output.nt if it is not specified in the target.
- Example 10. The sentence: “If multiple solution mappings exist, the target operator will append the serialized data to the file.”, seems like an implementation detail that could be configured with a different behaviour?
- Page 16, “the semantics of ...”
- Page 16, the statement “this way we ensure that our approach is language agnostic”, is in my opinion too strong, you could state that you were able to implement in some different languages, but you can not ensure that it is language agnostic.
- I found very interesting that the authors are using Rust to convert between RML and ShExML…in fact, at this moment, I am a big fan of Rust…however, the authors didn’t explain why they chose it, and also, why they chose Javascript for the engine…specially as I would probably choose Rust for the engine as in section 6, the authors do a performance comparison between FlexRML which is implemented in C++ and authors’ implementation in Japvascript. I would probably ask the authors to expand a little bit about those decisions…and what they think could happen if the engine was implemented in Rust instead of Javascript.
- I found a bit strange the use of “interpreter” for the Rust converter from RML and ShExML to the mapping operators. I would think that an interpreter is something that takes, for example, RML or ShExML code and interprets it, i.e. in my view, what the authors call interpreter, I would call it a converter/translator, while what they call engine, I would probably call it interpreter also. This is not a big issue for me, but I was a bit surprised by that usage.
- Sections 5.1 and 5.2 contain a description which is not easy to follow…I may suggest the authors to use a running example of a mapping expressed in RML and to show the output of the conversion, and the same thing using ShExML…in this way, the explanation could be clarified with those examples.
- Page 18, “that could be generated from RML document…”
- I would suggest Listing 3 and listing 4 to use the same example based on people/pets.
- Page 20 I think the sentence: “For adapted chosen ShExML test cases,” lacks something…
- Table 12 is interesting but some of the concepts in the table are not explained in the paper…I would ask the authors to describe the meaning of some concepts and maybe justify why they can not be implemented with their approach. I also think that the table lacks some parts like the push/pop feature of ShExML
- Page 22. “The virtual machine has 64 bit architecture” (I think it is “a” instead of “an”)
|