An Ontological Approach for Integrating Declarative Mapping Languages

Tracking #: 3115-4329

Authors: 
Ana Iglesias-Molina
Andrea Cimmino1
Edna Ruckhaus1
David Chaves-Fraga1
Raúl García-Castro1
Oscar Corcho1

Responsible editor: 
Tania Tudorache

Submission type: 
Ontology Description
Abstract: 
Knowledge Graphs are currently created using an assortment of techniques and tools: ad hoc code in a programming language, database export scripts, OpenRefine transformations, mapping languages, etc. Focusing on the latter, the wide variety of use cases, data peculiarities, and potential uses has had a substantial impact in how mappings have been created, extended, and applied. As a result, a large number of languages and their associated tools have been created. In this paper, we present the Conceptual Mapping ontology, that is designed to represent the features and characteristics of existing declarative mapping languages to construct Knowledge Graphs. This ontology is built upon the requirements extracted from experts experience, a thorough analysis of the features and capabilities of current mapping languages presented as a comparative framework; and the languages' limitations discussed by the community and denoted as Mapping Challenges. The ontology is evaluated to ensure that it meets these requirements and has no inconsistencies, pitfalls or modelling errors, and is publicly available online along with its documentation and related resources.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ben De Meester submitted on 19/May/2022
Suggestion:
Minor Revision
Review Comment:

I'd like to thank the authors for carefully taking my comments into account and I think they greatly improved the paper. What follows are some minor comments, and 2 larger questions: (i) how do you measure that the extracted requirements are covering the existing set of language, and (ii) what is the relevance of including the mapping challenges in the requirements analysis? I elaborated more below. That's why I recommend a minor revision.

### Response letter response

- protocols.ttl has 404 links in the skos:definition descriptions
- "The surveyed languages include the ones considered relevant because of their widespread use, unique features, and current maintenance. Deprecated or obsolete languages are not included." > That's not really a set of criteria in my opinion, it's not easy to reproduce your results based on this sentence. This makes the selection still seem a bit ad-hoc, which I would have tried to avoid.
- "The taxonomy proposed in the article was built upon the authors' experience and the current mapping language’s features" > I couldn't find this in section 5.1 to be honest. It is mentioned in the abstract, but I couldn't find it anywhere else.
- For Appendix B, since you state that you extract these features from the reference paper and/or specification, I'd suggest to add those references explicit in the tables, so there's no confusion what you use for which language. Same for adding references for the features annotation with a '*'.
- Q: "How do you specify that an expression is either an XPath or a JSONpath, or 'among others'? Why is this not a SKOS ConceptScheme?", A: "[...] we decided to reuse DCAT 2, more specifically, the class dcat:Distribution. This class enables the declaration of media types (dcat:mediaType), which already is a complete list of formats." > a list of formats, yes, but not of reference formulations. e.g. you have JSONPath-plus as an extension of JSONPath, where do you clarify that connection? Given, e.g. RML allows to support different reference formulations for the same data connection, I think this should be supported by the ontology. This kind of remarks makes me question how you extracted the features from the reference paper and/or specification, has there been an (internal) review for this?
- Q: "By linking the datatype to the statement, don't you get in trouble when you want to create mixed-type rdf:Lists?" A: "No to the best of our knowledge, rdf:List is an ordered list of items, so when an object is an rdf:List and the datatype is also an rdf:List there should not be an issue." > I'll clarify: what if you want to create a list that contains data of different types, e.g., ["string", 12, true]? It could be that's not supported by any existing mapping language, so then that's out of scope, but please clarify.

### Review

#### 4. Conceptual Mapping Requirements Specification

- Thanks _a lot_ for the requirements publication, highly interesting. Considering the requirements, it's a bit weird that you extract from both (i) existing languages and (ii) mapping challenges, as the scope of the ontology is "to gather the expressiveness of **existing** declarative mapping languages and represent their
shared characteristics" (as mentioned in introduction). I think that's OK, otherwise it's impossible to scope your work. By including 1 additional source (mapping challenges) I have the feeling you open a gate of 'trying to be exhaustive'. Reviewing your requirements publication, I also don't see the need for adding these challenges: everything is covered by at least one existing mapping language, afaict. I would remove it from the paper. That, or argument why you add that one single source which is, in my opinion, a bit arbitrary: it's a call for papers for a 2021 workshop. A reference to the github open challenges might be better suited, but then probably you'll find new requirements and need to update the ontology etc. etc.
- considering the discussion of leaving procedural languages out of scope: do you have any ideas on how to tackle that gap? is it just not possible, or are there mitigation strategies? I'd love to read a short reflection in the future work section on this
- considering the requirements: how do you measure them? There's a mapping effort on https://w3id.org/conceptual-mapping, however, there doesn't seem to be a complete mapping between information conveyed in the different mapping languages and the proposed ontology: how do you know you cover everything as specified in the existing languages? See my remark on JSONPath before, e.g. by laying your ontology example next to the one of SPARQL-Generate, you could see there's some semantics you're not covering currently, namely the change from referencing via a CSV field to a JSONPath. Please argument how you measured that the requirements you extracted are sufficient.

#### 5. Conceptual Mapping Implementation

- I do not agree with the title of the "Ontology Design Patterns" section: I have the feeling this is ontology reuse, not design patterns, i.e., you don't use specific patterns as also used in SKOS, but you actually re-use SKOS. Please clarify
- I'm a bit confused with Themis: I see the test cases, but I don't see test results. At the moment it's not clear you actually (i) created test cases, or also (ii) executed and verified those test cases. Please clarify.
- For 5.4, could you clarify how you made the distinction between core and extension, i.e., why not create two ontologies as it is divided into two parts, Statements and Data Sources?

#### 7. Conclusion and Future Work

- "This analysis allows us to develop a unique model that integrates the common features among declarative and procedural languages" > Without comparative tables, these kinds of claims are hard to verify, so I would be cautious with these kinds of statements.
- The structure of your first paragraph is a bit off, you mention somthing like "we first ..." , but there's no 'second' or 'then' that follows. Or do you mean by the second part the mapping challenges? Is that the 'two-fold analysis' you mention? Please clarify.

#### Appendix A

- Could you clarify whether all features of the core ontology are handled or not in the example, and if not, why?

Review #2
By Herminio Garcia-Gonzalez submitted on 23/May/2022
Suggestion:
Minor Revision
Review Comment:

This is a review of an earlier reviewed paper presenting an ontology that aims to represent the shared constructions that practitioners can encounter while using available declarative mapping languages to build KGs from heterogeneous data sources. As so, in this version, the paper has improved its scope, presentation and clarity, which makes it easier to read and follow. Nevertheless, before its publication, there are still some minor remarks that should be addressed.

The first thing that surprised me in comparison with the previous version is the argument used to not cover all the expressiveness and flexibility that SPARQL-based solutions offer. It is said that they are procedural (in this context?) so as the ontological approach is based in RDF (and declarative) it has some inherited expressiveness problems. Therefore, it does not make much sense to represent these more procedural instructions in the ontology.

I agree that here the border between imperative and declarative is a bit fuzzy in some functionalities. Nevertheless, as far as I can tell, SPARQL is a declarative language like other query languages (e.g., SQL). In these cases they offer a bit more flexibility like the cited for iteration which, by using the for keyword it reminds of the imperative fashion. But, I could argue that the same can be represented with a functional programming construction like the range being it more declarative but with exactly the same functionality. In a way, I am saying that I want to generate these values but I am not saying how to generate them. Interestingly, functions used in many examples (that would require an implementation, many times an imperative one) are not out of the scope. I think you should use other kind of argument, like waiting until there is more consensus in the community, like functions have.

The other interesting argument is that the RDF syntax imposes a limitation in the expressiveness of the language. This is indeed a very interesting discussion, that maybe should be raised in the community as the vast majority of solutions go in this way. But for the given examples I do not see a problem or a limitation in the syntax, letting alone the verbosity that such solution will deliver.

Another point that raised my attention is that some ShExML listings would not work due to an erroneous use of the syntax. I didn’t check all the listings in the paper but it makes me think that it could be the case for more languages. It seems to me like they were not verified and executed to ensure that they were correct. This endangers the pedagogical aspect of the paper (which by the way I really like) but also the reproducibility. I would advice to review and execute all the mappings offered in the listings (you can find the corrected one for ShExML below) and also offer the complete mapping rules as supplemental material so other people can check and learn from your findings.

This point drives me to another aspect that in my opinion is quite unattended, implementation details. The idea seems too abstract many times without being possible to tell whether this would work in the future. As being an ontology paper this is out of the review process but I think that it would be very beneficial to give some hints about the future implementation, for example, for accessing fields outside the iteration scope.

As a side note, ShExML has recently added support for functions and conditional statement generation. This changes will be soon reflected in the specification. Should you want to have the most updated information in the paper you could check them in the ShExML Github repo (https://github.com/herminiogg/ShExML) or drop me a line.

References should be thoroughly reviewed. For example, venue is missing in [4, 8, 9, 19, 21, 22, 25, 29, 30, 31, 32, 33, 44, 46, 47, 48, 50, 51, 53, 54, 57, 60]; [58] is duplicated with [32]; [12] is from ESWC; [3, 27] are not PhD thesis but Research Reports; [37] has been peer-reviewed and formally published; and [42] needs to be completed.

Additionally, I leave below more comments and suggestions per section:

#Title

Integrating → (I don’t see that the ontology is really integrating in this stage, maybe use something like describing or representing)

#Related work

noSQL → NoSQL

is a commercial language used commercially → is a commercial language

providing more complex comparison frameworks → (from the description afterwards I would say they are qualitative comparisons but not necessarily more or less complex)

#Methodology

[23, 49] → (The followed methodologies remind of a normal software development methodology, it would be nice to add some clarification why these are more interesting and in which things they excel at)

there are no use cases as this ontology is a mechanism of representation of mapping language’s features → (Is not the representation of mapping languages features a use case?)

#Conceptual Mapping Requirement Specification

W3C Knowledge Graph Construction Community → (add a link?)

neither SPARQL-based languages nor languages based on other schemes consider this feature → (this sentence is a bit ambiguous, if you mean by other formats the rest of the languages neither following RDF nor SPARQL then it is not correct)

such as RML and ShExML [57, 58] → such as RML [57] and ShExML [58]

especially in tree-like data sources → (only JSON is affected due to JSONPath inability, XML using XPath is capable though)

#Conceptual Mapping Implementation

is a subclass of dcat:Distribution → (beware of the long empty gap)

and defines which is the data in the source that is retrieved → and defines which data is retrieved from the source

and can define shapes (shapes) for its restrictions → (what do you mean? this is the first time that we hear about shapes. What are they intended for in the ontology?)

has been created for each taxonomy has been created → has been created for each taxonomy

Listing 8: Data sources → (data sources of what? use longer descriptions in the captions so the reader can understand better what is the figure/listing about)

$.zipcodes.* → (How is this expression supposed to work? As commented earlier some implementation details may be needed to fully understand the solution)

#Conclusion and Future Work

the Conceptual Mapping, an ontology-based mapping language called Conceptual Mapping → the Conceptual Mapping, an ontology-based mapping language

such as SPARQL-Generate of Facade-X → such as SPARQL-Generate or Facade-X

#Appendix A

of the ontology that don’t appear → of the ontology that do not appear

This example show how to describe → This example shows how to describe

dcat:mediaType “text/json” → (but endpointURL is .csv)

ShExML listing corrected. It is also possible to define a QUERY (http://shexml.herminiogarcia.com/spec/#query) and then use the variable like ITERATOR it_cities :

SOURCE cities_rdb jdbc:mysql://localhost:3306/citydb
SOURCE coord_json
ITERATOR it_cities {
FIELD c_city
FIELD population
FIELD year
FIELD zipcode
}
ITERATOR it_coord {
FIELD lat
FIELD long
FIELD loc_city
}

Review #3
By Jose Emilio Labra-Gayo submitted on 21/Jun/2022
Suggestion:
Minor Revision
Review Comment:

This review is based on a revised version of the paper and from my previous review on it. I acknowledge the great job that the authors have done to improve the quality of the paper taking into account the different suggestions made by the reviewers.

The paper describes an ontology about mapping languages that captures the commonalities between different mapping languages and technologies like R2RML, XSPARQL, CSVW, ShExML, etc.

Given that the paper has been submitted as an ontology description, I will use the criteria defined by the journal to assess these papers.

(1) Quality and relevance of the described ontology.

The ontology has been defined to give support to the problem of categorizing different mapping technologies that have appeared in the last years. The exercise of defining this ontology can shed light on this topic and is relevant as it can improve existing approaches.

The ontology has been developed following a well-established methodology and the ontology itself follows best practices.

In my previous review I already stated that one possible drawback is that the ontology seems to have been defined by members of the same research group, without a proper consensus work involving external stakeholders which could help improve the acceptance of the ontology. The authors indicate that they are planning to involve the W3C KG Construction Community Group which I think would be a really good idea.

I also indicated that it would be great if it was possible to add some provenance information to the concepts included in the ontology to the discussions about those concepts in the Community group or in other venues, or at least some mechanism to track the relationship between them and the authors have improved that information in the web page.

(2) Illustration, clarity and readability of the describing paper.

Overall, the paper is very readable and the ontology has been defined following a sound methodological approach. The ontology has been published using tools like Widoco and Ontoology and has a dereferentiable URI at: http://vocab.linkeddata.es/def/conceptual-mapping#

The ontology is available under a github repository and contains a README, and since the last version it has been improved.

The authors have added a new section about which ontology patterns they have used in the design of the ontology.

In the new version the authors indicate that they have not been using competency questions and that they have been using facts instead. This could be something to improve in the future.

In my previous review, I encouraged the authors to keep maintaining the ontology and creating more issues documenting that process and their response was that they will take it into consideration for future ontology releases.

(4) whether the provided data artifacts are complete.

In my opinion this ontology is a first version of an ontology proposal which has not yet been thoroughly reviewed or discussed by the community and stakeholders. I encourage the authors to continue working on this problem and take into account the “living” nature of the ontology, publishing different versions of it, and creating issues and mechanisms for its improvement, as well as more tools based on this ontology.

One of the other reviewers commented: “How do you specify that an expression is either an XPath or a JSONpath, or 'among others'?” and it is also not clear for me the answer, I think it would probably be good to clarify that aspect in the paper. I mean, when you have, for example:

:FrameRoute a cm:SourceFrame ;
cm:expression “$.triples[*]” ; ….

How could a processor know that the contents of cm:expression are XPath, JSONPath, or a SPARQL query? One possible approach that seems more natural could be something like:

:FrameRoute a cm:SourceFrame ;
cm:expression [ cm:value “$.triples[*]” ; cm:language cm:XPath ]; ….

In the case of cm:Constant, the property used to declare the value is “cm:returns” which is not very intuitive, I would suggest to use cm:value, which I think would improve readability going from:

cm:hasInput ([cm:returns "http://ex.com/loc/"] :lat [cm:returns "-" ] :long)].

to

cm:hasInput ([cm:value"http://ex.com/loc/"] :lat [cm:value"-" ] :long)].

which I think would be more readable…I was tempted to create a issue about this in the repo, but as there were no issues it was not clear to me if that was the right place to do this kind of suggestions.

Another suggestion I would like to do it to complement the ontology with a ShEx schema which could be used to validate RDF data and check that it follows the specified structure. I prepared a first draft of such a schema here: https://gist.github.com/labra/bc9853f4bab277b6bf8d405ea5476288

It includes also the Turtle example from the Appendix after repairing some minor typos.
The following link shows the results of the validation: https://rdfshape.weso.es/link/16558286156

Some minor typos/comments:

Page 16, line 47 (line 8 of listing) uses dcat:mediaType “text/json” but JSON’s MIMETYPE is “application/json” I think this should be changed there and in listing 11 in the Appendix A
Listing 9, line 5, the trailing colon “:” of lat is separated be the newline…I think it would be better to add a newline before “:lat” to keep it together.
Listing 9, line 33 should be: cm:hasInput (c_city " " "") instead of cm:hasInput (c_city, " ", ""), i.e. in an RDF list the separating commas should not be there.
Appendix A, line 4, “This example show how…”
Listing 11 in Appendix contains several typos, line 14 lacks a semicolon at the end “;”
In that line, the value of dcat:endpointURl is of type xsd:string, given the name of the property it would seem more natural if it was really a URL, i.e. something like:
Line 27 lacks a semicolon at the end
Line 49 lacks a parenthesis at the end
Line 89 should remove the dot after :stop_id and lacks a dot at the end (after the parenthesis)
Line 101 should remove the last square bracket.