Lessons Learned in the Pursuit of Production-Readiness: Iterative Evaluation of the Kadaster Knowledge Graph Construction

Tracking #: 3749-4963

Authors: 
Alexandra Rowland1
Erwin Folmer
Wouter Beek2
Hans Schevers
Anjo Kolk

Responsible editor: 
Guest Editors KG Construction 2024

Submission type: 
Application Report
Abstract: 
The application of linked data has played a role in enhancing the accessibility and reusability of data at the Kadaster, the Dutch National Land Administration and Mapping Agency. Over the past decade, three distinct iterations of linked data creation, publication, and integration have emerged. The most recent iteration involves the KKG, now published using a nationally standardised model and an enterprise-ready architecture, transitioning it into a production-ready data product. This paper evaluates each iteration, identifies key requirements for production-readiness in a governmental context, and presents lessons learned to guide other organisations in adopting linked data technologies. A design science methodology is used to perform the evaluations and the findings reveal the importance of strategic alignment, specialised expertise, and ongoing evaluation in achieving and sustainability of production-readiness in linked data architectures. This research provides valuable insights into the practical adoption of linked data over a decade-long period, offering a unique longitudinal perspective. The lessons from Kadaster’s experience can serve as a roadmap for other institutions aiming to integrate linked data into their operations. Future research could expand on these findings to explore the scalability of the approach in different organisational contexts and its impact on data accessibility and utility.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Herminio Garcia-Gonzalez submitted on 30/Oct/2024
Suggestion:
Reject
Review Comment:

This paper introduces a series of lessons learnt from adopting linked data technologies in the context of the Dutch Kadaster. The paper offers a longitudinal study which can be very interesting for other practitioners, especially in industry, who can benefit from the raised conclusions. Nevertheless, as a whole, I find that the paper lacks a more profound basis and further explanations.

Right now, Section 3 reads like an implementation logbook which highlights some of the problems found during its implementation (both technically and at the organisational level) but it is not clear how this differentiates from previous publications on the matter [1-4].

This iteration-based methodology is related as a “design science approach” but this concept is never introduced nor referenced in the paper. In general, this is somehow one of the main problems of the paper, as many of the used concepts are never introduced while at the same time it lacks a lot of essential references in the bibliography (just as an example, RML and Morph-KGC are mentioned but their companion scientific publications are not referenced).

In my opinion this paper would greatly benefit from defining the goals of implementing LD technologies in this specific context and establishing some quantifiable indicators beforehand. Then, based on that, each iteration can be analysed and conclusions can be drawn from them in a more structured and expectable manner. Finally, these results can be jointly discussed in Section 4 together with the already existing lessons learnt (which I find interesting but quite generic and applicable to any kind of scenario).
Following the guidelines for “Application Report” paper types, one important aspect is the uptake of the proposed solution. Even though the authors comment on the governmental set-up in which this application is being developed. There is no mention of the impact that it has on other government’s services and external users. I would suggest that the authors elaborate on this a bit more and relate it to my previous comments about clearly defining the goals of this implementation.

Even though the paper highlights some interesting points, due to the enumerated shortcomings, in its current form, I cannot suggest this paper for publication. Nevertheless, I would encourage the authors to rework this paper and resubmit it in the future.

Review #2
Anonymous submitted on 04/Nov/2024
Suggestion:
Reject
Review Comment:

The article presents a three-step (named iterations in the article) evolution of the Kadaster Knowledge Graph (KG). The article presents each iteration in terms of a model developed, a technological architecture illustrated by a figure and an evaluation. In the evaluation some shortcomings are presented.

Regarding the quality, importance, and impact of the described application (convincing evidence must be provided). The quality of the artefact, due to its size, is hard to be evaluated nevertheless, playing a bit with the Kadaster Knowledge Graph it seems to have been a high-quality application in technological terms. The importance and the impact of the Kadaster Knowledge Graph, as any other Kadaster, is clearly relevant. However, the article is about the evolution of this Knowledge Graph and the lessons learnt, and this is important to be considered. The quality, importance and impact of the lessons learnt, as well as the iterations defined is not very high:

• The Kadaster KG is available online, however, the article lacks links to the ontologies and other related artefacts (like SHACL shapes) mentioned in the paper that are relevant to understand and watch the evolution through the different iterations
• Each iteration presents models, but it is unclear how these models and the Knowledge Graph where developed.
• The evaluations sections present some shortcomings. However, it is not clear how these shortcomings are addressed in the next iterations. Also, the article mentions an evaluation for each iteration but does not explain how this evaluation was carried out or what was evaluated.
• In general, the reviewer is not able to follow and understand the evolution of the different steps since their explanation is shallow and generic.

Regarding the clarity of the article, which the reviewer considers one main drawback of the article. The article is not written in a clear language, particularly the first sections. Several sentences are repeated without providing more information about the contributions of the article. For instance, the evaluation and the iterations are mentioned in the abstract, in the introduction, and in the context. However, no information about the iterations or the evaluation is provided; just that there are three iterations and a design science methodology evaluation. There is no insight in the different sections or more information. This provides to the reviewer the impression of been reading the same text several times without reaching any point further. A reader needs to arrive to section 3 to understand what the authors mean by the iterations and the evaluation.

Following the previous comment, the contribution and the challenges are not clear, and they seem to change depending on the paragraph. They revolve around the iterations and the evaluation but sometimes, it seems the contribution is the definition of those iterations, other times it seems the implementation, other the evaluation as the iteration where already defined. A reader needs to reach section 3 to understand what these iterations stand for since there is no prior explanation about them.

In addition, the article has many inaccuracies and sentences that need to be improved, some of them:

- In the abstract the world KKG appears without a previous definition of it, it should appear what this acronym stands for before using it.

- "Over the past decade, three distinct iterations of linked data
creation, publication, and integration have emerged" iterations do not emerge, they are identified or performed

- "The most recent iteration involves the KKG" all iterations involve linked data. Which one is the most recent, the first iteration, the second, the third? is the most recent a fourth iteration? this is not clear.

- "A design science methodology is used to perform the evaluations and the findings reveal the importance of strategic alignment," a punctuation sign is needed after the world "evaluations"

- "The impact of the paper presented in this work is twofold" sentence is a bit redundant

- "The transformation of a Web Feature Service (WFS) model to a linked data model involves the alignment of the underlying data schema with semantic web standards. Initially, the WFS model, which is typically structured in a format optimised for geospatial data exchange, is mapped. This mapping process involves translating the elements of the WFS schema—such as feature types, attributes, and relationships—into corresponding classes, properties, and relationships in ontologies based on standards like RDFS, OWL, SHACL and SKOS" --> The schema of WFS, that is a model, should be developed as an ontology or mapped to an existing one so data expressed with WFS could be expressed in RDF according to such ontology. SHACL is for validation, how is WFS mapped into SHACL. This paragraph needs to be explained in more detailed as now has misleading and (seems) incorrect information.

- " The ETL begins with the extraction of relational data from the WFS service followed by loading it into a spatially enabled database. This step ensures that geographic features available via the WFS services are standardised into a structure which supports efficient querying and manipulation in the following steps of the ETL" How this can be ensured? the database and the fact that data is stored ensures that geographic features are standardised into a structure compatible with ETL? which standard is that?

* "The relational data is then mapped to the model defined and the resultant triples are loaded into an instance of GraphDB during which the SHACL validation step is taken to ensure that the resultant linked data adheres to the linked data model" Which data model, i.e., ontology ? how those SHACL are developed ?

* "This architecture is illustrated in Figure 1." Figure 1 shows an RDF4J database, the role of this component is not explained.

- "While this construction iteration resulted in the availability of a high volume of linked data" What does this mean, how much is high volume? in the figure 1 only one relational database appears.

- "The Information Model details specific dataset information using the Shapes Constraint Language (SHACL) to ensure internal consistency and maintain recognizability for domain experts. Conversely, the Knowledge Model captures generic, shareable knowledge, facilitating integration with external linked data models through RDF(S), OWL, and SKOS vocabularies for improved reusability and interoperability. A model for each key register and required external source was manually defined." Something similar was done in the first iteration, what is the difference now? Also, SHACL defined restrictions, is a bit odd to call it Information Model. How was the SHACL shapes developed? are they available?

- In figure 2 where is the original Relational Database and the WFS services? The reviewer expected an evolution of the previous iteration, but Figure 2 seems a new scenario.

In general, the article needs to be improved to explain accurately the evolution of the Kadaster Knowledge Graph. As the article is currently written and ideas presented it seems to have a lot of inaccuracies and the impact of the contribution is not clear. It would be also a good addition to explain how the lessons learnt can be adopted third parties.

Review #3
By Julian Rojas submitted on 04/Nov/2024
Suggestion:
Minor Revision
Review Comment:

This Application Report paper describes the process and different stages carried out for the creation and release of the Kadaster Knowledge Graph (KGG) by the Dutch National Land Registry and Mapping Agency (Kadaster). The description covers from the prototyping until production-ready stages for creating the KGG. The KGG constitutes an interoperable and integrated data resource that brings together multiple geospatial data sources mainly managed by Kadaster and known as geospatial key registries.

The KGG can be seen as a prominent example of how Semantic Web technologies can be used at large in real organizational contexts to improve data interoperability. The insights provided in the paper are valuable for others organizations wanting to follow similar paths and highlight common challenges that may arise. The use of Semantic Web technologies by Kadaster for managing official and public data assets for the whole country of the Netherlands, constitutes and important impact that can motivate other organizations to follow on their example. The technical and architectural details laid out in the paper are indeed a rare and valuable asset, since most Linked Data-related solution descriptions are usually of the prototype and demonstrator kind. The paper is well written and is easy to follow.

In general terms, I have the following two main remarks:

- Is not clear enough for me, how are data updates handled overall (e.g., nightly build, continuous stream, etc) and across the different iterations. As it is mentioned in the paper, the proper management and implementation of data updates is a crucial aspect, specially for production-ready solutions. I would like to see some more details on this particular aspect in each of the iterations descriptions.

- One of the most valuable contributions of this paper is the provision of examples of how Semantic Web standards and well-known specifications can be applied for real-world data publishing and service provision. However, I didn't see any links towards actual examples of the configurations, mapping rules, vocabularies, etc, used for the creation of the KKG. The official website does contain information about how to query the resulting KKG, but examples and documentation on how to accomplish the different steps to arrive to a fully-fledged knowledge graph are what other organizations could use better to follow this path.

Next I provide some more specific remarks on the description of the different iterations.

## Iteration 1:

- What spatially enabled DB was used? How did the mapping process take place? It would be good to know a bit more about the technical details that allowed you to quickly setup a working prototype, as this is often the first step needed to convince organizations to invest in Semantic Web technologies.

- Are the WFS schemas and semantic data models used publicly available? If so please linked them in the paper. It would be interesting to see the correspondence between them as an example for other KG creations that start also from existing WFS interfaces.

- How were updates handled (statically mapped or continuous process)?

## Iteration 2:

- Can you provide an example of a CONSTRUCT query used to map data from the key registers? Not necessarily within the paper, a link towards an external resource could also be valuable.

- Some additional info on the scale of the data and pipeline performance (e.g., mapping process, SHACL validation, etc) could provide better idea of the deployment costs of this type of solution.

## Iteration 3:

- How do you decide where to store the materialized GeoSPARQL relations? In one of the SPARQL endpoints? In both? how do manage synchronization in such case?

- Was there any improvement achieved in terms of performance and scalability for the ETL processes at this stage compare with the previous?

- Effective versioning is mentioned as crucial for managing updates, how is this handled in the case of the KKG? Are the versioning capabilities already considered in the IMX-GEO model or were they defined externally and if so what vocabularies did you use?

## Section 4, 5 and 6:

- Section 4 is called "Discussion and Future Work" and Section 5 is again called "Future Work". Perhaps Section 4 can be renamed to "Lessons learned and Discussion" and the 4.1 header can be removed as there are no other subsections there.

- The use of more complex OWL-based classes is mentioned as future work, do you already have some use cases in mind where (complex) reasoning could be useful for KKG?

## Typos:

- Abstract: "...in achieving and sustainability..."
- Page 5, lines 47-48: "...the ETL process managed [is] managed..."
- Page 8, lines 30 and 33: duplicated statement about the scalability of the process that generates triples
- Page 10, line 29: missing parenthesis on acronym "...(LLMs[)]"

Review #4
By Paola Espinoza submitted on 02/Jan/2025
Suggestion:
Minor Revision
Review Comment:

This work highlights the main decisions made and lessons learned during the construction of the Kadaster Knowledge Graph (KKG). This KG enables Kadaster to publish its geospatial data following the linked data principles. The manuscript is a valuable contribution for the Semantic Web community, particularly for newcomers as it summarizes the main steps taken in the KG construction process. It illustrates the stack of technologies used and evaluates which of them worked better based on the organization’s requirements and needs at different points in time. The evolution of the construction process over the years from a “yet another format” stage to a “production environment” effectively captures the KKG’s journey.

This manuscript was submitted as 'Application Report'.
** (1) Quality, importance, and impact of the described application (convincing evidence must be provided) **
The quality of the article is good, and it is relevant for the community. It systematically outlines each iteration of the KG construction and provides detailed insights into the requirements and decisions made throughout the construction, while also summarizing the key lessons learned. The manuscript cites some scientific publications that reflect the work done by Kadaster in recent years, offering evidence of KKG's journey.
** (2) Clarity and readability of the describing paper, which shall convey to the reader the key ideas regarding the application of Semantic Web technologies in the application. **
The manuscript is well-written and easy to follow. Unfortunately, as mentioned by the authors, the manuscript does not yet include an evaluation of the third construction iteration, as it is relatively recent. This evaluation could provide valuable lessons that would enrich the content of the paper and enhance the understanding and relevance of the latest construction iteration for readers.
** (3) Assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess: (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) wether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. ***
This assessment is not applicable because the primary focus of this work is to provide a general overview of the Knowledge Graph Construction (KGC) journey, and the key lessons learned. However, it is important to note that the KKG is accessible through a SPARQL endpoint, along with a suite of services offered by the tool that hosts the knowledge graph.

Minor comments:
- Abstract:
- Meaning of the “KGG” acronym needs to be provided.
- To improve clarity, the definition of data product should be provided in the manuscript.
- Section 2, Application Context:
- “Kadaster Knowledge Graph, in short the Graph” -> the term "Graph" is used infrequently in the manuscript. To enhance clarity and consistency, I recommend removing this reference and using "KKG" as the abbreviation for Kadaster Knowledge Graph, as this is already done in most parts of the manuscript
- Section 3.1. Iteration 1: Linked Data as "Yet Another Format":
- The meaning of “WFS” (Web Feature Service) should be introduced in this section rather than in Section 3.1.1. Implementation. This change will provide readers with relevant context at an earlier point in the document.
- Section 3.3.1. Implementation:
- Morph-KGC could be appropriately cited, as this KG construction engine is supported by a scientific paper: https://content.iospress.com/articles/semantic-web/sw223135
- Section 3.3.2. Evaluation:
- As the KG is operating in a production environment, it would be useful to provide a detailed evaluation of this construction iteration. For instance, what has been the impact of changes originated by the new model (IMX-Geo) on the use cases supported by the previous model, the benefits and challenges encountered when adopting a mapping rules strategy (using RML or R2RML) for building the KKG, how the governance of the KKG has been improved since the alignment with the Kadaster’s enterprise architecture, etc. Sharing these experiences could greatly enhance the understanding and relevance of the latest construction iteration.
- Section 4.1. Lessons Learned Across Construction Iterations:
- Governance of Knowledge Graphs: This lesson is a highly relevant and extensive topic, particularly for large organizations. However, the manuscript lacks further details on how governance is managed within Kadaster. To enhance the reader's understanding, it would be beneficial to include references related to this topic, allowing them to explore further information on, for example, “roles and responsibilities”. A valuable reference could be Section 5.2.4.1, titled “Roles in the KGE Lifecycle,” from the recent Dagstuhl Seminar Report https://drops.dagstuhl.de/storage/04dagstuhl-reports/volume14/issue02/24... which outlines the various individuals involved in the KG construction.
- Section 5. Future work:
- Evaluation of LLMs: Similarly to my previous comment, it would be useful to include references related to this topic. (e.g. https://drops.dagstuhl.de/storage/08tgdk/tgdk-vol001/tgdk-vol001-issue00... )

Typos:
- Section 3.2: “publication of concept lists in linked data” -> “publication of concept lists as linked data”
- Section 3.2.1:
- In caption of Figure 2, “…the ETL process with Delivers Linked Data” -> “…the ETL process which delivers Linked Data”
- In caption of Figure 3, the format of its reference should align with the style used for other citations throughout the manuscript
- Section 4. “Discussion and Future Work” should be renamed as it does not provide information about future work.
- Section 4.1: “linked data adoption suffers …” -> “Linked data adoption suffers …”
- Section 5: “(LLMs for knowledge graph” -> “(LLMs for knowledge graph)”