Investigating Knowledge Elicitation Automation with Large Language Models

Tracking #: 3868-5082

Authors: 
Sherida van den Bent
Romana Pernisch
Stefan Schlobach

Responsible editor: 
Guest Editors 2025 LLM GenAI KGs

Submission type: 
Full Paper
Abstract: 
Knowledge Elicitation, the process of extracting and structuring expert knowledge, is crucial for fields ranging from Artificial Intelligence (AI) to decision support systems. Traditionally, this process has relied on human experts, making it time consuming and resource intensive. With the rapid advancement of Large Language Models (LLMs), there is growing interest in their potential role in Knowledge Elicitation and ontology generation. This research investigates the feasibility of using LLMs, specifically ChatGPT v4, for automated Knowledge Elicitation and compares AI-led approaches to traditional human expert interviews. To evaluate this, a series of interviews were conducted with both human experts and an LLM, and the extracted knowledge was transformed into RDF ontologies using different pipelines, ranging from AI-generated to human-created ontologies. The research employs OQuaRE metrics and structural analysis to compare the generated ontologies against a base-truth ontology. The results indicate that AI-led interviews are more time-efficient and structured compared to human expert interviews. However, a human approach works better for ontology generation: AI-generated ontologies are more standardized but missed a lot of data, whereas human-created ontologies captured more information. These findings suggest that a hybrid approach, using LLMs as stand-ins for experts in the interview phase, while relying on human knowledge engineers for ontology generation, offers the best balance between speed and quality in Knowledge Elicitation.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Valentina Anita Carriero submitted on 08/Jun/2025
Suggestion:
Major Revision
Review Comment:

The paper presents an analysis and some experiments about how to automate the process of knowledge elicitation, and the consequent process of ontology creation based on the elicited knowledge, with the use of LLMs.
The paper is very well written and clear.
The results (but I believe not all the code needed to reproduce the experiments) are available on GitHub.
I strongly believe that the idea of investigating how to support the knowledge elicitation process is very good and has a great potential for supporting ontology engineering, however, I have some major concerns about the experiments, which would require to re-think at least some of them.

[Related Work / Knowledge Elicitation] when you talk about references 11 and 12, it is not clear to me if those are only applicable to ML projects, or beyond them. Also, the involvement of domain experts seems more related to the creation of a test set for training/evaluation of ML systems?

[Related Work / Large Language Models] about ref 22: I wanted to know if this work proves that LLM-generated OWL modelling suggestions support also experienced ontology engineers, or not (just inexperienced)

[Related Work] I think you are missing some relevant work in this section. E.g., wrt ontology generation with LLMs: "Ontogenia: Ontology generation with metacognitive prompting in large language models", "Ontology Generation using Large Language Models" (see also their related work section); wrt knowledge elicitation: "OntoChat: a Framework for Conversational Ontology Engineering using Language Models", and surely others, these are the ones I already knew (and I'm co-author of the last one that I mention). This is also clear from your subsection Related Work / Discussion, where you detail your RQs taking into account only two studies

[Figure 1] At page 6 you write "for the intents of the pipeline, we consider the ChatGPT interviews as "transcribed by AI"": I don't understand why it cannot be represented that the transcription step is skipped when it comes to ChatGPT interviews.

[Experimental Design / Base Truth Ontology] It is not clear to me why you did not try to define first some clear requirements for your base truth ontology (you defined only some CQs afterwards, but I would have better defined the scope of the ontology since the beginning). Especially since you want to use it for evaluating the other ontologies that actually come from a knowledge elicitation phase.
[Results / Metrics] I believe that the first paragraph of this section confirms my impression that you are comparing things not completely comparable.
[Results / Content Comparison] As you say in the first paragraph of this section, this result is not surprising since the interviews were open-ended and that the ontologies you are comparing have been created in such a different way.
[Results / Hierarchy comparison] the % for the H1-H-H is anyway very low (still not surprising). "replicating the correct hierarchy": does not make sense to talk about replication here

[Experimental Design / Testing The Pipeline] I understand you use the ACTA method for building the interviews, however, I am unaware of the use of such method in the requirements collection for ontology engineering. When building a real-world ontology, one usually has in mind a certain application/dataset or anyway some more concrete requirements. I don't think that an interview with only "tell me more about X" help defining a clear scope for the ontology. In general, I would have tried, in the experiments, to adhere more to s-o-t-a ontology engineering methodologies (including the ontological requirements collection phase). This would have made your research much stronger.

[Experimental Design / Testing The Pipeline] I think it would have made much sense to compare also the results of the interview, e.g., which concepts have been mentioned, how long where the answers, how many times you needed the "tell me more" thing, etc.

Here's for example some differences I noticed in the interviews on GitHub:

ChatGPT
"Please tell me more about the core mechanics: role-playing"
"Please tell me more about the core mechanics: Dungeon Master (DM)"

vs

Interview with H1 - AI transcript (English).txt
"Can you tell us more about the characters that players can have? How does that work?"
"Can you Tell more About the monsters"
"Do you have anything else you'd like to tell us about the game that we haven't touched on yet?"

Instead, you only indirectly compare them based on the resulting ontology. But there are many other factors that influence how the ontology is created. Indeed, in general, I don't really get why you wanted to experiment also on the ontology development task, it would have been more interesting to me to focus only on the actual knowledge elicitation task (and envision other cases, e.g., using the LLM as an interviewer, which maybe does not save time to the domain experts, but it's better at structuring afterwards thus helping and speeding up the ontology engineers? just hypothesizing). Especially since you are missing some relevant work that already tested a lot on how to better use LLMs for building ontologies.

[Experimental Design / Testing The Pipeline] Can you give more details about how the domain experts have been chosen? And about the ontology developer(s)? Is it always just one of the authors (the first author) as it seems from the URL of the ontologies created by humans? Which is her level of expertise? And who evaluated the ontologies after? The same human?

[Experimental Design / Testing The Pipeline] "several normalization steps were applied to standardize class names [...] terminology adjustments were made in non-base truth ontologies": the terminology is an important part of the ontology engineering activity, I believe I disagree with this choice by the author (e.g., the fact that in some classes, like CharacterAbilityModifiers, has been used a plural is relevant, it tells you something about how much "correct" an ontology has been modelled)
[Results / OQuaRE] Here for example you refer to the controlledVocabulary metric, which measures the terminology, but you have manually changed some terminology.

[Experimental Design / Testing The Pipeline] The prompts used for make the LLM develop the ontologies are not available on GitHub. They would have been relevant, also to understand how much the LLM has been guided in doing it, if they took into account related work on this, if there was all within a single conversation (interview + onto dev), etc.
[Limitations and Future Work] Much after when reading the paper, I realize that "the decision was made to manually construct the ontologies using the LLM's suggestions as a starting point": this should have been said since the beginning!

[Results / Metrics] What you discuss in the second paragraph of this section (about the higher number of axioms in AI-interview-ontologies) is interesting, however, how humans act is the reason why, if you need to develop an ontology having a clear scope, you do not use very general "talk me about this" questions.

[Results / OQuaRE] Redundancy: I probably got this definition wrong, because at first I thought that the higher is the score, the better is the ontology, instead, from your discussion of the results, it seems the contrary.

[Results / Hierarchy Comparison] if I understood correctly, you say that stating that Fighter is a specific type of Class (meaning CharacterClass, not owl:Class, correct?) instead of an instance of it is wrong, however, I believe these are just two different modelling choices, both valid, depending on the cases.

[Results / Hierarchy Comparison] "many ontologies still captured a rough semantic understanding of the domain": how many?

[Results / Hierarchy Comparison] "high-level misclassifications" + "human-generated ontologies sometimes included Race but with variations in the specific instances used": is this something that you noticed also in the interviews, or just in the resulting ontologies?

[Results / Hallucinated data in AI ontologies] "it still represents a deviation from the interview data": yes, agree. Additionally, in general, when developing an ontology, one does not want to include things that are out of scope.

[Results] Hope I did not miss it, but it would have been nice a deeper discussion of the comparison only between ontologies created by humans from human interview VS AI interview. To understand which "input" is "better" for a human to develop an ontology.

[Discussion / RQ2.1] did you verify the quality of ChatGPT v4 interview in terms of correcteness and relevance of answers? If so, how? You talk about hallucinations in the ontology: have you verified if the hallucinations in the interview are the same as those in the ontology?

[Discussion / RQ2.2] "including relevant individuals": too subjective, can you motivate this "relevant" further?

[Discussion / RQ2.2] "and let humans handle what they excel at (interpretation, abstraction, and nuanced structuring)": I do believe humans (specifically, domain experts and users) also excel in providing all the *relevant and more appropriate* pieces of information about a domain of knowledge to be represented in the ontology. At least, if the application of the ontology is clear.

[Discussion] "current LLMs are highly useful in speeding up and structuring the early stages of knowledge elicitation": I think it is clear now I do not believe you completely proved this. I agree with your statement "AI can serve as a powerful collaborator when its outputs are validated and curated by human experts", but, in order to prove this, it would have been interesting to let a domain expert review and possibly correct/integrate/remove things from the ChatGPT initial interview, and see whether this still speeds up the process wrt the domain expert starting from scratch.

[Limitations and Future Work] "the LLM often produced inconsistent results: identical prompts could lead to differing ontological structures across runs": this is something that would need to be reported on GitHub and discussed in the discussion section, since it's a very relevant and common problem when it comes to LLMs.

*minor comments
[template] it does not seem you used the usual latex template for SWJ
[abstract] missing keywords
[multiple places in the paper] please, fix the opening quotation mark
[page 2] "an LLM as a data source Next": missing dot
[page 2] "summize": summarize
[ref 21] I can see that this paper has also been published at the Forty-first International Conference on Machine Learning, I would correct the reference that only refers to arXiv
[page 4] you forgot a TODO in red
[page 6] "Among the various methodologies proposed for ontology evaluation, the OntoClean method and OQuaRE.": not sure if there was supposed to be a verb here
[page 7] "the time spend": spent
[Table 6] constrolled: controlled

Review #2
Anonymous submitted on 25/Aug/2025
Suggestion:
Reject
Review Comment:

Summary: This paper focuses on the process of knowledge elicitation (KE) and investigates the level to which this process can be automated with LLMs. To that end, the knowledge elicitation process is split into three stages: interviews, transcriptions of interviews, creating the ontology based on interviews. Each phase may be performed by humans or AI tools/LLM. The authors perform a set of experiments in which they test this 3-stage pipeline with different combinations of human/AI tools. They compare the pipelines in terms of time efficiency and the quality of the resulting ontologies in comparison to a manually built base truth ontology.

The topic of the paper is interesting, timely and fits well with the special issue’s scope. Despite this, the paper has considerable shortcomings, as follows.

The related work section requires a major revision. On the one hand, several details about KE and LLMs are of historic nature and not strictly relevant for the focus of this paper. On the other hand, much of the recent related work is missing, for example, related to the most recent ontology engineering methodologies (e.g., LOT), as well as the abundant body of work at the intersection of LLMs and knowledge engineering (as published in the special sessions that form the basis for this special issue). Indeed, from 27 references only 6 are dated after 2020. Given this lack of overview of the most recent work, the answer to RQ1, which is based on this section is not fully convincing.

In terms of task definition, the step of "generating ontologies from text” is very broad and ill-defined. It is not clear what this process entails exactly (e.g., is it a list of steps as envisioned in the Ontology 101 methodology?) and what are the characteristics of the generated ontologies (i.e., are they only taxonomies are fully-fledged ontologies with axioms such as property restrictions). This makes it difficult to understand parts of the paper that refer to this step.

Several major design decisions and experimental details are not explained. Why is the focus of the work on “niche subjects”? How do authors define a “niche subject”? When building the base truth ontology: what ontology engineering methodology was followed (the paper does list a set of steps, but these fall short of a methodologically solid approach)? Who were the experts involved and how many were they? What are the characteristics of the resulting ontology (e.g., size, complexity)? Who developed the CQs in Table 1 and what does it mean “results returned for each” - which results? returned from where? The evaluation criteria “topical relevance, conversational structure, overall efficiency” should be defined in more detail as it cannot be assumed that all audience is familiar with these.

Another decision that is not explained is that completely different approaches were used to create the base ontology and the experimental ontologies. As a result, it is not clear how meaningful the comparison between these ontologies is. For example, none of the CQs that were covered by the base-truth ontology were covered by the generated ontologies.

Finally, the experimental evaluation is currently limited to one domain and makes use of a single LLM, thus making the results of this paper too preliminary for a journal publications.

Minor Comments:
* there is some interesting formatting, apparently for “Transportation Research Record” (see top right of article)
* P1: “This makes them prone to errors that are hard to pinpoint and fix (3). The output of these models, therefore, often lack the depth, nuance, and verifiability that human experts bring to the table.” - it is unclear how the second sentence follows from the first. Also, the second sentence would highly benefit from a reference that would prove what it states.
* P4 - some text in red seams to capture internal comments between authors.
* The first two paragraphs of the “Hierarchy Comparison” section have a content-wise overlap (they seam to be different verbalisations of the same information).

In summary, while addressing an interesting topic, this work needs to be better aligned with the latest related work and it’s methodology should be revised and explained in more detail. Experimental evaluation should be extended to at least include more LLMs and possibly also more domains before submission to a journal.

Review #3
Anonymous submitted on 02/Oct/2025
Suggestion:
Reject
Review Comment:

This paper investigates using LLMs (specifically ChatGPT v4) for knowledge elicitation and ontology generation, comparing AI-led and human-led approaches. While the topic is relevant, the work appears incomplete and lacks sufficient rigor for publication. The experimental design has unresolved elements, the sample size is too small for statistical validity, and the contribution beyond confirming predictable findings is unclear.
The paper lacks clear novelty. There is excessive repetition between abstract and introduction, suggesting limited original contribution. The main finding that combining LLM interviews with human ontology engineering works best is predictable. What is the actual research advancement here? The paper does not clearly articulate what new knowledge or methodological innovation it contributes beyond confirming intuitive expectations.
In the introduction the authors’ claims are not supported by reference or at least citing the most relevant works.
The paper contains unfinished elements that indicate it is not ready for publication:
• Unresolved comment in Base Truth Ontology section: "[Romi]: we should execute these also for all the generated ontologies..."
• Competency questions (Table 1) are only evaluated against the base truth ontology, not against the generated ontologies as intended, undermining a key evaluation mechanism
The explanation of RQ1 in the introduction section is confusing. Before the methodology, RQ1 clearly addresses automatic elicitation without human intervention, and RQ2 compares automatic vs. human processes. However, after further explaination in the fourth paragraph of the introduction RQ1 is reformulated in ways that create confusion about what is actually being investigated.
The knowledge elicitation section is limited and does not include recent relevant work such as:
• Cogan Shimizu, Pascal Hitzler: "Accelerating knowledge graph and ontology engineering with large language models" (J. Web Semant.)
• "A Survey of Domain Knowledge Elicitation in Applied Machine Learning" (2021)
The introduction mentions reviewing state of the art without providing adequate summary or citing the most relevant supporting works.
With only two human experts interviewed, how are the results statistically significant? The sample size is insufficient to support the generalizable claims made about LLM vs. human performance.
What about the experimenting with open LLMs? What are the limitations?
The manuscript contains grammatical errors and unclear explanations

Minor Issues (not a complete list)
• RQ1 is split into two sub-questions, but RQ1.1 is explicitly stated while RQ1.2 is not clearly delineated
• Grammatical error: "Data source Next is the translation" is not a correct sentence
• Typo: "constrolledVocabulary" → "controlled vocabulary"