Review Comment:
The paper presents an analysis and some experiments about how to automate the process of knowledge elicitation, and the consequent process of ontology creation based on the elicited knowledge, with the use of LLMs.
The paper is very well written and clear.
The results (but I believe not all the code needed to reproduce the experiments) are available on GitHub.
I strongly believe that the idea of investigating how to support the knowledge elicitation process is very good and has a great potential for supporting ontology engineering, however, I have some major concerns about the experiments, which would require to re-think at least some of them.
[Related Work / Knowledge Elicitation] when you talk about references 11 and 12, it is not clear to me if those are only applicable to ML projects, or beyond them. Also, the involvement of domain experts seems more related to the creation of a test set for training/evaluation of ML systems?
[Related Work / Large Language Models] about ref 22: I wanted to know if this work proves that LLM-generated OWL modelling suggestions support also experienced ontology engineers, or not (just inexperienced)
[Related Work] I think you are missing some relevant work in this section. E.g., wrt ontology generation with LLMs: "Ontogenia: Ontology generation with metacognitive prompting in large language models", "Ontology Generation using Large Language Models" (see also their related work section); wrt knowledge elicitation: "OntoChat: a Framework for Conversational Ontology Engineering using Language Models", and surely others, these are the ones I already knew (and I'm co-author of the last one that I mention). This is also clear from your subsection Related Work / Discussion, where you detail your RQs taking into account only two studies
[Figure 1] At page 6 you write "for the intents of the pipeline, we consider the ChatGPT interviews as "transcribed by AI"": I don't understand why it cannot be represented that the transcription step is skipped when it comes to ChatGPT interviews.
[Experimental Design / Base Truth Ontology] It is not clear to me why you did not try to define first some clear requirements for your base truth ontology (you defined only some CQs afterwards, but I would have better defined the scope of the ontology since the beginning). Especially since you want to use it for evaluating the other ontologies that actually come from a knowledge elicitation phase.
[Results / Metrics] I believe that the first paragraph of this section confirms my impression that you are comparing things not completely comparable.
[Results / Content Comparison] As you say in the first paragraph of this section, this result is not surprising since the interviews were open-ended and that the ontologies you are comparing have been created in such a different way.
[Results / Hierarchy comparison] the % for the H1-H-H is anyway very low (still not surprising). "replicating the correct hierarchy": does not make sense to talk about replication here
[Experimental Design / Testing The Pipeline] I understand you use the ACTA method for building the interviews, however, I am unaware of the use of such method in the requirements collection for ontology engineering. When building a real-world ontology, one usually has in mind a certain application/dataset or anyway some more concrete requirements. I don't think that an interview with only "tell me more about X" help defining a clear scope for the ontology. In general, I would have tried, in the experiments, to adhere more to s-o-t-a ontology engineering methodologies (including the ontological requirements collection phase). This would have made your research much stronger.
[Experimental Design / Testing The Pipeline] I think it would have made much sense to compare also the results of the interview, e.g., which concepts have been mentioned, how long where the answers, how many times you needed the "tell me more" thing, etc.
Here's for example some differences I noticed in the interviews on GitHub:
ChatGPT
"Please tell me more about the core mechanics: role-playing"
"Please tell me more about the core mechanics: Dungeon Master (DM)"
vs
Interview with H1 - AI transcript (English).txt
"Can you tell us more about the characters that players can have? How does that work?"
"Can you Tell more About the monsters"
"Do you have anything else you'd like to tell us about the game that we haven't touched on yet?"
Instead, you only indirectly compare them based on the resulting ontology. But there are many other factors that influence how the ontology is created. Indeed, in general, I don't really get why you wanted to experiment also on the ontology development task, it would have been more interesting to me to focus only on the actual knowledge elicitation task (and envision other cases, e.g., using the LLM as an interviewer, which maybe does not save time to the domain experts, but it's better at structuring afterwards thus helping and speeding up the ontology engineers? just hypothesizing). Especially since you are missing some relevant work that already tested a lot on how to better use LLMs for building ontologies.
[Experimental Design / Testing The Pipeline] Can you give more details about how the domain experts have been chosen? And about the ontology developer(s)? Is it always just one of the authors (the first author) as it seems from the URL of the ontologies created by humans? Which is her level of expertise? And who evaluated the ontologies after? The same human?
[Experimental Design / Testing The Pipeline] "several normalization steps were applied to standardize class names [...] terminology adjustments were made in non-base truth ontologies": the terminology is an important part of the ontology engineering activity, I believe I disagree with this choice by the author (e.g., the fact that in some classes, like CharacterAbilityModifiers, has been used a plural is relevant, it tells you something about how much "correct" an ontology has been modelled)
[Results / OQuaRE] Here for example you refer to the controlledVocabulary metric, which measures the terminology, but you have manually changed some terminology.
[Experimental Design / Testing The Pipeline] The prompts used for make the LLM develop the ontologies are not available on GitHub. They would have been relevant, also to understand how much the LLM has been guided in doing it, if they took into account related work on this, if there was all within a single conversation (interview + onto dev), etc.
[Limitations and Future Work] Much after when reading the paper, I realize that "the decision was made to manually construct the ontologies using the LLM's suggestions as a starting point": this should have been said since the beginning!
[Results / Metrics] What you discuss in the second paragraph of this section (about the higher number of axioms in AI-interview-ontologies) is interesting, however, how humans act is the reason why, if you need to develop an ontology having a clear scope, you do not use very general "talk me about this" questions.
[Results / OQuaRE] Redundancy: I probably got this definition wrong, because at first I thought that the higher is the score, the better is the ontology, instead, from your discussion of the results, it seems the contrary.
[Results / Hierarchy Comparison] if I understood correctly, you say that stating that Fighter is a specific type of Class (meaning CharacterClass, not owl:Class, correct?) instead of an instance of it is wrong, however, I believe these are just two different modelling choices, both valid, depending on the cases.
[Results / Hierarchy Comparison] "many ontologies still captured a rough semantic understanding of the domain": how many?
[Results / Hierarchy Comparison] "high-level misclassifications" + "human-generated ontologies sometimes included Race but with variations in the specific instances used": is this something that you noticed also in the interviews, or just in the resulting ontologies?
[Results / Hallucinated data in AI ontologies] "it still represents a deviation from the interview data": yes, agree. Additionally, in general, when developing an ontology, one does not want to include things that are out of scope.
[Results] Hope I did not miss it, but it would have been nice a deeper discussion of the comparison only between ontologies created by humans from human interview VS AI interview. To understand which "input" is "better" for a human to develop an ontology.
[Discussion / RQ2.1] did you verify the quality of ChatGPT v4 interview in terms of correcteness and relevance of answers? If so, how? You talk about hallucinations in the ontology: have you verified if the hallucinations in the interview are the same as those in the ontology?
[Discussion / RQ2.2] "including relevant individuals": too subjective, can you motivate this "relevant" further?
[Discussion / RQ2.2] "and let humans handle what they excel at (interpretation, abstraction, and nuanced structuring)": I do believe humans (specifically, domain experts and users) also excel in providing all the *relevant and more appropriate* pieces of information about a domain of knowledge to be represented in the ontology. At least, if the application of the ontology is clear.
[Discussion] "current LLMs are highly useful in speeding up and structuring the early stages of knowledge elicitation": I think it is clear now I do not believe you completely proved this. I agree with your statement "AI can serve as a powerful collaborator when its outputs are validated and curated by human experts", but, in order to prove this, it would have been interesting to let a domain expert review and possibly correct/integrate/remove things from the ChatGPT initial interview, and see whether this still speeds up the process wrt the domain expert starting from scratch.
[Limitations and Future Work] "the LLM often produced inconsistent results: identical prompts could lead to differing ontological structures across runs": this is something that would need to be reported on GitHub and discussed in the discussion section, since it's a very relevant and common problem when it comes to LLMs.
*minor comments
[template] it does not seem you used the usual latex template for SWJ
[abstract] missing keywords
[multiple places in the paper] please, fix the opening quotation mark
[page 2] "an LLM as a data source Next": missing dot
[page 2] "summize": summarize
[ref 21] I can see that this paper has also been published at the Forty-first International Conference on Machine Learning, I would correct the reference that only refers to arXiv
[page 4] you forgot a TODO in red
[page 6] "Among the various methodologies proposed for ontology evaluation, the OntoClean method and OQuaRE.": not sure if there was supposed to be a verb here
[page 7] "the time spend": spent
[Table 6] constrolled: controlled
|