On General and Biomedical Text-to-Graph Large Language Models

Tracking #: 3765-4979

Authors: 
Lorenzo Bertolini
Roel Hulsman
Sergio Consoli
Antonio Puertas Gallardo
Mario Ceresa

Responsible editor: 
Guest Editors KG Gen from Text 2023

Submission type: 
Full Paper
Abstract: 
Knowledge graphs and ontologies represent symbolic and factual information that can offer structured and interpretable knowledge. Extracting and manipulating this type of information is a crucial step in complex processes. While Large Language Models (LLMs) are known to be useful for extracting and enriching knowledge graphs and ontologies, previous work has largely focused on comparing architecture-specific models (e.g. encoder-decoder only) across benchmarks from similar domains. In this work, we provide a large-scale comparison of the performance of certain LLM features (e.g. model architecture and size) and task learning methods (fine-tuning vs. in-context learning (iCL)) on text-to-graph benchmarks in two domains, namely the general and biomedical ones. Experiments suggest that, in the general domain, small fine-tuned encoder-decoder models and mid-sized decoder-only models used with iCL reach overall comparable performance with high entity and relation recognition and moderate yet encouraging graph completion. Our results further tentatively suggest that, independent of other factors, biomedical knowledge graphs are notably harder to learn and better modelled by small fine-tuned encoder-decoder architectures. Pertaining to iCL, we analyse hallucinating behaviour related to sub-optimal prompt design, suggesting an efficient alternative to prompt engineering and prompt tuning for tasks with structured model output.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Finn Årup Nielsen submitted on 16/Oct/2024
Suggestion:
Minor Revision
Review Comment:

I thank the authors for feedback to my review. I am mostly satisfy
except for my last point with the ordering of the triplets.

Consider the WebNLG sentence: "A.E Dimitra Efxeinoupolis is located
in Greece, the capital of which is Athens.".
There are two triplet extracted:

<"A.E_Dimitra_Efxeinoupolis", "location", "Greece">
<"Greece", "capital", "Athens">

If we reorder the triples they become

<"Greece", "capital", "Athens">
<"A.E_Dimitra_Efxeinoupolis", "location", "Greece">

If we use ROUGE *across* triplets, then we - in the first case - get as
one of the two-grams "Greece, Greece" and in the second case instead
"Athens, A.E_Dimitra_Efxeinoupolis". It was not clear to me from the
paper and it is still not clear from the reply notes whether the ROUGE
metrics are used across triplets, so the object of one triplet is used
together with the subject of the following extracted triplet. If that
is the case then the metric can be manipulated by reordering the
triplets (if there are multiple triplets).

Reading Figure 6, Appendix A, I see that the (Ciudad Ayala # country #
Mexico) is first in the output while the same triple it in the middle
of the list. Would moving that triple to the "correct" position result
in another ROUGE-2 and ROUGE-L score? As far as can read code the
authors use the 'evaluate' Python library for testing with ROUGE.

https://github.com/jrcf7/txt2graphLLMs/blob/main/scripts/utils/predictio...

But I have not examined with what format of data the 'compute' method
is called.

In graphs the ordering of the triplets are usually ignored. Therefore
if the computation of the ROUGE (-2 and -L) score is order-dependent it is highly
unusual and should be noted in the manuscript.

As a side note I would like to point out that reviewer 3 points to a
Piglou paper but the authors pont to a Kuhn paper, reference [6].

Review #2
By Fidel Jiomekong submitted on 16/Nov/2024
Suggestion:
Accept
Review Comment:

The authors addressed the comments suggested in the previous reviews.
The authors should make a proofread of the manuscript to correct grammatical errors and typos. The following comments should also be considered:

Page 5:
- line 3: Here we discuss -> This section presents
- line 7: Given that the authors make reference to datasets, it would be nice if they can present the datasets before this section (but, not an obligation)
- line 47: .. (we adopt version 3.0) … -> (we choose version 3.0) may be more appropriate
Page 6:
- line 1 - 12: it would be nice to illustrate with an example
Page 7:
- Hugging Face2 ,3 -> check the correct way to put the numbers referring to the footnote
Page 8:
- … Encoder-decoder architectures are a generic class of transformer models, introduced in [57] …
-> Encoder-decoder architectures are generic class of transformer models [57]
Page 9:
- line 21: is the following correct?: Anther difference consists
Page 10:
- line 33: it would be nice at this point to highlight the prompt method used by the authors. Of course, the type of prompt used is presented in page 13, line 48, but it is too far for the reader who wants to understand the paper
Page 14:
- line 37: have a a log-linear -> have a log-linear

Review #3
Anonymous submitted on 05/Dec/2024
Suggestion:
Accept
Review Comment:

The authors have satisfactorily addressed all my comments. I believe the paper is now ready for acceptance.