Retrieval-Augmented Generation-based Relation Extraction

Tracking #: 3670-4884

Authors: 
Sefika Efeoglu
Adrian Paschke

Responsible editor: 
Guest Editors KG Gen from Text 2023

Submission type: 
Full Paper
Abstract: 
Information Extraction (IE) is a transformative process that converts unstructured text data into a structured format by employing entity and relation extraction (RE) methodologies. The identification of the relation between a pair of entities plays a crucial role within this framework. Despite the existence of various techniques for relation extraction, their efficacy heavily relies on access to labeled data and substantial computational resources. In addressing these challenges, Large Language Models (LLMs) emerge as promising solutions; however, they might return hallucinating responses due to their own training data. To overcome these limitations, Retrieved-Augmented Generation-based Relation Extraction (RAG4RE) in this work is proposed, offering a pathway to enhance the performance of relation extraction tasks. This work evaluated the effectiveness of our RAG4RE approach utilizing different LLMs. Through the utilization of established benchmarks, such as TACRED, TACREV, Re-TACRED, and SemEval RE datasets, our aim is to comprehensively evaluate the efficacy of our RAG4RE approach. In particularly, we leverage prominent LLMs including Flan T5, Llama2, and Mistral in our investigation. The results of our study demonstrate that our RAG4RE approach surpasses performance of traditional RE approaches based solely on LLMs, particularly evident in the TACRED dataset and its variations. Furthermore, our approach exhibits remarkable performance compared to previous RE methodologies across both TACRED and TACREV datasets, underscoring its efficacy and potential for advancing RE tasks in natural language processing.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 21/Apr/2024
Suggestion:
Major Revision
Review Comment:

The paper describes a retrieval-augmented generation-based relation extraction approach.
The core idea is to append a similar sentence from the same dataset (training subset) to the prompt of the LLM.
It shows that this addition can improve the results in three out of four datasets with every LLM used in the study.
The paper is well-written and easy to follow. Figures 5 and 6 are a bit redundant (tables with exact numbers exist), but they are still fine.
The provided resources are a Github link, which is fine, and the repository is well structured.

The description of the approach is lacking a few details:
In Section 3, it would be better to describe directly that similar sentences are chosen from the corresponding dataset (more specifically, from the training set).
In Section 3.1, it is unclear what result is returned if no specified relation is mentioned in the generated text. Maybe it is a good idea to introduce a new artificial relation (instead of using no_relation) to analyze later for how many examples the LLM is not able to produce a relation mentioned in the prompt.

For the main approach, it is unclear whether any prompt engineering was executed or not.
The output of a LLM can highly vary if the promt is changed. Thus if the authors already tried out different promts, it would be good to include them in the results. Because the reason why adding one similar sentence (without any ground truth) should improve the results by a large margin is not clear to me.
Maybe adding any other text or changing the prompt would have similar or even better results. Thus, such an additional experiment would be useful and would make the results in the paper even stronger.
Similarly, a comparison to a prompt with a few examples, including the result (and the format of the result), usually increases the chance of getting a defined relation as output. Min et al. [1] showed that even the correct label is not so important at all, but it also describes the way how the LLM should respond (which makes the result parsing step easier and more reliable). Even though the paper is about the RAG part, this is very similar and is worth a comparison.
The LLM selection could also be improved by using larger LLMs, like Llama-13b or 70b, to see if larger models perform better (and what the influence of the choice of the model is).

It is nice to see the authors present an error analysis, which already gives some insights, but a more in-depth analysis would be interesting.
This could include e.g. an analysis on the level of relations (which relations are usually predicted correctly and which are usually wrong).
When using an additional artificial relation type for the case when no relation could be extracted, then one can see if the result is just wrong or if the LLM is producing text (maybe describing the correct relation) that does not mention the relation itself. Thus, the micro average would not hide some of the crucial details.

In Table 2, it would be nice to highlight the best values in each column (at least for F1 in SemEval).
Table 4 and 5 could be also integrated in Table 2 to have a direct comparison between the related work and the proposed approach (without just selecting different proposed models combined in Table 4/5 e.g. best model for SemEval is not the best for TACRED).

On page 8, line 17, the authors say that "it is clear that our RAG4RE approach, consisting of the relevant example sentence about the query,
has improved F1 scores". This is not really clear to me. Why should the model predict better if a similar sentence is contained in the prompt?

The selection of related work is good. A similar approach using T5 is presented by Han et al. [2] and could be included as well.
Interestingly, the results for TACRED and TACREV are much better than all of the SOTA approaches (including [2] which achived 75.3 on TACRED and 84.0 on TACREV). Even tough other approaches e.g. [2] include more training data and finetune the models, the results are worse.
If the author has any insights into why this is the case, it would be good to include them in the paper as well (is it just based on the larger model by using T5XL or based on the prompt or the selected sentence by the RAG approach?).
Furthermore, a discussion of why the model is better by a large margin on TACRED and TACREV but worse on Re-TACRED (in comparison to related work) is necessary and would give more insights into the dataset and the proposed RAG model (similarly with SemEval).

Thus, I'm looking forward to the next revision of the paper to see some more analysis and good justifications for why adding a similar sentence is getting better results than the SOTA approaches.

[1] Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? https://arxiv.org/pdf/2202.12837.pdf
[2] Generative Prompt Tuning for Relation Classification https://arxiv.org/pdf/2210.12435v1.pdf

Review #2
Anonymous submitted on 08/Dec/2024
Suggestion:
Minor Revision
Review Comment:

The paper presents a compelling approach to enhancing RE tasks by
integrating retrieval and augmentation into LLMs. Its contributions to
addressing hallucinations and leveraging context for better predictions are
valuable. With further work on logical inference and real-world testing,
RAG4RE could be a powerful tool in information extraction pipelines.

Strengths:

- The paper explains problems with traditional RE methods, like needing
labeled data and high computational costs. It presents RAG4RE to solve these
issues by using RAG and LLMs.
- RAG4RE uses retrieval, data augmentation, and generation to improve LLMs'
understanding of context. This helps reduce hallucinations by providing
relevant examples.
- The authors test RAG4RE on datasets like TACRED, TACREV, Re-TACRED, and
SemEval. They compare it with other advanced methods and use LLMs like Flan
T5, Llama2, and Mistral for evaluation.
- RAG4RE performs better than other LLM-based approaches, achieving high F1
scores on TACRED and TACREV. It also reduces false predictions.
- The authors note its limitations on datasets needing logical inference,
such as SemEval. They plan to test RAG4RE on real-world data in the future.

Weaknesses:

- The method has trouble with datasets like SemEval, where relations cannot
be directly extracted from text. This makes it less useful for tasks needing
complex reasoning.
- The evaluation is thorough on benchmark datasets, but the lack of
real-world testing raises concerns about its practical use.
- The paper includes error analysis but does not deeply explore why some
LLMs, like Mistral, perform poorly or how to improve them.
- The tables and results are detailed, but better visualizations or
explanations would make trends, like why Flan T5 does better than Llama2,
easier to understand.
- Although the source code is provided, the paper could add more details on
reproducing experiments, like data preparation and parameter settings.

Suggestions for Improvement:

- Include additional datasets from domains like healthcare or legal documents
to test the model's robustness outside of benchmarks.
- Explore techniques to incorporate logical inference into RAG4RE,
potentially by integrating symbolic reasoning methods.
- Provide more detailed explanations for the performance differences between
LLMs and SoTA methods.
- Use heatmaps or detailed charts to visualize false positive/negative trends
across datasets and models.

Review #3
Anonymous submitted on 09/Dec/2024
Suggestion:
Major Revision
Review Comment:

The paper RAG4RE for the relation extraction using LLM. Please find my comments below.

1. One of the major comments that authors make is the requirement of the computational power and time for RE and argues on using the LLMs. However, LLMs are more computationally intensive for training. Authors should clarify what they are referring to.
2. LLMs are restricted by their training data --> aren't all deep learning algorithms and in the case of the expert systems by rule?
3. P2, line 20-23, author talks about the RE being expensive due to labeled data. There has also works in this regard using approaches, such as semi-supervised learning. I would encourage authors discussing those aspects as well.
4. The introduction should discuss the challenge clearly as it just talks about the challenges on a high level. Authors are suggested to provide some tangible examples.
5. I was also expecting a brief discussion on what sets this study apart from existing works in the introduction but it is missing. Are there no RAG supported relation extraction studies? If yes, why RAG4RE?
6. 2.2, could you give an example of the post-processing steps for advanced RAG?
7. Authors should highlight the limitation of the related work and also the work on which the proposed work builds upon? It would be beneficial to have a table highlighting the limitations compared to the proposed approach (or study).
8. In Table 6, for the prompt it’s not clear the purpose of semicolon, sometimes it is used with head and tail entity and sometimes in the sentence (e.g., Example Sentence: But there is also a bench factor: Williams).
9. The user query (sentence) is more exhaustive than the example sentence that is retrieved based on the cosine similarity. The example sentence in Figure 4, contains just one sentence. I have little doubt that a sentence will make such a huge performance difference. Authors should clarify how many example sentences are retrieved, just one, two…N or there’s no limit as long as there’s match? Also, what is the threshold score used?
10. One of the main arguments that authors make regarding the performance boost is based on the performance between the simple query and the RAG4RE prompt. With LLM it is known that the better the prompt the better the response LLM will give. So, what is the main contribution or the research challenge paper is tackling, the prompt design?
11. I was expecting to see some more examples of the extracted relations along with some ablation studies but it was missing. Have authors not conducted any ablation study?
12. Authors also mention that their post-processing steps align the retrieved relations with prefixes, but looking at the results in GitHub, it doesn’t seem so. Not all of the results have a prefix.

Review #4
By Garima Agrawal submitted on 09/Dec/2024
Suggestion:
Accept
Review Comment:

This is a novel approach for reducing hallucinations in Relation Extraction. Integration of RAG with RE with improved prompts is a good approach. The methodology is clearly explained with the well-defined pipeline leveraging example sentences retrieved from embeddings. The standard datasets for RE are used for evaluation. The performance of RAG4RE could be further explored on datasets like SemEval. I think further analysis of failure cases can improve the evaluation. While you show your method reduces false positives and negatives in most cases, the explanation of failure cases in SemEval could be analyzed or explained. In method add more details in the data augmentation module. It will be good to add details like how similarity metrics and embeddings are calculated. Language is clear. Can check for minor grammatical mistakes.