Review Comment:
The paper describes a retrieval-augmented generation-based relation extraction approach.
The core idea is to append a similar sentence from the same dataset (training subset) to the prompt of the LLM.
It shows that this addition can improve the results in three out of four datasets with every LLM used in the study.
The paper is well-written and easy to follow. Figures 5 and 6 are a bit redundant (tables with exact numbers exist), but they are still fine.
The provided resources are a Github link, which is fine, and the repository is well structured.
The description of the approach is lacking a few details:
In Section 3, it would be better to describe directly that similar sentences are chosen from the corresponding dataset (more specifically, from the training set).
In Section 3.1, it is unclear what result is returned if no specified relation is mentioned in the generated text. Maybe it is a good idea to introduce a new artificial relation (instead of using no_relation) to analyze later for how many examples the LLM is not able to produce a relation mentioned in the prompt.
For the main approach, it is unclear whether any prompt engineering was executed or not.
The output of a LLM can highly vary if the promt is changed. Thus if the authors already tried out different promts, it would be good to include them in the results. Because the reason why adding one similar sentence (without any ground truth) should improve the results by a large margin is not clear to me.
Maybe adding any other text or changing the prompt would have similar or even better results. Thus, such an additional experiment would be useful and would make the results in the paper even stronger.
Similarly, a comparison to a prompt with a few examples, including the result (and the format of the result), usually increases the chance of getting a defined relation as output. Min et al. [1] showed that even the correct label is not so important at all, but it also describes the way how the LLM should respond (which makes the result parsing step easier and more reliable). Even though the paper is about the RAG part, this is very similar and is worth a comparison.
The LLM selection could also be improved by using larger LLMs, like Llama-13b or 70b, to see if larger models perform better (and what the influence of the choice of the model is).
It is nice to see the authors present an error analysis, which already gives some insights, but a more in-depth analysis would be interesting.
This could include e.g. an analysis on the level of relations (which relations are usually predicted correctly and which are usually wrong).
When using an additional artificial relation type for the case when no relation could be extracted, then one can see if the result is just wrong or if the LLM is producing text (maybe describing the correct relation) that does not mention the relation itself. Thus, the micro average would not hide some of the crucial details.
In Table 2, it would be nice to highlight the best values in each column (at least for F1 in SemEval).
Table 4 and 5 could be also integrated in Table 2 to have a direct comparison between the related work and the proposed approach (without just selecting different proposed models combined in Table 4/5 e.g. best model for SemEval is not the best for TACRED).
On page 8, line 17, the authors say that "it is clear that our RAG4RE approach, consisting of the relevant example sentence about the query,
has improved F1 scores". This is not really clear to me. Why should the model predict better if a similar sentence is contained in the prompt?
The selection of related work is good. A similar approach using T5 is presented by Han et al. [2] and could be included as well.
Interestingly, the results for TACRED and TACREV are much better than all of the SOTA approaches (including [2] which achived 75.3 on TACRED and 84.0 on TACREV). Even tough other approaches e.g. [2] include more training data and finetune the models, the results are worse.
If the author has any insights into why this is the case, it would be good to include them in the paper as well (is it just based on the larger model by using T5XL or based on the prompt or the selected sentence by the RAG approach?).
Furthermore, a discussion of why the model is better by a large margin on TACRED and TACREV but worse on Re-TACRED (in comparison to related work) is necessary and would give more insights into the dataset and the proposed RAG model (similarly with SemEval).
Thus, I'm looking forward to the next revision of the paper to see some more analysis and good justifications for why adding a similar sentence is getting better results than the SOTA approaches.
[1] Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? https://arxiv.org/pdf/2202.12837.pdf
[2] Generative Prompt Tuning for Relation Classification https://arxiv.org/pdf/2210.12435v1.pdf
|