Review Comment:
## **Summary of the Paper**
This paper proposes a retrieval-augmented generation (RAG) pipeline to enrich general-purpose knowledge graphs with "NIL" entities—people mentioned in historical sources but absent from Wikidata.
The paper addresses an information extraction challenge, **focusing on NIL (out-of-KG) historical person entities** mentioned in 19th-century musical periodicals. The authors propose a **retrieval-augmented generation (RAG) pipeline** that extracts structured knowledge for new and unreferenced entities by:
1. Selecting 6 core Wikidata properties related to Person
2. Framing property extraction as a question-answering task supported by sparse and dense retrievers over a historical corpus
3. Running multiple LLMs and retrievers in parallel and aggregating results via majority voting
4. Linking generated answers to Wikidata-compliant identifiers
The work introduces two evaluation datasets:
- **QID-KG**, derived from existing Wikidata persons mentioned in the corpus
- **NIL-KG**, a manually curated gold standard for 113 missing persons
Extensive experiments compare 6 LLMs and 6 retrievers, analyze property-specific behavior, and discuss implications for historical bias mitigation in Wikidata.
**Results:** Strong performance for gender and family name, promising results for occupation and country of citizenship, and low performance for date of birth. The paper highlights the limited effect of RAG methods, even on well-known QIDs, for some properties.
---
## **1. Originality**
- Tackles a **highly relevant and underexplored problem**: automatic KG enrichment with historical NIL entities, and addresses gender bias by operationalizing methods to describe underrepresented figures
- Moves beyond entity linking to **KG construction suitable for direct integration with Wikidata**, a significant step forward
- The creation and release of **NIL-KG** are valuable contributions to the community
- Positions RAG explicitly for structured KG enrichment of NIL entities—an underexplored setting that goes beyond populating existing entries and has many potential applications
- Provides a **carefully designed experimental study** across models, retrievers, properties, and aggregation strategies
The contribution is **clearly novel** in the Semantic Web context and well aligned with SWJ's focus on reproducible, data-centric research. The paper goes beyond incremental improvements and offers both conceptual and empirical advances.
---
## **2. Significance of the Results**
### **Strengths**
- The pipeline is well specified (Algorithms 1 & 2) and technically sound
- Experimental rigor and validation
- Evaluates a wide matrix of model–retriever combinations across two gold standards (one NIL and one non-NIL), providing nuanced insights into when RAG and model scale help
- Includes property-wise analyses and granular evaluation for dates (exact, year, decade, century) and multi-label occupation
- Releases code and data, enabling follow-up and potential reuse
- Exceptionally thorough experimental design:
- 6 LLMs (7B–200B+ range)
- 6 retrievers (sparse + dense)
- Zero-shot vs. RAG
- Single-model vs. cross-model aggregation
- Clear separation between QID-KG (parametric memory available) and NIL-KG (true long-tail setting)
- Insightful property-level analysis rather than relying on aggregate scores
- Strong error analysis and interpretation of negative results (e.g., date of birth failures)
### **Concerns and Limitations**
- The reliance on Wikidata labels and aliases for answer linking may propagate existing Wikidata biases and omissions. Potential evaluation of memorization effects could have been conducted by:
- Hiding labels and aliases
- Attempting to retrieve the QID directly
- The answer linking pseudocode (Algorithm 2) is detailed but basic
**Q1:** Could it rely on an existing entity linker?
➔ I'm missing details on entity linking errors:
**Q2:** Is the current strategy sufficient? How does it handle ambiguous entities?
- The RAG system struggles with missing context and poorly described entities
**Q3:** Would it be beneficial to expand context directly from Wikipedia content rather than using corpus data?
- The number of examples given in context is 3, but this could be emphasized more clearly in the article
**Q4:** Did you also try 1-shot learning?
**Q5:** Could your method integrate adaptive k-example learning based on the LLM's initial knowledge?
- Given that the initial knowledge of the LLM seems essential,
**Q6:** Did you try to integrate prompt data on the volume of knowledge available (in the corpus or on Wikipedia)?
An interesting extension would integrate provenance description, as the current method does not attach explicit evidence citations to produced triples—often required for Wikidata curation and human validation.
- The property selection step (K-means over property frequencies + elbow method) seems unnecessarily complex and insufficiently justified; it likely reduces to selecting the most frequent properties
- An analysis of gender effects on results could have been conducted but was not
- Computational cost is high, which may limit reproducibility despite good reporting of CO₂ impact. The paper does not propose fine-tuning smaller language models, which would be an interesting comparison
- Micro and Macro F1 scores would allow better comparison with the state of the art
**Q7:** Do your prompts instruct models to output "NIL/unknown" when the property is not supported by evidence?
**Q8:** How are unlinked or empty answers counted in precision, recall, and F1 (especially for NIL targets)?
The paper would benefit from data on expected and generated NIL or NULL values.
- Evaluation metrics already account for property types (e.g., graded date of birth matching), but could be better adapted for multi-label citizenship and occupation **(Krippendorff or Jaccard-based) and/or** using ontology-based data
- Experimental gaps or methodological issues
- Retrieval quality is not directly measured (e.g., recall@k of evidence mentions), making it difficult to attribute gains or variability to retrivers
- effectiveness. It may be interesting, since LLMs generate many answers.
- Retriever seems really bad, even for already-known QIDs.
Q9- Could the retriever need to be fine-tuned to retrieve this type of information better?
Q10- Could new information emerge from the extraction? Are False Positives factual errors?
Concerning the state of the art and position the paper could have results with these two related works:
- [https://arxiv.org/abs/2505.12533](https://arxiv.org/abs/2505.12533): interesting since it focuses on NYT + TACRED (newspaper corpora) vs. Wikipedia data.
- [https://arxiv.org/pdf/2511.03407](https://arxiv.org/pdf/2511.03407): since the graph pattern explored is very similar.
Overall, the approach is **significant, transparently discussed**, including its weaknesses.
---
## **3. Clarity and Organization**
**Strengths**
- The paper is pleasant to read and straightforward.
- The pipeline, motivation, and case study are clearly presented and effectively tied to cultural heritage and long-tail knowledge gaps.
- Error patterns and the influence of retrieval vs. parametric memory are discussed.
- The paper is clearly structured and easy to follow despite its length.
- Figures and tables (especially Figures 3–5) effectively support the analysis.
- The running example ("Miss Orton") helps ground the pipeline.
- Writing is precise and well-referenced.
- The code repository is clear, well-commented, and documented.
**Minor Issues**
- Some abbreviations should be normalized or better explained: the retrieval method abbreviation "Boyer-M" isn't sufficiently explained or defined.
- Figures 3 and 5 could be enlarged to facilitate comparison, since page count is not limited.
- Figures 3 and 5 could integrate visual aids to better compare zero-shot/aggregation and RAG methods.
- Minor typo: "Eblow curve" on page 8.
Overall, clarity is **very good** and appropriate for a journal audience.
---
## **4. Reproducibility and Open Science**
**Strengths**
- Code (GitHub) and data (Zenodo) are publicly released under CC-BY-SA-4.0.
- Standard resources are used (Wikidata, Polifonia, OpenRouter).
- Clear documentation of models, retrievers, and evaluation procedures.
- Explicit reporting of computational cost and CO₂ emissions.
This aligns **very well** with SWJ reproducibility standards.
---
## **5. Suggestions and Questions for Improvement**
I am aware that my review is full of questions and I am aware that implementing all of them may be better suited to an extension, but I would like to prioritise the following:
* Solve clarity issues: clarify abbreviations and improve charts as listed above.
* Try to better develop the Entity Linking part and/or the NULL/ NILL generation.
* F1 macro/micro computed across all properties may help better compare the benchmark proposed with the State of the Art.
* Try to evaluate manually to what extent the RAG context extension could miss the target by retrieving sentences that do not express the relevant fact.
---
## **Overall Recommendation**
**Recommendation: Accept (minor revisions)**
This work is an **original and well-executed contribution** to the Semantic Web community. It combines methodological innovation, rigorous experimentation, high-quality datasets, and clear societal motivation. The identified limitations are well acknowledged and don't undermine the core contribution.
With minor refinements—primarily addressing the suggestions for improvement noted above—the paper is well suited for publication in the **Semantic Web Journal**.
|