Lost and Found: Enriching Knowledge Graphs with “NIL” Persons from Historical Documents

Tracking #: 3919-5133

Authors: 
Arianna Graciotti
Nicolas Lazzari
Enrico Daga
Valentina Presutti

Responsible editor: 
Guest Editors 2025 OD+CH

Submission type: 
Full Paper
Abstract: 
Vast community-driven knowledge graphs (KGs), like Wikidata, are the primary reference data for Entity Linking (EL) applications. However, they exhibit significant coverage bias towards information that is widely popular on the Web, leading to underrepresentation of long-tail entities, particularly from non-contemporary contexts. Concurrently, ongoing mass digitisation of cultural heritage resources reveals numerous named entities and associated knowledge currently missing from general-purpose KGs. Enriching such KGs with these ``NIL'' entities presents an opportunity to enhance their completeness and mitigate biases, such as gender disparity in the representation of historical figures. In this article, we investigate an approach based on retrieval-augmented generative AI to capture information about NIL entities and generate structured KGs suitable for integration into Wikidata. The approach is applied to the case of persons unknown to Wikidata but mentioned in a collection of musical periodicals of the 19th century. We empirically select 6 properties used in Wikidata on entities of that type and create a manually annotated NIL-entities KG as a gold standard for evaluation. Through comprehensive experiments, we evaluate 6 State-of-the-Art Large Language Models (LLMs) from different vendors, combined with 6 different State-of-the-Art retrievers. Our results demonstrate significant variations in performance across model-retriever combinations, with a high accuracy for gender identification and family name, promising results for occupation and country of citizenship, and low for date of birth. We report on a detailed error analysis and discuss the potential of our approach for mitigating historical bias in Wikidata.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Paul Kelly submitted on 06/Oct/2025
Suggestion:
Minor Revision
Review Comment:

The work demonstrates moderate originality in applying RAG to NIL entity enrichment, with significance stemming from property-specific performance insights and findings that mid-range models with retrieval can match larger models. Writing quality is generally strong though some claims lack empirical support or citation (easily fixed, see later comments). The GitHub repo is well-organized with a clear README, but completeness is could be improved for replication: e.g. the QID-KG dataset (492 entities) appears to be missing.

Review #2
By Hang Dong submitted on 25/Oct/2025
Suggestion:
Minor Revision
Review Comment:

This work provides a RAG-based framework to address NIL entity insertion into the KG, Wikidata. The work focuses on the musical cultural heritage domain, and especially on persons. The work has very good originality in this case.

The results are significant on NIL-KG, an expert-annotated dataset as a contribution in this work, and in comparison to the QID-KG dataset with known entities. Extensive evaluation results suggest the variations in performance across LLM and retriever settings, with a meaningful summary of results w.r.t. different types of properties and choice of LLMs (e.g. by size). The work has an impact on building gender-equivalent KGs by reducing gender bias in the Wikidata in the historical music domain. The quality of writing is also very good throughout the paper.

The strength of this work includes the meaningful methodology of RAG to use context to "hook" the LLM parameters to provide answers about a NIL entity, and the analysis and impact on mitigating the historical, recency, and gender bias in the KG. The extensive evaluation is well analysed, even with significant variations in the properties studied. The future work provides important directions as well.

There are several minor issues that can be discussed:
First, regarding the features used for K-means to cluster the properties, in Section 3.1: is the number of entities for a property or other features?

Second, the methods for linking answers to Wikidata and for Majority voting are string-based or rule-based: they are effective but may produce some errors, and more complex methods may be possible. For example, for answer linking, answers may provide a new mention of a QID which is not in the look-up table, and for the voting, the truth may belong to the minority in LLMs for some cases. The impact of these methodological choices may be discussed.

Regarding the data file in GitHub/Zenodo, NIL-KG.csv, it would be good to specify how the “retrived_context_texts” are derived: which retrieval model was used to create the “retrived_context_texts”, especially since there are several retrieval model options and why there is only one set of “retrived_context_texts”.

Typos:
Page 6, “This will step can produce”
Page 12, “intuirively” on page 12

Review #3
By Célian Ringwald submitted on 14/Jan/2026
Suggestion:
Minor Revision
Review Comment:

## **Summary of the Paper**

This paper proposes a retrieval-augmented generation (RAG) pipeline to enrich general-purpose knowledge graphs with "NIL" entities—people mentioned in historical sources but absent from Wikidata.

The paper addresses an information extraction challenge, **focusing on NIL (out-of-KG) historical person entities** mentioned in 19th-century musical periodicals. The authors propose a **retrieval-augmented generation (RAG) pipeline** that extracts structured knowledge for new and unreferenced entities by:

1. Selecting 6 core Wikidata properties related to Person
2. Framing property extraction as a question-answering task supported by sparse and dense retrievers over a historical corpus
3. Running multiple LLMs and retrievers in parallel and aggregating results via majority voting
4. Linking generated answers to Wikidata-compliant identifiers

The work introduces two evaluation datasets:

- **QID-KG**, derived from existing Wikidata persons mentioned in the corpus
- **NIL-KG**, a manually curated gold standard for 113 missing persons

Extensive experiments compare 6 LLMs and 6 retrievers, analyze property-specific behavior, and discuss implications for historical bias mitigation in Wikidata.

**Results:** Strong performance for gender and family name, promising results for occupation and country of citizenship, and low performance for date of birth. The paper highlights the limited effect of RAG methods, even on well-known QIDs, for some properties.

---

## **1. Originality**

- Tackles a **highly relevant and underexplored problem**: automatic KG enrichment with historical NIL entities, and addresses gender bias by operationalizing methods to describe underrepresented figures
- Moves beyond entity linking to **KG construction suitable for direct integration with Wikidata**, a significant step forward
- The creation and release of **NIL-KG** are valuable contributions to the community
- Positions RAG explicitly for structured KG enrichment of NIL entities—an underexplored setting that goes beyond populating existing entries and has many potential applications
- Provides a **carefully designed experimental study** across models, retrievers, properties, and aggregation strategies

The contribution is **clearly novel** in the Semantic Web context and well aligned with SWJ's focus on reproducible, data-centric research. The paper goes beyond incremental improvements and offers both conceptual and empirical advances.

---

## **2. Significance of the Results**

### **Strengths**

- The pipeline is well specified (Algorithms 1 & 2) and technically sound
- Experimental rigor and validation
- Evaluates a wide matrix of model–retriever combinations across two gold standards (one NIL and one non-NIL), providing nuanced insights into when RAG and model scale help
- Includes property-wise analyses and granular evaluation for dates (exact, year, decade, century) and multi-label occupation
- Releases code and data, enabling follow-up and potential reuse
- Exceptionally thorough experimental design:
- 6 LLMs (7B–200B+ range)
- 6 retrievers (sparse + dense)
- Zero-shot vs. RAG
- Single-model vs. cross-model aggregation
- Clear separation between QID-KG (parametric memory available) and NIL-KG (true long-tail setting)
- Insightful property-level analysis rather than relying on aggregate scores
- Strong error analysis and interpretation of negative results (e.g., date of birth failures)

### **Concerns and Limitations**

- The reliance on Wikidata labels and aliases for answer linking may propagate existing Wikidata biases and omissions. Potential evaluation of memorization effects could have been conducted by:
- Hiding labels and aliases
- Attempting to retrieve the QID directly
- The answer linking pseudocode (Algorithm 2) is detailed but basic

**Q1:** Could it rely on an existing entity linker?

➔ I'm missing details on entity linking errors:

**Q2:** Is the current strategy sufficient? How does it handle ambiguous entities?

- The RAG system struggles with missing context and poorly described entities

**Q3:** Would it be beneficial to expand context directly from Wikipedia content rather than using corpus data?

- The number of examples given in context is 3, but this could be emphasized more clearly in the article

**Q4:** Did you also try 1-shot learning?

**Q5:** Could your method integrate adaptive k-example learning based on the LLM's initial knowledge?

- Given that the initial knowledge of the LLM seems essential,

**Q6:** Did you try to integrate prompt data on the volume of knowledge available (in the corpus or on Wikipedia)?

An interesting extension would integrate provenance description, as the current method does not attach explicit evidence citations to produced triples—often required for Wikidata curation and human validation.

- The property selection step (K-means over property frequencies + elbow method) seems unnecessarily complex and insufficiently justified; it likely reduces to selecting the most frequent properties
- An analysis of gender effects on results could have been conducted but was not
- Computational cost is high, which may limit reproducibility despite good reporting of CO₂ impact. The paper does not propose fine-tuning smaller language models, which would be an interesting comparison
- Micro and Macro F1 scores would allow better comparison with the state of the art

**Q7:** Do your prompts instruct models to output "NIL/unknown" when the property is not supported by evidence?

**Q8:** How are unlinked or empty answers counted in precision, recall, and F1 (especially for NIL targets)?

The paper would benefit from data on expected and generated NIL or NULL values.

- Evaluation metrics already account for property types (e.g., graded date of birth matching), but could be better adapted for multi-label citizenship and occupation **(Krippendorff or Jaccard-based) and/or** using ontology-based data
- Experimental gaps or methodological issues
- Retrieval quality is not directly measured (e.g., recall@k of evidence mentions), making it difficult to attribute gains or variability to retrivers
- effectiveness. It may be interesting, since LLMs generate many answers.
- Retriever seems really bad, even for already-known QIDs.

Q9- Could the retriever need to be fine-tuned to retrieve this type of information better?

Q10- Could new information emerge from the extraction? Are False Positives factual errors?

Concerning the state of the art and position the paper could have results with these two related works:

- [https://arxiv.org/abs/2505.12533](https://arxiv.org/abs/2505.12533): interesting since it focuses on NYT + TACRED (newspaper corpora) vs. Wikipedia data.
- [https://arxiv.org/pdf/2511.03407](https://arxiv.org/pdf/2511.03407): since the graph pattern explored is very similar.

Overall, the approach is **significant, transparently discussed**, including its weaknesses.

---

## **3. Clarity and Organization**

**Strengths**

- The paper is pleasant to read and straightforward.
- The pipeline, motivation, and case study are clearly presented and effectively tied to cultural heritage and long-tail knowledge gaps.
- Error patterns and the influence of retrieval vs. parametric memory are discussed.
- The paper is clearly structured and easy to follow despite its length.
- Figures and tables (especially Figures 3–5) effectively support the analysis.
- The running example ("Miss Orton") helps ground the pipeline.
- Writing is precise and well-referenced.
- The code repository is clear, well-commented, and documented.

**Minor Issues**

- Some abbreviations should be normalized or better explained: the retrieval method abbreviation "Boyer-M" isn't sufficiently explained or defined.
- Figures 3 and 5 could be enlarged to facilitate comparison, since page count is not limited.
- Figures 3 and 5 could integrate visual aids to better compare zero-shot/aggregation and RAG methods.
- Minor typo: "Eblow curve" on page 8.

Overall, clarity is **very good** and appropriate for a journal audience.

---

## **4. Reproducibility and Open Science**

**Strengths**

- Code (GitHub) and data (Zenodo) are publicly released under CC-BY-SA-4.0.
- Standard resources are used (Wikidata, Polifonia, OpenRouter).
- Clear documentation of models, retrievers, and evaluation procedures.
- Explicit reporting of computational cost and CO₂ emissions.

This aligns **very well** with SWJ reproducibility standards.

---

## **5. Suggestions and Questions for Improvement**

I am aware that my review is full of questions and I am aware that implementing all of them may be better suited to an extension, but I would like to prioritise the following:
* Solve clarity issues: clarify abbreviations and improve charts as listed above.
* Try to better develop the Entity Linking part and/or the NULL/ NILL generation.
* F1 macro/micro computed across all properties may help better compare the benchmark proposed with the State of the Art.
* Try to evaluate manually to what extent the RAG context extension could miss the target by retrieving sentences that do not express the relevant fact.

---

## **Overall Recommendation**

**Recommendation: Accept (minor revisions)**

This work is an **original and well-executed contribution** to the Semantic Web community. It combines methodological innovation, rigorous experimentation, high-quality datasets, and clear societal motivation. The identified limitations are well acknowledged and don't undermine the core contribution.

With minor refinements—primarily addressing the suggestions for improvement noted above—the paper is well suited for publication in the **Semantic Web Journal**.