Constructing Domain-Specific Knowledge Graphs From Text: A Case Study on Subprime Mortgage Crisis

Tracking #: 3757-4971

Authors: 
Ali Hur
Naeem Janjua

Responsible editor: 
Guest Editors KG Construction 2024

Submission type: 
Full Paper
Abstract: 
This research paper details a novel methodology for constructing a domain-specific Knowledge Graph (KG) from unstructured text data, exemplified by a case study on the subprime mortgage crisis. The authors present a five-phase approach – specification, conceptualization, formalization, integration, and augmentation – to transform unstructured financial news articles from the MEANTIME corpus into a structured KG. This framework enables the extraction of valuable insights, revealing trends, correlations, and complex relationships among companies, market movements, and economic indicators. The KG's efficacy is demonstrated through its ability to answer complex queries related to the subprime mortgage crisis, highlighting its potential as a powerful tool for knowledge representation and decision-making in the financial domain.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Christophe Cruz submitted on 06/Oct/2024
Suggestion:
Accept
Review Comment:

The paper "Constructing Domain-Specific Knowledge Graphs From Text: A Case Study on Subprime Mortgage Crisis" presents an innovative approach to constructing domain-specific knowledge graphs from textual data. The authors have chosen to focus on the subprime mortgage crisis as their case study, which provides a compelling and complex real-world application for their methodology.
In terms of originality, the paper offers a fresh perspective on extracting and structuring knowledge from complex financial texts. The methodology described appears novel in its application to this specific domain, potentially opening new avenues for research in other complex fields.
The significance of the results is evident in several aspects. The paper provides a methodology for creating domain-specific knowledge graphs and demonstrates its practical application to a high-impact scenario. The subprime mortgage crisis is a topic of considerable importance in the field of finance and economics, and the creation of a knowledge graph in this domain could serve as a valuable resource for researchers and practitioners alike. Furthermore, the methodology presented has the potential to be applied to other complex domains, extending its impact beyond the immediate case study.
The quality of writing in the paper is commendable. The authors have presented their ideas clearly and coherently, making the complex subject matter accessible to readers. The structure of the paper appears to be well-organized, allowing for a logical flow of ideas from the introduction of the methodology to its application in the case study.
Regarding the data file assessment, the authors have provided a GitHub repository with full details about the implementation and usage. This choice of platform is appropriate for long-term discoverability, as GitHub is widely used for code sharing and version control, making it easily accessible to other researchers. The provision of implementation details and usage instructions should facilitate reproducibility of the experiments described in the paper.
While a thorough assessment of the README file's clarity and completeness would require direct examination, the fact that the authors have included "full details" suggests a commitment to transparency and reproducibility. This is a significant strength of the paper, as it allows other researchers to build upon and verify the work presented.
The combination of a novel methodology, a relevant and impactful case study, and accessible resources makes this paper a significant and interesting contribution to the community.
In conclusion, this paper presents a valuable contribution to the field by providing a methodology for constructing domain-specific knowledge graphs and applying it to the complex domain of the subprime mortgage crisis. The high-quality writing, novel approach, and provision of resources for replication all contribute to making this a strong and impactful piece of research. Its potential to influence future work in both knowledge graph construction and financial analysis is considerable.

Review #2
By Paul Groth submitted on 09/Nov/2024
Suggestion:
Major Revision
Review Comment:

This article proposes a new methodology and system for knowledge graph construction. It provides an implementation of that methodology in a well documented framework. However, the evaluation is case study based and I found it difficult to assess the paper as a research contribution because of the lack of generalisability.

Specifically, the paper claims that "Existing methodologies to KG construction are predominantly constrained by their domain-specific focus, which limits their flexibility and generalizability for KG construction for other domains.". However, it then proceeds to analyse the performance of the approach on a single domain in the financial sector. I would have expected to see the methodology applied to multiple domains to address this claim. Furthermore, I think there are in fact several general approaches to knowledge graph construction, for instance:

- Gerhard Weikum, Xin Luna Dong, Simon Razniewski and Fabian Suchanek (2021), "Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases", Foundations and Trends® in Databases: Vol. 10: No. 2-4, pp 108-490.
- Hofer, M., Obraczka, D., Saeedi, A., Köpcke, H., & Rahm, E. (2024). Construction of knowledge graphs: current state and challenges. Information, 15(8), 509.
- Gytė Tamašauskaitė and Paul Groth. 2023. Defining a Knowledge Graph Development Process Through a Systematic Review. ACM Trans. Softw. Eng. Methodol. 32, 1, Article 27 (January 2023), 40 pages. https://doi.org/10.1145/3522586
- Lingfeng Zhong, Jia Wu, Qian Li, Hao Peng, and Xindong Wu. 2023. A Comprehensive Survey on Automatic Knowledge Graph Construction. ACM Comput. Surv. 56, 4, Article 94 (April 2024), 62 pages. https://doi.org/10.1145/3618295
-

That being said, I think there are interesting elements of the proposed methodology in particular the focus on linguistic analysis. I also appreciated the underlying software text2graphs as a framework and the reflection on the role of KG construction with respect to outputs as property graphs.

I think maybe the paper would have been better as a systems paper or application paper rather than a research paper with the key requirement of evidence of generalisation.

Review #3
Anonymous submitted on 12/Dec/2024
Suggestion:
Reject
Review Comment:

* Summary:
The paper presents a methodology for constructing a domain-specific financial/stock market Labeled Property Graph from the MEANTIME corpus (unstructured text data). The presented results and validations revolve around a financial news article on the subprime mortgage crisis (table #7). The methodology is based on a generic and well-known five-phase approach.

* Overall Evaluation (ranging from 0-10):
[Q]+ Quality: 5
[R]+ Importance/Relevance: 5
[I]+ Impact: 4
[N]+ Novelty: 4
[S]+ Stability: 6
[U]+ Usefulness: 4
[W]+ Clarity, illustration, and readability: 7
[P]+ Impression score: 5

* Dimensions for research contributions (ranging from 0-10):
(1) Originality (QRN): 4.7
(2) Significance of the results (ISU): 4.7
(3) Quality of writing (QWP): 5.7

* Overall Impression (1,2,3): 50.3
* Suggested Decision: Reject

* General comments:
- The paper is easy to read and has a good redaction and articulation of the main ideas.
- §2-3 (Background and Related Work) cover appropriate and foundational references about methods, techniques, models, and tools for KG and ontology engineering.
- I had a good impression of the paper until §4. Unfortunately, from §5 onward, my positive impression rapidly decayed.
In the end, I was disappointed in the intent of this work. To some extent, it feels like the paper presents a "cooking recipe" that only lists the ingredients.
- The "meat" of the content focuses on a case study about the subprime mortgage crisis: to transform an unstructured financial news article related to the stock market (table #7) from the MEANTIME corpus into a structured KG.
- The paper claims to present a "novel" five-phase approach. However, there's no apparent novelty in the proposed methodology since it's the summary of generic and well-known steps from the KG/Ontology Engineering domain:
1) Specification: Ontology Requirement Specification Document—ORSD.
2) Conceptualization: concept map (high-level conceptual representation), intermediate representations: data dictionary.
3) Formalization.
4) Integration: the paper skips this part (§7.4).
5) Augmentation: sub-phases: rule-based enrichments (Entity Typing task and identifying "triggers" for events) and refinement.
- The presented work is weak and has several issues to address (see further below).
- The claimed novelty is poor. One can read the following in some parts of the manuscript:
- "a novel methodology for constructing a domain-specific Knowledge Graph (KG) from unstructured text data".
- "a novel, domain-agnostic framework for the autonomous construction of Knowledge Graphs from unstructured text".
- However, the paper fails to show this claimed novelty.
- Additionally, this paper seems to overlap extensively with ref [104] (both papers are from the same authors). [104] has more technical merit than this one. It seems like the content of this manuscript is a "rehashed" version of [104]. If this appreciation is incorrect, the authors should describe how the manuscript differs and adds value compared to [104].
- In general, the presented paper requires major corrections and improvements.

* Specific comments:
- pag#05 / §4.1: "The announcement of the credit crisis came as a shock to the financial world" --> Not really. Any "seasoned financial analyst" like Bob would known that the "bubble" was going to "burst" at any time since financial reports from early 2007.
- §6: This section introduces an overview of the proposed 2-phase approach and architecture. It's too broad, and it doesn't go into details. There is no apparent novelty. There are no details on how the integration of the mentioned tasks/techniques was performed. (It's like a cooking recipe that only lists the ingredients.)
- §7.3 / Fig.#6: The ontology spec, in the form of a turtle serialisation of the OWL2 definitions, should be available in the public repo. A more detailed concept map of the ontology should be presented instead of a Protégé screenshot.
- [pag#17 / §7.5] Define/cite/reference "TLINK" and "CLINK".
- §8.1 has only one sub-section (§8.1.1). This structure is not adequate. §8.1.1 should be removed (the title), and the content should be merged with §8.1.
- §8.1 refers to Figure #5, which comes directly from [104] (Figure #8). This should be cited.
- §8.2, §8.3: What is the value of these sub-sections? Basically, they only state that the KG will be used to answer the competency questions.
- Table #6: The intent of the "Contexts" and "Arguments" is unclear. Also, there are some typos and missing information:
- [row 1, Arguments]: Shouldn't it be "ARG1: [Market, StockMarketIndexMovement]"?
- [row 2, Arguments]: Why there's an ARG2? What does it mean?
- [rows 4 & 6, Arguments]: Why there's no ARG0? What does it mean?
- [pag#20 / §8.5] The proposed validation process is a set of queries based on Competency Questions. Did they involve any financial experts to corroborate the query output and KG structure?
- [pag#21 / §8.5.2]
- "... and Lehman Brothers mentioned in the news"; "Lehman Brothers" doesn't even appear in the case study! (link from Table #7).
- "... statements from CEOs..."; the article does not include any statement from any CEO! (link from Table #7).
- How can the authors make such statements when the case study that was used clearly doesn't mention many entities?
- This sub-section, along with §4.1, describes a "fairy tale" about "Bob" and nothing else. There's no evidence of any of those results. Where are they in the repo? Where are the queries and the results presented in the repo?
- This is highly disappointing. Many things don't make any sense with the content presented in this section.

* Questions:
- Why a Labeled Property Graph (LPG)?
- The implementation uses Neo4j's proprietary technologies (graph database and Cypher query language). The generated KG is an LPG, and it's not represented as an RDF-based KG.
- [pag#03 / §3] "... RDF presents several limitations... in the context of complex textual data. It often results in sparse graphs with limited structural detail, particularly when handling intricate relationships within diverse real-world scenarios". The authors developed an ontology as the KG schema; shouldn't this give a proper backbone structure for the KG?
- What about FAIR principles (https://www.go-fair.org/fair-principles/) if the KG is not based on Open Standards such as W3C's Semantic Web technologies? Are those principles not relevant to this proposed "framework"?
- The authors should make a more substantial justification for why their implementation is not based on Open Standards.
- Figures #1,2,3:
- Figure #1: It's too generic and resembles the NeoN Ontology Engineering Methodology.
- Are these figures significant contributions? The diagrams summarise well-known steps and techniques in Information Extraction, Computational Linguistics (NLP) tasks, and KG engineering.
- The claimed novelty is unclear. Additionally, these figures are not clearly aligned with the software artefact in the repo (see feedback below about Figure 5).
- §5:
1) Specification: Ontology Requirement Specification Document—ORSD. Where is the ORSD for this case study?
2) Conceptualization: concept map (high-level conceptual representation), intermediate representations: data dictionary. Where is the concept map and data dictionary?
3) Formalization. Where is the OWL ontology file in the repo?
- [pag#07 / §5.4] Why you didn't include Wikidata in the "established knowledge graphs" spectrum list?
- [pag#08 / §5.4] Shouldn't be mapping entities from the formalised model into these external sources (the other way around)? (this corresponds to the Named-Entity Linking —NEL— task)
- [pag#10 / §6.3.1] What is "text genre"?
- [pag#16 / §7.4] Define "R2ML".
- Figure #5: This figure should depict the mapping of how each component (Python script, Docker service, etc.) from the software artefact repo is used in each one of those boxes. What is the purpose of this diagram? Basically, there's no difference between this figure and figures 1-3: they are too broad and vague. The value of this figure is unclear.
- [pag#18 / §8.1.1] The presented validation results of Phase 1 raise many questions:
- What is the "ground truth" used to evaluate the performance of all the tasks (entity extraction, recognition of numeric values, temporal expression detection, etc.)? It is unclear and requires more explanations.
- "event recall (F1 score of 0.719)"; the recall metric has an F1 score?! This must be a typo, right? F1 is the harmonic mean of precision and recall; hence, recall can't have an F1 score.
- "precision in entity instances (F1 score of 0.550)"; same as above!
- The authors should include a detailed table of the results: the P, R, and F1 for all the tasks (similar to what you did in [104 / Table 3]). The evaluations should be included in the repo.
- [pag#19 / §8.4.2] Where is the *Domain Dictionary* located in the repo?
- [§8.4.1 / Table #5, §8.4.3 / Table #6] How the rules are applied? The algorithms are missing. A diagram depicting these "enrichments" should clarify the expected output after applying these rules.
- [pag#22-23 / Table #9]
- What kind of id is "cb.id = 'Central banks across the world'"? Is "RegulatoryBody" a class in the ontology? Unclear.
- "(ec:EconomicCrisis)-[r {type:'ARG0'}]->(e:TEvent)", why only "ARG0"? What about "ARG1" and "ARG2"? Unclear.
- "... made by Bank of America Home Loans..." The entity doesn't even exist in the news article! (Link from Table #7).
- In the last query, why "Federal_Reserve" and not: "FederalReserve" or "Federal Reserve" or any other variation? Unclear.
- [pag#21 / §8.5.4] How the P, R, and F1 metrics were calculated? What is the "ground truth" for the KG enrichment process (NER, events, and temporal expressions)? Unclear. Did you manually build another KG from the article and run the same queries? How did you measure the "relevancy" of the retrieved results? This requires a detailed explanation.
- [pag#23 / Figure #7] Shouldn't the *MarketMovement* event "fell" be unique in the graph? Why are there many "fell" bubbles? Explanation required.
- [pag#24 / Figure #8]
- Why "ARG1" and not "ARG0"? What is the difference between "Frame" and "PublicStatement"?
- Why are "FrameArguments" modelled as entities and not literal values?
- Explanation required.
- [pag#24 / §9] From this section, it seems that the proposed framework only works for the specific stock market case study. So why are there claims about its generalisability, robustness, and versatility?
- After reviewing the paper, I have the final observations regarding the first two stated "significant contributions to KG construction and semantic modelling" of this work (pag#02 / §1):
1) Robust stock market ontology.
- There's no evidence of this. As indicated above, the ontology spec (OWL file) doesn't exist in the repo.
2) Formalization method utilising Labeled Property Graphs (LPGs) to capture complex semantic relationships.
- How is the method formalised? There's little evidence in what is described in §6.

* About the companion resources: (“Long-term stable URL for resources”) — Title of the attached files: "Materials for mapping study: structured data to RDF".
(A) The repo resources are well organised. The content is clear and easy to understand.
(B) The provided resources appear to be complete. To some extent, one could replicate the presented case study.
(C) The chosen repository is appropriate for long-term repository discoverability.
(D) The provided artifacts (companion documents for the study in place) are incomplete. There are missing artefacts such as the ontology spec, concept map, data dictionary, etc.
* Observations:
- There's an error in the following Jupyter notebook

* Minor corrections:
pag#03 / §3: "IE" is used but has not been defined ("Information Extraction").
pag#04 / §3: "semantic role labeling (SRL) (Shi & Lin, 2019)" --> Modify the reference following the format (a number).
pag#05 / §4.1: "Let’s assume, Bob is a seasoned financial analyst..." --> "Let’s assume, Bob, a seasoned financial analyst..."
pag#07~08 / §5.3~5.4: Missing citation/reference/link about OWL, Description Logic, Framenets, RDFS, RDF.
pag#08 / §5.5: The two-phase listing should be "a)" and "b)".
pag#09 / Fig.#2: (Phase 1): "baed"; the caption is too small --> should be enlarged. (All the table and figure captions should be enlarged).
pag#10 / §7.1.1: "IPOs" is used but has not been defined ("Initial Public Offerings"); it's defined in Table #2.
pag#13 / Table #2; pag#19 / Table #5: Move it to the next page.
pag#14 / Table #2: "Chronicles chronological"
pag#15 / Table #4: The "Domain Label Types" should be in bold --> "FinancialIndicator", "EconomicActivity", etc.
Table #5-6: they are not referenced in the text/content.
pag#20 / §8.5.1: The listing should start from "a)".
pag#21 / §8.5.2: "Table 7-3" does not exist. Shouldn't be Table #9?
Figure #7-8: they are not referenced in the text/content.
pag#25~ / Ref:
- Many references are incomplete ([25], [49], [64], [71]...). See [48] as a complete example for arXiv ([25]).
- Add DOI/link and year to all.
- Equivalent references: [79]===[85].
- [91] "M. Fernández, A. Gómez-Pérez".