Hierarchical Blockmodelling for Knowledge Graphs

Tracking #: 3779-4993

Authors: 
Marcin Pietrasik
Marek Reformat
‪Anna Wilbik‬

Responsible editor: 
Maria Maleshkova

Submission type: 
Full Paper
Abstract: 
In this paper, we investigate the use of probabilistic graphical models, specifically stochastic blockmodels, for the purpose of hierarchical entity clustering on knowledge graphs. These models, seldom used in the Semantic Web community, decompose a graph into a set of probability distributions. The parameters of these distributions are then inferred allowing for their subsequent sampling to generate a random graph. In a non-parametric setting, this allows for the induction of hierarchical clusterings without prior constraints on the hierarchy's structure. Specifically, this is achieved by the integration of the Nested Chinese Restaurant Process and the Stick Breaking Process into the generative model. In this regard, we propose a model leveraging such integration and derive a collapsed Gibbs sampling scheme for its inference. To aid in understanding, we describe the steps in this derivation and provide an implementation for the sampler. We evaluate our model on synthetic and real-world datasets and quantitatively compare against benchmark models. We further evaluate our results qualitatively and find that our model is capable of inducing coherent cluster hierarchies in small scale settings. The work presented in this paper provides the first step for the further application of stochastic blockmodels for knowledge graphs on a larger scale. We conclude the paper with potential avenues for future work on more scalable inference schemes.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 19/Mar/2025
Suggestion:
Accept
Review Comment:

As this manuscript was submitted as 'full paper', it is reviewed along the usual dimensions for research contributions.

S1. This 29-page reasonably written manuscript SWJ 3779-4993 plus 5 pages for references and four appendicies was original and captured scientific results.
S2. As a revised version of SWJ 3698-4912, it addressed reviewers' previous comments.
S3. The data file on GitHub appears to be complete and well organized. It contains a README file, which allows replication of experiments.
S4. Authors, Pietrasik et al., integrated the Nested Chinese Restaurant Process and the Stick Breaking Process into the generative model. They also derived a collapsed Gibbs sampling scheme for its inference.
S5. They evaluated the model by comparing it against benchmark models using both synthetic and real-world datasets.

W1. Unfortunately, evaluation results show that the model underperformed on the SBT data set when compared to the baseline approaches. Although the authors explained the reasons in their responses/supplementary file, it would be nice if they could discuss how they could mitigate the issue.

Review #2
By Yujia Zhang submitted on 11/Apr/2025
Suggestion:
Minor Revision
Review Comment:

This paper presents a solid background on probabilistic models and introduces a novel hierarchical clustering approach without prior constraints. The evaluation of synthetic and real-world datasets provides a basis for comparison with existing models. I have several concerns and suggestions for improvement, outlined as below:
Equation (1), line 40: Please define 'ijr'.
Figure 1: Consider adding a workflow diagram to this figure. Illustrating how the techniques are integrated into the processing pipeline would significantly enhance clarity and conciseness.
Figure 7: Please clarify the node representations by using shading to distinguish between observed and hidden nodes. Also should indicate the repetition of the variable by using the rectangle.
Page 22, lines 11 and 12: Include the equations inline, maintaining the consistent formatting used before.
Figure 9: Please provide additional context regarding the entities in A, D, and H. What are their relationships, and how do they relate to the toy example?
Model Performance: The model's performance appears to be less competitive compared to benchmark models. Do you have any strategies for improvement?