NeuSIRE: Neural-Symbolic Image Representation and Enrichment for Visual Understanding and Reasoning

Tracking #: 3247-4461

Muhammad Jaleed Khan
John Breslin
Edward Curry

Responsible editor: 
Guest Editors NeSy 2022

Submission type: 
Full Paper
The adoption of neural-symbolic hybrid approaches in visual intelligence is essential to progress toward seamless high-level understanding and reasoning about visual scenes. To this end, Scene Graph Generation (SGG) is a promising and challenging task, which involves the prediction of objects, their attributes and pairwise visual relationships in a visual scene to create a structured, symbolic scene representation, known as a scene graph, which is utilized in downstream visual reasoning to perform a desired task, such as image captioning, visual question answering, image retrieval, multimedia event processing or image synthesis. The crowdsourced training datasets used for this purpose are highly imbalanced and it is nearly impossible to collect and collate training samples for every visual concept or visual relationship due to a huge number of possible combinations of objects and relationship predicates. Leveraging commonsense knowledge is a natural solution to augment the data-driven approaches with external knowledge to enhance the expressiveness and autonomy of visual understanding and reasoning frameworks. In this paper, we proposed a neural-symbolic visual understanding and reasoning framework based on commonsense knowledge enrichment. Deep neural network-based techniques are used for object detection and multi-modal pairwise relationship prediction to generate a scene graph of an image, which is followed by rule-based algorithms to refine and enrich the scene graph using commonsense knowledge. The commonsense knowledge is extracted from a heterogeneous knowledge graph in the form of related facts and background information about the scene graph elements. The enriched scene graphs are then leveraged in downstream visual reasoning pipelines. We performed comprehensive evaluation of the proposed framework using the common datasets and standard evaluation metrics. As a result of commonsense knowledge enrichment, the relationship recall scores R@100 and mR@100 increased from 36.5 and 11.7 to 39.1 and 12.6 respectively on the Visual Genome (VG) dataset, and similar results were observed for the COCO dataset. The proposed framework outperformed the state-of-the-art methods in terms of R@K and mR@K on the standard split of VG. We incorporated scene graph-based image captioning and image generation models as downstream tasks of SGG with knowledge enrichment. With the use of enriched scene graphs, SPICE and CIDEr scores obtained by the image captioning model increased from 20.7 and 115.3 to 23.8 and 131.4 respectively, and the proposed approach outperformed the state-of-the-art scene graph-based image captioning techniques in terms of SPICE and CIDEr scores and achieved comparable performance in terms of BLEU, ROGUE and METEOR scores. The qualitative results of image generation showed that the enriched scene graphs result in more realistic images in which the semantic concepts in the input scene graph can be more clearly observed. The encouraging results validate the effectiveness of knowledge enrichment in scene graphs using heterogeneous knowledge graphs. The source code is available at
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 23/Dec/2022
Review Comment:

The paper proposes an architecture for scene graph generation from pictures and evaluates the effectiveness of the method on downstream tasks such as scene graph-based image captioning and image generation models.

The approach proposed by the authors decomposes the complex task into three main phases
Scene graph extraction, scene graph enrichment, and downstream task (caption/image generation) The author proposes an architecture that is obtained by combining several components used to solve the sub-task.

The author claims that they propose a “neural-symbolic visual understanding and reasoning framework”. This description does not reflect the contribution presented in the rest of the paper for two main reasons. Neuro-symbolic architectures are intended to have a tight integration of the neural part and the symbolic part, where both parts affect one another. The architecture presented in the paper looks more like a pipeline where the symbolic part, i.e.. the enrichment with the information available in a (set of) knowledge graphs. The obtained results are taken in input from the system that performs the specific downstream tasks. That cannot be considered a valid contribution to Neuro-Symbolic architecture as the addition of knowledge is done in a separate task and only the result is provided in input to a neural network. In NeSy architectures, the knowledge is injected into the model itself either in the form of loss or in the form of some special construct.

In the related work, the authors completely neglect to report and discuss the literature on neuro-symbolic architectures which have been used for image processing. See for instance
Hitzler, Pascal, et al. "Neuro-symbolic approaches in artificial intelligence." National Science Review 9.6 (2022): nwac035.
Hassan, Muhammad, et al. "Neuro-Symbolic Learning: Principles and Applications in Ophthalmology." arXiv preprint arXiv:2208.00374 (2022).
For a survey of the approaches, or specific papers on semantic image interpretation:
VAN KRIEKEN, Emile; ACAR, Erman; VAN HARMELEN, Frank. Analyzing differentiable fuzzy logic operators. Artificial Intelligence, 2022, 302: 103602.
BUFFELLI, Davide; TSAMOURA, Efthymia. Scalable Regularization of Scene Graph Generation Models using Symbolic Theories. arXiv preprint arXiv:2209.02749, 2022.

For Scene graph generation they use Faster-RCNN for object detection and labeling of object types. This model is extended with an additional model based on LSTM that is used to predict the relationship between pairs of objects. An evaluation of such an architecture is provided in figura 4. The evaluation of the SGG performance is usually divided into two parts, namely object type detection and relations detection since the second is a much more difficult taks than the first. Furthermore, for object type classification the authors use state-of-the-art architectures, while for relation extraction they propose a specific method, which should be evaluated.
In addition, state-of-the-art performances. Assuming that the reported evaluation concerns only the relationship between objects it looks like the baseline is less than the state of the art which is 26.1 33.5 38.4 (RK20, RK50, RK100) described in
Structured Sparse R-CNN for Direct Scene Graph Generation Yao Teng, Limin Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 19437-19446
The relationship between object added by the post-processing that injects background knowledge, however, outperforms these metrics, so it looks like there is indeed a contribution of the semantics to the task. However, there is something that I don’t completely understand, which I believe should be clarified. To evaluate the enriched graph do you also enrich the graphs of the test set? Otherwise, how can you train the system end-to-end? Do you also enrich the graphs in the test set? If this is the case then the comparison with the standard method is not fair, since you change also the test set.

About the enriching scene graph. From the examples shown in the paper, it seems that you add certain background relationships and not others. For instance, in the first line of figure 5, why do you add the superclass (ISA) relationship on the node of type person (isa Human) and not the superclass of surfboard? Do you have some method to limit the number of additional nodes and relationships to be added to an SG?

Concerning the downstream tasks, the authors only provide examples of the positive effects of enrichment and do not sufficiently explain and discuss the problems that might arise from such a method. This would be necessary to evaluate if such a methodology is applicable in a given context.

The scientific contribution of the paper is limited. So this should be considered an experimental paper. However the positive experimental results have been presented in a very similar conference paper:
KHAN, Muhammad Jaleed; BRESLIN, John G.; CURRY, Edward. Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning. In: European Semantic Web Conference. Springer, Cham, 2022. p. 93-112.
The only novelty seems to be the experimental evaluation of the downstream task of caption generation. Since I believe that is not sufficient for a top journal paper I suggest rejecting the paper due to its limitations on the scientific and experimental contribution.

Review #2
Anonymous submitted on 09/Jan/2023
Minor Revision
Review Comment:


This paper proposes a neural-symbolic visual understanding and reasoning framework based on enriched scene graphs. Specifically, DNNs (for object detection and pairwise relationship prediction) are leveraged to first construct symbolic scene graphs, then rule-based graph refinement and enrichment are used to enrich the scene graph for downstream usage (scene graph-based image captioning, scene graph-based image generation). The authors conduct experiences mainly on two datasets, Visual Genome and Microsoft COCO. According to the experimental results, the proposed method is able to construct better scene graphs (higher R@K and mR@K for SGG evaluation), and achieve good performance on two downstream tasks (scene graph-based image captioning, scene graph-based image generation).


1. Originality: The main contribution of this paper seems to be the scene graph enrichment, while the image representation learning and downstream applications seems to directly leverage existing methods. The idea of leveraging knowledge graphs to enrich scene graphs seems not new (e.g. reference [44]). Thus the originality of this paper is relatively limited to my understanding.

2. Significance: According to the experimental results, the proposed method is able to significantly improve the R@K and mR@K for SGG, and also bring improvement for downstream tasks (scene graph-based image captioning, scene graph-based image generation), thus I think the significance of this paper is relatively high.

3. Writing Quality: The overall writing is Okay, with some minor flaws. Specifically,
i) Page 8 Line 1~25: It is unclear why and how LSTM is leveraged to the detected objects, what’s the order of the detected objects that are fed into the LSTM, any intuition of leveraging LSTM here?
ii) Table 2: It would be ideal to add horizon separation between different methods for better readiness.
iii) Some quote symbols need to be fixed (e.g. Page 3 Line 42)

4. Reproducibility: after checking the Long-term stable URL for resources, I notice most of the resources are included in the provided github link, however some code is missing. Specifically,
A) The README file is included
B) Most of the data and code are included. However, there is still some code missing. In the “SGG_Evaluation.ipynb”, to run SGG, “!(CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port 10027 --nproc_per_node=1 tools/ \ --config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml"...” where both the code (tools/ and the config file (configs/e2e_relation_X_101_32_8_FPN_1x.yaml) are missing.
C) The provided link is a Github link, however, there is some third party data/code leveraged, which may not be long-term accessible.
D) The provided data artifacts seems to be complete.