MTab4D: Semantic Annotation of Tabular Data with DBpedia

Tracking #: 2894-4108

Phuc Nguyen
Natthawut Kertkeidkachorn
Ryutaro Ichise
Hideaki Takeda

Responsible editor: 
Jens Lehmann

Submission type: 
Full Paper
Semantic annotation of tabular data is the process of matching table elements with knowledge graphs. As a result, the table contents could be interpreted or inferred using knowledge graph concepts, enabling them to be useful in downstream applications such as data analytics and management. Nevertheless, semantic annotation tasks are challenging due to insufficient tabular data descriptions, heterogeneous schema, and vocabulary issues. This paper presents an automatic semantic annotation system for tabular data, called MTab4D, to generate annotations with DBpedia in three annotation tasks: 1) Cell-Entity (CEA), 2) Column-Type (CTA), and 3) Column Pair-Property (CPA). In particular, we propose an annotation pipeline that combines multiple matching signals from different table elements to address schema heterogeneity, data ambiguity, and noisiness. Additionally, this paper provides insightful analysis and extra resources on benchmarking semantic annotation with knowledge graphs. Experimental results on the original and adapted datasets of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019) show that our system achieves an impressive performance for the three annotation tasks. MTab4D's repository is publicly available at
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Vasilis Efthymiou submitted on 23/Oct/2021
Minor Revision
Review Comment:

I would like to thank the authors for their effort to improve the paper. The writing has improved a lot in the revised version, although there are still many grammar errors and typos that need to be fixed.
The strong points of the previous version are still valid, so I am going to focus this time on the weaker points that still exist and need to be addressed, in my opinion.
Before that, I would like to add to the previous strong points the excellent reproducibility and commend the authors for their efforts on that.

More important points:

- Section 2 - matching targets: you introduced matching targets, as per another reviewer's request, but they are not really well-defined. Moreover, you mention them next in the definition of the annotation tasks, as if they are given as input, whereas it's a part of the tasks to identify them. This also makes the task definitions unclear. I am fine with keeping the notion of matching targets, but you should be extra careful on how you define and use them.

- Section 2 - Assumption 1: "The system could return incorrect answers if table elements are not available in the knowledge graph." -> what do you mean "could"? 1. That it only returns errors when the table elements are not in the KG? 2. That it could also return a correct result (impossible)? 3. That it could also return no annotation result, which would be the ideal scenario, but it sometimes mistakenly returns something? 4. That it always returns something? 5. None of the above/something else? Please clarify.

- Section 2 - Assumption 3: Please elaborate a bit on what exactly "independent" means (e.g., not part of a relational database, no joins?).

- Section 3: In the revised version, you changed "learnable weights" to always-equal weights (w = 1). I don't see a point in keeping the weights in all the equations, when they are always equal to 1.

- Section 4: I would move Section 4 to an Appendix, as it breaks the flow and looks more like a section from a demo paper.

- Section 5: I don't see why Section 5 is a separate section and not just a part of the intro (e.g., placed just after the contributions bullets). In fact, many parts are also repeated in the intro.

- Error analysis: In my previous review, I pointed out that the "Error Analysis" section was focusing on the benchmark errors and not the MTab errors, which would have been more interesting. The authors removed the error analysis section in the revised version and created a new section about improving the benchmark, which is good. However, I am still missing an actual error analysis of MTab. Your results show that F1 is not always 1 (which is natural, it's not a weak point). It is interesting to check those cases that MTab does not get right and, if possible, explain why this happens, using examples or patterns of errors (again, using the cases where the original or revised benchmark has the correct ground truth annotation).

- The writing has improved, but there are still many errors. Some are, indicatively, listed next.

Minor points:

- Abstract does not end with a period ('.').
- page 1, line 45, left column: "knowledge graphs (KB)" -> "(KG)"
- page 1, lines 34-35, right column: "an semantic annotation system for tabular data designed" -> "a semantic annotation system for tabular data, designed"
- page 2, lines 25-27: "Wang et al., (...) complexity. [5]." -> "Wang et al. [5], (...) complexity."
- page 3, line 1 (and elsewhere): You are still comparing to your arXiv paper [7], not the SemTab paper. My previous comment was not to pick one or the other, but cite BOTH and compare to both together (if they are identical), or to each of them separately (if they are different).
- page 3, lines 40-51, left column: I would use the phrase "We denote (something) as (symbol)", instead of "(something) is (symbol)". I would also say that " $\in Triples$, consist of a subject (...), a predicate (...), and an object (...)". Finally, your subscripts and their definition is a bit confusing. For i and j you use = 1...N and 1...M, respectively, but for j_1 and j_2, you use [1,M]. I would prefer the latter notation, and perhaps use j and j' (resp. c_j and c_j') instead of j_1 and j_2, as j is already a number.
- page 3, right column: You should be extra careful with the problem definitions and assumptions. For example, the CEA task is not to annotate THE table cell with an entity, but many (as many as possible) table cells.
- page 3, line 49, right column: "representing" -> "represent"
- page 4, Assumption 4: "and the entities related to cell values are of the same type" -> "and the entities related to cell values of the same column are of the same type"
- page 4 (and elsewhere): when referring to specific steps, capitalize 'S', e.g., ("Step 1")
- page 4, line 36, left column: "We perform the four processes" -> "We perform the four following processes"
- page 4, line 48, left column: extra white space between "(NE)" and "."
- page 4, line 51, left column: "a entity-name" -> "an entity-name"
- page 4, lines 1-4, right column: why are objects, wars and books not associated with a SpaCy tag?
- page 5, Section 3.3: Please explain what type of index you are using? It looks like an elasticsearch index.
- page 7, Equation 10: missing ')'
- page 9, line 26, left column: delete this line "where t_e (...)", as t_e is already introduced
- page 9, Section 4.1, line 21: change to "We provide the five following APIs:"
- page 9, Section 4.1: replace "The use could (use this API to)" with something like "This API does that"
- page 9, line 30: "responded object include" -> either "objects" or "includes"
- page 10, lines 29-30, right column: "statistic" -> "statistics"
- page 13, lines 32, 33, right column: please cite here as well the ESWC 2020 paper ("SemTab 2019") that introduced the metrics AH and AP.
- page 14, lines 9-10, left column: "We compare MTab4D with the others reported results from SemTab 2019" -> "We compare MTab4D with other systems, using the results reported in SemTab 2019"
- page 15: Section 7.1 start on the right column, while the left column is blank
- page 18, Assumption 2: Please elaborate a bit more on what "classifying table types before matching" means

Review #2
Anonymous submitted on 11/Nov/2021
Minor Revision
Review Comment:

The paper is clearly a very significant and major revision of the prior submission. The paper clearly has merits for publication, and there is even more significance and potential impact with the numerous revisions made to the system. As a result, I believe that this paper, with a minor revision, should be accepted, but not as a 'full paper' but as a ‘tools & systems paper’. That is, the type described here:
and the impact as per is quite clear with the available system and its incredible performance comparing with state-of-the-art solutions.

The reason I do not recommend acceptance as 'full paper' is primarily due to the quality and style of writing, and in general lack of scientific rigor that is expected from a research paper published at a high-quality journal.

There are two issues that need to be fixed before the paper is published:
1) The paper claims reproducibility based on the available and very well-documented code base that allows reproducing the results reported in the paper. However, the results are only reproducible if the server is up and running. We all know that software systems cannot be maintained indefinitely and have a limited lifespan. The paper once published will be accessible for many years to come, but a large number of URLs cited (mostly as footnotes) will have a much shorter lifespan, potentially broken even by the time the paper is published. I suggest rewriting all the parts that refer to URLs to make them readable and useful even if the URL is not available. This means using citations of papers if possible, replacing citations of the entity search API URL with reference to the section that describes the entity search function, and moving implementation details to a separate section, under Evaluation.
2) The notations in the paper still need serious work and rewriting. This applies to most of the formulas and equations and definitions in the paper. Starting from the basic definition of a Knowledge Graph. Why do you need to define a KG as (E, T, P, Triples) ? You can define a KG as a collection of triples and then define what E, T, P are, but what you have is not even complete. For example, a T is not just a set of types, but it is derived from rdf:type predicate of the Triples. And then a “literal value” is not formally defined. Couldn’t you just re-use an existing definition of an RDF knowledge graph? The other really strange notation is when it comes to probabilities, i.e., equations 7-15. What is Pr(t|SpaCy_{c_j}) !? This could make sense if SpaCy_{c_j}) was a Boolean function with an input, but it is not. The definition you have for it in 3.4.4 is not a definition of a probability function. It seems to be some score function that you then normalize to view it as a probability function, but the normalization should be in equation 14 not just described in the text as you have it now. And you are not using the conditional probability notation properly: when you say P(A | B), that means "the conditional probability of A given B”, and that “the event of interest is A and the event B is known or assumed to have occurred”. t and SpaCy_{c_j} are not events!… Also, the name of the function being the name of the software library is yet another indication that this paper is written like a code documentation. The fact that SpaCy is used is irrelevant for understanding the function. That is just an implementation function. You could replace SpaCy with any other NLP library that does the same task.

You also need to further proof-read the paper for English issues. Examples:
8.2: session -> Section
7.3: environmental -> environment
Page 2 col 2 line 24-28: “According to Wang et al., they only could” -> “Wang et al. state that only our system could generate the annotations”, also remove the .
In abstract, remove the mention of CEA, CTA, and CPA and say what they are. That is:
…to generate annotations with DBpedia in three annotation tasks: 1) Cell-Entity (CEA), 2) Column-Type (CTA), and 3) Column Pair-Property (CPA).
…to generate annotations with DBpedia in three annotation tasks: 1) matching table cells to entities 2) matching columns to types, and 3) matching pairs of columns to properties.

Review #3
Anonymous submitted on 12/Nov/2021
Major Revision
Review Comment:

Thank you for your update.
- As stated in the paper, the difference between the extension and the previous work is reproducibility and insightful analysis on the ground truth dataset which is a minor extension.
- You removed the language prediction part from step 1 in your approach, so if the table data is written in non-English, how can you solve this issue? How can you deal with multilingual data?
- In figure 3, “Core Attribute Prediction” should be “Subject Column Prediction” as written in the text.
- APIs associated in the Github are not working
- There are many English mistakes and grammar issues such as:
- Page 2, line 10: an type prefix -> a type
- Page 3, line 6: the table -> table. Line 9: the future studies -> future studies.
- Line 39: a entity -> an entity
- data type -> datatype
- Page 4, line 51: a entity-name -> an entity-name
- Page 7, line 43: the type candidates -> the type of candidates
- Page 9, line 32: This signals -> these signals
- Page 10, line 12: 0.06 second -> 0.06 seconds
- Page 13, line 26: Precision scores is -> Precision scores are
- Table 2: orginal -> original