CMDI Roadmap: Visualization, Interaction and Analysis of Heterogeneous Textbook Resources

Tracking #: 2456-3670

Christian Scheel
Francesca Fallucchi
Ernesto William De Luca

Responsible editor: 
Philipp Cimiano

Submission type: 
Full Paper
Historically grown research projects, headed by researchers with limited understanding of data sustainability, data reusability and standards, often lead to data silos. While the data is very valuable it can not be used by any service except the tool it was prepared for. Research institutes face the problem that over the years, the number of such data graveyards will increase because new projects will always be designed from scratch. After describing the problem in detail and giving an overview about the lessons learnt from visualization, data exploration and interactive data approaches, we show how to overcome the dispersion produced by data silos, a multitude of metadata formats and outdated tools using the CMDI, suggested by CLARIN. In this work we propose a CMDI-based approach for data rescue and data reuse, where data is retroactively joined into one repository minimising implementation effort of future research projects.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Eetu Mäkelä submitted on 29/May/2020
Review Comment:

I think this manuscript suffers from a lack of coherence. It is framed as aiming to support digital humanities research, and provides images of actually very nice looking user interfaces for such, but looking below the surface, 1) those interfaces seem to be of separate, elsewhere reported work and do not seem to have anything to do with the data reuse aspect lying at the core of this paper and 2) what comes out of the data reuse aspect seems to be a simple text+faceted search interface operating on quite heterogeneous content, which isn't really useful to demonstrate the worth of the data integration (in truth, having very disparate results in the search easily just leads to a confusing experience, but this is not taken into account or analyzed at all).

As suggestions for improving the content, I would suggest either more explicitly targeting just the "various ways to do DH research on textbook resources" (possibly in another venue, depending on the amount data integration involved in those in reality) or otherwise to go into much more detail on the promises, problems and technical solutions of the central data integration work described herein (importantly including a discussion on how the different source types integrated fit together and support concrete use cases).

After determining focus, the introduction should be cut in at least half to get the point across quicker, and also all other content edited to target a common narrative, instead of what now seems to be distinct bits and pieces interjected in between. For example, currently, most of the article seems to talk about the integration of the datasets mentioned on page 3, but then page 6 seems list and talk about completely different data with regard to the interfaces, and no explicit relationship (or lack thereof) between the two is raised.

Review #2
By Andrea Giovanni Nuzzolese submitted on 15/Jun/2020
Review Comment:

The paper presents a solution for opening data silos associated with Digital Humanities (DH) projects by using the Component Metadata Infrastructure (CMDI). The solution is applied to a number of datasets about textbooks managed by the Georg Eckert Institute (GEI), which are used as a test case. Those datasets are heterogeneous in (i) semantics, (ii) formats and (iii) sizes. The authors provide an analysis of the GEI datasets from different perspectives, which include data visualisation, exploration, and interaction. This analysis aims at identifying general requirements that Linked Data applications should have and linked datasets should provide. Finally, the authors conclude that the CMDI is a fair solution for enabling the reuse, interoperability, and interaction of data in the DH domain.

==== Overall comments ====
The paper is in general well structured.
Nevertheless, the paper shows significant weaknesses that, in my opinion, prevent it from publication as it is in its current form.

=== Strengths ===
The problem of opening data silos by solving heterogeneity issues is challenging and relevant to the Semantic Web Journal.
The idea of identifying requirements for homogenising heterogeneous datasets in DH in a bottom-up fashion (i.e. by analysing existing applications that use legacy data) is interesting.

=== Weaknesses ===
However, the analysis of existing datasets resulting from past GEI projects should be significantly extended. More in detail, the paper misses a glue between the description of data visualisation, exploration, and interaction schemata (as resulting from legacy applications) and the adoption of CMDI.
First, it is unclear why CMDI should be adopted, besides the fact that "CLARIN assumed that metadata for language resources and tools existed in a variety of formats, the descriptions of which contained specific information for a particular research community". Then, the authors completely overlook how CMDI should be adopted.
The paper misses a discussion about how CMDI has been used. To be honest, the proof of concept presented in Section 3.4 does not add any scientific value to the paper.
Furthermore, the paper lacks of scientific rigour. There is a high level view over the problem, but such a problem is never properly tackled as a work submitted as a research paper should deserve.
The related work section is not sufficient and in general too shallow.
The authors take into account a very limited horizon of works at the state of the art that tackle a similar problem.