datos.bne.es: a Library Linked Data Dataset

Paper Title: 
datos.bne.es: a Library Linked Data Dataset
Daniel Vila-Suero, Boris Villazón-Terrazas, Asunción Gómez-Pérez
We describe the datos.bne.es library dataset. The dataset makes available the authority and bibliography catalogue from the Biblioteca Nacional de España (BNE, National Library of Spain) as Linked Data. The catalogue contains around 7 million authority and bibliographic records. The records in MARC 21 format were transformed to RDF and modelled using IFLA (International Federation of Library Associations) ontologies and other well-established vocabularies such as RDA (Resource Description and Access) or the Dublin Core Metadata Element Set. A tool named MARiMBA automatized the RDF generation process and the data linkage to DBpedia and other library linked data resources such as VIAF (Virtual International Authority File) or GND (Gemeinsame Normdatei, the authority dataset from the German National Library).
Full PDF Version: 
Submission type: 
Dataset Description
Responsible editor: 
Pascal Hitzler

Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-ca...

Revised submission after a "reject and resubmit" and a subsequent "accept with minor revisions" - then again "accept with minor revisions, and finally accepted. First round reviews are beneath the second round reviews, which are beneath the third round reviews.

Solicited review by Sören Auer:

The authors addressed most of the editing issues. Still a few minor things are left, which should be fixed before final submission:

* accross => across
* althought => although
* Split into two sentences: "The tool performs a pre-processing step that generates a set of spreadsheets, based on the source records, that function as mapping templates where the librarians can manually map the different MARC metadata elements to appropriate RDFS and OWL classes and properties."
* consistent capitalization in the first column of Table 3

There might be some more - another careful round of proof-reading by the authors could be helpful. Anyway, I think another round of reviewing is not required.

Second round reviews:

Solicited review by Sören Auer:

This is a resubmission, the authors improved their paper, but still contains quite many issues (which I marked in the PDF). Before final acceptance, the authors should improve all these.

Would be interesting to learn a little more about the Marima tool, the Website just says: "The tool will be available early 2012. However, if you would like to have more information or would like to try it, contact: dvila@fi.upm.es or bvillazon@fi.upm.es"

[EiC note: pdf was provided to the authors separately]

Solicited review by Tomi Kauppinen:

In the reply by authors there were some unpolished answers (in Spanish) but they were not that crucial. The manuscript is revised according to recommendations, and thus I recommend accepting it.

Solicited review by Danh Le Phuoc:

The author nicely addressed all comments of my review. I'm happy to approve for the paper get published.

First round reviews:

Solicited review by Sören Auer:

This paper describes the datos.bne.es dataset of the Spanish national library. The dataset is very interesting and of significant size and the article in general suitable for publication in the special issue. Unfortunately, the paper is not very well edited (see some examples below) and leaves out some (from my point of view) important information. In particular, it would have been interesting to obtain some more details about the extraction/mapping and the linking. What tools with which link specifications were used or did you perform just a simple string similarity matching? You could easily make space for these explanations, by reducing the spacing in tables and/or resizing images (e.g. the visualization would fit into one column).

Minor comments:
* the paper contains an extensive number of footnotes - the authors should try to reduce them where possible
* Linbreak in RDF-S/OWL in Section 2.1
* Section 2.3: "Table 2 presents the number of relationshipS established"
* "there's is"=>"there is"
* it would be good to format resource identifiers occuring in the text such as "owl:sameAs" differently, e.g. ttfamily
* The last sentence of 2.3 reads: "The same approach can be used for enriching the dataset with other resources, for example by linking to geographical descriptions (e.g. geonames)."
I don't see the description of your linking approach in this section.
* There should be a secured space before references, e.g. "VoID vocabulary[1]"
* "Miguel de Cervante's Graph visualization" should be "Visualization of the graph around the resource 'Miguel de Cervante'"
* Inconsistent fonts and font sizes in Table 3
* "relationshuos" => "relationships"
* in general the paper would benefit from careful proof-reading

Solicited review by Danh Le Phuoc:

This paper presents the LD dataset of National Library of Spain that are transformed from MARC 21 metadata. The paper needs a major revision to address following comments :

It isn't clear how the MAC metadata elements are mapped to the chosen vocabularies even the reviewer read the short article "Linked Data at the Spanish National Library and the Application of IFLA RDFS Models". It is important to show which MAC elements are mapped and how they are mapped/aligned to ontological vocabularies.

The paper listed a seri of ontologies used to model the data. But actually, it only uses 7 classes and 10 properties. It's not clear how these classes and properties can cover all the original MARC metadata. I doubt that the classes/instances listed in Table 1 and 2 can cover the majority of MARC 21 fields and categories/taxonomies. The paper should clarify why/how those classes/properties are chosen to model MARC metadata elements.

There are only two types of external links. While the dcterms:language links are trivial, the coverage ratio of owl:sameAs links is less then 10%. As external link metric is one of aspect to judge the quality to the dataset. Therefore, these figures weaken the paper.

I think the section 3 is interesting, but unfortunately the figure 2 lacks of details to explain what are visualized in the figure. Even, there are some explanation later but they are quite confusing to follow. Better explanations and some labels on the figure might make the visualization and interpretation more intuitive and understandable.

There are some aspects of the dataset that are not clearly mentioned in the paper, namely, reported usage, maintenance of dataset., language expressivity, quality of dataset etc.

Solicited review by Tomi Kauppinen:

The authors describe their project datos.bne.es which is Linked Data originating from the raw library data of the National Library of Spain. The project seems to be established, and such a library dataset is foreseen to have applications e.g. in cultural heritage content retrieval. The size of the dataset is also substantial and it is linked to other datasets. However, what remains unclear is how the linkage to other datasets is (or can be) evaluated. Authors say that there are a large amount of owl:sameAs links, but it would be essential to argue why they are considered to be correct. Some examples of them would also be useful. It is pity that local names of chosen IFLA-FRBR properties are cryptic codes (e.g. used in here: http://datos.bne.es/page/resource/XX3383563) but authors of this paper are not to blame about that. However, some prefixes seem to be missing from the pubby configuration in use, so I suggest to add them.

License issues remain quite unclear. In paper authors talk about cc-0 license but in the data I found also this sentence: "Todos los derechos de las imágenes y los contenidos mostrados en esta página pertenecen a Biblioteca Nacional de España". Please clarify the license issues.

References list is very weak, and only contains references to IFLA reports and to VoID article. Please consider to refer to related work about publishing library and cultural heritage contents as RDF (there are several examples e.g. in the Netherlands, UK and Finland). References are also not properly described, i.e. they are missing publisher, address, etc.

For these reasons I think a major revision is needed to address the mentioned issues.

Some minor issues:

- Please introduce the acronym LLD separately. Now it is not clearly stated that it means Linked Library Data.
- It would be nice to have a link or a reference to MARiMbA tool developed by OEG.
- The footnote 16 on page 2 links to a pdf of an article. I would propose to have it instead in the references list.
- When starting a sentence authors sometimes write datos.bne.es and sometimes Datos.bne.es: please harmonize the use of it.
- On page 4 genomes is mentioned but it would be appropriate to give a link to it (geonames.org)
- On page 4, section 3 there is strange sentence "Additionally, his experiment ….". Please check if there is something wrong with the grammar.
- Table 3 contains URIs in a variety of fonts, please harmonize them.
- On page 6 the sentence "According to the FRBR ontology …" is misleading, please correct it.
- One page 6: correct the spelling of "short-commings"
- References: void -> VoID.