The SemanticXBRL Dataset, Semantic Financial Data from XBRL Filings

Tracking #: 869-2079

Roberto García
Hector Carretié
Rosa Gil

Responsible editor: 
Axel Polleres

Submission type: 
Dataset Description
One of the main ways of populating the Web of Data is by translating existing data sources. One interesting candidate for this approach is data based on the XML Business Reporting Language (XBRL), a standard for business and financial reporting. Many regulation agencies require financial reports based on this format, e.g. the United States Securities and Exchange Commission (US SEC). However, despite its proliferation, XBRL data is loosely interconnected and thus it is difficult to mix and query beyond the per-report context. Our contribution is an auto-matic translation from XBRL filings to Semantic Web technologies, which we have applied to almost 30000 fil-ings by USA companies to the US SEC obtaining more than 125 million triples. The resulting semantic data is easier to integrate and cross query as some preliminary applications have shown. Moreover, it can be interconnect-ed with the rest of the Web of Data in order to extract its full potential.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Andreas Harth submitted on 17/Oct/2014
Review Comment:

This manuscript was submitted as 'Data Description' and is reviewed along the following dimensions: (1) Quality of the dataset. (2) Usefulness (or potential usefulness) of the dataset. (3) Clarity and completeness of the descriptions (see review criteria on the SWJ website).

(1) Quality of the dataset
(2) Usefulness of the dataset

I question whether a direct translation of XBRL to RDF is ideal, especially the XBRL notion of context.
The XBRL data suffers from using XML (linkbases) as underlying data model; RDF typically favours an entity-centric data model.
Thus, a more intuitive representation in the Semantic Web context is to follow the modelling constructs that established vocabularies use, such as Data Cube (QB) or SKOS.
However, the presented dataset does not use established vocabularies (such as the QB vocabulary which is applicable to XBRL and widely used for modelling numerical datasets).
Why the authors did not follow the QB model for representing numerical data is not explained.

Further, the authors do not provide their dataset as Linked Data, i.e., an HTTP GET on Semantic XBRL URIs do not return the corresponding RDF.
In an updated version, the URI schemes/templates for constructing entity URIs should be included (e.g., the URI template for identifying entities identified with a CIK).

XBRL's calculationArcs that describe how various measures related to each other are not mentioned in the paper.
The SIC taxonomy for classification of EDGAR entities is not mentioned.
The paper does not include licensing information.
Also, there is no reported usage of the data.

(3) Quality and completeness of description

My colleagues and I have done work in the XBRL space which is missing from the description (see, for example, the Linked EDGAR dataset,, which even provides links to Freebase and DBpedia, and Kaempgen et al., "Accepting the XBRL Challenge with Linked Data for Financial Data Integration",

In sum, the modeling decisions are not sufficiently explained, the dataset is not available as Linked Data, and the contribution over existing work is unclear.

Review #2
By Xian Li submitted on 11/Dec/2014
Review Comment:

This manuscript was submitted as 'Data Description' and should be reviewed along the following dimensions: (1) Quality of the dataset. (2) Usefulness (or potential usefulness) of the dataset. (3) Clarity and completeness of the descriptions.

(1) Quality of the dataset.

This paper describes a dataset (named as "SemanticXBRL") converted from XBRL fillings published by the US SEC. According to the authors, the goal is to enhance interoperability and linking with the rest of the Web of Data. The "Evaluation" section aims to assess the quality of the approach and resulting datasets along these two dimensions. Two concrete examples are provided to show how to relate to comparable facts and how to perform complex aggregations.

However, there are three issues with the current results.

First, this paper did not explicitly define metrics to assess the quality of the resulting datasets used in this approach. These metrics should be based on typical usages patterns of the resulting datasets. Experiments should be designed accordingly and these metrics should be measurable.

Second, the current "Evaluation" section did not assess the quality of SemanticXBRL compared to original XBRL datasets.

Third, there is not enough demonstration about the improvement about linking to other datasets on the Web. Section 3.1 described how "linking to external data" was done but did not provide much quality assessment about the linking. Although it was mentioned that 63 links about companies were generated, it is not clear what it means in terms of usabilities of these links.

(2) Usefulness (or potential usefulness) of the dataset.

By mapping XML schema to OWL and XML data to RDF data, the SemanticXBRL datasets improves integrations within the same data source as is shown in Section 4. However, major usages of the SEC XBRL datasets involve reasoning over domain-specific concepts defined in the US-GAAP taxonomies, constraints checking, reconciling commons rules with exceptions, etc. These typical usage patterns are difficult to be achieved with the current mapping mechanisms described in this paper.

(3) Clarity and completeness of the descriptions.
This paper provides clear and comprehensive descriptions about the approach and the datasets.