Describing the nature of legislation through roll call voting in the Chilean National Congress, a linked dataset description

Tracking #: 2749-3963

Francisco Cifuentes-Silva
Felipe Rivera-Polo
Jose Emilio Labra-Gayo
Hernán Astudillo

Responsible editor: 
Katja Hose

Submission type: 
Dataset Description
This article shows the dataset and Semantic Web model that includes ontologies, RDF Shape expressions and URI patterns for describing bills and roll call voting of bills (legislation), processed by the Library of the Chilean National Congress during the last three decades (1990 - 2020). The data includes over 1,2 million RDF triples about voting events belonging to more than 2.700 bills, related to more than 500 distinct Chilean Members of Congress. The dataset is integrated with others, such as those about parliamentary biographies, legal norms, bills, and debates from sessions. The dataset offers a public SPARQL endpoint which allows to develop different analysis. As an example, we present an exploratory analysis about polarity and political alignment metrics, verifying that it is possible to establish quadrants to classify the nature of bills into four categories defined by: I) Ideological stance (high polarization, high alignment), II) Personal interests (high polarization, low alignment), III) Sectoral interest (low polarization, low alignment) and IV) Technical consensus (low polarization, high alignment).We have identified that in each quadrant there is a co-existence of bills related to underlying topics, evidencing that voting is not totally governed by ideological factors. This type of information is relevant to describe the political scene and to carry out more in-depth studies in political science, establishing mechanisms that allow to optimize the legislative process, promoting more efficient parliamentary work, increase transparency and helping to improve the image of Congress and its members by the citizens.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Daniel Hernandez submitted on 11/Apr/2021
Major Revision
Review Comment:

The dataset is relevant, and its creation implies a significant amount of effort and resources to the institution. I know very well the context of this dataset and the other datasets published by the Library of the Chilean Congress (BCN). I thus consider that this dataset is a fundamental component for the set of datasets published by the BCN and integrates well with them. However, I consider that the quality of the description of the dataset provided in the manuscript is not satisfactory and that there is room for improvements in the way that the dataset is published. For this reason, and to improve this manuscript and the impact of the dataset, I recommend a major revision for the manuscript.

Usefulness of the dataset

The submitted dataset has a clear relevance for the democratic process of Chile and constitutes an example on how to record the legislative work. It focuses on the period after the end of the dictatorship to the present, but also includes some norms before 1990. These earlier norms include such ones produced in the dictatorship period (1973-1990) and norms related to the constitution of 1925. To clarify the context of this work, it is necessary to say that the current Chilean constitution was enacted in 1980 during the dictatorship period, and after weeks of protests in 2019, citizens approved through a referendum the change of the 1980 constitution. Hence, the submitted dataset includes a relevant period of the legislative history of Chile, and some aspects of the processing of the law may be subject to change in the future constitution. These changes require a flexible model, and RDF is therefore a good option to publish this data. The BCN has as mission to help the parliamentary community to fulfill its functions, and to provide access to the legislative information to the citizens to guarantee the normal function of the democracy and the preservation of political heritage of Chile.

The impact of the dataset beyond the influence of the authors (e.g., other research groups using the dataset or datasets linked to this dataset) is not documented. I wonder to know if there are projects outside the BCN consuming this data; or if there is any reason that hinders its use. The existence or nonexistence of such projects has to be made explicit. The authors describe the use of the dataset in a project within their range of influence. It is indubitable that the dataset has a potential impact beyond the influence of the authors. The arguments in favor of the dataset uptake can be stronger if the consumption of the dataset beyond the influence of the authors is documented.

Clarity and completeness of the descriptions

The manuscript does not provide a clear and complete description of the dataset. To use the dataset, it is required to understand other datasets that have complex vocabularies. The dataset described in the manuscript is integrated into this collection of datasets, so I would expect to see more example queries explaining the use of the dataset, and diagrams describing how this dataset is connected with the other datasets published by the BCN and with external datasets.

There is no enough description of related works. For instance, a comparison with similar datasets from other countries is needed. There is no "related work" section. Cite [53] that address a related work is not present in the body of the manuscript.

Recall that this manuscript has been submitted as a 'Data description', so it has to contain a concise description of a Linked Dataset, that serves as a guide to its usage for various (possibly unforeseen) purposes. Section 4, Proof of Concept, does not serve to this end. Describing a political analysis by using the dataset is out of the scope of the track, and waste several pages that could be used to improve the description of the dataset. Furthermore, that analysis is debatable. It is not clear if this analysis considers projects that are rejected by deputies and thus not voted by senators. Since the text of projects changes along their stages, the differences in the consensus may not be comparable. There are many factors that have to be considered to explain the analysis in Section 4. Again, the 'Data description' track is not for publishing studies from the data, but for describing a dataset and improving its usage.

The manuscript does not clearly explain the bounds of the dataset. In lines 27-29 is stated that "[...] the main semantic web pieces of the dataset are: [...] the description of the dataset by means of DCAT [...]." Regarding the description of the dataset, an IRI of an instance of dcat:Dataset ( is provided. But in the footnote it is indicated that this URI corresponds to a named graph. The footnote suggests that each dataset (included the one described in this manuscript) is published into a separate named graph. A description for the use of named graphs in the SPARQL endpoint is missed.

There are several datasets described in the document. The DCAT refers the dataset as "Voting dataset", Figure 1 is about the "voting dataset", section 3.1 is about "Members of Congress and Political Parties"
dataset, Section 3.2 describes the "Bills" dataset. I suggest introducing a figure showing all datasets published in the BCN SPARQL endpoint and their relations.

To illustrate the complicated that can be to use the dataset, and thus the need of a better description, let us to consider the query in Figure 5. The connection of a vote with a political party is not as clear as is shown in this query because some parliamentarians change their militancy. Also, some parliamentarians have been elected as independent (without militancy at a party) but in the same pact as candidates that have militancy. To write this query the user has to know the schema of other datasets. In particular, the predicates bcnbio:hasMilitancy and bcnbio:hasPoliticalParty are not described in the manuscript, but in the ontology of the "Members and Political Parties" dataset (document in Spanish here: The description property bcnbio:hasMilitancy is not in the documentation, and the description of bcnbio:hasPoliticalParty says that its subject has to be a member of the class bcnbio:Candidato. An instance of the class bcnbio:Candidata is a subclass of org:Role, and is related with the person (foaf:Person) via the predicate bcnbio:candidatureOf. This shows how complicated can be to use the dataset, and why a better explanation of the use of the dataset is needed.

There are also some minor issues:

- "Semantic Web" is written in some parts in upper case, and in lower case in others.

- In line 6 of page 2 says, "Library of the Chilean National Congress (BCN)." It could be confusing that the initials do not correspond to the English name of the institution. It would be useful to explain that these initials correspond to the "Biblioteca del Congreso Nacional."

- The initials KG are used once in all the manuscript, and without being introduced. It would be better to write "knowledge graph."

- Table 1 describes some URIs for the instances of the classes that appear in Figure 1. Why not all classes? Also, there are some classes in Table 1 that are not in Figure 1.

- Figures 1 and 5 have bad resolution and look pixelated.

Quality and stability of the dataset

The dataset has good quality in terms of the accuracy of the data and the integration with other datasets published by the BCN. However, there are some aspects of the dataset that can be improved:

- Lack of an explicit declaration of the licensing of the dataset. By its nature, it is expected that the data is open, however this is not indicated in the DCAT description of dataset nor in the manuscript. There is a reference in the BCN website about the licensing of data an information
( but is not directly connected with the dataset. That license is based on the UK Open Government License
but is not stated as a formal license.

- The manuscript does not describe a preservation strategy for the dataset. There is no versioning policy nor ways to download/access a specific version. The DCAT description of the dataset ( states an accrual periodicity, but does not indicate its versions. Also, the DCAT description has the dct:issued and dct:modified set to 2011-12-05. It is impossible that the last data correspond to the last modification of the dataset because it includes bill votes from 2021.

- The DCAT description of the dataset provides access to the current version of the dataset via two ways:

(1) using the Linked Data Platform, by navigating through the RDF files corresponding to each entity of the dataset (e.g., file, and

(2) the SPARQL endpoint.

No dumps of the dataset are provided. I know that the BCN has resources to maintain (1) and (2) for several years, but institutions are subject to change, so dumps may help third parties to preserve the dataset (and its versions).

- The dataset is published along with other datasets. Indeed, in page 5 line 24 it is said that several datasets have been published since [5]. The problem is that it is not clear how to separate the dataset described in the manuscript from previously published datasets.

Review #2
Anonymous submitted on 02/Jun/2021
Major Revision
Review Comment:

This paper presents a dataset and semantic web model for bills and roll-call voting of bills of the Chilean Congress. Having a dataset of these characteristics can be important in the context of the legitimacy of representative democracies. The paper shows the use of the dataset by making an analysis of polarization and alignment in the bill voting for he Chilean elections.

Despite the interesting data, as far as I could see, the work in this paper does not show clear indicators of the impact of the proposed dataset and needs further improvement to be considered for publication.

1. I do not see a demonstrable impact or future impact. No clear uptake indicators are provided in the paper. So, you need to show the interest for this type of datasets beyond the research group or within the research group for further investigation.

2. From the paper, the dataset is created from already existing datasets in the Chilean National Congress and other sources, which seem to also be as Linked Open Data in RDF format. So, you need to state clearer whether you are describing an already existing data or a new dataset, as well as, the type of maintenance it needs.

3. In the introduction, you use arguments without any reference. For example, "In the US and Europe there are numerous studies ...",..., "alignment and polarization have bene widely studies.."

4. In the proof of concept, What is the explanation of the thresholds for polarization and for alignment? Also, Are the categories "Ideological Stance", "Personal Interest", "Sectorial Interest", and "Technical Consensus" found in the literature?

5. In the definition of polarization, What is the rationality of defining the degree of polarization in comparison to the polarization coefficient ? Provide examples for this definition.

6. In Data section, Table 4.2 does not exist, but Table 2 does. This table needs further explanation. What do you mean for "In General" and "In particular?

7. It seems that a bill may have different voting events and you aggregate these events. If so, Why did not you model different voting events?

8. Use abbreviations after you indicates what they mean. For example LOD and KG.

9. Figure 1 is hard to read.

10. Make a carefully English proof. There are several grammar mistakes (verb conjugations).

Review #3
By Lorena Etcheverry submitted on 23/Jun/2021
Major Revision
Review Comment:

Summary review:

This article presents two datasets, published as LOD, processed by the Library of the Chilean Congress.
One corresponds to congress bills and voting events (1990 - 2020), which contains over 1,2 million RDF triples representing 2.700 bills. The other represents members of Congress and political parties from 1990 to the present (over 400k triples).
It also presents a use case where the authors analyze the polarization and political alignment in voting.
Although the info contained in the datasets is relevant, and the use case shows their potential, I have some concerns about the paper.

First, some relevant information and discussion on the datasets are missing. In particular, no info is provided on external and internal connectivity, neither on the quality and stability of the datasets. I strongly encourage the authors to provide details about the used vocabularies, ideally using the 5-star rating provided
Second, several paper sections contain grammatical and semantic issues that make it very difficult to read and understand. I suggest carefully review the work.

Information on the dataset:
Two OWL ontologies are defined and used in these datasets: Biographies ontology and Legislative resources ontology. Also, terms and concepts from DC, SKOS, and Time ontology are used.

Biographies Ontology
Available at in N3 and RDF-XML
Documentation available at:
Created 10/12/2011
Last Updated 08/23/2012
Version 1.3
Language: Spanish (documentation in English)

Legislative Resources Ontology
Available at in N3 and RDF-XML
Documentation available at:
Created 10/12/2011
Last Updated 01/18/2020
Version 1.2
Language: Spanish (documentation in English)

Legislative Resources shape
Available at using ShEx

Data sources, data acquisition, and preparation
sources: Chilean National Congress chambers and BCN Archive (Political History portal and repository)

Although in Section 3, the authors describe two acquisition methods (consuming available XML Web Services and performing web scrapping), then in Section 3.1, a new data source is described (a wiki with RDFa annotations).
The authors should clarify this point, stating which are all the data sources and acquisition methods.

Metrics and statistics on external and internal connectivity
No info is provided on this topic.

Some comments on the ontologies and design decisions:

- Check consistency between documentation and RDF files. Ex: Class Article ( is not defined in the Ontology (

- Cool URIs and Linked Data principles also suggest that URLs should be human-readable. Why have the authors decided to use numbers to identify people instead of using their names in the URI? In contrast, the suggested approach has been applied to Political Parties.

Linked data frontend
The authors provide information on a Linked Data frontend tool. Although they provide a link to a GitHub repository, no links are provided to use this frontend to access the described datasets.
How is this frontend used in the context of the project?

Strong points:
- the dataset is relevant
- the use case clearly shows the potential of the dataset

Weak points:
- I suggest a careful review of the document since there are some phrasing and grammar errors.
- most of the paper is devoted to the use case, while vital information on the dataset is missing. We strongly encourage authors of the dataset description paper to provide details about the used vocabularies; ideally using the 5-star rating

Quality and stability of the dataset:
The contents of the data sets are stable since they represent information from 1990-2020.
It also appears that the used ontologies are stable. For example, while both ontologies date from 2011, in one case,
the last modification date is 2012 (version 1.3), and in the other is 2020 (version 1.2).
No metrics or judgments about the quality of the datasets are presented in the article.

Usefulness of the dataset:
No evidence is provided on third-party uses of the datasets.

Clarity and completeness of the descriptions:
Two custom OWL ontologies are defined and used in these datasets: Biographies ontology and Legislative resources ontology.
Additionally, terms and concepts from DC, SKOS, and Time ontology are used.
In the case of the custom ontologies, dereferenceable human-readable information is provided (1-star).
Also, the information is available as a machine-readable explicit axiomatization of the vocabulary (2-stars).
Still, in both cases, the vocabularies are not linked to others, which is mandatory to reach the 3-stars level.

Impact: The analysis case described in the article, based on the published datasets,
can be considered evidence of the potential impact of the published datasets.
No evidence is provided on further impact beyond or within the authors' range of influence.