Automatizing experiment reproducibility using semantic models and container virtualization

Tracking #: 2264-3477

Carlos Buil Aranda
Maximiliano Osorio
Idafen Santana

Responsible editor: 
Guest Editors Semantic E-Science 2018

Submission type: 
Full Paper
Experimental reproducibility is a major cornerstone of the Scientific Method, allowing to run an experiment to verify its validity and advance science by building on top of previous results introducing changes to it. In order to achieve this goal, in the context of current in-silico experiments, it is mandatory to address the conservation of the underlying infrastructure (i.e., computational resources and software components) in which the experiment is executed. This represents a major challenge, since the execution of the same experiment on different execution environments may lead to significant result differences, assuming the scientist manages to actually run that experiment. In this work, we propose a method that extends existing semantic models and systems to automatically describe the execution environment of scientific workflows. Our approach allows to identify issues between different execution environments, easing experimental reproducibility. We have evaluated our approach using three different workflow management systems for a total of five different experiments, running on a container virtualization system (i.e. Docker). That showcases the feasibility of our approach to both reproduce the experiments as well as to identify potential execution issues.
Full PDF Version: 

Reject (Two Strikes)

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 14/Aug/2019
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The authors addressed my main concern regarding the overselling of their solution, and refraining from claiming to address security issue, which they are not. The paper is in its current state in a better shape and describe the solution it poor poses faithfully. I would recommend though to the authors to re-edit the introduction. It is still verbose, and it does not allow the reader to grasp necessary what the conservation of an expriment/workflow implies. This may be achieved by giving a simple example that illustrates the distingtion between what vistrtualization implies on one hand, and what concervation using container does, on the other hand.

Review #2
Anonymous submitted on 05/Sep/2019
Major Revision
Review Comment:


The paper describes DockerPedia, which is an engine for designing reproducible experiments.
The author designed DockerPedia following the conceptualisation presented in [3].
They aim at fostering experiment's reproducibility providing physical and logical conservation
stressing on the fact that little work was done on the conservation of the experiment infrastructure (Section 1, paragraph 2, page 2).

The authors clarify that Virtual Machines (VM) have been used to the same extent. Nevertheless,
they require a lot of disk space as they store the whole operating system.

On the contrary, containerization techniques seem to be promising. They require less space
still offering physical conservation.

Strong Points
- the paper touches the critical problem of experimental reproducibility
- the authors try to combine conceptualisation with a realistic technological stack
- the software is available on GitHub

Weak Point

- The are many imprecisions about containerization;
- the authors still oversell a bit their work: why is DockerPedia doing a better job than existing;
- the paper structure can be improved;
- the evaluation is still not convincing: it does not quantify the value of the approach.

The paper shows two major drawbacks that convinced me to do not do accept it.

The first problem regards the imprecisions related to Docker. Quoting from the paper

Section 1

- "Docker container can be seen as lightweight virtual machines".

I think that a scientific paper should not provide imprecise facts even though their intent is triggering an intuition.

- "Docker containers are used intensively in both industry and science, mostly to preserve the execution environment of software applications and also to
preserve the physical environment of an experiment"

I think the authors mean "docker images". Moreover, I think the second part of the sentence is a quite strong argument.

Section 2

- "Docker is a solution that allows virtualizing a minimal version of an Operating System (OS), sharing the resources from the host machine by means of software images."

Docker architecture include the docker client, dockerD, containerD, and runC

None of these components does virtualization. Moreover, containers are isolated and constrained process that run ON the host machine.
Software images are just software packages plus the necessary metadata that are used by runC and containerD to spawn containers.
It is an immutable file system.

Sec 2.2.

- "A Dockerfile is a text file that contains all commands to build a Docker image and run a container using this image."

A Dockerfile only contains instructions to build an image. Some of this instruction may condition the runtime behaviour of the container (CMD and ENTRYPOINT).
But this should be specified.

- "The first line in such files is the FROM keyword, which imports the base OS on which all software will be installed."

Sec 2.3.1

- "With Docker and AuFS that user can share the 1 GB data between all the containers. If that user has 1,000 containe"

Docker containers are mutable ephemeral copies of the Image file system. Therefore, each spawn container creates 1 GB copy.

Section 3

- "Moreover, as their evolution can be tracked along
the development process, it is possible to rollback to
previous Docker images in case new dependencies or
modifications introduce errors."

Docker images are immutable. Modifying an image requires to re-build it, which practically results in another image. An image is divided into layers, thus some of these layers may not be affected by a change. Nevertheless, the resulting image is a different one (different hash).

Rollback require to store the previous version.

Sec 3.2.1

- "Docker builds an image by either reading a set of instructions from a Dockerfile or just deploying that image on a host in case the Dockerfile is nor present."

The first line of a Dockerfile (FROM keyword) indicates the parent image that is used to start building the current one.

Docker images are built directly from running container using the "docker commit" command.

"docker build " automatizes the process. It spawns a container from the base image; it applied the following line in the

FROM 'scratch' indicate to start from an empty image

Sec 3.2.2

- ".. Docker image we are able to reproduce the Docker image only deploying it, and thus without modifying any parameter in the image. "

Docker images are immutable.

- "Each Docker image layer installs or removes software packages."

The second problem regards the proposed evaluation.

I understand that it is hard to assess the validity of Dockerpedia.
Nevertheless, the current proposal is not satisfying.

First of all, it does not assess that Dockerpedia works. What it "demonstrates" is that docker can replace VM as an execution environment. This is expectable as Docker is becoming the de-facto standard.

I think the authors should measure how much Dockerpedia impact the experimental workflow of a researcher.
How much it costs to use Dockerpedia in place of previous solutions?

Moreover, a lot of claims were made on the waste of storage caused by VM, but I miss to find the actual measurements for the Dockerpedia solution.

Finally, a side note regards the results. The authors claim they were successful in the 100% of the experiments they did. Although this is possible,
it usually raises the doubt you're using the wrong instrument to measure success. Research must be falsifiable, which requires to understand
the exact context where it can be applied.

Two final minor remarks:

- In Section 2, the authors explain how they extended the WICUS ontology to describe docker images and steps in the Dockerfiles.
How does this differ by [29] "Describing Docker file in RDF"?

- In Section 4, table 1 can be omitted as line 1 and 3 are always the same.

Review #3
By Daniel Nüst submitted on 17/Sep/2019
Review Comment:

This is my second review of the article, and I regret to say that major concerns have not been addressed by the authors. Some newly added or extended sections of the article are of rather low quality and require considerable work to improve readability and understandability. The current state of the manuscript needs work in language/writing, and while original, the significance of the results are below average at least at the current form of presentation. The valuable contribution (extract annotations from images and do reasoning on that) is still cluttered with somewhat related issues (image storage size, confirmation of successful reproduction, descriptions on how the example workflows are run - why not just reference existing usages?).

It also seems the revising of the manuscript lead to several mistakes and inconsistencies that require a thorough proofreading. There are too many simple typesetting errors and inconsistencies that hinder reading and understanding, e.g. missing spaces, different naming, that should not be my task to point out, but I stumbled over them:

- "Reprozip" and "ReproZip" (the latter being correct AFAIK)
- "Docker container" and "Docker Containers" and "docker containers"
- "Docker container" but "Docker Images" (capital I uncommon, "Docker" is the brand name)
- "Dockerfile", "DockerFile" and "dockerfile" (the first being commonly used)
- "DockerHub" and "Docker Hub" (the latter being correct)
- "Google Compute" and "Google Cloud"

## Introduction

- the authors follow the ACM definition, though it diverges from the predominant Clearbout/Donoho/Peng terminology (cf.
- I can't follow the statement that scientific workflows are _commonly_ executed on distributed computing systems. I don't think it is common yet, but maybe the authors can provide a reference?
- "build" a workflow is confusing terminology: Do you mean "execute" ?
- typo "distributing and experiment., "
- "it highly difficulties" - difficult is not a verb
- No explanation why "lightweight" solutions serve the need of "a more collaborative and agile approach". Too many buzzwords?
- "containers ... are around 1/10 of a VM [in size?]" - no support for this claim.
- "Docker containers [are] lightweight virtual machines" - the authors seem to take this as a given, though the file size is still unrelated to the main contribution of extracting semantic annotations from images. Why does the image size matter, when you can extract the relevant metadata and store them in a queryable database?

## Scientific workflows

- "mainly IT-oriented communities embracing container-based approaches" is not sufficiently supported by only one reference from computational biology
- section "Introduction to Docker" repeats content given a few paragraphs before
- "Docker repositories and files" would better be introduced via a good reference, as there are some mistakes that might confuse readers without Docker experience: Docker _instructions_ not "lines" are translated to layers, and an instruction can span multiple lines; FROM must not be the first line, and it can be more than a "base OS" but a base image with specific software
- section "Publishing and Deploying images...": it is unclear how this section is relevant for the presented main contribution; as _images_ are analysed, the reader might not even need to be aware of what a _Dockerfile_ is (though I agree that presenting how Docker images are created is useful, but could be much shorter and focused)
- users do not need to use command line tools to publish an image of on Docker Hub: they can use automated builds
- "the Dockerfile is not uploaded with the image": while technically correct, the instructions are stored in image metadata, e.g. run `docker history mosorio/pegasus_workflow_images:soykb --no-trunc`
- "even when Dockerfiles are available..." > I cannot follow this argument: Dockerfiles were most likely written/created by a human, so they should by understandable, too. I don't see why "intuitive" is a relevant criterion here.
- "some components might exist .. not specified by the Dockerfile" > that is a valid argument, I suggest to use the term (transitive) dependencies to precisely describe what is missing;
- final paragraph is on point and _very_ useful as it describes the main contribution quite well, but does not fit in this introductory section

## Docker and Virtual Machines

- similar arguments are brought up repeatedly in the manuscript - suggest to group them in one place and cross-reference
- in the introduction, the authors claim that distributed computing systems are commonly used, which implies large datasets. How does the size of the VM or the image relate to dataset size? Also: Even if a user wants hundreds of VMs at 1 GB each, thats still less than a common 1 TB harddrive.
- please provide data for the claim that Singularity is the most popular solution in the scientific community. If so, why does your solution not support the most popular container tool?
- typo "terated" ?

## Reproducibility in scientific worklows [..]

- "jeopardizing real [..] production infrastructures": please provide an example where research environments run next to production systems potentially jeopardizing them; I think this is an unrealistic issue
- "hosted on aforementioned repositories": only Docker Hub mentioned up to now, plural is confusing
- "installed software [..] should only be related to a single experiment > can you provide a reference for that? Would be useful for the reader.
- "as input the repository name" > why not the image name?

## Annotator

- Docker does not "build an image by [..] or just deploying that image" - a Dockerfile is built, a container is executed
- a Dockerfile cannot be executed, but is built
- "which is always available..." > sentence partially redundant
- How does the image manifest relate to the instructions in the Dockerfile? I think you can make a good case here of advantage of an image vs. Dockerfile
- Listing 1 does not install Tensorflow at all
- "install 184 packages" > that is really useful
- "the most commong system package managers, such as THE MANAGERS FOR THE LINUX DISTRIBUTIONS Ubuntu, Debian, ...
- It would be great if you could provide a compare link between your fork of Clair and the original code yours is based on, for transparency what you did
- Listing 2

## Comparing different versions [..]

- Fig. 2. could use some numbers so the reader knows where in the diagram to start
- Listing 2/3: Reader needs to see the query _result_ to understand. Could you provide a link to an online version of the query? Or put truncated results in the manuscript?
- Title of Listing 2 may be ".. installed in the latest Pegasus _image_" ?

## Building and storing images

- "Container description vocabulary": where are examples for this, and reference to the actual vocabulary missing?

## Running the experiments

- Please synchronise the example (using mosorio/.. image) and the text (Footnote 17 points to
- "We include in each Docker image a README" - how do you put the README _in_ the image, and how can it be accessed by users?
- Listing 4 to 6: I tried to run the workflow but the executions gets stuck at 15.3 %.
- Why don't you run the whole analysis when running the container?
- Listing 6: The on screen instructions go further and require to run `pegasus-run` - please double check!

## Is the reproduced environemt similar?

- Listing 7: Suggest to partially also show the result of the query (online it returns > 100 lines), and put a link to the online SPARQL endpoint.

## Pegasus

- "being probabilistic" > can you not set a seed?
- contributions of Fig. 3, 4 unclear towards main contribution of the paper
- Footnote 21 is not a link to Docker Hub but to Zenodo
- The records on Zenodo do only include empty README files > data repos should at least be able to soe something
- [13] - Why a reference here and not a footnote as before?

## dispel4py, WINGS, ..

- Fig. 5: I am nitpicky, but you can never "validate a hypothesis", only fail at disproving it; please add a clear hypothesis in any case!
- Section contains a number of Links to Zenodo, which is good albeit being a bit extensive
- services using an on-line service does _not_ automatically lead to different inputs and outputs (why not query historical data - should always be the same!
- Footnotes "Docker Hub 26" actually have URLs pointing to Zenodo.
- Workflow descriptions are quite extensive, maybe put them in an appendix on a GitHub repo > they take the attention away from the main contribution

## Results and discussion

- Which perceptual hash tool do you use?
- "it is possible to extend Clair" > was that unknown before? then this should be mentioned early on as a resarch question
- what can the semantic annotations do besides capturing different versions? One could achieve that just by saving installed packages to a text file (sudo dpkg --list) and then diffing the different image metadata
- real cause of problems is related to version pinning (suggest to use that term)
- "we could reproduce the environment using the annotations" > Do you create a Dockerfile and build that, solely based on annotations? Should be clarified.

## Related work

- The authors continue to claim the higher storage requirements of VMs, but don't provide data.
- `docker images | grep soykb` gives me an image size of about 1.41GB. What is the size of a comparable VM, with the same software? Is it just that image plus "1GB for a Linux distribution (can you provide a reference for that?)
- The final sentence "In the following ..." does not fit - the approach is described in earlier sections.
- "such as easily comparing two different execution ..." > The authors must demonstrate a case where the power of semantically annotated data is really leveraged. The presented example of different Java versions could have been detected by storing the installed software from a container in a plain text file and taking the diff of the two text files.

## Conclusions

- First two sentences repeat "in this work"
- No backing of the statement that containers are "lightweight", which should be provided given that image size is discussed.
- I continue not to agree with the statement that uploading an image to a container registry ensures "physical conservation", but I understand this is a term used by the authors in a previous work.
- ".. can detect software components issues" > incorrect versions is mentioned as an example, but no other abilities besides version changes are mentioned. Are there any other potential issues?
- Regarding perf events: That "tracing" approach is precisely what ReproZip does. Suggest to explain how a semantic modelling of the perf events would advance the current state of the art.
- "We prove [..] such as Conda." > not sure what you prove, suggest to relate this to a research question at the beginning.