# HDT crypt : Efficient Compression and Encryption of RDF Datasets

### Tracking #: 1733-2945

Authors:
Javier D. Fernandez
Sabrina Kirrane
Axel Polleres
Simon Steyskal

Responsible editor:
Ruben Verborgh

Submission type:
Full Paper
Abstract:
The publication and interchange of RDF datasets online has experienced significant growth in recent years, promoted by different but complementary efforts, such as Linked Open Data, the Web of Things and RDF stream processing systems. However, the current Linked Data infrastructure does not cater for the storage and exchange of sensitive or private data. On the one hand, data publishers need means to limit access to confidential data (e.g. health, financial, personal, or other sensitive data). On the other hand, the infrastructure needs to compress RDF graphs in a manner that minimises the amount of data that is both stored and transferred over the wire. In this paper, we demonstrate how HDT – a compressed serialization format for RDF – can be extended to cater for supporting encryption. We propose a number of different graph partitioning strategies and discuss the benefits and tradeoffs of each approach.
Revised Version:
Tags:
Reviewed

Decision/Status:
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Miel Vander Sande submitted on 10/Oct/2017
 Suggestion: Minor Revision Review Comment: The authors propose an extension to the compressed, but queryable RDF format HDT. Three different methods to compress the Dictionary and the Triples are presented and evaluated. The paper is very well written, the research conducted and an interesting contribution to a continuing trend of publishing Linked Data effectively. I do suggest some minor changes before it can be accepted. First, Section 4.2 is a bit hard to grasp. I guess this is mainly because of the numerous inline definitions that result in a lack of visual cues. Second, it surprises me no more advanced approach in service of querying was considered. This should be at least explained more elaborately in the conclusion. Third, the evaluation could be stronger. I wonder, for example, why DBpedia 3.8 was chosen (it's small and old). It seems to obscure size scalability issues. Also, I would have liked to see a comparison with an unencrypted HDT file to see what the introduced overhead is, especially when integrating dictionary and triples in a new HDT for querying. Or at least an explanation why it was not included. Minor remarks - S1;L5: exchange is a bit more appropriate than interchange - S2;L16: could use a line break - S2;L17-21: the limitations or relevancy of previous work by authors is not well explained - S4.1;L9: watch horizontal text overflow - S4.1;L14: the footnote is a strange choice, I suggest brackets or hyphenation. - S4.1;Fig. 3: mentioning the access rights would clearify the figure - S4.1;L52: a one-to-one - S5;L3: this sentence needs rephrasing
Review #2
Anonymous submitted on 24/Dec/2017
 Suggestion: Minor Revision Review Comment: This paper extends the existing HDT compression format for RDF with several partitioning strategies that allow a given graph to be automatically subdivided into subgraphs that reduce terminological (dictionary) and/or assertional (triples) duplication. Since subgraphs can themselves be annotated with different metadata, one use case is to couple named graph access rights to encryption keys. # Originality The big contribution of this paper is not that it performs automatic partition, nor that it provides encryption of sensitive data, but that it does these things in combination with the ability to query the data. This combination is very novel and has many concrete applications, both in Semantic Web research as well as in practice. # Significance of the results The significance of the results is not immediately clear, because there are subtle benefits and downsides to the various evaluated approaches. Table 6 and 7 perform a fair attempt at giving an overview of the various pro's and con's. It is unfortunate that approach crypt-D cannot really be compared to the other three approaches, since it uses a very different encoding for assertions/triples to begin with. It is not entirely clear to me why crypt-D is implemented in such a different way, and/or whether the performance hit that is due to the different representation can be calculated independently (thereby making the here presented performance results more comparable). # Approach Overall the approach is easy to follow. Specifically, the use of examples help in keeping track of the specifics of the described approach. There are some things that are a bit unclear to me, mainly WRT the notion of dictionary identifiers: - p3: If S and O both start at index |SO|+1, the latter cannot end at index [SO|+|S|+|O|. Maybe O starts at index |SO|+|S|+1? - p3. Role is mentioned but not explained. The roles seem to be node and edge (which implies that an IRI that appears as a node and as an edge receives two IDs). - p3. If \$G = \{\langle x,y,y \rangle\}, id(y,D) does not seem to be functional (receiving ID 1 and 2). Maybe functions id and ids should return pairs that contain a role (node or edge) and an ID? - p5. It is unclear why {G_1,\ldots,G_n} must be a cover of G. - p5. I do not understand the definition of the canonical partition. S is drawn from a set of Boolean _functions_. Is S' drawn from the same set? But then i and j cannot be members of S and S', because they are not functions. - p6. The two admissibility conditions on R also require i and j to be pairs i.o. integers (otherwise an edge i in ids(T) could be related, through R, to a node i in ids(D)). # Evaluation - Why was lzo' chosen to compress N-Triples and gz' to compress HDT? Is this a fair comparison? - Is it correct/expected that crypt-C compresses _faster_ than regular HDT? - To what extent does it make sense to compare performance results of crypt-D to results of crypt-A, crypt-B, and crypt-C, given that the triple representation is entirely different? - While it makes sense that decryption time increases when there are more subgraphs (due to more duplication), I was expecting a tradeoff for certain types of queries that only require a subset of D's and T's to be decrypted in the calculation of the query result set. # Discussion The paper has focused on uninformed partitioning of a given dataset into graphs. What about informed approaches that alter the D/T partitioning based on use / execution times measured while querying? # Typo's Overall, the paper is well written; here is only a couple of typo's that I came across: - LaTeX quotes are sometimes the wrong way around, e.g., on p2 ‘access restricted’ - Text sometimes appears outside columns, e.g., on p5 ‘access-restricted’ and the ‘=’-sign; on p8 \union; on p9 HDT_{crypt-B}. - p2: “investigate [on]” - p2: “each of our partitioning strategies [is] able” - p5: the double quotes in the example triple appear in code, so they should be the ASCII double quote character. - p7: “the notation of T(G,D) from above”, but T(G,D) has not been used or defined before. - p11: ‘N-Triples’ i.o. ‘NTriples’ - p13: “for the only matter of” → “for the sole purpose of” - p14: “but not the total graph”; should this not be ‘dataset’? - p15: “all approaches show[s]” - p19: “and i[s] not competitive” - p19: “such [as] in DBpedia”