CN107704617A

Movatterモバイル変換

Info

Publication number: CN107704617A
Application number: CN201711015971.3A
Authority: CN
Inventors: 黄莉
Original assignee: Wuhan University of Science and Technology WHUST
Current assignee: Wuhan University of Science and Technology WHUST
Priority date: 2017-10-25
Filing date: 2017-10-25
Publication date: 2018-02-16

Abstract

The invention discloses a kind of compression method of the associated data based on classification tree index, comprise the following steps：Dictionary file is built, two-dimensional matrix tree index part is established to data block index, establishes the three-dimensional matrice index of block data and corresponding ID triad sequences data.The distribution of the compression method marriage relation three-dimensional matrice and predicate vector of the associated data based on classification tree index of the present invention carries out piecemeal to relation three-dimensional matrice, pass through the upward mapping to relation three-dimensional matrice, the index of two-dimensional matrix form is established for these block datas, and it is compressed storage respectively on each piecemeal, compression ratio is considerably enhanced on the basis of original structure is not destroyed.

Description

Translated fromChinese

一种基于分级树形索引的关联数据的压缩方法A Compression Method for Linked Data Based on Hierarchical Tree Index

技术领域technical field

本发明涉及数据处理的技术领域，尤其涉及一种基于分级树形索引的关联数据的压缩方法。The invention relates to the technical field of data processing, in particular to a method for compressing associated data based on a hierarchical tree index.

背景技术Background technique

目前关联数据在序列化层次上的压缩主要有基于字符长度处理与基于语法结构处理的两种思路。基于字符长度的压缩思路主要是通过对RDF数据模型中的URI标识等的平均表示长度进行处理，从而实现整个数据集的压缩。目前在这方面比较成熟的方案都是以字典化的思路来对构成RDF三元组的主语、谓语和宾语的URI字符串或常量进行映射处理，通过将重复的URI字符等映射为唯一的整数ID标识，形成和RDF三元组逐一对应的ID三元组组成的数据集，从而显著的减少因为URI字符长度问题所造成的冗余信息。At present, the compression of linked data at the serialization level mainly has two ideas based on character length processing and grammatical structure processing. The idea of compression based on character length is mainly to process the average representation length of URI identifiers in the RDF data model, so as to realize the compression of the entire data set. At present, the relatively mature solutions in this regard are based on the idea of dictionary to map the URI strings or constants that constitute the subject, predicate and object of RDF triples, and map repeated URI characters to unique integers ID identification, forming a data set composed of ID triples corresponding to RDF triples one by one, thus significantly reducing redundant information caused by the length of URI characters.

基于语法结构的冗余处理思路主要是通过修改数据的存储结构来实现压缩，因此可以在基于字符长度的压缩之后进行进一步压缩。现有的压缩方案的区别主要在于对字典化等转换之后的数据的不同处理方式上。HDT(Header-Dictionary-Triple)中使用的BitMap方案通过提取字典化后生成的ID三元组之间的关联规则，结合线性序列值以及线性比特预测位设计存储结构，在字典化的ID三元组数据集上做更进一步的压缩。对于这种语法结构压缩思路，很多冗余的数据项都会被去除，因此压缩率比较高。但是由于数据的结构的改变，这种方案最终获得的压缩数据的主语、谓语以及宾语之间的关联关系比较隐晦，对查询的支持并不理想。The idea of redundant processing based on grammatical structure is to achieve compression by modifying the storage structure of data, so further compression can be performed after compression based on character length. The difference between the existing compression schemes mainly lies in the different processing methods for the converted data such as dictionary. The BitMap scheme used in HDT (Header-Dictionary-Triple) extracts the association rules between ID triples generated after dictionaryization, and designs storage structures in combination with linear sequence values and linear bit prediction bits. Further compression is done on the group dataset. For this grammatical structure compression idea, many redundant data items will be removed, so the compression rate is relatively high. However, due to the change of the data structure, the relationship between the subject, predicate and object of the compressed data finally obtained by this scheme is relatively obscure, and the support for the query is not ideal.

另一种关联数据语法结构压缩的思路则是从关系三维矩阵的方面着手。这种压缩方法在数据结构上将三元组看作三维矩阵中的点进行处理，基于字典的映射以及ID三元组的表述，很好地保持了原有三元组组成部分之间原本所存在的语义关联关系，这种存储结构可以很方便的实现查询操作。但是，为了维护矩阵的存储结构，一些相互之间没有逻辑关联关系的主语、谓语和宾语的部分也需要占用存储空间来表述。如果当关联数据集的规模不断增加的时候，这个由关联数据集所构成的稀疏矩阵里的冗余内容也在持续增长，稀疏三维矩阵里大量冗余信息的存在是这种存储结构在压缩率上的瓶颈所在。BitMat方案采用了紧凑的比特矩阵结构，但是这种方法对于稀疏矩阵中的冗余内容去除的效率并不高。而K²-triple方法则是通过将关联数据集按照谓词拆分成多个稀疏二维矩阵进行压缩，相较于BitMat具有较好的压缩效果。但是K²-triple所借用的结构最初设计的初衷是对二维结构的网络图进行压缩存储，在抽取谓词对关系三维矩阵表示的RDF三元组构建稀疏二维矩阵时，会造成一些存储空间的浪费。Another way to compress the grammatical structure of linked data is to start from the aspect of the relational three-dimensional matrix. This compression method treats triples as points in a three-dimensional matrix in terms of data structure, and based on dictionary mapping and the expression of ID triples, it well maintains the original existence between the original triple components. The semantic association relationship, this kind of storage structure can realize the query operation very conveniently. However, in order to maintain the storage structure of the matrix, some parts of the subject, predicate and object that have no logical relationship with each other also need to occupy storage space for expression. If the size of the associated data set is increasing, the redundant content in the sparse matrix composed of the associated data set is also increasing. The existence of a large amount of redundant information in the sparse three-dimensional matrix is the storage structure in the compression ratio. where the bottleneck is. The BitMat scheme uses a compact bit matrix structure, but this method is not efficient for removing redundant content in sparse matrices. The K² -triple method is to compress the associated data set into multiple sparse two-dimensional matrices according to the predicate, which has a better compression effect than BitMat. However, the original design of the structure borrowed by K² -triple is to compress and store the two-dimensional structure of the network graph. When extracting the predicate to construct a sparse two-dimensional matrix from the RDF triples represented by the three-dimensional matrix of the relation, some storage space will be created. waste.

发明内容Contents of the invention

有鉴于此，本发明所解决的技术问题在于提供一种基于分级树形索引的关联数据的压缩方法，结合关系三维矩阵以及谓词向量的分布对关系三维矩阵进行分块，通过对关系三维矩阵的向上映射，为这些分块数据建立二维矩阵形式的索引，并且在每个分块上分别进行压缩存储，在不破坏原有结构的基础上显著地提升了压缩率。In view of this, the technical problem to be solved by the present invention is to provide a method for compressing associated data based on a hierarchical tree index, which divides the three-dimensional relational matrix into blocks in combination with the distribution of the three-dimensional relational matrix and the predicate vector, and through the three-dimensional relational matrix Upward mapping, indexing in the form of a two-dimensional matrix is established for these block data, and each block is compressed and stored separately, which significantly improves the compression rate without destroying the original structure.

本发明通过以下技术方案来解决上述技术问题：The present invention solves the above technical problems through the following technical solutions:

一种基于分级树形索引的关联数据的压缩方法，包括如下步骤：A method for compressing associated data based on a hierarchical tree index, comprising the following steps:

S1.构建字典文件，利用字典化映射构建的方式对RDF三元组集进行处理，通过对RDF三元组的构成内容构建字典，通过分组和去重处理为数据集中的每一个不同标识分配一个唯一的整数ID；S1. Construct a dictionary file, process the RDF triple set by means of dictionary mapping construction, construct a dictionary by constructing the content of the RDF triple, and assign a different identifier to each different identifier in the data set through grouping and deduplication processing unique integer ID;

S2.建立二维矩阵树形索引部分对数据分块索引，对RDF三元组集进行字典构建以及映射生成ID处理后，得到与RDF三元组集一一对应的ID三元组；S2. Establishing a two-dimensional matrix tree-shaped index part to index the data blocks, constructing a dictionary for the RDF triple set and mapping to generate an ID, and obtaining an ID triple corresponding to the RDF triple set;

S3.建立分块数据的三维矩阵索引以及对应的ID三元组序列化数据，取出实际存储的ID三元组数据，并利用之前构建的映射字典对ID三元组进行还原，实现压缩文件的解压。S3. Establish the three-dimensional matrix index of the block data and the corresponding ID triplet serialization data, take out the actually stored ID triplet data, and use the previously constructed mapping dictionary to restore the ID triplet to realize the compressed file unzip.

可选的，在步骤S1中，将原本数据集中用于表达主语、谓语以及宾语的长字符串转化为简短的整数ID，以一种更简洁的方式来表述三元组。Optionally, in step S1, the long character strings used to express the subject, predicate and object in the original data set are converted into short integer IDs to express triples in a more concise manner.

作为上述技术方案的改进，所述字典文件借助关联数据集本身结构生成，实现从URI标识字符形式的表现形式到ID的转换。As an improvement of the above technical solution, the dictionary file is generated by means of the structure of the associated data set itself, so as to realize the conversion from the expression form of the URI identification character form to the ID.

可选的，在步骤S2中，借鉴八叉树结构特性，对字典化后生成的ID三元组数据进行矩阵化，同时对原本用于存储区域点元素的八叉树结构进行改造，将三元组点集作为最终的叶节点，同时基于三维矩阵进行分块，构建一个可以用于对ID三元组矩阵数据进行存储的结构。Optionally, in step S2, by referring to the characteristics of the octree structure, the ID triplet data generated after dictionaryization is matrixed, and at the same time, the octree structure originally used to store the area point elements is transformed, and the three The tuple point set is used as the final leaf node, and at the same time, it is divided into blocks based on the three-dimensional matrix to construct a structure that can be used to store the ID triple matrix data.

进一步地，在进行数据检索时如果RDF三元组匹配未成功，则直接返回查无此数据的结果，如果匹配成功，则进入对应的数据分块进行查找。Furthermore, if the RDF triples are not successfully matched during data retrieval, the result of finding no such data will be returned directly, and if the match is successful, the corresponding data block will be entered for search.

可选的，在步骤S3中，所述ID三元组序列化数据为一个N个节点的树的压缩结构，该结构对其最终的遍历值进行存储，对整个三维矩阵的ID三元组点集用两个比特位数组来表达，在压缩的基础上同时保持原有的逻辑结构。Optionally, in step S3, the serialized data of the ID triplet is a compressed structure of a tree with N nodes, the structure stores its final traversal value, and the ID triplet points of the entire three-dimensional matrix The set is expressed by two bit arrays, and the original logical structure is maintained on the basis of compression.

进一步地，对分块数据的索引为三维矩阵树形索引，通过与索引进行匹配，如果在分块的ID三元组线性序列值中命中匹配结果，对应的RDF三元组是存在的。Further, the index of the block data is a three-dimensional matrix tree index, and by matching with the index, if a matching result is found in the linear sequence value of ID triples of the block, the corresponding RDF triple exists.

本发明的基于分级树形索引的关联数据的压缩方法结合关系三维矩阵以及谓词向量的分布对关系三维矩阵进行分块，通过对关系三维矩阵的向上映射，为这些分块数据建立二维矩阵形式的索引，并且在每个分块上分别进行压缩存储，在不破坏原有结构的基础上显著地提升了压缩率。本发明和K-triple在数据集的压缩率上都是优于HDT-BitMap的，而相较于HDT-BitMap比较稳定的压缩比率，本发明的关联数据的压缩方法比其他两种方法的压缩率浮动较大，由不同的压缩存储结构决定的。本发明对于RDF数据集的压缩比率是由关系三维矩阵中ID三元组数据集的平面分布稀疏度和纵向重叠程度共同决定的。K-triple压缩方法在谓词元素较多的情况下压缩率较差，本发明在谓语元素较多的数据集中也能保持较高的压缩率。The method for compressing associated data based on a hierarchical tree index of the present invention divides the relational three-dimensional matrix into blocks in combination with the distribution of the relational three-dimensional matrix and the predicate vector, and establishes a two-dimensional matrix form for these block data by upward mapping of the relational three-dimensional matrix The index of each block is compressed and stored separately, which significantly improves the compression ratio without destroying the original structure. Both the present invention and K-triple are superior to HDT-BitMap in the compression rate of data sets, and compared with the relatively stable compression rate of HDT-BitMap, the compression method of the associated data of the present invention is better than the compression of the other two methods. The rate fluctuates greatly, which is determined by different compressed storage structures. The compression ratio of the present invention for the RDF data set is jointly determined by the plane distribution sparsity and vertical overlapping degree of the ID triple data set in the relational three-dimensional matrix. The compression rate of the K-triple compression method is poor when there are many predicate elements, and the present invention can also maintain a relatively high compression rate in a data set with many predicate elements.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其他目的、特征和优点能够更明显易懂，以下结合优选实施例，并配合附图，详细说明如下。The above description is only an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and understandable , below in conjunction with the preferred embodiment, and with the accompanying drawings, the detailed description is as follows.

附图说明Description of drawings

图1为本发明的基于分级树形索引的关联数据的压缩方法的流程示意图；FIG. 1 is a schematic flow diagram of a method for compressing associated data based on a hierarchical tree index of the present invention;

图2为本发明的基于分级树形索引的关联数据的压缩模型架构图。FIG. 2 is a structural diagram of a compression model of associated data based on a hierarchical tree index in the present invention.

具体实施方式detailed description

下面结合附图详细说明本发明，其作为本说明书的一部分，通过实施例来说明本发明的原理，本发明的其他方面、特征及其优点通过该详细说明将会变得一目了然。在所参照的附图中，不同的图中相同或相似的部件使用相同的附图标号来表示。The present invention will be described in detail below in conjunction with the accompanying drawings. As a part of this description, the principle of the present invention will be described through embodiments. Other aspects, features and advantages of the present invention will become clear at a glance through the detailed description. In the referenced drawings, the same reference numerals are used for the same or similar components in different drawings.

如图1和图2所示，本发明实施例提供的一种基于分级树形索引的关联数据的压缩方法，包括如下步骤：As shown in Figure 1 and Figure 2, a method for compressing associated data based on a hierarchical tree index provided by an embodiment of the present invention includes the following steps:

构建字典文件，利用字典化映射构建的方式对RDF三元组集进行处理，通过对RDF三元组的构成内容构建字典，通过分组和去重处理为数据集中的每一个不同标识分配一个唯一的整数ID，将原本数据集中用于表达主语、谓语以及宾语的长字符串转化为简短的整数ID，以一种更简洁的方式来表述三元组。本发明的字典文件是借助关联数据集本身结构生成的，主要负责实现从URI标识等字符形式的表现形式到ID的转换，是本文实现对数据集压缩的基础，也是实现模型在压缩的基础上对RDF三元组进行查询的关键所在。Build a dictionary file, process the RDF triple set by means of dictionary mapping construction, build a dictionary through the content of the RDF triple, and assign a unique ID to each different identifier in the data set through grouping and deduplication processing Integer ID, which converts the long strings used to express the subject, predicate, and object in the original data set into a short integer ID, expressing triples in a more concise way. The dictionary file of the present invention is generated with the help of the structure of the associated data set itself, and is mainly responsible for realizing the conversion from the expression form of characters such as the URI logo to the ID. The key to querying RDF triples.

建立二维矩阵树形索引部分对数据分块索引，对RDF三元组集进行字典构建以及映射生成ID处理后，得到与RDF三元组集一一对应的ID三元组，然后借鉴八叉树结构特性，对字典化后生成的ID三元组数据进行矩阵化，同时对原本用于存储区域点元素的八叉树结构进行改造，将三元组点集作为最终的叶节点，同时基于三维矩阵进行分块，构建一个可以用于对ID三元组矩阵数据进行存储的结构。在进行数据检索时如果RDF三元组匹配未成功，则直接返回查无此数据的结果，如果匹配成功，则进入对应的数据分块进行查找。Establish a two-dimensional matrix tree index part to index data blocks, construct a dictionary for the RDF triple set and map to generate an ID, and obtain the ID triple corresponding to the RDF triple set one by one, and then learn from the octave Tree structure feature, matrixizes the ID triplet data generated after dictionaryization, and transforms the octree structure originally used to store area point elements, using the triplet point set as the final leaf node, and based on The three-dimensional matrix is divided into blocks to construct a structure that can be used to store the ID triplet matrix data. When performing data retrieval, if the RDF triple is not successfully matched, it will directly return the result of finding no such data. If the match is successful, it will enter the corresponding data block to search.

本发明提出的序列化数据结构的本质是一个N个节点的树的压缩结构，对于前面所构建生成的存储ID三元组的三维矩阵存储结构，提出了一种序列化结构对其最终的遍历值进行存储，对整个三维矩阵的ID三元组点集用两个比特位数组来表达，在压缩的基础上同时也可以很清晰地保持原有的逻辑结构。对数据分块的索引为三维矩阵树形索引，通过与索引进行匹配，如果在分块的ID三元组线性序列值中命中匹配结果，那么说明对应的RDF三元组是存在的。同时两级索引也可以实现压缩数据集的解压缩，根据二维矩阵的遍历结果可以匹配对应坐标并返回实际存储的数据分块，利用对这些数据分块的并行遍历，可以取出实际存储的ID三元组数据，并利用之前构建的映射字典对ID三元组进行还原，实现压缩文件的解压。The essence of the serialized data structure proposed by the present invention is a compressed structure of a tree with N nodes. For the three-dimensional matrix storage structure of the storage ID triplet constructed and generated above, a serialized structure is proposed to traverse the final Values are stored, and the ID triple point set of the entire three-dimensional matrix is expressed by two bit arrays, which can clearly maintain the original logical structure on the basis of compression. The index of the data block is a three-dimensional matrix tree index. By matching with the index, if the matching result is found in the linear sequence value of the ID triplet of the block, it means that the corresponding RDF triplet exists. At the same time, the two-level index can also realize the decompression of the compressed data set. According to the traversal result of the two-dimensional matrix, the corresponding coordinates can be matched and the actual stored data blocks can be returned. Using the parallel traversal of these data blocks, the actual stored ID can be retrieved. triplet data, and use the mapping dictionary built before to restore the ID triplet to decompress the compressed file.

以上所述是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也视为本发明的保护范围。The above description is a preferred embodiment of the present invention, and it should be pointed out that for those skilled in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications are also considered Be the protection scope of the present invention.

Claims

Translated fromChinese

1.一种基于分级树形索引的关联数据的压缩方法，其特征在于，包括如下步骤：1. A method for compressing linked data based on a hierarchical tree index, characterized in that, comprising the steps:

2.根据权利要求1所述的基于分级树形索引的关联数据的压缩方法，其特征在于：在步骤S1中，2. The method for compressing associated data based on hierarchical tree index according to claim 1, characterized in that: in step S1,

将原本数据集中用于表达主语、谓语以及宾语的长字符串转化为简短的整数ID，以一种更简洁的方式来表述三元组。Convert the long strings used to express the subject, predicate, and object in the original dataset into short integer IDs to express triples in a more concise way.

3.根据权利要求2所述的基于分级树形索引的关联数据的压缩方法，其特征在于：所述字典文件借助关联数据集本身结构生成，实现从URI标识字符形式的表现形式到ID的转换。3. the method for compressing the associated data based on the hierarchical tree index according to claim 2, characterized in that: the dictionary file is generated by the structure of the associated data set itself, and realizes the conversion from the expression form of the URI identification character form to the ID .

4.根据权利要求1所述的基于分级树形索引的关联数据的压缩方法，其特征在于，在步骤S2中，4. The method for compressing associated data based on hierarchical tree index according to claim 1, characterized in that, in step S2,

借鉴八叉树结构特性，对字典化后生成的ID三元组数据进行矩阵化，同时对原本用于存储区域点元素的八叉树结构进行改造，将三元组点集作为最终的叶节点，同时基于三维矩阵进行分块，构建一个可以用于对ID三元组矩阵数据进行存储的结构。Drawing on the characteristics of the octree structure, the ID triplet data generated after dictionaryization is matrixed, and the octree structure originally used to store the area point elements is modified, and the triplet point set is used as the final leaf node , and at the same time block based on the three-dimensional matrix to construct a structure that can be used to store the ID triplet matrix data.

5.根据权利要求4所述的基于分级树形索引的关联数据的压缩方法，其特征在于，在进行数据检索时如果RDF三元组匹配未成功，则直接返回查无此数据的结果，如果匹配成功，则进入对应的数据分块进行查找。5. the method for compressing the associated data based on the hierarchical tree index according to claim 4, characterized in that, if the RDF triples are not matched successfully when performing data retrieval, then directly return the result of finding no such data, if If the match is successful, enter the corresponding data block to search.

6.根据权利要求1所述的基于分级树形索引的关联数据的压缩方法，其特征在于，在步骤S3中，6. The method for compressing associated data based on hierarchical tree index according to claim 1, characterized in that, in step S3,

所述ID三元组序列化数据为一个N个节点的树的压缩结构，该结构对其最终的遍历值进行存储，对整个三维矩阵的ID三元组点集用两个比特位数组来表达，在压缩的基础上同时保持原有的逻辑结构。The serialized data of the ID triplet is a compressed structure of a tree of N nodes, which stores the final traversal value, and uses two bit arrays to express the ID triplet point set of the entire three-dimensional matrix , while maintaining the original logical structure on the basis of compression.

7.根据权利要求6所述的基于分级树形索引的关联数据的压缩方法，其特征在于，对分块数据的索引为三维矩阵树形索引，通过与索引进行匹配，如果在分块的ID三元组线性序列值中命中匹配结果，对应的RDF三元组是存在的。7. the method for compressing the associated data based on the hierarchical tree index according to claim 6, characterized in that, the index to the block data is a three-dimensional matrix tree index, by matching with the index, if the ID of the block If there is a matching result in the triple linear sequence value, the corresponding RDF triple exists.