Movatterモバイル変換


[0]ホーム

URL:


CN114297260B - Distributed RDF data query method, device and computer equipment - Google Patents

Distributed RDF data query method, device and computer equipment
Download PDF

Info

Publication number
CN114297260B
CN114297260BCN202111644837.6ACN202111644837ACN114297260BCN 114297260 BCN114297260 BCN 114297260BCN 202111644837 ACN202111644837 ACN 202111644837ACN 114297260 BCN114297260 BCN 114297260B
Authority
CN
China
Prior art keywords
query
rdf data
sub
rdf
sparql
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111644837.6A
Other languages
Chinese (zh)
Other versions
CN114297260A (en
Inventor
肖国庆
岑楚璇
李肯立
陈玥丹
李雪琪
周旭
刘楚波
阳王东
唐卓
廖清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan UniversityfiledCriticalHunan University
Priority to CN202111644837.6ApriorityCriticalpatent/CN114297260B/en
Publication of CN114297260ApublicationCriticalpatent/CN114297260A/en
Application grantedgrantedCritical
Publication of CN114297260BpublicationCriticalpatent/CN114297260B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Landscapes

Abstract

Translated fromChinese

本申请涉及一种分布式RDF数据查询方法、装置和计算机设备。所述方法包括:获取RDF数据集和SPARQL查询语句,根据已构建的一致性哈希环,将RDF数据集划分至对应的子节点,一致性哈希环包括各RDF数据主题值的哈希值与各子节点的虚拟节点位置的映射关系,根据映射关系,将SPARQL查询语句中的子查询语句按照查询数据处理优先级分别发送至对应的子节点,接收各子节点反馈的查询结果,并对查询结果进行交集操作,得到目标查询结果,查询结果由各子节点根据接收到的子查询语句在划分得到的RDF数据中查询得到。采用本方法能够提高数据查询的效率。

The present application relates to a distributed RDF data query method, device and computer equipment. The method comprises: obtaining an RDF data set and a SPARQL query statement, dividing the RDF data set into corresponding child nodes according to a constructed consistent hash ring, the consistent hash ring comprising a mapping relationship between the hash value of each RDF data subject value and the virtual node position of each child node, sending the sub-query statements in the SPARQL query statement to the corresponding child nodes according to the query data processing priority, receiving the query results fed back by each child node, and performing an intersection operation on the query results to obtain a target query result, the query result is obtained by each child node inquiring in the divided RDF data according to the received sub-query statement. The use of this method can improve the efficiency of data query.

Description

Distributed RDF data query method, device and computer equipment
Technical Field
The present application relates to the field of graph data management technology, and in particular, to a distributed RDF data query method, apparatus, computer device, storage medium, and computer program product.
Background
RDF (Resource Description Framework), the resource management framework, is formulated by W3C, which defines an abstract syntax (data model) for linking all RDF-based languages and specifications. While RDF data management is of great concern, the existence of the standard query language SPARQL defined by W3C has motivated work focused on efficiently executing SPARQL queries on large RDF data sets. As the amount of RDF data grows, the computational complexity of indexing and querying large data sets becomes challenging. Optimization studies on SPARQL queries on large-scale RDF datasets are therefore trivial.
Most of the current SPARQL query systems are mainly distributed systems, and the distributed systems are constructed on a distributed data processing framework by using shared-nothing computing clusters, or implement a special distributed computing method. RDF graphs are partitioned into multiple machines to handle large data sets and query execution is parallelized to reduce run time. Answering a query typically involves processing local data for each machine and requiring data exchange between multiple machines.
However, the above-mentioned existing distributed RDF data query processing system improves parallelization to reduce running time, and meanwhile, has problems of excessive data partition overhead, uneven load balance and the like, which directly affect the efficiency of data query and reduce the data query efficiency.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a distributed RDF data query method, apparatus, computer device, computer readable storage medium and computer program product that can improve data query efficiency.
In a first aspect, the present application provides a distributed RDF data query method. The method comprises the following steps:
obtaining an RDF data set and an SPARQL query statement;
Dividing the RDF data set into corresponding child nodes according to the constructed consistency hash ring, wherein the consistency hash ring comprises the mapping relation between hash values of RDF data theme values and virtual node positions of the child nodes;
according to the mapping relation, sub-query sentences in the SPARQL query sentences are respectively sent to corresponding sub-nodes according to the query data processing priority;
And receiving the query results fed back by the child nodes, and performing intersection operation on the query results to obtain target query results.
In one embodiment, before dividing the RDF dataset into the corresponding child nodes according to the constructed consistent hash ring, the method further comprises:
Encoding the RDF data set to obtain an encoded RDF data set and a bidirectional dictionary;
Partitioning the RDF dataset to corresponding child nodes according to the constructed consistent hash ring includes:
And dividing the encoded RDF data set to corresponding child nodes according to the constructed consistency hash ring.
In one embodiment, before the sub-query statements in the SPARQL query statement are respectively sent to the corresponding sub-nodes according to the mapping relationship and the preset query data processing priority, the method further includes:
according to the bidirectional dictionary, encoding the SPARQL query statement to obtain an encoded SPARQL query statement;
according to the mapping relation, the step of respectively sending the sub-query sentences in the SPARQL query sentences to the corresponding sub-nodes according to the query data processing priority comprises the following steps:
And respectively transmitting sub-query sentences in the encoded SPARQL query sentences to corresponding sub-nodes according to the query data processing priority according to the mapping relation.
In one embodiment, partitioning the RDF dataset into corresponding child nodes according to the constructed consistent hash ring includes:
Performing hash calculation on the topic values of all RDF data in the RDF data set to obtain hash values of the topic values;
searching a virtual node position corresponding to the hash value of the theme value on the consistency hash ring;
each RDF data is sent to a child node corresponding to the virtual node location.
In one embodiment, according to the mapping relationship, sending the sub-query statements in the SPARQL query statement to the corresponding sub-nodes according to the query data processing priority includes:
sequencing SPARQL query sentences according to a preset query data processing priority to obtain sequenced SPARQL query sentences;
and sending the ordered SPARQL query statement to the corresponding child node according to the mapping relation.
In one embodiment, the consistent hash ring is constructed based on the following:
distributing a preset number of virtual nodes for each child node;
and obtaining the position of each virtual node by using a hash algorithm to obtain a consistent hash ring.
In a second aspect, the application further provides a distributed RDF data query device. The device comprises:
the data acquisition module is used for acquiring an RDF data set and an SPARQL query statement;
The data dividing module is used for dividing the RDF data set to corresponding child nodes according to the constructed consistency hash ring, wherein the consistency hash ring comprises the mapping relation between hash values of RDF data theme values and virtual node positions of the child nodes;
The query statement sending module is used for respectively sending sub-query statements in the SPARQL query statements to corresponding sub-nodes according to the query data processing priority according to the mapping relation;
and the query result processing module is used for receiving the query results fed back by the child nodes and performing intersection operation on the query results to obtain target query results.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:
obtaining an RDF data set and an SPARQL query statement;
Dividing the RDF data set into corresponding child nodes according to the constructed consistency hash ring, wherein the consistency hash ring comprises the mapping relation between hash values of RDF data theme values and virtual node positions of the child nodes;
according to the mapping relation, sub-query sentences in the SPARQL query sentences are respectively sent to corresponding sub-nodes according to the query data processing priority;
And receiving the query results fed back by the child nodes, and performing intersection operation on the query results to obtain target query results.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
obtaining an RDF data set and an SPARQL query statement;
Dividing the RDF data set into corresponding child nodes according to the constructed consistency hash ring, wherein the consistency hash ring comprises the mapping relation between hash values of RDF data theme values and virtual node positions of the child nodes;
according to the mapping relation, sub-query sentences in the SPARQL query sentences are respectively sent to corresponding sub-nodes according to the query data processing priority;
And receiving the query results fed back by the child nodes, and performing intersection operation on the query results to obtain target query results.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:
obtaining an RDF data set and an SPARQL query statement;
Dividing the RDF data set into corresponding child nodes according to the constructed consistency hash ring, wherein the consistency hash ring comprises the mapping relation between hash values of RDF data theme values and virtual node positions of the child nodes;
according to the mapping relation, sub-query sentences in the SPARQL query sentences are respectively sent to corresponding sub-nodes according to the query data processing priority;
And receiving the query results fed back by the child nodes, and performing intersection operation on the query results to obtain target query results.
According to the distributed RDF data query method, the device, the computer equipment, the storage medium and the computer program product, according to the consistency hash ring comprising the mapping relation between the hash value of each RDF data theme value and the virtual node position of each sub-node, the RDF data set is divided into the corresponding sub-nodes, the RDF data with the same theme value can be divided into the same nodes, so that the data of each node is divided uniformly, the problems of overlarge data partition cost and uneven load can be effectively solved, and further, the sub-query sentences in the SPARQL query sentences are respectively sent to the corresponding sub-nodes according to the query data processing priority according to the mapping relation, SPARQL query can be optimized, query data are distributed to each node for distributed execution, and the data query speed is effectively improved.
Drawings
FIG. 1 is an application environment diagram of a distributed RDF data query method in one embodiment;
FIG. 2 is a flow diagram of a distributed RDF data query method in one embodiment;
FIG. 3 is a flow chart of a distributed RDF data query method according to another embodiment;
FIG. 4 is a data flow diagram of a distributed RDF data query method in one embodiment;
FIG. 5 is a block diagram of a distributed RDF data query device in one embodiment;
FIG. 6 is a block diagram of a distributed RDF data query device in one embodiment;
fig. 7 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The distributed RDF data query method provided by the embodiment of the application can be applied to an application environment shown in figure 1. Wherein a master node 102 communicates with child nodes (also referred to as working nodes) 104 via a network. The data storage system may store data that the master node 102 needs to process. The data storage system may be integrated on the host node 102 or may be located on a cloud or other network server. Specifically, a worker inputs an RDF data set and a SPARQL query statement to the master node 102, sends a data processing instruction to the master node 102, the master node 102 responds to the instruction, obtains the RDF data set and the SPARQL query statement, and divides the RDF data set to corresponding sub-nodes according to a constructed consistency hash ring, wherein the consistency hash ring includes mapping relations between hash values of theme values of the RDF data and virtual node positions of the sub-nodes, and then sends the sub-query statement in the SPARQL query statement to the corresponding sub-nodes according to query data processing priorities, receives query results fed back by the sub-nodes, and performs intersection operation on the query results to obtain target query results, wherein the query results are obtained by the sub-nodes according to query results obtained by dividing the received sub-query statement, and the master node 102 and the sub-nodes 104 can be, but are not limited to, various personal computers, notebook computers, intelligent mobile phones, tablet computers, internet of things devices and portable wearable devices, and internet of things devices, and intelligent air conditioning devices. The portable wearable device may be a smart watch, smart bracelet, headset, or the like.
In one embodiment, as shown in fig. 2, a distributed RDF data query method is provided, and the method is applied to the master node 102 in fig. 1 for illustration, and includes the following steps:
step 202, an RDF dataset and a SPARQL query statement are obtained.
The RDF dataset is a set that includes a plurality of RDF triples, each RDF triplet including a subject value, an attribute value, and an object value. SPARQL query statements are data sets comprising a plurality of sub-SPARQL query statements, wherein each SPARQL query statement also exists in the form of a triplet.
As shown in fig. 3, in one embodiment, before step 204, the method further includes: and 203, encoding the RDF data set to obtain an encoded RDF data set and a bidirectional dictionary.
In practical applications, data storage and data transmission require significant overhead due to the large data size of the RDF data set and the large data length of the individual RDF data. To solve this problem, in this embodiment, edge data of vertices of each RDF triplet may be encoded, where the encoding starts from 0, each encoding corresponds to one data, and finally a corresponding vertex dictionary and edge dictionary, that is, a bi-directional dictionary, and an RDF dataset after encoding are obtained. Wherein it may be an attribute value for a triplet. And independently constructing a two-way dictionary, and constructing the two-way dictionary aiming at the theme value and the object value of the triplet. In this embodiment, the long character string can be encoded into the number by performing encoding processing on the RDF data set, so that the overhead of storage can be effectively reduced.
And 204, dividing the RDF data set to corresponding child nodes according to the constructed consistency hash ring, wherein the consistency hash ring comprises the mapping relation between the hash value of each RDF data theme value and the virtual node position of each child node.
The consistency hash ring comprises a mapping relation between hash values of theme values of RDF data (RDF triples) and positions of virtual nodes allocated to the child nodes, wherein the hash value of the theme value of each RDF triplet corresponds to the position of one virtual node. Specifically, the RDF data set may be divided by using a consistent hash algorithm based on a consistent hash ring constructed, and the RDF data in the RDF data set is correspondingly divided into a plurality of child nodes. Specifically, each RDF data may be sent to a corresponding child node through MPI (Multi Point Interface ).
In another embodiment, step 204 includes: step 224, dividing the encoded RDF data set into corresponding child nodes according to the constructed consistent hash ring.
In specific implementation, the encoded RDF data set may be divided into corresponding child nodes according to the constructed consistent hash ring. Likewise, each piece of encoded RDF data may be sent to a corresponding child node through the MPI. In this embodiment, the encoded RDF data is sent to the corresponding child node, so that the overhead of data transmission can be significantly reduced.
As shown in fig. 3, in one embodiment, before step 206, the method further includes: and 205, carrying out coding processing on the SPARQL query statement according to the bidirectional dictionary to obtain a coded SPARQL query statement.
In the implementation, the character strings included in the SPARQL query statement can find the corresponding character strings and the codes corresponding to the character strings in the bidirectional dictionary, so in order to further save the data transmission cost of the SPARQL query statement, the SPARQL query statement can be subjected to coding processing according to the bidirectional dictionary, and the coded SPARQL query statement is obtained. Specifically, the method includes traversing the topic value, the attribute value and the object value in the SPARQL query statement, searching whether the data identical to the topic value, the attribute value and the object value in the SPARQL query statement exist in the bidirectional dictionary, and if so, encoding the topic value, the attribute value and/or the object value in the SPARQL query statement into corresponding digital codes according to the corresponding relation between the character strings and the digital codes in the bidirectional dictionary, so that the redundant SPARQL query statement can be converted into a short query statement, and the cost of data transmission is greatly reduced.
And 206, respectively transmitting the sub-query sentences in the SPARQL query sentences to the corresponding sub-nodes according to the query data processing priority according to the mapping relation.
In this embodiment, the query data processing priority may be that sub-query statements with known topic values are executed first, then sub-query statements with known object values are executed, if the topic and the object are unknown, the topic variable with the highest occurrence frequency in the sub-query statement is found, then sub-query with the topic variable as the object is executed, and finally the rest of sub-query statements are executed. Specifically, according to the mapping relation between the hash value and the virtual node position, sub-query sentences of the SPARQL query sentences are respectively sent to corresponding sub-nodes, so that the sub-nodes execute the sub-query sentences and feed back corresponding query results.
In another embodiment, step 206 includes: and 226, respectively transmitting sub-query sentences in the encoded SPARQL query sentences to corresponding sub-nodes according to the query data processing priority according to the mapping relation.
In this embodiment, as well, sub-query statements of the encoded SPARQL query statement may be sent to corresponding sub-nodes according to the mapping relationship between the hash value and the virtual node position, so that the sub-nodes execute the sub-query statements and feed back corresponding query results, which can greatly save the overhead of data transmission.
And step 208, receiving the query results fed back by the child nodes, and performing intersection operation on the query results to obtain target query results.
In practical application, after receiving the corresponding sub-query statement, the sub-node executes the query task, queries in the RDF data obtained by dividing according to the received sub-query statement to obtain a corresponding query result, and then feeds back the query result to the main node. Because the number of the query results fed back by the sub-nodes is large and complicated, and there may be repeated query results, in this embodiment, after the query results fed back by the sub-nodes are received, each sub-query statement is used as a reference, and the corresponding query results are connected to obtain the target query results of the sub-query statement, and then the target query results of each sub-query statement are collected to obtain the final target query result.
In the distributed RDF data query method, according to the consistent hash ring comprising the mapping relation between the hash value of each RDF data theme value and the virtual node position of each sub-node, the RDF data set is divided into the corresponding sub-nodes, and the RDF data with the same theme value can be divided into the same node, so that the data division of each node is more uniform, the problem of overlarge data partition cost and uneven load can be effectively solved, and further, the sub-query sentences in the SPARQL query sentences are respectively sent to the corresponding sub-nodes according to the query data processing priority according to the mapping relation, the SPARQL query can be optimized, and the query data is distributed to each node for distributed execution, so that the data query speed is effectively improved.
In one embodiment, step 204 includes: performing hash calculation on the topic value of each RDF data in the RDF data set to obtain a hash value of the topic value, searching a virtual node position corresponding to the hash value of the topic value on the consistency hash ring, and sending each RDF data to a child node corresponding to the virtual node position.
In the implementation, hash calculation is performed on the topic values of all RDF triples in the RDF data set to obtain hash values of the topic values, then virtual node positions corresponding to the hash values of the topic values are searched clockwise on a consistency hash ring, and all RDF data are sent to sub-nodes corresponding to the virtual node positions, so that triples with the same topic values can be divided into the same sub-nodes, RDF data are uniformly distributed to all the sub-nodes, and the problem of unbalanced node loads is solved. It will be appreciated that in other embodiments, the virtual node location may be searched counterclockwise, which may be specific to the actual situation, and is not limited herein.
In one embodiment, the consistent hash ring is constructed based on the following: and distributing a preset number of virtual nodes for each child node, and acquiring the positions of each virtual node by using a hash algorithm to obtain a consistency hash ring.
In practical application, the consistency hash ring is constructed by allocating 100 virtual nodes to each physical child node, and then calculating the position of each virtual node by using a crc32 algorithm. Specifically, the hash may be performed by using the node name of the virtual node as a key, so as to obtain the position of each virtual node on the hash ring. It is to be understood that in other embodiments, 200 virtual nodes or other numbers of virtual nodes may be allocated to each physical child node, which is merely exemplary and not limiting.
In one embodiment, according to the mapping relationship, sending the sub-query statements in the SPARQL query statement to the corresponding sub-nodes according to the query data processing priority includes: sequencing the SPARQL query sentences according to the preset query data processing priority to obtain sequenced SPARQL query sentences, and sending the sequenced SPARQL query sentences to the corresponding child nodes according to the mapping relation.
In the implementation, the SPARQL query statement may be sequenced and optimized according to a preset query data processing priority, and then the sequenced and optimized SPARQL query statement is further sent to a corresponding child node according to a mapping relationship. Specifically, the sending of the SPARQL query statement after the ordering optimization to the corresponding child node may be:
Firstly, selecting a sub-query statement (hereinafter referred to as sub-query) with a known theme value, hashing the theme value of the sub-query statement to obtain a hash value, finding the position of a virtual node corresponding to the hash value according to a mapping relation contained in a consistency hash ring, and then sending the sub-query statement to a sub-node corresponding to the found position of the virtual node for execution without being sent to all nodes for execution, thereby reducing the communication overhead;
Step (1-2), selecting sub-queries with known object values, firstly sending the sub-queries to all nodes for execution, obtaining the values (i.e. the topic values) of topic variables of the sub-queries fed back by all the sub-nodes, and returning to step (1-1) after obtaining the topic values;
step (1-3) if the topic value and the object value are unknown, finding the topic variable with highest occurrence frequency in the sub-query, then sending the sub-query taking the topic variable as the object to all sub-nodes, obtaining the topic value of the sub-query, and returning to step (1-1) after obtaining the topic variable value;
Step (1-4) sends the sub-queries left after the screening in step (1-1) and step (1-3) to all nodes for query.
In this embodiment, the SPARQL query statement after the ordering optimization is sent to the corresponding child node according to the above manner, which is different from the prior art that the query statement is sent to all the nodes, so that the communication overhead can be reduced and the query speed can be improved.
In order to clearly illustrate the distributed RDF data query method provided by the present application, the following description is made with reference to a specific embodiment and fig. 4:
The master node responds to the data processing instruction, acquires an RDF data set and an SPARQL query statement, encodes the edge data of the vertexes of each RDF triplet, starts encoding from 0, corresponds to one data, finally obtains a corresponding vertex dictionary and an edge dictionary, namely a two-way dictionary, and divides the encoded RDF data set into corresponding child nodes according to the constructed consistency hash ring. The consistency hash ring comprises mapping relations between hash values of RDF data theme values and virtual node positions of child nodes, wherein the consistency hash ring is constructed by distributing 100 virtual nodes for each physical child node, and then calculating the positions of each virtual node by using a crc32 algorithm to obtain the consistency hash ring. Then, according to the bidirectional dictionary, encoding the SPARQL query statement to obtain an encoded SPARQL query statement, sorting sub-query statements in the encoded SPARQL query statement according to the query data processing priority, and then respectively sending the sorted SPARQL query statement to corresponding sub-nodes according to the mapping relation so as to enable the sub-nodes to execute the query task, and according to the received sub-query statement, querying in the RDF data obtained by dividing to obtain a corresponding query result, and feeding back the query result to the main node.
Specifically, it may be:
Firstly, selecting a sub-query statement (hereinafter referred to as sub-query) with a known theme value, hashing the theme value of the sub-query statement to obtain a hash value, finding the position of a virtual node corresponding to the hash value according to a mapping relation contained in a consistency hash ring, and then sending the sub-query statement to a sub-node corresponding to the found position of the virtual node for execution without being sent to all nodes for execution, thereby reducing the communication overhead;
Step (1-2), selecting sub-queries with known object values, firstly sending the sub-queries to all nodes for execution, obtaining the values (i.e. the topic values) of topic variables of the sub-queries fed back by all the sub-nodes, and returning to step (1-1) after obtaining the topic values;
step (1-3) if the topic value and the object value are unknown, finding the topic variable with highest occurrence frequency in the sub-query, then sending the sub-query taking the topic variable as the object to all sub-nodes, obtaining the topic value of the sub-query, and returning to step (1-1) after obtaining the topic variable value;
Step (1-4) sends the sub-queries left after the screening in step (1-1) and step (1-3) to all nodes for query.
The main node receives the query results fed back by the sub-nodes, and because the number of the query results fed back by the sub-nodes is large and complex, and repeated query results may exist in the query results, in this embodiment, after receiving the query results fed back by the sub-nodes, the main node may perform a connection operation on the corresponding query results with each sub-query statement as a reference, to obtain the target query results of the sub-query statement, and then, the target query results of each sub-query statement are collected, to obtain the final target query result.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a distributed RDF data query device for realizing the above related distributed RDF data query method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of one or more distributed RDF data query devices provided below may be referred to the limitation of the distributed RDF data query method hereinabove, and will not be repeated here.
In one embodiment, as shown in FIG. 5, there is provided a distributed RDF data query device, comprising: a data acquisition module 510, a data partitioning module 520, a query statement sending module 530, and a query result processing module 540, wherein:
a data acquisition module 510, configured to acquire an RDF dataset and a SPARQL query statement;
The data dividing module 520 is configured to divide the RDF data set into corresponding child nodes according to the constructed consistent hash ring, where the consistent hash ring includes a mapping relationship between hash values of the RDF data theme values and virtual node positions of the child nodes;
The query statement sending module 530 is configured to send sub-query statements in the SPARQL query statement to corresponding sub-nodes according to the query data processing priority, respectively;
The query result processing module 540 is configured to receive the query result fed back by each child node, and perform intersection operation on the query result to obtain a target query result.
According to the distributed RDF data query device, the RDF data set is divided into the corresponding sub-nodes according to the consistent hash ring comprising the hash value of each RDF data theme value and the mapping relation of the virtual node position of each sub-node, the RDF data with the same theme value can be divided into the same node, so that the data division of each node is uniform, the problems of overlarge data partition cost and uneven load can be effectively solved, further, sub-query sentences in the SPARQL query sentences are respectively sent to the corresponding sub-nodes according to the query data processing priority according to the mapping relation, SPARQL query can be optimized, query data are distributed to each node for distributed execution, and the data query speed is effectively improved.
As shown in fig. 6, in one embodiment, the apparatus further includes an RDF data encoding processing module 550, configured to encode the RDF data set to obtain an encoded RDF data set and a bi-directional dictionary; the data partitioning module 520 is further configured to partition the encoded RDF data set to corresponding child nodes according to the constructed consistent hash ring.
As shown in fig. 6, in one embodiment, the apparatus further includes a query data encoding processing module 560, configured to encode the SPARQL query statement according to the bi-directional dictionary, to obtain an encoded SPARQL query statement; the data sending module 530 is further configured to send sub-query statements in the encoded SPARQL query statement to corresponding sub-nodes according to the query data processing priority, respectively.
In one embodiment, the data partitioning module 520 is further configured to perform hash computation on the topic value of each RDF data in the RDF data set to obtain a hash value of the topic value, find a virtual node position corresponding to the hash value of the topic value on the consistent hash ring, and send each RDF data to a child node corresponding to the virtual node position.
In one embodiment, the data sending module 530 is further configured to sort the SPARQL query statements according to a preset query data processing priority, obtain sorted SPARQL query statements, and send the sorted SPARQL query statements to the corresponding child nodes according to the mapping relationship.
In one embodiment, the apparatus further includes a hash ring construction module 570 configured to allocate a preset number of virtual nodes to each child node, and obtain the positions of each virtual node by using a hash algorithm, so as to obtain a consistent hash ring.
The various modules in the distributed RDF data querying device described above may be implemented in whole or in part in software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing RDF data sets, SPARQL query sentences, consistent hash rings and other data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a distributed RDF data query method.
It will be appreciated by those skilled in the art that the structure shown in FIG. 7 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer device is provided that includes a memory having a computer program stored therein and a processor that when executing the computer program performs the steps of the distributed RDF data query method described above.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon which, when executed by a processor, implements the steps of the distributed RDF data query method described above.
In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, implements the steps of the distributed RDF data query method described above.
The user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

Translated fromChinese
1.一种分布式RDF数据查询方法,其特征在于,所述方法包括:1. A distributed RDF data query method, characterized in that the method comprises:获取RDF数据集和SPARQL查询语句;Get RDF datasets and SPARQL query statements;根据已构建的一致性哈希环,将所述RDF数据集划分至对应的子节点,所述一致性哈希环包括各RDF数据主题值的哈希值与各子节点的虚拟节点位置的映射关系;Divide the RDF data set into corresponding child nodes according to the constructed consistent hash ring, wherein the consistent hash ring includes a mapping relationship between the hash value of each RDF data subject value and the virtual node position of each child node;根据所述映射关系,将所述SPARQL查询语句中的子查询语句按照查询数据处理优先级分别发送至对应的子节点;According to the mapping relationship, the sub-query statements in the SPARQL query statement are sent to the corresponding child nodes respectively according to the query data processing priority;接收各子节点反馈的查询结果,并对所述查询结果进行交集操作,得到目标查询结果。Receive the query results fed back by each child node, and perform an intersection operation on the query results to obtain the target query result.2.根据权利要求1所述的分布式RDF数据查询方法,其特征在于,所述根据已构建的一致性哈希环,将所述RDF数据集划分至对应的子节点之前,还包括:2. The distributed RDF data query method according to claim 1, characterized in that before dividing the RDF data set into corresponding child nodes according to the constructed consistent hash ring, it also includes:对所述RDF数据集进行编码处理,得到编码后的RDF数据集以及双向字典;Encoding the RDF data set to obtain an encoded RDF data set and a bidirectional dictionary;所述根据已构建的一致性哈希环,将所述RDF数据集划分至对应的子节点包括:The step of dividing the RDF data set into corresponding child nodes according to the constructed consistent hash ring includes:根据已构建的一致性哈希环,将编码后的RDF数据集划分至对应的子节点。According to the constructed consistent hash ring, the encoded RDF dataset is divided into corresponding child nodes.3.根据权利要求2所述的分布式RDF数据查询方法,其特征在于,所述根据所述映射关系和预设的查询数据处理优先级,将所述SPARQL查询语句中的子查询语句分别发送至对应的子节点之前,还包括:3. The distributed RDF data query method according to claim 2, characterized in that before sending the sub-query statements in the SPARQL query statement to the corresponding child nodes respectively according to the mapping relationship and the preset query data processing priority, it also includes:根据所述双向字典,对所述SPARQL查询语句进行编码处理,得到编码后的SPARQL查询语句;According to the bidirectional dictionary, the SPARQL query statement is encoded to obtain an encoded SPARQL query statement;所述根据所述映射关系,将所述SPARQL查询语句中的子查询语句按照查询数据处理优先级分别发送至对应的子节点包括:The step of sending the sub-query statements in the SPARQL query statement to corresponding child nodes according to the query data processing priority according to the mapping relationship comprises:根据所述映射关系,将所述编码后的SPARQL查询语句中的子查询语句按照查询数据处理优先级分别发送至对应的子节点。According to the mapping relationship, the sub-query statements in the encoded SPARQL query statement are sent to corresponding child nodes respectively according to the query data processing priority.4.根据权利要求1至3任意一项所述的分布式RDF数据查询方法,其特征在于,根据已构建的一致性哈希环,将所述RDF数据集划分至对应的子节点包括:4. The distributed RDF data query method according to any one of claims 1 to 3, characterized in that, according to the constructed consistent hash ring, dividing the RDF data set into corresponding child nodes comprises:对所述RDF数据集中各RDF数据的主题值进行哈希计算,得到主题值的哈希值;Performing hash calculation on the subject value of each RDF data in the RDF data set to obtain a hash value of the subject value;查找所述一致性哈希环上与所述主题值的哈希值对应的虚拟节点位置;Finding a virtual node position on the consistent hash ring that corresponds to the hash value of the subject value;将各RDF数据发送至与所述虚拟节点位置对应的子节点。Each RDF data is sent to a child node corresponding to the position of the virtual node.5.根据权利要求1至3任意一项所述的分布式RDF数据查询方法,其特征在于,根据所述映射关系,将所述SPARQL查询语句中的子查询语句按照查询数据处理优先级分别发送至对应的子节点包括:5. The distributed RDF data query method according to any one of claims 1 to 3, characterized in that, according to the mapping relationship, sending the sub-query statements in the SPARQL query statement to the corresponding child nodes respectively according to the query data processing priority comprises:根据所述预设的查询数据处理优先级,对所述SPARQL查询语句进行排序,得到排序后的SPARQL查询语句;Sorting the SPARQL query statements according to the preset query data processing priority to obtain sorted SPARQL query statements;根据所述映射关系,依次将所述排序后的SPARQL查询语句发送至对应的子节点。According to the mapping relationship, the sorted SPARQL query statements are sent to corresponding child nodes in sequence.6.根据权利要求1至3任意一项所述的分布式RDF数据查询方法,其特征在于,所述一致性哈希环基于以下方式构建:6. The distributed RDF data query method according to any one of claims 1 to 3, characterized in that the consistent hash ring is constructed based on the following method:为各子节点分配预设数量的虚拟节点;Allocate a preset number of virtual nodes to each child node;使用哈希算法获取各虚拟节点的位置,得到一致性哈希环。Use the hash algorithm to obtain the location of each virtual node and obtain a consistent hash ring.7.一种分布式RDF数据查询装置,其特征在于,所述装置包括:7. A distributed RDF data query device, characterized in that the device comprises:数据获取模块,用于获取RDF数据集和SPARQL查询语句;Data acquisition module, used to obtain RDF data sets and SPARQL query statements;数据划分模块,用于根据已构建的一致性哈希环,将所述RDF数据集划分至对应的子节点,所述一致性哈希环包括各RDF数据主题值的哈希值与各子节点的虚拟节点位置的映射关系;A data partitioning module, used to partition the RDF data set into corresponding child nodes according to a constructed consistent hash ring, wherein the consistent hash ring includes a mapping relationship between a hash value of each RDF data subject value and a virtual node position of each child node;查询语句发送模块,用于根据所述映射关系,将所述SPARQL查询语句中的子查询语句按照查询数据处理优先级分别发送至对应的子节点;A query statement sending module, used for sending sub-query statements in the SPARQL query statement to corresponding child nodes according to the query data processing priority according to the mapping relationship;查询结果处理模块,用于接收各子节点反馈的查询结果,并对所述查询结果进行交集操作,得到目标查询结果。The query result processing module is used to receive the query results fed back by each sub-node and perform an intersection operation on the query results to obtain a target query result.8.一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至6中任一项所述的方法的步骤。8. A computer device, comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 6 when executing the computer program.9.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至6中任一项所述的方法的步骤。9. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 6 are implemented.10.一种计算机程序产品,包括计算机程序,其特征在于,该计算机程序被处理器执行时实现权利要求1至6中任一项所述的方法的步骤。10. A computer program product, comprising a computer program, characterized in that when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 6 are implemented.
CN202111644837.6A2021-12-292021-12-29 Distributed RDF data query method, device and computer equipmentActiveCN114297260B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202111644837.6ACN114297260B (en)2021-12-292021-12-29 Distributed RDF data query method, device and computer equipment

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202111644837.6ACN114297260B (en)2021-12-292021-12-29 Distributed RDF data query method, device and computer equipment

Publications (2)

Publication NumberPublication Date
CN114297260A CN114297260A (en)2022-04-08
CN114297260Btrue CN114297260B (en)2024-11-26

Family

ID=80971160

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202111644837.6AActiveCN114297260B (en)2021-12-292021-12-29 Distributed RDF data query method, device and computer equipment

Country Status (1)

CountryLink
CN (1)CN114297260B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114860811B (en)*2022-05-252024-09-17湖南大学Method and device for searching median approximation value of data set and computer equipment
CN116610699A (en)*2023-04-262023-08-18中国工商银行股份有限公司 Data query method, device, electronic device, medium and program product

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104462609A (en)*2015-01-062015-03-25福州大学RDF data storage and query method combined with star figure coding
CN109325029A (en)*2018-08-302019-02-12天津大学 RDF data storage and query method based on sparse matrix

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10248701B2 (en)*2015-09-182019-04-02International Business Machines CorporationEfficient distributed query execution
CN110825738B (en)*2019-10-222023-04-25天津大学 A method and device for data storage and query based on distributed RDF

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104462609A (en)*2015-01-062015-03-25福州大学RDF data storage and query method combined with star figure coding
CN109325029A (en)*2018-08-302019-02-12天津大学 RDF data storage and query method based on sparse matrix

Also Published As

Publication numberPublication date
CN114297260A (en)2022-04-08

Similar Documents

PublicationPublication DateTitle
Xu et al.Efficient $ k $-Means++ approximation with MapReduce
US20180276274A1 (en)Parallel processing database system with a shared metadata store
CN106528717B (en) Data processing method and system
CN104462609B (en)RDF data storage and querying method with reference to star-like graph code
CN108304460B (en)Improved database positioning method and system
CN111400555B (en)Graph data query task processing method and device, computer equipment and storage medium
CN103678520A (en)Multi-dimensional interval query method and system based on cloud computing
CN107391554A (en)Efficient distributed local sensitivity hash method
CN114297260B (en) Distributed RDF data query method, device and computer equipment
US12026162B2 (en)Data query method and apparatus, computing device, and storage medium
US10185743B2 (en)Method and system for optimizing reduce-side join operation in a map-reduce framework
US9026539B2 (en)Ranking supervised hashing
CN103678550A (en)Mass data real-time query method based on dynamic index structure
WO2022083197A1 (en)Data processing method and apparatus, electronic device, and storage medium
CN112632118A (en)Method, device, computing equipment and storage medium for querying data
Senthilkumar et al.An efficient FP-Growth based association rule mining algorithm using Hadoop MapReduce
Al-Khasawneh et al.MapReduce a comprehensive review
CN105550332A (en)Dual-layer index structure based origin graph query method
CN111400301A (en)Data query method, device and equipment
Elmeiligy et al.An efficient parallel indexing structure for multi-dimensional big data using spark
Mittal et al.Efficient random data accessing in MapReduce
CN112148830A (en)Semantic data storage and retrieval method and device based on maximum area grid
CN114153987A (en)Distributed knowledge graph query method, device and storage medium
CN110083598A (en)A kind of remotely-sensed data indexing means, system and electronic equipment towards Spark
CN106446039B (en) Aggregated big data query method and device

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp