CN109063191B

Movatterモバイル変換

Info

Publication number: CN109063191B
Application number: CN201810995475.7A
Authority: CN
Inventors: 陈榕; 王宁; 陈海波; 臧斌宇; 管海兵
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2018-08-29
Filing date: 2018-08-29
Publication date: 2021-07-06
Anticipated expiration: 2038-08-29
Also published as: CN109063191A

Abstract

本发明提供了一种在RDF数据集上进行OPTIONAL查询的方法，将RDF数据集加载到内存，与客户端建立网络通信，接收含OPTIONAL查询的查询请求；解析查询请求，将查询请求中的查询语句分解成主查询和OPTIONAL查询两部分；执行主查询，产生主查询结果；执行OPTIONAL查询的分组查询，产生分组查询结果，汇总分组查询结果得到OPTIONAL查询结果，将主查询结果、OPTIONAL查询结果结合形成最终查询结果，将最终查询结果发送客户端。本发明提出的在RDF数据集上进行OPTIONAL查询的方法，通过建立OPTIONAL匹配表，标记OPTIONAL变量，调整查询语句顺序，动态修正查询结果等技术，极大地提高了OPTIONAL查询效率，显著降低了查询处理的开销，加快了查询速度。

The invention provides a method for OPTIONAL query on an RDF data set, which loads the RDF data set into memory, establishes network communication with a client, receives a query request containing OPTIONAL query; parses the query request, and converts the query in the query request The statement is decomposed into two parts: the main query and the OPTIONAL query; the main query is executed to generate the main query results; the grouped query of the OPTIONAL query is executed to generate the grouped query results, the grouped query results are aggregated to obtain the OPTIONAL query results, and the main query results and the OPTIONAL query results are combined. The final query result is formed, and the final query result is sent to the client. The method for OPTIONAL query on RDF data set proposed by the present invention greatly improves OPTIONAL query efficiency and significantly reduces query processing by establishing OPTIONAL matching table, marking OPTIONAL variables, adjusting query statement sequence, dynamically modifying query results and other technologies. The overhead, speeding up the query speed.

Description

Method and storage medium for performing OPTIONAL query on RDF dataset

Technical Field

The invention relates to the field of RDF graph query, in particular to a method for carrying out OPTIONAL query on an RDF data set.

Background

Human beings are in the big data age, and various types of mass data need to be stored and used in appropriate forms. Many of the knowledge-graph, social network data is typically stored in the form of RDF (Resource Description Framework). RDF is a standard model of data recommended by W3C (The World Wide Web Consortium) that can conveniently represent structural data on The World Wide Web. In the RDF model, the interrelated data may be represented as triples (triples). A triple is composed of a subject (subject), a predicate (predicate), and an object (object). For example, using the RDF model, we can conveniently store the data "shanghai transportation university located in shanghai city". Wherein "Shanghai transportation university" is the subject of the triple, "Shanghai City" is the predicate, and "Shanghai City" is the object. If the subject and the object are regarded as vertexes and the predicate is regarded as an edge, the triple can be represented as a directed graph connecting the vertex of the subject and the vertex of the object by the predicate. The entire RDF dataset can be viewed as an intricate, directed graph with vertices connected with various predicates.

SPARQL (simple Protocol and RDF Query language) is a language for querying RDF data. The SPARQL query consists of multiple query statements. The query statement is shaped like a triple in RDF, a subject, a predicate or an object in a triple is replaced with a variable to be queried. For a SPARQL statement, if a known element (one or more of a subject, predicate, or object) in the statement is consistent with a corresponding element in an RDF triple, the data is said to "match" the query statement. The data elements corresponding to the variables in the statement are the query results of the statement.

For the underlying SPARQL query, the relationships between statements are similar to joins (joins) in SQL. For example, statement 1 is executed to generate intermediate result 1, and statement 2 is executed based on each piece of data in intermediate result 1, and if the intermediate result does not match statement 2, the piece of data does not appear in intermediate result 2 after the end of the statement. In this specification, we will refer to the underlying SPARQL query as the "master query".

While for the OPTIONAL query, the relationship between statements is similar to the left outer join in SQL. For example, if statement 1 executes to generate intermediate result 1, and statement 2 executes based on each piece of data in intermediate result 1, and if the intermediate result does not match statement 2, the piece of data still appears in intermediate result 2 after the end of the statement, and only the variable required to be queried by statement 2, referred to as the option variable in this specification, will be replaced with a null value.

A SPARQL query with options typically consists of two parts: main queries and options queries. The main query may contain zero or more query statements, enclosed by curly brackets. Following the main query is one or more OPTIONAL query groups, each marked by an OPTIONAL keyword and bracketed around the query statement. Each query group contains zero or more query statements.

The traditional SPARQL query engine does not consider the characteristics of RDF graph structure data, which results in the need of final join (final join) to generate results, and the cost is very high. For basic SPARQL query, a query method for graph exploration exists, final connection is avoided, and efficiency is high. However, the graph exploration technique cannot be applied to the OPTIONAL query because the special nature of the "not matching" intermediate results is not removed by the OPTIONAL query. How to invent an efficient OPTIONAL query method becomes a technical problem in the current graph query field.

Disclosure of Invention

In view of the deficiencies in the prior art, it is an object of the present invention to provide a method for performing an OPTIONAL query on an RDF dataset.

According to the method for carrying out the OPTIONAL query on the RDF data set, provided by the invention, the method comprises the following steps: a request receiving step: loading the RDF data set into a memory, establishing network communication with a client, and receiving a query request containing an OPTIONAL query; analyzing the request: analyzing the query request, and decomposing a query statement in the query request into a main query part and an OPTIONAL query part; a main inquiry step: executing the main query to generate a main query result; optional query procedure: and executing the grouping query of the OPTIONAL query, generating an OPTIONAL grouping query result, summarizing the OPTIONAL grouping query result to obtain an OPTIONAL query result, combining the main query result and the OPTIONAL query result to form a final query result, and sending the final query result to the client.

Preferably, the OPTIONAL querying step comprises: a query grouping step: grouping the OPTIONAL queries according to a grammar specification, one OPTIONAL query being grouped into one or more OPTIONAL query groups; and (3) executing a query step: executing query on a single OPTIONAL query group to form a first query result, and collecting and summarizing the first query result to form a second query result after all the query is executed by a plurality of OPTIONAL query groups; a progress judgment step: judging whether all the OPTIONAL query groups are executed completely, if the not executed OPTIONAL query groups exist, returning to the step of executing the query and continuing to execute; and if all the query groups are executed, sending the second query result as a final query result to the client.

Preferably, the step of performing a query comprises: a pretreatment step: analyzing the OPTIONAL query statement and sequencing the OPTIONAL query statement; a task decomposition step: decomposing the query task to form one or more query subtasks, and sending the query subtasks to a plurality of working threads for execution; a thread query step: the working thread executes the query statements one by one in a graph exploration mode; and a result summarizing step: collecting query results of the working threads, obtaining summary results of the queries, and forming a first query result; and final result summarizing step: comparing the number of the working threads which are inquired with the total number of the working threads, judging whether the inquiry of all the working threads is finished or not, and if not, returning to the thread inquiry step for continuous execution; if the query is finished, summarizing the first query result to form a second query result.

Preferably, the pre-treating step comprises: and (3) variable analysis step: analyzing an OPTIONAL query statement and marking an OPTIONAL variable; establishing a matching table: establishing a matching table based on the main query result, wherein the line number of the matching table is equal to the line number of the main query result; a sorting step: and performing priority classification on the query sentences, wherein the query sentences with high priority are preferentially executed.

Preferably, the thread querying step includes: and (3) searching and screening: replacing a known variable in a query statement with first data corresponding to an intermediate result, querying the first data in an RDF data set, querying to obtain second data, if the second data is inconsistent with the first data, marking a corresponding unit of a matching table as unmatched, and if the second data is not inconsistent with the first data, marking a corresponding unit of the matching table as matched; interrupting subsequent query operations for statements marked as unmatched in units of the matching table; and (3) unknown variable query step: inquiring whether third data matching the statement exists in the RDF data set or not according to each query statement containing unknown variables and each intermediate result marked as matching in the matching table, and if so, updating the intermediate result of the statement by using the third data; if not, updating the mark of the strip in the matching table to be not matched; and a known variable query step: inquiring whether fourth data matching with the statement exists in the RDF data set or not according to the query statement containing the known variable and the intermediate result marked as matching in the matching table, and if so, keeping the matching mark of the statement in the matching table; if not, the tag of the strip in the match table is updated to a mismatch.

Preferably, the result summarizing step includes: collecting query results of the working threads, obtaining summary results of the queries, and forming a first query result; a working thread recording step: recording the number of the working threads which have returned to the query result, and recording as the queried number; recording the total number of the working threads of the sent tasks, and recording the total number as the total number of the threads; a counting updating step: if the working thread returns the query result, summarizing the query result into a summary result, and increasing the query quantity; a step of summarizing and judging: and if the queried number is equal to the total number of the threads, taking the summary result as a first query result.

According to the present invention, there is provided a computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the above-described method.

Compared with the prior art, the invention has the following beneficial effects:

1. reordering the query sentences according to the main query result and the query sentences, so that the sentences with higher selectivity can be preferentially executed, unnecessary calculation amount is reduced, and the query speed is increased;

2. the OPTIONAL query task is decomposed into a plurality of subtasks and is executed by a plurality of working threads in parallel, so that the execution efficiency is improved, and the query speed is increased;

3. the method has the advantages that the 'matched' intermediate result is kept in the query process through the OPTIONAL matching table marking, the unmatched intermediate result is removed, the solving range of an OPTIONAL query group is recorded in a mode of marking an OPTIONAL variable, the query result is dynamically corrected in the query process of the OPTIONAL, the OPTIONAL query can be conducted in a graph exploration mode, a final join (final join) step in the prior art is avoided, and the query speed is greatly increased;

5. the problem that the prior art cannot efficiently execute the OPTIONAL query is solved, and reference is provided for the technical scheme of utilizing a graph exploration mode to perform other complex types of queries.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a graphical representation of an RDF data set of an embodiment;

FIG. 2 is a flow chart of an OPTIONAL query on an RDF dataset.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Specifically, the option query step includes: a query grouping step: grouping the OPTIONAL queries according to a grammar specification, one OPTIONAL query being grouped into one or more OPTIONAL query groups; and (3) executing a query step: executing query on a single OPTIONAL query group to form a first query result, and collecting and summarizing the first query result to form a second query result after all the query is executed by a plurality of OPTIONAL query groups; a progress judgment step: judging whether all the OPTIONAL query groups are executed completely, if the not executed OPTIONAL query groups exist, returning to the step of executing the query and continuing to execute; and if all the query groups are executed, sending the second query result as a final query result to the client.

Specifically, the step of executing the query includes: a pretreatment step: analyzing the OPTIONAL query statement and sequencing the OPTIONAL query statement; a task decomposition step: decomposing the query task to form one or more query subtasks, and sending the query subtasks to a plurality of working threads for execution; a thread query step: the working thread executes the query statements one by one in a graph exploration mode; and a result summarizing step: collecting query results of the working threads, obtaining summary results of the queries, and forming a first query result; and final result summarizing step: comparing the number of the working threads which are inquired with the total number of the working threads, judging whether the inquiry of all the working threads is finished or not, and if not, returning to the step of executing the inquiry to continue the execution; if the query is finished, summarizing the first query result to form a second query result.

Specifically, the pretreatment step includes: and (3) variable analysis step: analyzing an OPTIONAL query statement and marking an OPTIONAL variable; establishing a matching table: establishing a matching table based on the main query result, wherein the line number of the matching table is equal to the line number of the main query result; a sorting step: and performing priority classification on the query sentences, wherein the query sentences with high priority are preferentially executed.

Specifically, the thread querying step includes: and (3) searching and screening: replacing a known variable in a query statement with first data corresponding to an intermediate result, querying the first data in an RDF data set, querying to obtain second data, if the second data is inconsistent with the first data, marking a corresponding unit of a matching table as unmatched, and if the second data is not inconsistent with the first data, marking a corresponding unit of the matching table as matched; interrupting subsequent query operations for statements marked as unmatched in units of the matching table; and (3) unknown variable query step: inquiring whether third data matching the statement exists in the RDF data set or not according to each query statement containing unknown variables and each intermediate result marked as matching in the matching table, and if so, updating the intermediate result of the statement by using the third data; if not, updating the mark of the strip in the matching table to be not matched; and a known variable query step: inquiring whether fourth data matching with the statement exists in the RDF data set or not according to the query statement containing the known variable and the intermediate result marked as matching in the matching table, and if so, keeping the matching mark of the statement in the matching table; if not, the tag of the strip in the match table is updated to a mismatch.

Specifically, the result summarizing step includes: collecting query results of the working threads, obtaining summary results of the queries, and forming a first query result; a working thread recording step: recording the number of the working threads which have returned to the query result, and recording as the queried number; recording the total number of the working threads of the sent tasks, and recording the total number as the total number of the threads; a counting updating step: if the working thread returns the query result, summarizing the query result into a summary result, and increasing the query quantity; a step of summarizing and judging: and if the queried number is equal to the total number of the threads, taking the summary result as a first query result.

According to the present invention, a computer-readable storage medium is provided, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned method.

According to the method for carrying out OPTIONAL query on the RDF data set, provided by the invention, by the technologies of establishing an OPTIONAL matching table, marking OPTIONAL variables, adjusting the sequence of query statements, dynamically correcting query results and the like, the query efficiency is greatly improved, the query processing expense is obviously reduced, and the query speed is accelerated.

The present invention is explained in more detail below.

When a request is received, the RDF data set is loaded into the memory before the service is provided, and if the system comprises a server, the memory of the server can completely contain the RDF data set. If a plurality of servers are included, each server can load a complete data set, or the data set can be divided, and each server loads a part of the data set. The slicing of the dataset may be uneven, allowing some servers to load more data and others to load less data. The servers and clients in the system should be able to establish network communications based on various types of network communication techniques, including but not limited to TCP and RDMA. The client sends the query request to the server through the network, and the query result is returned to the client through the network.

When parsing a query request, the SPARQL syntax specification should be met. The main query portion may contain 0 to more query statements and the OPTIONAL query may contain a plurality of OPTIONAL query groups, each of which may contain 0 to more query statements. After the server analyzes the query request, the whole query is placed in a query structure, and the main query statement and each OPTIONAL query group are respectively stored in the structure. The OPTIONAL query set is stored in a vector (vector) in groups, and is fetched from the vector one at a time for execution. The query structure contains the id of the query to distinguish it from other queries. And recording the current execution progress in the query structure body to determine the subsequent steps.

When executing the main query, the main query may be implemented in various ways, including but not limited to final join, graph explorer. The main query is typically performed using graph exploration to provide better query performance. The main query result is stored in a vector (result _ table), and the number of rows (n _ rows) and the number of columns (n _ cols) of the query result are recorded in the query structure. The query structure records the column of the query result and the mapping (variable _ to _ column _ map) of the variable. Locating the position of a result in the vector by: the column (col) corresponding to the variable X in the query result is variable _ to _ column _ map [ X ]; the kth result value of the variable X is result table [ (k-1) × n _ cols + col ].

The server parses the options query portion, flagging the options variable. The specific process of the analysis is as follows: the query statement is scanned bar by bar, and for each element of each query statement (triplet), it is first checked whether it is a variable. If so, a lookup is made in the intermediate result whether the variable is a known variable. If the variable is unknown, then see if it has been marked. If not, it is marked as an OPTIONAL variable. The tags are stored in a set (set). The server establishes an OPTIONAL matching table based on the main query result. The OPTIONAL matching table is stored in a vector data structure (vector), and the arrangement of the matching table is equal to the number of lines of the main query result and corresponds to each line of the main query result one by one. At initialization, the value of each row in the match table is set to true (true). The server reorders the OPTIONAL statements and optimizes the query. When reordering, the OPTIONAL statements are divided into three priorities, wherein the high priority is executed preferentially, the medium priority is executed after the high priority, and the low priority is executed finally. The priority is generated according to the following principle: high priority: the subject and the object in the query statement are constant and known variables (index to below, below to const, const to below); medium priority: in the query statement, the subject is a constant or a known variable, and the object is an unknown variable (index to unknown, neighbor to unknown); low priority: the query statement is mainly unknown variable (unknown to unknown, unknown to const)

The respective work threads may be distributed on the same machine or may be distributed on a plurality of machines. If the working thread is on multiple machines, the multiple machines can all load complete RDF data, or the RDF data can be divided, and each machine stores a part of RDF data. If the working thread is on a plurality of machines, and each machine stores partial RDF data, the server divides the task by considering the distribution condition of the data. In particular, all the data required for a task assigned to a machine should be found on that machine.

When the thread queries, pruning is performed by using the OPTIONAL matching table in the query process, and the subsequent statements are not calculated for the intermediate results marked as 'unmatched' in the matching table. Specifically, the worker thread replaces the known variable of the current query statement triplet with the corresponding data in an intermediate result (marked as true in the match table) and then uses it to find the result in the dataset. If the found result is inconsistent with the intermediate result, the mismatching occurs, and the corresponding row in the matching table is marked as false (false); for query statements that involve unknown OPTIONAL variables, the RDF graph data is looked up for the presence of data that matches the statement based on the intermediate results of each item labeled "match" in the match table. If yes, updating the intermediate result; if not, marking the result as not matched; for query statements that refer to known variables, the RDF graph data is looked up for the presence of data that matches the current statement based on the intermediate results of each entry marked as a "match" in the match table. If yes, the intermediate result accords with the query condition, and the matching state is kept; if not, the intermediate result violates the statement, marks the intermediate result as 'not matched', and corrects the OPTIONAL variable of the intermediate result to be null. In particular, it will be checked whether each column of the intermediate result belongs to the OPTIONAL variable, and if so, the value of the column is modified to null (BLANK).

After the thread query is finished, the server records the number of the working threads which have returned results at present and the number of the working threads of the total scheduled tasks; when a worker thread returns a query result, adding the result into the summary result, and increasing the number count of the worker threads which have returned the result; when the number of work threads that have returned results is equal to the number of work threads for the total scheduled tasks, an aggregated result is declared to have been generated.

After an option query group is executed, the progress in the query structure is increased (OPTIONAL _ step + ═ 1). By comparing the number of progress (OPTIONAL _ step) and option query groups, it can be determined whether all queries have been completed. If not, the next unexecuted OPTIONAL query set is taken out from the vector, and the unexecuted OPTIONAL query set is continuously queried. If all the data is finished, the summary result is returned to the client side in a network mode.

In summary, the method for performing the optimal query on the RDF dataset according to the present invention greatly improves the optimal query efficiency, significantly reduces the query processing overhead, and accelerates the query speed by establishing the optimal matching table, marking the optimal variable, adjusting the query statement sequence, dynamically correcting the query result, and other techniques.

Specific examples of queries are listed below. As shown in Table one, the main query portion of the statement requires querying all entities in the RDF dataset that are of the identity type "software engineer" and querying the entities of their native school. The first OPTIONAL query team asked to query the entities of his Master school and asked that the Hospital school is a double university and that the address of the Master school was in Shanghai. The second option query group asks for entities in his doctor's school. The RDF data set is shown in FIG. 1, and FIG. 1 is converted into a table form, shown in Table two.

Table one query SPARQL description

Tabular form of the Table two RDF dataset

Table three final query results

？X	？B	？M	？D
				Engineer armor	FUDAN University	Shanghai Jiaotong University	-
Engineer B	FUDAN University	-	FUDAN University
				Engineer C	FUDAN University	-	-
Engineer D	Shanghai Jiaotong University	-	Peking University
				Engineer Wu	Shanghai Jiaotong University	-	-

As shown in table three, after the main query is executed, we will get the main query result, two columns of five rows (engineer a-e) (. Then execute the first OPTIONAL query set, it can be seen that Eethanbutyng all has data for Master school, however the second statement in the query set limits that the subject school should be a double university, so the two lines of Butyleng results are marked as not matching and their OPTIONAL variable is dynamically modified to null. The third statement requires that the address of the master school should be Shanghai, so the line with B is marked as a mismatch and its OPTIONAL variable is dynamically modified to null. The intermediate result becomes three columns of five rows after execution of the OPTIONAL query set, and the third column has only the value of A. Similarly, the results of the second OPTIONAL query group may be obtained, and the final results obtained.

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.