Multithreading loopback enumeration algorithm based on NumPyTechnical Field
The invention relates to the technical field of graph data processing, in particular to a multithreading loopback enumeration algorithm based on NumPy.
Background
The loop enumeration problem for graph data is a common problem in the field of graph analysis, and is often used in graph analysis scenarios such as reconnaissance and criminal investigation. In most real-world scenarios, the length of the ring to be enumerated is limited, and specifying an upper limit on the length of the enumerated ring can greatly mitigate computational costs. The existing loop enumeration algorithm is calculated based on the data of the structural graph and has the defects of large memory consumption and long calculation time.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the background art, the invention provides a multithreading loopback enumeration algorithm based on NumPy, which improves the efficiency of loop calculation through a multithreading technology and a NumPy library and provides an efficient solution method for loop calculation under different scenes
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:
a multithreading loopback enumeration algorithm based on NumPy comprises the following steps:
step S1, preprocessing data; processing multi-source heterogeneous data into graph data of a point set and an edge set meeting requirements through an ETL process, and further constructing an adjacency matrix based on the graph data;
step S2, calculating a path index by the main thread; the main thread executes n iterations, respectively calculates path indexes of all rings with the lengths of 1-n, generates a path index of the current iteration relative to the previous iteration through each iteration calculation, and sends the path index to the auxiliary thread for expansion; after the iteration is finished, the main thread receives all the paths of the rings sent by the auxiliary thread;
step S3, the assistant thread receives the path index sent by the main thread, and combines the stored path between the nodes and the path index to obtain the path of the ring; when the iteration is over, the auxiliary thread sends the paths of all the rings to the main thread.
Further, the constructing of the adjacency matrix in step S1 specifically includes:
acquiring an adjacent matrix of nodes from the graph data according to the input graph data and the maximum length of a ring to be calculated, constructing the adjacent matrix into a Boolean NumPy matrix, representing that the nodes are connected by adopting True, and representing that the nodes are not connected by adopting False; constructing a path dictionary with the distance of 1 hop between nodes from graph data, wherein keys are the ids of a starting node and a terminating node of an edge, and the value is the id of the edge; taking the maximum length of the ring to be calculated as the total iteration number N; the generated path dictionary is used as an initial parameter to start the auxiliary thread.
Further, the step S2 of calculating the path index by the main thread specifically includes:
s2.1, calculating a dot product of an adjacent matrix of i-1 hop and an initial adjacent matrix in the last iteration turn based on a dot operator of NumPy, and taking the dot product as the adjacent matrix of the ith hop;
s2.2, acquiring the position of a non-zero element in the row and column of the ith hop of adjacent matrix by using a nonzero method of NumPy; expressing the position in a row vector and column vector mode, acquiring row data of an adjacent matrix of i-1 jump by taking the row vector as a coordinate, and acquiring initial column data of the adjacent matrix by taking the column vector as a column coordinate;
s2.3, searching the position where the row data and the column data are equal through the AND operation and the where operator of the NumPy; obtaining a path index of the current turn according to the position where the row data and the column data are equal, and transmitting the current iteration turn i and the path index to an auxiliary thread for expansion; the diagonal element in the result of the dot product in step S2.1 is reset to 0.
Further, the iterative process executed by the assist thread in step S3 specifically includes:
taking an edge in graph data as an initial value of an inter-node path stored by an auxiliary thread, acquiring a path index obtained by each iteration and the total number of times of iteration by the auxiliary thread from a main thread, acquiring a path value between nodes of the current iteration according to the path and the index between the nodes stored by the auxiliary thread, selecting all loop paths obtained in the current iteration from all the inter-node path values, and updating the node path stored by the auxiliary thread into the inter-node path value obtained by the current iteration; and after the iteration is finished, sending the paths of all the rings to the main thread, and finishing the execution of the auxiliary thread.
Further, the auxiliary thread obtains a message queue of a transmission path index and a message queue of a transmission loop path result from the main thread;
firstly, the auxiliary thread obtains an initial path index from a message queue of the path index as an expansion result of the path index of the step 0 iteration; in a specific iteration process, the auxiliary thread obtains a current iteration turn and a path index from a message queue of the path index; the path index refers to the path index in the last iteration;
expanding the path index of the iteration according to the path index of the previous iteration, wherein the key of the path index represents the initial node and the final node of the path, and the value of the path index with the same initial node and final node is used as the path of the ring and is added into a list representing all the paths of the ring; updating a variable representing the path index of the previous step according to the path expansion result;
and when the current iteration id is more than or equal to the maximum iteration times, transmitting the list representing all loop paths to the main thread through the message queue for transmitting the loop path result, and finishing the loop.
Further, the ETL in the step S1 refers to obtaining satisfactory data from the multi-source heterogeneous data through extraction, conversion, and loading.
Has the advantages that:
the multithreading loopback enumeration algorithm based on the NumPy is provided, discovery and expansion of a loopback are decoupled by means of multithreading technology, and execution efficiency of the algorithm is further improved by means of efficient operation of a NumPy library. In order to deal with data differences of different scenes, a link of multi-source data preprocessing is designed, and the general type of the algorithm is further improved.
Drawings
FIG. 1 is a flow chart of the NumPy-based multi-threaded loopback enumeration algorithm of the present invention;
FIG. 2 is a flow chart of main thread and assist thread execution in an embodiment of the invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings.
The multithreading loopback enumeration algorithm based on the NumPy is shown in figure 1 and specifically comprises the following steps:
step S1, preprocessing data; and processing the multi-source heterogeneous data into graph data of a point set and an edge set meeting requirements through an ETL process, and further constructing an adjacency matrix based on the graph data. The ETL process here refers to obtaining satisfactory data from multi-source heterogeneous data through various ways such as extraction, conversion, and loading. The finally constructed adjacency matrix is a NumPy matrix.
Acquiring an adjacent matrix of nodes from the graph data according to the input graph data and the maximum length of a ring to be calculated, constructing the adjacent matrix into a Boolean NumPy matrix, representing that the nodes are connected by adopting True, and representing that the nodes are not connected by adopting False; constructing a path dictionary with the distance of 1 hop between nodes from graph data, wherein keys are the ids of a starting node and a terminating node of an edge, and the value is the id of the edge; taking the maximum length of the ring to be calculated as the total iteration number N; the generated path dictionary is used as an initial parameter to start the auxiliary thread.
Step S2, calculating a path index by the main thread; the main thread executes n iterations, respectively calculates path indexes of all rings with the lengths of 1-n, generates a path index of the current iteration relative to the previous iteration through each iteration calculation, and sends the path index to the auxiliary thread for expansion; and after the iteration is finished, the main thread receives all the paths of the rings sent by the auxiliary thread. In particular, the amount of the solvent to be used,
and S2.1, calculating the dot product of the initial adjacency matrix and the adjacency matrix of the i-1 hop in the last iteration turn by using a dot operator based on the NumPy, and taking the dot product as the adjacency matrix of the i-th hop.
And calculating the i power of the adjacent matrix of the graph by using the i-1 power result of the last iteration and the adjacent matrix of the graph to perform dot product, wherein all the matrixes are stored by adopting a Boolean NumPy matrix, and the result is recorded as matrix _ i.
S2.2, acquiring the position of a nonzero element in matrix _ i by using a nonzero method of NumPy; its position is recorded with a row vector rows and a column vector columns. Extracting rows in matrix _ i according to row vectors rows to generate new row vectors new _ rows, extracting columns of an original image adjacent matrix according to column vectors, and generating new column vectors new _ columns; splitting a row vector new _ rows by using unique, cumsum and split operators of NumPy and performing grouping perfect operation on a column vector new _ columns, wherein the split row vector is new _ rows _ unique, and the grouped column vector is new _ columns _ group;
step S2.3, generating a new path index according to the new row vector and the new column vector, wherein the specific method comprises the following steps: taking elements of new _ rows _ unique and new _ columns _ group, packing the elements into tuples (row _ index, columns _ list) one by one, and generating new path indexes by taking (row [ row _ index ], columns [ row _ index ]) as keys, (row [ row _ index ], columns _ list, columns [ row _ index ]) as values, wherein (row [ row _ index ], column _ list) represents (row number, column number list) of the previous path dictionary, (column _ list, columns [ row _ index ]) represents (row number list, column number) of the initial path dictionary; initiating and sending the current round and the path index generated in the last step to an auxiliary thread; all diagonal elements of matrix _ i are set to 0, and the next round of the adjacency matrix dot product matrix _ (i-1) is updated with matrix _ i.
Step S3, the assistant thread receives the path index sent by the main thread, and combines the stored path between the nodes and the path index to obtain the path of the ring; when the iteration is over, the auxiliary thread sends the paths of all the rings to the main thread.
Taking an edge in graph data as an initial value of an inter-node path stored by an auxiliary thread, acquiring a path index obtained by each iteration and the total number of times of iteration by the auxiliary thread from a main thread, acquiring a path value between nodes of the current iteration according to the path and the index between the nodes stored by the auxiliary thread, selecting all loop paths obtained in the current iteration from all the inter-node path values, and updating the node path stored by the auxiliary thread into the inter-node path value obtained by the current iteration; and after the iteration is finished, sending the paths of all the rings to the main thread, and finishing the execution of the auxiliary thread. In particular, the amount of the solvent to be used,
the helper thread accepts several parameters as follows: the total number of iterations N, the queue path _ message _ queue of the transmission path index and the queue cycles _ queue of all rings transmitted. The auxiliary thread acquires an initial path dictionary from the path _ message _ queue as a path index of the step 0 iteration. And in the process of each iteration, the auxiliary thread acquires the updated path index of the main thread from the queue, expands the current path index by combining with the path dictionary of the previous iteration round, acquires the path dictionary of the current round, extracts all rings from the path dictionary, and adds the rings into a list representing all ring paths. The iteration ends, sending the results of all loops back to the main thread.
Further, obtaining the id of the current iteration and the path index of the current iteration from the path _ message _ queue; the path index is a variable of a dictionary type, represents an index of a path dictionary of the previous iteration, and expands the value of the path dictionary according to the variable, wherein the expanding method comprises the steps of obtaining a path list from the path dictionary of the previous iteration according to (rows [ row _ index ], column _ list ], obtaining the path list from an initial path dictionary according to (column _ list, columns [ row _ index ]), carrying out Cartesian product on the two lists, and storing the result as a new value in the path index of the current iteration round; assigning a variable of a path dictionary representing the previous round as an updated path index, acquiring all loop paths from the updated path index, and storing the loop paths into a list representing all loop paths; and judging whether the obtained id of the current round is larger than the total iteration number N, and if so, sending a list representing all the loop paths back to the main thread.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.