CN113468275B

Movatterモバイル変換

Info

Publication number: CN113468275B
Application number: CN202110858726.9A
Authority: CN
Inventors: 杨福星; 周明伟; 朱林浩; 俞毅; 沈秋军; 何林强
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2024-07-30
Anticipated expiration: 2041-07-28
Also published as: CN113468275A

Abstract

The invention discloses a data importing method and device of a graph database, a storage medium and electronic equipment. The method comprises the following steps: determining a data file to be uploaded, wherein the data file comprises: a data mapping file of point data and edge data corresponding to the graph data; processing the data mapping file through a master node to obtain a data block distribution list corresponding to the data mapping file and a slave node; the master node distributes the data blocks to be processed corresponding to the slave nodes through the data block distribution list, and distributes processing tasks corresponding to the data blocks; and under the condition that the slave node completes the distributed concurrency processing on the data block, the master node determines boundary data to be imported in the graph database according to the processing result of the distributed concurrency processing. The problems that the performance of a graph database is bottleneck and the importing efficiency of edge data is low when large data volume graph data is imported in the prior art are solved.

Description

Data importing method and device of graph database, storage medium and electronic equipment

Technical Field

The present invention relates to the field of communications, and in particular, to a method and apparatus for importing data into a graph database, a storage medium, and an electronic device.

Background

With the advent of the age of interconnection of big data and everything, data has emerged as a trend of eruption, and various complex relationships exist between data. The traditional relational database processes the data of the complex relations to present a phenomenon of insufficient performance, and the requirements of market users cannot be well met. The graph database is generated based on graph theory, and naturally supports the rapid processing of complex relationships. However, it becomes critical how to import a large amount of complex data of point-side association into the graph database.

In order to solve the above problems, in the prior art, when large data amount of graph data is imported, the performance of the graph database is a bottleneck, the importing efficiency of the edge data is low, and no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the invention provides a data importing method and device of a graph database, a storage medium and electronic equipment, which at least solve the problems that the performance of the graph database is bottleneck and the importing efficiency of edge data is low when large data amount of graph data is imported in the prior art.

According to an aspect of the embodiment of the present invention, there is provided a data importing method of a graph database, including: determining a data file to be uploaded, wherein the data file comprises: a data mapping file of point data and edge data corresponding to the graph data; processing the data mapping file through a master node to obtain a data block distribution list corresponding to the data mapping file and a slave node; the master node distributes the data blocks to be processed corresponding to the slave nodes through the data block distribution list, and distributes processing tasks corresponding to the data blocks; and under the condition that the slave node completes the distributed concurrency processing on the data block, the master node determines boundary data imported by the graph database according to a processing result of the distributed concurrency processing.

In an exemplary embodiment, the determining, by the master node, boundary data imported from the graph database according to a processing result of the distributed concurrency processing includes: the master node obtains boundary offset in the processing result of the data block on each slave node; and summarizing the boundary offset through a preset algorithm to obtain the imported boundary data corresponding to the boundary offset.

In an exemplary embodiment, the distributed concurrent processing of the data blocks by the slave node includes: under the condition that the data mapping file is determined to be the data mapping file corresponding to the point data, the graph database returns the vertex ID to the slave node so as to indicate the slave node to write the vertex ID in a local cache, wherein the slave node is used for importing the point data and/or the edge data of the data file corresponding to the graph data to be uploaded; and under the condition that the data mapping file is determined to be the data mapping file corresponding to the edge data, determining whether the vertex IDs at the two ends corresponding to the edge data exist in the local memory of the slave node or not so as to determine the confirmation mode of the edge data.

In an exemplary embodiment, determining whether the two end vertex IDs corresponding to the edge data exist in the local memory of the slave node to determine the confirmation mode of the edge data includes: under the condition that vertex IDs at two ends corresponding to the edge data do not exist in the local memory of the slave node, carrying out distributed query on the vertex IDs from other slave nodes corresponding to the master node; and under the condition that vertex IDs at two ends corresponding to the edge data exist in the local memory of the slave node, directly writing the edge data according to the vertex IDs queried in the cache of the local memory of the slave node.

In an exemplary embodiment, the distributed querying of the vertex IDs from other slave nodes corresponding to the master node includes: the master node obtains information of vertex IDs cached on the other slave nodes through a remote calling framework to inquire; and when the vertex ID corresponding to the edge data is found out according to the query result corresponding to the query, the vertex IDs on the other slave nodes are used for indicating the writing of the edge data.

In an exemplary embodiment, after the master node determines the boundary data imported by the graph database according to the processing result of the distributed concurrency processing in the case that the slave node completes the distributed concurrency processing on the data block, the method further includes: the master node gathers the offset and the boundary information of the data block determined on each slave node through a remote calling framework, and sorts the boundary information; acquiring complete boundary data according to the offset and the boundary information; and notifying the slave node to import the complete boundary data under the condition that the master node has acquired the complete boundary data.

In an exemplary embodiment, after determining the data file to be uploaded, the method further comprises: determining whether map title corresponding to the data file to be uploaded exists in a graph database, wherein the data file further comprises: map metadata files corresponding to the map data, wherein map title is used for indicating names of data files corresponding to the map data; and under the condition that the graph name does not exist in the graph database, creating a new graph data file in the graph database according to the graph name and the graph metadata file, and loading a target graph corresponding to the new graph data file.

According to another aspect of the embodiment of the present invention, there is also provided a data importing method and apparatus for a graph database, including: the determining module is configured to determine a data file to be uploaded, where the data file includes: a data mapping file of point data and edge data corresponding to the graph data; the processing module is used for processing the data mapping file through the master node to obtain a data block distribution list corresponding to the data mapping file and the slave node; the distribution module is used for distributing the data blocks to be processed corresponding to the slave nodes through the data block distribution list and distributing the processing tasks corresponding to the data blocks; and the importing module is used for determining boundary data imported by the graph database according to a processing result of the distributed concurrency processing under the condition that the slave node completes the distributed concurrency processing on the data block.

In an exemplary embodiment, the importing module is further configured to obtain, by the master node, a boundary offset in a processing result of the data block on each of the slave nodes; and summarizing the boundary offset through a preset algorithm to obtain the imported boundary data corresponding to the boundary offset.

In an exemplary embodiment, the importing module is further configured to, when determining that the data mapping file is a data mapping file corresponding to point data, send, by the graph database, a vertex ID to the slave node, so as to instruct the slave node to write the vertex ID in a local cache, where the slave node is configured to import point data and/or edge data of the data file corresponding to the graph data to be uploaded; and under the condition that the data mapping file is determined to be the data mapping file corresponding to the edge data, determining whether the vertex IDs at the two ends corresponding to the edge data exist in the local memory of the slave node or not so as to determine the confirmation mode of the edge data.

In an exemplary embodiment, the importing module is further configured to, when vertex IDs at two ends corresponding to the edge data do not exist in a local memory of the slave node, perform distributed query on the vertex IDs from other slave nodes corresponding to the master node; and under the condition that vertex IDs at two ends corresponding to the edge data exist in the local memory of the slave node, directly writing the edge data according to the vertex IDs queried in the cache of the local memory of the slave node.

In an exemplary embodiment, the above importing module is further configured to obtain, by the master node, information of vertex IDs cached on the other slave nodes through a remote invocation framework for querying; and when the vertex ID corresponding to the edge data is found out according to the query result corresponding to the query, the vertex IDs on the other slave nodes are used for indicating the writing of the edge data.

In an exemplary embodiment, the importing module is further configured to aggregate, by the master node, the offset and the boundary information of the data block determined on each of the slave nodes through a remote call framework, and sort the boundary information; acquiring complete boundary data according to the offset and the boundary information; and notifying the slave node to import the complete boundary data under the condition that the master node has acquired the complete boundary data.

In an exemplary embodiment, the above apparatus further includes: the creating module is configured to determine whether map title corresponding to a data file to be uploaded exists in the graph database, where the data file further includes: map metadata files corresponding to the map data, wherein map title is used for indicating names of data files corresponding to the map data; and under the condition that the graph name does not exist in the graph database, creating a new graph data file in the graph database according to the graph name and the graph metadata file, and loading a target graph corresponding to the new graph data file.

According to a further aspect of embodiments of the present invention, there is also provided a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the method of any of the method embodiments when run.

According to a further aspect of the embodiments of the present invention, there is also provided an electronic device comprising a memory in which a computer program is stored, and a processor arranged to perform the method of any of the method embodiments described above by means of the computer program.

In the embodiment of the invention, a data file to be uploaded is determined, wherein the data file comprises: a data mapping file of point data and edge data corresponding to the graph data; processing the data mapping file through the master node to obtain a data block distribution list corresponding to the data mapping file and the slave node; the master node distributes the data blocks to be processed corresponding to the slave nodes through the data block distribution list, and distributes processing tasks corresponding to the data blocks; and under the condition that the slave node completes the distributed concurrency processing on the data block, the master node determines the boundary data imported by the graph database according to the processing result of the distributed concurrency processing. That is, the distributed import framework of the master-slave structure and the characteristic that each slave node processes the local block data block save IO overhead of the data reading network in the distributed scene, the vertex cache scheme when importing point data and the distributed vertex query scheme when importing side data avoid the cost of querying the database when importing side data, further improve the import efficiency of large-scale data files imported by the graph database, solve the problems that the performance of the graph database is bottleneck when importing large-data-volume graph data and the import efficiency of the side data is low in the prior art, and effectively improve the data importing capability of the graph database when importing large-data-volume.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

Fig. 1 is a block diagram of a hardware configuration of a computer terminal of a data import method of a graph database according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an application environment for data importation of a graph database according to an embodiment of the present invention;

FIG. 3 is a flow chart of a method of importing data from a graph database according to an embodiment of the present invention;

FIG. 4 is an interactive schematic diagram of a distributed concurrency import by calling tools according to an alternative embodiment of the present invention;

FIG. 5 is a flow chart of a vertex ID cache lookup in accordance with an alternative embodiment of the present invention;

fig. 6 is a schematic diagram of a data importing apparatus of a graph database according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The method embodiments provided by the embodiments of the present application may be performed in a computer terminal, a mobile terminal, or similar computing device. Taking a computer terminal as an example, fig. 1 is a block diagram of a hardware structure of a computer terminal of a data importing method of a graph database according to an embodiment of the present application. As shown in fig. 1, the computer terminal 10 may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the computer terminal described above. For example, the computer terminal 10 may also include more or less components than those shown in FIG. 1, or have a different configuration than equivalent functions shown in FIG. 1 or more than those shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a data importing method of a graph database in an embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104, thereby performing various functional applications and data processing, that is, implementing the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.

According to an aspect of the embodiment of the present invention, there is provided a data importing method of a graph database, and as an alternative implementation manner, the data importing method of the graph database may be applied to, but is not limited to, an environment as shown in fig. 2.

Alternatively, in the present embodiment, the above-mentioned terminal device may be a terminal device configured with a target client, and may include, but is not limited to, at least one of the following: video surveillance cameras, video surveillance gun cameras, inspection robots, mobile phones (such as Android Mobile phones, iOS Mobile phones, and the like), notebook computers, tablet computers, palm computers, MIDs (Mobile INTERNET DEVICES ), PADs, desktop computers, smart televisions, and the like. The target client may be a video client, an instant messaging client, a browser client, an educational client, and the like. The network may include, but is not limited to: a wired network, a wireless network, wherein the wired network comprises: local area networks, metropolitan area networks, and wide area networks, the wireless network comprising: bluetooth, WIFI, and other networks that enable wireless communications. The graph database may be a single server, a server cluster composed of a plurality of servers, or a cloud server. The above is merely an example, and is not limited in any way in the present embodiment.

Optionally, as an optional implementation manner, as shown in fig. 3, the data importing method of the graph database includes:

step S202, determining a data file to be uploaded, where the data file includes: a data mapping file of point data and edge data corresponding to the graph data;

Step S204, processing the data mapping file through a master node to obtain a data block distribution list corresponding to the data mapping file and a slave node;

Step S206, the master node distributes the data blocks to be processed corresponding to the slave nodes through the data block distribution list, and distributes processing tasks corresponding to the data blocks;

Step S208, when the slave node completes the distributed concurrency processing on the data block, the master node determines the boundary data imported by the graph database according to the processing result of the distributed concurrency processing.

Through the steps, the data file to be uploaded is determined, wherein the data file comprises: a data mapping file of point data and edge data corresponding to the graph data; processing the data mapping file through the master node to obtain a data block distribution list corresponding to the data mapping file and the slave node; the master node distributes the data blocks to be processed corresponding to the slave nodes through the data block distribution list, and distributes processing tasks corresponding to the data blocks; and under the condition that the slave node completes the distributed concurrency processing on the data block, the master node determines the boundary data imported by the graph database according to the processing result of the distributed concurrency processing. That is, the distributed import framework of the master-slave structure and the characteristic that each slave node processes the local block data block save IO overhead of the data reading network in the distributed scene, the vertex cache scheme when importing point data and the distributed vertex query scheme when importing side data avoid the cost of querying the database when importing side data, further improve the import efficiency of large-scale data files imported by the graph database, solve the problems that the performance of the graph database is bottleneck when importing large-data-volume graph data and the import efficiency of the side data is low in the prior art, and effectively improve the data importing capability of the graph database when importing large-data-volume.

It should be noted that, the master node is used for indicating a node in the graph database that can execute a writing operation, and the slave node is a node in the graph database that is used for storing data, when the master node executes a data storing operation on the latest graph data, all data changes will be sent to the slave node, and the slave node stores the graph data strictly in sequence after receiving the data changes.

In an exemplary embodiment, the main node determines boundary data imported by the graph database according to a processing result of the distributed concurrency processing, including: the master node obtains the boundary offset in the processing result of the data block on each slave node; and summarizing the boundary offset through a preset algorithm to obtain imported boundary data corresponding to the boundary offset.

In short, after determining the boundary offset of each slave node, in order to quickly determine the corresponding boundary data, the boundary offset of each slave node corresponding to the master node is summed up, so as to determine the boundary data corresponding to the boundary offset.

As an alternative embodiment, the overall flow of the present invention is as follows:

Determining whether map title corresponding to a data file to be uploaded to the graph database exists in the graph database, wherein the data file comprises: a map schema metadata file, a map corresponding point data and a data mapping file of edge data; creating a new graph and a graph schema corresponding to the loading graph in the graph database according to the graph name and the graph schema metadata file under the condition that the graph name does not exist in the graph database; based on the block distribution characteristics of the HDFS and the idea of locally processing local block data, a master node distributes the block blocks of the data file, a certain slave node is given the block blocks on the slave node, and a block data processing task is distributed to the slave node; processing the distributed block data blocks on the node from the node, caching the vertex IDs into a memory when the point data is imported, and directly acquiring the IDs of the vertices at two ends from the memories of other slave nodes when the edge data is imported, so that the database is prevented from being queried; the slave node gathers the boundary offset of the block to the master node, and the master node processes and imports the merged boundary data.

Further, the data mapping file is processed through the master node, a data block distribution list corresponding to the data mapping file and the slave node is determined according to the principle that the local node processes local block data, a data processing task is distributed, and the slave node processes data in a distributed and concurrent mode.

The vertex IDs are used to identify the foremost vertices of the left and right ends of the edge data, and each vertex ID is used to uniquely identify the vertex of the corresponding edge data.

In one exemplary embodiment, the distributed concurrent processing of data blocks by slave nodes includes: under the condition that the data mapping file is determined to be the data mapping file corresponding to the point data, the graph database returns the vertex ID to the slave node so as to indicate the slave node to write the vertex ID in the local cache, wherein the slave node is used for importing the point data and/or the edge data of the data file corresponding to the graph data to be uploaded; and under the condition that the data mapping file is determined to be the data mapping file corresponding to the edge data, determining whether the vertex IDs at the two ends corresponding to the edge data exist in the local memory of the slave node or not so as to determine the confirmation mode of the edge data.

It may be understood that, when the data mapping file is determined to be a data mapping file corresponding to point data, the slave node writes the point data into a graph database, and simultaneously needs to write a vertex ID returned by the graph database in a local cache, where the slave node is used to import the point data and/or edge data corresponding to the graph. And under the condition that the data mapping file is determined to be the data mapping file corresponding to the edge data, determining whether to perform distributed query on vertex IDs to other slave nodes according to whether the vertex IDs at two ends corresponding to the edge data exist in a local memory. If the vertex ID can be queried in the local cache of the slave node, the edge data is directly written, otherwise, the frame is called remotely to other slave nodes to query the vertex ID and then write the edge data.

In short, when the slave node processes the local point data block or the edge data block, the boundary data offset range can be calculated according to the line feed symbol of the data and the initial offset and the end offset of the block, the master node is used for determining the content of the boundary data by sorting the boundary data offset list according to the summary, and the slave node is used for determining the offset corresponding to the data block and the boundary information; the master node gathers the boundary information on each slave node through the remote calling frame, sorts the boundary information, acquires complete boundary data according to the offset and the boundary information, and informs one slave node to import the boundary data.

In order to better understand the technical solutions of the embodiments and the alternative embodiments of the present invention, the flow of the data importing method of the graph database is explained below with reference to examples, but the flow is not limited to the technical solutions of the embodiments of the present invention.

Since the edge is a relationship established between points, only the edge data cannot be imported in isolation when the edge data is imported, and the characteristic value of the associated point is acquired, in order to solve the problem, the problem of cache query of the vertex ID is involved. Therefore, the invention can be selectively implemented to solve the performance problem of the large-scale data import graph database and the problem that the vertex is required to be queried in the database when the edge data is imported.

As an optional implementation manner, an optional embodiment of the invention provides a distributed concurrency import method based on the block distribution characteristic of the HDFS and an RPC remote method calling tool, which solves the import performance problem when a large amount of data is needed, as shown in FIG. 4, which is an interactive schematic diagram of the distributed concurrency import performed by the calling tool of the optional embodiment of the invention.

Step 1, starting operation, accessing an import parameter, acquiring a corresponding file list Host list of a data file to be uploaded, and uploading an HDFS.

Step 2, transmitting the related file to the corresponding first slave node Follower1;

step 2.1, initializing idMap on the first slave node Follower according to the input mapping mapper file;

2.2, registering a slave remote call method, and starting a slave-server to wait for remote call;

Step 3, transmitting the related files to the master node importerLeader, and starting a master main program (namely a main process program);

step 4, the master node loads a map load schema according to the parameter entering map schema file;

Step 5, reading the point-edge integrated file, finding out the critical offset of the point edge, and outputting data block blocks information (covering offset and boundary information);

Optionally, starting vertex data processing;

step 6, the master node takes the information of the vertex file blocks as a parameter, and remotely calls the load of the slave node to load the local blocks of the slave node;

step 6.1, importing a vertex from the first slave node Follower1, and put longID to idMap, namely saving a vertex ID corresponding to the vertex in Idmap;

step 7, returning to the list of all the local blocks boundary data of the slave;

step 8, after the local blocks corresponding to the slave nodes are processed by all slave nodes and returned, the master node gathers all vertex block boundary data;

And 9, only remotely calling slave on the master machine to process all boundary data and importing the boundary data.

Optionally, starting edge data processing;

Step 10, determining the data to be processed as an edge file;

step 11, the master node takes vertex file blocks information as a parameter, and remotely calls load of slave nodes to load local blocks of slave nodes;

Step 12, firstly, detecting whether the local ID Map of the first slave node Follower has vertex IDs (long IDs) of left and right vertices, and directly importing edges; if any one exists, the record is stored;

step 13, if a long Id cannot be found in the local Id Map for the left or right vertex, then searching on other slave nodes is sequentially and remotely invoked, for example, searching is performed in the second slave node Follower;

Step 14, returning ARRAY LIST < Edge Record > containing Long Id, that is, determining the corresponding list return value first slave node Follower1 after finding the Edge data containing vertex Id in the second slave node Follower;

step 15, importing the edge data which does not find the long ID of the vertex locally before according to the edge data which is returned from the second slave node Follower and contains the ID of the vertex;

Step 16, the first slave node Follower1 returns all the blocks boundary data lists of the slave local;

Step 17, after the local blocks corresponding to the slave node are processed by all slave nodes and returned, the master node gathers all vertex block boundary data;

As an optional implementation manner, the optional embodiment of the invention also provides a vertex ID cache query scheme, solves the problem that a graph database needs to be queried when edge data is imported, and is particularly compatible with supporting input formats of point-edge separation data and point-edge integrated data. The method can effectively improve the data importing capability of the graph database when a large amount of data is available.

Optionally, fig. 5 is a flowchart of a vertex ID cache query according to an alternative embodiment of the present invention, including the following steps:

Step 1: and uploading the data file to an HDFS (Hadoop Distributed FILE SYSTEM, distributed file system, abbreviated as HDFS) corresponding to the graph database by executing a script program through the input map title, the graph schema file (equivalent to the graph metadata file), the data-mapper data mapping file of the point-edge mapping and the data file.

Step 2: and (3) detecting whether the picture name transmitted in the step (1) exists in the database, if not, creating a picture according to map title and loading a picture schema according to the schema file.

Step 3: the leader node sequentially processes the files according to the point-side mapping files, acquires a block data block distribution list of the files on the HDFS, and uniformly distributes local blocks on each of the follower nodes (equivalent to slave nodes in the implementation of the invention) so as to ensure that each of the follower nodes processes the local blocks.

Step 4: the leader (corresponding to a main node in the implementation of the present invention) informs each follower node to start concurrent processing of local blocks allocated to the follower node through an RPC (Remote Procedure Call, remote procedure call, abbreviated as RPC) remote call framework, calculates boundary data, sends the boundary data to the leader node through the RPC remote call, and the leader node imports the boundary data after receiving all the boundary data (points or edges).

Step 5: if the point file is processed, the following node (corresponding to the slave node in the implementation of the present invention) needs to write the vertex ID returned by the graph database (corresponding to the vertex ID in the implementation of the present invention) into the local memory when the point data is imported, so that the vertex ID in the distributed query cache is used when the edge data is imported in the subsequent step.

Step 6: if the edge file is processed, each follower node firstly queries the vertex id in the local cache, and if the vertex id cannot be found in the local cache, the frame is remotely called through the RPC to query other follower nodes remotely.

It should be noted that, the application scenarios of the above alternative embodiments mainly include: and storing an application scene of leading-in point edge graph data by using a graph database based on the HDFS component at the back end, and relating to a cache vertex ID scheme when leading-in edge data.

According to the embodiment, the distributed concurrency import is carried out based on the block distribution characteristics of the HDFS and the RPC remote method calling tool, so that each follower working node processes local blocks based on the distribution characteristics of the blocks of the HDFS, load balance is fully ensured, overall data import performance is quickened, the problem of edge import caused by the characteristics related to the edges of data points of a graph database is solved, when each follower working node is imported, vertex ids are cached in a memory through a vertex id caching scheme, when the vertex is reached, the memories of other followers are searched in a distributed mode, the import speed of edge data can be quickened, the problem that the graph database needs to be queried when the edge data is imported is solved, and the large-scale data import performance of the graph database is improved.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

According to another aspect of the embodiment of the present invention, there is also provided a data importing apparatus for implementing the data importing method of the graph database. As shown in fig. 6, the apparatus includes:

a determining module 62, configured to determine a data file to be uploaded, where the data file includes: a data mapping file of point data and edge data corresponding to the graph data;

the processing module 64 is configured to process the data mapping file through a master node, and obtain a data block distribution list corresponding to the data mapping file and a slave node;

The allocation module 66 is configured to allocate, through the data block distribution list, a data block to be processed corresponding to a slave node to the slave node, and allocate a processing task corresponding to the data block;

and the importing module 68 is configured to determine, by the master node, boundary data imported by the graph database according to a processing result of the distributed concurrency processing when the slave node completes the distributed concurrency processing on the data block.

By the device, the data file to be uploaded is determined, wherein the data file comprises: a data mapping file of point data and edge data corresponding to the graph data; processing the data mapping file through the master node to obtain a data block distribution list corresponding to the data mapping file and the slave node; the master node distributes the data blocks to be processed corresponding to the slave nodes through the data block distribution list, and distributes processing tasks corresponding to the data blocks; and under the condition that the slave node completes the distributed concurrency processing on the data block, the master node determines the boundary data imported by the graph database according to the processing result of the distributed concurrency processing. That is, the distributed import framework of the master-slave structure and the characteristic that each slave node processes the local block data block save IO overhead of the data reading network in the distributed scene, the vertex cache scheme when importing point data and the distributed vertex query scheme when importing side data avoid the cost of querying the database when importing side data, further improve the import efficiency of large-scale data files imported by the graph database, solve the problems that the performance of the graph database is bottleneck when importing large-data-volume graph data and the import efficiency of the side data is low in the prior art, and effectively improve the data importing capability of the graph database when importing large-data-volume.

In an exemplary embodiment, the importing module is configured to obtain, by a master node, a boundary offset in a processing result of a data block on each slave node; and summarizing the boundary offset through a preset algorithm to obtain imported boundary data corresponding to the boundary offset.

In the description of the present invention, it should be understood that the directions or positional relationships indicated by the terms "center", "upper", "lower", "front", "rear", "left", "right", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or component to be referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of the two components. When an element is referred to as being "mounted" or "disposed" on another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. It will be understood by those of ordinary skill in the art that the terms described above are in the specific sense of the present invention.

It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; or the above modules may be located in different processors in any combination.

An embodiment of the invention also provides a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of:

s1, determining a data file to be uploaded, wherein the data file comprises: a data mapping file of point data and edge data corresponding to the graph data;

s2, processing the data mapping file through a master node to obtain a data block distribution list corresponding to the data mapping file and a slave node;

S3, the master node distributes the data blocks to be processed corresponding to the slave nodes through the data block distribution list, and distributes processing tasks corresponding to the data blocks;

And S4, under the condition that the slave node completes the distributed concurrency processing on the data block, the master node determines boundary data imported by the graph database according to the processing result of the distributed concurrency processing.

Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.

An embodiment of the invention also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Optionally, the electronic device may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method described in the embodiments of the present invention.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided by the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A data importing method of a graph database, comprising:

Determining a data file to be uploaded, wherein the data file comprises: a data mapping file of point data and edge data corresponding to the graph data;

processing the data mapping file through a master node to obtain a data block distribution list corresponding to the data mapping file and a slave node;

The master node distributes the data blocks to be processed corresponding to the slave nodes through the data block distribution list, and distributes processing tasks corresponding to the data blocks;

Under the condition that the slave node completes distributed concurrency processing on the data block, the master node determines boundary data imported by a graph database according to a processing result of the distributed concurrency processing;

The master node is a node for executing writing operation in the graph database, and the slave node is a node for storing data in the graph database.

2. The method according to claim 1, wherein the determining, by the master node, boundary data imported from the graph database according to a processing result of the distributed concurrency processing includes:

the master node obtains boundary offset in the processing result of the data block on each slave node;

And summarizing the boundary offset through a preset algorithm to obtain the imported boundary data corresponding to the boundary offset.

3. The method of claim 1, wherein the distributed concurrent processing of the data blocks by the slave node comprises:

Under the condition that the data mapping file is determined to be the data mapping file corresponding to the point data, the graph database returns the vertex ID to the slave node so as to indicate the slave node to write the vertex ID in a local cache, wherein the slave node is used for importing the point data and/or the edge data of the data file corresponding to the graph data to be uploaded;

And under the condition that the data mapping file is determined to be the data mapping file corresponding to the edge data, determining whether the vertex IDs at the two ends corresponding to the edge data exist in the local memory of the slave node or not so as to determine the confirmation mode of the edge data.

4. The method of claim 3, wherein determining whether the two-end vertex IDs corresponding to the edge data exist in the local memory of the slave node to determine the confirmation manner of the edge data comprises:

under the condition that vertex IDs at two ends corresponding to the edge data do not exist in the local memory of the slave node, carrying out distributed query on the vertex IDs from other slave nodes corresponding to the master node;

And under the condition that vertex IDs at two ends corresponding to the edge data exist in the local memory of the slave node, directly writing the edge data according to the vertex IDs queried in the cache of the local memory of the slave node.

5. The method of claim 4, wherein the distributed querying of the vertex IDs from other slave nodes corresponding to the master node comprises:

the master node obtains information of vertex IDs cached on the other slave nodes through a remote calling framework to inquire;

and when the vertex ID corresponding to the edge data is found out according to the query result corresponding to the query, the vertex IDs on the other slave nodes are used for indicating the writing of the edge data.

6. The method according to claim 1, wherein in the case that the slave node performs distributed concurrency processing on the data block, after the master node determines the boundary data imported from the graph database according to the processing result of the distributed concurrency processing, the method further comprises:

The master node gathers the offset and the boundary information of the data block determined on each slave node through a remote calling framework, and sorts the boundary information;

Acquiring complete boundary data according to the offset and the boundary information;

and notifying the slave node to import the complete boundary data under the condition that the master node has acquired the complete boundary data.

7. The method of claim 1, wherein after determining the data file to be uploaded, the method further comprises:

Determining whether map title corresponding to the data file to be uploaded exists in a graph database, wherein the data file further comprises: map metadata files corresponding to the map data, wherein map title is used for indicating names of data files corresponding to the map data;

and under the condition that the graph name does not exist in the graph database, creating a new graph data file in the graph database according to the graph name and the graph metadata file, and loading a target graph corresponding to the new graph data file.

8. A data importing apparatus of a graph database, comprising:

the determining module is configured to determine a data file to be uploaded, where the data file includes: a data mapping file of point data and edge data corresponding to the graph data;

The processing module is used for processing the data mapping file through the master node to obtain a data block distribution list corresponding to the data mapping file and the slave node;

the distribution module is used for distributing the data blocks to be processed corresponding to the slave nodes through the data block distribution list and distributing the processing tasks corresponding to the data blocks;

the importing module is used for determining boundary data imported by the graph database according to a processing result of the distributed concurrency processing when the slave node completes the distributed concurrency processing on the data block;

9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program, when run, performs the method of any one of claims 1 to 7.

10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 7 by means of the computer program.