Disclosure of Invention
The invention requests to protect a data synchronization method based on a distributed file system, which integrally adopts a working mode of interactive communication between a physical server and a virtual server and double servers for synchronous verification of data interaction, adopts a mode of separating memory data and a metadata cluster in the internal part, and executes corresponding work by each server, thereby achieving the technical effects of timely synchronizing data and accurately updating.
A data synchronization method based on a distributed file system is characterized in that:
monitoring the working states of a physical server and a virtual server in the process that a data server adopts a double-server working mode to simultaneously carry out communication interaction with the physical server and the virtual server;
wherein the communication interaction comprises: performing signal interaction with the physical server, and performing data interaction with the physical server and the virtual server simultaneously;
establishing a database-based memory cluster and a metadata cluster, storing the memory data of the distributed file system to a distributed database in the cluster, and simultaneously storing the metadata of the distributed database in the cluster for processing operation;
if the data server determines that the physical server fails and the virtual server works normally, the data server sends a single-server working mode switching instruction to the virtual server;
the method comprises the steps that an automatic physical server is connected to a virtual server, the resource hierarchical management is carried out according to virtual machine resources defined by an application logic architecture, the intelligent allocation of the resources is calculated, and the resources are dynamically expanded on line;
the data server receives a first confirmation response returned by the virtual server, and continues to perform the signal interaction and the data interaction with the virtual server in a single-server working mode;
and creating a temporary data block by the data node of the virtual server, writing data into the temporary data block according to the received signal difference, and replacing the file content in the original data node with the content of the temporary data block to realize data synchronization.
According to the invention, the working mode of interactive communication between the physical server and the virtual server and the double servers is adopted, the two servers respectively take charge of own roles, and the working mode is switched to the working mode of the single server under necessary conditions, so that the operating efficiency of the system is effectively ensured; meanwhile, the contents of data synchronization are clustered and treated respectively based on the memory clusters and the metadata clusters in the database, so that the reasonable distribution of data synchronization resources and the accuracy of data synchronization are effectively guaranteed.
Detailed Description
The invention firstly protects a data synchronization method based on a distributed file system, and refers to the attached figure 1, which is a work flow chart of the method, and is characterized in that:
the method comprises the following steps that in the process that a data server carries out communication interaction with a physical server and a virtual server simultaneously in a double-server working mode, the working states of the physical server and the virtual server are monitored, wherein the communication interaction comprises the following steps: performing signal interaction with the physical server, and performing data interaction with the physical server and the virtual server simultaneously;
establishing a database-based memory cluster and a metadata cluster, storing the memory data of the distributed file system to a distributed database in the cluster, and simultaneously storing the metadata of the distributed database in the cluster for processing operation;
if the data server determines that the physical server fails and the virtual server works normally, the data server sends a single-server working mode switching instruction to the virtual server;
the method comprises the steps that an automatic physical server is connected to a virtual server, the resource hierarchical management is carried out according to virtual machine resources defined by an application logic architecture, the intelligent allocation of the resources is calculated, and the resources are dynamically expanded on line;
the data server receives a first confirmation response returned by the virtual server, and continues to perform the signal interaction and the data interaction with the virtual server in a single-server working mode;
and creating a temporary data block by the data node of the virtual server, writing data into the temporary data block according to the received signal difference, and replacing the file content in the original data node with the content of the temporary data block to realize data synchronization.
Preferably, the process of performing communication interaction between the physical server and the virtual server simultaneously further includes:
the method comprises the steps that jump connection is established with a virtual server, so that the virtual server can know the current state of the virtual server, including the current state, load, updating and other conditions, meanwhile, the virtual server waits for a task request sent by the virtual server, after a client connection request sent by the virtual server is received, the virtual server establishes connection with a client to monitor an operation request of the client, and timely responds;
after the current physical server submits the update from the client, the data cache server synchronizes the update to the physical server, and then submits the update message to the virtual server after the update is completed, and the virtual server controls other data cache servers to synchronize with the physical server.
There are generally two steps in synchronizing small files: the data node position where the small file index is located and the stream of the data file are obtained from the metadata node. Wherein the output stream of data files is obtained at the time of writing and the input stream of data files is obtained at the time of reading. When small files are operated, if the small files accessed continuously belong to the same directory, the information of the small files is the same. The system reduces the interaction with the metadata node by caching the index position information corresponding to the directory and the input stream and the output stream of the data file under the directory at the client, thereby improving the access speed of the file, simultaneously not requiring the metadata node to modify the update mark frequently, and only needing to carry out the update when the update mark is changed. The number of the information cached at the index position of the client is 20 by default, and the user can set the information by himself. The cache uses the LRU eviction policy by default, and no lock mechanism is employed for the cache because there are no multiple threads present at the client.
Because the client caches the information of the index position and the update mark of the small file, the client does not need to frequently access the metadata nodes, the performance of the system is greatly improved, and the problem of the consistency of the indexes in each data node is brought. According to the selection rule, the main data node in the index position mapping table is used as the position for creating the index, and the content of the main data node is up to date. After the client 1 queries the replica data node 1, it changes the update flag in the cache to N, and modifies the entry corresponding to the metadata node mapping table. At this time, the client 2 acquires the information of the mapping table and knows that the node 1 of the replica data updates the flag bit N. At this time, client 1 creates an index so that the update flag of replica data node 1 becomes Y, and client 2 cannot actively know, so it will not see the small file just created by client 1.
Further preferably, the synchronization control node in the data nodes obtains a copy list from a metadata node of the source file system according to a source path input by the client, creates a thread pool, and allocates a source file to each thread according to the copy list, where the copy list is a list of all source files under the source path, including a file name, a size, and a file path of each source file.
And each thread of the synchronous control node acquires the metadata of the source file distributed by each thread from the metadata node of the source file system, and acquires the check code of each data block contained in the source file from the corresponding source data node according to the metadata of the source file.
And each thread of the synchronous control node acquires metadata of a target file corresponding to each source file from a metadata node of the target file system, compares the sizes of the source file and the target file, and applies for creating or deleting a data block of the target file from the metadata node of the target file system according to a comparison result so that the size of the target file is consistent with that of the source file.
And each thread of the synchronous control node acquires the metadata of each target file from the metadata node of the target file system again, and acquires the check codes of all data blocks contained in each target file from the corresponding target data node according to the metadata of each target file.
Each thread of the synchronous control node generates a file check code list according to the metadata of the respective source and target files and the check codes of all the source and target data blocks, wherein the file check code list comprises: the serial number of the data block, the ID of the source data block, the check code of the source data block, the ID of the source data node, the ID of the target data block, the check code of the target data block, the ID of the target data node and whether the target data block is the mark bit of the newly created data block.
The method can select random source data space segmentation, long-span dimension average segmentation, clustering segmentation of each dimension and the like, a cluster of a large distributed file system usually spans a plurality of racks, communication between computers on different racks needs to pass through a switch, and transmission cost is high. In most cases, the bandwidth between two computers in different racks is less than that between two computers in the same rack. The copy strategy of the existing distributed file system is to store copies in two different racks, which can prevent data loss when one rack has a problem, and simultaneously, when data is read, a node which is closest to a client computer and stores source data can be accessed by using a principle of proximity, or the reading time is reduced by using the bandwidth between different racks. Moving the computation to the vicinity of the data storage node is significantly more efficient and less expensive than moving the data to the vicinity of the computation node.
Further, preferably, the distributed file system works by adopting a MapReduce thread, and the MapReduce program creates as many control files as the number of directories in the distributed file system as the number of the input files of the Map function according to the number of created directories given by a user and the number of created small files under each directory.
The Map function of the test mainly utilizes the small files with the designated quantity and size created by the mass small file storage system and the small files under the directory created when the interface for reading the small files by using the mass small file storage system to read the small files during the write test.
The same Reduce function is used for the write test and the read test, the function counts output data of each Map, such as the total size, the total amount, the total running time and the like of a MapReduce program test file, and the data are stored in a distributed file system.
And the result analysis function reads out the result data of the Reduce statistics from the distributed file system, and calculates the speed of writing and reading small files of the distributed file system or the mass small file storage system and the like through a given formula.
After the operation of the MapReduce program is finished each time, the memory occupancy of the massive small file storage system and the distributed file system in the system needs to be recorded.
Preferably, the metadata cluster authorizes a user to access the data cache server cluster, quickly establishes connection between the client and the data server, monitors the state of each data cache server in real time, and allocates the data cache server capable of providing the optimal service to the user according to the state information. Meanwhile, the consistency and stability of the user data in the data cache server cluster are ensured by utilizing a cache consistency strategy; the metadata cluster is controlled by the virtual server and is responsible for data interaction with the client, and the data state of each user is monitored in real time. And meanwhile, the state information is submitted to the virtual server, so that the traceability of the control server to the user data state is ensured. And a heartbeat connection is established between the service quality monitoring server and the transaction control server, and the factors influencing the service quality, such as the available bandwidth of the network, the CPU utilization rate and the like of the service quality monitoring server are transmitted to the virtual server.
In a storage system, a physical file corresponds to a logical representation, which constitutes metadata information. When reading the file, the logical file is read first, then the corresponding data block is taken out from the storage system according to the formed metadata information sequence, and finally the copy of the physical file is restored. The data file stores the data of the small files through a key/value data structure, so that the scale of metadata of massive small files in a distributed file system is reduced, the access speed of the small files is increased (by reducing interaction with metadata nodes), the MapReduce-based data processing is facilitated, and support is provided for distributed computing.
All the small files stored in the same directory by the client are stored in the data file in the directory, wherein the data file is a file in the distributed file system. And meanwhile, generating an index, recording the specific position of the small file in the data file and other related information, handing the index to each data node for maintenance and management, and providing index service for the client by the data node. The metadata node needs to record the data node used to maintain the index of the small file. When a client needs to provide an index service request of a small file in a certain directory to a data node, the position of the data node needs to be acquired from the distributed file system. The client side cache mechanism caches and maintains the data node positions and data file information of the small file indexes, and the times of accessing the metadata nodes are reduced, so that the access speed of the small files is greatly improved.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.