CN109299056A

Movatterモバイル変換

Info

Publication number: CN109299056A
Application number: CN201811096362.XA
Authority: CN
Inventors: 张慧如; 周建明; 冯娜
Original assignee: Weifang Engineering Vocational College
Current assignee: Weifang Engineering Vocational College
Priority date: 2018-09-19
Filing date: 2018-09-19
Publication date: 2019-02-01
Anticipated expiration: 2038-09-19
Also published as: CN109299056B

Abstract

The present invention relates to a kind of method of data synchronization and device based on distributed file system, communication interaction is carried out using two server operating mode and physical server and virtual server simultaneously by data server, establish main memory cluster and metadata cluster based on database, signal difference writes data to the temporary data block based on the received, the file content inside former back end is replaced with the content of temporary data block, realizes that data are synchronous.The operating mode that the present invention is interactively communicated using physical server and virtual server and two server, Each performs its own functions is each responsible for for two servers, is switched to single server operating mode when necessary, the operational efficiency of system has been effectively ensured；Meanwhile the cluster content synchronous to data with metadata cluster carries out clustering processing to data store internal based on memory, treats respectively, the reasonable distribution of data isochronous resources and the accuracy that data are synchronous has been effectively ensured.

Description

Data synchronization method and device based on distributed file system

Technical Field

The present application belongs to the field of distributed processing technologies, and in particular, to a data synchronization method and apparatus based on a distributed file system.

Background

With the continuous improvement of the life quality of people, the application of the internet is also continuously popularized. In order to provide services for people more conveniently, the application of the internet is continuously developed and evolved, and meanwhile, the network security problem is more and more, so that the demand of the internet security products on the market is continuously increased. Among many network security problems, the situation that important information is changed or lost due to accidental deletion of files is the most serious, and therefore, research on the technical field of webpage accidental deletion prevention by websites at home and abroad is continuously carried out.

At the initial stage of the occurrence of the webpage tamper-resistant system, the internal structure is very simple, and functional modules are divided less. Some basic security problems of the website can be solved, but the website has many defects. If a hacker uses a large-scale and continuous-action tampering activity to attack an important website, the anti-tampering system cannot complete the protection function of the website. With the increasing dependence on the internet and the increasing access to web pages, a simple tamper-proof system for protecting the security of a website cannot cope with the situation. Therefore, in order to effectively prevent the web page from being tampered and protect the security of the website, web page tamper-resistant systems are gradually developed and perfected. With the gradual improvement of the anti-tampering technology, the internal structure of the anti-tampering system becomes more and more complex, and the functional modules are more and more divided. The anti-tampering system at this moment is just matched with each other through the interaction among all the modules, and the function of the whole anti-tampering system is completed. The modules are closely related, and the rings are buckled with each other.

Therefore, in the webpage tamper-resistant system, it is necessary to perform more optimization on the distributed file synchronization system implemented in the text, so that the function of multi-machine publishing of files can be completed through simpler system information configuration operation, and the simplicity of system use is improved; the synchronization system needs to carry out further intensive research work on the design aspect of improving the file transmission efficiency, fully optimizes the function of the system, can be better fused with a tamper-resistant system, and exerts the due function of the system.

Disclosure of Invention

The invention requests to protect a data synchronization method based on a distributed file system, which integrally adopts a working mode of interactive communication between a physical server and a virtual server and double servers for synchronous verification of data interaction, adopts a mode of separating memory data and a metadata cluster in the internal part, and executes corresponding work by each server, thereby achieving the technical effects of timely synchronizing data and accurately updating.

A data synchronization method based on a distributed file system is characterized in that:

monitoring the working states of a physical server and a virtual server in the process that a data server adopts a double-server working mode to simultaneously carry out communication interaction with the physical server and the virtual server;

wherein the communication interaction comprises: performing signal interaction with the physical server, and performing data interaction with the physical server and the virtual server simultaneously;

establishing a database-based memory cluster and a metadata cluster, storing the memory data of the distributed file system to a distributed database in the cluster, and simultaneously storing the metadata of the distributed database in the cluster for processing operation;

if the data server determines that the physical server fails and the virtual server works normally, the data server sends a single-server working mode switching instruction to the virtual server;

the method comprises the steps that an automatic physical server is connected to a virtual server, the resource hierarchical management is carried out according to virtual machine resources defined by an application logic architecture, the intelligent allocation of the resources is calculated, and the resources are dynamically expanded on line;

the data server receives a first confirmation response returned by the virtual server, and continues to perform the signal interaction and the data interaction with the virtual server in a single-server working mode;

and creating a temporary data block by the data node of the virtual server, writing data into the temporary data block according to the received signal difference, and replacing the file content in the original data node with the content of the temporary data block to realize data synchronization.

According to the invention, the working mode of interactive communication between the physical server and the virtual server and the double servers is adopted, the two servers respectively take charge of own roles, and the working mode is switched to the working mode of the single server under necessary conditions, so that the operating efficiency of the system is effectively ensured; meanwhile, the contents of data synchronization are clustered and treated respectively based on the memory clusters and the metadata clusters in the database, so that the reasonable distribution of data synchronization resources and the accuracy of data synchronization are effectively guaranteed.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a data synchronization method based on a distributed file system according to the present invention.

Fig. 2 is a block diagram illustrating a data synchronization apparatus based on a distributed file system according to the present invention.

Detailed Description

The invention firstly protects a data synchronization method based on a distributed file system, and refers to the attached figure 1, which is a work flow chart of the method, and is characterized in that:

the method comprises the following steps that in the process that a data server carries out communication interaction with a physical server and a virtual server simultaneously in a double-server working mode, the working states of the physical server and the virtual server are monitored, wherein the communication interaction comprises the following steps: performing signal interaction with the physical server, and performing data interaction with the physical server and the virtual server simultaneously;

Preferably, the process of performing communication interaction between the physical server and the virtual server simultaneously further includes:

the method comprises the steps that jump connection is established with a virtual server, so that the virtual server can know the current state of the virtual server, including the current state, load, updating and other conditions, meanwhile, the virtual server waits for a task request sent by the virtual server, after a client connection request sent by the virtual server is received, the virtual server establishes connection with a client to monitor an operation request of the client, and timely responds;

after the current physical server submits the update from the client, the data cache server synchronizes the update to the physical server, and then submits the update message to the virtual server after the update is completed, and the virtual server controls other data cache servers to synchronize with the physical server.

There are generally two steps in synchronizing small files: the data node position where the small file index is located and the stream of the data file are obtained from the metadata node. Wherein the output stream of data files is obtained at the time of writing and the input stream of data files is obtained at the time of reading. When small files are operated, if the small files accessed continuously belong to the same directory, the information of the small files is the same. The system reduces the interaction with the metadata node by caching the index position information corresponding to the directory and the input stream and the output stream of the data file under the directory at the client, thereby improving the access speed of the file, simultaneously not requiring the metadata node to modify the update mark frequently, and only needing to carry out the update when the update mark is changed. The number of the information cached at the index position of the client is 20 by default, and the user can set the information by himself. The cache uses the LRU eviction policy by default, and no lock mechanism is employed for the cache because there are no multiple threads present at the client.

Because the client caches the information of the index position and the update mark of the small file, the client does not need to frequently access the metadata nodes, the performance of the system is greatly improved, and the problem of the consistency of the indexes in each data node is brought. According to the selection rule, the main data node in the index position mapping table is used as the position for creating the index, and the content of the main data node is up to date. After the client 1 queries the replica data node 1, it changes the update flag in the cache to N, and modifies the entry corresponding to the metadata node mapping table. At this time, the client 2 acquires the information of the mapping table and knows that the node 1 of the replica data updates the flag bit N. At this time, client 1 creates an index so that the update flag of replica data node 1 becomes Y, and client 2 cannot actively know, so it will not see the small file just created by client 1.

Further preferably, the synchronization control node in the data nodes obtains a copy list from a metadata node of the source file system according to a source path input by the client, creates a thread pool, and allocates a source file to each thread according to the copy list, where the copy list is a list of all source files under the source path, including a file name, a size, and a file path of each source file.

And each thread of the synchronous control node acquires the metadata of the source file distributed by each thread from the metadata node of the source file system, and acquires the check code of each data block contained in the source file from the corresponding source data node according to the metadata of the source file.

And each thread of the synchronous control node acquires metadata of a target file corresponding to each source file from a metadata node of the target file system, compares the sizes of the source file and the target file, and applies for creating or deleting a data block of the target file from the metadata node of the target file system according to a comparison result so that the size of the target file is consistent with that of the source file.

And each thread of the synchronous control node acquires the metadata of each target file from the metadata node of the target file system again, and acquires the check codes of all data blocks contained in each target file from the corresponding target data node according to the metadata of each target file.

Each thread of the synchronous control node generates a file check code list according to the metadata of the respective source and target files and the check codes of all the source and target data blocks, wherein the file check code list comprises: the serial number of the data block, the ID of the source data block, the check code of the source data block, the ID of the source data node, the ID of the target data block, the check code of the target data block, the ID of the target data node and whether the target data block is the mark bit of the newly created data block.

The method can select random source data space segmentation, long-span dimension average segmentation, clustering segmentation of each dimension and the like, a cluster of a large distributed file system usually spans a plurality of racks, communication between computers on different racks needs to pass through a switch, and transmission cost is high. In most cases, the bandwidth between two computers in different racks is less than that between two computers in the same rack. The copy strategy of the existing distributed file system is to store copies in two different racks, which can prevent data loss when one rack has a problem, and simultaneously, when data is read, a node which is closest to a client computer and stores source data can be accessed by using a principle of proximity, or the reading time is reduced by using the bandwidth between different racks. Moving the computation to the vicinity of the data storage node is significantly more efficient and less expensive than moving the data to the vicinity of the computation node.

Further, preferably, the distributed file system works by adopting a MapReduce thread, and the MapReduce program creates as many control files as the number of directories in the distributed file system as the number of the input files of the Map function according to the number of created directories given by a user and the number of created small files under each directory.

The Map function of the test mainly utilizes the small files with the designated quantity and size created by the mass small file storage system and the small files under the directory created when the interface for reading the small files by using the mass small file storage system to read the small files during the write test.

The same Reduce function is used for the write test and the read test, the function counts output data of each Map, such as the total size, the total amount, the total running time and the like of a MapReduce program test file, and the data are stored in a distributed file system.

And the result analysis function reads out the result data of the Reduce statistics from the distributed file system, and calculates the speed of writing and reading small files of the distributed file system or the mass small file storage system and the like through a given formula.

After the operation of the MapReduce program is finished each time, the memory occupancy of the massive small file storage system and the distributed file system in the system needs to be recorded.

Preferably, the metadata cluster authorizes a user to access the data cache server cluster, quickly establishes connection between the client and the data server, monitors the state of each data cache server in real time, and allocates the data cache server capable of providing the optimal service to the user according to the state information. Meanwhile, the consistency and stability of the user data in the data cache server cluster are ensured by utilizing a cache consistency strategy; the metadata cluster is controlled by the virtual server and is responsible for data interaction with the client, and the data state of each user is monitored in real time. And meanwhile, the state information is submitted to the virtual server, so that the traceability of the control server to the user data state is ensured. And a heartbeat connection is established between the service quality monitoring server and the transaction control server, and the factors influencing the service quality, such as the available bandwidth of the network, the CPU utilization rate and the like of the service quality monitoring server are transmitted to the virtual server.

In a storage system, a physical file corresponds to a logical representation, which constitutes metadata information. When reading the file, the logical file is read first, then the corresponding data block is taken out from the storage system according to the formed metadata information sequence, and finally the copy of the physical file is restored. The data file stores the data of the small files through a key/value data structure, so that the scale of metadata of massive small files in a distributed file system is reduced, the access speed of the small files is increased (by reducing interaction with metadata nodes), the MapReduce-based data processing is facilitated, and support is provided for distributed computing.

All the small files stored in the same directory by the client are stored in the data file in the directory, wherein the data file is a file in the distributed file system. And meanwhile, generating an index, recording the specific position of the small file in the data file and other related information, handing the index to each data node for maintenance and management, and providing index service for the client by the data node. The metadata node needs to record the data node used to maintain the index of the small file. When a client needs to provide an index service request of a small file in a certain directory to a data node, the position of the data node needs to be acquired from the distributed file system. The client side cache mechanism caches and maintains the data node positions and data file information of the small file indexes, and the times of accessing the metadata nodes are reduced, so that the access speed of the small files is greatly improved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A data synchronization method based on a distributed file system is characterized in that:

2. The data synchronization method of a distributed file system according to claim 1, wherein:

the process that the physical server and the virtual server simultaneously carry out communication interaction further comprises the following steps:

the method comprises the steps that jump connection is established with a virtual server, so that the virtual server can know the current state of the virtual server, including the current state, load, updating and other conditions, meanwhile, the virtual server waits for a task request sent by the virtual server, after a client connection request sent by the virtual server is received, the virtual server establishes connection with a client to monitor an operation request of the client, and timely responds; after the current physical server submits the update from the client, the data cache server synchronizes the update to the physical server, and then submits the update message to the virtual server after the update is completed, and the virtual server controls other data cache servers to synchronize with the physical server.

3. The data synchronization method of a distributed file system according to claim 1, wherein:

a synchronous control node in the data nodes acquires a copy list from a metadata node of a source file system according to a source path input by a client, creates a thread pool and distributes a source file for each thread according to the copy list, wherein the copy list is a list of all source files under the source path and comprises the file name, the size and the file path of each source file;

each thread of the synchronous control node acquires metadata of a source file distributed by each thread from a metadata node of a source file system, and acquires check codes of each data block contained in the source file from a corresponding source data node according to the metadata of the source file;

each thread of the synchronous control node acquires metadata of a target file corresponding to each source file from a metadata node of a target file system, compares the sizes of the source file and the target file, and applies for creating or deleting a data block of the target file from the metadata node of the target file system according to a comparison result so that the size of the target file is consistent with that of the source file;

each thread of the synchronous control node acquires the metadata of each target file from the metadata node of the target file system again, and acquires the check codes of all data blocks contained in each target file from the corresponding target data node according to the metadata of each target file;

4. The data synchronization method of a distributed file system according to claim 1, wherein:

the distributed file system works by adopting a MapReduce thread, and the MapReduce program creates control files as many as the number of directories in the distributed file system as input files of a Map function according to the number of created directories given by a user and the number of created small files under each directory;

the Map function of the test mainly utilizes the small files with specified quantity and size created by the mass small file storage system and the interfaces for reading the small files by using the mass small file storage system to read all the small files under the directory created during the write test;

the same Reduce function is used for the write test and the read test, the function counts output data of each Map, such as the total size, the total amount, the total running time and the like of a MapReduce program test file, and the data are stored in a distributed file system;

the result analysis function reads out the result data of Reduce statistics from the distributed file system, and calculates the speed of writing and reading small files of the distributed file system or the mass small file storage system through a given formula;

5. The data synchronization method of a distributed file system according to claim 1, wherein:

the metadata cluster authorizes a user to access the data cache server cluster, and quickly establishes connection between the client and the data server;

meanwhile, the consistency and stability of the user data in the data cache server cluster are ensured by utilizing a cache consistency strategy; the metadata cluster is controlled by a virtual server and is responsible for data interaction with the client and real-time monitoring of the data state of each user;

meanwhile, the state information is submitted to the virtual server, so that the traceability of the control server to the user data state is ensured;

and a heartbeat connection is established between the service quality monitoring server and the transaction control server, and the factors influencing the service quality, such as the available bandwidth of the network, the CPU utilization rate and the like of the service quality monitoring server are transmitted to the virtual server.

6. A data synchronization device based on a distributed file system is characterized in that:

the device comprises a data server, a physical server, a virtual server and a database; wherein,

7. The data synchronization apparatus of a distributed file system according to claim 6, wherein:

establishing jump connection with the virtual server, so that the virtual server can know the current state of the virtual server, including the current state, including the conditions of load, update and the like, wait for the virtual server to send a task request, establish connection with the client to monitor the operation request of the client after receiving a client connection request sent by the virtual server, and respond in time; after the current physical server submits the update from the client, the data cache server synchronizes the update to the physical server, and then submits the update message to the virtual server after the update is completed, and the virtual server controls other data cache servers to synchronize with the physical server.

8. The data synchronization apparatus of a distributed file system according to claim 6, wherein:

9. The data synchronization apparatus of a distributed file system according to claim 6, wherein:

10. The data synchronization apparatus of a distributed file system according to claim 6, wherein: