CN111124724A

Movatterモバイル変換

Info

Publication number: CN111124724A
Application number: CN201911120925.9A
Authority: CN
Inventors: 李军站
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2020-05-08
Anticipated expiration: 2039-11-15
Also published as: CN111124724B

Abstract

The invention provides a node fault testing method and a device of a distributed block storage system, which comprises the following steps: setting a test script according to a preset client service model; selecting a test node according to the node information of the storage system; sending the test script to a test node; executing the test script according to a preset execution mode and generating an execution log; and monitoring the node state and the client service execution result of the storage system in the test script execution process. The node fault testing method and device of the distributed block storage system provided by the invention have the advantages that the testing cost is reduced, the testing efficiency is improved, and the testing flexibility and the software reliability are improved.

Description

Node fault testing method and device of distributed block storage system

Technical Field

The invention relates to the technical field of storage system testing, in particular to a node fault testing method and device of a distributed block storage system.

Background

The distributed block storage system is widely applied to the fields of IT enterprises, cloud computing, big data, virtualization and the like, and meanwhile, the reliability requirements of the fields on the storage system are higher and higher. The reliability of the storage system is a fatal problem, and the reliability must be discovered and solved as early as possible in the project development stage. In practical application of a client, node failures (such as unexpected jamming, shutdown, restart, power failure and the like of a certain node) are often encountered, and a storage system needs to ensure that the storage system can still normally operate in a redundancy rule, so that it is very important to perform node failure testing in a project development stage. Due to the fact that the service models used by customers are various, the service of each node of the storage system is different, and the redundancy rules of the storage system are different, the number of node fault test cases is large. If the node fault is tested by pure manual, the labor capacity is large, the testing period is long, the testing coverage is not high, and the testing efficiency is low.

Disclosure of Invention

In view of the above disadvantages of the prior art, the present invention provides a node fault testing method and apparatus for a distributed block storage system.

In a first aspect, the present invention provides a node fault testing method for a distributed block storage system, including:

setting a test script according to a preset client service model;

selecting a test node according to the node information of the storage system;

sending the test script to a test node;

executing the test script according to a preset execution mode and generating an execution log;

and monitoring the node state and the client service execution result of the storage system in the test script execution process.

Further, before the executing the test script according to the preset execution mode and generating the execution log, the method further includes:

calling a volume mounting program according to a preset client service model;

performing clock synchronization with a storage system;

and issuing a test script execution command and synchronously starting monitoring of the storage system.

Further, the setting of the test script according to the preset customer service model includes:

setting test script execution content according to the client service model; the client business model comprises an 8k random read-write model, a 1024k sequence read-write model, an 8k and 1024k mixed read-write model, an OLTP business model and an OLAP business model; the test script execution content comprises reboot and power on and power off.

Further, the selecting a test node according to the node information of the storage system includes:

setting the number of fault nodes according to a redundancy rule of a storage system;

and selecting a test node according to the service type of each node.

Further, the executing the test script and generating the execution log according to the preset execution mode includes:

setting an execution sequence and an execution time of the test script according to the IP of the test node where the test script is located, wherein the execution sequence and the execution time ensure that at least one normal node exists in the same group of redundant nodes executing the same subtask;

sequentially executing test scripts under the corresponding test node IP according to the script execution sequence and the execution time;

and screening and outputting error information in the execution log.

Further, the monitoring of the node state and the client service execution result of the storage system during the execution of the test script includes:

monitoring whether the state of the test node is consistent with the execution condition of the test script on the test node, and if not, outputting the execution failure information of the test script under the test node;

monitoring the execution condition of customer service, generating a customer service operation record file, screening and outputting whether operation error information exists in the operation record file, wherein the operation error information comprises data inconsistency errors and flow break errors.

In a second aspect, the present invention provides a node fault testing apparatus for a distributed block storage system, including:

the script setting unit is configured for setting a test script according to a preset client service model;

the node selection unit is configured to select a test node according to the node information of the storage system;

the script issuing unit is configured to issue the test script to the test node;

the script execution unit is configured for executing the test script according to a preset execution mode and generating an execution log;

and the result monitoring unit is configured for monitoring the node state and the client service execution result of the storage system in the test script execution process.

Further, the node selecting unit includes:

the quantity determining module is configured for setting the quantity of the fault nodes according to the redundancy rule of the storage system;

and the position determining module is configured to select the test node according to the service type of each node.

Further, the script execution unit includes:

the execution setting module is configured to set an execution sequence and an execution time of the test script according to the IP of the test node where the test script is located, wherein the execution sequence and the execution time ensure that at least one normal node exists in a same group of redundant nodes executing the same subtask;

the script execution module is configured for sequentially executing the test scripts under the corresponding test node IP according to the script execution sequence and the execution time;

and the log output module is configured for screening and outputting the error information in the execution log.

Further, the result monitoring unit includes:

the execution monitoring module is configured to monitor whether the state of the test node is consistent with the execution condition of the test script on the test node or not, and if not, the execution monitoring module outputs the execution failure information of the test script under the test node;

and the task monitoring module is configured for monitoring the execution condition of the client service, generating a client service operation record file, screening and outputting whether the operation record file has operation error information, wherein the operation error information comprises data inconsistency errors and flow break errors.

The beneficial effect of the invention is that,

according to the node fault testing method and device of the distributed block storage system, the testing scripts for fault simulation at the testing nodes are output according to the client service model, and the execution sequence and the execution time of the testing scripts are preset. Selecting test nodes from the storage system according to the node information of the storage system, issuing the test scripts to the test nodes, and then sequentially executing the test scripts according to a preset execution sequence and execution time. After the test script is executed by the test node, an execution log of the script is generated, wherein the execution log is a part of test result data; in addition, in the whole process of executing all the test scripts, the node state of the storage system and the execution result of the customer service are monitored, and the two monitoring results are also test result data. And the tester can more intuitively analyze the fault stability of the storage system according to the test result data. The node fault testing method and device of the distributed block storage system provided by the invention have the advantages that the testing cost is reduced, the testing efficiency is improved, and the testing flexibility and the software reliability are improved.

In addition, the invention has reliable design principle, simple structure and very wide application prospect.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention.

Fig. 2 is a schematic block diagram of an apparatus of one embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention. The execution subject in fig. 1 may be a node failure testing system of a distributed block storage system.

As shown in fig. 1, the method 100 includes:

step 110, setting a test script according to a preset customer service model;

step 120, selecting a test node according to the node information of the storage system;

step 130, sending the test script to a test node;

step 140, executing the test script according to a preset execution mode and generating an execution log;

and 150, monitoring the node state and the client service execution result of the storage system in the test script execution process.

In order to facilitate understanding of the present invention, the node fault testing method of the distributed block storage system provided in the present invention is further described below with reference to the principle of the node fault testing method of the distributed block storage system of the present invention and in combination with the process of performing node fault testing on the distributed block storage system in the embodiment.

Specifically, the node fault testing method of the distributed block storage system includes:

the method comprises the steps of firstly establishing communication connection between a client and a distributed storage system server, wherein the communication connection adopts a gigabit network, a plurality of clients can be set as redundant setting, and the downtime of the client in the test process is avoided.

And S1, setting a test script according to the preset client service model.

Determining the number of created volumes (the volume number is generally the number of volumes divided into 4T space with 90% of the total capacity of the storage system) according to the state and the configuration of the storage system, setting a client service model in the node fault test process, such as an 8k random read-write model, a 1024k sequential read-write model, an 8k &1024k mixed read-write model, an oltp service model, an olap service model and the like, and supplementing and modifying according to project requirements.

After the customer service model is determined, different customer service models with different execution contents (reboot or power on and power off) of the test script are further set to match with the different execution contents, and the configuration is carried out according to the test requirements.

And S2, selecting a test node according to the node information of the storage system.

Determining the number of test nodes according to a redundancy rule (2 copies/3 copies) of the storage system, wherein the more copies are, the larger the number of test nodes is; and determining nodes (such as a main node, a secondary main node and a common node) which are specifically used as test nodes according to the service of each node of the storage system.

If the nodes of the initial storage system all normally operate, all the nodes can be used as test nodes, if the nodes of the storage system have the nodes which are just restarted and the nodes are in the data recovery period, if the nodes of the storage system have the 2-copy redundant system, the same group of redundant nodes of the nodes are in the normal working state, the nodes in the recovery period are used as the test nodes, and the purpose is to ensure that at least one normal node in the nodes processing the same subtask can normally process the task.

And S3, sending the test script to the test node.

And issuing the test script to the corresponding test nodes to ensure that each test node receives one test script, and returning a receiving success prompt after the test nodes receive the test scripts to avoid multiple sending or missed sending.

And S4, executing the test script according to the preset execution mode and generating an execution log.

When a client starts a client service, a volume mount program (such as 64 volumes and 32 volumes) is called first, then clock synchronization between the client and a storage system is carried out, then a test script is started, and a monitoring start command is sent to a storage system monitoring device while the test script is started. The storage system monitoring device can monitor the state of each node and the execution result of the customer service under the storage system.

The execution sequence and execution time of the test scripts are preset at the client, and are determined according to the service (executed subtask) of the execution node where each test script is located. And representing the test scripts by using the test node IP where the test scripts are located, wherein each test script corresponds to one test node IP. When the execution time is set, the restarting time of the script node and the data recovery time after restarting need to be considered, if two nodes execute the same subtask according to the redundancy rule, after the test script is executed on the node 1, the waiting time needs to exceed the sum of the restarting time and the data recovery time, and the test script of the node 2 can be executed after the waiting time is up. There is no time limit between nodes executing different subtasks.

And restarting or powering off and restarting the test node after each test script is executed, generating an execution log in the execution process of the test script, and recording the restart time of the test node, the error information of the system, the loss of the data of the test node before and after the restart and the like by the log.

And S5, monitoring the node state and the client service execution result of the storage system in the test script execution process.

The node state monitored by the monitoring device of the storage system in the whole testing process is collected, the node state comprises information such as whether the node runs and the corresponding time of state change, the node state is compared with the execution time of the test script, if the node state and the execution time of the test script are in accordance, the test script is successfully executed, and if the node state and the execution time of the test script are not in accordance, the execution is failed, and the execution failure information is output.

Collecting a monitored client service execution result, and determining whether data inconsistency and EIO (electronic article inspection) problems occur or not by checking an error. The logfile file is checked to determine if a cut-out has occurred and to determine the cut-out time.

If no error or failure information occurs in the execution log in step S4 and the monitoring information collected in this step, the node failure test of this time passes, otherwise, the test fails.

As shown in fig. 2, the system 200 includes:

a script setting unit 210 configured to set a test script according to a preset customer service model;

a node selecting unit 220 configured to select a test node according to the node information of the storage system;

the script issuing unit 230 is configured to issue the test script to the test node;

the script execution unit 240 is configured to execute the test script according to a preset execution mode and generate an execution log;

and the result monitoring unit 250 is configured to monitor the node state and the client service execution result of the storage system during the execution process of the test script.

Optionally, as an embodiment of the present invention, the node selecting unit includes:

Optionally, as an embodiment of the present invention, the script execution unit includes:

Optionally, as an embodiment of the present invention, the result monitoring unit includes:

Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A node fault testing method of a distributed block storage system is characterized by comprising the following steps:

setting a test script according to a preset client service model;

selecting a test node according to the node information of the storage system;

sending the test script to a test node;

2. The method of claim 1, wherein before the executing the test script according to the preset execution mode and generating the execution log, the method further comprises:

calling a volume mounting program according to a preset client service model;

performing clock synchronization with a storage system;

3. The method according to claim 1, wherein the setting up the test script according to the preset customer service model comprises:

4. The method of claim 1, wherein selecting the test node according to the storage system node information comprises:

and selecting a test node according to the service type of each node.

5. The method of claim 1, wherein executing the test script according to a preset execution mode and generating the execution log comprises:

and screening and outputting error information in the execution log.

6. The method of claim 1, wherein monitoring node status and customer service execution results of the storage system during execution of the test script comprises:

7. A node failure testing apparatus of a distributed block storage system, comprising:

8. The apparatus of claim 7, wherein the node selection unit comprises:

9. The apparatus of claim 7, wherein the script execution unit comprises:

10. The apparatus of claim 7, wherein the result monitoring unit comprises: