Disclosure of Invention
The invention provides a task processing method and system based on distributed cluster datax, aiming at the problem that the existing datax tool does not support a distributed architecture.
The technical scheme adopted by the invention is that the task processing method based on the distributed cluster datax comprises the following steps: acquiring tasks to be executed; acquiring task load conditions of all current datax nodes, wherein each datax node is provided with a corresponding proxy service module, and the datax nodes are registered in a nacos registry through the corresponding proxy service modules; and issuing the task to the datax node based on a pre-configured distribution strategy.
In one embodiment, the obtaining task load conditions of all current datax nodes includes: the redis module acquires and stores the task load conditions of all the current dataxs in a key-value mode, wherein the task load conditions comprise the number of distributed synchronous tasks.
In one embodiment, the issuing the task to the datax node based on the pre-configured allocation policy includes: and issuing the task to one of the datax nodes in at least one datax node with the lowest task condition load based on the priority order preset by the task.
In one embodiment, the distributed cluster datax based task processing method further includes: dynamically adjusting the task memory of the datax node, including: configuring a task memory of the datax node by using the metadata of the nacos registry; or configuring the task memory of the task before the task is issued to the datax node.
Another aspect of the present invention further provides a task processing system based on distributed cluster datax, including: the acquisition module acquires tasks to be executed; the redis management module is used for acquiring the task load conditions of all current datax nodes and issuing the tasks to the datax nodes based on a pre-configured distribution strategy; each datacax node is configured with a corresponding proxy service module, and the datacax nodes are registered in the nacos registry through the corresponding proxy service modules.
In one embodiment, the redis module is further configured to: and acquiring and saving the task load conditions of all the current datax in a key-value form, wherein the task load conditions comprise the number of the distributed synchronous tasks.
In one embodiment, the redis module is further configured to: and issuing the task to one of the datax nodes in at least one datax node with the lowest task condition load based on the priority order preset by the task.
In one embodiment, the distributed cluster datax based task processing system further comprises: a memory adjustment module configured to: configuring a task memory of the datax node by using the metadata of the nacos registry; or configuring the task memory of the task before the task is issued to the datax node.
Another aspect of the present invention also provides an electronic device, including: memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the distributed cluster datax based task processing method as described in any one of the above.
Another aspect of the present invention also provides a computer storage medium having a computer program stored thereon, where the computer program is used to implement the steps of the distributed cluster datax based task processing method according to any one of the above embodiments when executed by a processor.
By adopting the technical scheme, the invention at least has the following advantages:
the task processing method based on the distributed cluster datax realizes the distributed cluster of the datax by configuring the corresponding proxy service module for each datax node and enabling a plurality of datax nodes to be registered in the nacos registry through the module, and can realize automatic registration and service discovery in the nacos registry.
Detailed Description
To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined purposes, the present invention is described in detail below with reference to the accompanying drawings and preferred embodiments.
The description of the method flow in the present specification and the steps of the flow chart in the drawings of the present specification are not necessarily strictly performed by the step numbers, and the execution order of the method steps may be changed. Moreover, certain steps may be omitted, multiple steps combined into one step execution, and/or a step broken into multiple step executions.
A first embodiment of the present invention provides a task processing method based on distributed cluster datax, as shown in fig. 1, including the following specific steps:
and S1, acquiring a task to be executed.
And S2, acquiring task load conditions of all current datax nodes, wherein each datax node is configured with a corresponding proxy service module, and the datax nodes are registered in the nacos registry through the corresponding proxy service modules.
And S3, issuing the task to the datax node based on the pre-configured distribution strategy.
The following describes in detail a task processing method based on distributed cluster datax provided in this embodiment step by step.
And S1, acquiring a task to be executed.
In this embodiment, the task may be obtained through a redis (Remote Dictionary Server) management module, or may be obtained through other interfaces, where the task may be multiple synchronous tasks.
And S2, acquiring the task load conditions of all the current datax nodes, wherein each datax node is provided with a corresponding proxy service module, and the datax nodes are registered in a nacos registry through the corresponding proxy service modules.
In this example, referring to fig. 2, at least two (generally, a plurality of) datax nodes may be provided, and since the open-source datax only supports a single machine mode, in order to make it form a distributed architecture, in each datax node, an agent service module (agent in fig. 2) corresponding to the datax node is configured, that is, each agent service module corresponds to one datax node, an agent for the datax is implemented by an agent service, and an external interface is provided in the agent to perform relevant operations on the datax, including starting and stopping a datax process; in addition, the running log of datax is also collected by the agent and sent to the distributed storage for unified management.
In this embodiment, the package in the datax is mounted through the directory of the container and is stored in the designated location of the container of the agent microservice, so that the registration and discovery of the datax node are realized through the automatic registration and discovery of the agent microservice, the nacos is used as a registration center, when a new agent service module is registered in the nacos registration center, the datax node corresponding to the newly added agent service module is in an available state, and the external part can operate the corresponding datax program through the external interface provided by the agent service module.
In this embodiment, the current distributed architecture may be further subjected to capacity expansion or capacity reduction according to the needs of practical application.
Specifically, when capacity expansion is needed, an agent service module corresponding to a new datax node needs to be started, so that the new agent service is registered in the nacos registration center, and at this time, the new datax node is added into the datax cluster, and then a corresponding task can be allocated to run.
When the capacity reduction is needed, only the proxy service corresponding to the datax node needs to be offline, and at this time, the datax node is separated from the management of the datax cluster, and does not accept the issue of the task any more.
And S3, based on a pre-configured distribution strategy, issuing the task to the datax node.
In this embodiment, the load control of the datax node determines to which datax node a new synchronization task is to be allocated according to the task load condition of each node of the current datax. Because the task issuing module is a micro-service architecture, a unified manager is required for load control, for example, a redis distributed lock may be used to perform unified node load calculation and node load update in the distributed lock.
Further, the load condition of each datax node is also stored in redis in the form of key-value, when a datax node is allocated with a new synchronization task, the load corresponding to the node is increased by 1, and when the task execution is finished, the load corresponding to the node is decreased by 1. That is, the task load of each datax node is determined by the number of synchronization tasks currently processed by the datax.
Further, when the redis module performs task allocation, a task priority policy is also designed, and a task with a high priority level preferentially allocates operating resources. Specifically, the tasks may be divided into a plurality of priorities according to actual needs, and the tasks are sequentially issued one by one to one in the at least one datax node with a lighter load condition of the current task, that is, the least number of the synchronous tasks, in an order from the highest priority to the lowest priority.
In some possible implementations in this embodiment, in order to improve the performance of task synchronization, in addition to using task fragments of the datax nodes themselves, the advantage of the datax distributed cluster may be fully utilized in a datax node fragmentation manner, and the tasks are distributed to multiple datax nodes to run in parallel after being split. That is, the same task may be fragmented and divided into multiple parts, which are sent to different datax nodes for processing.
Specifically, according to a split pk field specified in the task script, where the field may be a numerical type or a time type, the source end data is segmented by using the value of the field according to the number of nodes of the currently allocable resource, and finally the task is segmented into several small tasks to be executed.
In this embodiment, because different types of synchronization tasks have different requirements on the memory during the task running process, the dynamic adjustment of the task running memory of the datax node is also very important. In order to accurately control the memory of the datax node, the memory of the datax node can be controlled to run by the following two modes:
1) Global adjustment;
the method is based on the metadata of the nacos, the running memory of the datax is set by changing the configuration of the metadata in the nacos, and the configuration is transmitted to the start script of the datax node through the agent service.
2) Controlling task level;
in the method, when a data synchronization task is configured, a user can set memory parameters during the operation of the task, and the parameters are transmitted to an agent to be used as configuration when a datax is started.
Compared with the prior art, the embodiment has at least the following advantages;
1) And the agent of the datax is realized through the designed agent service, and the distributed cluster of the datax is realized through the cluster deployment of the agent service.
2) Automatic registration of nodes and service discovery: the datax software package is used as a part of the agent service, and the automatic registration of the agent service and the datax node and the discovery of the service are realized by using the characteristic that the agent service registers in a nacos registration center.
3) Dynamic expansion and contraction of nodes: and combining the service registration and service offline functions of the agent microservice to realize the expansion and contraction of the datax cluster, thereby realizing the dynamic expansion of the datax nodes.
4) redis load control and task allocation: the characteristics of a redis distributed lock are utilized to construct uniform load management, the load condition of each node is stored by utilizing the redis, and the priority degree of task operation is controlled by utilizing a task priority strategy.
5) Dynamic adjustment of the node memory: and dynamically adjusting the memory of the datax in operation through global adjustment and task level control.
6) Task fragmentation: based on the characteristic of the datax cluster, the source end data is segmented by using the split PK field, a large task is split into a plurality of small tasks according to the segmentation result, and the small tasks are distributed to the datax cluster to run, so that the task parallelism is improved.
A second embodiment of the present invention is a task processing system based on distributed cluster datax, which corresponds to the first embodiment, and as shown in fig. 3, includes the following components:
the acquisition module is configured to acquire tasks to be executed.
And the redis management module acquires the task load conditions of all the current datax nodes and issues the tasks to the datax nodes based on a pre-configured distribution strategy.
Each of the datax nodes is provided with a corresponding proxy service module, and the datax nodes are registered in the nacos registry through the corresponding proxy service modules.
In this embodiment, the redis module is further configured to: and acquiring and storing the task load conditions of all the current dataxs in a key-value mode, wherein the task load conditions comprise the number of distributed synchronous tasks.
In this embodiment, the redis module is further configured to: and issuing the task to one of the at least one datax node with the lowest task condition load in the datax nodes based on the priority sequence pre-configured by the task.
In this embodiment, the task processing system based on distributed cluster datax further includes: a memory adjustment module configured to: configuring a task memory of a datax node by using metadata of a nacos registry; or configuring a task memory of the task before the task is issued to the datax node.
A third embodiment of the present invention, an electronic device, as shown in fig. 4, can be understood as a physical device, and includes a processor and a memory storing instructions executable by the processor, and when the instructions are executed by the processor, the electronic device performs the following operations:
and S1, acquiring a task to be executed.
And S2, acquiring task load conditions of all current datax nodes, wherein each datax node is configured with a corresponding proxy service module, and the datax nodes are registered in the nacos registry through the corresponding proxy service modules.
And S3, based on a pre-configured distribution strategy, issuing the task to the datax node.
A fourth embodiment of the present invention is a task processing method based on distributed cluster datax, which has the same flow as the first, second, or third embodiments, and is different in that, in terms of engineering implementation, this embodiment may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but the former is a better implementation in many cases. With this understanding in mind, the method of the present invention may be embodied in the form of a computer software product stored on a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and including instructions for causing a device (e.g., a network device such as a base station) to perform the method of the present invention.
In summary, compared with the prior art, the advantages of the present invention at least include:
1) And the agent of the datax is realized through the designed agent service, and the distributed cluster of the datax is realized through the cluster deployment of the agent service.
2) Automatic registration of nodes and service discovery: the datax software package is used as a part of the agent service, and the automatic registration of the agent service and the datax node and the discovery of the service are realized by using the characteristic that the agent service registers in a nacos registration center.
3) Dynamic expansion and contraction of nodes: and combining the service registration and service offline functions of the agent micro-service to realize the capacity expansion and the capacity reduction of the datax cluster, thereby realizing the dynamic expansion of the datax nodes.
4) redis load control and task allocation: the characteristics of a redis distributed lock are utilized to construct uniform load management, the load condition of each node is stored by utilizing the redis, and the priority degree of task operation is controlled by utilizing a task priority strategy.
5) Dynamic adjustment of the node memory: and dynamically adjusting the internal memory of the datax in operation through global adjustment and task level control.
6) Task fragmentation: based on the characteristic of the datax cluster, the source end data is segmented by using the split PK field, a large task is split into a plurality of small tasks according to the segmentation result, and the small tasks are distributed to the datax cluster to run, so that the task parallelism is improved.
While the present invention has been described in connection with the preferred embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.