Disclosure of Invention
The invention aims at providing an adaptive computing power scheduling system for large model training aiming at the defects.
The invention adopts the following technical scheme:
an adaptive computing power dispatching system for large model training comprises a computing power management module, a task analysis module, an adaptive dispatching module and a data interaction module;
The power management module is used for managing and monitoring the whole power resource, the task analysis module is used for analyzing the requirements and characteristics of the training task, the self-adaptive scheduling module performs power scheduling according to the real-time resource condition and the task requirements, and the data interaction module is used for managing the data interaction among different power nodes;
The computing power management module comprises a computing power resource monitoring unit, a computing power pool management unit and a pre-estimated resource unit, wherein the computing power resource monitoring unit is used for detecting the load and the available state of computing power resources, the computing power pool management unit is used for dividing the available computing power into resource pools with different grades, and the pre-estimated resource unit predicts the required overall computing power resources based on the training characteristics of a large model;
The task analysis module comprises a task decomposition unit, a demand analysis unit and a priority evaluation unit, wherein the task decomposition unit is used for decomposing a large model training task into a plurality of subtasks and identifying the dependency relationship of the task, the demand analysis unit is used for analyzing the demand of each subtask and generating a calculation power demand parameter, and the priority evaluation unit is used for distributing priority to the subtasks;
The self-adaptive scheduling module comprises a dynamic resource allocation unit, a scheduling strategy generation unit and a self-adaptive optimization unit, wherein the dynamic resource allocation unit allocates proper computing power resources to each subtask based on a scheduling strategy, the scheduling strategy generation unit generates a scheduling strategy based on the priority of the task, the demand parameters and the resource state, and the self-adaptive optimization unit is used for adjusting the computing power allocation in the task execution process;
the data interaction module comprises a data caching unit, a data synchronizing unit and a transmission optimizing unit, wherein the data caching unit is used for caching common data and intermediate results, the data synchronizing unit is used for ensuring consistency and synchronism of the data among all nodes, and the transmission optimizing unit is used for selecting an optimal data transmission path and mode.
Further, the dynamic resource allocation unit comprises a policy management processor, a node selection processor and a task allocation processor, wherein the policy management processor is used for receiving scheduling policies, the node selection processor is used for selecting corresponding resource nodes for subtasks in each scheduling policy, and the task allocation processor is used for sorting task data and then sending the task data to the corresponding resource nodes.
Further, the process of selecting the resource node by the node selection processor includes the following steps:
s1, screening nodes with the resource quantity larger than the actual resource allocation quantity in the scheduling strategy, wherein the nodes are called to-be-selected nodes;
S2, calculating a data dependency index do of the node to be selected according to the following formula:
;
where ch is the remaining resource amount of the node, cd is the actual resource allocation amount of the subtask, dh is the necessary data amount owned by the node, and da is the total necessary data amount of the subtask;
And S3, selecting the node with the largest data dependency index as the node for distributing resources to execute the subtasks.
Further, the self-adaptive optimizing unit comprises an optimizing detection processor, a resource changing processor and an idle management processor, wherein the optimizing detection processor judges whether to optimize based on the use condition of the subtask on the special resource, the resource changing processor adjusts and changes the task label of the resource based on optimizing information, and the idle management processor is used for managing the use of the subtask on the idle resource.
Further, when the time that the utilization rate of the special resources by the subtasks continuously exceeds the first threshold reaches T, if idle resources exist at the moment, the resource adding and allocation optimization is performed, and when the time that the utilization rate of the special resources by the subtasks is lower than the second threshold reaches T, the resource deleting optimization is performed, wherein T is the optimization judging duration;
the resource change processor calculates the incremental resource quantity cd < + > according to the following formula:
;
Wherein,As a basis of the proportion of the components,For average utilization in time T, y1 is a first threshold;
the resource change processor calculates a pruned resource quantity cd-, according to the following formula:
;
Wherein y2 is a second threshold.
The beneficial effects obtained by the invention are as follows:
The system dynamically adjusts the resource allocation by tracking the states of all the power calculation nodes in the system in real time through the power calculation resource monitoring unit, avoids overload or idle of the resources, dynamically allocates the adaptive resource types according to task demands by the power calculation pool management unit to realize refined resource management, and the task analysis module identifies the key paths and the priorities of the tasks through the demand analysis unit and the priority evaluation unit.
For a further understanding of the nature and the technical aspects of the present invention, reference should be made to the following detailed description of the invention and the accompanying drawings, which are provided for purposes of reference only and are not intended to limit the invention.
Detailed Description
The following embodiments of the present invention are described in terms of specific examples, and those skilled in the art will appreciate the advantages and effects of the present invention from the disclosure herein. The invention is capable of other and different embodiments and its several details are capable of modification and variation in various respects, all without departing from the spirit of the present invention. The drawings of the present invention are merely schematic illustrations, and are not intended to be drawn to actual dimensions. The following embodiments will further illustrate the related art content of the present invention in detail, but the disclosure is not intended to limit the scope of the present invention.
The embodiment provides an adaptive computing power scheduling system for large model training, which comprises a computing power management module, a task analysis module, an adaptive scheduling module and a data interaction module, wherein the adaptive computing power scheduling system is combined with FIG. 1;
The power management module is used for managing and monitoring the whole power resource, the task analysis module is used for analyzing the requirements and characteristics of the training task, the self-adaptive scheduling module performs power scheduling according to the real-time resource condition and the task requirements, and the data interaction module is used for managing the data interaction among different power nodes;
The computing power management module comprises a computing power resource monitoring unit, a computing power pool management unit and a pre-estimated resource unit, wherein the computing power resource monitoring unit is used for detecting the load and the available state of computing power resources, the computing power pool management unit is used for dividing the available computing power into resource pools with different grades, and the pre-estimated resource unit predicts the required overall computing power resources based on the training characteristics of a large model;
The task analysis module comprises a task decomposition unit, a demand analysis unit and a priority evaluation unit, wherein the task decomposition unit is used for decomposing a large model training task into a plurality of subtasks and identifying the dependency relationship of the task, the demand analysis unit is used for analyzing the demand of each subtask and generating a calculation power demand parameter, and the priority evaluation unit is used for distributing priority to the subtasks;
The self-adaptive scheduling module comprises a dynamic resource allocation unit, a scheduling strategy generation unit and a self-adaptive optimization unit, wherein the dynamic resource allocation unit allocates proper computing power resources to each subtask based on a scheduling strategy, the scheduling strategy generation unit generates a scheduling strategy based on the priority of the task, the demand parameters and the resource state, and the self-adaptive optimization unit is used for adjusting the computing power allocation in the task execution process;
the data interaction module comprises a data caching unit, a data synchronizing unit and a transmission optimizing unit, wherein the data caching unit is used for caching common data and intermediate results, the data synchronizing unit is used for ensuring consistency and synchronism of the data among all nodes, and the transmission optimizing unit is used for selecting an optimal data transmission path and mode.
The dynamic resource allocation unit comprises a strategy management processor, a node selection processor and a task allocation processor, wherein the strategy management processor is used for receiving scheduling strategies, the node selection processor is used for selecting corresponding resource nodes for subtasks in each scheduling strategy, and the task allocation processor is used for sorting task data and then sending the task data to the corresponding resource nodes.
The process of selecting resource nodes by the node selection processor comprises the following steps:
s1, screening nodes with the resource quantity larger than the actual resource allocation quantity in the scheduling strategy, wherein the nodes are called to-be-selected nodes;
S2, calculating a data dependency index do of the node to be selected according to the following formula:
;
where ch is the remaining resource amount of the node, cd is the actual resource allocation amount of the subtask, dh is the necessary data amount owned by the node, and da is the total necessary data amount of the subtask;
And S3, selecting the node with the largest data dependency index as the node for distributing resources to execute the subtasks.
The self-adaptive optimizing unit comprises an optimizing detection processor, a resource changing processor and an idle management processor, wherein the optimizing detection processor judges whether to optimize special resources based on the use condition of the subtasks, the resource changing processor adjusts and changes task labels of the resources based on optimizing information, and the idle management processor is used for managing the use of the idle resources by the subtasks.
When the utilization rate of the special resources by the subtasks continuously exceeds the first threshold value and reaches T, if idle resources exist at the moment, optimizing the added resources, and when the utilization rate of the special resources by the subtasks is lower than the second threshold value and reaches T, optimizing the deleted resources, wherein T is the optimizing judgment duration;
the resource change processor calculates the incremental resource quantity cd < + > according to the following formula:
;
Wherein,As a basis of the proportion of the components,For average utilization in time T, y1 is a first threshold;
the resource change processor calculates a pruned resource quantity cd-, according to the following formula:
;
Wherein y2 is a second threshold.
The second embodiment comprises the whole content of the first embodiment, and provides an adaptive computing power dispatching system for large model training, which comprises a computing power management module, a task analysis module, an adaptive dispatching module and a data interaction module;
The power management module is used for managing and monitoring the whole power resource, the task analysis module is used for analyzing the requirements and characteristics of the training task, the self-adaptive scheduling module performs power scheduling according to the real-time resource condition and the task requirements, and the data interaction module is used for managing the data interaction among different power nodes;
Referring to fig. 2, the power management module includes a power resource monitoring unit, a power pool management unit, and an estimated resource unit, where the power resource monitoring unit is configured to detect a load and an available state of a power resource, the power pool management unit is configured to divide the available power into resource pools with different levels, and the estimated resource unit predicts a required overall power resource based on a large model training feature;
Referring to fig. 3, the task analysis module includes a task decomposition unit, a demand analysis unit and a priority evaluation unit, where the task decomposition unit is configured to decompose a large model training task into a plurality of subtasks and identify dependency relationships of the tasks, the demand analysis unit is configured to analyze demands of each subtask and generate a calculation power demand parameter, and the priority evaluation unit is configured to assign priorities to the subtasks;
referring to fig. 4, the adaptive scheduling module includes a dynamic resource allocation unit, a scheduling policy generation unit and an adaptive optimization unit, where the dynamic resource allocation unit allocates an appropriate computing power resource to each subtask based on a scheduling policy, the scheduling policy generation unit generates a scheduling policy based on a demand parameter and a resource state of a task, and the adaptive optimization unit is used to adjust computing power allocation in a task execution process;
Referring to fig. 5, the data interaction module includes a data caching unit, a data synchronization unit and a transmission optimization unit, where the data caching unit is used to cache common data and intermediate results, the data synchronization unit is used to ensure consistency and synchronization of data between nodes, and the transmission optimization unit is used to select an optimal data transmission path and mode;
The resource monitoring unit comprises a node state monitoring processor, a performance evaluation processor and a feedback transmission processor, wherein the node state monitoring processor is used for collecting resource state data of each node, the performance evaluation processor is used for performing performance evaluation on the collected resource state, and the feedback transmission processor is used for feeding monitoring information back to the computing pool management unit;
The computing power pool management unit comprises a monitoring receiving processor, a grading statistics processor and an information recording processor, wherein the monitoring receiving processor is used for receiving monitoring information of each node, the grading statistics processor is used for dividing computing power resources into different resource pools for statistics based on performance evaluation, and the information recording processor is used for recording resource information of each resource pool;
The estimated resource unit comprises a history training register, a comparison estimated processor and an calculation pool allocation processor, wherein the history training register is used for storing case information of actual training, the comparison estimated processor is used for comparing the characteristics of the large model and the case information and determining resources required by the whole training, and the calculation pool allocation processor is used for carrying out marking allocation on corresponding resources in the calculation pool based on the estimated resources;
The task decomposition unit comprises a model disassembly processor, a task coding processor and a task management processor, wherein the model disassembly processor is used for disassembling a training model into a plurality of subtasks, the task coding processor is used for carrying out task coding based on the dependency relationship of the subtasks, and the task management processor is used for storing the detailed information of each subtask and managing the assignment of the tasks;
The demand analysis unit comprises a characteristic extraction processor, a resource mapping processor and a parameter feedback processor, wherein the characteristic extraction processor is used for extracting characteristic information of a subtask, the resource mapping processor is used for mapping to obtain a resource type based on the characteristic information and rule information, and the parameter feedback processor is used for generating parameters of corresponding type resources based on the characteristic information and feeding back the parameters to the task management processor;
The priority evaluation unit comprises a time sensitive evaluation processor, a resource occupation evaluation processor and a priority calculation processor, wherein the time sensitive evaluation processor is used for evaluating the time sensitivity of the subtasks, the resource occupation evaluation processor is used for evaluating the resource occupation of the subtasks, and the priority calculation processor is used for calculating the priority information of the subtasks based on the evaluation result;
The priority calculating processor calculates the priority P of the subtasks according to the following formula:
;
Wherein T is a time sensitive value, R is a resource occupation value, w1 is a time coefficient, and w2 is a resource coefficient;
The task management processor divides the subtasks into a plurality of batches based on task codes, sorts the subtasks in the same batch based on priority, sends the subtasks to the self-adaptive dispatching module according to the sequence, and processes the subtasks of the next batch after processing one batch;
The dynamic resource allocation unit comprises a strategy management processor, a node selection processor and a task allocation processor, wherein the strategy management processor is used for receiving scheduling strategies, the node selection processor is used for selecting corresponding resource nodes for subtasks in each scheduling strategy, and the task allocation processor is used for sorting task data and then sending the task data to the corresponding resource nodes;
The process of selecting resource nodes by the node selection processor comprises the following steps:
s1, screening nodes with the resource quantity larger than the actual resource allocation quantity in the scheduling strategy, wherein the nodes are called to-be-selected nodes;
S2, calculating a data dependency index do of the node to be selected according to the following formula:
;
where ch is the remaining resource amount of the node, cd is the actual resource allocation amount of the subtask, dh is the necessary data amount owned by the node, and da is the total necessary data amount of the subtask;
The necessary data of the subtasks are data generated by the front subtasks and are distributed on different nodes, and the subtasks can be executed only after the synchronization to the nodes where the subtasks are located;
S3, selecting a node with the maximum data dependency index as a node for distributing resources to execute subtasks;
The scheduling policy generation unit comprises a state evaluation processor, a resource adjustment processor and a policy output processor, wherein the state evaluation processor is used for evaluating the current resource state, the resource adjustment processor is used for adjusting the demand parameters based on the resource state evaluation result to obtain the actual resource allocation amount, and the policy output processor is used for sending policy information based on the actual resource allocation amount to the dynamic resource allocation unit;
The resource adjustment processor calculates the actual resource allocation cd according to the following formula:
;
Wherein s0 is a standard state, rs is a resource state evaluation value, and pn is a demand parameter;
Two labels exist for the resource in the node, one is an item label and the other is a task label, the item label indicates that the resource is used for a corresponding training item, the task label indicates that the resource is allocated to a corresponding subtask, the resource with the item label but without the task label is called idle resource, and the resource with the item label and the task label is called special resource;
The self-adaptive optimizing unit comprises an optimizing detection processor, a resource changing processor and an idle management processor, wherein the optimizing detection processor judges whether to optimize special resources based on the use condition of the subtasks, the resource changing processor adjusts and changes task labels of the resources based on optimizing information, and the idle management processor is used for managing the use of idle resources by the subtasks;
When the utilization rate of the special resources by the subtasks continuously exceeds the first threshold value and reaches T, if idle resources exist at the moment, optimizing the added resources, and when the utilization rate of the special resources by the subtasks is lower than the second threshold value and reaches T, optimizing the deleted resources, wherein T is the optimizing judgment duration;
the resource change processor calculates the incremental resource quantity cd < + > according to the following formula:
;
Wherein,As a basis of the proportion of the components,For average utilization in time T, y1 is a first threshold;
the resource change processor calculates a pruned resource quantity cd-, according to the following formula:
;
Wherein y2 is a second threshold.
When the utilization rate of the subtask on the special resource temporarily reaches 100%, the idle management processor calls the idle resource to the subtask for use, and the idle resource is released immediately after the subtask finishes using the idle resource;
The data synchronization unit comprises a synchronization retrieval processor and a data transmission processor, wherein the synchronization retrieval processor is used for retrieving nodes of which the data needs to be synchronized, and the data transmission processor is used for sorting the data needing to be synchronized and then transmitting the data to the corresponding nodes;
part of the code of the system is as follows:
# calculation management module
class ComputeManagementModule:
def __init__(self):
self.resource_monitor = ResourceMonitor()
self.compute_pool_manager = ComputePoolManager()
self.resource_predictor = ResourcePredictor()
def get_resource_state(self):
Return resource status #
return self.resource_monitor.monitor_resources()
# Computing power resource monitoring unit
class ResourceMonitor:
def monitor_resources(self):
# Detecting load and availability status of computational resources
return {"CPU": 70, "GPU": 80, "Memory": 60}
Management unit for # calculation force pool
class ComputePoolManager:
def manage_pools(self):
# Dividing resources into resource pools of different classes
print("Managing compute pools...")
# Estimated resource unit
class ResourcePredictor:
def predict_resources(self, task_characteristics):
# Forecast overall calculation force demand
return {"required_CPU": 100, "required_GPU": 200}
Task analysis module
class TaskAnalysisModule:
def __init__(self):
self.task_splitter = TaskSplitter()
self.demand_analyzer = DemandAnalyzer()
self.priority_evaluator = PriorityEvaluator()
def analyze_tasks(self):
Task demand and feature analysis
tasks = self.task_splitter.split_task("Main Training Task")
for task in tasks:
task["demands"] = self.demand_analyzer.analyze_demands(task)
task["priority"] = self.priority_evaluator.evaluate_priority(task)
return tasks
Task decomposition unit
class TaskSplitter:
def split_task(self, task):
# Break down task into subtasks
return [{"name": "Subtask1"}, {"name": "Subtask2"}]
# Demand analysis unit
class DemandAnalyzer:
def analyze_demands(self, task):
# Generating the calculation force demand parameter
return {"CPU": 50, "GPU": 80}
# Priority assessment unit
class PriorityEvaluator:
def evaluate_priority(self, task):
Computing task priority
return 1
# Self-adaptive scheduling module
class AdaptiveSchedulingModule:
def __init__(self):
self.dynamic_allocator = DynamicAllocator()
self.strategy_generator = StrategyGenerator()
self.optimizer = AdaptiveOptimizer()
def generate_schedule(self, tasks, resource_state):
# Generating scheduling policy based on task and resource status
return self.strategy_generator.generate_strategy(tasks, resource_state)
def allocate_resources(self, schedule):
Dynamic allocation of resources #
self.dynamic_allocator.allocate(schedule)
Dynamic resource allocation unit
class DynamicAllocator:
def allocate(self, schedule):
# Allocation of computing resources
print(f"Allocating resources based on schedule: {schedule}")
# Scheduling policy generation unit
class StrategyGenerator:
def generate_strategy(self, tasks, resource_state):
# Generating scheduling policy
return {"task_allocation": tasks, "resource_state": resource_state}
# Self-adaptive optimization unit
class AdaptiveOptimizer:
def optimize(self, task_state, resource_state):
Dynamic adjustment of computing force distribution
print("Optimizing resource allocation...")
# Data interaction module
class DataInteractionModule:
def __init__(self):
self.data_cache = DataCache()
self.data_synchronizer = DataSynchronizer()
self.transfer_optimizer = TransferOptimizer()。
The training process is respectively executed by the system and the common system for 10 training samples, and the training time is obtained by testing, so that a comparison chart shown in fig. 6 is obtained.
The foregoing disclosure is only a preferred embodiment of the present invention and is not intended to limit the scope of the invention, so that all equivalent technical changes made by applying the description of the present invention and the accompanying drawings are included in the scope of the present invention, and in addition, elements in the present invention can be updated as the technology develops.