Movatterモバイル変換


[0]ホーム

URL:


CN119322682B - Self-adaptive calculation power scheduling system for large model training - Google Patents

Self-adaptive calculation power scheduling system for large model training
Download PDF

Info

Publication number
CN119322682B
CN119322682BCN202411874201.4ACN202411874201ACN119322682BCN 119322682 BCN119322682 BCN 119322682BCN 202411874201 ACN202411874201 ACN 202411874201ACN 119322682 BCN119322682 BCN 119322682B
Authority
CN
China
Prior art keywords
resource
unit
task
processor
computing power
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411874201.4A
Other languages
Chinese (zh)
Other versions
CN119322682A (en
Inventor
张卫平
杨淦
梁昊星
刘安
陈静婷
李玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Global Numerical Technology Co ltd
Original Assignee
Global Numerical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Global Numerical Technology Co ltdfiledCriticalGlobal Numerical Technology Co ltd
Priority to CN202411874201.4ApriorityCriticalpatent/CN119322682B/en
Publication of CN119322682ApublicationCriticalpatent/CN119322682A/en
Application grantedgrantedCritical
Publication of CN119322682BpublicationCriticalpatent/CN119322682B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明提供了一种用于大模型训练的自适应算力调度系统,涉及电数字数据处理领域,包括算力管理模块、任务分析模块、自适应调度模块和数据交互模块,所述算力管理模块用于负责整体算力资源的管理和监控,所述任务分析模块用于负责分析训练任务的需求与特征,所述自适应调度模块根据实时资源状况与任务需求进行算力调度,所述数据交互模块用于管理不同算力节点之间的数据交互;本系统通过分派训练任务至不同的节点,并在执行任务过程中调整分配的资源,能够有效地提高训练效率。

The present invention provides an adaptive computing power scheduling system for large model training, which relates to the field of electrical digital data processing, including a computing power management module, a task analysis module, an adaptive scheduling module and a data interaction module. The computing power management module is responsible for the management and monitoring of the overall computing power resources, the task analysis module is responsible for analyzing the needs and characteristics of training tasks, the adaptive scheduling module performs computing power scheduling according to real-time resource conditions and task requirements, and the data interaction module is used to manage data interaction between different computing power nodes. The system can effectively improve the training efficiency by dispatching training tasks to different nodes and adjusting the allocated resources during the execution of tasks.

Description

Self-adaptive calculation power scheduling system for large model training
Technical Field
The invention relates to the field of electric digital data processing, in particular to a self-adaptive calculation power scheduling system for large model training.
Background
In the fields of artificial intelligence and deep learning, large models have become an important tool for solving complex tasks. The models are characterized by huge parameter amounts and complex network structures, massive data are usually required for training, the demand for computing resources is extremely severe, and along with the continuous increase of the model scale and the further improvement of the computing demand, the traditional static resource allocation mode is difficult to adapt to the demand of large model training. An adaptive scheduling system capable of dynamically adjusting resource allocation according to task demands is needed at present to optimize the utilization rate of computing power resources, improve the execution efficiency of tasks and guarantee the robustness and flexibility of the system.
Many computing power dispatching systems have been developed, and a large number of searching and reference are performed to find that the existing computing power dispatching system has a system as disclosed in publication number CN117873734B, and the system methods generally comprise the steps of distributing the allowed use quantity of GPU cards in a distributed training cluster for each algorithm engineer in advance, forming GPU card distribution information for storage, reading the stored GPU card distribution information and the current GPU card use condition in the distributed training cluster when a model training task is created, judging the quantity of GPU cards allowed to be selected for use by the algorithm engineer, executing training operation dispatching according to the selected GPU cards when the GPU cards required for the training task are selected, judging whether to reduce the quantity of GPU cards in the training task according to the quantity of the GPU cards selected by the algorithm engineer, and then starting a new training task. However, the system cannot adjust the allocated resources in the training process, so that the utilization rate of the computing resources is not high, and the whole training efficiency needs to be improved.
Disclosure of Invention
The invention aims at providing an adaptive computing power scheduling system for large model training aiming at the defects.
The invention adopts the following technical scheme:
an adaptive computing power dispatching system for large model training comprises a computing power management module, a task analysis module, an adaptive dispatching module and a data interaction module;
The power management module is used for managing and monitoring the whole power resource, the task analysis module is used for analyzing the requirements and characteristics of the training task, the self-adaptive scheduling module performs power scheduling according to the real-time resource condition and the task requirements, and the data interaction module is used for managing the data interaction among different power nodes;
The computing power management module comprises a computing power resource monitoring unit, a computing power pool management unit and a pre-estimated resource unit, wherein the computing power resource monitoring unit is used for detecting the load and the available state of computing power resources, the computing power pool management unit is used for dividing the available computing power into resource pools with different grades, and the pre-estimated resource unit predicts the required overall computing power resources based on the training characteristics of a large model;
The task analysis module comprises a task decomposition unit, a demand analysis unit and a priority evaluation unit, wherein the task decomposition unit is used for decomposing a large model training task into a plurality of subtasks and identifying the dependency relationship of the task, the demand analysis unit is used for analyzing the demand of each subtask and generating a calculation power demand parameter, and the priority evaluation unit is used for distributing priority to the subtasks;
The self-adaptive scheduling module comprises a dynamic resource allocation unit, a scheduling strategy generation unit and a self-adaptive optimization unit, wherein the dynamic resource allocation unit allocates proper computing power resources to each subtask based on a scheduling strategy, the scheduling strategy generation unit generates a scheduling strategy based on the priority of the task, the demand parameters and the resource state, and the self-adaptive optimization unit is used for adjusting the computing power allocation in the task execution process;
the data interaction module comprises a data caching unit, a data synchronizing unit and a transmission optimizing unit, wherein the data caching unit is used for caching common data and intermediate results, the data synchronizing unit is used for ensuring consistency and synchronism of the data among all nodes, and the transmission optimizing unit is used for selecting an optimal data transmission path and mode.
Further, the dynamic resource allocation unit comprises a policy management processor, a node selection processor and a task allocation processor, wherein the policy management processor is used for receiving scheduling policies, the node selection processor is used for selecting corresponding resource nodes for subtasks in each scheduling policy, and the task allocation processor is used for sorting task data and then sending the task data to the corresponding resource nodes.
Further, the process of selecting the resource node by the node selection processor includes the following steps:
s1, screening nodes with the resource quantity larger than the actual resource allocation quantity in the scheduling strategy, wherein the nodes are called to-be-selected nodes;
S2, calculating a data dependency index do of the node to be selected according to the following formula:
;
where ch is the remaining resource amount of the node, cd is the actual resource allocation amount of the subtask, dh is the necessary data amount owned by the node, and da is the total necessary data amount of the subtask;
And S3, selecting the node with the largest data dependency index as the node for distributing resources to execute the subtasks.
Further, the self-adaptive optimizing unit comprises an optimizing detection processor, a resource changing processor and an idle management processor, wherein the optimizing detection processor judges whether to optimize based on the use condition of the subtask on the special resource, the resource changing processor adjusts and changes the task label of the resource based on optimizing information, and the idle management processor is used for managing the use of the subtask on the idle resource.
Further, when the time that the utilization rate of the special resources by the subtasks continuously exceeds the first threshold reaches T, if idle resources exist at the moment, the resource adding and allocation optimization is performed, and when the time that the utilization rate of the special resources by the subtasks is lower than the second threshold reaches T, the resource deleting optimization is performed, wherein T is the optimization judging duration;
the resource change processor calculates the incremental resource quantity cd < + > according to the following formula:
;
Wherein,As a basis of the proportion of the components,For average utilization in time T, y1 is a first threshold;
the resource change processor calculates a pruned resource quantity cd-, according to the following formula:
;
Wherein y2 is a second threshold.
The beneficial effects obtained by the invention are as follows:
The system dynamically adjusts the resource allocation by tracking the states of all the power calculation nodes in the system in real time through the power calculation resource monitoring unit, avoids overload or idle of the resources, dynamically allocates the adaptive resource types according to task demands by the power calculation pool management unit to realize refined resource management, and the task analysis module identifies the key paths and the priorities of the tasks through the demand analysis unit and the priority evaluation unit.
For a further understanding of the nature and the technical aspects of the present invention, reference should be made to the following detailed description of the invention and the accompanying drawings, which are provided for purposes of reference only and are not intended to limit the invention.
Drawings
FIG. 1 is a schematic diagram of the overall structural framework of the present invention;
FIG. 2 is a schematic diagram of a power management module according to the present invention;
FIG. 3 is a schematic diagram of a task analysis module according to the present invention;
FIG. 4 is a schematic diagram of an adaptive scheduling module according to the present invention;
FIG. 5 is a schematic diagram of a data interaction module according to the present invention;
FIG. 6 is a graph showing the comparative effects of the training test of the present invention.
Detailed Description
The following embodiments of the present invention are described in terms of specific examples, and those skilled in the art will appreciate the advantages and effects of the present invention from the disclosure herein. The invention is capable of other and different embodiments and its several details are capable of modification and variation in various respects, all without departing from the spirit of the present invention. The drawings of the present invention are merely schematic illustrations, and are not intended to be drawn to actual dimensions. The following embodiments will further illustrate the related art content of the present invention in detail, but the disclosure is not intended to limit the scope of the present invention.
The embodiment provides an adaptive computing power scheduling system for large model training, which comprises a computing power management module, a task analysis module, an adaptive scheduling module and a data interaction module, wherein the adaptive computing power scheduling system is combined with FIG. 1;
The power management module is used for managing and monitoring the whole power resource, the task analysis module is used for analyzing the requirements and characteristics of the training task, the self-adaptive scheduling module performs power scheduling according to the real-time resource condition and the task requirements, and the data interaction module is used for managing the data interaction among different power nodes;
The computing power management module comprises a computing power resource monitoring unit, a computing power pool management unit and a pre-estimated resource unit, wherein the computing power resource monitoring unit is used for detecting the load and the available state of computing power resources, the computing power pool management unit is used for dividing the available computing power into resource pools with different grades, and the pre-estimated resource unit predicts the required overall computing power resources based on the training characteristics of a large model;
The task analysis module comprises a task decomposition unit, a demand analysis unit and a priority evaluation unit, wherein the task decomposition unit is used for decomposing a large model training task into a plurality of subtasks and identifying the dependency relationship of the task, the demand analysis unit is used for analyzing the demand of each subtask and generating a calculation power demand parameter, and the priority evaluation unit is used for distributing priority to the subtasks;
The self-adaptive scheduling module comprises a dynamic resource allocation unit, a scheduling strategy generation unit and a self-adaptive optimization unit, wherein the dynamic resource allocation unit allocates proper computing power resources to each subtask based on a scheduling strategy, the scheduling strategy generation unit generates a scheduling strategy based on the priority of the task, the demand parameters and the resource state, and the self-adaptive optimization unit is used for adjusting the computing power allocation in the task execution process;
the data interaction module comprises a data caching unit, a data synchronizing unit and a transmission optimizing unit, wherein the data caching unit is used for caching common data and intermediate results, the data synchronizing unit is used for ensuring consistency and synchronism of the data among all nodes, and the transmission optimizing unit is used for selecting an optimal data transmission path and mode.
The dynamic resource allocation unit comprises a strategy management processor, a node selection processor and a task allocation processor, wherein the strategy management processor is used for receiving scheduling strategies, the node selection processor is used for selecting corresponding resource nodes for subtasks in each scheduling strategy, and the task allocation processor is used for sorting task data and then sending the task data to the corresponding resource nodes.
The process of selecting resource nodes by the node selection processor comprises the following steps:
s1, screening nodes with the resource quantity larger than the actual resource allocation quantity in the scheduling strategy, wherein the nodes are called to-be-selected nodes;
S2, calculating a data dependency index do of the node to be selected according to the following formula:
;
where ch is the remaining resource amount of the node, cd is the actual resource allocation amount of the subtask, dh is the necessary data amount owned by the node, and da is the total necessary data amount of the subtask;
And S3, selecting the node with the largest data dependency index as the node for distributing resources to execute the subtasks.
The self-adaptive optimizing unit comprises an optimizing detection processor, a resource changing processor and an idle management processor, wherein the optimizing detection processor judges whether to optimize special resources based on the use condition of the subtasks, the resource changing processor adjusts and changes task labels of the resources based on optimizing information, and the idle management processor is used for managing the use of the idle resources by the subtasks.
When the utilization rate of the special resources by the subtasks continuously exceeds the first threshold value and reaches T, if idle resources exist at the moment, optimizing the added resources, and when the utilization rate of the special resources by the subtasks is lower than the second threshold value and reaches T, optimizing the deleted resources, wherein T is the optimizing judgment duration;
the resource change processor calculates the incremental resource quantity cd < + > according to the following formula:
;
Wherein,As a basis of the proportion of the components,For average utilization in time T, y1 is a first threshold;
the resource change processor calculates a pruned resource quantity cd-, according to the following formula:
;
Wherein y2 is a second threshold.
The second embodiment comprises the whole content of the first embodiment, and provides an adaptive computing power dispatching system for large model training, which comprises a computing power management module, a task analysis module, an adaptive dispatching module and a data interaction module;
The power management module is used for managing and monitoring the whole power resource, the task analysis module is used for analyzing the requirements and characteristics of the training task, the self-adaptive scheduling module performs power scheduling according to the real-time resource condition and the task requirements, and the data interaction module is used for managing the data interaction among different power nodes;
Referring to fig. 2, the power management module includes a power resource monitoring unit, a power pool management unit, and an estimated resource unit, where the power resource monitoring unit is configured to detect a load and an available state of a power resource, the power pool management unit is configured to divide the available power into resource pools with different levels, and the estimated resource unit predicts a required overall power resource based on a large model training feature;
Referring to fig. 3, the task analysis module includes a task decomposition unit, a demand analysis unit and a priority evaluation unit, where the task decomposition unit is configured to decompose a large model training task into a plurality of subtasks and identify dependency relationships of the tasks, the demand analysis unit is configured to analyze demands of each subtask and generate a calculation power demand parameter, and the priority evaluation unit is configured to assign priorities to the subtasks;
referring to fig. 4, the adaptive scheduling module includes a dynamic resource allocation unit, a scheduling policy generation unit and an adaptive optimization unit, where the dynamic resource allocation unit allocates an appropriate computing power resource to each subtask based on a scheduling policy, the scheduling policy generation unit generates a scheduling policy based on a demand parameter and a resource state of a task, and the adaptive optimization unit is used to adjust computing power allocation in a task execution process;
Referring to fig. 5, the data interaction module includes a data caching unit, a data synchronization unit and a transmission optimization unit, where the data caching unit is used to cache common data and intermediate results, the data synchronization unit is used to ensure consistency and synchronization of data between nodes, and the transmission optimization unit is used to select an optimal data transmission path and mode;
The resource monitoring unit comprises a node state monitoring processor, a performance evaluation processor and a feedback transmission processor, wherein the node state monitoring processor is used for collecting resource state data of each node, the performance evaluation processor is used for performing performance evaluation on the collected resource state, and the feedback transmission processor is used for feeding monitoring information back to the computing pool management unit;
The computing power pool management unit comprises a monitoring receiving processor, a grading statistics processor and an information recording processor, wherein the monitoring receiving processor is used for receiving monitoring information of each node, the grading statistics processor is used for dividing computing power resources into different resource pools for statistics based on performance evaluation, and the information recording processor is used for recording resource information of each resource pool;
The estimated resource unit comprises a history training register, a comparison estimated processor and an calculation pool allocation processor, wherein the history training register is used for storing case information of actual training, the comparison estimated processor is used for comparing the characteristics of the large model and the case information and determining resources required by the whole training, and the calculation pool allocation processor is used for carrying out marking allocation on corresponding resources in the calculation pool based on the estimated resources;
The task decomposition unit comprises a model disassembly processor, a task coding processor and a task management processor, wherein the model disassembly processor is used for disassembling a training model into a plurality of subtasks, the task coding processor is used for carrying out task coding based on the dependency relationship of the subtasks, and the task management processor is used for storing the detailed information of each subtask and managing the assignment of the tasks;
The demand analysis unit comprises a characteristic extraction processor, a resource mapping processor and a parameter feedback processor, wherein the characteristic extraction processor is used for extracting characteristic information of a subtask, the resource mapping processor is used for mapping to obtain a resource type based on the characteristic information and rule information, and the parameter feedback processor is used for generating parameters of corresponding type resources based on the characteristic information and feeding back the parameters to the task management processor;
The priority evaluation unit comprises a time sensitive evaluation processor, a resource occupation evaluation processor and a priority calculation processor, wherein the time sensitive evaluation processor is used for evaluating the time sensitivity of the subtasks, the resource occupation evaluation processor is used for evaluating the resource occupation of the subtasks, and the priority calculation processor is used for calculating the priority information of the subtasks based on the evaluation result;
The priority calculating processor calculates the priority P of the subtasks according to the following formula:
;
Wherein T is a time sensitive value, R is a resource occupation value, w1 is a time coefficient, and w2 is a resource coefficient;
The task management processor divides the subtasks into a plurality of batches based on task codes, sorts the subtasks in the same batch based on priority, sends the subtasks to the self-adaptive dispatching module according to the sequence, and processes the subtasks of the next batch after processing one batch;
The dynamic resource allocation unit comprises a strategy management processor, a node selection processor and a task allocation processor, wherein the strategy management processor is used for receiving scheduling strategies, the node selection processor is used for selecting corresponding resource nodes for subtasks in each scheduling strategy, and the task allocation processor is used for sorting task data and then sending the task data to the corresponding resource nodes;
The process of selecting resource nodes by the node selection processor comprises the following steps:
s1, screening nodes with the resource quantity larger than the actual resource allocation quantity in the scheduling strategy, wherein the nodes are called to-be-selected nodes;
S2, calculating a data dependency index do of the node to be selected according to the following formula:
;
where ch is the remaining resource amount of the node, cd is the actual resource allocation amount of the subtask, dh is the necessary data amount owned by the node, and da is the total necessary data amount of the subtask;
The necessary data of the subtasks are data generated by the front subtasks and are distributed on different nodes, and the subtasks can be executed only after the synchronization to the nodes where the subtasks are located;
S3, selecting a node with the maximum data dependency index as a node for distributing resources to execute subtasks;
The scheduling policy generation unit comprises a state evaluation processor, a resource adjustment processor and a policy output processor, wherein the state evaluation processor is used for evaluating the current resource state, the resource adjustment processor is used for adjusting the demand parameters based on the resource state evaluation result to obtain the actual resource allocation amount, and the policy output processor is used for sending policy information based on the actual resource allocation amount to the dynamic resource allocation unit;
The resource adjustment processor calculates the actual resource allocation cd according to the following formula:
;
Wherein s0 is a standard state, rs is a resource state evaluation value, and pn is a demand parameter;
Two labels exist for the resource in the node, one is an item label and the other is a task label, the item label indicates that the resource is used for a corresponding training item, the task label indicates that the resource is allocated to a corresponding subtask, the resource with the item label but without the task label is called idle resource, and the resource with the item label and the task label is called special resource;
The self-adaptive optimizing unit comprises an optimizing detection processor, a resource changing processor and an idle management processor, wherein the optimizing detection processor judges whether to optimize special resources based on the use condition of the subtasks, the resource changing processor adjusts and changes task labels of the resources based on optimizing information, and the idle management processor is used for managing the use of idle resources by the subtasks;
When the utilization rate of the special resources by the subtasks continuously exceeds the first threshold value and reaches T, if idle resources exist at the moment, optimizing the added resources, and when the utilization rate of the special resources by the subtasks is lower than the second threshold value and reaches T, optimizing the deleted resources, wherein T is the optimizing judgment duration;
the resource change processor calculates the incremental resource quantity cd < + > according to the following formula:
;
Wherein,As a basis of the proportion of the components,For average utilization in time T, y1 is a first threshold;
the resource change processor calculates a pruned resource quantity cd-, according to the following formula:
;
Wherein y2 is a second threshold.
When the utilization rate of the subtask on the special resource temporarily reaches 100%, the idle management processor calls the idle resource to the subtask for use, and the idle resource is released immediately after the subtask finishes using the idle resource;
The data synchronization unit comprises a synchronization retrieval processor and a data transmission processor, wherein the synchronization retrieval processor is used for retrieving nodes of which the data needs to be synchronized, and the data transmission processor is used for sorting the data needing to be synchronized and then transmitting the data to the corresponding nodes;
part of the code of the system is as follows:
# calculation management module
class ComputeManagementModule:
def __init__(self):
self.resource_monitor = ResourceMonitor()
self.compute_pool_manager = ComputePoolManager()
self.resource_predictor = ResourcePredictor()
def get_resource_state(self):
Return resource status #
return self.resource_monitor.monitor_resources()
# Computing power resource monitoring unit
class ResourceMonitor:
def monitor_resources(self):
# Detecting load and availability status of computational resources
return {"CPU": 70, "GPU": 80, "Memory": 60}
Management unit for # calculation force pool
class ComputePoolManager:
def manage_pools(self):
# Dividing resources into resource pools of different classes
print("Managing compute pools...")
# Estimated resource unit
class ResourcePredictor:
def predict_resources(self, task_characteristics):
# Forecast overall calculation force demand
return {"required_CPU": 100, "required_GPU": 200}
Task analysis module
class TaskAnalysisModule:
def __init__(self):
self.task_splitter = TaskSplitter()
self.demand_analyzer = DemandAnalyzer()
self.priority_evaluator = PriorityEvaluator()
def analyze_tasks(self):
Task demand and feature analysis
tasks = self.task_splitter.split_task("Main Training Task")
for task in tasks:
task["demands"] = self.demand_analyzer.analyze_demands(task)
task["priority"] = self.priority_evaluator.evaluate_priority(task)
return tasks
Task decomposition unit
class TaskSplitter:
def split_task(self, task):
# Break down task into subtasks
return [{"name": "Subtask1"}, {"name": "Subtask2"}]
# Demand analysis unit
class DemandAnalyzer:
def analyze_demands(self, task):
# Generating the calculation force demand parameter
return {"CPU": 50, "GPU": 80}
# Priority assessment unit
class PriorityEvaluator:
def evaluate_priority(self, task):
Computing task priority
return 1
# Self-adaptive scheduling module
class AdaptiveSchedulingModule:
def __init__(self):
self.dynamic_allocator = DynamicAllocator()
self.strategy_generator = StrategyGenerator()
self.optimizer = AdaptiveOptimizer()
def generate_schedule(self, tasks, resource_state):
# Generating scheduling policy based on task and resource status
return self.strategy_generator.generate_strategy(tasks, resource_state)
def allocate_resources(self, schedule):
Dynamic allocation of resources #
self.dynamic_allocator.allocate(schedule)
Dynamic resource allocation unit
class DynamicAllocator:
def allocate(self, schedule):
# Allocation of computing resources
print(f"Allocating resources based on schedule: {schedule}")
# Scheduling policy generation unit
class StrategyGenerator:
def generate_strategy(self, tasks, resource_state):
# Generating scheduling policy
return {"task_allocation": tasks, "resource_state": resource_state}
# Self-adaptive optimization unit
class AdaptiveOptimizer:
def optimize(self, task_state, resource_state):
Dynamic adjustment of computing force distribution
print("Optimizing resource allocation...")
# Data interaction module
class DataInteractionModule:
def __init__(self):
self.data_cache = DataCache()
self.data_synchronizer = DataSynchronizer()
self.transfer_optimizer = TransferOptimizer()。
The training process is respectively executed by the system and the common system for 10 training samples, and the training time is obtained by testing, so that a comparison chart shown in fig. 6 is obtained.
The foregoing disclosure is only a preferred embodiment of the present invention and is not intended to limit the scope of the invention, so that all equivalent technical changes made by applying the description of the present invention and the accompanying drawings are included in the scope of the present invention, and in addition, elements in the present invention can be updated as the technology develops.

Claims (3)

CN202411874201.4A2024-12-192024-12-19Self-adaptive calculation power scheduling system for large model trainingActiveCN119322682B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411874201.4ACN119322682B (en)2024-12-192024-12-19Self-adaptive calculation power scheduling system for large model training

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411874201.4ACN119322682B (en)2024-12-192024-12-19Self-adaptive calculation power scheduling system for large model training

Publications (2)

Publication NumberPublication Date
CN119322682A CN119322682A (en)2025-01-17
CN119322682Btrue CN119322682B (en)2025-03-18

Family

ID=94228913

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411874201.4AActiveCN119322682B (en)2024-12-192024-12-19Self-adaptive calculation power scheduling system for large model training

Country Status (1)

CountryLink
CN (1)CN119322682B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119537032B (en)*2025-01-212025-05-20北京亿安天下科技股份有限公司Large model reasoning scheduling method based on off-network computing power server
CN119718688A (en)*2025-02-282025-03-28北京涵鑫盛科技有限公司Cluster load balancing processing method based on cloud computing
CN120448134B (en)*2025-07-082025-09-26新立讯科技集团股份有限公司GPU heterogeneous cluster scheduling method and system for large model training and reasoning

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117349026A (en)*2023-12-042024-01-05环球数科集团有限公司 A distributed computing power scheduling system for AIGC model training
CN118550711A (en)*2024-07-292024-08-27广脉科技股份有限公司Method and system for improving calculation efficiency

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117519944A (en)*2023-12-052024-02-06骆佳喜 An unmanned vehicle and its collaboration method based on computing power awareness and edge cloud computing collaboration
CN118656198A (en)*2024-02-272024-09-17马上消费金融股份有限公司 Data processing method, device, electronic device and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117349026A (en)*2023-12-042024-01-05环球数科集团有限公司 A distributed computing power scheduling system for AIGC model training
CN118550711A (en)*2024-07-292024-08-27广脉科技股份有限公司Method and system for improving calculation efficiency

Also Published As

Publication numberPublication date
CN119322682A (en)2025-01-17

Similar Documents

PublicationPublication DateTitle
CN119322682B (en)Self-adaptive calculation power scheduling system for large model training
CN117472587B (en)Resource scheduling system of AI intelligent computation center
CN119576507A (en) AI-based automatic optimization method and system for big data distributed computing tasks
CN118672790B (en) A method and system for summarizing massive data based on task chain and divide-and-conquer method
Bommala et al.Machine learning job failure analysis and prediction model for the cloud environment
CN118502918A (en)Workflow intelligent management system based on artificial intelligence
CN117349026A (en) A distributed computing power scheduling system for AIGC model training
US20230376800A1 (en)Predicting runtime variation in big data analytics
CN115543577A (en) Covariate-based Kubernetes resource scheduling optimization method, storage medium and equipment
CN118245234B (en)Distributed load balancing method and system based on cloud computing
CN119739535A (en)Computing power resource scheduling method, system and storage medium
CN119892905B (en)Micro-service scheduling method, device, equipment and storage medium under k8s cluster
CN118550654A (en)Knowledge graph-based simulation resource scheduling method
CN119225933A (en) A process scheduling method, device, equipment, medium and computer program product
CN119356824B (en)Reinforced learning-based algorithm power scheduling strategy optimization system
CN118245205A (en) Hybrid multi-cloud resource scheduling method based on GPT technology and genetic optimization
CN119806827A (en) An application resource adaptive allocation management system for concentrators
Tripathi et al.Workload shifting based on low carbon intensity periods: A framework for reducing carbon emissions in cloud computing
CN118784597A (en) A scheduling method and system based on Kubernetes in a cross-domain environment
CN118092817A (en)Intelligent management method and system for space of tablet personal computer
CN117749832A (en)Internet of things equipment management method and system combining block chains
De Mello et al.A new migration model based on the evaluation of processes load and lifetime on heterogeneous computing environments
CN119759544B (en) Resource scheduling method, device and computer equipment for power range simulation system
CN118409974B (en)Optimization method of reverse hotel Ai intelligent robbery list platform based on big data analysis
CN118467178B (en)Implementation method of self-service settlement system based on digital RMB

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp