Movatterモバイル変換


[0]ホーム

URL:


CN115756789B - GPU scheduling optimization method for deep learning reasoning service system - Google Patents

GPU scheduling optimization method for deep learning reasoning service system

Info

Publication number
CN115756789B
CN115756789BCN202211456890.8ACN202211456890ACN115756789BCN 115756789 BCN115756789 BCN 115756789BCN 202211456890 ACN202211456890 ACN 202211456890ACN 115756789 BCN115756789 BCN 115756789B
Authority
CN
China
Prior art keywords
model
time
throughput
models
scheduling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211456890.8A
Other languages
Chinese (zh)
Other versions
CN115756789A (en
Inventor
彭亚琼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan UniversityfiledCriticalHunan University
Priority to CN202211456890.8ApriorityCriticalpatent/CN115756789B/en
Publication of CN115756789ApublicationCriticalpatent/CN115756789A/en
Application grantedgrantedCritical
Publication of CN115756789BpublicationCriticalpatent/CN115756789B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明新型公开了一种针对深度学习推理服务系统的GPU调度优化方法,包括针对深度学习推理服务系统作初始化处理;分配获取的处理后的系统对应包含的所有模型,启动预测线程,周期性执行吞吐量需求预测流程,针对预测线程分配得到的模型预测新周期内的吞吐量需求;启动调度线程,采用获得的预测吞吐量需求,在调度时刻执行基于反馈控制策略的吞吐量调整流程,优化系统各模型的实际吞吐量分配。本发明能够动态预测各模型服务在独占GPU时的吞吐量,有效适应复杂多变的工作负载;满足部署在同一台服务器上各模型请求的不同延时和吞吐量需求,弥补了现有模型服务系统中的任务调度策略的不足。

The present invention discloses a GPU scheduling optimization method for a deep learning reasoning service system, including initializing the deep learning reasoning service system; allocating all models corresponding to the obtained processed system, starting a prediction thread, periodically executing a throughput demand prediction process, and predicting the throughput demand in a new cycle for the model assigned to the prediction thread; starting a scheduling thread, using the obtained predicted throughput demand, executing a throughput adjustment process based on a feedback control strategy at the scheduling moment, and optimizing the actual throughput allocation of each model in the system. The present invention can dynamically predict the throughput of each model service when it monopolizes the GPU, effectively adapting to complex and changeable workloads; meeting the different delay and throughput requirements of each model request deployed on the same server, and making up for the deficiencies of the task scheduling strategy in the existing model service system.

Description

GPU scheduling optimization method for deep learning reasoning service system
Technical Field
The invention belongs to the technical field of computer system structures and artificial intelligence, and particularly relates to a GPU scheduling optimization method for a deep learning reasoning service system.
Background
With the explosive growth of data, improvement of core algorithm and improvement of hardware computing power, the deep learning technology brings immeasurable value for development and application of artificial intelligence, and is also widely applied to the fields of computer vision, voice recognition, natural language processing and the like. Deep learning mainly includes two stages, training and reasoning. The training phase is a process of continuously adjusting weights of the deep neural network (Deep Neural Networks, DNN) by using an optimization algorithm according to input data, and is a process of constructing a model. Because of the large number of input data and DNN weight parameters, training these DNN models typically requires a significant amount of computational effort and takes several hours to days to complete the training process.
Hardware accelerators such as GPU (Graphics Processing Unit, image processor) and TPU (Tensor Processing Unit, tensor processor) become the mainstream hardware for acceleration of DNN model training by virtue of their powerful ability to perform simple repetitive operations on data. After the DNN model is trained, the input data can be predicted, and then the reasoning stage is started. The online reasoning service of the DNN model is an important application scene capable of falling to the ground for the deep learning technology, and the DNN model is deployed in a cloud data center to provide reasoning service for tenants, so that mobile equipment with limited computing resources can also support the deep learning application through the service. The inference tasks are deployed more on the CPU (Central Processing Unit ) at an early stage. With the increase of DNN model, it is difficult to meet the end-to-end real-time requirement (delay less than 100 ms) by running an inference task on CPU. Thus, a great deal of work is currently favored to use GPUs to accelerate DNN reasoning tasks.
In order to improve the utilization rate of GPU resources, a current approach is to deploy multiple deep learning models on the same GPU server. Under the scene, the demands of low time delay of reasoning tasks, performance isolation among multiple tenants and the like provide new challenges for a task scheduling mechanism in a shared GPU environment. Currently, task scheduling in a shared GPU environment is mainly classified into a spatial sharing type and a time sharing type. The space sharing technology represented by NVIDIA multi-process service (Multi Process Service, MPS) can enable the GPU to run a plurality of tasks at any time, and effectively improves the resource utilization rate of the reasoning service system. However, under the space sharing strategy, the performance of the simultaneously operated tasks has great uncertainty, and the performance isolation and real-time requirements among multiple tenants cannot be ensured. In any scheduling unit under the time sharing strategy, the GPU only runs a single reasoning task, and the parallel execution characteristic of the GPU kernel is not fully utilized. Compared with a space sharing strategy, although the GPU resource utilization rate under the time sharing strategy is lower, the execution time of the reasoning task is more stable, and the real-time requirement of the reasoning service request can be better ensured. However, when heterogeneous models are deployed on the same server, the service performance of the model with a relatively smaller scale is more easily interfered, and the service experience of the corresponding tenant is seriously affected.
In summary, aiming at the acceleration process of DNN reasoning tasks, multiple tasks cannot be well scheduled in the currently used shared GPU environment, uncertainty and interference are caused, and the real-time requirement of reasoning service requests cannot be well guaranteed.
Novel matters of the invention
The invention aims to provide a GPU scheduling optimization method for a deep learning reasoning service system, which is used for supporting a multi-tenant isolated deep learning reasoning service system in a shared GPU environment, dynamically predicting throughput requirements of various model services in the shared GPU environment in real time under complex and changeable workload conditions, and carrying out dynamic real-time GPU scheduling based on a prediction result so as to solve the problem of performance isolation among multiple tenants.
The GPU scheduling optimization method for the deep learning reasoning service system provided by the invention comprises the following steps:
s1, initializing a scheduling optimization parameter;
S2, acquiring a task to be processed of a deep learning reasoning service system at the current moment, predicting throughput requirements of each model in the task to be processed in a new cycle based on system parameters of the deep learning reasoning service system, wherein model throughput refers to the number of reasoning requests successfully responded by the model, each reasoning request has a deadline, and if and only if a client acquires a corresponding reasoning result before the deadline of the request, the request calculates successful response and accounts for the throughput of the model;
s3, starting a new cycle, and adjusting and distributing the throughput of each model in the cycle duration based on the throughput requirement of each model in the new cycle, which is obtained in the step S2;
s4, after the current period is finished, repeating the steps S2-S3 until the deep learning reasoning service system stops running, and completing GPU scheduling optimization aiming at the deep learning reasoning service system.
The initialization of the scheduling optimization parameters in the step S1 specifically comprises the steps that global shared data defined by a system comprise a model variable array models [2] [ m ], a model state index si and model state lock variables si_lock corresponding to the model state index si, a simulation queue and a request queue are set for each model, wherein models [2] [ m ] is used for storing state information, the stored state information comprises estimated values of throughput requirements of each model and actual throughput of a current period, m is the number of models, models [2] [ m ] is the number of models, each column of the array corresponds to the state information of one model, initial values of elements and si in models [2] [ m ] are set to be 0, the values of si are limited to be 0 or 1, the scheduling process only reads and writes model states stored in each column of models [2] [ m ] from a row appointed by the si value, the prediction process reads models [2] [ m ] is read from each row appointed by the si' value, the prediction process is used for the model state stored in each column of the model, the prediction process is equal to the number of 1, the parallel to the number of the model is required to be 1, and the parallel to the number of the model is required to be 1, the number of the parallel request is set, and the value of the system is required to be satisfied, and the number of the parallel is required to be 1 is set to be 1.
The system parameters based on the deep learning reasoning service system described in step S2 predict throughput demands of each model in the task to be processed in a new cycle, specifically, the following steps are adopted to predict the throughput demands:
A. All models are averagely distributed to each prediction process, wherein the number of models existing in the system is m, the number of the prediction processes which are executed in parallel is n, m and n are natural numbers, the value of m is more than or equal to 1, the specific value of n is set according to the hardware configuration of the system and is more than or equal to 1, and the number of models distributed by the previous n-1 prediction processes isThe number of the two-dimensional space-saving type,For rounding up the symbol, the number of models allocated to the last 1 prediction process is m% n, wherein% is the remainder operation;
B. Starting a prediction process to estimate throughput demands of all models in a new cycle, wherein aiming at a current model, the prediction process assumes that the model monopolizes GPU computing resources, simulates and schedules requests which are not completed by the model, can respond to successful reasoning request numbers in a period duration under the condition of computing simulated scheduling, and takes a calculation result as the throughput demands of the model in the new cycle;
C. all prediction processes are finished after the estimation task is completed;
D. Obtaining a model state lock variable si_lock, updating a model state index si to be si=si', releasing model state data in a new cycle for a scheduling process, and releasing the model state lock variable after updating is finished;
E. ending the prediction flow of the round;
the refinement flow of the prediction process for estimating the throughput demand of a single model specifically comprises the following steps:
(1) Copying all requests in a request queue of a current model to a simulation queue of the model;
(2) The method comprises the steps of clearing state data to be released of a current model, wherein the element corresponding to the number of the current model is i, models [ (si+1)% 2] [ i ] is assumed to be the state data to be released of the model, each model state data comprises two member variables, namely sim_ solo and goodput, wherein models [ (si+1)% 2] [ i ]. Sim_ solo is a sim_ solo member variable of the model with the number of i and is used for recording throughput requirements estimated for the current model, models [ (si+1)% 2] [ i ]. Goodput is a goodput member variable of the model with the number of i and is used for recording actual throughput of the current model, and sim_ solo and goodput member variables of each model state data are assigned to be 0 and the state data to be released of the current model is cleared;
(3) Deleting the request which cannot be completed within the self deadline from the simulation queue of the current model;
(4) Setting a start variable and an end variable which are respectively used for indicating the start time and the end time of a new cycle, reading the current system time and assigning the current system time to the start, and then assigning the result of adding the start and the cycle duration to the end;
(5) Assuming that the system has N GPUs, setting a simulated scheduling time sched _time for each GPU, and initializing sched _time to be a start;
(6) Judging whether the simulation queue is empty, if so, executing the step (13), otherwise, executing the step (7);
(7) Acquiring the minimum sched _time value in all the GPUs, and assigning the result to a variable min_ sched _time, wherein min_ sched _time is a temporary variable for recording the minimum sched _time value in all the GPUs;
(8) Judging whether the min_ sched _time is greater than or equal to end, if so, executing the step (13), otherwise, executing the step (9);
(9) Searching as many continuous request blocks as possible from the simulation queue, and meeting the following requirements:
min_sched_time+inferTime(batch_size)<deadline
the method comprises the steps of determining whether a current model is exclusive of a GPU, wherein the deadline of a first request in a request block is deadline, the number of requests in the block is batch_size, INFERTIME (batch_size) is the completion time of executing batch_size current model requests in batches, and the above formula is used for judging whether the request block can be executed on the GPU in batches when the current model is exclusive of the GPU;
(10) Updating the throughput of the current model under the condition of simulated scheduling to sim_ solo +batch_size, wherein sim_ solo is a member variable of the model state data;
(11) Updating the simulated scheduling time sched _time of the GPUi to min_ sched _time+ INFERTIME (batch_size) in the case of the simulated scheduling;
(12) Deleting all requests in the continuous request block in the step (9) from the simulation queue, and jumping to the step (6);
(13) Ending the prediction flow for the current model.
The step S3 of adjusting and distributing the throughput of each model in the new cycle based on the throughput requirement of each model in the new cycle obtained in the step S2 specifically includes the following steps:
1) Obtaining a model state lock variable si_lock;
2) Copying all elements in one row indicated by a model state index si in a variable array models [2] [ m ] into a local one-dimensional array ms, and then releasing a model state lock variable si_lock;
3) Traversing each element in the array ms, judging whether the actual throughput acquired by each model is lower than a standard value, if so, determining thatThe actual throughput obtained by the model is lower than the standard value, wherein mi is the (i+1) th element in the array ms;
4) If the actual throughput of the models is lower than the standard value, adding the models into an M list with an initial value being empty, otherwise, adding all the models into the M list;
5) Searching as many continuous request blocks as possible from the request queues of each model to enable the continuous request blocks to meet the following requirements:
curTime+inferTime(batch_size)<deadline
Wherein the arrival time of the first request in the request block is arival_time, the deadline is deadline, the number of requests in the block is batch_size, the current system time is curTime, INFERTIME (batch_size) is the time for executing batch_size current model requests, deadline-INFERTIME (batch_size) is the latest scheduling time when all requests in the request block can be successfully responded, and then the next time the current GPU executes the scheduling process is curTime + INFERTIME (batch_size);
6) And exiting the scheduling process.
According to the GPU scheduling optimization method for the deep learning reasoning service system, the deep learning reasoning service system is initialized, under the shared GPU environment, based on real-time load conditions distributed by all models, throughput of all model services when the GPU is exclusively used can be dynamically predicted, a predicted result can be effectively adapted to complex and changeable workload without introducing an offline prediction process, meanwhile, the throughput of the models when the models are independently placed is used as a measurement standard to adjust weights of the heterogeneous models in terms of GPU resource distribution, different delay and throughput requirements of all model requests deployed on the same server are met, and the defect of task scheduling strategies in the existing model service system in terms of heterogeneous model request performance isolation is overcome.
Drawings
FIG. 1 is a diagram of a GPU scheduling framework for a deep learning reasoning service system.
FIG. 2 is a schematic flow chart of the method of the present invention.
Fig. 3 is a general flow diagram of throughput demand prediction.
FIG. 4 is a detailed flow diagram of a prediction thread estimating throughput requirements for a new round of cycles for a single model.
Fig. 5 is a schematic diagram of a throughput adjustment flow based on a feedback control strategy.
FIG. 6 is a schematic diagram comparing the performance of example 1 of the present invention with the prior art.
Detailed Description
The GPU scheduling framework diagram for the deep learning reasoning service system is shown in fig. 1, and mainly comprises two parts, namely a throughput demand prediction module and a throughput adjustment module based on a feedback control strategy, wherein the two parts mutually interact to jointly complete overall operation. The scheduling system maintains an inference request queue for each model, and continuously receives inference requests from clients, and inserts the inference requests into the inference request queues of the target models. In the multi-model GPU sharing scenario, the throughput demand prediction module periodically estimates the throughput of each model when the model independently operates under the current workload condition, and then determines the throughput demand proportion among the models based on the estimation result, and provides guidance information for the throughput adjustment module based on the feedback control strategy. The throughput adjustment module based on the feedback control strategy dynamically monitors the actual throughput of each model in the current period, optimizes the actual throughput distribution in fine granularity based on the throughput demand proportion among the models, minimizes the difference caused by GPU sharing to heterogeneous model performance loss, and ensures the performance isolation among the models.
The method flow diagram of the method of the invention is shown in fig. 2, the GPU scheduling optimization method for the deep learning reasoning service system provided by the invention comprises the following steps:
S1, initializing a scheduling optimization parameter, which specifically comprises the following steps:
The global shared data defined by the system comprises a model variable array models [2] [ m ], a model state index si and a model state lock variable si_lock corresponding to the model state index si, and a simulation queue and a request queue are set for each model; the models [2] [ m ] is used for storing state information, wherein the stored state information comprises estimated values of throughput demands of various models and actual throughput of the current period; in order to avoid read-write conflict of prediction process and scheduling process on model state information, the scheduling process only reads and writes model states stored in each column in models [2] [ m ] array from the row appointed by si value, the prediction process reads and writes model states stored in each column in models [2] [ m ] array from the row appointed by si 'value, wherein si' = (si+1)%2,% is residual symbol, the number of parallel-executed prediction processes is n, the number of parallel-executed scheduling processes is k, m, n and k are natural integers, the specific values of m and k are set according to hardware configuration of the system, each prediction process or scheduling process is executed by one thread respectively, and each request is used for storing a request waiting for the model to be inferred and used for storing a request of the step S of the queue for reasoning and storing the model;
S2, acquiring a task to be processed of the deep learning reasoning service system at the current time, and predicting throughput requirements of each model in the task to be processed in a new cycle based on system parameters of the deep learning reasoning service system, wherein the method specifically comprises the following steps:
FIG. 3 is a general flow chart of the throughput demand forecast, wherein the model throughput refers to the number of reasoning requests successfully responded by the model, each reasoning request has an expiration time, and if and only if a client acquires a corresponding reasoning result before the expiration time of the request, the request calculates a successful response and accounts for the throughput of the model;
the throughput requirements of each model in a new cycle are predicted by the following steps:
A. All models are averagely distributed to each prediction process, wherein the number of models existing in the system is m, the number of the prediction processes which are executed in parallel is n, m and n are natural numbers, the value of m is more than or equal to 1, the specific value of n is set according to the hardware configuration of the system and is more than or equal to 1, and the number of models distributed by the previous n-1 prediction processes isThe number of the two-dimensional space-saving type,For rounding up the symbol, the number of models allocated to the last 1 prediction process is m% n, wherein% is the remainder operation;
B. Aiming at the current model, the prediction thread presumes that the model monopolizes GPU computing resources, and simulates and schedules the not-completed request of the model, under the condition of computing the simulated scheduling, the prediction thread can respond to the successful reasoning request number within one period duration, and takes the computing result as the throughput demand of the model in the new period;
C. all prediction processes are finished after the estimation task is completed;
D. Obtaining a model state lock variable si_lock, updating a model state index si to be si=si', releasing model state data in a new cycle for a scheduling process, and releasing the model state lock variable after updating is finished;
E. ending the prediction flow of the round;
a refined flow chart of the prediction process for estimating throughput requirements in a new round of cycles for a single model is shown in fig. 4:
the refinement flow of the prediction thread for estimating the throughput demand of a single model specifically comprises the following steps:
(1) Copying all requests in a request queue of a current model to a simulation queue of the model;
(2) The method comprises the steps of clearing state data to be released of a current model, wherein the element corresponding to the number of the current model is i, models [ (si+1)% 2] [ i ] is assumed to be the state data to be released of the model, each model state data comprises two member variables, namely sim_ solo and goodput, wherein models [ (si+1)% 2] [ i ]. Sim_ solo is a sim_ solo member variable of the model with the number of i and is used for recording throughput requirements estimated for the current model, models [ (si+1)% 2] [ i ]. Goodput is a goodput member variable of the model with the number of i and is used for recording actual throughput of the current model, and sim_ solo and goodput member variables of each model state data are assigned to be 0 and the state data to be released of the current model is cleared;
(3) Deleting the request which cannot be completed within the self deadline from the simulation queue of the current model;
(4) Setting a start variable and an end variable which are respectively used for indicating the start time and the end time of a new cycle, reading the current system time and assigning the current system time to the start, and then assigning the result of adding the start and the cycle duration to the end;
(5) Assuming that the system has N GPUs, setting a simulated scheduling time sched _time for each GPU, and initializing sched _time to be a start;
(6) Judging whether the simulation queue is empty, if so, executing the step (13), otherwise, executing the step (7);
(7) Acquiring the minimum sched _time value in all the GPUs, and assigning the result to a variable min_ sched _time, wherein min_ sched _time is a temporary variable for recording the minimum sched _time value in all the GPUs;
(8) Judging whether the min_ sched _time is greater than or equal to end, if so, executing the step (13), otherwise, executing the step (9);
(9) Searching as many continuous request blocks as possible from the simulation queue, and meeting the following requirements:
min_sched_time+inferTime(batch_size)<deadline
the method comprises the steps of determining whether a current model is exclusive of a GPU, wherein the deadline of a first request in a request block is deadline, the number of requests in the block is batch_size, INFERTIME (batch_size) is the completion time of executing batch_size current model requests in batches, and the above formula is used for judging whether the request block can be executed on the GPU in batches when the current model is exclusive of the GPU;
(10) Updating the throughput of the current model under the condition of simulated scheduling to sim_ solo +batch_size, wherein sim_ solo is a member variable of the model state data;
(11) Updating the simulated scheduling time sched _time of the GPUi to min_ sched _time+ INFERTIME (batch_size) in the case of the simulated scheduling;
(12) Deleting all requests in the continuous request block in the step (9) from the simulation queue, and jumping to the step (6);
(13) Ending the prediction flow for the current model;
S3, starting a new cycle, and adjusting and distributing the throughput of each model in the cycle duration based on the throughput requirement of each model in the new cycle obtained in the step S2, wherein the method specifically comprises the following steps:
As shown in fig. 5, which is a schematic diagram of a throughput adjustment flow based on a feedback control policy, a flow of each scheduling thread to perform a single throughput adjustment specifically includes:
1) Obtaining a model state lock variable si_lock;
2) Copying all elements in one row indicated by a model state index si in a variable array models [2] [ m ] into a local one-dimensional array ms, and then releasing a model state lock variable si_lock;
3) Traversing each element in the array ms, judging whether the actual throughput acquired by each model is lower than a standard value, if so, determining thatThe actual throughput obtained by the model is lower than the standard value, wherein mi is the (i+1) th element in the array ms;
4) If the actual throughput of the models is lower than the standard value, adding the models into an M list with an initial value being empty, otherwise, adding all the models into the M list;
5) Searching as many continuous request blocks as possible from the request queues of each model to enable the continuous request blocks to meet the following requirements:
curTime+inferTime(batch_size)<deadline
Wherein the arrival time of the first request in the request block is arival_time, the deadline is deadline, the number of requests in the block is batch_size, the current system time is curTime, INFERTIME (batch_size) is the time for executing batch_size current model requests, deadline-INFERTIME (batch_size) is the latest scheduling time when all requests in the request block can be successfully responded, and then the next time the current GPU executes the scheduling process is curTime + INFERTIME (batch_size);
6) And exiting the scheduling process.
1 Model GoogleNet and ResNet model were placed on a single block of a GPU model of Inlet-Tesla V100, then each model was loaded with 325 inference requests per second, and each model was tested for effective throughput per second (Goodput) under the support of Clockwork and the method of the present invention.
As shown in FIG. 6, which is a comparison diagram of the performance of the invention with that of the prior art, the experimental result is shown in FIG. 6, and it can be observed from the experimental result that the invention can greatly improve the Goodput of the GoogleNet model on the basis of hardly influencing the relatively large ResNet model, thereby meeting the real-time requirements of more reasoning service requests.
S4, after the current period is finished, repeating the steps S2-S3 until the deep learning reasoning service system stops running, and completing GPU scheduling optimization aiming at the deep learning reasoning service system.

Claims (3)

The global shared data defined by the system comprises a model variable array models [2] [ m ], a model state index si and a model state lock variable si_lock corresponding to the model state index si, and a simulation queue and a request queue are set for each model; wherein models [2] [ m ] is used for storing state information, wherein the stored state information comprises estimated values of throughput requirements of various models and actual throughput of the current period; in order to avoid read-write conflict of prediction process and scheduling process on model state information, the scheduling process only reads and writes model states stored in each column in models [2] [ m ] array from a row appointed by si value, the prediction process reads and writes model states stored in each column in models [2] [ m ] array from a row appointed by si 'value, wherein si' = (si+1)%2,% is residual symbol, the number of parallel-executed prediction processes is n, the number of parallel-executed scheduling processes is k, m, n and k are natural integers, the specific values of m and k are set according to hardware configuration of the system, the request queue of each model is used for storing requests of the model waiting response, and the simulation is used for storing the request of the prediction queue related to the reasoning process S2;
(2) The method comprises the steps of clearing state data to be released of a current model, wherein the element corresponding to the number of the current model is i, models [ (si+1)% 2] [ i ] is assumed to be the state data to be released of the model, each model state data comprises two member variables, namely sim_ solo and goodput, wherein models [ (si+1)% 2] [ i ]. Sim_ solo is a sim_ solo member variable of the model with the number of i and is used for recording throughput requirements estimated for the current model, models [ (si+1)% 2] [ i ]. Goodput is a goodput member variable of the model with the number of i and is used for recording actual throughput of the current model, and sim_ solo and goodput member variables of each model state data are assigned to be 0 and the state data to be released of the current model is cleared;
CN202211456890.8A2022-11-212022-11-21GPU scheduling optimization method for deep learning reasoning service systemActiveCN115756789B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202211456890.8ACN115756789B (en)2022-11-212022-11-21GPU scheduling optimization method for deep learning reasoning service system

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202211456890.8ACN115756789B (en)2022-11-212022-11-21GPU scheduling optimization method for deep learning reasoning service system

Publications (2)

Publication NumberPublication Date
CN115756789A CN115756789A (en)2023-03-07
CN115756789Btrue CN115756789B (en)2025-07-25

Family

ID=85333689

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202211456890.8AActiveCN115756789B (en)2022-11-212022-11-21GPU scheduling optimization method for deep learning reasoning service system

Country Status (1)

CountryLink
CN (1)CN115756789B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
TWI866184B (en)*2023-04-282024-12-11緯創資通股份有限公司Resource allocation system and method for cloud environment
CN117349032B (en)*2023-12-052024-02-20城云科技(中国)有限公司Method and device for improving throughput of large language model
CN119378693B (en)*2024-12-272025-04-18杭州海康威视数字技术股份有限公司Engine parameter adjustment method and device of reasoning engine

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113434303A (en)*2021-08-272021-09-24湖北星地智链科技有限公司Batch-processed remote sensing image intelligent processing model prediction performance optimization system and method
CN115237586A (en)*2022-03-242022-10-25华东师范大学GPU resource configuration method for deep learning inference performance interference perception

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
KR20210024993A (en)*2018-03-302021-03-08엑스포지션 파크 홀딩스 에스이지씨 Digital asset exchange
CN110795228B (en)*2018-08-032023-08-25伊姆西Ip控股有限责任公司Method and article of manufacture for training deep learning model, and computing system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113434303A (en)*2021-08-272021-09-24湖北星地智链科技有限公司Batch-processed remote sensing image intelligent processing model prediction performance optimization system and method
CN115237586A (en)*2022-03-242022-10-25华东师范大学GPU resource configuration method for deep learning inference performance interference perception

Also Published As

Publication numberPublication date
CN115756789A (en)2023-03-07

Similar Documents

PublicationPublication DateTitle
CN115756789B (en)GPU scheduling optimization method for deep learning reasoning service system
Guo et al.Cloud resource scheduling with deep reinforcement learning and imitation learning
CN110737529B (en)Short-time multi-variable-size data job cluster scheduling adaptive configuration method
CN105956021B (en)A kind of automation task suitable for distributed machines study parallel method and its system
US20220012089A1 (en)System for computational resource prediction and subsequent workload provisioning
CN112416585A (en) GPU resource management and intelligent scheduling method for deep learning
US9934071B2 (en)Job scheduler for distributed systems using pervasive state estimation with modeling of capabilities of compute nodes
CN109857534A (en)A kind of intelligent task scheduling strategy training method based on Policy-Gradient Reinforcement Learning
US11513866B1 (en)Method and system for managing resource utilization based on reinforcement learning
CN112732444A (en)Distributed machine learning-oriented data partitioning method
CN119537032A (en) Large model reasoning scheduling method based on off-grid computing server
Lin et al.A scheduling algorithm based on reinforcement learning for heterogeneous environments
WO2021220616A1 (en)Information processing device and information processing method, computer program, and distributed training system
CN117194025A (en) GPU spatio-temporal sharing method for deep learning services
Gong et al.Chic: Experience-driven scheduling in machine learning clusters
US11551095B2 (en)Sharing preprocessing, computations, and hardware resources between multiple neural networks
Yao et al.Workload-aware performance model based soft preemptive real-time scheduling for neural processing units
CN118245809B (en)Batch size adjustment method in distributed data parallel online asynchronous training
CN119292771A (en) Scheduling method, device, equipment and storage medium
CN119201443A (en) A method and system for allocating and scheduling computing power of edge computing platform
WO2025001472A1 (en)Data reasoning method, model training method and device
CN118981360A (en) Task scheduling method, device, storage medium, system and program product
CN118445036A (en)Intelligent scheduling method for data sharing exchange task
CN117556933A (en)Logistics robot cluster task scheduling method and device based on Double DQN and readable medium
Le Hai et al.Irls: An improved reinforcement learning scheduler for high performance computing systems

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp