CN112925630B

Movatterモバイル変換

Info

Publication number: CN112925630B
Application number: CN202110349693.5A
Authority: CN
Inventors: 徐曼; 刘一鸣; 李文昊
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2025-08-29
Anticipated expiration: 2041-03-31
Also published as: CN112925630A

Abstract

The application discloses a method, a device, equipment and a medium for submitting and managing an artificial intelligence task, wherein the method comprises the steps of creating the artificial intelligence task according to the task type of the artificial intelligence task; the method comprises the steps of receiving each instance in an artificial intelligence task, respectively creating each container in which each instance in the artificial intelligence task is located, and operating each instance in the corresponding container.

Description

Method, device, equipment and medium for submitting management of artificial intelligent task

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for managing submission of an artificial intelligence task.

Background

With the rapid growth of data, the application of heterogeneous computing technology has become a topic of interest in industry and academia in recent years, and the biggest advantage of heterogeneous computing is that it can bring about higher operation efficiency, higher cost performance and lower delay. Therefore, heterogeneous computing technology is an important approach to improve the efficiency of "people" and "machines" in the development process of machine learning application.

However, existing heterogeneous computing cannot be run on a user's personal computer, and thus the user needs to put his own tasks on machines in a remote cluster to run. At present, most common in the industry describes a task through a computing framework, such as TensorFlow, pytorch, and divides the resource into users in advance and submits the task to a remote cluster, however, the mode of pre-allocating the resource does not support the fast sharing of the resource to other users, so that resource waste is caused, and when a plurality of users have a plurality of tasks, different tasks lack isolation means, and are incompatible with each other, so that the task fails to run.

Disclosure of Invention

It is an object of embodiments of the present disclosure to provide a new solution for submission management of artificial intelligence tasks.

According to a first aspect of the present disclosure, there is provided a method of submission management of an artificial intelligence task, comprising:

creating an artificial intelligence task according to the task type of the artificial intelligence task;

Receiving each instance in the artificial intelligence task;

Creating containers in which instances of the artificial intelligence task are located, respectively, and

Running the instances in the corresponding containers.

Optionally, the receiving each instance of the artificial intelligence task includes:

Creating a user group for a user submitting the artificial intelligence task through a user group creation command;

acquiring the work nodes distributed to the user group;

Acquiring resource requirements of each instance in the artificial intelligence task;

Selecting at least one working node from the distributed working nodes according to the resource requirements of the examples;

Instances of the artificial intelligence task are received based on the at least one working node.

Optionally, said running said instances in corresponding said containers comprises:

and running each instance in the corresponding container based on the work node allocated for each instance.

Optionally, the task types of the artificial intelligence tasks include an artificial intelligence development task and an artificial intelligence training task.

Optionally, the artificial intelligence task is an artificial intelligence development task,

The creating artificial intelligence tasks includes:

an artificial intelligence development task is created by a first task creation command.

Optionally, each instance in the artificial intelligence development task is each artificial intelligence development instance,

The method further comprises the steps of:

during the operation of each artificial intelligence development instance in each corresponding container, suspending the operation of the artificial intelligence development instance in any container through a task suspension command.

The method further comprises the steps of:

After each artificial intelligence development example is operated in each corresponding container, an artificial intelligence training program is obtained;

and saving the artificial intelligence training program as an image file by submitting a command.

Optionally, the method further comprises:

storing the image file into a first setting folder in a preset storage space;

wherein the first settings folder is a private folder created for a user submitting the artificial intelligence development task.

Optionally, the artificial intelligence task is an artificial intelligence training task,

The creating artificial intelligence tasks includes:

Creating an artificial intelligence training task through the second task creation command.

Optionally, the method further comprises:

Saving training data for training the artificial intelligence training program to a second setting folder;

Wherein the second set folder is shared by all users.

Optionally, each instance in the artificial intelligence training task is each artificial intelligence training instance,

The running of each instance in each corresponding container comprises the following steps:

Obtaining the image file from the first setting folder, and

Acquiring the training data from the second setting folder;

And operating each artificial intelligent training example in each corresponding container based on the work node distributed for each artificial intelligent training example so as to train the mirror image file by utilizing the training data to obtain an artificial intelligent model.

Optionally, during running the respective artificial intelligence training instances in the respective corresponding containers based on the assigned working nodes for the respective artificial intelligence training instances, the method further comprises:

viewing the running state information of each artificial intelligent training example through a first task viewing command, and/or,

And checking the operation log information of each artificial intelligent training example through a second task checking command, and/or,

And checking the working nodes running the artificial intelligent training examples and the resource use information of the artificial intelligent training examples through a third task checking command.

and dynamically adjusting the resources required by each artificial intelligent training example according to the resource use information of each artificial intelligent training example.

According to a second aspect of the present disclosure, there is also provided a submission management apparatus of an artificial intelligence task, including:

the first creating module is used for creating the artificial intelligent task according to the task type of the artificial intelligent task;

The receiving module is used for receiving each instance in the artificial intelligence task;

A second creation module for creating containers of the artificial intelligence task, and

And the operation module is used for operating the examples in the corresponding containers.

Optionally, the receiving module is specifically configured to:

acquiring the work nodes distributed to the user group;

Optionally, the operation module is specifically configured to:

Optionally, the artificial intelligence task is an artificial intelligence development task, and the creation module is specifically configured to:

Optionally, each instance in the artificial intelligence development task is each artificial intelligence development instance, and the device further comprises a stopping module for:

Optionally, each instance in the artificial intelligence development task is each artificial intelligence development instance, and the device further comprises an acquisition module for:

Optionally, the apparatus further comprises a storage module for:

storing the image file into a first setting folder in a preset storage space;

Optionally, the artificial intelligence task is an artificial intelligence training task, and the creating module is specifically configured to:

Optionally, the storage module is further configured to:

Wherein the second set folder is shared by all users.

Optionally, each instance in the artificial intelligence training task is each artificial intelligence training instance, and the operation module is specifically configured to:

Obtaining the image file from the first setting folder, and

Acquiring the training data from the second setting folder;

Optionally, the apparatus further includes a view module for, during execution of the respective artificial intelligence training instance in the respective container based on the assigned working node for the respective artificial intelligence training instance:

And checking the working nodes running the artificial intelligent training tasks and the resource use information of the artificial intelligent training tasks through a third task checking command.

Optionally, the view module is further configured to, during running of the respective artificial intelligence training instances in the respective containers based on the assigned working nodes for the respective artificial intelligence training instances:

According to a third aspect of the present disclosure there is also provided an apparatus comprising at least one computing device and at least one storage device, wherein the at least one storage device is adapted to store instructions for controlling the at least one computing device to perform the method according to the first aspect above, or the apparatus implements the device according to the second aspect above by the computing device and the storage device.

According to a fourth aspect of the present disclosure there is also provided a computer readable storage medium, wherein a computer program is stored thereon, which, when executed by a processor, implements the method according to the first aspect above.

The method, the device, the equipment and the medium have the beneficial effects that according to the embodiment of the disclosure, a corresponding container can be created for each instance in the artificial intelligence task, and different computing frames can be mounted in the container, so that isolation is provided for different tasks and different instances of the same task, and the different tasks and the different instances of the same task cannot interfere with each other. Moreover, as the method is an example of running in the container, a user can pause the running of the container at any time to release the resource and rapidly share the resource to other users, so that the resource waste caused by the occupation of the resource by the task is avoided.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a schematic diagram of a hardware architecture of an electronic device according to an embodiment of the present disclosure;

FIG. 2 is a flow diagram of a method of submission management of artificial intelligence tasks, in accordance with an embodiment of the disclosure;

FIG. 3 is a schematic diagram of a task queue according to an embodiment of the present disclosure;

FIG. 4 is a functional block diagram of resource allocation according to an embodiment of the present disclosure;

FIG. 5 is a functional block diagram of an artificial intelligence task submission management apparatus according to an embodiment of the disclosure;

FIG. 6 is a functional block diagram of an electronic device according to an embodiment of the present disclosure;

Fig. 7 is a schematic diagram of a hardware structure of an electronic device according to another embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of exemplary embodiments may have different values.

It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

< Hardware configuration >

The method of the embodiments of the present disclosure may be implemented by at least one electronic device, i.e. the means 5000 for implementing the method may be arranged on the at least one electronic device. Fig. 1 shows a hardware structure of any electronic device. The electronic device shown in fig. 1 may be a portable computer, a desktop computer, a workstation, a server, or any other device having a computing device such as a processor and a storage device such as a memory, and is not limited herein.

As shown in fig. 1, the electronic device 1000 may include a processor 1100, a memory 1200, an interface device 1300, a communication device 1400, a display device 1500, an input device 1600, a speaker 1700, a microphone 1800, and the like. Wherein the processor 1100 is adapted to execute a computer program. The computer program may be written in an instruction set of an architecture such as x86, arm, RISC, MIPS, SSE, etc. The memory 1200 includes, for example, ROM (read only memory), RAM (random access memory), nonvolatile memory such as a hard disk, and the like. The interface device 1300 includes, for example, a USB interface, a headphone interface, and the like. The communication device 1400 can be capable of wired or wireless communication, and specifically can include Wifi communication, bluetooth communication, 2G/3G/4G/5G communication, and the like. The display device 1500 is, for example, a liquid crystal display, a touch display, or the like. The input device 1600 may include, for example, a touch screen, keyboard, somatosensory input, and the like. The electronic device 1000 may output voice information through the speaker 1700, may collect voice information through the microphone 1800, and the like.

The electronic device shown in fig. 1 is merely illustrative and is in no way meant to limit the invention, its application or uses. In an embodiment of the present disclosure, the memory 1200 of the electronic device 1000 is used to store instructions for controlling the processor 1100 to operate to perform the method of submission management of artificial intelligence tasks of the embodiment of the present disclosure. The skilled person can design instructions according to the disclosed solution. How the instructions control the processor to operate is well known in the art and will not be described in detail here.

In one embodiment, an apparatus is provided that includes at least one computing device and at least one storage device for storing instructions for controlling the at least one computing device to perform a method according to any embodiment of the present disclosure.

The device may include at least one electronic device 1000 as shown in fig. 1 to provide at least one computing device such as a processor and at least one storage device such as a memory, without limitation.

< Method example >

FIG. 2 is a flow chart of a method of submission management of artificial intelligence tasks performed by an electronic device 1000, as shown in FIG. 2, according to an embodiment of the disclosure, the method may include the following steps S2100-S2400:

In step S2100, an artificial intelligence task is created according to the task type of the artificial intelligence task.

The task JOB is a total unit of task operation and is also a minimum unit submitted by a user, the user submits the task JOB as one task JOB each time, and after the user submits the task JOB, the electronic device automatically generates a JOB ID for the task JOB, and the JOB ID can be used as an index for searching the task JOB. Typically, one task job includes one or more instance tasks, where in the case that one task job includes multiple instance tasks, the multiple instance tasks are related to each other.

In one example, where the artificial intelligence task is an artificial intelligence development task, creating the artificial intelligence task in step S2100 may further include creating the artificial intelligence development task via a first task creation command.

The first task creation command may be a HSCTL CREATE lab command.

In this example, the artificial intelligence development task can be created by using HSCTL CREATE lab command, that is, the created artificial intelligence development task is a lab task, and according to the subsequent steps, it can be known that a corresponding container is created for each artificial intelligence development instance in the artificial intelligence development task, so that, on one hand, the user can pause the operation of the container where the artificial intelligence development instance is located at any time in the artificial intelligence development stage to release resources, and after releasing the resources, other users can immediately see the released visible resources and use the visible resources to operate their own tasks. On the other hand, the second-level recoverable resource after restarting the container corresponding to the artificial intelligence development example.

In one example, where the artificial intelligence task is an artificial intelligence training task, creating the artificial intelligence task in step S2100 may further include creating the artificial intelligence training task via a second task creation command.

The second task creation command may be a HSCTL CREATE flash command.

After creating the artificial intelligence task according to the task type of the artificial intelligence task, entering:

step S2200, receiving each instance in the artificial intelligence task.

In this embodiment, as can be seen from the description of the above step S2100, a single artificial intelligence task job often includes multiple instances tasks, for example, a single artificial intelligence development task job includes multiple artificial intelligence development instances tasks, and another artificial intelligence training task job includes multiple artificial intelligence training instances tasks.

In this embodiment, the receiving each instance of the artificial intelligence task in the step S2200 may further include the following steps S2210 to S2240:

step S2210, a user group is created for a user submitting an artificial intelligence task through a user group creation command.

The user group creation command may be a hsctl queue command.

In step S2210, the creation of the user group may be implemented by the hsctl queue command, where the computing resources may be allocated according to the user group, and the computing resources in one user group are dynamically allocated.

Step S2220 obtains the work nodes allocated to the user group.

In this step S2220, the working node running each instance of the task may be configured for each user group, so that the user may submit the task to the working node allocated for that user group. Illustratively, user group 1 may be assigned working node 1, working node 2, and working node 3, user group 2 may be assigned working node 4, working node 5, and working node 6, and so on. Of course, the same worker node may be configured as worker nodes within different user groups.

In step S2230, resource requirements for each instance in the artificial intelligence task are obtained.

In this step S2230, taking the artificial intelligence task as an example of the artificial intelligence training task, when submitting the artificial intelligence training task, it not only designates a python file (the python file includes code information for executing the artificial intelligence training task) required by the artificial intelligence training task, but also designates required resource requirements, i.e., computing resource information, such as the number of GPUs, the number of GPUs and the number of NPUs required by the artificial intelligence training task, where, unlike the prior art in which hardware resource information such as the number of CPU cores, the number of memory and the number of GPU devices is required to be configured in TensorFlow code files, in this embodiment, when submitting the task by using a computing framework such as TensorFlow, it is not required to designate the number of CPU cores, the number of memory and the number of GPU devices in TensorFlow code files, but is self-allocated by the electronic device during specific operation, thereby reducing the trouble preset by the user.

Step S2240, selecting at least one working node from the allocated working nodes according to the resource requirements of each instance.

In step S2240, different working nodes may be selected from the user group where the task is located according to the resource requirements of each instance in the task, so as to submit the task to the selected working nodes.

Step S2250 receives instances of the artificial intelligence task based on the at least one worker node.

According to the steps S2210-S2250, computing resources are allocated according to the user groups, and the computing resources in each user group are dynamically allocated, so that resource waste is avoided.

In this embodiment, queue management may also be completed through hsctl queue commands, where tasks in each user group may form a task queue, where tasks submitted by users are sequentially queued, and fig. 3 is a containment relationship of a containerized instance in the task queue, where computing frameworks such as TensorFlow are responsible for executing computing tasks in a container environment.

After receiving each instance of the artificial intelligence task, enter:

Step S2300, creating containers in which the instances of the artificial intelligence task are located, respectively.

In this embodiment, a corresponding container is created for each instance in the artificial intelligence task, and computing resources required by a user, such as the number of GPUs, the amount of GPU display and the number of NPUs required by the user, are pre-allocated during creation, and by creating the container, different tasks and isolation between instances of the same task can be achieved.

After creating each container in which each instance in the artificial intelligence task is located, respectively, entering:

Step S2400, running each instance in each corresponding container.

In this embodiment, running each instance in the corresponding container in step S2400 may further include running each instance in the corresponding container based on the work node allocated for each instance.

In one example, where the artificial intelligence task is an artificial intelligence development task, each instance in the artificial intelligence development task is each artificial intelligence development instance, which after running each artificial intelligence development instance in a corresponding container, will obtain an artificial intelligence training program, where the artificial intelligence training program can be saved as a mirror image file by submitting a command.

The commit command may be a hsctl commit command.

In this example, the obtained artificial intelligence training program may be saved as an image file by command hsctl commit.

In this example, after the above image file is obtained, the image file may also be saved to the first setting folder in the predetermined storage space.

The first setting folder is a private folder created by a user submitting an artificial intelligence development task, and data in the private folder only supports the user to read and write.

In this example, after the artificial intelligence training program is saved as a mirror image file, a large amount of training data is often needed to train the artificial intelligence training program to obtain the artificial intelligence model in the subsequent artificial intelligence training stage. In this case, in the present example, the electronic device has a data storage system, in which a user may create a private folder path/shared/users, where the users may be user names, through which different users may be defined, and through which the user may save the saved image file into a corresponding private folder, and simultaneously each user may only read and write data in the private folder under the own user name, so as to avoid private data leakage. Here, the user may also interact with the data storage system via HSCTL FILE commands, for example, using HSCTL FILE-hellp commands to display help documents, using HSCTL FILE download SOURCE DEST commands to effect data downloads within the private folder, using HSCTL FILE upload SOURCE DEST to effect data uploads within the private folder, and so on.

Meanwhile, in the data storage system, a public folder path/shared/public which can be accessed by all users is provided, the public can be the name of the public folder, through the public folder path, the user can upload the public common data set such as training data of various different application scenes to the public folder, the different application scenes can be an image processing scene, a voice recognition scene, a natural voice processing scene, an automatic control scene, an intelligent question-answer scene, a business decision scene, a recommended business scene, a search scene, an abnormal behavior detection scene and the like, and meanwhile, all users can read and write the data in the public folder, so that data sharing is realized. Here, the user may also interact with the data storage system via HSCTL FILE commands, for example, using HSCTL FILE-hellp commands to display help documents, using HSCTL FILE download SOURCE DEST commands to effect data downloads in the public folders, using HSCTL FILE upload SOURCE DEST to effect data uploads in the public folders, and so on.

In an example, in the case that the artificial intelligence task is an artificial intelligence training task, each instance in the artificial intelligence training task is each artificial intelligence training instance, where running each instance in each corresponding container in the step S2400 may further include the following steps S2410 to S2430:

in step S2410, an image file is acquired from the first setting folder.

The first setting folder may be the above private folder.

In this step S2410, the image file saved by the artificial intelligence training program may be obtained from the private folder of the user.

Step S2420, the training data is obtained from the second setting folder.

The second setting folder may be the public folder, and the second setting folder is shared by all users, where the users may save the training data for training the artificial intelligence training program into the second setting folder in advance, so as to obtain the training data for training the artificial intelligence training program from the second setting folder in the training stage.

In this step S2420, training data may be acquired from the above common folder. The more the number of training data, the more accurate the training result is, but after the training data reaches a certain number, the more and more slowly the accuracy of the training result increases until the orientation is stable. Here, the accuracy of the training result and the data processing cost can be combined to determine the amount of training data required.

Step S2430, based on the work nodes allocated to the artificial intelligence training examples, running the artificial intelligence training examples in the corresponding containers to train the mirror image file by using the training data, and obtaining the artificial intelligence model.

In this step S2430, after feature extraction and feature combination are performed on training data according to an automatic machine learning technique to obtain each target feature, a training sample is generated in combination with real tag data corresponding to the training data, and further, at least one model training algorithm is used to train the image file based on the training sample to obtain an artificial intelligent model.

According to the method, a corresponding container is created for each instance in the artificial intelligence task, and different computing frames can be mounted in the container, so that isolation is provided for different tasks and different instances of the same task, and the different tasks and the different instances of the same task cannot interfere with each other. Moreover, as the method is an example of running in the container, a user can pause the running of the container at any time to release the resource and rapidly share the resource to other users, so that the resource waste caused by the occupation of the resource by the task is avoided.

In one embodiment, in the case that the artificial intelligence task is an artificial intelligence development task, each instance in the artificial intelligence development task is each artificial intelligence development instance, where the method for submitting and managing the artificial intelligence task according to the disclosure may further include:

during the operation of each artificial intelligence development instance in each corresponding container, the operation of the artificial intelligence development instance in any container is suspended through a task suspension command.

The task pause command may be a hsctl stop command.

In this embodiment, since the created artificial intelligence development task is usually a lab task, during the period of running each artificial intelligence development instance in each corresponding container, the running of the artificial intelligence development instance in any container can be suspended by hsctl stop command to release resources, and the released resources can be used by other users in the user group, so as to realize resource sharing, and avoid resource waste.

In one embodiment, in the case where the artificial intelligence task is an artificial intelligence training task, during running each artificial intelligence training instance in a corresponding container based on the work node allocated for each artificial intelligence training instance according to the above step S2430, the submission management method of the artificial intelligence task of the present disclosure may further include:

And checking the resource use information of each artificial intelligent training example through a third task checking command, and dynamically adjusting the resources required by each artificial intelligent training example according to the resource use information of each artificial intelligent training example.

The third task view command may be a hsctl info job _use command.

In this embodiment, a user may view, through a hsctl info job _usage command, resource usage information of each instance in the artificial intelligence training task, for example, a historically consumed resource amount, a currently consumed resource amount, and the like, and dynamically adjust resources required by each artificial intelligence training instance, so as to ensure that each artificial intelligence training instance can operate normally.

In this embodiment, according to the analysis of step S2230, only the required number of GPUs, the required amount of GPU display and the required number of NPUs need to be specified, and the number of CPU cores, the required amount of memory, the required number of GPU devices and other computing resource information need not be specified in the TensorFlow code file, but the required information is allocated by the electronic device at the time of operation, for example, the required GPU device is specifically operated on which GPU device can be allocated by the electronic device, so that the trouble preset by the user is reduced. Here, in order to reduce the trouble of user-preset by eliminating the need for the user to specify the number of CPU cores, the amount of memory and the number of GPU devices, all hardware resources are virtualized by a virtualization module, as shown in fig. 4, so that the hardware resources are the resources of the minimum CPU core (1 CPU core), the minimum GPU (1 GPU), the 1MB GPU video memory, the 1MB memory and the like, and are put into a resource pool for unified perception by an upper layer application. The container is arranged to declare the required resources, and the scheduling module allocates the resources to the required virtualized resources. Because the allocation of resources is dynamically adjusted for the container requiring CPU and memory resources, the locking of the resources is also dynamic, so that the user does not need to preset the CPU core number and the memory quantity when submitting the task, but automatically adjusts in the running process, the trouble preset by the user is reduced, and the problem that the task fails due to the limitation of the resources in the running process caused by the resource preset is avoided. However, for the container requiring the GPU, the static locking resource is currently a physical resource, so that the user needs to set the number of GPUs and the amount of GPU display, and does not need to set the GPU device number.

It can be understood that when a user submits a task and starts to run each instance of the task after the task is completely scheduled to a resource, before the task is in a waiting state, once the resource is successfully scheduled, each instance of the task starts to run automatically, the user does not need to operate, and the strategy can be called an all or scheduling strategy.

According to the embodiment, a user can avoid setting resources, but dynamically adjust the use condition of the resources in the running process of each instance of the task, so that the waste of the resources is avoided.

in the first aspect, the running state information of each artificial intelligent training example is checked through a first task checking command.

The first task view command may be a HSCTL LIST job command.

In this aspect, the operational status information of each of the artificial intelligence training instances in the artificial intelligence training task may be viewed through HSCTL LIST jobs commands, such as, but not limited to, including an artificial intelligence training instance in a waiting (waiting) state, an artificial intelligence training instance in a running (running) state, an artificial intelligence training instance in a killing (killed) state, an artificial intelligence training instance in a fail (fail) state, and the like.

In a second aspect, the log information of each artificial intelligence training instance is reviewed via a second task review command.

The second task view command may be a hsctl log job command.

In this aspect, the running log information of each artificial intelligence training instance in the artificial intelligence training task can be checked through hsctl log job commands.

In a third aspect, a third task view command is used to view the working nodes running each artificial intelligence training instance and the resource usage information of each artificial intelligence training instance.

In the aspect, the service information such as the historical consumed resource amount, the current consumed resource amount and the like of each artificial intelligent training example in the artificial intelligent training task can be checked through hsctl info job _use command on which working nodes are operated.

It will be appreciated that when the amount of resources is insufficient, the artificial intelligence training instance is in a wait (waiting) state, at which time the user may adjust queuing information of the artificial intelligence training instance, for example, the artificial intelligence training instance may be enqueued to the forefront by command hsctl update, so that when there are available resources, the artificial intelligence training instance may begin to run.

According to the embodiment, a user can monitor the available resource condition of the resource pool and the running condition of each instance of the task in real time, timely feed back task scheduling information to acquire the running state of the task, and can check queuing information and queue insertion.

< Example >

Next, taking an example in which the artificial intelligence tasks include an artificial intelligence development task and an artificial intelligence training task, a method of managing submission of an artificial intelligence task is shown, in which the method of managing submission of an artificial intelligence task may include the steps of:

Step S6010, creating an artificial intelligence development task through HSCTL CREATE lab commands, creating a container where each instance of the artificial intelligence development task is located, designating required computing resources, and performing the whole development and debugging process to obtain an artificial intelligence training program.

Step S6020, save the artificial intelligence training program as an image file through hsctl commit command.

In step S6020, the debugging process can be completed by entering the container through hsctl exec command, and finally a debugged image file is generated by storing through hsctl commit command, where the resource can be released through hsctl stop command, and after the resource is released, other users in the same user group can check the released resource through hsctl status command, so as to occupy the part of the resource.

Step S6030, the stored image file is stored in the private folder of the user by interacting with the data storage system through hsctl commands.

In step S6030, the data storage system further includes a common folder that can be shared by all users, and training data of different application scenarios are stored in the common folder, where the different application scenarios may be the application scenarios mentioned in the above embodiment, and the description is omitted in this example.

Step S6040, creating an artificial intelligence training task through hsctl bash commands and creating containers where all the examples in the artificial intelligence training task are located.

In this step S6040, the user may specify a desired python file including code information for performing the artificial intelligence training task, where it may also specify desired computing resources such as the number of GPUs, the amount of GPU display, and the number of NPUs. Meanwhile, the CPU core number, the memory quantity, the GPU equipment number and other computing resource information are not required to be specified in TensorFlow code files, and the electronic equipment is allocated at the time of operation, for example, the GPU equipment which is specifically operated on can be allocated by the electronic equipment, so that the trouble preset by a user is reduced.

Step S6050, based on the work nodes distributed for the artificial intelligence training examples, running the artificial intelligence training examples in the corresponding containers to train the mirror image file by using the training data, and obtaining the artificial intelligence model.

< Device example >

In this embodiment, there is further provided an artificial intelligence task submission management apparatus 5000, as shown in fig. 5, where the artificial intelligence task submission management apparatus 5000 includes a first creation module 5100, a receiving module 5200, a second creation module 5300, and an operation module 5400, and is configured to implement the artificial intelligence task submission management method provided in this embodiment, and each module of the artificial intelligence task submission management apparatus 5000 may be implemented by software or hardware, and is not limited herein.

The first creating module 5100 is configured to create an artificial intelligence task according to a task type of the artificial intelligence task.

The receiving module 5200 is configured to receive each instance in the artificial intelligence task.

The second creating module 5300 is configured to create containers in which the instances in the artificial intelligence task are located, respectively.

An operation module 5400 is configured to operate the respective instances in the respective corresponding containers.

In one embodiment, the receiving module 5200 is specifically configured to create a user group for a user submitting the artificial intelligence task through a user group creation command, obtain working nodes allocated for the user group, obtain resource requirements of each instance in the artificial intelligence task, select at least one working node from the allocated working nodes according to the resource requirements of each instance, and receive each instance in the artificial intelligence task based on the at least one working node.

In one embodiment, the execution module 5400 is specifically configured to execute each instance in a corresponding container based on the work node allocated for each instance.

In one embodiment, the task types of the artificial intelligence tasks include artificial intelligence development tasks and artificial intelligence training tasks.

In one embodiment, the artificial intelligence task is an artificial intelligence development task, and the creating module 5300 is specifically configured to create the artificial intelligence development task through the first task creation command.

In one embodiment, each instance in the artificial intelligence development task is an artificial intelligence development instance, and the apparatus 5000 further includes a stopping module (not shown) for suspending the operation of the artificial intelligence development instance in any of the containers by a task pause command during the operation of each artificial intelligence development instance in the corresponding container.

In one embodiment, each instance in the artificial intelligence development task is an artificial intelligence development instance, and the apparatus 5000 further includes an obtaining module (not shown in the figure) for obtaining an artificial intelligence training program after each artificial intelligence development instance is run in a corresponding container, and storing the artificial intelligence training program as a mirror image file by submitting a command.

In one embodiment, the apparatus 5000 further comprises a storage module for saving the image file in a first setting folder in a predetermined storage space.

The first settings folder is a private folder created for a user submitting the artificial intelligence development task.

In one embodiment, the artificial intelligence task is an artificial intelligence training task, and the creating module 5300 is specifically configured to create the artificial intelligence training task through the second task creation command.

In one embodiment, the storage module is further configured to save training data for training the artificial intelligence training program to a second settings folder.

The second settings folder is shared by all users.

In one embodiment, each instance in the artificial intelligence training task is an artificial intelligence training instance, and the operation module 5400 is specifically configured to obtain the image file from the first setting folder, obtain the training data from the second setting folder, and operate each artificial intelligence training instance in each corresponding container based on the work node allocated to each artificial intelligence training instance, so as to train the image file by using the training data, and obtain an artificial intelligence model.

In one embodiment, the apparatus 5000 further comprises a viewing module (not shown in the figure) for viewing running state information of each artificial intelligence training instance through a first task viewing command and/or viewing running log information of each artificial intelligence training instance through a second task viewing command and/or viewing a working node for running each artificial intelligence training task and resource usage information of each artificial intelligence training task through a third task viewing command during running each artificial intelligence training instance in a corresponding container based on the working node allocated for each artificial intelligence training instance.

In one embodiment, the checking module is further configured to dynamically adjust resources required by each of the artificial intelligence training instances during running of each of the artificial intelligence training instances in the corresponding containers based on the assigned working nodes for each of the artificial intelligence training instances.

< Device example >

Corresponding to the above method embodiment, in this embodiment, an electronic device is further provided, as shown in fig. 6, which may include an artificial intelligence task submission management apparatus 5000 according to any embodiment of the disclosure, for implementing the artificial intelligence task submission management method of any embodiment of the disclosure.

As shown in fig. 7, the electronic device 6000 may further include a processor 6200 and a memory 6100, the memory 6100 for storing executable instructions, the processor 6200 for running the electronic device according to control of the instructions to perform the submission management method of the artificial intelligence task according to any embodiment of the present disclosure.

The various modules of the apparatus 5000 above may be implemented by the processor 6200 executing the instructions to perform methods according to any of the embodiments of the present disclosure.

The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, punch cards or intra-groove protrusion structures such as those having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are all equivalent.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A method of submission management of an artificial intelligence task, comprising:

Creating an artificial intelligence task according to the task type of the artificial intelligence task, wherein when the artificial intelligence task is an artificial intelligence development task, the artificial intelligence development task is created;

Receiving each instance in the artificial intelligence task, wherein when the artificial intelligence task is an artificial intelligence development task, receiving a plurality of instances in the artificial intelligence development task;

Respectively creating containers of all instances in the artificial intelligence task, wherein when the artificial intelligence task is an artificial intelligence development task, the containers of all instances in the artificial intelligence development task are created, when the artificial intelligence task is a training task, the containers of all instances in the artificial intelligence training task are received, and

And when the artificial intelligent task is an artificial intelligent training task, checking the resource use information of the artificial intelligent training examples, and dynamically adjusting the resources required by each artificial intelligent training example according to the resource use information, wherein the resource use information comprises the consumption amount of historical resources and the current consumption amount of resources.

2. The method of claim 1, wherein the receiving instances of artificial intelligence tasks comprises:

acquiring the work nodes distributed to the user group;

3. The method of claim 2, wherein the running the instances in the corresponding containers comprises:

4. The method of claim 1, wherein the artificial intelligence task is an artificial intelligence development task,

The creating artificial intelligence tasks includes:

5. The method of claim 4, wherein each instance in the artificial intelligence development task is an artificial intelligence development instance,

The method further comprises the steps of:

6. The method of claim 5, wherein the method further comprises:

storing the image file into a first setting folder in a preset storage space;

7. The method of claim 6, wherein the artificial intelligence task is an artificial intelligence training task,

The creating artificial intelligence tasks includes:

8. The method of claim 7, wherein the method further comprises:

Wherein the second set folder is shared by all users.

9. The method of claim 8, wherein each instance in the artificial intelligence training task is an artificial intelligence training instance,

Obtaining the image file from the first setting folder, and

Acquiring the training data from the second setting folder;

10. The method of claim 9, wherein during running the respective artificial intelligence training instances in the respective containers based on the assigned working nodes for the respective artificial intelligence training instances, the method further comprises:

11. The method of claim 10, wherein during running the respective artificial intelligence training instances in the respective containers based on the assigned working nodes for the respective artificial intelligence training instances, the method further comprises:

12. A commit management device of an artificial intelligence task, comprising:

the system comprises a first creating module, a second creating module and a third creating module, wherein the first creating module is used for creating an artificial intelligence task according to the task type of the artificial intelligence task, and creating the artificial intelligence development task when the artificial intelligence task is the artificial intelligence development task;

the system comprises a receiving module, a receiving module and a processing module, wherein the receiving module is used for receiving each instance in an artificial intelligence task, wherein when the artificial intelligence task is an artificial intelligence development task, a plurality of instances in the artificial intelligence development task are received;

The second creating module is used for respectively creating containers of all the instances in the artificial intelligence task, wherein when the artificial intelligence task is an artificial intelligence development task, the containers of all the instances in the artificial intelligence development task are created, when the artificial intelligence task is a training task, the containers of all the instances in the artificial intelligence training task are received, and

And when the artificial intelligent task is an artificial intelligent training task, the resource use information of the artificial intelligent training instance is checked in response to the checking command, and the resource use information is dynamically adjusted for the resources required by each artificial intelligent training instance according to the resource use information, wherein the resource use information comprises the consumption of historical resources and the current consumption of resources.

13. The apparatus of claim 12, wherein the receiving module is specifically configured to:

acquiring the work nodes distributed to the user group;

14. The apparatus of claim 13, wherein the operation module is specifically configured to:

15. The apparatus of claim 12, wherein the artificial intelligence task is an artificial intelligence development task, and the creation module is specifically configured to:

16. The apparatus of claim 15, wherein each instance in the artificial intelligence development task is an artificial intelligence development instance, the apparatus further comprising an acquisition module to:

17. The apparatus of claim 16, wherein the apparatus further comprises a storage module to:

storing the image file into a first setting folder in a preset storage space;

18. The apparatus of claim 17, wherein the artificial intelligence task is an artificial intelligence training task, and the creation module is specifically configured to:

19. The apparatus of claim 18, wherein the memory module is further configured to:

Wherein the second set folder is shared by all users.

20. The apparatus of claim 19, wherein each instance in the artificial intelligence training task is an artificial intelligence training instance, and the operation module is specifically configured to:

Obtaining the image file from the first setting folder, and

Acquiring the training data from the second setting folder;

21. The apparatus of claim 20, wherein the apparatus further comprises a view module to, during execution of the respective artificial intelligence training instances in the respective containers based on the assigned working nodes for the respective artificial intelligence training instances:

22. The apparatus of claim 21, wherein the view module is further to, during execution of the respective artificial intelligence training instance in the corresponding respective container based on the assigned working node for the respective artificial intelligence training instance:

23. An apparatus comprising at least one computing device and at least one storage device, wherein the at least one storage device is to store instructions to control the at least one computing device to perform the method according to any one of claims 1 to 11, or the apparatus implements the apparatus according to claims 12 to 22 by the computing device and the storage device.

24. A computer readable storage medium, wherein a computer program is stored thereon, which, when executed by a processor, implements the method of any of claims 1 to 11.