Disclosure of Invention
In order to solve the above problems in the prior art, a first aspect of the present disclosure provides a knowledge-based training method for a model, wherein the method is applied to a student model, and the method includes: setting a second output layer which is the same as the first output layer at the distillation position according to the distillation position; acquiring a training set, wherein the training set comprises a plurality of training data; obtaining first data output by a first output layer and second data output by a second output layer based on the training data; acquiring supervision data output by a teacher model on a teacher layer corresponding to the distillation position based on training data, wherein the teacher model is a complex model which is trained and completes the same task as the student model; obtaining a distillation loss value according to a distillation loss function based on the difference between the supervision data and the first data and the second data; parameters of the student model are updated based on the distillation loss value.
In one example, the distillation loss function is set such that the gap is positively correlated with the value of distillation loss; and as the second data increased, the distillation loss value decreased.
In one example, the distillation loss function is:
wherein,
in order to supervise the data,
as the first data, it is the first data,
is the second data.
In one example, the training set further comprises: standard marking data corresponding to the training data one by one; the method further comprises the following steps: obtaining student output data output by the student model based on the training data; obtaining a task loss value according to a task loss function based on the standard annotation data and the student output data; updating parameters of the student model based on the distillation loss value, including: and updating parameters of the student model based on the total loss value of the task loss value and the distillation loss value.
In one example, the method further comprises: and after the total loss value is lower than the training threshold value through multiple iterations, deleting the second output layer to obtain the student model which completes the training.
In one example, the distillation positions include one or more of: any feature extraction layer in the student model and a full connection layer of the student model.
A second aspect of the present disclosure provides an image processing method, the method comprising: acquiring an image; and extracting image characteristics of the image through a model to obtain an image recognition result, wherein the model is a student model obtained through the knowledge distillation-based model training method of the first aspect.
A third aspect of the present disclosure provides a knowledge-based distillation model training device, applied to a student model, the device comprising: the model building module is used for setting a second output layer which is the same as the first output layer at the distillation position according to the distillation position; the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a training set, and the training set comprises a plurality of training data; the data processing module is used for obtaining first data output by the first output layer and second data output by the second output layer based on the training data; the second acquisition module is used for acquiring the supervision data output by the teacher model on a teacher layer corresponding to the distillation position based on the training data, wherein the teacher model is a complex model which is trained and completes the same task as the student model; the loss calculation module is used for obtaining a distillation loss value according to a distillation loss function based on the difference between the supervision data and the first data and the second data; and the parameter adjusting module is used for updating the parameters of the student model based on the distillation loss value.
A fourth aspect of the present disclosure provides an image processing apparatus comprising: the image acquisition module is used for acquiring an image; and the image recognition module is used for extracting image characteristics of the image through a model to obtain an image recognition result, wherein the model is a student model obtained through the knowledge distillation-based model training method in the first aspect.
A fifth aspect of the present disclosure provides an electronic device, comprising: a memory to store instructions; and a processor for invoking memory-stored instructions to perform the knowledge-distillation based model training method of the first aspect or the image processing method of the second aspect.
A sixth aspect of the present disclosure provides a computer-readable storage medium having stored therein instructions that, when executed by a processor, perform the knowledge-based distillation model training method of the first aspect or the image processing method of the second aspect.
According to the knowledge distillation-based model training method and device, the output layer is added according to the distillation position, and the corresponding loss function is used, so that the teacher model can transmit the knowledge of simple data to the student model more heavily in the knowledge distillation, namely, the self-adaptive knowledge transfer is realized, the transfer of the knowledge of dirty data and over-difficult sample data is reduced, the knowledge is transmitted to the student model, the method and device can adapt to any student model, the knowledge transfer can be realized at different positions as required, the training effect of the student model with a simple model structure and few parameters is ensured, and the accuracy and reliability of the recognition result of the student model are ensured.
Detailed Description
The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely to enable those skilled in the art to better understand and to practice the present disclosure, and are not intended to limit the scope of the present disclosure in any way.
It should be noted that, although the expressions "first", "second", etc. are used herein to describe different modules, steps, data, etc. of the embodiments of the present disclosure, the expressions "first", "second", etc. are merely used to distinguish between different modules, steps, data, etc. and do not indicate a particular order or degree of importance. Indeed, the terms "first," "second," and the like are fully interchangeable.
In order to make the process of training a student network by knowledge distillation more efficient and transfer more reliable and useful knowledge from a complex teacher model to a student model with a more simplified model, the embodiment of the present disclosure provides a knowledge distillation-basedmodel training method 10 applied to a student model in a knowledge distillation framework of a teacher-student model, as shown in fig. 1, the knowledge distillation-basedmodel training method 10 may include steps S11-S16, which are described in detail below:
step S11, setting a second output layer identical to the first output layer of the distillation position according to the distillation position.
Carry out knowledge distillation's position as required, to the simple transformation that carries out of student's model, according to the original first output layer in this position, increase the second output layer that sets up side by side with first output layer. The second output layer and the first output layer have the same position and structure, and the parameters can be different, and in the following training process, the parameters of the second output layer and the first output layer are independently adjusted.
In some embodiments, the distillation position comprises one or more of: any feature extraction layer in the student model and a full connection layer of the student model. The distillation position can be selected according to actual needs, and in the course of training, a plurality of distillation positions can be selected, and the aforesaid structural modification can be carried out to every distillation position.
Step S12, a training set is obtained, the training set including a plurality of training data.
And acquiring a plurality of training data for training, wherein in the multi-iteration training, the training data is used for inputting the model, and loss is calculated for the result through the supervision data, so that the model parameters are updated.
Step S13 is to obtain first data output by the first output layer and second data output by the second output layer based on the training data.
After the training data is input into the student model, the first data can be obtained through a first output layer through backward propagation, and the second data can be obtained through a second output layer which is parallel to the first output layer at the same position. Depending on the distillation position, that is, the first output layer and the second output layer, the first data and the second data may be feature data, a feature map, a logic output before softmax of a neural network, or the like. Wherein the values of the first data and the second data may be different but the format and dimensions are the same, e.g. both are feature vector representations.
And step S14, acquiring supervision data output by a teacher model on a teacher layer corresponding to the distillation position based on the training data, wherein the teacher model is a complex model which is trained and completes the same task as the student model.
The teacher model is a more complex model, and can be provided that the teacher model has more layers, more complex structures, more parameters and the like, so that the teacher model also has very good performance and generalization capability, and is difficult to deploy in some terminal devices due to the need of larger storage space and more calculation support.
The teacher model in this embodiment is a model that has completed training and is used to complete the same task as the student models, such as both for image recognition. And the structures of the first data and the second data are basically the same, and based on the distillation positions of the student models, the teacher model also has a corresponding teacher layer to output corresponding supervision data, and the supervision data has the same format and dimensions as the first data and the second data.
And step S15, obtaining a distillation loss value according to the distillation loss function based on the difference between the supervision data and the first data and the second data.
In some conventional techniques, the parameters of the student model are adjusted based only on the difference between the supervised data and the first data. In the embodiment of the present disclosure, the second data is added, the distillation loss function includes the difference between the monitoring data and the first data, and the second data to obtain the distillation loss value, and the student model can update the parameters of the model according to the distillation loss value.
In one embodiment, the distillation loss function is set such that the gap is positively correlated to the value of distillation loss; and as the second data increased, the distillation loss value decreased. The larger the difference between the supervision data and the first data is, the larger the distillation loss value is, the larger the amplitude of the parameters required to be adjusted by the student model is, and the output data of the distillation position of the student model cannot well express the characteristics of the student model for the training data. On the other hand, the difference between the supervision data and the first data is too large, which also indicates that the training data may be dirty data or training data which is too difficult, and for the dirty data, knowledge migration should be avoided as much as possible, while the training data which is too difficult may be unusual data in an actual application environment, and for a simplified student model, the meaning of learning the training data which is too difficult is not great. Therefore, in the distillation loss function, through the second data, under the condition that the difference between the monitoring data and the first data is overlarge, the parameters of the second output layer can be adjusted, the second data is improved, and the distillation loss value of the distillation loss function is reduced. The training process is a process for reducing the loss value, so that when the student model updates parameters according to the distillation loss value, the student model can increase the second data by updating the parameters of the second output layer, thereby achieving the purpose of reducing the distillation loss value. Through the mode, the damage of knowledge migration of dirty data or difficult data during training is reduced, the training effect of clean data and difficult data is correspondingly improved, and the training is more efficient.
In other embodiments, the distillation loss function LdistillCan be
Wherein,
in order to supervise the data,
as the first data, it is the first data,
for the second data, d represents the dimension and N represents the number of batches of training data. The supervision data, the first data and the second data are data of corresponding distillation positions, and the data are in the same form, such as output d-dimension feature vectors, wherein the supervision data are data output by a teacher model and used for conveying knowledge, the first data are data output by a first output layer of the distillation positions in the student model and capable of representing features, the second data are data output by a second output layer arranged at the distillation positions and capable of representing confidence degrees of the training data, the function of adjusting weights is achieved, and in the case that the training data are dirty data and the like, the influence on other parameters of the model can be reduced.
Under the condition that the loss value exceeds the threshold value, parameters of the student model need to be updated to reduce the loss value, the student model can enable the first data to be close to the supervision data through updating the parameters, and therefore the distillation loss value is reduced. In the present embodiment, according to the formula
In part, it can be seen that, in addition to reducing the loss value by updating the parameters of the student model so that the first data is close to the supervised data, the student model can also reduce the loss value by updating the parameters of the student model when the difference between the supervised data and the first data is too largeNew parameters, in particular parameters of the second output layer, for improving the second data
Distillation loss values can also be reduced, in which case adjustments to other parameters of the student model are reduced, and the adverse effects of dirty or overly difficult data on the student model are reduced. Meanwhile, in order to avoid that the student model adjusts the parameters of the second output layer too heavily, the value of the second data is only increased to reduce the distillation loss value, so that more knowledge cannot be transferred from the teacher model, a restriction factor is also set in the formula
When the second data is too low, the value of the restriction factor is increased, so that the condition that the student model only blindly adjusts the parameters of the second output layer to reduce the second data is avoided.
And step S16, updating parameters of the student model based on the distillation loss value.
And updating parameters of the model, including all parameters of the student model, wherein the parameters of the first output layer are adjusted to have the strongest influence on the first data, and the parameters of the second output layer are adjusted to have the strongest influence on the first data. Therefore, parameters of the distillation position can be trained and updated better according to the distillation loss value, adverse effects of dirty data or over-difficult data on the student model can be reduced through the distillation loss function of any one of the embodiments, training is more efficient, and the output result of the trained student model is more accurate.
In one embodiment, as shown in FIG. 2, the knowledge-based distillationmodel training method 10, wherein the training set further comprises: standard marking data corresponding to the training data one by one; the knowledge-distillation-basedmodel training method 10 further includes: step S17, obtaining student output data output by the student model based on the training data; step S18, based on the standard marking data and the student output data, obtaining a task loss value according to the task loss function; meanwhile, step S16 includes: and updating parameters of the student model based on the total loss value of the task loss value and the distillation loss value.
In order to train a student model completely and better, task-related training is carried out on the student model, namely according to the actual task to be completed of the model, training data are input into the student model through step S17, final student output data, namely recognition results, clustering results and the like, are output through structures in all the student models, and then through step S18, a task loss value is obtained according to a task loss function based on comparison of the student output data and standard labeling data of the training data; then, in step S16, the parameters of the student model are updated according to the total loss value of the mission loss value and the distillation loss value. Therefore, the student model is trained comprehensively and accurately, and the influence of local distillation on the overall result of the model is avoided.
In an embodiment, the knowledge-based distillationmodel training method 10 further includes deleting the second output layer after the total loss value is lower than the training threshold value through multiple iterations, so as to obtain the trained student model. When the output of the student model is sufficiently converged, namely the total loss value is lower than a training threshold value, the parameter update of the student model can be considered to meet the requirement, and at the moment, the preset second output layer can be deleted to obtain the student model completing the training. In the embodiment provided by the disclosure, the second output layer is only used for providing second data in the knowledge distillation-based model training process to achieve the effects and effects described in the foregoing embodiments, and after the parameter update meets the requirement, the existence significance of the second output layer also disappears, the original structure of the student model can be recovered, the storage space of the student model is reduced, and the unconscious calculation is reduced. This is particularly true when multiple distillation locations are distilled simultaneously. The original structure of the student model can be restored through the method of the example, and the universality of the embodiment of the disclosure is also improved.
Based on the same inventive concept, fig. 3 illustrates animage processing method 20 provided by the embodiment of the present disclosure, which includes: step S21, acquiring an image; step S22, extracting image features of the image through a model to obtain an image recognition result, where the model is a student model obtained by the knowledge distillation-basedmodel training method 10 according to any one of the foregoing embodiments. In some scenes, the student model is used for image recognition, the student model obtained by the knowledge distillation-basedmodel training method 10 is more efficient in training, simple in structure and higher in operation speed, can be used in terminal equipment, and ensures the accuracy of an image processing result.
Based on the same inventive concept, fig. 4 shows a knowledge-based training device 100 provided by an embodiment of the present disclosure, which is applied to a student model, and as shown in fig. 4, the knowledge-based training device 100 includes: amodel building module 110 for setting a second output layer identical to the first output layer at the distillation position according to the distillation position; a first obtainingmodule 120, configured to obtain a training set, where the training set includes a plurality of training data; adata processing module 130, configured to obtain first data output by the first output layer and second data output by the second output layer based on the training data; a second obtainingmodule 140, configured to obtain supervision data output by the teacher model on a teacher layer corresponding to the distillation position based on the training data, where the teacher model is a complex model that has been trained and performs the same task as the student model; aloss calculation module 150, configured to obtain a distillation loss value according to a distillation loss function based on a difference between the supervision data and the first data and the second data; and theparameter adjusting module 160 is used for updating the parameters of the student model based on the distillation loss value.
In one example, the distillation loss function is set such that the gap is positively correlated with the value of distillation loss; and as the second data increased, the distillation loss value decreased.
In one example, the distillation loss function is:
wherein,
to supervise countingAccording to the above-mentioned technical scheme,
as the first data, it is the first data,
is the second data.
In one example, the training set further comprises: standard marking data corresponding to the training data one by one; thedata processing module 130 is further configured to: obtaining student output data output by the student model based on the training data; theloss calculation module 150 is further configured to: based on the standard marking data and the student output data, obtaining a task loss value according to a task loss function; theparameter adjustment module 160 is further configured to: and updating parameters of the student model based on the total loss value of the task loss value and the distillation loss value.
In one example,model building module 110 is further configured to: and after the total loss value is lower than the training threshold value through multiple iterations, deleting the second output layer to obtain the student model which completes the training.
In one example, the distillation positions include one or more of: any feature extraction layer in the student model and a full connection layer of the student model.
With respect to the knowledge-based distillation model training apparatus 100 in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.
Based on the same inventive concept, fig. 5 illustrates an image processing apparatus 200 according to an embodiment of the disclosure, and as shown in fig. 5, the image processing apparatus 200 includes: animage acquisition module 210 for acquiring an image; and animage recognition module 220, configured to extract image features of the image through a model to obtain an image recognition result, where the model is a student model obtained through the knowledge-based distillationmodel training method 10 according to the first aspect.
With regard to the image processing apparatus 200 in the above-described embodiment, the specific manner in which the respective modules perform operations has been described in detail in the embodiment related to the method, and will not be described in detail here.
As shown in fig. 6, one embodiment of the present disclosure provides an electronic device 300. The electronic device 300 includes amemory 301, aprocessor 302, and an Input/Output (I/O)interface 303. Thememory 301 is used for storing instructions. Aprocessor 302 for calling the instructions stored in thememory 301 to execute the neural network compression method or the image processing method of the embodiments of the present disclosure. Wherein theprocessor 302 is connected to thememory 301, the I/O interface 303, respectively, for example via a bus system and/or other form of connection mechanism (not shown). Thememory 301 may be used to store programs and data, including programs of a neural network compression method or an image processing method involved in the embodiments of the present disclosure, and theprocessor 302 executes various functional applications and data processing of the electronic device 300 by running the programs stored in thememory 301.
Theprocessor 302 in the embodiment of the present disclosure may be implemented in at least one hardware form of a Digital Signal Processor (DSP), a Field-Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), and theprocessor 302 may be one or a combination of a Central Processing Unit (CPU) or other Processing units with data Processing capability and/or instruction execution capability.
Memory 301 in the disclosed embodiments may comprise one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile Memory may include, for example, a Random Access Memory (RAM), a cache Memory (cache), and/or the like. The non-volatile Memory may include, for example, a Read-only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk Drive (HDD), a Solid-State Drive (SSD), or the like.
In the embodiment of the present disclosure, the I/O interface 303 may be used to receive input instructions (e.g., numeric or character information, and generate key signal inputs related to user settings and function control of the electronic device 300, etc.), and may also output various information (e.g., images or sounds, etc.) to the outside. The I/O interface 303 in the embodiments of the present disclosure may include one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a mouse, a joystick, a trackball, a microphone, a speaker, a touch panel, and the like.
It is to be understood that although operations are depicted in the drawings in a particular order, this is not to be understood as requiring that such operations be performed in the particular order shown or in serial order, or that all illustrated operations be performed, to achieve desirable results. In certain environments, multitasking and parallel processing may be advantageous.
The methods and apparatus related to embodiments of the present disclosure can be accomplished with standard programming techniques with rule-based logic or other logic to accomplish the various method steps. It should also be noted that the words "means" and "module" as used herein, and in the claims, is intended to encompass implementations using one or more lines of software code, and/or hardware implementations, and/or equipment for receiving input.
Any of the steps, operations, or procedures described herein may be performed or implemented using one or more hardware or software modules, alone or in combination with other devices. In one embodiment, the software modules are implemented using a computer program product comprising a computer readable medium containing computer program code, which is executable by a computer processor for performing any or all of the described steps, operations, or procedures.
The foregoing description of the implementations of the disclosure has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure to the precise form disclosed, and modifications and variations are possible in light of the above teachings or may be acquired from practice of the disclosure. The embodiments were chosen and described in order to explain the principles of the disclosure and its practical application to enable one skilled in the art to utilize the disclosure in various embodiments and with various modifications as are suited to the particular use contemplated.