Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
With the development of network technology and intelligent mobile platform, live broadcast and mobile live broadcast become deep into people's daily life, if video transmitted in the network is not monitored, the video is easy to become obscene pornography and violent transmission means, so that vast network people are victimized, in order to monitor the content of video transmitted in the network, sensitive images need to be identified from the video, but because the number of live broadcast platforms is huge, manpower monitoring usually takes trouble and effort, a great deal of cost is consumed, in the traditional method, sensitive images can be identified through a feature matching algorithm, but the live broadcast environment is diversified, illumination variation is strong, resolution is low, human body posture difference is obvious, accurate classification cannot be achieved through a simple feature matching algorithm, in addition, the training sample size is too small, and the training method is too simple, so that sensitive images with complicated and various contents cannot be truly identified.
In the related art, the sensitive images can be identified by a deep learning method, for example, a convolutional neural network, and the method has better results in the field of image identification, specifically, the high-level semantic features of the images are extracted through a pre-trained detection model, and the images are classified based on the high-level semantic features. However, in a live broadcast scene, the video images have various sources and complex structures, and the sensitive images are difficult to accurately and effectively identify by the mode, so that the omission rate and the false detection rate of the sensitive images are high. Based on the above, the image classification method, the training method of the feature extraction network and the training device of the feature extraction network provided by the embodiment of the invention can be applied to devices such as mobile phones and computers, and particularly can be applied to devices with network live broadcast or network video playing functions.
For the sake of understanding the present embodiment, first, a detailed description will be given of an image classification method disclosed in the present embodiment, as shown in fig. 1, where the method includes the following steps:
step S102, inputting a target image into a feature extraction network which is trained in advance, and outputting image features of the target image;
The target image may be a video image propagated by a network or a video image in a network live platform, for example, a live scene image is usually an image containing a person, the pre-trained feature extraction network may be a network model such as CNN (Convolutional Neural Networks, convolutional neural network), RNN (Recurrent Neural Network ), DNN (Deep Neural Network, deep neural network), or the like, and the network may usually contain a multi-layer convolutional network, and may also contain a plurality of activation functions, or the like. The image features described above typically comprise multi-level features of the target image, and may include, for example, one or more of bottom (color, texture, etc.) features, middle (shape, etc.) features, or higher (semantic, etc.) features of the target image. The image feature may be a feature vector.
Specifically, the target image may be represented as X e RH×W×3, where H represents the height of the target image, W represents the width of the target image, and 3 represents the target image as a three-channel image, and the target image X e RH×W×3 with the size of h×w×3 is input into the feature extraction network, and the corresponding hierarchical features of at least two hierarchies, which are at least two of the features of the bottom layer (color, texture, etc.) of the target image, the features of the middle layer (shape, action, etc.), or the features of the high layer (semantic, etc.), are extracted through the multi-hierarchy feature extraction layers therein, and then the at least two hierarchical features are fused to obtain the image features, and in particular, feature fusion may be performed by a stitching manner.
Step S104, determining the category of the target image based on the image characteristics, wherein the characteristic extraction network comprises a plurality of serially connected level characteristic extraction layers, each level characteristic extraction layer is used for outputting level characteristics corresponding to the current level, and the image characteristics are obtained by fusing the level characteristics corresponding to at least two levels.
The categories of the target image can comprise various types, namely a normal image and a sensitive image, and can also be a normal image, a suboptimal image, a pornographic image and a violent image, wherein the suboptimal image, the pornographic image and the violent image belong to the sensitive image, and in actual implementation, the image characteristics comprise multi-level characteristics of the target image, such as the characteristics of the background and the person in the target image, the shape of the object, the semantic meaning of the character and the like, so that the probability of each category of the target image can be determined by utilizing a calculation probability mode according to the image characteristics, and finally the category of the target image is determined according to the calculated probability of each category. The category of the target image can also be obtained through a classifier according to the image characteristics.
The multi-level feature extraction layers at least comprise two, three, four or more, generally, the more the feature extraction layers are, the more the level features of the target image finally extracted are, the more the performance is, the time for extracting the features is increased, the slower the speed is, and therefore, the number of layers of the feature extraction layers can be specifically set according to the classification speed and the precision requirement of actual requirements. The above feature extraction layers of each level may include a plurality of convolution networks (which may also be referred to as convolution layers) and a plurality of activation functions, where the function of the convolution layers is convolution operation, and the purpose of the convolution operation is to extract different features of the target image, the first layer of convolution layer usually only can extract some low-level image features such as edges, lines, angles, and other levels, and more layers of convolution layers can iteratively extract more complex image features from the low-level features. Therefore, for the multi-level feature extraction layers, the image information of the level features output by each level feature extraction layer is different, for example, the level features output by the low-level feature extraction layer comprise features such as simpler background color, texture and the like of the target image, the level features output by the middle level feature extraction layer comprise features such as the shape, the character action, the character skin color, the area and the like of each object in the target image, the level features output by the higher level feature extraction layer comprise features such as the semantics of characters in the target image, and the level features corresponding to different levels are fused to obtain the image features of the target image. The method can extract various level features so as to enrich the features of the finally fused image.
The image classification method comprises the steps of inputting a target image into a feature extraction network which is trained in advance, outputting image features of the target image, determining the category of the target image based on the image features, wherein the feature extraction network comprises multiple layers of feature extraction layers connected in series, each layer of feature extraction layer is used for outputting a layer feature corresponding to a current layer, and the image features are obtained by fusing layer features corresponding to at least two layers. In the mode, the image features comprise at least two levels of level features, so that the feature levels contained in the image features are richer, and the image features can be used for recognizing images in complex scenes such as live broadcasting, so that sensitive images can be accurately and effectively recognized, and the omission rate and the false detection rate of the sensitive images are reduced.
The present embodiment also provides another image classification method, which is implemented on the basis of the above embodiment, and the present embodiment focuses on the implementation procedure of the step of inputting the target image into the feature extraction network that is trained in advance, outputting the image features of the target image (implemented through steps S202-S204), and the implementation procedure of the step of determining the category of the target image based on the image features (implemented through steps S206-S208);
In the embodiment, the feature extraction network further comprises at least two feature processing modules and a feature fusion module, each feature processing module is connected with a hierarchical feature extraction layer, the feature extraction layers connected with any two feature processing modules are different, the feature processing modules are used for further processing hierarchical features output by the corresponding feature extraction layers to obtain features which are more accurate and have discriminant, and the feature fusion module is used for fusing the features output by each feature processing module. The number of the feature processing modules is generally greater than or equal to the number of feature extraction layers.
As shown in fig. 2, the method comprises the steps of:
Step S202, processing the hierarchical features output by the feature extraction layer connected with the feature processing modules based on an attention mechanism through each feature processing module, and outputting intermediate features;
In order to make the hierarchical features output by the feature extraction layer more accurate and the feature information more obvious, the hierarchical features can be processed through a feature processing module. The attention mechanism is similar to the human retina, different parts have different degrees of information processing capability, namely, the hierarchical features output by the feature extraction layer connected by the scanning feature processing module acquire target features needing to be focused, then more attention resources are input to the features, more detail information related to the target features is acquired, and other irrelevant information is ignored. By means of the mechanism, high-value features can be rapidly screened from a large amount of information of hierarchical features by using limited attention resources, and then intermediate features are output.
The at least two feature processing modules comprise a first feature processing module, a second feature processing module and a third feature processing module, wherein the first feature processing module is connected with a feature extraction layer of the lowest level, the feature extraction layer of the lowest level is used for inputting a target image, the second feature processing module is connected with a feature extraction layer appointed in the middle level, and the third feature processing module is connected with the feature extraction layer of the highest level.
Referring to the schematic structural diagram of the feature extraction network shown in fig. 3, the feature extraction network is illustrated by taking an example that the feature extraction network comprises five feature extraction layers connected in series, namely a feature extraction layer 1, a feature extraction layer 2, a feature extraction layer 3, a feature extraction layer 4 and a feature extraction layer 5, wherein the feature extraction layer 1 corresponds to the feature extraction layer of the lowest level, the feature extraction layer 2, the feature extraction layer 3 and the feature extraction layer 4 correspond to the feature extraction layer of the middle level, and the feature extraction layer 5 corresponds to the feature extraction layer of the highest level. In addition, the feature extraction network comprises three feature processing modules, namely a first feature processing module, a second feature processing module and a third feature processing module, which are respectively connected with the feature extraction layer 1, the feature extraction layer 3 and the feature extraction layer 5.
In addition, as shown in fig. 3, the feature processing module further includes a pooling layer, a first full-connection layer, and a feature multiplication layer;
the pooling layer may be called pooling layer, and is mainly used for downsampling the input features to reduce the number of parameters, the first fully connected layer (Fully Connected layers, abbreviated as FC) plays a role of a classifier in the whole convolutional neural network, and can perform a weighted sum on the features output by the front layer and map the feature space to the sample mark space through linear transformation, and the feature multiplication layer (multiply) is mainly used for multiplying the hierarchical features and the features output by the first fully connected layer.
One possible implementation is:
The method comprises the steps of carrying out first pooling processing on input level features through a pooling layer, outputting a first pooling result, carrying out first full-connection processing on the first pooling result through a first full-connection layer, outputting a first full-connection result, multiplying the input level features and the first full-connection result through a feature multiplication layer to obtain a multiplication result, and outputting intermediate features based on the multiplication result.
Specifically, referring to the data flow in the graph shown in fig. 3, taking the feature processing module as an example for describing the first feature processing module, firstly, the target image xe RH×W×3 with the size of hxw×3 is input to the feature extraction layer at the lowest level in the feature extraction network to obtain the level feature, that is, the feature matrix f1 e Rh1*w1*c1, where H1 represents the height of the feature matrix, W1 represents the width of the feature matrix, and c1 represents the channel number of the feature matrix. The hierarchical feature f1 epsilon Rh1*w1*c1 is input to a pooling layer in a first feature processing module to carry out first pooling processing, a first pooling result, namely a feature vector f1' epsilon Rc1, is output, the first pooling result f1' epsilon Rc1 is input to a first full-connection layer, first full-connection processing is carried out on the first pooling result, a first full-connection result f1' epsilon Rc1 is output, the hierarchical feature f1 epsilon Rh1*w1*c1 and the first full-connection result f1' epsilon Rc1 are multiplied to obtain a multiplication result f1' epsilon Rh1*w1*c1, finally an intermediate feature is output based on the multiplication result, and the intermediate feature can represent the bottom feature of a target image.
Referring to fig. 3, the feature processing module further includes a spatial pyramid pooling layer, the spatial pyramid pooling layer is connected with the feature multiplying layer, and the spatial pyramid pooling layer (SPATIAL PYRAMID Pooling, SPP) is mainly used for processing features of different levels to obtain features with the same dimension.
In the step of outputting the intermediate feature based on the multiplication result, a possible implementation manner is to perform second pooling processing on the multiplication result through a spatial pyramid pooling layer, and output the intermediate feature with a specified dimension. Wherein the specified dimension can be set according to the actual application.
After the multiplication result f1' "e Rh1*w1*c1 is obtained in the foregoing manner, the multiplication result f1 '" e Rh1*w1*c1 may be input to the spatial pyramid pooling layer, and the second pooling process may be performed on the multiplication result, so as to output the intermediate feature of the specified dimension, that is, the feature vector f1' "e Rc.
Similarly, taking the feature processing module as an example of the second feature processing module, the hierarchical feature f3∈rh3*w3*c3 output by the designated feature extraction layer 3 of the middle hierarchy is input into the second feature processing module to obtain a second intermediate feature f3″ e Rc, taking the feature processing module as an example of the third feature processing module, and the hierarchical feature f5∈rh5*w5*c5 output by the feature extraction layer 5 of the highest hierarchy is input into the third feature processing module to obtain a third intermediate feature f5' "e Rc. The spatial pyramid pooling layers in the first feature processing module, the second feature processing module and the third feature processing module respectively output intermediate features with the same dimension.
Step S204, fusing the intermediate features output by each feature processing module through a feature fusion module to obtain image features;
specifically, the intermediate features output by each feature processing module, namely the first feature processing module, the second feature processing module and the third feature processing module, are input to a feature fusion module, and each intermediate feature is fused in a feature splicing mode and the like to obtain a multi-level fusion feature, namely the image feature.
Referring to the schematic structural diagram of the feature extraction module shown in fig. 3, the feature fusion module includes a feature splicing layer, a second full-connection layer and a third full-connection layer, where the feature splicing layer (splicing) is mainly used to splice each intermediate feature f1' ' ' e Rc、f3″″∈Rc、f5″″∈Rc, and the second full-connection layer and the third full-connection layer have the same function as the first full-connection layer.
The step of obtaining the image features by fusing the intermediate features output by each feature processing module through the feature fusion module, which is a possible implementation manner:
The method comprises the steps of performing splicing processing on intermediate features output by each feature processing module through a feature splicing layer to output spliced features, performing second full-connection processing on the spliced features through a second full-connection layer to output second full-connection results, performing third full-connection processing on the second full-connection results through a third full-connection layer to output image features.
Specifically, the intermediate feature f1'' '' of the target image X is subjected to stitching processing by using an epsilon Rc、f3″″∈Rc、f5″″∈Rc, stitching features f=r3c are output, the stitching features f=r3c are input to a second full-connection layer to perform second full-connection processing on the stitching features, a second full-connection result is output, the second full-connection result is input to a third full-connection layer, third full-connection processing is performed on the second full-connection result, and an output vector of the network, namely an image feature z=r3, is output.
Step S206, inputting the image features into a preset normalized exponential function and outputting probability distribution vectors, wherein the probability distribution vectors comprise a plurality of categories and probability values corresponding to each category;
The normalized exponential function may be a softmax function, the probability distribution vector may be represented as p, and the probability distribution vector may be calculated by:
wherein, p represents a probability distribution vector, z represents an image feature, m represents an mth feature processing template, pi and zi represent the ith elements of p and z respectively, and e represents a natural constant;
In step S208, the category corresponding to the maximum probability value in the probability distribution vector is determined as the category of the target image.
Specifically, the coordinate corresponding to the largest probability value in the probability distribution vector, namely the category of the icon image, can be determined through the formula k=argmaxi(pi), wherein the preset category is taken as three examples, k=1 represents that the target image is a normal image, k=2 represents that the target image is a popular image, and k=3 represents that the target image is a pornographic image.
According to the method, multiple level features in a target image can be extracted through a plurality of feature extraction layers included in a feature extraction network, a first feature processing module, a second feature processing module and a third feature processing module which are connected with different feature extraction layers, the level features are processed through the feature processing modules to obtain intermediate features, discrimination of the level features is increased, image information contained in the intermediate features is more accurately enriched, each intermediate feature is fused through a feature fusion module to obtain image features, the category of the target image is determined based on the image features, the features of the target image are not required to be designed manually, the features which are effective in image category are automatically extracted through a convolutional neural network, algorithm generalization capability is strong, robustness is high, the sensitive image recognition of live broadcasting scenes can be effectively recognized through the image features, recognition accuracy of the feature extraction network is improved, omission rate and false detection rate of the sensitive image are reduced, and the accuracy of the sensitive image recognition is improved.
When the live broadcast scene is oriented, live broadcast images can be classified through the mode, so that sensitive images in the images can be identified, the aim of intelligently monitoring the network live broadcast room is fulfilled, and meanwhile, the labor cost is reduced.
The embodiment also provides a training method of the feature extraction network, as shown in fig. 4, the method includes the following steps:
Step S402, determining a training sample based on a preset sample set, wherein the training sample comprises a sample image and a category label of the sample image;
Specifically, a detailed image classification standard can be designed on line, and can comprise normal, low custom, pornography and violence, (such as a landscape is a normal image, a bare genitalia is a pornography image, a kissing is a low custom image, a knife injury person is a violence image, and the like), a data set D can be obtained by manually labeling massive live images according to the standard, a part of the data set can be used as a training sample Dtrain and the rest can be used as a test sample Dtest according to a certain proportion, the data set D can be specifically divided into a training sample Dtrain and a test sample Dtest according to a proportion of 10:1, the training sample comprises a sample image and a sample image, category labels can comprise four, namely, labeling results can be expressed as y epsilon {1,2,3,4, 1 represents normal, 2 represents low custom, 3 represents pornography, 4 represents violence, the like, the image classification can be limited to the category described by the application, and other categories can also comprise confidential categories such as leakage countries, illegal break the rules, and the like.
Step S404, inputting a sample image into a feature extraction network to obtain sample features of the sample image, wherein the feature extraction network comprises a plurality of serially connected feature extraction layers, each of which is used for outputting the corresponding level features of the current level;
Step S406, determining a class identification result of the sample image based on the sample characteristics, determining a loss value based on a preset class loss function, a class label and the class identification result, and updating network parameters of the characteristic extraction network based on the loss value;
step S408, the step of determining training samples based on the preset sample set is continuously executed until the loss value converges, and the trained feature extraction network is obtained.
Specifically, the class identification result z=r3 of the sample image may be input to the softmax function by the formulaThe probability distribution vector p is calculated and then the loss value is calculated by the equation l= -log (py), where y is the label of the training sample image. Finally, calculating the derivative of the loss value to all network parameters W in the feature extraction network through a back propagation algorithmUpdating network parameters of the feature extraction network based on the calculated loss value by a random gradient descent algorithm, namely:
wherein, alpha represents learning rate (is a preset super parameter, and the common value is 0.01 and 0.001), and the trained feature extraction network is obtained by continuously and iteratively updating the parameters of the feature extraction network until the loss value converges.
In addition, after training is completed, the trained feature extraction network is required to be tested through a test sample Dtest, a plurality of test images are selected from the test sample and input into the trained feature extraction network to obtain output vectors, namely image features, the category of the test images is determined based on the image features and is compared with the marked images, and if the preset condition is met, the trained feature extraction network is obtained.
The training method of the feature extraction network comprises the steps of determining a training sample based on a preset sample set, wherein the training sample comprises a sample image and a category label of the sample image, inputting the sample image into the feature extraction network to obtain sample features of the sample image, wherein the feature extraction network comprises a plurality of serially connected level feature extraction layers, each level feature extraction layer is used for outputting level features corresponding to a current level, the sample features are obtained through fusion of the level features corresponding to at least two levels, a category recognition result of the sample image is determined based on the sample features, a loss value is determined based on a preset classification loss function, the category label and the category recognition result, network parameters of the feature extraction network are updated based on the loss value, and the step of determining the training sample based on the preset sample set is continuously executed until the loss value converges to obtain the trained feature extraction network. In the mode, the feature extraction network comprises a plurality of layers of feature extraction layers, at least two layers of features in the target image can be extracted, the at least two layers of features are fused to obtain image features, and the category of the target image is determined based on the image features.
In the live broadcast scene, massive data are collected and marked as training samples, a detailed tag classification standard is put forward to carry out strong marking on the training samples, and the obtained feature extraction network meets the supervision requirement of the live broadcast scene.
Corresponding to the above method embodiment, the present embodiment further provides an image classification device, as shown in fig. 5, including:
The output module 51 is configured to input the target image into a feature extraction network that is trained in advance, and output image features of the target image;
The classification module 52 is configured to determine a class of the target image based on the image features, where the feature extraction network includes multiple serially connected feature extraction layers, each of the feature extraction layers is configured to output a level feature corresponding to a current level, and the image features are obtained by fusing at least two level features corresponding to the current level.
The image classification device provided by the embodiment of the invention inputs a target image into a feature extraction network which is trained in advance to output image features of the target image, determines the category of the target image based on the image features, wherein the feature extraction network comprises a plurality of serially connected feature extraction layers, each feature extraction layer is used for outputting a level feature corresponding to a current level, and the image features are obtained by fusing the level features corresponding to at least two levels. In the mode, the image features comprise at least two levels of level features, so that the feature levels contained in the image features are richer, and the image features can be used for recognizing images in complex scenes such as live broadcasting, so that sensitive images can be accurately and effectively recognized, and the omission rate and the false detection rate of the sensitive images are reduced.
The image feature processing system comprises a feature extraction network, an output module, a feature fusion module and a feature processing module, wherein the feature extraction network further comprises at least two feature processing modules and a feature fusion module, each feature processing module is connected with a hierarchical feature extraction layer, the feature extraction layers connected with any two feature processing modules are different, the output module is further used for processing hierarchical features output by the feature extraction layers connected with the feature processing modules based on an attention mechanism through each feature processing module and outputting intermediate features, and the intermediate features output by each feature processing module are fused through the feature fusion module to obtain the image feature.
The at least two feature processing modules comprise a first feature processing module, a second feature processing module and a third feature processing module, wherein the first feature processing module is connected with a feature extraction layer of the lowest level, the feature extraction layer of the lowest level is used for inputting a target image, the second feature processing module is connected with a feature extraction layer appointed in the middle level, and the third feature processing module is connected with the feature extraction layer of the highest level.
The feature processing module comprises a pooling layer, a first full-connection layer and a feature multiplication layer, and is further used for carrying out first pooling processing on the input hierarchical features through the pooling layer, outputting first pooling results, carrying out first full-connection processing on the first pooling results through the first full-connection layer, outputting first full-connection results, multiplying the input hierarchical features with the first full-connection results through the feature multiplication layer to obtain multiplication results, and outputting intermediate features based on the multiplication results.
The feature processing module is further used for performing second pooling processing on multiplication results through the spatial pyramid pooling layer and outputting intermediate features with specified dimensions.
The feature fusion module comprises a feature splicing layer, a second full-connection layer and a third full-connection layer, and the output module is further used for splicing the middle features output by each feature processing module through the feature splicing layer to output splicing features, performing second full-connection processing on the splicing features through the second full-connection layer to output second full-connection results, performing third full-connection processing on the second full-connection results through the third full-connection layer to output image features.
The classification module is further used for inputting the image features into a preset normalized exponential function and outputting a probability distribution vector, wherein the probability distribution vector comprises a plurality of categories and probability values corresponding to the categories, and the category corresponding to the maximum probability value in the probability distribution vector is determined to be the category of the target image.
The image classification device provided by the embodiment of the invention has the same technical characteristics as the image classification method provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.
Corresponding to the above method embodiment, this embodiment further provides a training device for a feature extraction network, as shown in fig. 6, where the device includes:
The sample determining module 61 is used for determining a training sample based on a preset sample set, wherein the training sample comprises a sample image and a category label of the sample image;
The image input module 62 is configured to input a sample image into the feature extraction network to obtain sample features of the sample image, where the feature extraction network includes multiple serially connected feature extraction layers, each of the feature extraction layers is configured to output a level feature corresponding to a current level;
A parameter updating module 63, configured to determine a class recognition result of the sample image based on the sample feature, determine a loss value based on a preset class loss function, class label, and class recognition result, update a network parameter of the feature extraction network based on the loss value;
the network determining module 64 is configured to continue to perform the step of determining the training samples based on the preset sample set until the loss value converges, so as to obtain a trained feature extraction network.
The training device of the feature extraction network provided by the embodiment of the invention has the same technical characteristics as the training method of the feature extraction network provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.
The present embodiment also provides an electronic device including a processor and a memory, the memory storing machine-executable instructions executable by the processor, the processor executing the machine-executable instructions to implement the above-described image classification method, or a training method of a feature extraction network.
Referring to fig. 7, the electronic device includes a processor 100 and a memory 101, the memory 101 storing machine executable instructions that can be executed by the processor 100, the processor 100 executing the machine executable instructions to implement the above-described image classification method, or a training method of a feature extraction network.
Further, the electronic device shown in fig. 7 further includes a bus 102 and a communication interface 103, and the processor 100, the communication interface 103, and the memory 101 are connected through the bus 102.
The memory 101 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the system network element and at least one other network element is implemented via at least one communication interface 103 (which may be wired or wireless), and may use the internet, a wide area network, a local network, a metropolitan area network, etc. Bus 102 may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 7, but not only one bus or type of bus.
The processor 100 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 100 or by instructions in the form of software. The Processor 100 may be a general-purpose Processor, including a central processing unit (Central Processing Unit, CPU), a network Processor (Network Processor, NP), a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 101, and the processor 100 reads the information in the memory 101 and, in combination with its hardware, performs the steps of the method of the previous embodiment.
The present embodiment also provides a machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the above-described image classification method, or a training method of a feature extraction network.
The computer program product of the image classification method, the training method of the feature extraction network and the device provided by the embodiment of the invention comprises a computer readable storage medium storing program codes, wherein the instructions included in the program codes can be used for executing the method described in the method embodiment, and specific implementation can be referred to the method embodiment and will not be repeated here.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.
In addition, in the description of embodiments of the present invention, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected, mechanically connected, electrically connected, directly connected, indirectly connected via an intermediate medium, or in communication between two elements. The specific meaning of the above terms in the present invention will be understood by those skilled in the art in specific cases.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
It should be noted that the foregoing embodiments are merely illustrative embodiments of the present invention, and not restrictive, and the scope of the invention is not limited to the foregoing embodiments, but it should be understood by those skilled in the art that any modification, variation or substitution of some technical features described in the foregoing embodiments may be easily made within the scope of the present invention without departing from the spirit and scope of the technical solutions of the embodiments. Therefore, the protection scope of the invention is subject to the protection scope of the claims.