Movatterモバイル変換


[0]ホーム

URL:


CN112183672B - Image classification method, feature extraction network training method and device - Google Patents

Image classification method, feature extraction network training method and device
Download PDF

Info

Publication number
CN112183672B
CN112183672BCN202011227392.7ACN202011227392ACN112183672BCN 112183672 BCN112183672 BCN 112183672BCN 202011227392 ACN202011227392 ACN 202011227392ACN 112183672 BCN112183672 BCN 112183672B
Authority
CN
China
Prior art keywords
feature
feature extraction
features
image
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011227392.7A
Other languages
Chinese (zh)
Other versions
CN112183672A (en
Inventor
苏驰
李凯
刘弘也
王育林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Cloud Network Technology Co Ltd
Original Assignee
Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Cloud Network Technology Co LtdfiledCriticalBeijing Kingsoft Cloud Network Technology Co Ltd
Priority to CN202011227392.7ApriorityCriticalpatent/CN112183672B/en
Publication of CN112183672ApublicationCriticalpatent/CN112183672A/en
Application grantedgrantedCritical
Publication of CN112183672BpublicationCriticalpatent/CN112183672B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明提供了一种图像分类方法、特征提取网络的训练方法和装置,将目标图像输入至预先训练完成的特征提取网络中,输出目标图像的图像特征;基于图像特征,确定目标图像的类别;其中的特征提取网络包括串联的多层级特征提取层;每层级特征提取层用于输出当前层级对应的层级特征;图像特征通过至少两个层级对应的层级特征进行融合得到。该方式中,图像特征中包括至少两个层级的层级特征,因而该图像特征中包含的特征层级更加丰富,可以应对直播等复杂场景下的图像识别,从而可以准确有效地识别敏感图像,降低了敏感图像的漏检率和误检率。

The present invention provides an image classification method, a training method and device for a feature extraction network, wherein a target image is input into a pre-trained feature extraction network, and the image features of the target image are output; based on the image features, the category of the target image is determined; wherein the feature extraction network includes a plurality of layers of feature extraction layers connected in series; each layer of feature extraction layer is used to output the hierarchical features corresponding to the current layer; and the image features are obtained by fusing the hierarchical features corresponding to at least two layers. In this method, the image features include at least two layers of hierarchical features, so the feature hierarchies contained in the image features are richer, and image recognition in complex scenes such as live broadcast can be handled, so that sensitive images can be accurately and effectively identified, and the missed detection rate and false detection rate of sensitive images are reduced.

Description

Image classification method, training method and device of feature extraction network
Technical Field
The present invention relates to the field of image processing technologies, and in particular, to an image classification method, and a training method and apparatus for a feature extraction network.
Background
In order to conduct content supervision on video propagated in a network, sensitive images such as hypo-custom images, pornography images, violent images, horror images, and the like need to be identified from the video. In the related art, the sensitive images can be identified by a deep learning method, specifically, first, high-level semantic features of the images are extracted by a pre-trained detection model, and then the images are classified based on the high-level semantic features. However, in a live broadcast scene, the source and the structure of the video image are more complex, and the sensitive image is difficult to accurately and effectively identify by the mode, so that the omission rate and the false detection rate of the sensitive image are higher.
Disclosure of Invention
In view of the above, the present invention aims to provide an image classification method, a training method of a feature extraction network, and a device thereof, so as to improve accuracy of identifying sensitive images.
The embodiment of the invention provides an image classification method, which comprises the steps of inputting a target image into a feature extraction network which is trained in advance, outputting image features of the target image, determining the type of the target image based on the image features, wherein the feature extraction network comprises multiple layers of feature extraction layers connected in series, each layer of feature extraction layer is used for outputting a layer feature corresponding to a current layer, and the image features are obtained by fusing at least two layer features corresponding to the layers.
Further, the feature extraction network further comprises at least two feature processing modules and a feature fusion module; the method comprises the steps of connecting each feature processing module with a hierarchical feature extraction layer, inputting a target image into a feature extraction network which is trained in advance, outputting image features of the target image, processing hierarchical features output by the feature extraction layers connected with the feature processing modules based on an attention mechanism through each feature processing module, outputting intermediate features, and fusing the intermediate features output by each feature processing module through a feature fusion module to obtain the image features.
The at least two feature processing modules comprise a first feature processing module, a second feature processing module and a third feature processing module, wherein the first feature processing module is connected with a feature extraction layer of the lowest level, the feature extraction layer of the lowest level is used for inputting a target image, the second feature processing module is connected with a feature extraction layer appointed in the middle level, and the third feature processing module is connected with the feature extraction layer of the highest level.
The feature processing module comprises a pooling layer, a first full-connection layer and a feature multiplication layer, wherein the step of processing the hierarchical features output by the feature extraction layer connected with the feature processing module based on an attention mechanism and outputting intermediate features comprises the steps of carrying out first pooling processing on the input hierarchical features through the pooling layer and outputting first pooling results, carrying out first full-connection processing on the first pooling results through the first full-connection layer and outputting first full-connection results, multiplying the input hierarchical features with the first full-connection results through the feature multiplication layer to obtain multiplication results, and outputting the intermediate features based on the multiplication results.
The feature processing module further comprises a spatial pyramid pooling layer, wherein the spatial pyramid pooling layer is connected with the feature multiplication layer, and the step of outputting intermediate features based on multiplication results comprises the steps of carrying out second pooling processing on the multiplication results through the spatial pyramid pooling layer and outputting the intermediate features with specified dimensions.
The feature fusion module comprises a feature splicing layer, a second full-connection layer and a third full-connection layer, and the step of fusing the intermediate features output by each feature processing module through the feature fusion module to obtain image features comprises the steps of splicing the intermediate features output by each feature processing module through the feature splicing layer to output the spliced features, carrying out second full-connection processing on the spliced features through the second full-connection layer to output second full-connection results, and carrying out third full-connection processing on the second full-connection results through the third full-connection layer to output the image features.
The method comprises the steps of determining the category of a target image based on image features, and determining the category corresponding to the maximum probability value in the probability distribution vector as the category of the target image.
The embodiment of the invention provides a training method of a feature extraction network, which comprises the steps of determining a training sample based on a preset sample set, wherein the training sample comprises a sample image and a category label of the sample image, inputting the sample image into the feature extraction network to obtain sample features of the sample image, wherein the feature extraction network comprises a plurality of serial-connected level feature extraction layers, each level feature extraction layer is used for outputting level features corresponding to a current level, the sample features are obtained by fusing the level features corresponding to at least two levels, determining a category identification result of the sample image based on the sample features, determining a loss value based on a preset classification loss function, the category label and the category identification result, updating network parameters of the feature extraction network based on the loss value, and continuing to execute the step of determining the training sample based on the preset sample set until the loss value converges to obtain the trained feature extraction network.
The embodiment of the invention provides an image classification device which comprises an output module, a classification module and a classification module, wherein the output module is used for inputting a target image into a feature extraction network which is trained in advance to output image features of the target image, the classification module is used for determining the category of the target image based on the image features, the feature extraction network comprises multiple layers of feature extraction layers which are connected in series, each layer of feature extraction layer is used for outputting a layer feature corresponding to a current layer, and the image features are obtained by fusing at least two layer features corresponding to the layer.
In a fourth aspect, the embodiment of the invention provides a training device of a feature extraction network, which comprises a sample determining module, an image input module and a parameter updating module, wherein the sample determining module is used for determining a training sample based on a preset sample set, the training sample comprises a sample image and a category label of the sample image, the image input module is used for inputting the sample image into the feature extraction network to obtain sample features of the sample image, the feature extraction network comprises multiple serial-connection-level feature extraction layers, each level feature extraction layer is used for outputting level features corresponding to a current level, the sample features are obtained through fusion of at least two levels corresponding to the level features, the parameter updating module is used for determining a category identification result of the sample image based on the sample features, determining a loss value based on a preset classification loss function, a category label and a category identification result, updating network parameters of the feature extraction network based on the loss value, and the network determining module is used for continuously executing steps of determining the training sample based on the preset sample set until the loss value converges to obtain the trained feature extraction network.
In a fifth aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory, where the memory stores machine executable instructions executable by the processor, the processor executing the machine executable instructions to implement the image classification method of the first aspect, or the training method of the feature extraction network of the second aspect.
In a sixth aspect, embodiments of the present invention provide a machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the image classification method of the first aspect, or the training method of the feature extraction network of the second aspect.
The embodiment of the invention has the following beneficial effects:
The image classification method, the training method of the feature extraction network and the training device of the feature extraction network are provided by the embodiment of the invention, a target image is input into the feature extraction network which is trained in advance, the image features of the target image are output, the category of the target image is determined based on the image features, the feature extraction network comprises a plurality of serially connected feature extraction layers, each feature extraction layer is used for outputting the corresponding level features of the current level, and the image features are obtained by fusing the level features corresponding to at least two levels. In the mode, the image features comprise at least two levels of level features, so that the feature levels contained in the image features are richer, and the image features can be used for recognizing images in complex scenes such as live broadcasting, so that sensitive images can be accurately and effectively recognized, and the omission rate and the false detection rate of the sensitive images are reduced.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are some embodiments of the invention and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of an image classification method according to an embodiment of the present invention;
FIG. 2 is a flowchart of another image classification method according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a feature extraction network according to an embodiment of the present invention;
FIG. 4 is a flowchart of a training method of a feature extraction network according to an embodiment of the present invention;
Fig. 5 is a schematic structural diagram of an image classification device according to an embodiment of the present invention;
Fig. 6 is a schematic structural diagram of a training device of a feature extraction network according to an embodiment of the present invention;
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
With the development of network technology and intelligent mobile platform, live broadcast and mobile live broadcast become deep into people's daily life, if video transmitted in the network is not monitored, the video is easy to become obscene pornography and violent transmission means, so that vast network people are victimized, in order to monitor the content of video transmitted in the network, sensitive images need to be identified from the video, but because the number of live broadcast platforms is huge, manpower monitoring usually takes trouble and effort, a great deal of cost is consumed, in the traditional method, sensitive images can be identified through a feature matching algorithm, but the live broadcast environment is diversified, illumination variation is strong, resolution is low, human body posture difference is obvious, accurate classification cannot be achieved through a simple feature matching algorithm, in addition, the training sample size is too small, and the training method is too simple, so that sensitive images with complicated and various contents cannot be truly identified.
In the related art, the sensitive images can be identified by a deep learning method, for example, a convolutional neural network, and the method has better results in the field of image identification, specifically, the high-level semantic features of the images are extracted through a pre-trained detection model, and the images are classified based on the high-level semantic features. However, in a live broadcast scene, the video images have various sources and complex structures, and the sensitive images are difficult to accurately and effectively identify by the mode, so that the omission rate and the false detection rate of the sensitive images are high. Based on the above, the image classification method, the training method of the feature extraction network and the training device of the feature extraction network provided by the embodiment of the invention can be applied to devices such as mobile phones and computers, and particularly can be applied to devices with network live broadcast or network video playing functions.
For the sake of understanding the present embodiment, first, a detailed description will be given of an image classification method disclosed in the present embodiment, as shown in fig. 1, where the method includes the following steps:
step S102, inputting a target image into a feature extraction network which is trained in advance, and outputting image features of the target image;
The target image may be a video image propagated by a network or a video image in a network live platform, for example, a live scene image is usually an image containing a person, the pre-trained feature extraction network may be a network model such as CNN (Convolutional Neural Networks, convolutional neural network), RNN (Recurrent Neural Network ), DNN (Deep Neural Network, deep neural network), or the like, and the network may usually contain a multi-layer convolutional network, and may also contain a plurality of activation functions, or the like. The image features described above typically comprise multi-level features of the target image, and may include, for example, one or more of bottom (color, texture, etc.) features, middle (shape, etc.) features, or higher (semantic, etc.) features of the target image. The image feature may be a feature vector.
Specifically, the target image may be represented as X e RH×W×3, where H represents the height of the target image, W represents the width of the target image, and 3 represents the target image as a three-channel image, and the target image X e RH×W×3 with the size of h×w×3 is input into the feature extraction network, and the corresponding hierarchical features of at least two hierarchies, which are at least two of the features of the bottom layer (color, texture, etc.) of the target image, the features of the middle layer (shape, action, etc.), or the features of the high layer (semantic, etc.), are extracted through the multi-hierarchy feature extraction layers therein, and then the at least two hierarchical features are fused to obtain the image features, and in particular, feature fusion may be performed by a stitching manner.
Step S104, determining the category of the target image based on the image characteristics, wherein the characteristic extraction network comprises a plurality of serially connected level characteristic extraction layers, each level characteristic extraction layer is used for outputting level characteristics corresponding to the current level, and the image characteristics are obtained by fusing the level characteristics corresponding to at least two levels.
The categories of the target image can comprise various types, namely a normal image and a sensitive image, and can also be a normal image, a suboptimal image, a pornographic image and a violent image, wherein the suboptimal image, the pornographic image and the violent image belong to the sensitive image, and in actual implementation, the image characteristics comprise multi-level characteristics of the target image, such as the characteristics of the background and the person in the target image, the shape of the object, the semantic meaning of the character and the like, so that the probability of each category of the target image can be determined by utilizing a calculation probability mode according to the image characteristics, and finally the category of the target image is determined according to the calculated probability of each category. The category of the target image can also be obtained through a classifier according to the image characteristics.
The multi-level feature extraction layers at least comprise two, three, four or more, generally, the more the feature extraction layers are, the more the level features of the target image finally extracted are, the more the performance is, the time for extracting the features is increased, the slower the speed is, and therefore, the number of layers of the feature extraction layers can be specifically set according to the classification speed and the precision requirement of actual requirements. The above feature extraction layers of each level may include a plurality of convolution networks (which may also be referred to as convolution layers) and a plurality of activation functions, where the function of the convolution layers is convolution operation, and the purpose of the convolution operation is to extract different features of the target image, the first layer of convolution layer usually only can extract some low-level image features such as edges, lines, angles, and other levels, and more layers of convolution layers can iteratively extract more complex image features from the low-level features. Therefore, for the multi-level feature extraction layers, the image information of the level features output by each level feature extraction layer is different, for example, the level features output by the low-level feature extraction layer comprise features such as simpler background color, texture and the like of the target image, the level features output by the middle level feature extraction layer comprise features such as the shape, the character action, the character skin color, the area and the like of each object in the target image, the level features output by the higher level feature extraction layer comprise features such as the semantics of characters in the target image, and the level features corresponding to different levels are fused to obtain the image features of the target image. The method can extract various level features so as to enrich the features of the finally fused image.
The image classification method comprises the steps of inputting a target image into a feature extraction network which is trained in advance, outputting image features of the target image, determining the category of the target image based on the image features, wherein the feature extraction network comprises multiple layers of feature extraction layers connected in series, each layer of feature extraction layer is used for outputting a layer feature corresponding to a current layer, and the image features are obtained by fusing layer features corresponding to at least two layers. In the mode, the image features comprise at least two levels of level features, so that the feature levels contained in the image features are richer, and the image features can be used for recognizing images in complex scenes such as live broadcasting, so that sensitive images can be accurately and effectively recognized, and the omission rate and the false detection rate of the sensitive images are reduced.
The present embodiment also provides another image classification method, which is implemented on the basis of the above embodiment, and the present embodiment focuses on the implementation procedure of the step of inputting the target image into the feature extraction network that is trained in advance, outputting the image features of the target image (implemented through steps S202-S204), and the implementation procedure of the step of determining the category of the target image based on the image features (implemented through steps S206-S208);
In the embodiment, the feature extraction network further comprises at least two feature processing modules and a feature fusion module, each feature processing module is connected with a hierarchical feature extraction layer, the feature extraction layers connected with any two feature processing modules are different, the feature processing modules are used for further processing hierarchical features output by the corresponding feature extraction layers to obtain features which are more accurate and have discriminant, and the feature fusion module is used for fusing the features output by each feature processing module. The number of the feature processing modules is generally greater than or equal to the number of feature extraction layers.
As shown in fig. 2, the method comprises the steps of:
Step S202, processing the hierarchical features output by the feature extraction layer connected with the feature processing modules based on an attention mechanism through each feature processing module, and outputting intermediate features;
In order to make the hierarchical features output by the feature extraction layer more accurate and the feature information more obvious, the hierarchical features can be processed through a feature processing module. The attention mechanism is similar to the human retina, different parts have different degrees of information processing capability, namely, the hierarchical features output by the feature extraction layer connected by the scanning feature processing module acquire target features needing to be focused, then more attention resources are input to the features, more detail information related to the target features is acquired, and other irrelevant information is ignored. By means of the mechanism, high-value features can be rapidly screened from a large amount of information of hierarchical features by using limited attention resources, and then intermediate features are output.
The at least two feature processing modules comprise a first feature processing module, a second feature processing module and a third feature processing module, wherein the first feature processing module is connected with a feature extraction layer of the lowest level, the feature extraction layer of the lowest level is used for inputting a target image, the second feature processing module is connected with a feature extraction layer appointed in the middle level, and the third feature processing module is connected with the feature extraction layer of the highest level.
Referring to the schematic structural diagram of the feature extraction network shown in fig. 3, the feature extraction network is illustrated by taking an example that the feature extraction network comprises five feature extraction layers connected in series, namely a feature extraction layer 1, a feature extraction layer 2, a feature extraction layer 3, a feature extraction layer 4 and a feature extraction layer 5, wherein the feature extraction layer 1 corresponds to the feature extraction layer of the lowest level, the feature extraction layer 2, the feature extraction layer 3 and the feature extraction layer 4 correspond to the feature extraction layer of the middle level, and the feature extraction layer 5 corresponds to the feature extraction layer of the highest level. In addition, the feature extraction network comprises three feature processing modules, namely a first feature processing module, a second feature processing module and a third feature processing module, which are respectively connected with the feature extraction layer 1, the feature extraction layer 3 and the feature extraction layer 5.
In addition, as shown in fig. 3, the feature processing module further includes a pooling layer, a first full-connection layer, and a feature multiplication layer;
the pooling layer may be called pooling layer, and is mainly used for downsampling the input features to reduce the number of parameters, the first fully connected layer (Fully Connected layers, abbreviated as FC) plays a role of a classifier in the whole convolutional neural network, and can perform a weighted sum on the features output by the front layer and map the feature space to the sample mark space through linear transformation, and the feature multiplication layer (multiply) is mainly used for multiplying the hierarchical features and the features output by the first fully connected layer.
One possible implementation is:
The method comprises the steps of carrying out first pooling processing on input level features through a pooling layer, outputting a first pooling result, carrying out first full-connection processing on the first pooling result through a first full-connection layer, outputting a first full-connection result, multiplying the input level features and the first full-connection result through a feature multiplication layer to obtain a multiplication result, and outputting intermediate features based on the multiplication result.
Specifically, referring to the data flow in the graph shown in fig. 3, taking the feature processing module as an example for describing the first feature processing module, firstly, the target image xe RH×W×3 with the size of hxw×3 is input to the feature extraction layer at the lowest level in the feature extraction network to obtain the level feature, that is, the feature matrix f1 e Rh1*w1*c1, where H1 represents the height of the feature matrix, W1 represents the width of the feature matrix, and c1 represents the channel number of the feature matrix. The hierarchical feature f1 epsilon Rh1*w1*c1 is input to a pooling layer in a first feature processing module to carry out first pooling processing, a first pooling result, namely a feature vector f1' epsilon Rc1, is output, the first pooling result f1' epsilon Rc1 is input to a first full-connection layer, first full-connection processing is carried out on the first pooling result, a first full-connection result f1' epsilon Rc1 is output, the hierarchical feature f1 epsilon Rh1*w1*c1 and the first full-connection result f1' epsilon Rc1 are multiplied to obtain a multiplication result f1' epsilon Rh1*w1*c1, finally an intermediate feature is output based on the multiplication result, and the intermediate feature can represent the bottom feature of a target image.
Referring to fig. 3, the feature processing module further includes a spatial pyramid pooling layer, the spatial pyramid pooling layer is connected with the feature multiplying layer, and the spatial pyramid pooling layer (SPATIAL PYRAMID Pooling, SPP) is mainly used for processing features of different levels to obtain features with the same dimension.
In the step of outputting the intermediate feature based on the multiplication result, a possible implementation manner is to perform second pooling processing on the multiplication result through a spatial pyramid pooling layer, and output the intermediate feature with a specified dimension. Wherein the specified dimension can be set according to the actual application.
After the multiplication result f1' "e Rh1*w1*c1 is obtained in the foregoing manner, the multiplication result f1 '" e Rh1*w1*c1 may be input to the spatial pyramid pooling layer, and the second pooling process may be performed on the multiplication result, so as to output the intermediate feature of the specified dimension, that is, the feature vector f1' "e Rc.
Similarly, taking the feature processing module as an example of the second feature processing module, the hierarchical feature f3∈rh3*w3*c3 output by the designated feature extraction layer 3 of the middle hierarchy is input into the second feature processing module to obtain a second intermediate feature f3″ e Rc, taking the feature processing module as an example of the third feature processing module, and the hierarchical feature f5∈rh5*w5*c5 output by the feature extraction layer 5 of the highest hierarchy is input into the third feature processing module to obtain a third intermediate feature f5' "e Rc. The spatial pyramid pooling layers in the first feature processing module, the second feature processing module and the third feature processing module respectively output intermediate features with the same dimension.
Step S204, fusing the intermediate features output by each feature processing module through a feature fusion module to obtain image features;
specifically, the intermediate features output by each feature processing module, namely the first feature processing module, the second feature processing module and the third feature processing module, are input to a feature fusion module, and each intermediate feature is fused in a feature splicing mode and the like to obtain a multi-level fusion feature, namely the image feature.
Referring to the schematic structural diagram of the feature extraction module shown in fig. 3, the feature fusion module includes a feature splicing layer, a second full-connection layer and a third full-connection layer, where the feature splicing layer (splicing) is mainly used to splice each intermediate feature f1' ' ' e Rc、f3″″∈Rc、f5″″∈Rc, and the second full-connection layer and the third full-connection layer have the same function as the first full-connection layer.
The step of obtaining the image features by fusing the intermediate features output by each feature processing module through the feature fusion module, which is a possible implementation manner:
The method comprises the steps of performing splicing processing on intermediate features output by each feature processing module through a feature splicing layer to output spliced features, performing second full-connection processing on the spliced features through a second full-connection layer to output second full-connection results, performing third full-connection processing on the second full-connection results through a third full-connection layer to output image features.
Specifically, the intermediate feature f1'' '' of the target image X is subjected to stitching processing by using an epsilon Rc、f3″″∈Rc、f5″″∈Rc, stitching features f=r3c are output, the stitching features f=r3c are input to a second full-connection layer to perform second full-connection processing on the stitching features, a second full-connection result is output, the second full-connection result is input to a third full-connection layer, third full-connection processing is performed on the second full-connection result, and an output vector of the network, namely an image feature z=r3, is output.
Step S206, inputting the image features into a preset normalized exponential function and outputting probability distribution vectors, wherein the probability distribution vectors comprise a plurality of categories and probability values corresponding to each category;
The normalized exponential function may be a softmax function, the probability distribution vector may be represented as p, and the probability distribution vector may be calculated by:
wherein, p represents a probability distribution vector, z represents an image feature, m represents an mth feature processing template, pi and zi represent the ith elements of p and z respectively, and e represents a natural constant;
In step S208, the category corresponding to the maximum probability value in the probability distribution vector is determined as the category of the target image.
Specifically, the coordinate corresponding to the largest probability value in the probability distribution vector, namely the category of the icon image, can be determined through the formula k=argmaxi(pi), wherein the preset category is taken as three examples, k=1 represents that the target image is a normal image, k=2 represents that the target image is a popular image, and k=3 represents that the target image is a pornographic image.
According to the method, multiple level features in a target image can be extracted through a plurality of feature extraction layers included in a feature extraction network, a first feature processing module, a second feature processing module and a third feature processing module which are connected with different feature extraction layers, the level features are processed through the feature processing modules to obtain intermediate features, discrimination of the level features is increased, image information contained in the intermediate features is more accurately enriched, each intermediate feature is fused through a feature fusion module to obtain image features, the category of the target image is determined based on the image features, the features of the target image are not required to be designed manually, the features which are effective in image category are automatically extracted through a convolutional neural network, algorithm generalization capability is strong, robustness is high, the sensitive image recognition of live broadcasting scenes can be effectively recognized through the image features, recognition accuracy of the feature extraction network is improved, omission rate and false detection rate of the sensitive image are reduced, and the accuracy of the sensitive image recognition is improved.
When the live broadcast scene is oriented, live broadcast images can be classified through the mode, so that sensitive images in the images can be identified, the aim of intelligently monitoring the network live broadcast room is fulfilled, and meanwhile, the labor cost is reduced.
The embodiment also provides a training method of the feature extraction network, as shown in fig. 4, the method includes the following steps:
Step S402, determining a training sample based on a preset sample set, wherein the training sample comprises a sample image and a category label of the sample image;
Specifically, a detailed image classification standard can be designed on line, and can comprise normal, low custom, pornography and violence, (such as a landscape is a normal image, a bare genitalia is a pornography image, a kissing is a low custom image, a knife injury person is a violence image, and the like), a data set D can be obtained by manually labeling massive live images according to the standard, a part of the data set can be used as a training sample Dtrain and the rest can be used as a test sample Dtest according to a certain proportion, the data set D can be specifically divided into a training sample Dtrain and a test sample Dtest according to a proportion of 10:1, the training sample comprises a sample image and a sample image, category labels can comprise four, namely, labeling results can be expressed as y epsilon {1,2,3,4, 1 represents normal, 2 represents low custom, 3 represents pornography, 4 represents violence, the like, the image classification can be limited to the category described by the application, and other categories can also comprise confidential categories such as leakage countries, illegal break the rules, and the like.
Step S404, inputting a sample image into a feature extraction network to obtain sample features of the sample image, wherein the feature extraction network comprises a plurality of serially connected feature extraction layers, each of which is used for outputting the corresponding level features of the current level;
Step S406, determining a class identification result of the sample image based on the sample characteristics, determining a loss value based on a preset class loss function, a class label and the class identification result, and updating network parameters of the characteristic extraction network based on the loss value;
step S408, the step of determining training samples based on the preset sample set is continuously executed until the loss value converges, and the trained feature extraction network is obtained.
Specifically, the class identification result z=r3 of the sample image may be input to the softmax function by the formulaThe probability distribution vector p is calculated and then the loss value is calculated by the equation l= -log (py), where y is the label of the training sample image. Finally, calculating the derivative of the loss value to all network parameters W in the feature extraction network through a back propagation algorithmUpdating network parameters of the feature extraction network based on the calculated loss value by a random gradient descent algorithm, namely:
wherein, alpha represents learning rate (is a preset super parameter, and the common value is 0.01 and 0.001), and the trained feature extraction network is obtained by continuously and iteratively updating the parameters of the feature extraction network until the loss value converges.
In addition, after training is completed, the trained feature extraction network is required to be tested through a test sample Dtest, a plurality of test images are selected from the test sample and input into the trained feature extraction network to obtain output vectors, namely image features, the category of the test images is determined based on the image features and is compared with the marked images, and if the preset condition is met, the trained feature extraction network is obtained.
The training method of the feature extraction network comprises the steps of determining a training sample based on a preset sample set, wherein the training sample comprises a sample image and a category label of the sample image, inputting the sample image into the feature extraction network to obtain sample features of the sample image, wherein the feature extraction network comprises a plurality of serially connected level feature extraction layers, each level feature extraction layer is used for outputting level features corresponding to a current level, the sample features are obtained through fusion of the level features corresponding to at least two levels, a category recognition result of the sample image is determined based on the sample features, a loss value is determined based on a preset classification loss function, the category label and the category recognition result, network parameters of the feature extraction network are updated based on the loss value, and the step of determining the training sample based on the preset sample set is continuously executed until the loss value converges to obtain the trained feature extraction network. In the mode, the feature extraction network comprises a plurality of layers of feature extraction layers, at least two layers of features in the target image can be extracted, the at least two layers of features are fused to obtain image features, and the category of the target image is determined based on the image features.
In the live broadcast scene, massive data are collected and marked as training samples, a detailed tag classification standard is put forward to carry out strong marking on the training samples, and the obtained feature extraction network meets the supervision requirement of the live broadcast scene.
Corresponding to the above method embodiment, the present embodiment further provides an image classification device, as shown in fig. 5, including:
The output module 51 is configured to input the target image into a feature extraction network that is trained in advance, and output image features of the target image;
The classification module 52 is configured to determine a class of the target image based on the image features, where the feature extraction network includes multiple serially connected feature extraction layers, each of the feature extraction layers is configured to output a level feature corresponding to a current level, and the image features are obtained by fusing at least two level features corresponding to the current level.
The image classification device provided by the embodiment of the invention inputs a target image into a feature extraction network which is trained in advance to output image features of the target image, determines the category of the target image based on the image features, wherein the feature extraction network comprises a plurality of serially connected feature extraction layers, each feature extraction layer is used for outputting a level feature corresponding to a current level, and the image features are obtained by fusing the level features corresponding to at least two levels. In the mode, the image features comprise at least two levels of level features, so that the feature levels contained in the image features are richer, and the image features can be used for recognizing images in complex scenes such as live broadcasting, so that sensitive images can be accurately and effectively recognized, and the omission rate and the false detection rate of the sensitive images are reduced.
The image feature processing system comprises a feature extraction network, an output module, a feature fusion module and a feature processing module, wherein the feature extraction network further comprises at least two feature processing modules and a feature fusion module, each feature processing module is connected with a hierarchical feature extraction layer, the feature extraction layers connected with any two feature processing modules are different, the output module is further used for processing hierarchical features output by the feature extraction layers connected with the feature processing modules based on an attention mechanism through each feature processing module and outputting intermediate features, and the intermediate features output by each feature processing module are fused through the feature fusion module to obtain the image feature.
The at least two feature processing modules comprise a first feature processing module, a second feature processing module and a third feature processing module, wherein the first feature processing module is connected with a feature extraction layer of the lowest level, the feature extraction layer of the lowest level is used for inputting a target image, the second feature processing module is connected with a feature extraction layer appointed in the middle level, and the third feature processing module is connected with the feature extraction layer of the highest level.
The feature processing module comprises a pooling layer, a first full-connection layer and a feature multiplication layer, and is further used for carrying out first pooling processing on the input hierarchical features through the pooling layer, outputting first pooling results, carrying out first full-connection processing on the first pooling results through the first full-connection layer, outputting first full-connection results, multiplying the input hierarchical features with the first full-connection results through the feature multiplication layer to obtain multiplication results, and outputting intermediate features based on the multiplication results.
The feature processing module is further used for performing second pooling processing on multiplication results through the spatial pyramid pooling layer and outputting intermediate features with specified dimensions.
The feature fusion module comprises a feature splicing layer, a second full-connection layer and a third full-connection layer, and the output module is further used for splicing the middle features output by each feature processing module through the feature splicing layer to output splicing features, performing second full-connection processing on the splicing features through the second full-connection layer to output second full-connection results, performing third full-connection processing on the second full-connection results through the third full-connection layer to output image features.
The classification module is further used for inputting the image features into a preset normalized exponential function and outputting a probability distribution vector, wherein the probability distribution vector comprises a plurality of categories and probability values corresponding to the categories, and the category corresponding to the maximum probability value in the probability distribution vector is determined to be the category of the target image.
The image classification device provided by the embodiment of the invention has the same technical characteristics as the image classification method provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.
Corresponding to the above method embodiment, this embodiment further provides a training device for a feature extraction network, as shown in fig. 6, where the device includes:
The sample determining module 61 is used for determining a training sample based on a preset sample set, wherein the training sample comprises a sample image and a category label of the sample image;
The image input module 62 is configured to input a sample image into the feature extraction network to obtain sample features of the sample image, where the feature extraction network includes multiple serially connected feature extraction layers, each of the feature extraction layers is configured to output a level feature corresponding to a current level;
A parameter updating module 63, configured to determine a class recognition result of the sample image based on the sample feature, determine a loss value based on a preset class loss function, class label, and class recognition result, update a network parameter of the feature extraction network based on the loss value;
the network determining module 64 is configured to continue to perform the step of determining the training samples based on the preset sample set until the loss value converges, so as to obtain a trained feature extraction network.
The training device of the feature extraction network provided by the embodiment of the invention has the same technical characteristics as the training method of the feature extraction network provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.
The present embodiment also provides an electronic device including a processor and a memory, the memory storing machine-executable instructions executable by the processor, the processor executing the machine-executable instructions to implement the above-described image classification method, or a training method of a feature extraction network.
Referring to fig. 7, the electronic device includes a processor 100 and a memory 101, the memory 101 storing machine executable instructions that can be executed by the processor 100, the processor 100 executing the machine executable instructions to implement the above-described image classification method, or a training method of a feature extraction network.
Further, the electronic device shown in fig. 7 further includes a bus 102 and a communication interface 103, and the processor 100, the communication interface 103, and the memory 101 are connected through the bus 102.
The memory 101 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the system network element and at least one other network element is implemented via at least one communication interface 103 (which may be wired or wireless), and may use the internet, a wide area network, a local network, a metropolitan area network, etc. Bus 102 may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 7, but not only one bus or type of bus.
The processor 100 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 100 or by instructions in the form of software. The Processor 100 may be a general-purpose Processor, including a central processing unit (Central Processing Unit, CPU), a network Processor (Network Processor, NP), a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 101, and the processor 100 reads the information in the memory 101 and, in combination with its hardware, performs the steps of the method of the previous embodiment.
The present embodiment also provides a machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the above-described image classification method, or a training method of a feature extraction network.
The computer program product of the image classification method, the training method of the feature extraction network and the device provided by the embodiment of the invention comprises a computer readable storage medium storing program codes, wherein the instructions included in the program codes can be used for executing the method described in the method embodiment, and specific implementation can be referred to the method embodiment and will not be repeated here.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.
In addition, in the description of embodiments of the present invention, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected, mechanically connected, electrically connected, directly connected, indirectly connected via an intermediate medium, or in communication between two elements. The specific meaning of the above terms in the present invention will be understood by those skilled in the art in specific cases.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
It should be noted that the foregoing embodiments are merely illustrative embodiments of the present invention, and not restrictive, and the scope of the invention is not limited to the foregoing embodiments, but it should be understood by those skilled in the art that any modification, variation or substitution of some technical features described in the foregoing embodiments may be easily made within the scope of the present invention without departing from the spirit and scope of the technical solutions of the embodiments. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (10)

CN202011227392.7A2020-11-052020-11-05 Image classification method, feature extraction network training method and deviceActiveCN112183672B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202011227392.7ACN112183672B (en)2020-11-052020-11-05 Image classification method, feature extraction network training method and device

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202011227392.7ACN112183672B (en)2020-11-052020-11-05 Image classification method, feature extraction network training method and device

Publications (2)

Publication NumberPublication Date
CN112183672A CN112183672A (en)2021-01-05
CN112183672Btrue CN112183672B (en)2024-12-31

Family

ID=73917609

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202011227392.7AActiveCN112183672B (en)2020-11-052020-11-05 Image classification method, feature extraction network training method and device

Country Status (1)

CountryLink
CN (1)CN112183672B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112836076B (en)*2021-01-272024-07-19京东方科技集团股份有限公司Image tag generation method, device and equipment
CN112800978B (en)*2021-01-292024-09-24北京金山云网络技术有限公司Attribute identification method, training method and device of part attribute extraction network
CN113836992B (en)*2021-06-152023-07-25腾讯科技(深圳)有限公司Label identification method, label identification model training method, device and equipment
CN113592031B (en)*2021-08-172023-11-28全球能源互联网研究院有限公司Image classification system, and method and device for identifying violation tool
CN113807362B (en)*2021-09-032024-02-27西安电子科技大学Image classification method based on interlayer semantic information fusion depth convolution network
CN114140655A (en)*2022-01-292022-03-04深圳市中讯网联科技有限公司Image classification method and device, storage medium and electronic equipment
CN116935066A (en)*2023-07-202023-10-24支付宝(杭州)信息技术有限公司Method for training image recognition model, image recognition method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109101914A (en)*2018-08-012018-12-28北京飞搜科技有限公司It is a kind of based on multiple dimensioned pedestrian detection method and device
CN110084249A (en)*2019-04-242019-08-02哈尔滨工业大学The image significance detection method paid attention to based on pyramid feature

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108830211A (en)*2018-06-112018-11-16厦门中控智慧信息技术有限公司Face identification method and Related product based on deep learning
CN109325547A (en)*2018-10-232019-02-12苏州科达科技股份有限公司Non-motor vehicle image multi-tag classification method, system, equipment and storage medium
CN109801293B (en)*2019-01-082023-07-14平安科技(深圳)有限公司Remote sensing image segmentation method and device, storage medium and server
CN110070067B (en)*2019-04-292021-11-12北京金山云网络技术有限公司Video classification method, training method and device of video classification method model and electronic equipment
CN110110715A (en)*2019-04-302019-08-09北京金山云网络技术有限公司Text detection model training method, text filed, content determine method and apparatus
CN110188817B (en)*2019-05-282021-02-26厦门大学 A real-time high-performance semantic segmentation method for street view images based on deep learning
CN110837846B (en)*2019-10-122023-10-31深圳力维智联技术有限公司 An image recognition model construction method, image recognition method and device
CN111104898B (en)*2019-12-182022-03-25武汉大学 Image scene classification method and device based on target semantics and attention mechanism
CN111274994B (en)*2020-02-132022-08-23腾讯科技(深圳)有限公司Cartoon face detection method and device, electronic equipment and computer readable medium
CN111680176B (en)*2020-04-202023-10-10武汉大学Remote sensing image retrieval method and system based on attention and bidirectional feature fusion
CN111597918A (en)*2020-04-262020-08-28北京金山云网络技术有限公司Training and detecting method and device of human face living body detection model and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109101914A (en)*2018-08-012018-12-28北京飞搜科技有限公司It is a kind of based on multiple dimensioned pedestrian detection method and device
CN110084249A (en)*2019-04-242019-08-02哈尔滨工业大学The image significance detection method paid attention to based on pyramid feature

Also Published As

Publication numberPublication date
CN112183672A (en)2021-01-05

Similar Documents

PublicationPublication DateTitle
CN112183672B (en) Image classification method, feature extraction network training method and device
CN111160335B (en)Image watermark processing method and device based on artificial intelligence and electronic equipment
CN110555481B (en)Portrait style recognition method, device and computer readable storage medium
CN110796018B (en) A Hand Movement Recognition Method Based on Depth Image and Color Image
CN111178183A (en)Face detection method and related device
CN110321845B (en)Method and device for extracting emotion packets from video and electronic equipment
CN114332680A (en)Image processing method, video searching method, image processing device, video searching device, computer equipment and storage medium
CN112836625A (en)Face living body detection method and device and electronic equipment
CN111401521B (en)Neural network model training method and device, and image recognition method and device
CN111432206B (en) Video clarity processing method, device and electronic equipment based on artificial intelligence
CN113487610B (en)Herpes image recognition method and device, computer equipment and storage medium
CN111753618B (en) Image recognition method, device, computer equipment and computer readable storage medium
CN110750673B (en)Image processing method, device, equipment and storage medium
CN113762257A (en)Identification method and device for marks in makeup brand images
CN112733686A (en)Target object identification method and device used in image of cloud federation
CN115187530A (en) Ultrasound automatic breast full volume image identification method, device, terminal and medium
CN115424293A (en) Live body detection method, live body detection model training method and device
CN111582302A (en) Vehicle identification method, device, equipment and storage medium
CN119515956B (en)Material particle size measurement method, device, equipment and storage medium
CN112446311A (en)Object re-recognition method, electronic device, storage medium and device
CN112052730A (en)3D dynamic portrait recognition monitoring device and method
CN118277674B (en)Personalized image content recommendation method based on big data analysis
CN119251708A (en) Sheep counting method, device, equipment and storage medium based on small target detection
CN118898870A (en) Gesture recognition method, device, equipment and storage medium
CN118364125A (en)Training and image retrieval method of image retrieval model and related equipment thereof

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp