CN110516512B

Movatterモバイル変換

Info

Publication number: CN110516512B
Application number: CN201810488759.7A
Authority: CN
Inventors: 王睿
Original assignee: Beijing Authenmetric Data Technology Co ltd
Current assignee: Beijing Authenmetric Data Technology Co ltd
Priority date: 2018-05-21
Filing date: 2018-05-21
Publication date: 2023-08-25
Anticipated expiration: 2038-05-21
Also published as: CN110516512A

Abstract

The embodiment of the invention discloses a training method and a training device for a pedestrian attribute analysis model, wherein the training method comprises the following steps: inputting a pedestrian image and a probability map corresponding to the pedestrian image into a convolutional neural network to obtain a prediction attribute; the probability map characterizes a set of probability values of each pixel node belonging to a pedestrian component area in the pedestrian image partitioned into at least one pedestrian component area; calculating training loss by using the real attribute corresponding to the pedestrian image and the predicted attribute; and if the training loss converges, determining the current model parameters of the convolutional neural network as the model parameters of the pedestrian attribute analysis model to obtain the pedestrian attribute analysis model. The pedestrian attribute analysis model obtained by training by the training method can accurately identify the pedestrian attribute from the pedestrian attribute analysis model even when the pedestrian attribute analysis model faces to the application scene, the pedestrian gesture and the pedestrian image with larger angle difference of the camera when the pedestrian attribute analysis model is used for identifying the pedestrian image with unknown attribute.

Description

Training method of pedestrian attribute analysis model, pedestrian attribute identification method and device

Technical Field

The invention relates to the technical field of image processing, in particular to a training method and a training system of a pedestrian attribute analysis model, and a pedestrian attribute identification method and an identification device.

Background

Pedestrian attribute identification (Pedestrian attribute recognition) refers to a technique of processing and analyzing pictures to identify pedestrian attributes, where the pedestrian attributes include body appearance characteristics (e.g., height, weight, etc.), wearing characteristics (e.g., type and color of coat, pants, backpack, etc.), facial characteristics (e.g., age, gender, race, etc.).

Currently, pedestrian attribute recognition is mainly based on a neural network model, and generally comprises two stages of model training and model recognition. In the model training stage, the labeled pictures are used as input data of the neural network model for training, and then the model parameters meeting the requirements can be obtained. The neural network model after the parameters are determined is the trained model. In the using stage, the picture to be identified is used as input data of the trained neural network model, output data is obtained, and the attribute of the pedestrian identified from the picture can be obtained.

However, the pedestrian attribute identification method based on the neural network model has some problems in practical application, and one of the problems is that the pedestrian attributes in all the pictures cannot be accurately identified due to the diversity of the pedestrian pictures. Specifically, in the actual application scene, the collected pedestrian pictures are quite diversified due to the differences of the monitoring scene, the camera angle, the wearing of pedestrians, the gesture and the like, one method based on the neural network model is good in effect of identifying some pictures, and the pedestrian attributes cannot be accurately identified from the pictures with larger changes of other application scenes, the camera angle and the like.

Disclosure of Invention

In order to solve the technical problems, the application provides a neural network model training method for identifying pedestrian attributes and a method for identifying pedestrian attributes by using the trained neural network model, so that pedestrian attributes can be accurately identified from various pictures.

In a first aspect, a training method of a pedestrian attribute analysis model is provided, including:

inputting a pedestrian image and a probability map corresponding to the pedestrian image into a convolutional neural network to obtain a prediction attribute; the probability map characterizes a set of probability values of each pixel node belonging to a pedestrian component area in the pedestrian image partitioned into at least one pedestrian component area;

calculating training loss by using the real attribute corresponding to the pedestrian image and the predicted attribute;

and if the training loss converges, determining the current model parameters of the convolutional neural network as the model parameters of the pedestrian attribute analysis model to obtain the pedestrian attribute analysis model.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the calculating of the probability map includes the following steps:

inputting a pedestrian image into a pedestrian analysis model to obtain a probability map corresponding to the pedestrian image; the pedestrian analysis model is a full convolution neural network trained by training images with real probability map labels.

With reference to the first implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the convolutional neural network includes a first sub-network and a second sub-network;

the step of inputting the pedestrian image and the probability map corresponding to the pedestrian image into a convolutional neural network to obtain the prediction attribute comprises the following steps:

extracting pedestrian characteristics from the pedestrian image by using a first subnetwork;

updating the probability map;

obtaining fusion characteristics according to the updated probability map and the pedestrian characteristics;

and inputting the fusion characteristics into a second sub-network to obtain the prediction attribute.

With reference to the first aspect and the foregoing possible implementation manners, in a third possible implementation manner of the first aspect, the step of updating the probability map specifically includes:

wherein x is_i Is the ith pedestrian image;

representing that the s-th pixel node in the i-th pedestrian image belongs to the s-th pixel nodeA probability value for a pedestrian component area of class c;

representing the updated probability value;

representing for an s-th pixel node belonging to class C in the C-th pedestrian component area +.>Obtaining the maximum value;

and/or the number of the groups of groups,

the step of obtaining the fusion characteristic according to the updated probability map and the pedestrian characteristic specifically comprises the following steps:

Convolving and fusing the updated probability map and the pedestrian characteristic to obtain a first characteristic;

φ(x_i )＝[φ(x_i )¹ ,φ(x_i )² ,…,φ(x_i )^C ']

wherein phi (x)_i )^c A first feature representing a c-th channel of an i-th pedestrian image;

a set of probability values representing an updated c-th channel;

representing a set of probability values for the c-th channel after the copy update to match the pedestrian feature f_b (x_i ) The number of channels is the same;

pixel multiplication representing the corresponding position;

f_b (x_i ) A pedestrian feature representing an i-th pedestrian image;

φ(x_i ) A first feature representing an i-th pedestrian image;

and obtaining a fusion characteristic by utilizing the first characteristic and the pedestrian characteristic.

With reference to the first aspect and the foregoing possible implementation manners, in a fourth possible implementation manner of the first aspect, the step of calculating a training loss by using a real attribute corresponding to the pedestrian image and the predicted attribute specifically includes:

J(θ)＝∑_j λ_j J(θ^j )

wherein J (θ) is the total training loss;

J(θ^j ) Training loss for the j-th task;

λ_j task weights representing the j-th task;

a variance representing uncertainty in prediction in the j-th task;

m represents the total number of pedestrian images used for training in the j-th task;

K_j a number of options representing the value of the j-th attribute;

representing the real attribute of the ith pedestrian image in the jth task;

P_i^j And the predicted attribute of the ith pedestrian image in the jth task is represented.

In a second aspect, a training method of a pedestrian attribute analysis model is provided, including:

inputting the pedestrian image into a convolutional neural network to obtain a prediction attribute;

calculating a training loss using the real attribute corresponding to the pedestrian image and the predicted attribute:

J(θ)＝∑_j λ_j J(θ^j )

wherein J (θ) is the total training loss;

J(θ^j ) Training loss for the j-th task;

λ_j task weights representing the j-th task;

a variance representing uncertainty in prediction in the j-th task;

K_j a number of options representing the value of the j-th attribute;

representing the real attribute of the ith pedestrian image in the jth task;

P_i^j representing the predicted attribute of the ith pedestrian image in the jth task;

In a third aspect, a pedestrian attribute identification method is provided, including the steps of:

inputting the pedestrian image to be identified into the pedestrian attribute analysis model trained by the training method in any one of the first aspect or the second aspect to obtain the identified pedestrian attribute.

In a fourth aspect, a pedestrian attribute analysis model training system is provided, comprising:

the first training unit is used for inputting the pedestrian image and the probability map corresponding to the pedestrian image into the convolutional neural network to obtain the prediction attribute; calculating training loss by using the real attribute corresponding to the pedestrian image and the predicted attribute; under the condition that the training loss converges, determining the current model parameters of the convolutional neural network as the model parameters of a pedestrian attribute analysis model to obtain the pedestrian attribute analysis model;

wherein the probability map characterizes a set of probability values for each pixel node belonging to a pedestrian component area in the pedestrian image partitioned out of at least one pedestrian component area.

In a fifth aspect, a pedestrian attribute analysis model training system is provided, including:

the second training unit is used for inputting the pedestrian image into the convolutional neural network to obtain a prediction attribute; calculating training loss by using the real attribute corresponding to the pedestrian image and the predicted attribute; under the condition that the training loss converges, determining the current model parameters of the convolutional neural network as the model parameters of a pedestrian attribute analysis model to obtain the pedestrian attribute analysis model;

The second training unit includes:

the second weight self-updating unit is used for adjusting task weights corresponding to the tasks according to the following formula:

wherein lambda is_j Task weights representing the j-th task;

a variance representing uncertainty in prediction in the j-th task;

K_j a number of options representing the value of the j-th attribute;

representing the real attribute of the ith pedestrian image in the jth task;

the second training unit is further configured to calculate a training total loss using the task weight:

J(θ)＝∑_j λ_j J(θ^j )

wherein J (θ) is the total training loss;

J(θ^j ) Training loss for the j-th task;

λ_j the task weight of the j-th task is represented.

In a sixth aspect, there is provided a pedestrian attribute identification apparatus including:

the prediction unit is configured to input the pedestrian image to be identified into the neural network model trained by the training system according to any one of the fourth aspect or the fifth aspect, and output the identified pedestrian attribute.

In the training method of the pedestrian attribute analysis model in the first aspect, firstly, a pedestrian image and a probability map corresponding to the pedestrian image are input into a convolutional neural network to obtain a predicted attribute, wherein the probability map represents a set of probability values of each pixel node belonging to a pedestrian component area in the pedestrian image divided into at least one pedestrian component area. Then, training loss is calculated using the real attribute corresponding to the pedestrian image and the predicted attribute. And if the training loss converges, determining the current model parameters of the convolutional neural network as the model parameters of the pedestrian attribute analysis model to obtain the pedestrian attribute analysis model. By the method, pedestrian component areas are divided on the pixel level, indication information of the pedestrian component areas is given, and the pedestrian attribute analysis network is guided to learn more specific and robust characteristics, so that the influence of diversity of pedestrian gestures, camera angles and the like can be resisted to a certain extent. When the analysis model obtained by training by the training method is used for identifying pedestrian images with unknown attributes, even if the pedestrian images with larger differences of application scenes, pedestrian gestures and camera angles are faced, the pedestrian attributes can be accurately identified.

In addition, aiming at the situation that the learning difficulty and the convergence rate of each attribute analysis task are different, the technical scheme of the second aspect provides a training method for automatically updating the pedestrian attribute analysis model of the task weight corresponding to the task according to the difference of the tasks, that is, during each training, the training conditions of each attribute analysis task are used for adjusting each task weight so as to increase the contribution degree of a simple task in model training, prevent the model from being dominated by the difficult task, enable a plurality of tasks to coordinate training, and help the feature learning and information exchange of each task. The analysis model obtained through training by the training method can accurately identify a plurality of pedestrian attributes from the pedestrian image.

Drawings

In order to more clearly illustrate the technical solution of the present application, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a flow chart of one implementation of a first embodiment of a training method of a pedestrian attribute analysis model of the present application;

FIG. 2 is a flow chart of a second implementation in a first embodiment of the training method of the pedestrian attribute analysis model of the present application;

FIG. 3 is a flowchart of one implementation of the step S100 in a first embodiment of the training method of the pedestrian attribute analysis model of the present application;

FIG. 4 is a flowchart of a third implementation in the first embodiment of the training method of the pedestrian attribute analysis model of the present application;

FIG. 5 is a flow chart of one implementation of a second embodiment of the training method of the pedestrian attribute analysis model of the present application;

FIG. 6 is a flow chart of a second implementation of the training method of the pedestrian attribute analysis model of the present application;

FIG. 7 is a schematic diagram of a structure of one embodiment of a pedestrian attribute analysis model training system of the present application;

FIG. 8 is a schematic diagram of a second embodiment of a training system for pedestrian attribute analysis model in accordance with the present application;

FIG. 9 is a schematic diagram of a third embodiment of the training system for pedestrian attribute analysis model of the present application;

FIG. 10 is a schematic diagram of a structure of a training system for a pedestrian attribute analysis model in accordance with a fourth embodiment of the present application.

Detailed Description

The embodiments of the present application are described in detail below.

Convolutional neural networks (Convolutional Neural Networks, CNN) are a model of a multi-layer neural network that is adept at dealing with the relevant machine learning problem of images, particularly large images.

In order to solve the problem of accuracy of pedestrian attribute recognition caused by the diversity of pedestrian pictures, please refer to fig. 1, in a first embodiment of the present application, a training method of a pedestrian attribute analysis model is provided, which includes:

s100: and inputting the pedestrian image and the probability map corresponding to the pedestrian image into a convolutional neural network to obtain the prediction attribute.

The pedestrian image refers to an image including pedestrians, and may be an original picture, such as a picture from a monitoring video; or may be a preprocessed picture or the like. The pedestrian images in this step belong to a first training set, which is a set of training samples for training the pedestrian attribute analysis model, and each of the pedestrian images in the first training set is one training sample. Each pedestrian image is provided with a corresponding real attribute label for marking the real attribute of the pedestrian image. The real property tags herein may be manually annotated.

In one implementation manner, the pedestrian image is obtained by preprocessing an original picture, and specifically the method comprises the following steps:

detecting whether pedestrians are contained in the original picture;

if the pedestrian is contained, acquiring the pedestrian position of the pedestrian in the original picture;

and cutting out a pedestrian image from the original picture according to the pedestrian position.

Alternatively, the size of the clipped pedestrian image may be preset here, for example, the pixel size of the preset clipped pedestrian image is 224×224.

And if the original picture does not contain pedestrians, discarding the original picture, and carrying out the step of preprocessing on the next original picture. If a plurality of pedestrians are detected in one original picture, a plurality of pedestrian images can be obtained by clipping.

A pedestrian image may include S 'pixel nodes, e.g., cut to size 224×224, then S' = 50176. The predetermined pedestrian component area may include hair, face, upper body, lower body, etc. The probability map characterizes a set of probability values of each pixel node belonging to a preset pedestrian component area in the pedestrian image partitioned into at least one pedestrian component area. Thus, the probability map may be represented as a matrix in which each probability value represents a probability value that the pixel node belongs to a certain pedestrian component area.

The preset pedestrian component area is provided with C 'types, and the ith pedestrian image comprises S' pixel nodes. s (lowercase) indicates the serial number of the pixel node, i.e., the s-th pixel node. c (lower case) indicates the serial number of the pedestrian component area, i.e., the c-th pedestrian component area.A probability value indicating that the s-th pixel node in the i-th pedestrian image belongs to the c-th pedestrian component area,a probability map representing the i-th pedestrian image. Here, in the present application, "C" in C' is a capital letter indicating the total number of pedestrian component areas divided in the i-th pedestrian image; the lower case C is the number of the pedestrian component area, representing the C-th of the C' pedestrian component areas; "S" in S' is a capital letter representing the total number of pixel nodes in the ith pedestrian image; lower case S is the number of the pixel node, representing the S 'th of the S' pixel nodes.

For example, for a pedestrian image, the pedestrian component area divided from the pedestrian image includes hair, face and upper body 3 types, and the pedestrian image includes 224×224= 50176 pixel nodes, the probability map of the pedestrian image may be represented as a 3×224×224 three-dimensional matrix. The probability map is represented in two dimensions, i.e., pixel nodes are numbered sequentially, and the correspondence relationship shown in fig. 1 can be represented.

Table 1 correspondence of numerical meanings in matrix

Wherein,,probability value indicating that 1 st pixel node belongs to hair,/->The probability value that the 1 st pixel node belongs to the face is represented, and other values have similar meanings, and specific corresponding relations are shown in table 1.

Referring to fig. 2, the probability map corresponding to the pedestrian image can be obtained by the following steps:

s400: and inputting the pedestrian image into a pedestrian analysis model to obtain a probability map corresponding to the pedestrian image. The pedestrian analysis model is a full convolution neural network trained by training images with real probability map labels.

In this step, the pedestrian analytical model is trained, i.e., a fully convolutional neural network (Fully Convolutional Networks, FCN) for which model parameters have been determined.

The training image here is also a pedestrian image, belonging to a second training set, which is a set of training samples for training the pedestrian analytical model. Each training image is provided with a corresponding real probability map label, and the real probability map label is used for labeling the real probability map of the training image.

For a certain training image, the true probability icon carried by the training image characterizes the actual pedestrian component area divided by the pedestrian image, and the probability value of each pixel node in the pedestrian image belonging to a certain actual pedestrian component area. For a certain pixel, the probability value of the pixel belonging to a certain actual pedestrian component area is usually 1 or 0,1 indicates that the pixel belongs to the actual pedestrian component area, and 0 indicates that the pixel does not belong to the actual pedestrian component area.

The main process of training the pedestrian analysis model is as follows: and inputting the training images in the second training set into the full convolution neural network, dividing the training images into at least one predicted pedestrian component area, and outputting a predicted probability value of each pixel node belonging to the predicted pedestrian component area to obtain a predicted probability map. And calculating the training loss of the full convolution neural network by using the predicted probability map and the true probability map of the training image. If the training loss converges, determining the model parameters in the current full convolution neural network as the model parameters of the pedestrian analysis model. If the training loss does not converge, the model parameters in the fully convolutional neural network are updated, and then the foregoing training steps are repeated. And determining the latest model parameter in the full convolutional neural network as the parameter of the pedestrian analysis model until the training loss of the calculated full convolutional neural network converges.

Pedestrian attributes may include body appearance characteristics (e.g., height, weight, etc.), wear characteristics (e.g., type and color of coat, pants, backpack, etc.), facial characteristics (e.g., age, gender, race), etc.

When the pedestrian attribute analysis model is trained, pedestrian images of the convolutional neural network are input in the first training set, and each pedestrian image can correspondingly obtain corresponding predicted pedestrian attributes, namely predicted attributes. When there is only one pedestrian attribute to predict, such training may be referred to as single task attribute analysis training. When there are a plurality of pedestrian attributes to be predicted, then the multi-task attribute analysis training is called.

By way of example of a single task attribute analysis training, for a pedestrian image, the predicted pedestrian attribute is only the hair color, and the hair color has 6 possible colors, and table 2 shows the probability of the pedestrian image being input into the convolutional neural network, and the hair being the corresponding color.

Table 2 prediction attribute example

	Yellow colour	Brown color	White color	Red color	Green colour	Black color
							Hair color	0.5	0.3	0.01	0.05	0.04	0.1

In particular, in one implementation, the attention mechanism may also be introduced simultaneously when introducing the probability map into the method of training the pedestrian attribute analysis model. Specifically, referring to fig. 3, the convolutional neural network includes a first sub-network and a second sub-network; the step of S100 may include:

s110: extracting pedestrian characteristics from the pedestrian image by using a first subnetwork;

s120: updating the probability map;

s130: obtaining fusion characteristics according to the updated probability map and the pedestrian characteristics;

s140: and inputting the fusion characteristics into a second sub-network to obtain the prediction attribute.

In the step of S110, the pedestrian feature may be represented as f_b (x_i ) Extracting pedestrian features from the pedestrian image using the first subnetwork may be implemented using existing implementations. More specifically, the first subnetwork may comprise several convolution groups, each convolution group comprising one convolution layer and one pooling layer. Inputting the pedestrian image into a first subnetwork, extracting and downsampling the characteristics of a plurality of convolution groups, and finally obtaining the pedestrian characteristic f of the pedestrian image_b (x_i )。

In step S120, updating probability values in the probability map specifically includes:

wherein x is_i For the ith pedestrian imageThe method comprises the steps of carrying out a first treatment on the surface of the i is the index number of the pedestrian image;

representing a probability value that an s-th pixel node in an i-th pedestrian image belongs to a c-th pedestrian component area;

representing the updated probability value;

representing for an s-th pixel node belonging to class C in the C-th pedestrian component area +.>The maximum value is taken. That is, when this condition is satisfied, +.>The value of (2) is updated to 1. Otherwise, go (L)>Whether to take the original +.>Is a value of (2).

For example, take the s-th pixel node in Table 1 as an example, which is prior to updatingThe values of (2) are shown in Table 3, respectively, and after updating, the values of (2) are +.>The values of (2) are shown in Table 4.

Table 3 probability value of s-th pixel node before update

	Hair (c=1)	Face (c=2)	Upper body (c=3)
				The s-th pixel node	0.7	0.2	0.1

Table 4 probability value of the s-th pixel node after update

	Hair (c=1)	Face (c=2)	Upper body (c=3)
				The s-th pixel node	1	0.2	0.1

If the probability value that the s-th pixel node belongs to the C-th pedestrian component area is the largest, the C-th pedestrian component area is considered as the prediction type of the s-th pixel node, and the other (C' -1) categories are the non-prediction types of the s-th pixel node. By updating the probability value in the probability map, the probability of the prediction category is set to 1, so that the information of the pixel node can be kept as much as possible in the subsequent feature fusion step. Meanwhile, the probability of the non-prediction type is also kept, so that information loss caused by the prediction error of the s-th pixel node is prevented.

In step S130, obtaining a fusion feature according to the updated probability map and the pedestrian feature may specifically include:

s131: convolving and fusing the updated probability map and the pedestrian characteristic to obtain a first characteristic;

s132: abstracting the first feature to obtain a second feature;

s133: abstracting the pedestrian features to obtain third features;

s134: and adding and fusing the second feature and the third feature to obtain a fused feature.

In one implementation, the step of convolutionally fusing of S131 may include:

φ(x_i )＝[φ(x_i )¹ ,φ(x_i )² ,…,φ(x_i )^C' ]

wherein,,

φ(x_i )^c a first feature representing a c-th channel of an i-th pedestrian image;

probability value representing updated c-th channelA collection; i.e. a set of probability values (updated) that each pixel node in the pedestrian image belongs to a class c pedestrian component area.

Representing a set of probability values for the c-th channel after the copy update to match the pedestrian feature f_b (x_i ) The number of channels is the same.

The pixels representing the corresponding positions are multiplied.

f_b (x_i ) Representing the pedestrian characteristics of the i-th pedestrian image.

φ(x_i ) A first feature of the i-th pedestrian image is represented.

For example, assume that a pedestrian feature f is input_b (x_i ) Size (128, 56, 56), wherein 128 refers to the number of passages of pedestrian characteristics, 56 and 56 refer to height and width, respectively; assume an updated probability mapThe size is (9, 56, 56), where 9 refers to the number of channels in the probability map, i.e., the total number of pedestrian component areas, and 56 refer to height and width, respectively. For updated probability map->For the first feature of the c-th lane of (2), it is duplicated 128 times to match it with the pedestrian feature f_b (x_i ) The number of channels is the same, i.e.)>Is (128, 56, 56). Then add->And f_b (x_i ) The corresponding elements are multiplied, the convolution fusion is performed,obtaining the first characteristic phi (x_i )^c 。

Since the convolution fusion operation is performed separately for each channel in the probability map, i.e., for each pedestrian component region, the first feature that results is: phi (x)_i )＝[φ(x_i )¹ ,φ(x_i )² ,…,φ(x_i )^C ']Can be expressed as a [ (9X 128) ×56×56 ]]Is a matrix of (a) in the matrix.

After the updated probability map and the pedestrian features are subjected to convolution fusion, the features of each semantic region in the pedestrian image are separated independently, so that the convolution layer of the second sub-network is emphasized during learning in the subsequent steps, namely, the convolution layer of the second sub-network can know which semantic regions should be emphasized more according to specific numerical values in the first features during learning, and how to combine the features of each semantic region.

In the steps S132 to S134, the first feature is abstracted by several convolution layers to obtain a second feature, and the number of channels of the second feature is the same as the number of channels of the pedestrian feature. The pedestrian feature is abstracted by a plurality of convolution layers to obtain a third feature, and the number of channels of the third feature is the same as that of the pedestrian feature. And finally, adding and fusing the second characteristic and the third characteristic to obtain a more comprehensive fusion characteristic, so that the convolutional neural network can be trained more accurately, and the prediction accuracy is improved.

In step S140, the second sub-network and the first sub-network may be regarded as part of a convolutional neural network, which together form a convolutional neural network. The second subnetwork may comprise a plurality of convolution groups and a plurality of fully-connected layers, each convolution group comprising a convolution layer and a pooling layer, the plurality of convolution groups being connected in sequence, the last convolution group being connected with the fully-connected layer. And inputting the fusion characteristics into a second sub-network, extracting and downsampling the characteristics of a plurality of convolution groups, and finishing prediction through a full-connection layer to obtain the prediction attribute.

For example, table 5 illustrates predicted attribute tags for a pedestrian image in a first training set. Wherein each numerical value represents a probability that the value of the attribute is the corresponding option. That is, 0.1 indicates that the probability value of the pedestrian image that the hair color is yellow is 0.1, the probability value of the pedestrian image that it is brown is 0.6, and the rest of the meanings are similar. The option with the highest probability is selected and can be used as the predicted value of the attribute which is finally output.

Table 5 prediction attribute example

	Yellow colour	Brown color	White color	Red color	Green colour	Black color
							Hair color	0.1	0.6	0.05	0.15	0.05	0.05

S200: and calculating training loss by using the real attribute corresponding to the pedestrian image and the predicted attribute.

As mentioned above, each pedestrian image in the first training set is respectively provided with a real attribute label corresponding to the pedestrian image, and the real attribute labels are used for labeling the real attributes of the pedestrian images. The real property tags herein may be manually annotated.

For example, table 6 illustrates the true attribute tags for a pedestrian image in the first training set shown in table 5. Wherein, 1 indicates that the attribute is a corresponding value, and 0 indicates that the attribute is not the value. That is, the hair color of the pedestrian image is brown, not the other five colors.

TABLE 6 true Attribute example

	Yellow colour	Brown color	White color	Red color	Green colour	Black color
							Hair color	0	1	0	0	0	0

The training loss may be calculated using existing loss functions, such as a square error loss function, an SVM loss function, a softmax loss function, and the like.

As previously described, such training may be referred to as single task attribute analysis training when there is only one pedestrian attribute to predict. When there are a plurality of pedestrian attributes to be predicted, then the multi-task attribute analysis training is called. Assuming that the T attributes are required to be predicted, the T training tasks are correspondingly shared, wherein the training task corresponding to the j-th attribute is predicted as the j-th task. The total training loss is:

J(θ)＝∑_j λ_j J(θ^j )

Wherein J (θ) is the total training loss;

J(θ^j ) Training loss for the j-th task;

λ_j task weights representing the j-th task; generally lambda_j May be set to a fixed value.

Training loss J (θ) for jth task^j ) Can be obtained by existing calculation means, such as, alternatively, J (θ^j ) The softmax loss function was used to calculate, as follows:

wherein m represents the total number of pedestrian images used for training in the jth task, namely the total number of training samples of the jth task, and i is the index number of the ith pedestrian image in the m pedestrian images.

K_j A number of options representing the value of the j-th attribute; k represents K_j Index number of k in the options; for example, for the pedestrian attribute of "sex", the value may be two options of "male" and "female", then K_j Equal to 2.

And the real attribute of the ith pedestrian image in the jth task is represented.

Representing a dirac function; if and only if->At the value of the kth option +.>Otherwise

Penalty coefficient representing anti-unbalanced data, +.>Wherein (1)>The value of the jth attribute in the training samples representing the jth task is the ratio of the number of samples of the kth option to the total number of training samples of the jth task. In performing multi-task attribute analysis training, the number of training samples for each value in each attribute is often unbalanced, e.g., there may be significantly fewer pedestrians in the training samples who wear the sunglasses than those who do not. In order to make training more efficient, a penalty factor of anti-unbalanced data is introduced in the face of this situation >When the sample proportion of the kth option containing the jth attribute becomes large, then +.>And becomes smaller, thereby penalizing.

The probability value indicating that the value of the jth attribute is the kth option is predicted for the ith pedestrian image. More specifically, the->

S300: and if the training loss converges, determining the current model parameters of the convolutional neural network as parameters of a pedestrian attribute analysis model to obtain the pedestrian attribute analysis model.

Referring to fig. 4, if the training loss does not converge, S301 is performed: updating model parameters in the convolutional neural network. And then repeating the steps from S100 to S200 for training until the calculated training loss is converged, and determining the model parameters in the latest convolutional neural network as the parameters of the pedestrian attribute analysis model. Here, the parameters of the convolutional neural network may be updated by using an existing algorithm such as a random gradient descent method (Stochastic Gradient Descent, SGD).

In addition to the aforementioned variety of pedestrian pictures that can affect the accuracy of pedestrian attribute recognition, the neural network model-based pedestrian attribute recognition method has another problem in practical application: the multitasking attribute analyzes the incompatibility of training. In particular, in practical applications, it is often necessary to identify a plurality of pedestrian attributes from one picture. However, when training the neural network model, learning difficulty and task convergence are not the same for different pedestrian attributes, for example, it is often more difficult to identify the age of a person than to identify the color of clothing. In the conventional model training method, the task weights of different tasks are often set to be fixed values, so that the problem of different difficulty and convergence of different tasks is ignored, and coordinated training is difficult to form when models with multiple pedestrian attributes are required to be identified in training. Also because of this, it is difficult for the trained analytical model to identify all of the pedestrian attributes more accurately when used to identify multiple pedestrian attributes in a pedestrian image.

For this purpose, a second embodiment of the application also proposes a further training method for a pedestrian attribute analysis model, in which the task weight λ_j Or can be adjusted according to the training situation.

Specifically, referring to fig. 5, a training method of a pedestrian attribute analysis model is provided, which includes steps S500 to S700.

S500: and inputting the pedestrian image into a convolutional neural network to obtain the prediction attribute.

The convolutional neural network can utilize pedestrian images as input data, and the obtained output data is a prediction attribute. In one implementation, a convolutional neural network may include a first subnetwork and a second subnetwork. The first subnetwork is used for extracting pedestrian characteristics f from the pedestrian image_b (x_i ). In particular, the first subnetwork may comprise several convolution groups, each convolution group comprising one convolution layer and one pooling layer. Inputting the pedestrian image into a first subnetwork, extracting and downsampling the characteristics of a plurality of convolution groups, and finally obtaining the pedestrian characteristic f of the pedestrian image_b (x_i ). The second sub-network is used for utilizing the pedestrian characteristic f_b (x_i ) As input data, a prediction attribute is obtained. In particular, the second subnetwork may comprise a plurality of convolution groups and a plurality of fully-connected layers, each convolution group comprising a convolution layer and a pooling layer, the plurality of convolution groups being connected in sequence, the last convolution group being connected with the fully-connected layer. And inputting the fusion characteristics into a second sub-network, extracting and downsampling the characteristics of a plurality of convolution groups, and finishing prediction through a full-connection layer to obtain the prediction attribute.

S600: calculating a training loss using the real attribute corresponding to the pedestrian image and the predicted attribute:

J(θ)＝∑_j λ_j J(θ^j ) (5)

wherein J (θ) is the total training loss;

J(θ^j ) Training loss for the j-th task;

λ_j task weights representing the j-th task;

a variance representing uncertainty in prediction in the j-th task;

K_j a number of options representing the value of the j-th attribute;

representing the real attribute of the ith pedestrian image in the jth task;

Task weight lambda_j The derivation process of (2) is as follows:

for the problem of identifying and classifying the pedestrian attribute, the problem of classifying can be firstly regarded as a regression problem, namely for the ith pedestrian image, the real attribute of the ith pedestrian image in the jth task is assumed to beConverting the label marked with the real attribute into a vector form:Wherein (1)>K_j Number of options representing the value of the jth attribute, K being K_j Index number of k in the options.Representing a dirac function; if and only if->In the time-course of which the first and second contact surfaces,otherwise->For example, taking the real attributes of Table 5 as an example, K_j There are 6 possible values for this attribute, 6, i.e. hair color. For the ith pedestrian image, its jth attribute hair color, the true attribute is brown, which is expressed as +. >

After considering the classification problem as a fitting regression problem, its training goals remain consistent, i.e. let the predicted properties P_i^j Near real attributesTaking the uncertainty and Gaussian distribution of isomorphism into consideration, the prediction process of the ith pedestrian image in the jth task can be modeled as follows:

wherein,,is the variance of uncertainty in the j-th task. Assuming that there are m training samples and T prediction tasks in the jth task, the overall process can be modeled as:

the negative log likelihood function of equation (2) is written as:

in a common regression problem of the fit,is its training objective function. Whereas in the above formula (3), +.>Fitting regression problem with multi-attribute analysis task weight adaptive learning, and +.>The task weight of the j-th task. And the latter term K in formula (3)_j logσ_j Is a regular term, restrict σ_j Neither too large nor too small. Considering that the classification problem and the regression problem described above are essentially one problem, and the target agreement is such that the predicted properties P_i^j Approximation of real Properties->The task weight of each task can be estimated and applied to the classification problem. />

In the above formula (3), uncertaintyCan be estimated according to the maximum likelihood method, let ∈Tek- >Obtain->Thus, the task weight for the jth task may be set to:

from the above equation (4), in the process of automatically updating the task weights, the weight λ of each task_j As model training increases, but the speed of simple tasks increases relatively faster (simple tasksFast descent and corresponding loss is also smaller), while the difficult task increases relatively slowly. The method can increase the contribution degree of simple tasks in model training to a certain extent, prevent the model from being dominated by difficult tasks, and therefore enable training coordination of multi-task attribute analysis to be better.

S700: and if the training loss converges, determining the current model parameters of the convolutional neural network as the model parameters of a pedestrian attribute analysis model.

If the training loss is not converged, updating the model parameters in the convolutional neural network, repeating the steps from S500 to S600 for training until the calculated training loss is converged, and determining the model parameters in the latest convolutional neural network as the parameters of the pedestrian attribute analysis model.

It should be noted that, in the first embodiment of the present application, the method of introducing a probability map to perform feature fusion to train an analysis model and the method of automatically updating task weights in the second embodiment may be combined with each other.

Thus, alternatively, referring to FIG. 6, the step of S500 in the second embodiment may be replaced with the step of S800, i.e

S800: and inputting the pedestrian image and the probability map corresponding to the pedestrian image into a convolutional neural network to obtain the prediction attribute. Wherein the probability map characterizes a set of probability values for each pixel node belonging to a pedestrian component area in the pedestrian image partitioned out of at least one pedestrian component area.

The steps of S800 may refer to the descriptions related to S100 of the first embodiment, and the same specific implementation manner is adopted, which is not described herein.

In a third embodiment of the present application, there is provided a pedestrian attribute identification method including the steps of:

and (3) training the obtained pedestrian attribute analysis model by adopting the training method in the first embodiment or the second embodiment, and inputting the pedestrian image to be identified into the pedestrian attribute analysis model to obtain the identified pedestrian attribute.

Here, the pedestrian image to be recognized is also an image including a pedestrian, except that the pedestrian attribute of the pedestrian is unknown. The pedestrian image to be identified can also be an original picture, such as a picture from a surveillance video or the like; or may be a preprocessed picture or the like.

In one implementation manner, the pedestrian image to be identified is obtained by preprocessing an original picture, and specifically the method includes the following steps:

detecting whether pedestrians are contained in the original picture;

and cutting out the pedestrian image to be identified from the original picture according to the pedestrian position.

Alternatively, the size of the cut-out pedestrian image to be recognized may be preset here, for example, the pixel size is 224×224. In general, the size of the pedestrian image to be recognized may be identical to the size of the pedestrian image employed when training the pedestrian attribute analysis model.

And if the original picture does not contain pedestrians, discarding the original picture, and carrying out the step of preprocessing on the next original picture. If a plurality of pedestrians are detected in one original picture, a plurality of pedestrian images to be identified can be obtained by clipping.

The pedestrian attribute directly output from the pedestrian attribute analysis model can also be expressed as a matrix, similar to the way the predicted attribute is expressed in the training process. For example, the probability of the hair color being yellow or the like is 0.01 for the other 5 colors, and the probability of the hair color being brown is 0.95. Taking the attribute when outputting the result to the user The highest probability of the options is taken as the final predicted value of the attribute, namelyStill by way of example, as described above, when output to a user, the predicted pedestrian image to be recognized is output with the hair color "brown".

In a fourth embodiment of the present application, corresponding to the training method in the first embodiment, please refer to fig. 7, there is provided a pedestrian attribute analysis model training system including:

the first training unit 1 is used for inputting a pedestrian image and a probability map corresponding to the pedestrian image into a convolutional neural network to obtain a prediction attribute; calculating training loss by using the real attribute corresponding to the pedestrian image and the predicted attribute; and under the condition that the training loss converges, determining the current model parameters of the convolutional neural network as the model parameters of the pedestrian attribute analysis model to obtain the pedestrian attribute analysis model. Wherein the probability map characterizes a set of probability values for each pixel node belonging to a pedestrian component area in the pedestrian image partitioned out of at least one pedestrian component area.

Optionally, referring to fig. 8, the training system may further include:

And the pedestrian analysis unit 2 is used for inputting the pedestrian image into the pedestrian analysis model to obtain a probability map corresponding to the pedestrian image. The pedestrian analysis model is a full convolution neural network trained by training images with real probability map labels.

Optionally, the first training unit 1 may further include:

a first pedestrian parsing auxiliary module 11 for updating the probability map; and obtaining a fusion characteristic according to the updated probability map and the pedestrian characteristic.

The first training unit 1 is further configured to extract pedestrian features from the pedestrian image by using a first sub-network; and inputting the fusion characteristic into a second sub-network to obtain the prediction attribute.

Optionally, the step of updating the probability map of the first pedestrian analysis assisting module 11 specifically includes:

wherein x is_i Is the ith pedestrian image;

representing the updated probability value;

representing for an s-th pixel node belonging to class C in the C-th pedestrian component area +.>The maximum value is taken.

Optionally, the step of obtaining the fusion feature by the first pedestrian analysis assisting module 11 according to the updated probability map and the pedestrian feature specifically includes:

And carrying out convolution fusion on the updated probability map and the pedestrian characteristic to obtain a first characteristic:

φ(x_i )＝[φ(x_i )¹ ,φ(x_i )² ,…,φ(x_i )^C' ]

wherein phi (x)_i )^c Represents the i Zhang Hangren thA first feature of a c-th channel of the image;

a set of probability values representing an updated c-th channel;

pixel multiplication representing the corresponding position;

f_b (x_i ) A pedestrian feature representing an i-th pedestrian image;

φ(x_i ) A first feature representing an i-th pedestrian image;

Optionally, the first training unit 1 further comprises:

a weight self-updating unit 12, configured to adjust a task weight corresponding to a task according to the following formula:

wherein lambda is_j Task weights representing the j-th task;

a variance representing uncertainty in prediction in the j-th task;

K_j a number of options representing the value of the j-th attribute;

representing the real attribute of the ith pedestrian image in the jth task;

The first training unit 1 is further configured to calculate a training total loss using the task weight:

J(θ)＝∑_j λ_j J(θ^j )

Wherein J (θ) is the total training loss;

J(θ^j ) Training loss for the j-th task;

λ_j the task weight of the j-th task is represented.

Optionally, the training system further comprises:

a preprocessing unit 3, configured to detect whether the original picture contains a pedestrian; if the pedestrian is contained, acquiring the pedestrian position of the pedestrian in the original picture; and cutting out the pedestrian image from the original picture according to the pedestrian position.

In a fifth embodiment of the present application, referring to fig. 9, corresponding to the training method in the second embodiment, a pedestrian attribute analysis model training system is provided, including:

the second training unit 4 is used for inputting the pedestrian image into the convolutional neural network to obtain a prediction attribute; calculating training loss by using the real attribute corresponding to the pedestrian image and the predicted attribute; under the condition that the training loss converges, determining the current model parameters of the convolutional neural network as the model parameters of a pedestrian attribute analysis model to obtain the pedestrian attribute analysis model;

the second training unit 4 comprises:

a second weight self-updating unit 42, configured to adjust a task weight corresponding to the task according to the following formula:

Wherein lambda is_j Task weights representing the j-th task;

a variance representing uncertainty in prediction in the j-th task;

K_j a number of options representing the value of the j-th attribute;

representing the real attribute of the ith pedestrian image in the jth task;

The second training unit 4 is further configured to calculate a training total loss using the task weights:

J(θ)＝∑_j λ_j J(θ^j )

wherein J (θ) is the total training loss;

J(θ^j ) Training loss for the j-th task;

λ_j the task weight of the j-th task is represented.

Optionally, the second training unit 4 is further configured to input a pedestrian image and a probability map corresponding to the pedestrian image into a convolutional neural network to obtain a prediction attribute; wherein the probability map characterizes a set of probability values for each pixel node belonging to a pedestrian component area in the pedestrian image partitioned out of at least one pedestrian component area.

Optionally, referring to fig. 10, the training system may further include:

Optionally, the second training unit 4 may further include:

a second pedestrian resolution assistance module 41 for updating the probability map; and obtaining a fusion characteristic according to the updated probability map and the pedestrian characteristic.

The second training unit 4 is further configured to extract pedestrian features from the pedestrian image by using a first sub-network; and inputting the fusion characteristic into a second sub-network to obtain the prediction attribute.

Optionally, the step of updating the probability map by the second pedestrian parsing auxiliary module 41 specifically includes:

wherein x is_i Is the ith pedestrian image;

representing the updated probability value;

representing for an s-th pixel node belonging to class C in the C-th pedestrian component area +.>The maximum value is taken. />

Optionally, the step of obtaining the fusion feature by the second pedestrian analysis assisting module 41 according to the updated probability map and the pedestrian feature specifically includes:

φ(x_i )＝[φ(x_i )¹ ,φ(x_i )² ,…,φ(x_i )^C' ]

a set of probability values representing an updated c-th channel;

pixel multiplication representing the corresponding position;

f_b (x_i ) A pedestrian feature representing an i-th pedestrian image;

φ(x_i ) A first feature representing an i-th pedestrian image;

Optionally, the training system further comprises:

In a sixth embodiment of the present application, there is also provided a pedestrian attribute identification apparatus, corresponding to the identification method in the third embodiment, including:

and the prediction unit is used for inputting the pedestrian image to be identified into the neural network model trained by the training system in the fourth or fifth embodiment, and outputting the identified pedestrian attribute.

Claims

1. A method of training a pedestrian attribute analysis model, comprising:

the convolutional neural network comprises a first sub-network and a second sub-network;

the step of inputting the pedestrian image and the probability map corresponding to the pedestrian image into a convolutional neural network to obtain the prediction attribute specifically comprises the following steps:

updating the probability map;

inputting the fusion characteristics into a second sub-network to obtain a prediction attribute;

the step of updating the probability map specifically includes:

wherein x is_i Is the ith sheetA pedestrian image;

representing the updated probability value;

representing for an s-th pixel node belonging to class C in the C-th pedestrian component area +. >Obtaining the maximum value;

the step of obtaining a fusion feature according to the updated probability map and the pedestrian feature specifically includes:

φ(x_i )＝[φ(x_i )¹ ,φ(x_i )² ,…,φ(x_i )^C' ]

a set of probability values representing an updated c-th channel;

pixel multiplication representing the corresponding position;

f_b (x_i ) A pedestrian feature representing an i-th pedestrian image;

φ(x_i ) A first feature representing an i-th pedestrian image;

obtaining a fusion characteristic by utilizing the first characteristic and the pedestrian characteristic;

2. Training method according to claim 1, characterized in that the calculation of the probability map comprises the following steps:

3. Training method according to any of the claims 1-2, characterized in that the step of calculating training losses using the real properties corresponding to the pedestrian images and the predicted properties, in particular comprises:

J(θ)＝∑_j λ_j J(θ^j )

wherein J (θ) is the total training loss;

J(θ^j ) Training loss for the j-th task;

λ_j task weights representing the j-th task;

a variance representing uncertainty in prediction in the j-th task;

K_j a number of options representing the value of the j-th attribute;

representing the real attribute of the ith pedestrian image in the jth task;

4. A pedestrian attribute identification method, characterized by comprising the steps of:

inputting the pedestrian image to be identified into the pedestrian attribute analysis model trained by the training method according to any one of claims 1-3 to obtain the identified pedestrian attribute.

5. A pedestrian attribute analysis model training system, comprising:

Wherein the probability map characterizes a set of probability values for each pixel node belonging to a pedestrian component area in the pedestrian image partitioned out of at least one pedestrian component area;

the first training unit is specifically configured to extract pedestrian features from the pedestrian image by using a first sub-network; updating the probability map; obtaining fusion characteristics according to the updated probability map and the pedestrian characteristics; inputting the fusion characteristics into a second sub-network to obtain a prediction attribute;

the step of updating the probability map specifically includes:

wherein x is_i Is the ith pedestrian image;

representing the updated probability value;

φ(x_i )＝[φ(x_i )¹ ,φ(x_i )² ,…,φ(x_i )^C' ]

a set of probability values representing an updated c-th channel;

pixel multiplication representing the corresponding position;

f_b (x_i ) A pedestrian feature representing an i-th pedestrian image;

φ(x_i ) A first feature representing an i-th pedestrian image;

6. A pedestrian attribute recognition apparatus, characterized by comprising:

the prediction unit is used for inputting the pedestrian image to be identified into the neural network model obtained by training the training system of claim 5, and outputting the identified pedestrian attribute.