CN118097732A

Movatterモバイル変換

Info

Publication number: CN118097732A
Application number: CN202211448977.0A
Authority: CN
Inventors: 夏瀚笙
Original assignee: Shanghai Jinsheng Communication Technology Co ltd
Current assignee: Shanghai Jinsheng Communication Technology Co ltd
Priority date: 2022-11-18
Filing date: 2022-11-18
Publication date: 2024-05-28

Abstract

The embodiment of the application discloses a model generation method, a face key point detection device and electronic equipment. The method comprises the following steps: acquiring a training data set, wherein the training data set comprises a plurality of large-pose face images and a plurality of small-pose face images, and the difference value between the number of the large-pose face images and the number of the small-pose face images is within a preset range; training the face key point detection model to be trained based on the training data set, so that the converged face key point detection model to be trained is used as the target face key point detection model. By the method, the key point detection model of the face to be trained can be trained based on the training data set of the large-gesture face image and the small-gesture face image, and the target face key point detection model is obtained, so that the key point detection and recognition accuracy of the target face key point detection model is improved.

Description

Model generation method, face key point detection device and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a model generation method, a method and an apparatus for detecting key points of a face, and an electronic device.

Background

Along with the continuous development of artificial intelligence technology, the detection of key points of a human face is becoming more and more focused, the detection technology of key points of the human face can detect the position of preset characteristic points of the human face, the characteristic points of the human face can comprise the human face, eyes, nose, mouth, eyebrows and the like, and the tasks of beautifying face, beautifying make-up, reconstructing the human face, face recognition and the like can be further carried out based on the detected position of the characteristic points of the human face. In a related manner, a convolutional neural network (Convolutional Neural Network, CNN) may be employed to extract features in the face image, and then based on the extracted features, key point information may be obtained using a coded form of coordinate regression or heat map regression. In the related manner, however, the accuracy of the detection of the key points of the face needs to be improved.

Disclosure of Invention

In view of the above problems, the present application provides a model generating method, a method and an apparatus for detecting key points of a face, and an electronic device, so as to improve the above problems.

In a first aspect, the present application provides a model generating method, the method comprising: acquiring a training data set, wherein the training data set comprises a plurality of large-pose face images and a plurality of small-pose face images, and the difference value between the number of the large-pose face images and the number of the small-pose face images is within a preset range; training the face key point detection model to be trained based on the training data set, so that the converged face key point detection model to be trained is used as the target face key point detection model.

In a second aspect, the present application provides a method for detecting a key point of a face, where the method includes: acquiring a face image to be detected; inputting the face image to be detected into a target face key point detection model obtained based on the method of any one of claims 1-8 to obtain a face key point detection result of the face image to be detected.

In a third aspect, the present application provides a model generating apparatus, the apparatus comprising: the system comprises a data set acquisition unit, a data set acquisition unit and a data processing unit, wherein the data set acquisition unit is used for acquiring a training data set, the training data set comprises a plurality of large-posture face images and a plurality of small-posture face images, and the difference value between the number of the large-posture face images and the number of the small-posture face images is within a preset range; the model generating unit is used for training the face key point detection model to be trained based on the training data set, and taking the converged face key point detection model to be trained as a target face key point detection model.

In a fourth aspect, the present application provides a face key point detection apparatus, the apparatus comprising: the face image acquisition unit is used for acquiring a face image to be detected; the detection result obtaining unit is configured to input the face image to be detected into a target face key point detection model obtained based on the method according to any one of claims 1 to 8, so as to obtain a face key point detection result of the face image to be detected.

In a fifth aspect, the present application provides an electronic device comprising one or more processors and a memory; one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the methods described above.

In a sixth aspect, the present application provides a computer readable storage medium having program code stored therein, wherein the method described above is performed when the program code is run.

The application provides a model generation method, a face key point detection method, a device, electronic equipment and a storage medium. By the method, the human face key point detection model to be trained can be trained based on the training data set comprising a plurality of large-pose human face images and a plurality of small-pose human face images, wherein the difference value between the number of the large-pose human face images and the number of the small-pose human face images is in a preset range, namely the human face key point detection model to be trained can be trained through the training data set with the difference value between the number of positive samples (small-pose human face images) and the number of negative samples (large-pose human face images) in the preset range, the generalization capability of the human face key point detection model to be trained can be improved, the accuracy of the key point detection of the negative samples is improved, and therefore the accuracy of the target human face key point detection model to human face key point detection is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a model generation method according to an embodiment of the present application;

fig. 2 shows a schematic diagram of a face key point detection model to be trained according to an embodiment of the present application;

FIG. 3 illustrates a flow chart of one embodiment of S120 of FIG. 1 in accordance with the present application;

FIG. 4 is a schematic diagram of a network structure of a shallow feature extraction network according to an embodiment of the present application;

Fig. 5 is a schematic diagram showing a network structure of another face key point detection model to be trained according to an embodiment of the present application;

Fig. 6 is a schematic diagram of a network structure of a conversion module according to an embodiment of the present application;

Fig. 7 is a schematic diagram of a network structure of another face key point detection model to be trained according to an embodiment of the present application;

fig. 8 is a schematic diagram of a network structure of a feature fusion network according to an embodiment of the present application;

FIG. 9 illustrates a flow chart of one embodiment of S127 of FIG. 3 in accordance with the present application;

Fig. 10 shows a flowchart of a face key point detection method according to an embodiment of the present application;

FIG. 11 is a block diagram showing a model generating apparatus according to an embodiment of the present application;

fig. 12 is a block diagram illustrating a face key point detection apparatus according to an embodiment of the present application;

fig. 13 shows a block diagram of an electronic device according to the present application;

Fig. 14 is a storage unit for storing or carrying program codes for implementing a model generating method, a face key point detecting method according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

With the continuous development of artificial intelligence technology, face key point detection is becoming an important task in the field of computer vision.

However, the inventor finds that the accuracy of the detection of the key points of the human face needs to be improved in the related research. For example, in the scheme of coordinate regression, the coordinates of each key point are predicted through the output of a fully connected network, and this method may result in loss of spatial information, thereby reducing prediction accuracy. For another example, in the scheme of heat map regression, although spatial information can be utilized, the calculated amount and the parameter amount are too large, which causes resource waste, and the scheme based on heat map regression is not end-to-end, so that the model may have inconsistency in the training process and the reasoning process, thereby reducing the prediction accuracy.

Embodiments of the present application will be described below with reference to the accompanying drawings.

Referring to fig. 1, the method for generating a model provided by the present application includes:

S110: the method comprises the steps of obtaining a training data set, wherein the training data set comprises a plurality of large-posture face images and a plurality of small-posture face images, and the difference value between the number of the large-posture face images and the number of the small-posture face images is in a preset range.

The small-pose face image may refer to a face image obtained when a person keeps his body straight and looks right at the image acquisition device. The large-pose face image may refer to a face image obtained in a case where a body or a head of a person is greatly offset with respect to the image capturing apparatus, for example, a side face image with an excessively large yaw angle, a low head or head face image with an excessively large pitch angle, or the like.

As a way, a plurality of face images can be acquired based on an image acquisition device (such as a camera, a mobile phone and the like), wherein the plurality of images can come from different people, each person can correspond to a unique Identity (ID), and each face image corresponding to each person can comprise a plurality of large-pose face images and a plurality of small-pose face images; after the plurality of face images are obtained, the position coordinates of the key points of each face image can be marked manually, the marked plurality of face images are used as a training data set, and the marked coordinates can be real key point coordinates corresponding to the plurality of face images.

Meanwhile, the number of the large-pose face images in the acquired plurality of face images may be far lower than that of the small-pose face images, so that the large-pose face images can be expanded to enable the difference between the number of the plurality of the large-pose face images and the number of the plurality of the small-pose face images to be in a preset range in order to balance data.

Optionally, the large-pose face image can be directly copied for multiple times, so as to realize expansion of the large-pose face image.

Optionally, the large pose face image may be augmented based on a data enhancement algorithm, which may include rotation, random erasure, addition of gaussian noise, changing brightness, changing contrast, changing saturation and hue, etc.

Alternatively, the preset range may be set based on actual conditions and experimental experience. For example, the preset range may be [ -1,1], and the number of large pose face images may be equal to the number of small pose face images, or the number of large pose face images may be 1 more than the number of small pose face images, or the number of large pose face images may be 1 less than the number of small pose face images.

S120: training the face key point detection model to be trained based on the training data set, so that the converged face key point detection model to be trained is used as the target face key point detection model.

The large-pose face images and the small-pose face images can be respectively corresponding to real key point coordinates. As shown in fig. 2, the face key point detection model to be trained may include a shallow feature extraction network, an image segmentation network, a hierarchical feature extraction network, a feature fusion network and a fully connected network, where the shallow feature extraction network may be used to extract shallow features of a face image, where each shallow feature characterizes low-level semantic information of a corresponding face image, for example: texture, brightness, etc. An image segmentation network (Patch Partition) may be used to divide a shallow feature into multiple, dimensionally identical local features. The hierarchical feature extraction network may be used to further feature extract local features to obtain a plurality of hierarchical features. The feature fusion network can be used for fusing a plurality of hierarchical features to obtain fusion features, and the fully-connected network can be used for outputting predicted key point coordinates of the face image.

As one way, as shown in fig. 3, training the face keypoint detection model to be trained based on the training data set, so as to take the converged face keypoint detection model to be trained as the target face keypoint detection model, including:

S121: acquiring a plurality of face images in the current training process from the plurality of large-pose face images and the plurality of small-pose face images, wherein the plurality of face images in the current training process comprise a plurality of large-pose face images and a plurality of small-pose face images, and performing data expansion on the plurality of face images in the current training process to obtain a plurality of expansion images in the current training process, and the plurality of face images in the current training process are respectively corresponding to a first expansion image and a second expansion image.

As a mode, a plurality of face images in the current training process can be obtained from a plurality of large-gesture face images and a plurality of small-gesture face images in a training data set, and then the plurality of face images in the current training process are expanded in a data enhancement mode, so that each face image in the current training process can be correspondingly provided with a first expansion image and a second expansion image.

Alternatively, the number of face images acquired in each training process may be the same, the number of face images acquired in each training process may be represented by the batch size, and the number of large-pose face images and small-pose face images in each training process may also be kept relatively balanced.

As another way, the training data set can be divided into the training set and the verification set according to a preset proportion, the face images of the same person can not appear in the training set and the verification set, then the face images in the current training process are obtained from the large-posture face images and the small-posture face images in the training set, and then the face images in the current training process are expanded in a data enhancement mode, so that each face image in the current training process can correspond to a first expansion image and a second expansion image.

S122: and inputting the plurality of face images and the plurality of expansion images in the current training process into the shallow feature extraction network to obtain a plurality of shallow features corresponding to the current training process, wherein each shallow feature represents low-level semantic information of the corresponding face image.

As shown in fig. 4, the shallow feature extraction network may be a step block, where the step block may include a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer, a pooling layer, and a connection layer (concat), where the convolution kernel of the first convolution layer may be 3×3, the number of convolution kernels may be 32, and stride may be 2; the convolution kernel size of the second convolution layer may be 3×3, the number of convolution kernels may be 16, and stride may be 1; the convolution kernel size of the third convolution layer may be 3×3, the number of convolution kernels may be 32, and stride may be 2; the convolution kernel size of the fourth convolution layer may be 1×1, the number of convolution kernels may be 32, and stride may be 1. The pooling layer may be a maximum pooling of 2×2 (MaxPool), and stride may be 2.

As a mode, a plurality of face images and a plurality of expansion images in the current training process can be input into a shallow feature extraction network to obtain a plurality of shallow features corresponding to the current training process.

Alternatively, the shallow feature extraction network may be other network structures, such as inceptionV, EFFICIENTNET, etc.

S123: and inputting the shallow features into the image segmentation network to obtain a plurality of local features.

As one approach, the image segmentation network may employ Patch Partition techniques to segment each shallow feature into multiple local features. For example, the shallow features may be represented by a feature map of h×w, and inputting the shallow features into the image segmentation network may obtain feature maps (Patch) corresponding to H/4x W/4 local features, where the size of the feature map corresponding to each local feature may be 4×4.

S124: and inputting the local features into the hierarchical feature extraction network to obtain a plurality of hierarchical features.

Wherein the plurality of hierarchical features may include a first hierarchical feature and a second hierarchical feature. As shown in fig. 5, the hierarchical feature extraction network may include a first hierarchical feature extraction network and a second hierarchical feature extraction network.

As a way, a plurality of local features may be input into the first-level feature extraction network to obtain a first-level feature, and then the first-level feature may be input into the second-level feature extraction network to obtain a second-level feature.

Referring to fig. 5 again, the first hierarchical feature extraction network includes a first reference feature extraction network, a second reference feature extraction network, and a third reference feature extraction network.

Optionally, a plurality of local features may be input into a first reference feature extraction network to obtain a first level reference feature; inputting the first level reference features into a second reference feature extraction network to obtain second level reference features; and inputting the second-level reference features into a third reference feature extraction network to obtain the first-level features.

Referring to fig. 5 again, the first reference feature extraction network may include a first embedded conversion module and a second embedded conversion module, and the network structure of the first embedded conversion module may be the same as that of the second embedded conversion module.

Optionally, a plurality of local features may be input into the first embedded conversion module to obtain a first conversion feature; and inputting the first conversion characteristic into a second embedded conversion module to obtain a first-level reference characteristic.

Referring again to fig. 5, the first embedding transformation module may include a local position embedding module and a transformation module, the local embedding module may include Patch Embedding (patch embedding) and Position Embedding (position embedding), patch Embedding may be used to transform a plurality of local features into a plurality of sequence features (token), and Position Embedding may be used to learn a positional relationship of each of the plurality of local features.

In an embodiment of the application, as shown in fig. 6, the conversion module (Transformer block) may include a plurality of layer normalization (Layer Normalization, LN) modules, linear attention (Linear Attention based Conformer, LAC) modules, multi-layer perceptron (Multilayer Perceptron, MLP) modules, and connection modules. Wherein, the expression of LAC may be:

Wherein the input of the LAC may beT may represent the length of the output sequence, d_m being the dimension of embeding layers. From X, a query matrix/>Key matrix/>Value matrix/>D_k is a preset value.

Whereas the expression of the attention mechanism commonly used in the prior art Transformer block may be:

While the prior art attention mechanism has a good parallel capability, as can be seen from the two above-mentioned attention expressions, the time complexity of LAC is O (t×d_k×d_k), the time complexity of the prior art attention mechanism may be O (t×t×d_k), and d_k will generally be less than T, as in the VIT model T will generally be 197 and d_k will generally be 64. Therefore, in the embodiment of the application, the LAC is adopted to ensure that the face key point detection model to be trained has good parallel capability, and simultaneously reduce the time complexity of the face key point detection model to be trained, thereby accelerating the training speed of the face key point detection model to be trained. And the LAC can compress the occupation of the memory/video memory by the network through simple deep convolution, thereby reducing the calculation complexity of the key point detection model of the face to be trained. Meanwhile, the Transformer block introduces position coding to avoid the loss of spatial information.

Optionally, a plurality of local features may be input into the local position embedding module to obtain a first embedded feature, where the first embedded feature may be a sequence feature including position information and image semantic information; the first embedded feature is input into a conversion module, and a first conversion feature is obtained based on a linear attention mechanism in the conversion module.

Alternatively, as shown in fig. 7, the second hierarchical feature extraction network, the second reference feature extraction network, and the third reference feature extraction network may each include a plurality of local merge transformation modules, each of which may include a local merge transformation module and a transformation module. The local merge module (PATCH MERGING) may be used to reduce the resolution of the corresponding feature and increase the number of channels of the corresponding feature. The conversion module may be configured to output a corresponding feature based on a linear attention mechanism, and the conversion module is configured identically to the conversion module in the first reference feature extraction network.

S125: and inputting the multiple layers of features into the feature fusion network to obtain fusion features.

As shown in fig. 8, the feature fusion network may include a feature semantic enhancement module and a connection module, the feature semantic enhancement module may be a LAM (Linear Attention Module ), the LAM may include CNN Embedding (convolutional neural network embedded) and LAC, and the output of CNN Embedding may be: Where T may represent the sequence length of the second level feature and d_e may represent the dimension of the embeding layer.

As a way, the second level feature may be input to the feature semantic enhancement module to obtain an enhanced level feature; and inputting the first level feature and the enhancement level feature into the connection module to obtain a fusion feature.

In the embodiment of the application, the feature fusion network can well reserve the dependency relationship and detail information of the first-level features and the second-level features in long distance through a linear attention mechanism and a feature fusion technology, so that the accuracy and generalization capability of the target face key point detection model can be improved.

S126: and inputting the fusion characteristics into the fully-connected network to obtain the predicted key point coordinates.

The number of the predicted key point coordinates of each face image may be 251.

As a mode, fusion characteristics corresponding to the face images and the expansion images in the current training process can be input into a full-connection network, and prediction key point coordinates corresponding to the face images and the expansion images in the current training process can be obtained.

S127: and training a shallow feature extraction network, an image segmentation network, a hierarchical feature extraction network, a feature fusion network and a full-connection network in the current training process based on the predicted key point coordinates, the real key point coordinates and the loss function, and entering the next training process after the current training process is finished so as to obtain a target face key point detection model based on multiple training processes.

The loss function may include a first loss function and a second loss function, and the predicted key point coordinates may include predicted key point coordinates of a plurality of face images of the current training process and predicted key point coordinates of a plurality of extended images of the current training process.

As one way, as shown in fig. 9, training a shallow feature extraction network, an image segmentation network, a hierarchical feature extraction network, a feature fusion network and a fully connected network in a current training process based on the predicted key point coordinates, the real key point coordinates and the loss function, and entering a next training process after the current training process is finished to obtain a target face key point detection model based on a plurality of training processes, including:

S1271: and obtaining a first loss value corresponding to each of the face images in the current training process based on the predicted key point coordinates, the real key point coordinates and the first loss function corresponding to each of the face images in the current training process, wherein the first loss function is used for reducing the difference between the predicted key point coordinates and the real key point coordinates of the face images in the current training process.

As a way, the predicted key point coordinates and the real key point coordinates corresponding to the face images in the current training process can be input into the first loss function, so as to obtain the first loss values corresponding to the face images in the current training process. Wherein the first loss function may be a modified Rwing loss, and the modified Rwing loss calculation formula may be:

wherein x may represent a difference between the real key point coordinates and the corresponding predicted key point coordinates, and r, ω, ε, and C may be preset values.

In the embodiment of the application, the first loss value is obtained by adopting the improved Rwing loss, and the corresponding first loss value can be obtained by selecting different calculation formulas based on the difference between the predicted key point coordinate and the real key point coordinate, so that when the error between the predicted key point coordinate and the real key point coordinate is smaller (the difference is smaller than r), the error can be ignored, thereby reducing the influence of artificial annotation noise on the training process of the face key point detection model to be trained, and further improving the accuracy of the target face key point detection model.

S1272: obtaining a second loss value corresponding to each of the face images in the current training process based on the predicted key point coordinates of the first extended image, the predicted key point coordinates of the second extended image and the second loss function corresponding to each of the face images in the current training process, wherein the second loss function is used for reducing the difference between the predicted key point coordinates of the first extended image corresponding to each of the face images in the current training process and the predicted key point coordinates of the second extended image corresponding to each of the face images in the current training process.

As a way, the predicted key point coordinates of the first extended image and the predicted key point coordinates of the second extended image corresponding to the face images in the current training process may be input into the second loss function, so as to obtain the second loss value corresponding to the face images in the current training process. Wherein the second loss function may be a modified Rwing loss, and the modified Rwing loss calculation formula may be:

Wherein x may represent a difference between the predicted key point coordinates of the first extended image and the predicted key point coordinates of the second extended image, and r, ω, ε, and C may be preset values.

In the embodiment of the application, the second loss value is obtained by adopting the improved Rwing loss, and the corresponding second loss value is obtained by selecting different calculation formulas based on the difference value between the predicted key point coordinates of the first extended image and the predicted key point coordinates of the second extended image, so that when the error between the predicted key point coordinates of the first extended image and the predicted key point coordinates of the second extended image is smaller (the difference value is smaller than r), the error can be ignored, thereby reducing the influence of artificial labeling noise on the training process of the face key point detection model to be trained, and further improving the accuracy of the target face key point detection model.

In addition, in this embodiment, the consistency constraint can be performed on the face key point detection model to be trained through the second loss function, that is, the prediction results of the target face key point detection model on the first extended image and the second extended image obtained through different data enhancement factors can be kept as consistent as possible.

Alternatively, rwing loss may be considered for the first and second loss functions for the threshold r and the potential for gradient discontinuity near zero error.

S1273: and obtaining a target loss value based on the first loss value and the second loss value.

As one aspect, the first loss value and the second loss value may each correspond to a weight coefficient, and the target loss value may be obtained based on the first loss value and the second loss value and the weight coefficient corresponding to each of the first loss value and the second loss value. The calculation formula of the target loss value may be:

L_total＝w1*L1+w2*L2

Optionally, the weight coefficients corresponding to the first loss value and the second loss value may be preset based on experience, or may be automatically generated in the training process.

In the embodiment of the application, the target loss function is obtained through the first loss function representing the coordinate regression constraint and the second loss function representing the consistency constraint, and the target loss function is used for training the face key point detection model to be trained, so that the accuracy of the target face key point detection model can be improved.

S1274: and training a shallow feature extraction network, an image segmentation network, a hierarchical feature extraction network, a feature fusion network and a full-connection network in the current training process based on the target loss value, and entering the next training process after the current training process is finished so as to obtain a target face key point detection model based on multiple training processes.

As a way, the shallow feature extraction network, the image segmentation network, the hierarchical feature extraction network, the feature fusion network and the fully connected network in the current training process can be trained based on the target loss value, the next training process is performed after the current training process is finished, the target loss value is converged (reaches the local minimum) based on the multiple training processes, and the face key point detection model with the converged target loss value is used as the target face key point detection model.

Referring to fig. 10, the method for detecting key points of a face provided by the present application includes:

s210: and acquiring a face image to be detected.

As one way, the face image to be detected may be acquired by an image acquisition device (e.g., camera, webcam, cell phone, etc.).

S220: and inputting the face image to be detected into a target face key point detection model obtained based on the method to obtain a face key point detection result of the face image to be detected.

The face key point detection result may be coordinates of the face key point.

As a way, the face image to be detected may be input into the target face key point detection model, to obtain a face key point detection result of the face image to be detected.

According to the face key point detection method, the face image to be detected can be input into the target face key point detection model which is obtained through training based on the training data set of sample balance and is based on the linear attention mechanism and the feature fusion technology, and a more accurate face key point detection result can be obtained.

Referring to fig. 11, the present application provides a model generating apparatus 600, where the apparatus 600 includes:

the data set obtaining unit 610 is configured to obtain a training data set, where the training data set includes a plurality of large-pose face images and a plurality of small-pose face images, and a difference between a number of the plurality of large-pose face images and a number of the plurality of small-pose face images is within a preset range.

The model generating unit 620 is configured to train the face key point detection model to be trained based on the training data set, so as to take the converged face key point detection model to be trained as the target face key point detection model.

As a way, the large pose face images and the small pose face images respectively correspond to real key point coordinates, the to-be-trained face key point detection model includes a shallow feature extraction network, an image segmentation network, a hierarchical feature extraction network, a feature fusion network and a full connection network, the model generation unit 620 is specifically configured to obtain a plurality of face images in a current training process from the large pose face images and the small pose face images, the plurality of face images in the current training process include a plurality of large pose face images and a plurality of small pose face images, and perform data expansion on the plurality of face images in the current training process to obtain a plurality of expansion images in the current training process, where the plurality of face images in the current training process respectively correspond to a first expansion image and a second expansion image; inputting the face images and the expansion images in the current training process into the shallow feature extraction network to obtain a plurality of shallow features corresponding to the current training process, wherein each shallow feature represents low-level semantic information of the corresponding face image; inputting the shallow features into the image segmentation network to obtain a plurality of local features; inputting the local features into the hierarchical feature extraction network to obtain a plurality of hierarchical features; inputting the multiple layers of features into the feature fusion network to obtain fusion features; inputting the fusion characteristics into the fully-connected network to obtain predicted key point coordinates; and training a shallow feature extraction network, an image segmentation network, a hierarchical feature extraction network, a feature fusion network and a full-connection network in the current training process based on the predicted key point coordinates, the real key point coordinates and the loss function, and entering the next training process after the current training process is finished so as to obtain a target face key point detection model based on multiple training processes.

Optionally, the first hierarchical feature extraction network includes a first reference feature extraction network, a second reference feature extraction network, and a third reference feature extraction network, and the model generating unit 620 is specifically configured to input the plurality of local features into the first reference feature extraction network to obtain a first hierarchical reference feature; inputting the first level reference features into the second reference feature extraction network to obtain the second level reference features; and inputting the second-level reference features into the third reference feature extraction network to obtain the first-level features.

Optionally, the first reference feature extraction network includes a first embedded conversion module and a second embedded conversion module, where a network structure of the first embedded conversion module is the same as that of the second embedded conversion module, and the model generating unit 620 is specifically configured to input the plurality of local features into the first embedded conversion module to obtain a first conversion feature; and inputting the first conversion characteristic into the second embedded conversion module to obtain the first-level reference characteristic.

Optionally, the first embedding conversion module includes a local position embedding module and a conversion module, and the model generating unit 620 is specifically configured to input the plurality of local features into the local position embedding module to obtain a first embedded feature, where the first embedded feature is a sequence feature including position information and image semantic information; inputting the first embedded feature into the conversion module, and obtaining the first conversion feature based on a linear attention mechanism in the conversion module.

Optionally, the feature fusion network includes a feature semantic enhancement module and a connection module, and the model generating unit 620 is specifically configured to input the second level feature into the feature semantic enhancement module to obtain an enhanced level feature; and inputting the first level feature and the enhancement level feature into the connection module to obtain the fusion feature.

Optionally, the second hierarchical feature extraction network, the second reference feature extraction network and the third reference feature extraction network each include a plurality of local merge conversion modules, each local merge conversion module includes a local merge module and a conversion module, the local merge module is used for reducing resolution of a corresponding feature and increasing a channel number of the corresponding feature, and the conversion module is used for outputting the corresponding feature based on a linear attention mechanism.

Referring to fig. 12, in the face key point detection apparatus 800 provided by the present application, the apparatus 800 includes:

the face image to be detected acquiring unit 810 is configured to acquire a face image to be detected.

A detection result obtaining unit 820, configured to input the face image to be detected into a target face key point detection model obtained based on the method of any one of claims 1 to 9, to obtain a face key point detection result of the face image to be detected.

An electronic device according to the present application will be described with reference to fig. 13.

Referring to fig. 13, another electronic device 100 capable of executing the foregoing model generation method and the face key point detection method is provided in the embodiment of the present application based on the foregoing model generation method and face key point detection method and device. The electronic device 100 includes one or more (only one shown) processors 102, memory 104 coupled to each other. The memory 104 stores therein a program capable of executing the contents of the foregoing embodiments, and the processor 102 can execute the program stored in the memory 104.

Wherein the processor 102 may include one or more processing cores. The processor 102 utilizes various interfaces and lines to connect various portions of the overall electronic device 100, perform various functions of the electronic device 100, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 104, and invoking data stored in the memory 104. Alternatively, the processor 102 may be implemented in hardware in at least one of digital signal Processing (DIGITAL SIGNAL Processing, DSP), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 102 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 102 and may be implemented solely by a single communication chip.

The Memory 104 may include random access Memory (Random Access Memory, RAM) or Read-Only Memory (ROM). Memory 104 may be used to store instructions, programs, code sets, or instruction sets. The memory 104 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (e.g., a touch function, a sound playing function, a video playing function, etc.), instructions for implementing the various method embodiments described below, etc. The storage data area may also store data created by the terminal 100 in use (such as phonebook, audio-video data, chat-record data), etc.

Referring to fig. 14, a structural block of a computer readable storage medium according to an embodiment of the present application is shown. The computer readable storage medium 1000 has stored therein program code that can be invoked by a processor to perform the methods described in the method embodiments described above.

The computer readable storage medium 1000 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, computer readable storage medium 1000 includes non-volatile computer readable storage medium (non-transitory computer-readable storage medium). The computer readable storage medium 1000 has storage space for program code 1010 that performs any of the method steps described above. The program code can be read from or written to one or more computer program products. Program code 1010 may be compressed, for example, in a suitable form.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be appreciated by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of generating a model, the method comprising:

Acquiring a training data set, wherein the training data set comprises a plurality of large-pose face images and a plurality of small-pose face images, and the difference value between the number of the large-pose face images and the number of the small-pose face images is within a preset range;

Training the face key point detection model to be trained based on the training data set, so that the converged face key point detection model to be trained is used as the target face key point detection model.

2. The method according to claim 1, wherein the plurality of large-pose face images and the plurality of small-pose face images each correspond to real keypoint coordinates, the face keypoint detection model to be trained includes a shallow feature extraction network, an image segmentation network, a hierarchical feature extraction network, a feature fusion network, and a full connection network, the face keypoint detection model to be trained is trained based on the training data set to take the converged face keypoint detection model to be trained as a target face keypoint detection model, and the method comprises:

Acquiring a plurality of face images in the current training process from the plurality of large-pose face images and the plurality of small-pose face images, wherein the plurality of face images in the current training process comprise a plurality of large-pose face images and a plurality of small-pose face images, and performing data expansion on the plurality of face images in the current training process to obtain a plurality of expansion images in the current training process, and the plurality of face images in the current training process are respectively corresponding to a first expansion image and a second expansion image;

Inputting the face images and the expansion images in the current training process into the shallow feature extraction network to obtain a plurality of shallow features corresponding to the current training process, wherein each shallow feature represents low-level semantic information of the corresponding face image;

inputting the shallow features into the image segmentation network to obtain a plurality of local features;

inputting the local features into the hierarchical feature extraction network to obtain a plurality of hierarchical features;

inputting the multiple layers of features into the feature fusion network to obtain fusion features;

inputting the fusion characteristics into the fully-connected network to obtain predicted key point coordinates;

and training a shallow feature extraction network, an image segmentation network, a hierarchical feature extraction network, a feature fusion network and a full-connection network in the current training process based on the predicted key point coordinates, the real key point coordinates and the loss function, and entering the next training process after the current training process is finished so as to obtain a target face key point detection model based on multiple training processes.

3. The method of claim 2, wherein the plurality of hierarchical features includes a first hierarchical feature and a second hierarchical feature, the hierarchical feature extraction network includes a first hierarchical feature extraction network and a second hierarchical feature extraction network, the inputting the plurality of local features into the hierarchical feature extraction network results in a plurality of hierarchical features, comprising:

inputting the local features into the first-level feature extraction network to obtain the first-level features;

and inputting the first level features into the second level feature extraction network to obtain the second level features.

4. A method according to claim 3, wherein the first hierarchical feature extraction network comprises a first reference feature extraction network, a second reference feature extraction network, and a third reference feature extraction network, and wherein the inputting the plurality of local features into the first hierarchical feature extraction network results in the first hierarchical feature comprises:

Inputting the local features into the first reference feature extraction network to obtain first-level reference features;

Inputting the first level reference features into the second reference feature extraction network to obtain the second level reference features;

and inputting the second-level reference features into the third reference feature extraction network to obtain the first-level features.

5. The method of claim 4, wherein the first reference feature extraction network comprises a first embedded conversion module and a second embedded conversion module, the network structure of the first embedded conversion module is the same as the network structure of the second embedded conversion module, the inputting the plurality of local features into the first reference feature extraction network to obtain a first level of reference features comprises:

Inputting the local features into the first embedded conversion module to obtain first conversion features;

And inputting the first conversion characteristic into the second embedded conversion module to obtain the first-level reference characteristic.

6. The method of claim 5, wherein the first embedded transformation module comprises a local position embedding module and a transformation module, wherein the inputting the plurality of local features into the first embedded transformation module results in a first transformation feature, comprising:

inputting the local features into the local position embedding module to obtain a first embedded feature, wherein the first embedded feature is a sequence feature containing position information and image semantic information;

inputting the first embedded feature into the conversion module, and obtaining the first conversion feature based on a linear attention mechanism in the conversion module.

7. The method of claim 4, wherein the second hierarchical feature extraction network, the second reference feature extraction network, and the third reference feature extraction network each comprise a plurality of local merge transformation modules, each comprising a local merge module for reducing resolution of a corresponding feature and increasing a number of channels of the corresponding feature, and a transformation module for outputting the corresponding feature based on a linear attention mechanism.

8. The method of claim 3, wherein the feature fusion network includes a feature semantic enhancement module and a connection module, the inputting the plurality of hierarchical features into the feature fusion network to obtain a fused feature includes:

inputting the second-level features into the feature semantic enhancement module to obtain enhanced-level features;

And inputting the first level feature and the enhancement level feature into the connection module to obtain the fusion feature.

9. The method of claim 2, wherein the loss function comprises a first loss function and the second loss function, wherein the predicted keypoint coordinates comprise predicted keypoint coordinates of the plurality of face images of the current training process and predicted keypoint coordinates of the plurality of augmented images of the current training process, wherein the training of the shallow feature extraction network, the image segmentation network, the hierarchical feature extraction network, the feature fusion network, and the fully connected network of the current training process based on the predicted keypoint coordinates, the true keypoint coordinates, and the loss function proceeds to the next training process after the end of the current training process to obtain the target face keypoint detection model based on the plurality of training processes, comprising:

Obtaining a first loss value corresponding to each of the face images in the current training process based on the predicted key point coordinates, the real key point coordinates and the first loss function corresponding to each of the face images in the current training process, wherein the first loss function is used for reducing the difference between the predicted key point coordinates and the real key point coordinates of the face images in the current training process;

Obtaining a second loss value corresponding to each of the face images in the current training process based on the predicted key point coordinates of the first extended image, the predicted key point coordinates of the second extended image and the second loss function corresponding to each of the face images in the current training process, wherein the second loss function is used for reducing the difference between the predicted key point coordinates of the first extended image corresponding to each of the face images in the current training process and the predicted key point coordinates of the second extended image corresponding to each of the face images in the current training process;

Obtaining a target loss value based on the first loss value and the second loss value;

And training a shallow feature extraction network, an image segmentation network, a hierarchical feature extraction network, a feature fusion network and a full-connection network in the current training process based on the target loss value, and entering the next training process after the current training process is finished so as to obtain a target face key point detection model based on multiple training processes.

10. The method for detecting the key points of the human face is characterized by comprising the following steps:

acquiring a face image to be detected;

Inputting the face image to be detected into a target face key point detection model obtained based on the method of any one of claims 1-9 to obtain a face key point detection result of the face image to be detected.

11. A model generation apparatus, characterized in that the apparatus comprises:

The system comprises a data set acquisition unit, a data set acquisition unit and a data processing unit, wherein the data set acquisition unit is used for acquiring a training data set, the training data set comprises a plurality of large-posture face images and a plurality of small-posture face images, and the difference value between the number of the large-posture face images and the number of the small-posture face images is within a preset range;

The model generating unit is used for training the face key point detection model to be trained based on the training data set, and taking the converged face key point detection model to be trained as a target face key point detection model.

12. A face key point detection apparatus, the apparatus comprising:

the face image acquisition unit is used for acquiring a face image to be detected;

The detection result obtaining unit is configured to input the face image to be detected into a target face key point detection model obtained based on the method according to any one of claims 1 to 9, so as to obtain a face key point detection result of the face image to be detected.

13. An electronic device comprising one or more processors and memory;

One or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-10.

14. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program code, wherein the method of any of claims 1-10 is performed when the program code is run.