CN113537249A

Movatterモバイル変換

Info

Publication number: CN113537249A
Application number: CN202110944265.7A
Authority: CN
Inventors: 杨永强
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-08-17
Filing date: 2021-08-17
Publication date: 2021-10-22
Anticipated expiration: 2041-08-17
Also published as: CN113537249B

Abstract

Translated fromChinese

本发明实施例提供了一种图像的确定方法、装置、存储介质及电子装置，其中，该方法包括：使用训练后的目标网络模型中的N个子特征提取网络，对包含有目标对象的图像进行特征提取，得到目标对象N个子特征；其中，N个子特征提取网络中不同的子特征提取网络用于针对对象的不同姿态进行特征提取，N为大于1的整数；对N个子特征进行融合处理，得到目标特征向量；基于目标特征向量，从候选图像中确定出与目标对象的图像相似度大于图像相似度阈值的候选图像。通过本发明，解决了相关技术中存在的通过图像确定图像存在不准确的问题，提高了通过图像确定图像的准确率。

Embodiments of the present invention provide an image determination method, device, storage medium, and electronic device, wherein the method includes: using N sub-feature extraction networks in the trained target network model to perform an image containing the target object. Feature extraction, and N sub-features of the target object are obtained; wherein, different sub-feature extraction networks in the N sub-feature extraction networks are used to extract features for different poses of the object, and N is an integer greater than 1; the N sub-features are fused. A target feature vector is obtained; based on the target feature vector, a candidate image whose image similarity with the target object is greater than the image similarity threshold is determined from the candidate images. The invention solves the problem of inaccuracy in determining an image by using an image in the related art, and improves the accuracy of determining an image by using an image.

Description

Image determination method and device, storage medium and electronic device

Technical Field

The embodiment of the invention relates to the field of communication, in particular to a method and a device for determining an image, a storage medium and an electronic device.

Background

In the related art, the application of searching out an image similar to an image in an image library through the image is more and more extensive, and the following description takes an image of a non-motor vehicle as an example of the searched image:

by inputting a picture of a non-motor vehicle and then retrieving the non-motor vehicle images of the same non-motor vehicle under different cameras or in videos, the method has very important application value in the fields of city security, public security criminal investigation, intelligent communities, intelligent e-commerce and the like. However, in an actual monitoring scene, due to the existence of complex posture changes of the non-motor vehicles, semantic features between postures are not as highly predictable as pedestrians, and the problems of occlusion caused by human bodies on the non-motor vehicles and the like, so that the apparent features of the same non-motor vehicle may have large differences, and the determined images are not accurate.

Therefore, the problem that the image is determined through the image in the related art is inaccurate is known.

In view of the above problems in the related art, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a method and a device for determining an image, a storage medium and an electronic device, which are used for at least solving the problem that the image is determined inaccurately through the image in the related art.

According to an embodiment of the present invention, there is provided a method of determining an image, including: using N sub-feature extraction networks in a trained target network model to perform feature extraction on an image containing a target object to obtain N sub-features of the target object, wherein different sub-feature extraction networks in the N sub-feature extraction networks are used for performing feature extraction on different postures of the object, and N is an integer greater than 1; performing fusion processing on the N sub-features to obtain a target feature vector; and determining candidate images with the image similarity of the target object larger than an image similarity threshold value from the candidate images based on the target feature vectors.

According to another embodiment of the present invention, there is provided an image determination apparatus including: the extraction module is used for extracting the features of an image containing a target object by using N sub-feature extraction networks in a trained target network model to obtain N sub-features of the target object, wherein different sub-feature extraction networks in the N sub-feature extraction networks are used for extracting the features of different postures of the object, and N is an integer greater than 1; the processing module is used for carrying out fusion processing on the N sub-features to obtain a target feature vector; and the determining module is used for determining a candidate image with the image similarity with the target object larger than an image similarity threshold value from the candidate images based on the target feature vector.

According to yet another embodiment of the invention, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program, when executed by a processor, implements the steps of the method as set forth in any of the above.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the method and the device, N sub-feature extraction networks in a trained target network model are used for carrying out feature extraction on the image containing the target object to obtain N sub-features of the target object, the N sub-feature vectors are subjected to fusion processing to obtain the target feature vector, and the candidate image with the image similarity of the target object being larger than the image similarity threshold is determined from the candidate image according to the target feature vector. Because different sub-feature extraction networks in the N sub-feature extraction networks perform feature extraction aiming at different postures of the object, the trained target network model can accurately determine the sub-features of the target object in the image under different postures, and then the sub-features under different postures are fused to obtain an accurate target feature vector, so that the target image can be accurately determined according to the target feature vector. Therefore, the problem that the image is determined through the image in the related art is inaccurate can be solved, and the accuracy rate of determining the image through the image is improved.

Drawings

Fig. 1 is a block diagram of a hardware configuration of a mobile terminal of an image determination method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of determining an image according to an embodiment of the invention;

FIG. 3 is a diagram of a target network model architecture in accordance with an exemplary embodiment of the present invention;

FIG. 4 is a flow chart of a method for determining an image according to an embodiment of the present invention;

fig. 5 is a block diagram of the configuration of an image determination apparatus according to an embodiment of the present invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings in conjunction with the embodiments.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided in the embodiments of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking an example of the method running on a mobile terminal, fig. 1 is a block diagram of a hardware structure of the mobile terminal of the method for determining an image according to the embodiment of the present invention. As shown in fig. 1, the mobile terminal may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and amemory 104 for storing data, wherein the mobile terminal may further include atransmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

Thememory 104 may be used to store a computer program, for example, a software program and a module of application software, such as a computer program corresponding to the image determination method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in thememory 104, so as to implement the method described above. Thememory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, thememory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Thetransmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, thetransmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, thetransmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

In the present embodiment, an image determining method is provided, and fig. 2 is a flowchart of an image determining method according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, using N sub-feature extraction networks in the trained target network model to extract features of an image containing a target object to obtain N sub-features of the target object; wherein different sub-feature extraction networks in the N sub-feature extraction networks are used for extracting features aiming at different postures of the object, and N is an integer greater than 1;

step S204, carrying out fusion processing on the N sub-feature vectors to obtain a target feature vector;

step S206, based on the target feature vector, determining a candidate image with the image similarity with the target object larger than an image similarity threshold value from the candidate images.

In the above embodiments, the target object may be a non-motor vehicle, for example, a motorcycle, an electric vehicle, a bicycle, a balance car, or the like. The image which can contain the target object is input into the target network model, and the image is subjected to feature extraction through the target network model to obtain N sub-features of the target object under different postures. The N may be 3, or other positive numbers greater than 1, which is not limited in the present invention, and when N is 3, the different postures may include a front surface, a side surface, and a back surface. That is, after an image including a target object is input, a front sub-feature vector, a side sub-feature and a back sub-feature of the target object can be extracted from the image (the extraction of the sub-features is a process of converting a picture into an n-dimensional vector by using a depth learning algorithm, and the n-dimensional vector contains semantic information of the picture). After the sub-features of the target object in different postures are extracted, the sub-features in different postures can be fused, so that the target feature vector is determined. And determining the candidate image with the image similarity greater than the image similarity threshold value with the target object from the candidate images according to the target characteristic vector, so as to achieve the purpose of searching the image by using the image. The image similarity threshold may be a preset threshold, and the image similarity threshold may set different thresholds according to an application scene, which is not limited in the present invention.

Optionally, the main body of the above steps may be a background processor, or other devices with similar processing capabilities, and may also be a machine integrated with at least an image acquisition device and a data processing device, where the image acquisition device may include a graphics acquisition module such as a camera, and the data processing device may include a terminal such as a computer and a mobile phone, but is not limited thereto.

In an exemplary embodiment, the trained target network model is obtained by: acquiring a plurality of groups of image samples containing historical objects; wherein different sets of image samples in the plurality of sets of image samples comprise image samples acquired for the historical object from different angles, the different angles corresponding to the different poses; respectively performing loss determination operation on the N sub-feature extraction networks to obtain a predicted loss value corresponding to each sub-feature extraction network in the N sub-feature extraction networks; the loss determination operation includes: inputting one group of image samples in the multiple groups of image samples into a target sub-feature extraction network, and determining a prediction loss value corresponding to the target sub-feature extraction network based on a feature extraction result of the target sub-feature extraction network for the group of image samples; the target sub-feature extraction network is a sub-feature extraction network corresponding to the group of image samples; and adjusting parameters of each sub-feature extraction network based on the corresponding prediction loss value of each sub-feature extraction network to obtain the trained target network model. In this embodiment, the target network model may be obtained by model training, and when performing the model training, a plurality of sets of image samples including the history object may be obtained, where each of the plurality of sets of image samples includes image samples obtained by shooting the history object from different angles. And training the initial network model of the target network model through each group of image samples to obtain the target network model.

In the above embodiment, the target network model includes N sub-feature extraction networks, and each sub-feature extraction network may be trained by using multiple groups of image samples, and a corresponding prediction loss value of each sub-feature extraction network is determined. In determining the predicted loss value, it may be determined by a loss determination operation. And inputting one group of image samples in the multiple groups of image samples into a target sub-feature extraction network, wherein the target sub-feature extraction network can determine a feature extraction result aiming at one group of image samples, and can determine a prediction loss value corresponding to the target sub-feature extraction network according to the extraction result. The target sub-feature extraction network is one of the N sub-feature extraction networks.

In the above embodiment, each sub-feature extraction network to extract the pose feature may be preset, for example, when N is 3, 3 sub-feature extraction networks may be predetermined to extract the front sub-feature, the side sub-feature, and the back sub-feature of the target object, respectively. The pose of each image sample in the set of image samples may be marked in each set of input image samples, for example, each set of image samples includes 3 images, a front image, a side image, and a back image of the object. Therefore, during training, the front image is input into the sub-feature extraction network for extracting the front sub-features, the side image is input into the sub-feature extraction network for extracting the side sub-features, and the back image is input into the sub-feature extraction network for extracting the back sub-features.

In the above embodiment, after the image sample is input to the target sub-feature extraction network, a prediction loss value may be determined, and parameter adjustment may be performed on each sub-feature extraction network according to the prediction loss value, so as to obtain a trained target network model.

In an exemplary embodiment, the parameter adjustment of each sub-feature extraction network based on the corresponding predicted loss value of the sub-feature extraction network includes: and respectively carrying out the following operations aiming at each sub-feature extraction network: based on the prediction loss value corresponding to one sub-feature extraction network in each sub-feature extraction network, performing parameter adjustment on the sub-feature extraction network; or carrying out weighted fusion on the prediction loss values corresponding to the sub-feature extraction networks to obtain a comprehensive prediction loss value, and carrying out parameter adjustment on the sub-feature extraction networks based on the comprehensive prediction loss value. In this embodiment, when performing parameter adjustment on each sub-feature extraction network, parameters of the sub-feature extraction network may be directly adjusted by using the predicted loss value of the sub-feature extraction network, or the predicted loss values of the sub-feature extraction networks may be weighted and fused to obtain a comprehensive predicted loss value, and the parameters of each sub-feature network are adjusted according to the comprehensive predicted loss value.

In the above embodiment, the weight corresponding to each sub-feature extraction network may be set by a user according to an application scenario, and the weight corresponding to each sub-feature extraction network may be a probability determined by a classification layer included in the target network model, or may be another value, which is not limited in this invention.

In an exemplary embodiment, the performing weighted fusion on the predicted loss values corresponding to the sub-feature extraction networks includes: respectively determining the probability that the image samples input into the sub-feature extraction networks belong to the target sub-feature extraction network by using a target classification layer included in the target network model; determining the probability as a target weight of the sub-feature extraction network; and performing weighted fusion on the target weight and the prediction loss value corresponding to each sub-feature extraction network. In this embodiment, the target network model further includes a target classification layer, and the target classification layer may be a processing layer parallel to the N sub-feature extraction networks. In the training process, each group of image samples included in the multiple groups of image samples are input into an initial model of a target network model, and a target classification layer can classify each image in each group of image samples so as to determine the probability that the image samples input into each sub-feature extraction network belong to the target sub-feature extraction network, and the probability is used for determining the target weight of the sub-feature extraction network. And determining the product of each target weight and the prediction loss function determined by the sub-feature extraction network corresponding to the target weight, and determining the comprehensive prediction loss value by the sum of a plurality of products.

In the above embodiment, when the N sub-feature extraction networks are 3 sub-feature extraction networks, for the loss generated by the front side and the back side three branch networks, the probability that the model outputs the corresponding gesture at S1 level softmax may be respectively multiplied, and finally the loss of the three branches is added to be the final loss of the network, so as to perform the back propagation gradient updating network parameters. In the training process, map can be used as a test index, and the model with the highest map in the test set is saved as a target network model. The map is an index for calculating the effect of searching the map by the map, and for the search target, the higher the ranking of all truth values in the search result is, the higher the map is, and vice versa.

In an exemplary embodiment, each of the sets of image samples comprises: the method comprises the steps of obtaining a plurality of image samples obtained by shooting a historical object from different angles and label information of each image sample, wherein the label information comprises identification information of the image samples and angle information of shooting the image samples. In this embodiment, when the target object is a non-motor vehicle, for the same non-motor vehicle, images of different poses in different monitoring scenes may be collected as training images, and the same non-motor vehicle marks its ID (i.e., identification information) and pose (i.e., angle information, such as front, side, and back).

In one exemplary embodiment, inputting one of the sets of image samples into the target sub-feature extraction network comprises: weakening the target area of the group of image samples to obtain a first image sample; performing enhancement operation on the first image sample to obtain a second image sample; inputting the second image sample to the target sub-feature extraction network. In this embodiment, when the target object is a non-motor vehicle and a person drives the non-motor vehicle, the leg portion usually blocks the body of the non-motor vehicle, so that the target area of the image sample is subjected to the attenuation processing, so that the influence of the human body on the blocking of the body can be reduced. The target area is weakened, and in the training process, the model implicitly considers that the target area is an unimportant area and inhibits the human body information of the target area to a certain extent.

In the above embodiment, after performing the weakening processing on the training image to obtain the first image sample, the enhancement operation may be performed on the first image sample to obtain the second image sample. The enhancement operation may include image enhancement operations such as random cropping, random erasing, color dithering, and the like. And training a target sub-feature extraction network by using a second image sample to obtain a target network model.

In an exemplary embodiment, the step of de-emphasizing the target region of the set of image samples to obtain the first image comprises: determining a target probability of deemphasizing the target region; and carrying out zero setting processing on the target area according to the target probability to obtain the first image sample. In this embodiment, when the target object is a non-motor vehicle, the acquired non-motor vehicle image includes a human body, and in order to eliminate interference of the human body on the non-motor vehicle, the upper half area of the image included in each group of image samples may be set to 0 with a target probability (for example, 0.5 probability, which is only an exemplary description, and may also be 0.4 probability, 0.6 probability, and the like, which is not limited by the present invention), that is, the upper half area of the image has no information with the 0.5 probability. The target area may be determined according to the type of the target object, for example, when there is interference in the lower half of the image sample, the lower half of the image may be determined as the target area; when the interference exists in the upper half part of the image sample, the upper half part of the image can be determined as a target area; when there is interference in the left half of the image sample, the left half of the image may be determined as the target region; when the interference exists in the right half part of the image sample, the right half part of the image can be determined as the target area, and the target area is not limited by the invention.

In the above embodiment, the image target region is set to 0 with a probability of 0.5. The operation is mainly to reduce the influence of human body on the non-motor vehicle. The probability of 0.5 enables the model to have half of the probability of not receiving the information of the upper half area of the image in the training process, so that the model can implicitly consider that the upper half area is an unimportant area in the training process, and the human body information of the upper half area can be inhibited to a certain degree. Why is the upper half not erased directly? Because the postures of the non-motor vehicles and the visual angles of the cameras are rich in a real monitoring scene, important information of the non-motor vehicles can be erased if the upper half area is directly erased. Therefore, the method for setting the probability of 0.5 in the upper half area to 0 is beneficial to weakening the interference of the human body on the non-motor vehicle retrieval task and improving the non-motor vehicle retrieval precision.

In one exemplary embodiment, determining the corresponding prediction loss value of the target sub-feature extraction network based on the feature extraction results of the target sub-feature extraction network for the set of image samples comprises: determining a first feature extraction image of the target sub-feature extraction network based on the feature extraction result; carrying out segmentation processing on the first feature extraction image according to a preset direction to obtain a plurality of second feature extraction images of different angle types; determining the prediction loss value based on the second feature image. In this embodiment, a first feature extraction image may be determined according to a feature extraction result of the target sub-feature extraction network, and then the first feature image is segmented according to a predetermined direction to obtain second feature images of different angle categories. The predetermined direction may be determined by the target area, and may be a left-right direction when the target area is an upper half portion or a lower half portion, and may be an up-down direction when the target area is a left half portion or a right half portion. For example, when the target object is a non-motor vehicle, the first feature image may be half-cut in the left-right direction, up and down, divided into two blocks L1 and L2, and then only the half-block L2 feature may be removed for subsequent softmax and TripletLoss calculations.

In the above embodiment, the feature map is partitioned vertically, and then only the lower partitioned feature map is propagated forward to participate in the subsequent loss calculation, so that the feature information of the upper half does not contribute to the network update. Because the non-motor vehicle comprises a human body, the human body and the non-motor vehicle have a relative up-down position relation, the influence of the human body on the non-motor vehicle retrieval precision can be weakened to a certain extent by forbidding the upper half part of features to generate loss, the interference of the human body on the non-motor vehicle semantic information can be weakened, more feature information of the non-motor vehicle can be extracted, and the operation speed of the model can be increased.

In an exemplary embodiment, before performing feature extraction on an image including a target object by using N sub-feature extraction networks in the trained target network model, the method further includes: adding a target classification layer and the N sub-feature extraction networks in parallel after the last layer of convolution layer of the residual network model to obtain an initial network model, wherein the target classification layer is used for classifying the feature images output by the last layer of convolution layer into the N posture categories, and each sub-feature extraction network is used for outputting the feature images corresponding to each sub-feature extraction network; and training the initial network model by using the plurality of groups of image samples to obtain the target network model. In this embodiment, before using the target network model, an initial network model may be first constructed and trained to obtain the target network model. Referring to fig. 3, as shown in fig. 3, a target classification layer and N sub-feature extraction networks may be added in parallel after the last convolution layer of the residual network model to obtain an initial network model. Wherein, the residual network model may be resnet50, and the target classification layer may be softmax classification layer. That is, resnet50 can be used as a backbone network, and after the last convolutional layer of resnet50 is output, a softmax classification layer S1 is accessed to perform three classifications of the front and back of an image. And simultaneously, three branch networks are connected to the last convolutional layer of the resnet50 and are respectively used for extracting the apparent characteristics of the front surface, the back surface and the side surface of the non-motor vehicle. The branch network can extract the apparent features of the target object according to the angle information included in the label information of the training image.

In the above embodiment, three branch networks for extracting the front, side and back surfaces, respectively, are introduced in the model. The introduction of the branch structure enables the target network model to have the capability of simultaneously extracting three kinds of posture information of the front side, the back side and the side of the target object; the introduction of the branch structure can also lead each branch to be more concentrated on extracting the apparent information of the target object under a certain posture, so that the model can be better fitted; finally, the features of the three branches are fused to calculate the similarity, the semantic difference among different postures of the non-motor vehicle is greatly reduced by fusing the features of the three branches, the fused features contain richer information, the cross-posture mutual searching capability of the target object can be greatly improved, and the accuracy of searching the image by the image is improved.

In an exemplary embodiment, the fusing the N sub-features to obtain the target feature vector includes: splicing the N sub-features to obtain spliced feature vectors; and determining the splicing feature vector as the target feature vector. In this embodiment, a non-motor vehicle image to be retrieved may be input into the target network model, the target network model is used to determine the feature image, that is, feature vectors of three branches on the front side and the back side are extracted, and then the three feature vectors are spliced together to serve as a final target feature vector of the image. And then, carrying out the same characteristic extraction operation on the non-motor vehicle base image, then calculating the cosine similarity between the image to be retrieved and the characteristic vector of the non-motor vehicle base image, and finally sequencing the similarity of the base image and the image to be retrieved from large to small, so that a map searching result of the non-motor vehicle is obtained, wherein the non-motor vehicle with higher similarity indicates that the non-motor vehicle and the non-motor vehicle to be retrieved are more likely to be the same vehicle.

In one exemplary embodiment, determining, from the candidate images, a candidate image having an image similarity with the target object greater than an image similarity threshold based on the target feature vector comprises: inputting the candidate image into the trained target network model, and extracting candidate characteristics of candidate objects in the candidate image under different postures; performing fusion processing on the candidate features to obtain target candidate feature vectors; and determining a candidate image with the image similarity with the target object larger than an image similarity threshold value from the candidate images based on the target feature vector and the target candidate feature vector. In this embodiment, when determining the target image, the images in the candidate images may be input into the target network model to determine the candidate features of each image, and the candidate features are subjected to fusion processing to obtain the target candidate feature vector. And comparing the target candidate feature vector with the target feature vector, determining the similarity of the target candidate feature vector and the target feature vector, and sequencing the similarity. And determining the image with the top rank as a candidate image, or determining the image with the similarity larger than the image similarity threshold as the candidate image.

The following describes a method for determining an image in accordance with an embodiment:

fig. 4 is a flowchart of a method for determining an image according to an embodiment of the present invention, as shown in fig. 4, the flowchart includes:

step S402, inputting the non-motor vehicle image to be retrieved and the non-motor vehicle chassis image into the non-motor vehicle to search the image model (corresponding to the target network model);

step S404, obtaining the front side and back side branch characteristics of the non-motor vehicle;

step S406, fusing front side and back features of the non-motor vehicle;

step S408, determining the characteristics of the graph to be retrieved and the characteristics of the bottom library graph;

step S410, similarity calculation;

step S412, similarity sorting;

in step S414, the search result is determined.

In the embodiment, multi-attitude information of the non-motor vehicle is introduced in the model training stage, three attitude branches are constructed, and the capability of inter-attitude search of the non-motor vehicle is improved; the model implicitly ignores the human body in the upper half part in the training process, and the attention is focused on the non-motor vehicle in the lower half part, so that the interference of human body information on the model is reduced, and the expression capability of the model on the non-motor vehicle is improved. The problem that in the related technology, different postures of the non-motor vehicles have large difference, so that the image sequencing of searching the same non-motor vehicle from each other across the postures is difficult to be performed forward is solved, and in an actual monitoring scene, human bodies exist on the non-motor vehicles, and the information of the human bodies interferes with the non-motor vehicle retrieval.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, an image determining apparatus is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and the description already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 5 is a block diagram of a configuration of an image determination apparatus according to an embodiment of the present invention, as shown in fig. 5, the apparatus including:

an extractingmodule 52, configured to use N sub-feature extracting networks in the trained target network model to perform feature extraction on an image including a target object, so as to obtain N sub-features of the target object; wherein different sub-feature extraction networks in the N sub-feature extraction networks are used for extracting features aiming at different postures of the object, and N is an integer greater than 1;

theprocessing module 54 is configured to perform fusion processing on the N sub-features to obtain a target feature vector;

a determiningmodule 56, configured to determine, based on the target feature vector, a candidate image whose image similarity with the target object is greater than an image similarity threshold from among the candidate images.

In an exemplary embodiment, the apparatus may obtain the trained target network model by: acquiring a plurality of groups of image samples containing historical objects; wherein different sets of image samples in the plurality of sets of image samples comprise image samples acquired for the historical object from different angles, the different angles corresponding to the different poses; respectively performing loss determination operation on the N sub-feature extraction networks to obtain a predicted loss value corresponding to each sub-feature extraction network in the N sub-feature extraction networks; the loss determination operation includes: inputting one group of image samples in the multiple groups of image samples into a target sub-feature extraction network, and determining a prediction loss value corresponding to the target sub-feature extraction network based on a feature extraction result of the target sub-feature extraction network for the group of image samples; the target sub-feature extraction network is a sub-feature extraction network corresponding to the group of image samples; and adjusting parameters of each sub-feature extraction network based on the corresponding prediction loss value of each sub-feature extraction network to obtain the trained target network model.

In an exemplary embodiment, the apparatus may perform parameter adjustment on each sub-feature extraction network based on a corresponding predicted loss value of the sub-feature extraction network by: and respectively carrying out the following operations aiming at each sub-feature extraction network: based on the prediction loss value corresponding to one sub-feature extraction network in each sub-feature extraction network, performing parameter adjustment on the sub-feature extraction network; or carrying out weighted fusion on the prediction loss values corresponding to the sub-feature extraction networks to obtain a comprehensive prediction loss value, and carrying out parameter adjustment on the sub-feature extraction networks based on the comprehensive prediction loss value.

In an exemplary embodiment, the apparatus may perform weighted fusion on the predicted loss values corresponding to the sub-feature extraction networks by: respectively determining the probability that the image samples input into the sub-feature extraction networks belong to the target sub-feature extraction network by using a target classification layer included in the target network model; determining the probability as a target weight of the sub-feature extraction network; and performing weighted fusion on the target weight and the prediction loss value corresponding to each sub-feature extraction network.

In an exemplary embodiment, each of the sets of image samples comprises: the method comprises the steps of obtaining a plurality of image samples obtained by shooting a historical object from different angles and label information of each image sample, wherein the label information comprises identification information of the image samples and angle information of shooting the image samples.

In an exemplary embodiment, the apparatus may input one of the sets of image samples to a target sub-feature extraction network by: weakening the target area of the group of image samples to obtain a first image sample; performing enhancement operation on the first image sample to obtain a second image sample; inputting the second image sample to the target sub-feature extraction network.

In an exemplary embodiment, the apparatus may perform the de-emphasis processing on the target area of the group of image samples to obtain the first image sample by: determining a target probability of deemphasizing the target region; and carrying out zero setting processing on the target area according to the target probability to obtain the first image sample.

In an exemplary embodiment, the apparatus may determine the prediction loss value corresponding to the target sub-feature extraction network based on the feature extraction result of the target sub-feature extraction network for the group of image samples by: determining a first feature extraction image of the target sub-feature extraction network based on the feature extraction result; carrying out segmentation processing on the first feature extraction image according to a preset direction to obtain a plurality of second feature extraction images of different angle types; determining the prediction loss value based on the second feature extraction image.

In an exemplary embodiment, the apparatus may be configured to, before analyzing an image including a target object using a target network model to extract feature vectors of the target object at different angles, add a target classification layer and N sub-feature extraction networks in parallel after a last convolutional layer of a residual network model to obtain an initial network model, where the target classification layer is configured to classify feature images output by the last convolutional layer into N pose categories, and each of the N sub-feature extraction networks is configured to output a feature image corresponding to each of the N sub-feature extraction networks; and training the initial network model by using the plurality of groups of image samples to obtain the target network model.

In an exemplary embodiment, theprocessing module 54 may perform fusion processing on the N subsets to obtain the target feature vector as follows: splicing the N sub-features to obtain spliced feature vectors; and determining the splicing feature vector as the target feature vector.

In an exemplary embodiment, the determiningmodule 56 may determine, based on the target feature vector, a candidate image having an image similarity with the target object greater than an image similarity threshold from among candidate images by: inputting the candidate image into the target network model, and extracting candidate characteristics of candidate objects in the candidate image under different postures; performing fusion processing on the candidate features to obtain target candidate feature vectors; and determining a candidate image with the image similarity with the target object larger than an image similarity threshold value from the candidate images based on the target feature vector and the target candidate feature vector.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Embodiments of the present invention also provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the method as set forth in any of the above.

In an exemplary embodiment, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

In an exemplary embodiment, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary embodiments, and details of this embodiment are not repeated herein.

It will be apparent to those skilled in the art that the various modules or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and they may be implemented using program code executable by the computing devices, such that they may be stored in a memory device and executed by the computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for determining an image, comprising:

using N sub-feature extraction networks in the trained target network model to perform feature extraction on an image containing a target object to obtain N sub-features of the target object; wherein different sub-feature extraction networks in the N sub-feature extraction networks are used for extracting features aiming at different postures of the object, and N is an integer greater than 1;

performing fusion processing on the N sub-features to obtain a target feature vector;

and determining candidate images with the image similarity of the target object larger than an image similarity threshold value from the candidate images based on the target feature vectors.

2. The method of claim 1, wherein the trained target network model is obtained by:

acquiring a plurality of groups of image samples containing historical objects; wherein different sets of image samples in the plurality of sets of image samples comprise image samples acquired for the historical object from different angles, the different angles corresponding to the different poses;

respectively performing loss determination operation on the N sub-feature extraction networks to obtain a predicted loss value corresponding to each sub-feature extraction network in the N sub-feature extraction networks; the loss determination operation includes: inputting one group of image samples in the multiple groups of image samples into a target sub-feature extraction network, and determining a prediction loss value corresponding to the target sub-feature extraction network based on a feature extraction result of the target sub-feature extraction network for the group of image samples; the target sub-feature extraction network is a sub-feature extraction network corresponding to the group of image samples;

and adjusting parameters of each sub-feature extraction network based on the corresponding prediction loss value of each sub-feature extraction network to obtain the trained target network model.

3. The method of claim 2, wherein performing parameter adjustment on each sub-feature extraction network based on the predicted loss value corresponding to the sub-feature extraction network comprises:

and respectively carrying out the following operations aiming at each sub-feature extraction network: based on the prediction loss value corresponding to one sub-feature extraction network in each sub-feature extraction network, performing parameter adjustment on the sub-feature extraction network; or

And performing weighted fusion on the prediction loss values corresponding to the sub-feature extraction networks to obtain a comprehensive prediction loss value, and performing parameter adjustment on the sub-feature extraction networks based on the comprehensive prediction loss value.

4. The method of claim 3, wherein performing weighted fusion on the predicted loss values corresponding to the sub-feature extraction networks comprises:

respectively determining the probability that the image samples input into the sub-feature extraction networks belong to the target sub-feature extraction network by using a target classification layer included in the target network model;

determining the probability as a target weight of the sub-feature extraction network;

and performing weighted fusion on the target weight and the prediction loss value corresponding to each sub-feature extraction network.

5. The method of claim 2, wherein each of the sets of image samples comprises:

the method comprises the steps of obtaining a plurality of image samples obtained by shooting a historical object from different angles and label information of each image sample, wherein the label information comprises identification information of the image samples and angle information of shooting the image samples.

6. The method of claim 2, wherein inputting one of the sets of image samples to a target sub-feature extraction network comprises:

weakening the target area of the group of image samples to obtain a first image sample;

performing enhancement operation on the first image sample to obtain a second image sample;

inputting the second image sample to the target sub-feature extraction network.

7. The method of claim 6, wherein de-emphasizing the target region of the set of image samples to obtain a first image sample comprises:

determining a target probability of deemphasizing the target region;

and carrying out zero setting processing on the target area according to the target probability to obtain the first image sample.

8. The method of claim 2, wherein determining the prediction loss value corresponding to the target sub-feature extraction network based on the feature extraction results of the target sub-feature extraction network for the set of image samples comprises:

determining a first feature extraction image of the target sub-feature extraction network based on the feature extraction result;

carrying out segmentation processing on the first feature extraction image according to a preset direction to obtain a plurality of second feature extraction images of different angle types;

determining the prediction loss value based on the second feature extraction image.

9. The method of claim 1, wherein before feature extraction of the image containing the target object using the N sub-feature extraction networks in the trained target network model, the method further comprises:

adding a target classification layer and the N sub-feature extraction networks in parallel after the last layer of convolution layer of the residual network model to obtain an initial network model, wherein the target classification layer is used for classifying the feature images output by the last layer of convolution layer into the N posture categories, and each sub-feature extraction network is used for outputting the feature images corresponding to each sub-feature extraction network;

and training the initial network model by utilizing a plurality of groups of image samples to obtain the target network model.

10. The method of claim 1, wherein the fusing the N sub-features to obtain a target feature vector comprises:

splicing the N sub-features to obtain spliced feature vectors;

and determining the splicing feature vector as the target feature vector.

11. The method of claim 1, wherein determining, from the candidate images, a candidate image having an image similarity with the target object greater than an image similarity threshold based on the target feature vector comprises:

inputting the candidate image into the trained target network model, and extracting candidate characteristics of candidate objects in the candidate image under different postures;

performing fusion processing on the candidate features to obtain target candidate feature vectors;

and determining a candidate image with the image similarity with the target object larger than an image similarity threshold value from the candidate images based on the target feature vector and the target candidate feature vector.

12. An apparatus for determining an image, comprising:

the extraction module is used for extracting the characteristics of the image containing the target object by using the N sub-characteristic extraction networks in the trained target network model to obtain N sub-characteristics of the target object; wherein different sub-feature extraction networks in the N sub-feature extraction networks are used for extracting features aiming at different postures of the object, and N is an integer greater than 1;

the processing module is used for carrying out fusion processing on the N sub-features to obtain a target feature vector;

and the determining module is used for determining a candidate image with the image similarity with the target object larger than an image similarity threshold value from the candidate images based on the target feature vector.

13. A computer-readable storage medium, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 11.

14. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 11.