CN113191942A

Movatterモバイル変換

Info

Publication number: CN113191942A
Application number: CN202110561037.1A
Authority: CN
Inventors: 支蓉; 郭子杰; 张武强; 王宝锋
Original assignee: Daimler AG
Current assignee: Mercedes Benz Group AG
Priority date: 2021-05-20
Filing date: 2021-05-20
Publication date: 2021-07-30

Abstract

The present invention relates to the field of image generation and the field of person detection. In particular to a method for generating an image, wherein the method comprises the following steps: s1, providing an original image (10) containing a person; s2, cutting out at least one original character image block (11) from the original image (10); s3, generating a synthesized human image block (21) based on each original human image block (11), wherein the synthesized human image block (21) has a background in the corresponding original human image block (11) and a human pose different from the human pose in the corresponding original human image block (11); s4, replacing the corresponding original human image block (11) in the original image (10) with the synthesized human image block (21) to generate the synthesized image (20). It also relates to a method of training a human detection model (40), a computer program product and an apparatus for processing images.

Description

Method for generating image, method for training human detection model, program, and device

Technical Field

The present invention relates to the field of image generation and the field of person detection, and in particular to a method of generating an image, a method of training a person detection model, and a corresponding computer program product and apparatus for processing an image.

Background

The person detection technology based on computer vision can detect the position of a person and the like by processing images or video information acquired by a camera. The human detection has wide application prospect, for example, the human detection can be used for pedestrian detection and used as a key technology in the applications of vehicle auxiliary driving, vehicle automatic driving, intelligent video monitoring, human behavior analysis and the like. In recent years, machine learning has become an algorithm widely used in the field of computer vision and the like. People detection technology based on machine learning is increasingly receiving attention from both academic and industrial fields.

The performance of machine learning based character detection models depends not only on the quality of the model construction, but also on the quality and quantity of training data. In order to ensure the performance of the human detection model, a large number of samples are usually required for training, and the acquisition of the large number of samples consumes a large amount of manpower and material resources. The data enhancement technology is a better method for reducing acquisition cost, can effectively expand the number of training samples and improve the identification accuracy of the character detection model.

For example, existing Generative Networks such as Variational Autocodes (VAEs), Generative Adaptive Networks (GANs), etc. may generate new samples based on a training data set having a limited number of training samples.

However, most of the current generation processes are random processes, and the current generation model is difficult to ensure that the pattern of the target image is accurately controlled while generating a high-quality and high-definition image, so that the generated image is not suitable for being used as a training sample of a person detection model.

Therefore, the prior art still has many defects in image generation and in improving the recognition rate of the person detection model.

Disclosure of Invention

It is an object of the invention to provide an improved method of generating images, an improved method of training a person detection model, and corresponding computer program products and apparatuses.

According to a first aspect of the present invention, there is provided a method of generating an image, wherein the method comprises the steps of:

s1, providing an original image containing a person;

s2, cutting out at least one original character image block from the original image;

s3, generating a synthesized human image block based on each of the original human image blocks, wherein the synthesized human image block has a background in the corresponding original human image block and a human pose different from the human pose in the corresponding original human image block;

s4, replacing the corresponding original human image block in the original image with the synthesized human image block to generate a synthesized image.

According to the present invention, the newly generated synthesized human image block has the background in the original human image block. That is, the synthesized human image block contains background information that matches the environmental information in the original image. It is thus possible to directly replace the original human image block with the corresponding newly generated synthetic human image block. No element other than the person in the complete composite image thus generated is changed. In the synthesized image, the new synthesized image block of the person is in a reasonable position and a reasonable size, and simultaneously can well accord with the environmental information in the original image. These composite images have different character poses compared to the original images, and therefore can provide more diverse character poses and bounding box information.

According to an alternative embodiment of the invention the synthetic character image block has the appearance of a character in the corresponding original character image block.

In accordance with an alternative embodiment of the present invention, step S1 also includes providing target pose information, and in step S3, generating a composite human image block based on each original human image block and according to the target human pose such that the composite human image block has the target human pose represented by the target pose information.

According to an alternative embodiment of the invention, the target pose information is provided by a pose image comprising pose key points connected according to a real human skeleton link. Alternatively, the target pose information is provided by a target pose character image containing a character having a target character pose. Alternatively, the target pose information is provided by position data for a set of pose key points.

According to an alternative embodiment of the invention, the target pose information has associated annotation information, and the composite character image patch generated in step S3 has annotation information associated with the corresponding target pose information, wherein the annotation information optionally includes character intent information and/or gesture information.

According to an alternative embodiment of the present invention, in step S3, the original human image block and the target posture information are input into a human generator that generates a composite human image block.

According to an alternative embodiment of the invention, the character generator is configured to be able to perform the following steps:

s31, identifying at least one posture key point of the character in the original character image block;

s32, intercepting a plurality of foreground image blocks and a plurality of background image blocks from the original character image block based on the at least one posture key point;

s33, extracting at least one first feature vector from the plurality of foreground image patches and the plurality of background image patches;

s34, acquiring at least one second feature vector from the target posture information; and

and S35, generating a composite human image block from the at least one first feature vector extracted in step S33 and the at least one second feature vector acquired in step S34.

According to a second aspect of the present invention, there is provided a method of training a human detection model, wherein the method comprises the steps of: providing a training data set comprising a synthetic image generated by a method of generating images according to the invention; and training the character detection model with a training data set, wherein the training data set optionally includes original images.

As described above, on the one hand, in the synthesized image, the new synthesized human image block has both a reasonable position and a reasonable size, while well conforming to the surrounding environment information in the original image. On the other hand, these composite images may provide more varied character poses and bounding box information. Therefore, the composite image is particularly suitable for training the human detection model, so that the human detection model obtains a higher recognition rate.

According to a third aspect of the invention, there is provided a computer program product comprising computer program instructions, wherein the computer program instructions, when executed by one or more processors, enable the processor to perform a method of generating an image according to the invention or a method of training a human detection model according to the invention.

According to a fourth aspect of the present invention, there is provided an apparatus for processing an image, the apparatus comprising a processor and a computer-readable storage device communicatively connected to the processor, the computer-readable storage device having stored thereon a computer program for implementing the method of generating an image according to the present invention or the method of training a human detection model according to the present invention when the computer program is executed by the processor.

By the invention, the following effects are realized: the figure in the generated composite image has reasonable position and size, and the problem that the foreground is not matched with the background information is avoided. And training the character detection model by using the synthetic image so as to improve the recognition rate of the character detection model.

Drawings

The principles, features and advantages of the present invention may be better understood by describing the invention in more detail below with reference to the accompanying drawings. The drawings comprise:

fig. 1 is a schematic block diagram illustrating an apparatus for processing an image according to an exemplary embodiment of the present invention;

FIG. 2 shows a flow diagram of a method of generating an image according to an exemplary embodiment of the invention;

FIG. 3 schematically illustrates an original image;

FIG. 4 schematically illustrates a method of generating an image according to an exemplary embodiment of the invention;

FIG. 5 schematically illustrates a process of generating a composite human image block, according to an exemplary embodiment of the present invention; and

figure 6 schematically illustrates a method of training a human detection model according to an exemplary embodiment of the invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and exemplary embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the scope of the invention.

Fig. 1 illustrates a schematic structural block diagram of an apparatus for processing an image according to an exemplary embodiment of the present invention. The apparatus for processing images comprises aprocessor 1 and a computerreadable storage device 2 communicatively connected to theprocessor 1. The computer-readable storage device 2 has stored therein a computer program for implementing a method of generating an image or a method of training a human detection model, which will be explained in detail below, when the computer program is executed by theprocessor 1.

According to an exemplary embodiment, adisplay device 3 is provided in communicative connection with theprocessor 1. By means of thedisplay device 3, the user can view theoriginal image 10 processed by the device of the image to be processed and thenew composite image 20 generated by the device.

According to an exemplary embodiment, aninput device 4 is provided in communicative connection with theprocessor 1. By means of theinput device 4, the user can select or input anoriginal image 10 to be processed by the device. Theinput device 4 may include, for example: a keyboard, a mouse, and/or a touch screen.

According to an exemplary embodiment, a camera device 5 is provided in communicative connection with theprocessor 1. By means of the camera device 5, the user can take a photograph containing an image of a person as anoriginal image 10 to be processed by the device. In particular, theoriginal image 10 includes not only the person but also surrounding environment information such as a scene in which the person is located.

According to an exemplary embodiment, an original image set is provided which is made up of a plurality oforiginal images 10. The raw set of images may be stored in the computerreadable storage device 2 or another storage device communicatively connected to theprocessor 1.

Fig. 2 shows a flowchart of a method of generating an image according to an exemplary embodiment of the present invention.

In step S1, theoriginal image 10 containing the person is provided. Fig. 3 schematically shows anoriginal image 10, theoriginal image 10 containing a person and a scene in which the person is located. Theoriginal image 10 may be any of the images in the original set mentioned above. Theoriginal image 10 may be, for example, an image captured by a user via the camera 5 or a frame of a person image captured from a video stream.

Then, in step S2, at least one original-person image block 11 is cut out from theoriginal image 10. As shown in fig. 4, two original human image blocks 11 can be cropped from theoriginal image 10 shown in fig. 3. The cropped originalcharacter image block 11 contains the complete character and also contains a small amount of background.

In one exemplary embodiment, pose keypoints of a person contained in theoriginal image 10 are identified, and the original-person image blocks 11 are cropped according to the identified pose keypoints, so that a single original-person image block 11 contains a single whole person. For example, from the identified pose key points, a character bounding box may be determined, which is expanded outward, e.g., by a factor of 1.5, to form a cropped bounding box for cropping out the originalcharacter image block 11.

Next, in step S3, a synthesizedhuman image block 21 is generated based on each of the original human image blocks 11, where the synthesizedhuman image block 21 has the background in the corresponding originalhuman image block 11 and a human posture different from the human posture in the corresponding originalhuman image block 11. The syntheticcharacter image block 21 has new character pose and bounding box information, making the information in the full data set more rich.

Alternatively, the synthesizedhuman image block 21 has the appearance of a human in the corresponding originalhuman image block 11. Thus, the synthesizedhuman image block 21 changes only the human pose in the originalhuman image block 11, while maintaining the human appearance and background of the originalhuman image block 11.

Illustratively, the synthetichuman image block 21 may be generated by means of ahuman generator 30. As shown in fig. 4, thehuman generator 30 generates two corresponding synthesized human image blocks 21 based on the two original human image blocks 11.

Next, in step S4, the corresponding originalhuman image block 11 in theoriginal image 10 is replaced with the synthesizedhuman image block 21 to generate thesynthesized image 20. In the newly generated synthetic-human image block 21, the background information of the pixel level matching the background and environmental information of theoriginal image 10 has already been included, so that the original-human image block 11 can be directly replaced with the corresponding newly generated synthetic-human image block 21. Thus, no elements other than the person in the newly generated completecomposite image 20 are changed. In thecomposite image 20, the new compositehuman image block 21 is both in a reasonable position and in a reasonable size, while conforming well to the surrounding environment information. Thesecomposite images 20 may provide a greater variety of character poses and bounding box information. The use of the original image set together with the composite image set for training theperson detection model 40 allows theperson detection model 40 to achieve a higher recognition rate.

Fig. 5 illustrates a process of generating the synthesizedhuman image block 21 according to an exemplary embodiment of the present invention. In the exemplary embodiment, step S1 also includes providing target pose information. Also, in step S3, the synthesizedhuman image block 21 is generated based on each of the original human image blocks 11 and in accordance with the target human pose such that the synthesizedhuman image block 21 has the target human pose represented by the target pose information.

The target pose information may be provided by a pose image that contains pose key points connected according to a real human skeleton link. The originalhuman image block 11 and the posture image having the target posture are input into thehuman generator 30, and then thehuman generator 30 outputs the synthesizedhuman image block 21. The source of the target pose is not limited by the invention, and the target pose can be the pose of other people in the original image set or the pose of people in other data sets.

Alternatively, the target pose information may be provided by a target pose character image containing a character having a target character pose. The target posed character image may or may not be selected from the original set of images.

Alternatively, the target pose information may be provided by position data for a set of pose key points. It should be understood that the present invention is not limited to the specific form of the target pose information.

In one exemplary embodiment, the target pose information has associated annotation information, and the syntheticcharacter image block 21 generated in step S3 has annotation information associated with the corresponding target pose information. The annotation information includes, for example, character intention information and/or gesture information. Thus, the resultingcomposite images 20 will also have associated annotation information, and thesecomposite images 20 may be particularly advantageously used to train a human intent recognizer or human gesture detector, or the like, for enhancing their performance.

In one exemplary embodiment,character generator 30 is configured to perform the following steps:

s31, identifying at least one posture key point of the person in the originalperson image block 11;

s32, intercepting a plurality of foreground image patches and a plurality of background image patches from the originalcharacter image block 11 based on the at least one pose keypoint;

s35, generating a synthetichuman image block 21 from the at least one first feature vector extracted in step S33 and the at least one second feature vector acquired in step S34.

Theperson generator 30 may be provided as another type of generator as long as theperson generator 30 functionally satisfies the requirement of being able to control the appearance, posture and background of the generated person image. It should be understood that the present invention is not limited to a particular type ofcharacter generator 30.

Fig. 6 shows a schematic diagram of a method of training ahuman detection model 40 according to an exemplary embodiment of the invention. In the method of training thehuman detection model 40, thesynthetic image 20 is generated by the method of generating an image according to the present invention. Then, a training data set is provided, which comprises thesynthetic image 20. The training data set may be stored in a computer readable storage medium. Optionally, the training data set further comprises theoriginal image 10. That is, theoriginal image 10 and thesynthesized image 20 may be used together to train thehuman detection model 40.

As described above, the person in thecomposite image 20 has a reasonable size and position, and the problem of the foreground not matching the background information does not occur. Thecomposite image 20 may provide more varied character poses and bounding box information, making the training data set more informative. The original image set and the composite image set are submitted to thehuman detection model 40 for training, so that thehuman detection model 40 can obtain higher recognition rate.

Furthermore, the invention relates to a computer program product comprising computer program instructions which, when executed by one or more processors, are capable of performing the method of generating an image according to the invention or the method of training a human detection model according to the invention. The computer program instructions may be stored in a computer readable storage medium. In the present invention, the computer-readable storage medium may include, for example, a high-speed random access memory, and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device. The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Although specific embodiments of the invention have been described herein in detail, they have been presented for purposes of illustration only and are not to be construed as limiting the scope of the invention. Various substitutions, alterations, and modifications may be devised without departing from the spirit and scope of the present invention.

Claims

1. A method of generating an image, wherein the method comprises the steps of:

s1, providing an original image (10) containing a person;

s2, cutting out at least one original character image block (11) from the original image (10);

s3, generating a synthesized human image block (21) based on each original human image block (11), wherein the synthesized human image block (21) has a background in the corresponding original human image block (11) and a human pose different from the human pose in the corresponding original human image block (11);

s4, replacing the corresponding original human image block (11) in the original image (10) with the synthesized human image block (21) to generate the synthesized image (20).

2. The method according to claim 1, wherein the synthetic human image block (21) has the appearance of a human in the corresponding original human image block (11).

3. The method of claim 1, wherein step S1 further includes providing target pose information, and in step S3 the composite human image block (21) is generated based on each original human image block (11) and in accordance with the target human pose such that the composite human image block (21) has the target human pose represented by the target pose information.

4. The method of claim 3, wherein,

the target posture information is provided by a posture image, and the posture image comprises posture key points which are connected according to a real human body skeleton linking mode; or

The target pose information is provided by a target pose character image containing a character having a target character pose; or

The target pose information is provided by position data for a set of pose key points.

5. The method according to claim 3, wherein the target pose information has associated annotation information, the synthetic character image patch (21) generated in step S3 having annotation information associated with the corresponding target pose information, wherein the annotation information optionally includes character intent information and/or gesture information.

6. The method according to any one of claims 3-5, wherein in step S3, the original human image block (11) and the target pose information are input into a human generator (30), the human generator (30) generating a composite human image block (21).

7. The method of claim 6, wherein the character generator (30) is configured to perform the steps of:

s31, identifying at least one pose key point of a person in the original person image block (11);

s32, intercepting a plurality of foreground image patches and a plurality of background image patches from the original character image block (11) based on the at least one pose keypoint;

and S35, generating a composite human image block (21) from the at least one first feature vector extracted in step S33 and the at least one second feature vector acquired in step S34.

8. A method of training a character detection model (40), wherein the method comprises the steps of:

providing a training data set comprising a synthetic image (20) generated by the method of generating an image according to any one of claims 1-7; and

a person detection model (40) is trained using a training data set, wherein the training data set optionally comprises original images (10).

9. A computer program product comprising computer program instructions, wherein the computer program instructions, when executed by one or more processors, are capable of performing the method of any one of claims 1-8.

10. An apparatus for processing an image, the apparatus comprising a processor (1) and a computer-readable storage device (2) communicatively connected to the processor (1), the computer-readable storage device (2) having stored therein a computer program for implementing the method according to any one of claims 1-8 when the computer program is executed by the processor (1).