CN109241345B

Movatterモバイル変換

Info

Publication number: CN109241345B
Application number: CN201811178561.5A
Authority: CN
Inventors: 李元朋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-10-10
Filing date: 2018-10-10
Publication date: 2022-10-14
Anticipated expiration: 2038-10-10
Also published as: CN109241345A

Abstract

The embodiment of the invention provides a video positioning method and device based on face recognition. The method comprises the following steps: carrying out face recognition on each frame of image of the video to obtain each frame of image comprising a face image; performing target tracking on each frame of image including the face image, and taking each frame of image including the face image of the same person as a set; selecting a plurality of frame images from each set; and comparing the facial image of the target person with the facial images in the frames of images selected from each set to determine the position of the target person in the video. The embodiment of the invention can quickly and accurately identify the frame images of the face images in the video, and specifically identify the frame images in which the face images of the target person appear. Therefore, the method is beneficial to assisting video editing and optimizing the video.

Description

Video positioning method and device based on face recognition

Technical Field

The invention relates to the technical field of face recognition, in particular to a video positioning method and device based on face recognition.

Background

In one video, various characters may be included. These characters may appear at certain times during the video playback and not at certain times. If some character images in the video need to be clipped, the video needs to be edited and manually viewed, when the character images needing to be clipped are found, information such as the appearance time of the character images is recorded, and clipping operation is carried out.

It takes a long time to manually view the character images in the video, and if some scenes in the video are converted too fast, the character images which are desired to be cut off are easy to be missed.

Disclosure of Invention

The embodiment of the invention provides a video positioning method and device based on face recognition, and aims to solve one or more technical problems in the prior art.

In a first aspect, an embodiment of the present invention provides a video positioning method based on face recognition, including:

carrying out face recognition on each frame of image of the video to obtain each frame of image comprising a face image;

performing target tracking on each frame of image including the face image, and taking each frame of image including the face image of the same person as a set;

selecting a plurality of frame images from each set;

and comparing the facial image of the target person with the facial images in the frames of images selected from each set to determine the position of the target person in the video.

In one embodiment, the target tracking is performed on each frame of image including a face image, and comprises the following steps:

and performing target tracking on each frame of image including the face image by adopting a nuclear correlation filtering algorithm.

In one embodiment, a nuclear correlation filtering algorithm is used for performing target tracking on each frame of image including a face image, and the method includes the following steps:

detecting the position of the face image in each frame image;

calculating the position offset of the face image in the adjacent frame image;

and if the position offset is smaller than a set threshold value, judging the adjacent frame images as frame images comprising face images of the same person.

In one embodiment, calculating the position offset of the face image in the adjacent frame image comprises:

and if the adjacent frame images are stretched or zoomed, aligning the coordinates in the adjacent frame images, and then calculating the position offset of the face image in the adjacent frame images.

In one embodiment, selecting a plurality of frame images from each set includes:

a plurality of frame images from a set is selected according to the sharpness and/or resolution of the frame images comprised in said set.

In one embodiment, comparing the facial image of the target person with the facial images in the frames of images selected from each set to determine the location of the target person in the video includes:

calculating the similarity between the facial image of the target person and the facial image in each frame of image selected from each set;

and acquiring the frame number and/or the corresponding playing time included in the set of the target person.

In a second aspect, an embodiment of the present invention provides a video positioning apparatus based on face recognition, including:

the face recognition module is used for carrying out face recognition on each frame of image of the video to obtain each frame of image comprising the face image;

the target tracking module is used for carrying out target tracking on each frame of image comprising the face image and taking each frame of image comprising the face image of the same person as a set;

a selecting module for selecting a plurality of frame images from each set;

and the positioning module is used for comparing the face image of the target person with the face images in the frames of images selected from each set so as to determine the position of the target person appearing in the video.

In one embodiment, the target tracking module is further configured to perform target tracking on each frame of image including a face image by using a kernel correlation filtering algorithm.

In one embodiment, the target tracking module is further configured to detect a position of a face image in each frame image; calculating the position offset of the face image in the adjacent frame image; and if the position offset is smaller than a set threshold value, judging the adjacent frame images as frame images comprising face images of the same person.

In an embodiment, the target tracking module is further configured to, if the adjacent frame images are stretched or scaled, align coordinates in the adjacent frame images, and then calculate a position offset of the face image in the adjacent frame images.

In one embodiment, the selecting module is further configured to select a plurality of frame images from a set according to the definition and/or resolution of the frame images included in the set.

In one embodiment, the positioning module is further configured to calculate similarity between the facial image of the target person and the facial images in the frames of images selected from each set; if the similarity between the face image of the target person and the face image in each frame of image selected from one set is larger than a set threshold value, determining that the target person appears in the set of the video; and acquiring the frame number and/or the corresponding playing time included in the set of the target person.

In a third aspect, an embodiment of the present invention provides a video positioning apparatus based on face recognition, where functions of the apparatus may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above-described functions.

In one possible design, the apparatus includes a processor and a memory, the memory is used for storing a program for supporting the apparatus to execute the above-mentioned video positioning method based on face recognition, and the processor is configured to execute the program stored in the memory. The apparatus may also include a communication interface for communicating with other devices or a communication network.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium for storing computer software instructions for a video positioning apparatus based on face recognition, which includes a program for executing the video positioning method based on face recognition.

One of the above technical solutions has the following advantages or beneficial effects: the frame images of the face images in the video can be rapidly and accurately identified, and the face images of the target person can be specifically identified in the frame images. Therefore, the method is beneficial to assisting video editing and optimizing the video.

The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present invention will be readily apparent by reference to the drawings and following detailed description.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.

Fig. 1 shows a flow chart of a video positioning method based on face recognition according to an embodiment of the invention.

Fig. 2 shows a flow chart of a video positioning method based on face recognition according to an embodiment of the invention.

Fig. 3 shows a flow chart of a video positioning method based on face recognition according to an embodiment of the invention.

Fig. 4 shows a block diagram of a video positioning apparatus based on face recognition according to an embodiment of the present invention.

Fig. 5 shows a block diagram of a video positioning apparatus based on face recognition according to an embodiment of the invention.

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

Fig. 1 shows a flow chart of a video positioning method based on face recognition according to an embodiment of the invention. As shown in fig. 1, the video positioning method based on face recognition may include:

and S11, carrying out face recognition on each frame of image of the video to obtain each frame of image comprising the face image.

And S12, carrying out target tracking on each frame of image including the face image, and taking each frame of image including the face image of the same person as a set.

And S13, selecting a plurality of frame images from each set.

And S14, comparing the facial image of the target person with the facial images in the frames of images selected from each set to determine the position of the target person in the video.

In general, a video includes several frame images, each having a corresponding frame number. Typically, each frame image also has a corresponding play time in the video. Some of the frame images have a landscape and some of the frame images have a person. In the frame images having a person, there may be one face image in some of the frame images, and there may be a plurality of face images in some of the frame images.

In the embodiment of the invention, face recognition can be carried out on all frame images of the video, or face recognition can be carried out on a part of the video, such as the first 10% frame images, and then face recognition is carried out on the subsequent frame image segments. Therefore, the frame images including the face images in the video can be screened out.

In one embodiment, in step S12, performing target tracking on each frame image including a face image includes: and performing target tracking on each frame of image including the face image by adopting a Kernel KCF (Kernel Correlation Filter) algorithm. Then, the frame images including the face images of the same person may be taken as one set, and a plurality of sets may be obtained.

For example, if the frame numbers 010 to 030 include the face image of person a, the frame numbers 040 to 050 include the face image of person B, the frame images of the frame numbers 010 to 030 are regarded as a set S1, and the frame images of the frame numbers 040 to 050 are regarded as a set S2.

The face images of a plurality of persons may appear in one frame image, and therefore, the same frame image may belong to different sets. For example, if the frame numbers 010 to 030 include a human face image of person a, the frame numbers 020 to 050 include a human face image of person C, the frame images of the frame numbers 010 to 030 are regarded as one set S1, and the frame images of the frame numbers 020 to 050 are regarded as one set S3.

In one embodiment, as shown in fig. 2, a kernel correlation filtering algorithm is used to perform target tracking on each frame of image including a face image, including:

and S21, detecting the position of the face image in each frame image.

And S22, calculating the position offset of the face image in the adjacent frame image.

And step S23, if the position offset is smaller than a set threshold value, judging the adjacent frame images as frame images comprising facial images of the same person.

In the embodiment of the invention, whether the face images in the frame images belong to the same person can be determined according to the position offset of the face images in each frame image. The positional shift amount of the same person appearing in the adjacent frame images is not too far in general. Therefore, the position offset of the face image in the adjacent frame image can be calculated. If the difference value is less than a certain threshold value, the adjacent frame images can be judged to comprise the face image of the same person.

In one example, calculating the position offset of the face image in the adjacent frame image may include: and calculating the difference or distance of the coordinates of the center points of the face images in the adjacent frame images. For example: the center coordinates of the face image in the frame image F1 are (x 1, y 1), the center coordinates of the face image in the frame image F2 are (x 2, y 2), and the difference between the two coordinates may be (x 2-x1, y2-y 1). The distance between the two can be Euclidean distance or cosine distance, etc.

In another example, calculating the position offset of the face image in the adjacent frame image may include: after the difference value of the coordinates of the central points of the face images in the adjacent frame images is calculated, the proportion of the difference value to the size of the frame image is calculated. For example, the center coordinates of the face image in the frame image F1 are (x 1, y 1), the center coordinates of the face image in the frame image F2 are (x 2, y 2), and the difference between the two coordinates may be (x 2-x1, y2-y 1). If the frame image has a length x and a width y, ratios (x 2-x 1)/x and (y 2-y 1)/y may be calculated as offsets.

Therefore, the offset threshold may be set correspondingly according to the offset calculation method. For example, a threshold value of the length difference, a threshold value of the width difference, a threshold value of the euclidean distance, a proportional threshold value, and the like are set.

During video shooting, stretching or zooming of a lens may occur, so that the same face image generates an enlarged or reduced block. For example, if the frame images F1 and F2 are adjacent frame images, and the frame image F2 is stretched compared with the frame image F1, the frame image F2 may be subjected to coordinate conversion in a stretching ratio, and then the position shift amounts of the face images in F2 and F1 may be compared.

In one embodiment, step S13 selects a plurality of frame images from each set, including:

For example, a set includes 20 frame images, and several frame images with high quality, such as high resolution and high definition, can be selected from the 20 frame images. Therefore, when the target person is identified subsequently, an accurate identification result is obtained.

In one embodiment, as shown in fig. 3, the step S14 of comparing the facial image of the target person with the facial images in the frames of images selected from each set to determine the position of the target person appearing in the video includes:

and S31, calculating the similarity between the face image of the target person and the face image in each frame of image selected from each set.

Step S32, if the similarity between the facial image of the target person and the facial image in each frame of image selected from one set is larger than a set threshold value, determining that the target person appears in the set of the video.

And S33, acquiring the frame number and/or the corresponding playing time included in the set of the target person.

If all the frame images selected from a certain set include the face image of the target person, it can be determined that the frame images included in the set belong to the target person with a high probability. The face image of the target person can be provided by the user, and some persons, such as photos of blacklisted persons, which need to be prohibited from playing, can be pre-stored in the database. When a certain video needs to be edited and the like, the face images of one or more target characters can be called out from the database, so that real-time comparison is realized.

For example, there are 100 frame images including a face region in a video, in which face images of the same person appear, and these 100 frame images are taken as a set. Several high-quality frame images (high resolution and high definition) can be selected from the set, and the similarity between the selected frame images and the face image of the target person can be compared. If the similarity is high, the target person can be determined to be included in the 100 images.

Then, the frame numbers of these frame images with the target person can be output. In addition, the playing time of the frame numbers corresponding to the video can also be output.

In an application example, in the post-production of the video, the face recognition is performed on the video, after the frame number or the moment of a certain target person is obtained, the frame images can be found, and the target person in the frame images is processed. For example, these frame images are deleted, or a target person in these frame images is subjected to mosaic processing or the like.

By adopting the embodiment of the invention, the frame images of the face images in the video can be rapidly and accurately identified, and the frame images of the face images of the target person can be specifically identified. Therefore, the method is beneficial to assisting video editing and optimizing the video.

The embodiment of the invention can support the automatic tracking of the people in the video by utilizing the database and can also support the uploading of the images of the people to be tracked by the user. In addition, by identifying the people in the video, the number of times the people appear in the video, the positions of the people appearing each time and the like are located, and therefore the specific people can be conveniently processed in the post-production of the video.

Fig. 4 shows a block diagram of a video positioning apparatus based on face recognition according to an embodiment of the invention. As shown in fig. 4, the apparatus may include:

aface recognition module 41, configured to perform face recognition on each frame of image of the video to obtain each frame of image including a face image;

atarget tracking module 42, configured to perform target tracking on each frame of image including a face image, and use each frame of image including a face image of the same person as a set;

a selectingmodule 43, configured to select a plurality of frame images from each set;

alocation module 44 for comparing the facial image of the target person with the facial images in the frames of images selected from each set to determine the location of the target person appearing in the video.

In one embodiment, thetarget tracking module 42 is further configured to perform target tracking on each frame of image including the face image by using a kernel correlation filtering algorithm.

In one embodiment, thetarget tracking module 42 is further configured to detect the position of the face image in each frame image; calculating the position offset of the face image in the adjacent frame image; and if the position offset is smaller than a set threshold value, judging the adjacent frame images as frame images comprising facial images of the same person.

In one embodiment, thetarget tracking module 42 is further configured to calculate a position offset of the face image in the adjacent frame image after aligning coordinates in the adjacent frame image if the adjacent frame image is stretched or scaled.

In one embodiment, the selectingmodule 43 is further configured to select a plurality of frame images from a set according to the definition and/or resolution of the frame images included in the set.

In one embodiment, thepositioning module 44 is further configured to calculate similarity between the facial image of the target person and the facial images in the frames of images selected from each set; if the similarity between the face image of the target person and the face image in each frame of image selected from one set is larger than a set threshold value, determining that the target person appears in the set of the video; and acquiring the frame number and/or the corresponding playing time included in the set of the target person.

The functions of the modules in the apparatuses according to the embodiments of the present invention may refer to the corresponding descriptions in the above methods, and are not described herein again.

Fig. 5 shows a block diagram of a video positioning apparatus based on face recognition according to an embodiment of the invention. As shown in fig. 5, the apparatus includes: amemory 910 and aprocessor 920, thememory 910 having stored therein computer programs operable on theprocessor 920. Theprocessor 920 implements the transaction commit method in the above embodiments when executing the computer program. The number of thememory 910 and theprocessor 920 may be one or more.

The device also includes:

and acommunication interface 930 for communicating with an external device to perform data interactive transmission.

Thememory 910 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

If thememory 910, theprocessor 920 and thecommunication interface 930 are implemented independently, thememory 910, theprocessor 920 and thecommunication interface 930 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (Extended Industry Standard Component) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 5, but that does not indicate only one bus or one type of bus.

Optionally, in an implementation, if thememory 910, theprocessor 920 and thecommunication interface 930 are integrated on a chip, thememory 910, theprocessor 920 and thecommunication interface 930 may complete communication with each other through an internal interface.

Embodiments of the present invention provide a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the computer program implements the method described in any of the above embodiments.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). Further, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following technologies, which are well known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present invention, and these should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A video positioning method based on face recognition is characterized by comprising the following steps:

performing target tracking on each frame of image of the video including the face image, and taking each frame of image including the face image of the same person as a set;

selecting a plurality of frame images from each set;

and comparing the facial image of the target person with the facial images in the frames of images selected from each set to determine the set of the target person, and determining the position of the target person in the video by using the set of the target person.

2. The method of claim 1, wherein performing target tracking on each frame of image of the video, including the face image, comprises:

and performing target tracking on each frame image of the video including the face image by adopting a kernel correlation filtering algorithm.

3. The method of claim 2, wherein performing target tracking on each frame of image including the face image by using a kernel correlation filtering algorithm comprises:

detecting the position of the face image in each frame image;

calculating the position offset of the face image in the adjacent frame image;

4. The method of claim 3, wherein calculating the position offset of the face image in the adjacent frame image comprises:

5. The method of any one of claims 1 to 4, wherein selecting a plurality of frame images from each set comprises:

6. The method of any one of claims 1 to 4, wherein comparing the facial image of the target person with the facial images of the frames of images selected from each set to determine the set of the target person appears, and using the set of the target person appears to determine the position of the target person in the video, comprises:

calculating the similarity between the face image of the target person and the face image in each frame of image selected from each set;

7. A video positioning device based on face recognition is characterized by comprising:

the face recognition module is used for carrying out face recognition on each frame of image of the video to obtain each frame of image comprising a face image;

the target tracking module is used for carrying out target tracking on each frame of image of the video, including the face image, and taking each frame of image including the face image of the same person as a set;

a selecting module for selecting a plurality of frame images from each set;

and the positioning module is used for comparing the face image of the target person with the face images in the frames of images selected from each set to determine the set in which the target person appears, and determining the position of the target person appearing in the video by using the set in which the target person appears.

8. The apparatus of claim 7, wherein the target tracking module is further configured to perform target tracking on each frame of image of the video including the face image using a kernel correlation filtering algorithm.

9. The apparatus of claim 8, wherein the target tracking module is further configured to detect a position of a human face image in each frame of image; calculating the position offset of the face image in the adjacent frame image; and if the position offset is smaller than a set threshold value, judging the adjacent frame images as frame images comprising face images of the same person.

10. The apparatus of claim 9, wherein the target tracking module is further configured to, if the adjacent frame images are stretched or scaled, align the coordinates in the adjacent frame images, and then calculate a position offset of the face image in the adjacent frame images.

11. The apparatus according to any one of claims 7 to 10, wherein the selecting module is further configured to select a plurality of frame images from a set according to the definition and/or resolution of the frame images included in the set.

12. The apparatus according to any one of claims 7 to 10, wherein the positioning module is further configured to calculate similarity between the facial image of the target person and the facial image in each frame of image selected from each set; if the similarity between the facial image of the target person and the facial image in each frame of image selected from one set is larger than a set threshold value, determining that the target person appears in the set of the video; and acquiring the frame number and/or the corresponding playing time included in the set of the target person.

13. A video positioning device based on face recognition is characterized by comprising:

one or more processors;

storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6.

14. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.