Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart of a face recognition enhancement method according to an embodiment of the present invention, which is applicable to a situation that face recognition is required, for example, the method is applied to a video monitoring system with a face recognition function, and the method can be executed by a face recognition enhancement device, and the device can be implemented in a software and/or hardware manner, and can be integrated on an electronic device, for example, a camera with an audio collector, or a backend server.
As shown in fig. 1, the face recognition enhancement method specifically includes the following steps:
s101, obtaining a target image containing a target object, and performing face recognition on the target image to obtain a first recognition result.
In the embodiment of the present invention, the target image of the target object may be image data including one frame of image collected by the monitoring camera, or may be video data including a plurality of frames of images. When the face recognition is performed on the target image, optionally, a face area in the target image is detected first, the face image is extracted from the target image based on the face area, and then the face image is subjected to face recognition to obtain a first recognition result, wherein the first recognition result includes character structural information, such as gender, age group, race and emotion, whether glasses are worn or not, a beard and the like.
In an alternative embodiment, performing face recognition on a target image to obtain a first recognition result includes:
inputting the target image into a pre-trained face recognition model, and obtaining a first recognition result according to the output of the face recognition model, wherein the first recognition result comprises the recognition result of at least one biological feature and the confidence coefficient of the recognition result of each biological feature.
In the embodiment of the invention, the pre-trained face recognition model can be selected as a trained convolutional neural network model. And the biological characteristics in the first recognition result at least include the above-mentioned person structural information (sex, age group, race, emotion, whether glasses are worn, mustache, etc.). And the recognition result of the biometric characteristic is exemplified by: sex: male; age: 20-24 years old; race: people of yellow race. The confidence degree of the identification result of the biological characteristics is used for representing the credibility of the identification result, and if the confidence degree of a certain identification result is higher, the more accurate the prediction is.
It should be noted that, if the quality of the acquired target image including the target object is poor, for example, the resolution is low, before the target image is input into the pre-trained face recognition model, the target image may be preprocessed, for example, the target image is cut to obtain a cut image, and then the cut image is reconstructed by using the pre-trained image reconstruction model, so that face recognition may be performed based on the reconstructed target image, and thus the accuracy of face recognition is ensured.
S102, audio data of the target object collected by the audio collector are obtained, and voice recognition is carried out on the audio data to obtain a second recognition result.
The monitoring system is characterized in that at least two audio collectors are arranged on a camera of the monitoring system, the audio collectors can be selected as sound collectors, audio data of a target object can be obtained through the audio collectors, and then the collected audio data of the target object is identified through a voice identification technology to obtain a second identification result.
In an alternative embodiment, performing speech recognition on the audio data to obtain the second recognition result includes:
and inputting the audio data into a pre-trained voice recognition model, and obtaining a second recognition result according to the output of the voice recognition model, wherein the second recognition result comprises the recognition result of at least one biological characteristic and the confidence coefficient of the recognition result of each biological characteristic.
It should be noted that the speech recognition model is obtained by training based on audio data corresponding to different biological features as training samples, and for example, audio data of males or females of different ages may be collected as training samples.
And S103, correcting or supplementing the first recognition result based on the second recognition result to obtain a final recognition result.
In an alternative embodiment, modifying or supplementing the first recognition result based on the second recognition result to obtain a final recognition result, includes:
and comparing at least one biological characteristic and the confidence coefficient thereof in the second recognition result with at least one biological characteristic and the confidence coefficient thereof in the first recognition result, and correcting or supplementing the first recognition result according to the comparison result to obtain a final recognition result.
Optionally, for any biometric feature shared by the first recognition result and the second recognition result, if the confidence of the recognition result of the biometric feature in the second recognition result is greater than the confidence of the recognition result of the biometric feature in the first recognition result, the recognition result of the biometric feature in the first recognition result is replaced with the recognition result of the biometric feature in the second recognition result. For example, for the biological feature of "gender", in the first recognition result: gender male, confidence 60%; and in the second recognition result: gender female, confidence 85%; the second recognition result is considered to be more accurate, and the gender male in the first recognition result can be replaced by the gender female.
For any biometric feature that is present in the second recognition result and is not present in the first recognition result, the recognition result of the biometric feature is supplemented to the first recognition result. That is, the features that can only be obtained by speech recognition are supplemented to the face image recognition result. Illustratively, the biological feature is the province to which the target object belongs, and when voice recognition is performed on the audio data of the target object, the target object is determined to be northeast, belongs to the three eastern provinces, and the confidence coefficient is 80% according to the voice feature. Therefore, "the target object belongs to the east-third province" can be supplemented to the first recognition result, thereby enriching the result of the face recognition.
In the embodiment of the invention, after the face recognition is carried out on the target image containing the target object, the voice recognition is carried out on the collected audio data of the target object, and then the face recognition result of the target object is corrected or supplemented based on the voice recognition result of the target object, so that the aim of improving the accuracy of the face recognition is fulfilled.
Example two
Fig. 2 is a flowchart of a face recognition enhancement method according to a second embodiment of the present invention, where the present embodiment is optimized based on the foregoing embodiment, and adds an operation of acquiring audio data of a target object, as shown in fig. 2, the method includes:
s201, obtaining a target image containing a target object, and performing face recognition on the target image to obtain a first recognition result.
After the face recognition model is used for carrying out face recognition on the target image, the recognition result of at least one biological feature of the target object can be obtained, and the coordinate information (namely the position information) of the target object in the target image can also be obtained.
S202, acquiring position information of the target object in the target image and distance information between the target object and the camera.
After the position information of the target object in the target image is acquired from the first recognition result, the distance information between the target object and the camera can be calculated according to the following formula:
wherein D is the distance between the camera lens and the target object; f denotes a focal length of the lens; h is the target surface size height (fixed) of the camera lens; h is the height (known in advance) of the camera lens shooting site.
Due to the characteristics of face recognition, the lenses of the cameras used are basically fixed focus lenses, i.e. the focal length f of the camera lens is known. Therefore, the distance between the target object and the camera can be directly calculated according to the values of f, H and H.
S203, positioning the source of the audio data according to the audio data collected by the audio collector to obtain at least one sound source position.
Because the camera is provided with at least two audio collectors, the sound collected by the audio collectors can be positioned, and optionally, the sound collected by the audio collectors is positioned based on distance difference, energy difference and other methods. For example, two audio collectors can determine the approximate position of the sound source in a two-dimensional plane on the camera monitoring line, and when the number of the audio collectors is increased, the more accurate position of the sound source can be obtained by calculating and superposing. It should be noted that, since there may be multiple sound sources in a monitored scene, all the sound sources in the scene need to be located to obtain at least one sound source position.
And S204, determining a target sound source position corresponding to the target object from at least one sound source position according to the position information and the distance information.
Since at least one sound source position is obtained in S203, to accurately acquire the audio data of the target object, a target sound source position corresponding to the target object needs to be determined from a plurality of sound source positions. Optionally, the coordinate information of the target object and the distance between the target object and the camera determined in S202 are compared with the positions of the sound sources, so as to determine the position of the target sound source corresponding to the target object.
S205, acquiring the audio data of the target object acquired by the audio acquisition unit from the position of the target sound source.
After determining the target sound source position corresponding to the target object, the audio data of the target object may be collected from the target sound source position, and then S206 is performed to identify the audio data of the target object.
S206, performing voice recognition on the audio data of the target object to obtain a second recognition result.
And S207, based on the second recognition result, correcting or supplementing the first recognition result to obtain a final recognition result.
According to the embodiment of the invention, the target sound source position corresponding to the target object is determined according to the position of the target object, the distance between the target object and the camera and the position of the sound source, and the audio data collected from the target sound source position is further acquired, so that the accuracy of acquiring the audio data of the target object is ensured, and the accuracy of correcting the face recognition result based on the voice recognition result is further ensured.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a face recognition enhancing device in a third embodiment of the present invention, where this embodiment is applicable to a case where face recognition is required, and the device may be configured on a camera or a back-end server provided with at least two audio collectors, referring to fig. 3, and the device includes:
theface recognition module 301 is configured to acquire a target image including a target object, and perform face recognition on the target image to obtain a first recognition result;
thevoice recognition module 302 is configured to obtain audio data of the target object collected by the audio collector, and perform voice recognition on the audio data to obtain a second recognition result;
and aresult modification module 303, configured to modify or supplement the first recognition result based on the second recognition result, so as to obtain a final recognition result.
In the embodiment of the invention, after the face recognition is carried out on the target image containing the target object, the voice recognition is carried out on the collected audio data of the target object, and then the face recognition result of the target object is corrected or supplemented based on the voice recognition result of the target object, so that the aim of improving the accuracy of the face recognition is fulfilled.
On the basis of the foregoing embodiment, optionally, the speech recognition module includes:
a position and distance information acquiring unit for acquiring position information of the target object in the target image and distance information of the target object from the camera;
the first positioning unit is used for positioning the source of the audio data according to the audio data collected by the audio collector to obtain at least one sound source position;
the second positioning unit is used for determining a target sound source position corresponding to the target object from at least one sound source position according to the position information and the distance information;
and the voice acquisition unit is used for acquiring the audio data of the target object acquired by the audio acquisition unit from the position of the target sound source.
On the basis of the above embodiment, optionally, the face recognition module is specifically configured to:
inputting the target image into a pre-trained face recognition model, and obtaining a first recognition result according to the output of the face recognition model, wherein the first recognition result comprises the recognition result of at least one biological feature and the confidence coefficient of the recognition result of each biological feature.
On the basis of the above embodiment, optionally, the speech is specifically used by the module to:
and inputting the audio data into a pre-trained voice recognition model, and obtaining a second recognition result according to the output of the voice recognition model, wherein the second recognition result comprises a recognition result of at least one biological characteristic and the confidence coefficient of the recognition result of each biological characteristic.
On the basis of the foregoing embodiment, optionally, the result correction module includes: :
and the result correcting unit is used for comparing at least one biological characteristic and the confidence coefficient thereof in the second recognition result with at least one biological characteristic and the confidence coefficient thereof in the first recognition result, and correcting or supplementing the first recognition result according to the comparison result to obtain a final recognition result.
On the basis of the foregoing embodiment, optionally, the result correction unit is specifically configured to:
for any biological feature shared by the first recognition result and the second recognition result, if the confidence coefficient of the recognition result of the biological feature in the second recognition result is greater than that of the recognition result of the biological feature in the first recognition result, replacing the recognition result of the biological feature in the first recognition result with the recognition result of the biological feature in the second recognition result;
for any biometric feature that is present in the second recognition result and is not present in the first recognition result, the recognition result of the biometric feature is supplemented to the first recognition result.
The face recognition enhancement device provided by the embodiment of the invention can execute the face recognition enhancement method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example four
Fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention. Fig. 4 shows a block diagram of an exemplaryelectronic device 12 suitable for implementing an embodiment of the present invention, in this embodiment, the electronic device may be a camera provided with an audio collector, or a backend server. Theelectronic device 12 shown in fig. 4 is only an example and should not bring any limitation to the function and the scope of use of the embodiment of the present invention.
As shown in FIG. 4,electronic device 12 is embodied in the form of a general purpose computing device. The components ofelectronic device 12 may include, but are not limited to: one or more processors orprocessing units 16, asystem memory 28, and abus 18 that couples various system components including thesystem memory 28 and theprocessing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Electronic device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible byelectronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
Thesystem memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/orcache memory 32. Theelectronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, and commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected tobus 18 by one or more data media interfaces.Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) ofprogram modules 42 may be stored, for example, inmemory 28,such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment.Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
Electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device,display 24, etc.), with one or more devices that enable a user to interact withelectronic device 12, and/or with any devices (e.g., network card, modem, etc.) that enableelectronic device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O)interface 22. Also, theelectronic device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via thenetwork adapter 20. As shown, thenetwork adapter 20 communicates with other modules of theelectronic device 12 via thebus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction withelectronic device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Theprocessing unit 16 executes various functional applications and data processing by running the program stored in thesystem memory 28, for example, to implement the face recognition enhancement method provided by the embodiment of the present invention, the method includes:
acquiring a target image containing a target object, and performing face recognition on the target image to obtain a first recognition result;
acquiring audio data of a target object acquired by an audio acquisition device, and performing voice recognition on the audio data to obtain a second recognition result;
and based on the second recognition result, correcting or supplementing the first recognition result to obtain a final recognition result.
EXAMPLE five
The fifth embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for enhancing face recognition provided in the fifth embodiment of the present invention, where the method includes:
acquiring a target image containing a target object, and performing face recognition on the target image to obtain a first recognition result;
acquiring audio data of a target object acquired by an audio acquisition device, and performing voice recognition on the audio data to obtain a second recognition result;
and based on the second recognition result, correcting or supplementing the first recognition result to obtain a final recognition result.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.