Movatterモバイル変換


[0]ホーム

URL:


CN112487246A - Method and device for identifying speakers in multi-person video - Google Patents

Method and device for identifying speakers in multi-person video
Download PDF

Info

Publication number
CN112487246A
CN112487246ACN202011373431.4ACN202011373431ACN112487246ACN 112487246 ACN112487246 ACN 112487246ACN 202011373431 ACN202011373431 ACN 202011373431ACN 112487246 ACN112487246 ACN 112487246A
Authority
CN
China
Prior art keywords
image
data
face
frame
speaker
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011373431.4A
Other languages
Chinese (zh)
Inventor
陈均
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Kadoxi Technology Co ltd
Original Assignee
Shenzhen Kadoxi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Kadoxi Technology Co ltdfiledCriticalShenzhen Kadoxi Technology Co ltd
Priority to CN202011373431.4ApriorityCriticalpatent/CN112487246A/en
Publication of CN112487246ApublicationCriticalpatent/CN112487246A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本发明涉及摄像头装置控制技术领域,特别是涉及一种多人视频中发言人的识别方法和装置,其方法包括,获取摄像头所采集的图像数据,调用预设的人脸识别模型对每一帧所述图像数据进行识别,确定所获取到的每一人脸特征在所属图像数据中的位置参数;获取麦克风阵列所采集的多路音频数据,采用预设的语音识别模型确定其中一路人声声能最强的音频数据的位置参数;根据所述音频数据的位置参数确定发言人在所述图像中的位置参数;根据发言人在所述图像中的位置参数,获取对所述发言人人脸的图像截取数据,对所述图像截取数据中图像进行像素放大,使实时直播中视频画面结构化能自动实现,提高了直播的趣味性,增强了人机交互的能力。

Figure 202011373431

The invention relates to the technical field of camera device control, and in particular to a method and device for recognizing speakers in a multi-person video. The method includes: acquiring image data collected by a camera, calling a preset face recognition model for each frame The image data is identified, and the position parameters of each acquired face feature in the image data to which it belongs are determined; the multi-channel audio data collected by the microphone array is acquired, and a preset speech recognition model is used to determine one of the vocal energy. The position parameter of the strongest audio data; the position parameter of the speaker in the image is determined according to the position parameter of the audio data; the position parameter of the speaker in the image is obtained according to the position parameter of the speaker. The image interception data is used to enlarge the pixels of the image in the image interception data, so that the structure of the video picture in the real-time live broadcast can be automatically realized, the interest of the live broadcast is improved, and the ability of human-computer interaction is enhanced.

Figure 202011373431

Description

Method and device for identifying speakers in multi-person video
Technical Field
The invention relates to the technical field of camera device control, in particular to a method and a device for identifying speakers in a multi-person video.
Background
Under the background of rapid development and progress of the prior art, more video and audio intelligent analysis technologies are provided so as to complete the output of the structured data of video and audio, and more humanized application experience can be provided through the fusion presentation of the structured data and the video and audio data.
In the presence of audio and video data of multiple persons, when the audio and video data are displayed on the same picture, a system cannot determine a specific speaker in the current video stream, so that the system cannot automatically embody the structured audio and video data, the structured audio and video data are often formed by post-person co-processing and fusion in the recorded audio and video data, and the system is difficult to adapt to real-time live broadcast application.
Disclosure of Invention
In view of the above, embodiments of the present invention are proposed to provide a method and apparatus for identifying a speaker in a multi-person video that overcome or at least partially solve the above problems.
In order to solve the above problem, an embodiment of the present invention discloses a method for identifying a speaker in a multi-person video, including:
acquiring image data acquired by a camera, calling a preset face recognition model to recognize each frame of the image data, and determining the position parameter of each acquired face feature in the image data;
acquiring multi-channel audio data acquired by a microphone array, and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model;
determining a position parameter of a speaker in the image according to the position parameter of the audio data;
and acquiring image interception data of the face of the speaker according to the position parameters of the speaker in the image, and amplifying the pixels of the image in the image interception data.
Further, the calling a preset face recognition model to recognize the image data of each frame includes:
extracting human face features in the sample image;
inputting the human face features and sample image data into a recognition network, and determining position information of a human face recognition frame and human face image information in the human face recognition frame;
intercepting the face image in the face recognition frame to obtain a face image interception frame, and inputting image data in the face image interception frame into the recognition network;
and training the face recognition frame and the face screenshot frame through the recognition network to obtain the face recognition model.
Further, the acquiring of the multiple paths of audio data acquired by the microphone array and the determining of the position parameter of the audio data with the strongest acoustic energy of one path of people by using the preset speech recognition model include:
performing echo cancellation processing on each path of acquired audio data according to a reference signal; specifically, the reference signal may be obtained from a speaker or a sound card driver;
carrying out noise reduction suppression on signals which are not subjected to echo cancellation, and obtaining recognizable voice data by adopting automatic gain;
processing the human voice data in each path of audio data by adopting a beam forming algorithm to obtain a plurality of paths of beam signals;
and respectively carrying out voice recognition on each path of beam signal, determining the beam signal with the strongest sound energy of the human body, and obtaining the position parameter of the audio data corresponding to the beam signal.
Further, the performing voice recognition on each beam signal respectively includes:
and respectively carrying out voice recognition on the keywords in each path of beam signal, and when detecting that the keyword information in one path of beam signal is matched with a preset keyword training result, determining that the path of beam signal is the keyword beam signal.
Further, the training the face recognition box and the face screenshot box through the recognition network to obtain the face recognition model includes:
acquiring pixel proportion data of an image amplification area;
calculating an amplification factor of the intercepted image amplified to the image amplification area according to pixel ratio data in the image interception data;
and carrying out pixel amplification on the image in the image interception data according to the amplification factor.
There is also provided an apparatus for identifying a speaker in a multi-person video, comprising:
the face recognition module is used for acquiring image data acquired by the camera, calling a preset face recognition model to recognize each frame of image data and determining the position parameter of each acquired face feature in the image data to which the face feature belongs;
the voice recognition module is used for acquiring multi-channel audio data acquired by the microphone array and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model;
the position confirmation module is used for determining the position parameter of the speaker in the image according to the position parameter of the audio data;
a pixel amplification module, configured to obtain image capture data of a speaker face according to a position parameter of the speaker in the image, and perform pixel amplification on the image in the image capture data
Further, the face recognition module includes:
extracting human face features in the sample image;
inputting the human face features and sample image data into a recognition network, and determining position information of a human face recognition frame and human face image information in the human face recognition frame;
intercepting the face image in the face recognition frame to obtain a face image interception frame, and inputting image data in the face image interception frame into the recognition network;
and training a multi-convolution layer structure on the face recognition frame and the face cutout frame through the recognition network to obtain the face recognition model.
Further, the pixel amplifying module includes:
the enlarged region acquisition module is used for acquiring pixel proportion data of an image enlarged region;
the amplification factor calculation module is used for calculating the amplification factor of the intercepted image amplified to the image amplification area according to the pixel proportion data in the image interception data;
and the amplification submodule is used for carrying out pixel amplification on the image in the image interception data according to the amplification factor.
There is also provided an electronic device comprising a processor, a memory and a computer program stored on the memory and capable of running on the processor, the computer program, when executed by the processor, implementing the method of identifying a speaker in a multi-person video.
There is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of identifying a speaker in a multi-person video.
The embodiment of the invention has the following advantages:
according to the method and the device, all face targets in the image are positioned by applying a face recognition technology, and the microphone array is combined to position the position information of a specific speaker, so that the specific position of the speaker in the image is specifically positioned, the face image of the speaker is amplified by calculating the image amplification factor through the image, the video picture structuralization in real-time live broadcast can be automatically realized, the live broadcast interest is improved, and the human-computer interaction capability is enhanced.
Drawings
FIG. 1 is a flow chart illustrating steps of an embodiment of a method for identifying a speaker in a multi-person video according to the present invention;
FIG. 2 is a block diagram of an embodiment of an apparatus for identifying a speaker in a multi-person video according to the present invention;
fig. 3 is a block diagram of a computer apparatus for speaker identification in a multi-person video in accordance with the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The method for controlling the rotation of the camera based on the sound source positioning can be applied to any terminal equipment with a voice function and an image recognition function, such as terminal equipment of a smart phone, a tablet personal computer, a smart home and the like.
In the embodiment of the application, one camera can be used, only one direction is shot, and correspondingly, the microphone array is in a linear array; the number of the cameras can be multiple, the cameras are in an annular array, and correspondingly, the microphones are also in the annular array.
One of application scenarios in the embodiment of the present application is to identify an actual speaker in the same video frame where multiple people appear simultaneously, as shown in fig. 1, providing a method for identifying a speaker in a video of multiple people, which includes the following specific steps:
s100, acquiring image data acquired by a camera, calling a preset face recognition model to recognize each frame of image data, and determining the position parameter of each acquired face feature in the image data;
s200, acquiring multi-channel audio data acquired by a microphone array, and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model;
s300, determining the position parameter of the speaker in the image according to the position parameter of the audio data;
s400, acquiring image intercepting data of the face of the speaker according to the position parameters of the speaker in the image, and carrying out pixel amplification on the image in the image intercepting data.
In step S100, the preset face recognition model is obtained by continuously training a sample image with face features based on a convolutional neural network, and specifically includes:
extracting human face features in the sample image; the method mainly comprises the position information of a real face in the sample pattern, and can extract the coordinate data and the pixel proportion data of the face in the sample pattern by utilizing the existing image feature selection tool.
Inputting the human face features and sample image data into a recognition network, and determining position information of a human face recognition frame and human face image information in the human face recognition frame;
intercepting the face image in the face recognition frame to obtain a face image interception frame, and inputting image data in the face image interception frame into the recognition network;
training a multi-convolution layer structure on the face recognition frame and the face cutout frame through the recognition network to obtain the face recognition model;
the recognition network is a convolutional neural network, and the structure of the recognition network is not limited to a convolutional layer, but also includes a pooling layer, a full link layer, and the like, and no matter which structural method is combined for training, the purpose is to obtain position information of a face in an image and image data of a face capture frame by inputting image data with face features into a face recognition model in the embodiment of the present application.
In step S200, image data acquisition and audio data acquisition are performed synchronously, and image data acquisition can be quickly identified and located based on the face recognition model, and preprocessing is required in the identification process of audio data, specifically, the method includes:
performing echo cancellation processing on each path of acquired audio data according to a reference signal; specifically, the reference signal may be obtained from a speaker or a sound card driver;
and carrying out noise reduction and suppression on the signals which are not subjected to echo cancellation, and obtaining recognizable voice data by adopting automatic gain. Wherein the sound frequency of the human voice is 20HZ-20 KHZ.
After preprocessing the collected audio data, obtaining processed voice data, wherein the identification process of the audio data is as follows:
processing the human voice data in each path of audio data by adopting a beam forming algorithm to obtain a plurality of paths of beam signals; beam forming, which is to perform time delay or phase compensation and amplitude weighting processing on audio signals output by each microphone in a microphone array to form a beam pointing to a specific direction;
and respectively carrying out voice recognition on each path of beam signal, determining the beam signal with the strongest sound energy of the human body, and obtaining the position parameter of the audio data corresponding to the beam signal.
In an embodiment, the beam signals further include keyword information, and the performing speech recognition on each beam signal respectively further includes:
and respectively carrying out voice recognition on the keywords in each path of beam signal, and when detecting that the keyword information in one path of beam signal is matched with a preset keyword training result, determining that the path of beam signal is the keyword beam signal.
In the above embodiment, in the voice data obtained by preprocessing in the multi-channel audio data, the keyword result trained by the speech recognition model detects that the keyword information matched with the result exists in a certain channel, the position parameter of the channel of audio data is regarded as the position parameter of the audio data with the strongest voice energy, and the position parameter of the audio data with the keyword information is used as the reference parameter for subsequent positioning.
After the position parameters of the audio data with the strongest sound energy of one path of people are determined, the angle and the direction of the audio data are found through the beam forming algorithm, so that one microphone in the microphone array close to the audio data is determined, the position parameters of the microphone are obtained, and the corresponding relation between the real speaker and the microphone can be obtained.
Specifically, if 4 microphones are used for linear array, and the included angle between adjacent microphones is 45 degrees, and each microphone just corresponds to one person, 4 individual face recognition frames should be recognized in the image. Assuming that each person shows a speaking state, the system cannot identify an actual speaker through a face recognition technology, and after the position parameter of the audio data with the strongest voice energy is obtained in the embodiment of the application, the system can locate which specific microphone has the strongest voice energy, and can determine the actual speaker position parameter in the face recognition frame by combining the position parameter.
In step S400, the acquired image data includes an image enlargement area, that is, the identified image is enlarged in the specified image enlargement area, specifically, the method includes:
acquiring pixel proportion data of an image amplification area;
calculating an amplification factor of the intercepted image amplified to the image amplification area according to pixel ratio data in the image interception data;
and carrying out pixel amplification on the image in the image interception data according to the amplification factor.
Therefore, the actual speaker is amplified and displayed in the image in the multi-person video picture, the interactivity in the multi-person video is improved, and the wider application of the multi-person video is widened.
As shown in fig. 2, an apparatus for recognizing a speaker in a multi-person video is further provided in an embodiment of the present application, including:
theface recognition module 100 is configured to acquire image data acquired by a camera, call a preset face recognition model to recognize each frame of the image data, and determine a position parameter of each acquired face feature in the image data to which the face feature belongs;
thevoice recognition module 200 is used for acquiring multiple paths of audio data acquired by the microphone array and determining the position parameters of the audio data with the strongest sound energy of one path of people by adopting a preset voice recognition model;
aposition confirmation module 300, configured to determine a position parameter of the speaker in the image according to the position parameter of the audio data;
apixel amplification module 400, configured to obtain image capture data of a face of a speaker according to a position parameter of the speaker in the image, and perform pixel amplification on an image in the image capture data
In one embodiment, theface recognition module 100 includes:
extracting human face features in the sample image;
inputting the human face features and sample image data into a recognition network, and determining position information of a human face recognition frame and human face image information in the human face recognition frame;
intercepting the face image in the face recognition frame to obtain a face image interception frame, and inputting image data in the face image interception frame into the recognition network;
and training a multi-convolution layer structure on the face recognition frame and the face cutout frame through the recognition network to obtain the face recognition model.
In one embodiment, thepixel amplification module 400 includes:
the enlarged region acquisition module is used for acquiring pixel proportion data of an image enlarged region;
the amplification factor calculation module is used for calculating the amplification factor of the intercepted image amplified to the image amplification area according to the pixel proportion data in the image interception data;
the amplification submodule is used for carrying out pixel amplification on the image in the image interception data according to the amplification factor
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
Referring to fig. 3, a computer device for identifying a speaker in a multi-person video according to the present invention is shown, which may specifically include the following:
in an embodiment of the present invention, the present invention further provides a computer device, where thecomputer device 12 is represented in a general computing device, and the components of thecomputer device 12 may include but are not limited to: one or more processors orprocessing units 16, asystem memory 28, and abus 18 that couples various system components including thesystem memory 28 and theprocessing unit 16.
Bus 18 represents one or more of any of several types ofbus 18 structures, including amemory bus 18 or memory controller, aperipheral bus 18, an accelerated graphics port, and a processor orlocal bus 18 using any of a variety ofbus 18 architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA)bus 18, micro-channel architecture (MAC)bus 18,enhanced ISA bus 18, audio Video Electronics Standards Association (VESA)local bus 18, and Peripheral Component Interconnect (PCI)bus 18.
Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible bycomputer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
Thesystem memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)31 and/orcache memory 32.Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only,storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (commonly referred to as "hard drives"). Although not shown in FIG. 3, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected tobus 18 by one or more data media interfaces. The memory may include at least one program product having a set (e.g., at least one) ofprogram modules 42, with theprogram modules 42 configured to carry out the functions of embodiments of the invention.
A program/utility 41 having a set (at least one) ofprogram modules 42 may be stored, for example, in memory,such program modules 42 including, but not limited to, an operating system, one or more application programs,other program modules 42, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment.Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device,display 24, camera, etc.), with one or more devices that enable a user to interact withcomputer device 12, and/or with any devices (e.g., network card, modem, etc.) that enablecomputer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O)interface 22. Also,computer device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN)), a Wide Area Network (WAN), and/or a public network (e.g., the Internet) via network adapter 20. As shown, thenetwork adapter 21 communicates with the other modules of thecomputer device 12 via thebus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction withcomputer device 12, including but not limited to: microcode, device drivers,redundant processing units 16, external disk drive arrays, RAID systems, tape drives, and databackup storage systems 34, etc.
Theprocessing unit 16 executes various functional applications and data processing by running programs stored in thesystem memory 28, for example, implementing a method for speaker recognition in a multi-person video provided by an embodiment of the present invention.
That is, theprocessing unit 16 implements, when executing the program: acquiring image data acquired by a camera, calling a preset face recognition model to recognize each frame of the image data, and determining the position parameter of each acquired face feature in the image data; acquiring multi-channel audio data acquired by a microphone array, and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model; determining a position parameter of a speaker in the image according to the position parameter of the audio data; and acquiring image interception data of the face of the speaker according to the position parameters of the speaker in the image, and amplifying the pixels of the image in the image interception data.
In an embodiment of the present invention, the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements a method for identifying a speaker in a multi-person video as provided in all embodiments of the present application.
That is, the program when executed by the processor implements: acquiring image data acquired by a camera, calling a preset face recognition model to recognize each frame of the image data, and determining the position parameter of each acquired face feature in the image data; acquiring multi-channel audio data acquired by a microphone array, and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model; determining a position parameter of a speaker in the image according to the position parameter of the audio data; and acquiring image interception data of the face of the speaker according to the position parameters of the speaker in the image, and amplifying the pixels of the image in the image interception data.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer-readable storage medium or a computer-readable signal medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPOM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The method for identifying speakers in a multi-person video provided by the invention is described in detail above, and the principle and the implementation of the invention are explained in the present document by applying specific examples, and the description of the above examples is only used to help understanding the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A method for identifying a speaker in a multi-person video, comprising:
acquiring image data acquired by a camera, calling a preset face recognition model to recognize each frame of the image data, and determining the position parameter of each acquired face feature in the image data;
acquiring multi-channel audio data acquired by a microphone array, and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model;
determining a position parameter of a speaker in the image according to the position parameter of the audio data;
and acquiring image interception data of the face of the speaker according to the position parameters of the speaker in the image, and amplifying the pixels of the image in the image interception data.
2. The method according to claim 1, wherein the invoking a preset face recognition model to recognize each frame of the image data comprises:
extracting human face features in the sample image;
inputting the human face features and sample image data into a recognition network, and determining position information of a human face recognition frame and human face image information in the human face recognition frame;
intercepting the face image in the face recognition frame to obtain a face image interception frame, and inputting image data in the face image interception frame into the recognition network;
and training the face recognition frame and the face screenshot frame through the recognition network to obtain the face recognition model.
3. The method of claim 1, wherein the acquiring the multiple paths of audio data collected by the microphone array, and determining the position parameter of the audio data with the strongest sound energy of one path of people by using a preset speech recognition model, comprises:
performing echo cancellation processing on each path of acquired audio data according to a reference signal; specifically, the reference signal may be obtained from a speaker or a sound card driver;
carrying out noise reduction suppression on signals which are not subjected to echo cancellation, and obtaining recognizable voice data by adopting automatic gain;
processing the human voice data in each path of audio data by adopting a beam forming algorithm to obtain a plurality of paths of beam signals;
and respectively carrying out voice recognition on each path of beam signal, determining the beam signal with the strongest sound energy of the human body, and obtaining the position parameter of the audio data corresponding to the beam signal.
4. The method of claim 1, wherein the separately performing speech recognition on each beam signal comprises:
and respectively carrying out voice recognition on the keywords in each path of beam signal, and when detecting that the keyword information in one path of beam signal is matched with a preset keyword training result, determining that the path of beam signal is the keyword beam signal.
5. The method of claim 1, wherein the training the face recognition box and the face screenshot box through the recognition network to obtain the face recognition model comprises:
acquiring pixel proportion data of an image amplification area;
calculating an amplification factor of the intercepted image amplified to the image amplification area according to pixel ratio data in the image interception data;
and carrying out pixel amplification on the image in the image interception data according to the amplification factor.
6. An apparatus for identifying a speaker in a multi-person video, comprising:
the face recognition module is used for acquiring image data acquired by the camera, calling a preset face recognition model to recognize each frame of image data and determining the position parameter of each acquired face feature in the image data to which the face feature belongs;
the voice recognition module is used for acquiring multi-channel audio data acquired by the microphone array and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model;
the position confirmation module is used for determining the position parameter of the speaker in the image according to the position parameter of the audio data;
and the pixel amplification module is used for acquiring image interception data of the face of the speaker according to the position parameters of the speaker in the image and amplifying the pixels of the image in the image interception data.
7. The apparatus of claim 6, wherein the face recognition module comprises:
extracting human face features in the sample image;
inputting the human face features and sample image data into a recognition network, and determining position information of a human face recognition frame and human face image information in the human face recognition frame;
intercepting the face image in the face recognition frame to obtain a face image interception frame, and inputting image data in the face image interception frame into the recognition network;
and training a multi-convolution layer structure on the face recognition frame and the face cutout frame through the recognition network to obtain the face recognition model.
8. The apparatus of claim 6, wherein the pixel amplification module comprises:
the enlarged region acquisition module is used for acquiring pixel proportion data of an image enlarged region;
the amplification factor calculation module is used for calculating the amplification factor of the intercepted image amplified to the image amplification area according to the pixel proportion data in the image interception data;
and the amplification submodule is used for carrying out pixel amplification on the image in the image interception data according to the amplification factor.
9. Electronic device, characterized in that it comprises a processor, a memory and a computer program stored on said memory and capable of running on said processor, said computer program, when executed by said processor, implementing a method for identification of a speaker in a multi-person video according to any of claims 1 to 5.
10. Computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method for recognition of a speaker in a multi-person video according to any one of claims 1 to 5.
CN202011373431.4A2020-11-302020-11-30Method and device for identifying speakers in multi-person videoPendingCN112487246A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202011373431.4ACN112487246A (en)2020-11-302020-11-30Method and device for identifying speakers in multi-person video

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202011373431.4ACN112487246A (en)2020-11-302020-11-30Method and device for identifying speakers in multi-person video

Publications (1)

Publication NumberPublication Date
CN112487246Atrue CN112487246A (en)2021-03-12

Family

ID=74937375

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202011373431.4APendingCN112487246A (en)2020-11-302020-11-30Method and device for identifying speakers in multi-person video

Country Status (1)

CountryLink
CN (1)CN112487246A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113301372A (en)*2021-05-202021-08-24广州繁星互娱信息科技有限公司Live broadcast method, device, terminal and storage medium
CN114363553A (en)*2021-12-172022-04-15上海理想信息产业(集团)有限公司Dynamic code stream processing method and device in video conference
CN114594892A (en)*2022-01-292022-06-07深圳壹秘科技有限公司Remote interaction method, remote interaction device and computer storage medium
CN115019337A (en)*2022-04-262022-09-06浙江华创视讯科技有限公司Positioning method, positioning device, video conference system, electronic device and storage medium
US20220415003A1 (en)*2021-06-272022-12-29Realtek Semiconductor Corp.Video processing method and associated system on chip
CN115701098A (en)*2021-07-232023-02-07海信集团控股股份有限公司Display method and device for multi-person picture

Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101394679A (en)*2007-09-172009-03-25深圳富泰宏精密工业有限公司Sound source positioning system and method
CN103841357A (en)*2012-11-212014-06-04中兴通讯股份有限公司Microphone array sound source positioning method, device and system based on video tracking
US20150088515A1 (en)*2013-09-252015-03-26Lenovo (Singapore) Pte. Ltd.Primary speaker identification from audio and video data
CN108737719A (en)*2018-04-042018-11-02深圳市冠旭电子股份有限公司Camera filming control method, device, smart machine and storage medium
CN109257559A (en)*2018-09-282019-01-22苏州科达科技股份有限公司A kind of image display method, device and the video conferencing system of panoramic video meeting

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101394679A (en)*2007-09-172009-03-25深圳富泰宏精密工业有限公司Sound source positioning system and method
CN103841357A (en)*2012-11-212014-06-04中兴通讯股份有限公司Microphone array sound source positioning method, device and system based on video tracking
US20150088515A1 (en)*2013-09-252015-03-26Lenovo (Singapore) Pte. Ltd.Primary speaker identification from audio and video data
CN108737719A (en)*2018-04-042018-11-02深圳市冠旭电子股份有限公司Camera filming control method, device, smart machine and storage medium
CN109257559A (en)*2018-09-282019-01-22苏州科达科技股份有限公司A kind of image display method, device and the video conferencing system of panoramic video meeting

Cited By (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113301372A (en)*2021-05-202021-08-24广州繁星互娱信息科技有限公司Live broadcast method, device, terminal and storage medium
US20220415003A1 (en)*2021-06-272022-12-29Realtek Semiconductor Corp.Video processing method and associated system on chip
CN115701098A (en)*2021-07-232023-02-07海信集团控股股份有限公司Display method and device for multi-person picture
CN114363553A (en)*2021-12-172022-04-15上海理想信息产业(集团)有限公司Dynamic code stream processing method and device in video conference
CN114594892A (en)*2022-01-292022-06-07深圳壹秘科技有限公司Remote interaction method, remote interaction device and computer storage medium
CN114594892B (en)*2022-01-292023-11-24深圳壹秘科技有限公司Remote interaction method, remote interaction device, and computer storage medium
CN115019337A (en)*2022-04-262022-09-06浙江华创视讯科技有限公司Positioning method, positioning device, video conference system, electronic device and storage medium

Similar Documents

PublicationPublication DateTitle
CN112487246A (en)Method and device for identifying speakers in multi-person video
Zmolikova et al.Neural target speech extraction: An overview
US20240365081A1 (en)System and method for assisting selective hearing
CN112088402B (en)Federated neural network for speaker recognition
US10878824B2 (en)Speech-to-text generation using video-speech matching from a primary speaker
JP6464449B2 (en) Sound source separation apparatus and sound source separation method
US20230164509A1 (en)System and method for headphone equalization and room adjustment for binaural playback in augmented reality
US11513762B2 (en)Controlling sounds of individual objects in a video
CN111091845A (en) Audio processing method, device, terminal device and computer storage medium
CN112492207B (en)Method and device for controlling camera to rotate based on sound source positioning
WO2021000498A1 (en)Composite speech recognition method, device, equipment, and computer-readable storage medium
Yu et al.Audio-visual multi-channel integration and recognition of overlapped speech
US20240428816A1 (en)Audio-visual hearing aid
WO2021120190A1 (en)Data processing method and apparatus, electronic device, and storage medium
JP7400364B2 (en) Speech recognition system and information processing method
CN110765868A (en)Lip reading model generation method, device, equipment and storage medium
CN112466306B (en) Method, device, computer equipment and storage medium for generating meeting minutes
Cabañas-Molero et al.Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis
CN111986680A (en)Method and device for evaluating spoken language of object, storage medium and electronic device
US20240170004A1 (en)Context aware audio processing
Vo et al.Multimodal learning interfaces
CN118658487A (en) Smart glasses control method, smart glasses, storage medium and program product
CN115038014B (en)Audio signal processing method and device, electronic equipment and storage medium
WO2024158629A1 (en)Guided speech-enhancement networks
Abel et al.Cognitively inspired audiovisual speech filtering: towards an intelligent, fuzzy based, multimodal, two-stage speech enhancement system

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication

Application publication date:20210312

RJ01Rejection of invention patent application after publication

[8]ページ先頭

©2009-2025 Movatter.jp