Disclosure of Invention
In view of the above, the present application provides a video call voice processing method, a communication terminal, and a readable storage medium, so as to solve the problem that the input gain of a microphone of an existing voice call scene cannot be adjusted according to the situation of the other party.
The application provides a video call voice processing method, which comprises the following steps:
identifying a target person in a current video image;
acquiring answering parameters of a target person, wherein the answering parameters are used for identifying the sound intensity heard by the target person and comprise at least one of the position of the target person in a current video image, the action of the target person and the voice of the target person;
and adjusting the gain of the sound acquisition equipment according to the answering parameters, and transmitting the voice signals acquired by the sound acquisition equipment after adjustment to the target terminal.
Optionally, the obtaining of the listening parameter of the target person includes:
setting priorities for various answering parameters; and
when the sound intensities identified by the plurality of answering parameters conflict with each other, selecting the answering parameter with the highest priority and abandoning the answering parameter with the low priority;
and when the sound intensities identified by the plurality of answering parameters are not in conflict, executing the step of adjusting the gain of the sound acquisition equipment according to the answering parameters.
Optionally, the obtaining of the listening parameter of the target person includes:
acquiring a corresponding relation between a shooting focal length and a position of a target terminal;
and acquiring the shooting focal length of the target terminal when the current video image is imaged, and acquiring the position of the target person in the current video image according to the corresponding relation.
Optionally, the call voice processing method further includes:
detecting whether the target person is continuously displayed in the current video image;
and when the target person disappears in the current video image, acquiring the current voice of the target person, and determining the position of the target person in the current video image according to the current voice.
Optionally, the position of the target person in the video image includes: the face of the target person is located in a center region, a left half, or a right half of the video image;
the answering parameter is the position of a target person in a video image, and the adjusting of the gain of the sound acquisition equipment according to the answering parameter comprises the following steps:
increasing a gain of a left channel of the sound collection device when the face of the target person is located in a left half of the video image; increasing a gain of a right channel of the sound collection device when the face of the target person is located in a right half of the video image; when the face of the target person is located in the center area of the video image, the gains of the left channel and the right channel of the sound collection device are kept unchanged.
Optionally, the action of the target person comprises: the ear of the target person faces the target terminal;
the answering parameter is the action of the target person, and the adjusting of the gain of the sound acquisition equipment according to the answering parameter comprises the following steps: the gain of the sound collection device is increased.
Optionally, the voice of the target person includes: identifying a voice-sized language segment;
the answering parameter is the voice of the target person, and the adjusting of the gain of the sound collecting equipment according to the answering parameter comprises the following steps:
when a language segment with small identification sound is obtained, increasing the gain of the sound acquisition equipment; and when the language segment with large identification sound is acquired, the gain of the sound acquisition equipment is reduced.
The application provides a communication terminal, which comprises an application processor, a digital signal processor and a sound acquisition device,
the application processor is used for acquiring a current video image;
the digital signal processor is used for identifying a target person in a current video image and acquiring answering parameters of the target person, wherein the answering parameters are used for identifying the sound intensity heard by the target person and comprise at least one of the position of the target person in the current video image, the action of the target person and the voice of the target person; and the number of the first and second groups,
the application processor is also used for adjusting the gain of the sound acquisition equipment according to the answering parameters and transmitting the voice signals acquired by the adjusted sound acquisition equipment to the target terminal.
Optionally, the application processor is further configured to prioritize various types of listening parameters,
when the sound intensities identified by the plurality of answering parameters conflict with each other, selecting the answering parameter with the highest priority and abandoning the answering parameter with the low priority;
and when the sound intensities identified by the plurality of answering parameters are not in conflict, adjusting the gain of the sound acquisition equipment according to the answering parameters.
The application provides a readable storage medium, which stores a program, and the program is used for being executed by a processor to execute one or more steps of any one of the video call voice processing methods.
According to the method and the device, the gain of the sound acquisition equipment is adjusted according to the answering parameters of the target person in the video image, wherein the answering parameters comprise at least one of the position of the target person in the current video image, the action of the target person and the voice of the target person, and the answering parameters identify the real-time feedback of the opposite side to the call volume in the voice call scene, so that the input gain of a microphone of the opposite side can be adjusted according to the situation of the opposite side, and the high-quality voice call service is provided.
Detailed Description
The technical solutions in the embodiments of the present application are clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments, and not all of them. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. The following embodiments and their technical features may be combined with each other without conflict.
It should be noted that, in the description, step numbers such as S11 and S12 are used for the purpose of more clearly and briefly describing the corresponding contents, and do not constitute a substantial limitation on the sequence, and those skilled in the art may perform S12 first and then S11 in specific implementation, but these should be within the protection scope of the present application.
Fig. 1 is a flowchart illustrating a video call voice processing method according to an embodiment of the present application. The video call voice processing method may be applied to Mobile Internet Devices (MID) such as smart phones (Android phones, iOS phones, and the like), tablet computers, PDAs (Personal Digital assistants), and learning machines, or wearable Devices with an audio/video call function that can be worn on human limbs, artificial limbs, or embedded in clothing, jewelry, accessories, and the like, and embodiments of the present application are not limited thereto.
In an audio and video call scene, the execution main body of each step in the method can be any party equipment of two parties of the call, and when the execution main body is calling equipment, the called equipment is a target terminal; when the execution subject is called equipment, the calling equipment is a target terminal. For convenience of distinction and description, the execution body of each step is referred to as a body terminal herein.
Referring to FIG. 1, the method for processing video call voice may include steps S11-S13.
S11: a target person in the current video image is identified.
The current video image is an image displayed on the main body terminal in real time, and the target character is a character displayed in the current video image in real time, namely a called party. The target terminal captures a target person and the current environment of the target person through a camera (front or rear) of the target terminal, images the target person and the current environment of the target person, and transmits the images to the main body terminal, and the main body terminal can recognize the target person in the images through a face recognition technology, a human body detection technology and a human body posture/behavior/action recognition technology.
In the conversation process, when a plurality of people appear in the current video image, the main body terminal obtains the mouth characteristics of each person through a face recognition technology, judges the current person who is speaking according to the mouth characteristics, and takes the face of the person as a target face, for example, the face with the mouths combined one by one as the target face. Accordingly, each character is allowed to be a target character.
Or, the main terminal selects one of the faces as a target face through a face recognition technology, and discards the other faces. Accordingly, only one person is the target person. For example, when the caller and person A0While talking, person B0Entering the camera view finding range of the target terminal, and the person B speaking at the moment0May be simply passing by, not talking to the caller, for which the subject terminal may only pass character a0As a target face.
The reference basis of the main body terminal for selecting the target person can be in addition to the facial features and parameters such as voiceprint features and the like which can identify the unique identity of the person.
S12: and acquiring answering parameters of the target person, wherein the answering parameters are used for identifying the sound intensity heard by the target person and comprise at least one of the position of the target person in the current video image, the action of the target person and the voice of the target person.
S13: and adjusting the gain of the sound acquisition equipment according to the answering parameters, and transmitting the voice signals acquired by the adjusted sound acquisition equipment to the target terminal.
The listening parameter refers to a parameter that can identify the intensity of a voice (or the volume of a speaking voice) for the voice of a calling party heard by a target person. If the listening parameter indicates that the currently heard sound is small, which means that the sound of the calling party collected by the main terminal is small, the main terminal can increase the gain of the sound collection device (such as a microphone). And if the currently heard sound of the answering parameter identifier is larger, the calling party collected by the main body terminal is indicated to be larger, and then the gain of the main body terminal can be reduced.
The answering parameter identifies real-time feedback of the opposite side to the call volume in the voice call scene, so that the input gain of the microphone can be adjusted according to the situation of the opposite side, and high-quality voice call service is provided.
The specific value of the gain to be adjusted may be determined according to the value of the listening parameter, and the specific algorithm adopted is not limited in the embodiment of the present application.
With respect to the three types of listening parameters exemplified in the above step S12, how to adjust the gain of the sound collection device according to each type of listening parameter is explained below.
Referring to fig. 2, the positions of the target person in the video image can be divided into three types: the face of the target person (hereinafter referred to as the target face) is located in the center region, left half, and right half of the video image.
When the target face is located in the left half of the video image, for example, at position a shown in fig. 2, the subject terminal increases the gain of the left channel of the sound collection device.
When the target face is located in the right half of the video image, for example, at position B shown in fig. 2, the subject terminal increases the gain of the right channel of the sound collection device.
When the target face is located in a central region of the video image, for example, at a position C shown in fig. 2, the main body terminal may keep the gains of the left and right channels of the sound collection apparatus unchanged.
The movement of the target person may be a physical movement capable of reflecting the size of the sound to be listened to by the target person, and may be a movement in which the ear (left ear or right ear) of the target person faces the target terminal, and further, a physical movement such as a hand covering the ear or shaking the head may be accompanied.
If the left ear or the right ear of the target person is directed toward the target terminal in the current video image, the subject terminal increases the gain of the sound collection device.
For more intuitive feedback of the listening parameters of the target person to the listening sound, namely the voice of the target person, the embodiment of the application can acquire the speaking sound of the target person and acquire the language segment for marking the sound size from the speaking sound through the voice recognition technology.
When a language segment in which the identification sound is small, for example, "sound is too small, i hear not clear" or "loud" is acquired, the subject terminal increases the gain of the sound collection device. And when the language segment with large identification sound is obtained, the main body terminal reduces the gain of the sound acquisition equipment.
It should be understood that in the dimension of voice, the listening parameter may also be the voice intensity of the target person, for example, the current voice intensity of the target person may be compared with the previous voice intensity, and specifically, if the voice intensity becomes smaller, the gain of the sound collection device is increased; if the voice intensity is increased, the gain of the sound collection device is reduced.
For other types of answering parameters, the main body terminal can be acquired in a proper mode. For example, for the position of the target face in the current video image, the main body terminal may obtain a corresponding relationship between a shooting focal length and a position of the target terminal from the target terminal, then obtain the shooting focal length of the target terminal when the current video image is imaged, and obtain the position of the target face in the current video image according to the corresponding relationship; or, the position of the target person in the video image is obtained according to the voice intensity of the target person, specifically, the main body terminal obtains the corresponding relationship between the voice intensity and the distance in advance, and then obtains the position corresponding to the voice intensity according to the currently obtained voice intensity.
In the foregoing embodiment, the main terminal adjusts the gain of the sound collection device according to the face image in the video image. For the case that the target person leaves the camera view range during the call, that is, when it is detected that the target person disappears in the current video image, the main body terminal cannot acquire the position of the target person in the video image and the motion of the target person, at this time, the gain of the sound capture device may be adjusted by the voice of the target person, for example, comparing the current voice intensity with the previous voice intensity of the target person, directly adjusting the gain by the change of the voice intensity, or acquiring the position of the target person in the current video image by the change of the voice intensity, and then adjusting the gain according to the position change.
In the embodiment of the present application, the main terminal may obtain multiple types of listening parameters, and synthesize these listening parameters to adjust the gain, but there may be a conflict between the volume levels identified by these listening parameters, for example, when the target person makes a language segment "sound too much, and croup toward me", the target person may feel head-side to the target terminal due to itching ears and accompanied by a limb action of the hand covering the ears, and at this time, the main terminal needs to determine which parameter to adjust the gain.
In this regard, embodiments of the present application may be provided with a method as shown in fig. 3 below. As shown in FIG. 3, the video call voice processing method includes steps S21-S25.
S21: a target person in the current video image is identified.
S22: and acquiring answering parameters of the target person, wherein the answering parameters are used for identifying the sound intensity heard by the target person and comprise at least two of the position of the target person in the current video image, the action of the target person and the voice of the target person.
S23: and setting priorities for various answering parameters.
S24: when the sound intensities identified by the plurality of answering parameters conflict with each other, selecting the answering parameter with the highest priority and abandoning the answering parameter with the low priority; and when the sound intensities identified by the plurality of answering parameters are not in conflict, selecting all the acquired answering parameters.
S25: and adjusting the gain of the sound acquisition equipment according to the answering parameters, and transmitting the voice signals acquired by the adjusted sound acquisition equipment to the target terminal.
Based on the description of the foregoing embodiment, the embodiment can avoid the situation of the misregulation gain caused by the misjudgment of the main terminal, and more accurately feed back the listening situation of the target person.
Fig. 4 is a schematic structural diagram of a communication terminal according to an embodiment of the present application. Referring to fig. 4, thecommunication terminal 40 may be one of two parties of a video call, such as the main terminal. Thecommunication terminal 40 includes anapplication processor 41, adigital signal processor 42, asound collection device 43, acamera 44, aleft speaker 451, aright speaker 452, and anantenna 46. Theapplication processor 41 and thedigital signal processor 42 may be regarded as the core of thecommunication terminal 40, and are connected with the respective structural elements to implement the corresponding functions during the video call.
Thecamera 44 is used to capture images of the person and the environment in which the person is located.
Theantenna 46 is, for example, a Wi-Fi antenna or the like, and is configured to receive and transmit electromagnetic waves, and to perform interconversion between the electromagnetic waves and electrical signals, thereby performing communication with another device for video call.
Asound collection device 43, such as a microphone, is used to collect the voice of the calling party.
Theleft speaker 451 and theright speaker 452 are used for playing the voice of the other party of the video call, and the left channel voice signal and the right channel voice signal are played correspondingly.
An Application Processor (Application Processor)41 is used to acquire the current video image and transmit it to a digital signal Processor (digital signal processing) 42.
Thedsp 42 is configured to start a corresponding algorithm to identify a target person in the current video image, and obtain listening parameters of the target person, wherein the listening parameters are used to identify the sound intensity heard by the target person and include at least one of the position of the target person in the current video image, the motion of the target person, and the voice of the target person.
Specifically, theapplication processor 41 sends the current video image to thedigital signal processor 42, and thedigital signal processor 42 starts the corresponding algorithm to recognize the position of the target person in the current video image and the motion of the target person. Theapplication processor 41 transmits the voice of the target person to thedigital signal processor 42, thedigital signal processor 42 starts a voice recognition algorithm to obtain corresponding parameters, for example, to recognize whether there is a specific language segment in the voice signal, such as "your voice is too small", and returns the detection result to theapplication processor 41.
Theapplication processor 41 is configured to generate a gain scheme according to the listening parameters, adjust the gain of thesound collection device 43 according to the gain scheme, and transmit the voice signal collected by the sound collection device after adjustment to the target terminal through theantenna 46.
The specific working modes of the structural elements can refer to the steps of the method, and are not described in detail herein. For example, theapplication processor 41 is further configured to set priorities for various listening parameters, select a listening parameter with the highest priority and discard a listening parameter with a low priority when the sound strengths identified by the listening parameters conflict with each other; when the sound intensities identified by the plurality of listening parameters do not conflict, the gain of thesound collection device 43 is adjusted according to the listening parameters.
Here, thecommunication terminal 40 has advantageous effects that can be achieved by the foregoing method.
It should be understood that, when implemented in a practical application scenario, the execution bodies of the above steps may not be the aforementioned structural elements, but may be implemented by other modules and units, respectively, according to the type of the device to which thecommunication terminal 40 belongs.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor. To this end, an embodiment of the present application provides a readable storage medium, where a plurality of instructions are stored in the readable storage medium, and the instructions can be loaded by a processor to execute the steps in any video call voice processing method provided in the embodiment of the present application.
The storage medium may include a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.
The instructions stored in the storage medium can execute the steps in any video call voice processing method provided in the embodiments of the present application, so that the beneficial effects that any video call voice processing method can achieve can be achieved, as detailed in the foregoing embodiments.
Embodiments of the present application also provide a computer program product, which includes computer program code, when the computer program code runs on a computer, causes the computer to execute the method as described in the above various possible embodiments.
Embodiments of the present application further provide a chip, which includes a memory for storing a computer program and a processor for calling and executing the computer program from the memory, so that a device in which the chip is installed performs the method in the above various possible embodiments.
It should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element, and that elements, features, or elements having the same designation in different embodiments may or may not have the same meaning as that of the other elements, and that the particular meaning will be determined by its interpretation in the particular embodiment or by its context in further embodiments.
In addition, although the terms "first, second, third, etc. are used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, depending on the context, without departing from the scope herein. The term "if" can be interpreted as "at … …" or "when … …" or "in response to a determination". Furthermore, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. The terms "or" and/or "are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions, steps or operations are inherently mutually exclusive in some way.
Further, although the various steps in the flowcharts herein are shown in order as indicated by the arrows, they are not necessarily performed in order as indicated by the arrows. Unless explicitly stated otherwise herein, the steps are not performed in the exact order, but may be performed in other orders. Moreover, at least some of the steps in the figures may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, in different orders, and may be performed alternately or at least partially with respect to other steps or sub-steps of other steps.
Although the application has been shown and described with respect to one or more implementations, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The present application includes all such modifications and variations, and is supported by the technical solutions of the foregoing embodiments. That is, the above-mentioned embodiments are only some of the embodiments of the present application, and not intended to limit the scope of the present application, and all equivalent structural changes made by using the contents of the present specification and the drawings, such as the combination of technical features between the embodiments, or the direct or indirect application to other related technical fields, are included in the scope of the present application.