CN114078466A

Movatterモバイル変換

Info

Publication number: CN114078466A
Application number: CN202010848931.2A
Authority: CN
Inventors: 张丽霞
Original assignee: Oneplus Technology Shenzhen Co Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2022-02-22
Anticipated expiration: 2040-08-21
Also published as: CN114078466B

Abstract

The application discloses a video call voice processing method, a communication terminal and a readable storage medium. The video call voice processing method comprises the following steps: identifying a target person in a current video image; acquiring answering parameters of the target person, wherein the answering parameters are used for identifying the sound intensity heard by the target person and comprise at least one of the position of the target person in the current video image, the action of the target person and the voice of the target person; and adjusting the gain of the sound acquisition equipment according to the answering parameters, and transmitting the voice signals acquired by the adjusted sound acquisition equipment to the target terminal. According to the method and the device, the input gain of the microphone can be adjusted according to the opposite situation in the call scene, so that high-quality voice call service can be provided for a user.

Description

Video call voice processing method, communication terminal and readable storage medium

Technical Field

The present application relates to the field of communications and electronic devices, and in particular, to a video call voice processing method, a communication terminal, and a readable storage medium.

Background

In order to improve the voice call quality, communication terminals with communication functions, such as mobile phones, are mostly provided with a call voice processing mechanism, so that when the communication terminals are in a noisy environment, both parties can hear the sound of the other party clearly, and the voice call quality is improved.

At present, the most common call voice processing method for a communication terminal mainly focuses on noise reduction, for example, a microphone noise reduction method is adopted, that is, noise reduction processing is performed on voice signals acquired by a microphone. However, it does not consider the situation of the answering party, for example, when the answering party is far away from the terminal held by the answering party, the answering party still has difficulty in clearly listening to the call voice sent by the calling party.

Disclosure of Invention

In view of the above, the present application provides a video call voice processing method, a communication terminal, and a readable storage medium, so as to solve the problem that the input gain of a microphone of an existing voice call scene cannot be adjusted according to the situation of the other party.

The application provides a video call voice processing method, which comprises the following steps:

identifying a target person in a current video image;

acquiring answering parameters of a target person, wherein the answering parameters are used for identifying the sound intensity heard by the target person and comprise at least one of the position of the target person in a current video image, the action of the target person and the voice of the target person;

and adjusting the gain of the sound acquisition equipment according to the answering parameters, and transmitting the voice signals acquired by the sound acquisition equipment after adjustment to the target terminal.

Optionally, the obtaining of the listening parameter of the target person includes:

setting priorities for various answering parameters; and

when the sound intensities identified by the plurality of answering parameters conflict with each other, selecting the answering parameter with the highest priority and abandoning the answering parameter with the low priority;

and when the sound intensities identified by the plurality of answering parameters are not in conflict, executing the step of adjusting the gain of the sound acquisition equipment according to the answering parameters.

acquiring a corresponding relation between a shooting focal length and a position of a target terminal;

and acquiring the shooting focal length of the target terminal when the current video image is imaged, and acquiring the position of the target person in the current video image according to the corresponding relation.

Optionally, the call voice processing method further includes:

detecting whether the target person is continuously displayed in the current video image;

and when the target person disappears in the current video image, acquiring the current voice of the target person, and determining the position of the target person in the current video image according to the current voice.

Optionally, the position of the target person in the video image includes: the face of the target person is located in a center region, a left half, or a right half of the video image;

the answering parameter is the position of a target person in a video image, and the adjusting of the gain of the sound acquisition equipment according to the answering parameter comprises the following steps:

increasing a gain of a left channel of the sound collection device when the face of the target person is located in a left half of the video image; increasing a gain of a right channel of the sound collection device when the face of the target person is located in a right half of the video image; when the face of the target person is located in the center area of the video image, the gains of the left channel and the right channel of the sound collection device are kept unchanged.

Optionally, the action of the target person comprises: the ear of the target person faces the target terminal;

the answering parameter is the action of the target person, and the adjusting of the gain of the sound acquisition equipment according to the answering parameter comprises the following steps: the gain of the sound collection device is increased.

Optionally, the voice of the target person includes: identifying a voice-sized language segment;

the answering parameter is the voice of the target person, and the adjusting of the gain of the sound collecting equipment according to the answering parameter comprises the following steps:

when a language segment with small identification sound is obtained, increasing the gain of the sound acquisition equipment; and when the language segment with large identification sound is acquired, the gain of the sound acquisition equipment is reduced.

The application provides a communication terminal, which comprises an application processor, a digital signal processor and a sound acquisition device,

the application processor is used for acquiring a current video image;

the digital signal processor is used for identifying a target person in a current video image and acquiring answering parameters of the target person, wherein the answering parameters are used for identifying the sound intensity heard by the target person and comprise at least one of the position of the target person in the current video image, the action of the target person and the voice of the target person; and the number of the first and second groups,

the application processor is also used for adjusting the gain of the sound acquisition equipment according to the answering parameters and transmitting the voice signals acquired by the adjusted sound acquisition equipment to the target terminal.

Optionally, the application processor is further configured to prioritize various types of listening parameters,

and when the sound intensities identified by the plurality of answering parameters are not in conflict, adjusting the gain of the sound acquisition equipment according to the answering parameters.

The application provides a readable storage medium, which stores a program, and the program is used for being executed by a processor to execute one or more steps of any one of the video call voice processing methods.

According to the method and the device, the gain of the sound acquisition equipment is adjusted according to the answering parameters of the target person in the video image, wherein the answering parameters comprise at least one of the position of the target person in the current video image, the action of the target person and the voice of the target person, and the answering parameters identify the real-time feedback of the opposite side to the call volume in the voice call scene, so that the input gain of a microphone of the opposite side can be adjusted according to the situation of the opposite side, and the high-quality voice call service is provided.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a video call voice processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of the location of a target person in a current video image;

fig. 3 is a flowchart illustrating a video call voice processing method according to another embodiment of the present application;

fig. 4 is a schematic structural diagram of a communication terminal according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application are clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments, and not all of them. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. The following embodiments and their technical features may be combined with each other without conflict.

It should be noted that, in the description, step numbers such as S11 and S12 are used for the purpose of more clearly and briefly describing the corresponding contents, and do not constitute a substantial limitation on the sequence, and those skilled in the art may perform S12 first and then S11 in specific implementation, but these should be within the protection scope of the present application.

Fig. 1 is a flowchart illustrating a video call voice processing method according to an embodiment of the present application. The video call voice processing method may be applied to Mobile Internet Devices (MID) such as smart phones (Android phones, iOS phones, and the like), tablet computers, PDAs (Personal Digital assistants), and learning machines, or wearable Devices with an audio/video call function that can be worn on human limbs, artificial limbs, or embedded in clothing, jewelry, accessories, and the like, and embodiments of the present application are not limited thereto.

In an audio and video call scene, the execution main body of each step in the method can be any party equipment of two parties of the call, and when the execution main body is calling equipment, the called equipment is a target terminal; when the execution subject is called equipment, the calling equipment is a target terminal. For convenience of distinction and description, the execution body of each step is referred to as a body terminal herein.

Referring to FIG. 1, the method for processing video call voice may include steps S11-S13.

S11: a target person in the current video image is identified.

The current video image is an image displayed on the main body terminal in real time, and the target character is a character displayed in the current video image in real time, namely a called party. The target terminal captures a target person and the current environment of the target person through a camera (front or rear) of the target terminal, images the target person and the current environment of the target person, and transmits the images to the main body terminal, and the main body terminal can recognize the target person in the images through a face recognition technology, a human body detection technology and a human body posture/behavior/action recognition technology.

In the conversation process, when a plurality of people appear in the current video image, the main body terminal obtains the mouth characteristics of each person through a face recognition technology, judges the current person who is speaking according to the mouth characteristics, and takes the face of the person as a target face, for example, the face with the mouths combined one by one as the target face. Accordingly, each character is allowed to be a target character.

Or, the main terminal selects one of the faces as a target face through a face recognition technology, and discards the other faces. Accordingly, only one person is the target person. For example, when the caller and person A₀While talking, person B₀Entering the camera view finding range of the target terminal, and the person B speaking at the moment₀May be simply passing by, not talking to the caller, for which the subject terminal may only pass character a₀As a target face.

The reference basis of the main body terminal for selecting the target person can be in addition to the facial features and parameters such as voiceprint features and the like which can identify the unique identity of the person.

S12: and acquiring answering parameters of the target person, wherein the answering parameters are used for identifying the sound intensity heard by the target person and comprise at least one of the position of the target person in the current video image, the action of the target person and the voice of the target person.

S13: and adjusting the gain of the sound acquisition equipment according to the answering parameters, and transmitting the voice signals acquired by the adjusted sound acquisition equipment to the target terminal.

The listening parameter refers to a parameter that can identify the intensity of a voice (or the volume of a speaking voice) for the voice of a calling party heard by a target person. If the listening parameter indicates that the currently heard sound is small, which means that the sound of the calling party collected by the main terminal is small, the main terminal can increase the gain of the sound collection device (such as a microphone). And if the currently heard sound of the answering parameter identifier is larger, the calling party collected by the main body terminal is indicated to be larger, and then the gain of the main body terminal can be reduced.

The answering parameter identifies real-time feedback of the opposite side to the call volume in the voice call scene, so that the input gain of the microphone can be adjusted according to the situation of the opposite side, and high-quality voice call service is provided.

The specific value of the gain to be adjusted may be determined according to the value of the listening parameter, and the specific algorithm adopted is not limited in the embodiment of the present application.

With respect to the three types of listening parameters exemplified in the above step S12, how to adjust the gain of the sound collection device according to each type of listening parameter is explained below.

Referring to fig. 2, the positions of the target person in the video image can be divided into three types: the face of the target person (hereinafter referred to as the target face) is located in the center region, left half, and right half of the video image.

When the target face is located in the left half of the video image, for example, at position a shown in fig. 2, the subject terminal increases the gain of the left channel of the sound collection device.

When the target face is located in the right half of the video image, for example, at position B shown in fig. 2, the subject terminal increases the gain of the right channel of the sound collection device.

When the target face is located in a central region of the video image, for example, at a position C shown in fig. 2, the main body terminal may keep the gains of the left and right channels of the sound collection apparatus unchanged.

The movement of the target person may be a physical movement capable of reflecting the size of the sound to be listened to by the target person, and may be a movement in which the ear (left ear or right ear) of the target person faces the target terminal, and further, a physical movement such as a hand covering the ear or shaking the head may be accompanied.

If the left ear or the right ear of the target person is directed toward the target terminal in the current video image, the subject terminal increases the gain of the sound collection device.

For more intuitive feedback of the listening parameters of the target person to the listening sound, namely the voice of the target person, the embodiment of the application can acquire the speaking sound of the target person and acquire the language segment for marking the sound size from the speaking sound through the voice recognition technology.

When a language segment in which the identification sound is small, for example, "sound is too small, i hear not clear" or "loud" is acquired, the subject terminal increases the gain of the sound collection device. And when the language segment with large identification sound is obtained, the main body terminal reduces the gain of the sound acquisition equipment.

It should be understood that in the dimension of voice, the listening parameter may also be the voice intensity of the target person, for example, the current voice intensity of the target person may be compared with the previous voice intensity, and specifically, if the voice intensity becomes smaller, the gain of the sound collection device is increased; if the voice intensity is increased, the gain of the sound collection device is reduced.

For other types of answering parameters, the main body terminal can be acquired in a proper mode. For example, for the position of the target face in the current video image, the main body terminal may obtain a corresponding relationship between a shooting focal length and a position of the target terminal from the target terminal, then obtain the shooting focal length of the target terminal when the current video image is imaged, and obtain the position of the target face in the current video image according to the corresponding relationship; or, the position of the target person in the video image is obtained according to the voice intensity of the target person, specifically, the main body terminal obtains the corresponding relationship between the voice intensity and the distance in advance, and then obtains the position corresponding to the voice intensity according to the currently obtained voice intensity.

In the foregoing embodiment, the main terminal adjusts the gain of the sound collection device according to the face image in the video image. For the case that the target person leaves the camera view range during the call, that is, when it is detected that the target person disappears in the current video image, the main body terminal cannot acquire the position of the target person in the video image and the motion of the target person, at this time, the gain of the sound capture device may be adjusted by the voice of the target person, for example, comparing the current voice intensity with the previous voice intensity of the target person, directly adjusting the gain by the change of the voice intensity, or acquiring the position of the target person in the current video image by the change of the voice intensity, and then adjusting the gain according to the position change.

In the embodiment of the present application, the main terminal may obtain multiple types of listening parameters, and synthesize these listening parameters to adjust the gain, but there may be a conflict between the volume levels identified by these listening parameters, for example, when the target person makes a language segment "sound too much, and croup toward me", the target person may feel head-side to the target terminal due to itching ears and accompanied by a limb action of the hand covering the ears, and at this time, the main terminal needs to determine which parameter to adjust the gain.

In this regard, embodiments of the present application may be provided with a method as shown in fig. 3 below. As shown in FIG. 3, the video call voice processing method includes steps S21-S25.

S21: a target person in the current video image is identified.

S22: and acquiring answering parameters of the target person, wherein the answering parameters are used for identifying the sound intensity heard by the target person and comprise at least two of the position of the target person in the current video image, the action of the target person and the voice of the target person.

S23: and setting priorities for various answering parameters.

S24: when the sound intensities identified by the plurality of answering parameters conflict with each other, selecting the answering parameter with the highest priority and abandoning the answering parameter with the low priority; and when the sound intensities identified by the plurality of answering parameters are not in conflict, selecting all the acquired answering parameters.

S25: and adjusting the gain of the sound acquisition equipment according to the answering parameters, and transmitting the voice signals acquired by the adjusted sound acquisition equipment to the target terminal.

Based on the description of the foregoing embodiment, the embodiment can avoid the situation of the misregulation gain caused by the misjudgment of the main terminal, and more accurately feed back the listening situation of the target person.

Fig. 4 is a schematic structural diagram of a communication terminal according to an embodiment of the present application. Referring to fig. 4, thecommunication terminal 40 may be one of two parties of a video call, such as the main terminal. Thecommunication terminal 40 includes anapplication processor 41, adigital signal processor 42, asound collection device 43, acamera 44, aleft speaker 451, aright speaker 452, and anantenna 46. Theapplication processor 41 and thedigital signal processor 42 may be regarded as the core of thecommunication terminal 40, and are connected with the respective structural elements to implement the corresponding functions during the video call.

Thecamera 44 is used to capture images of the person and the environment in which the person is located.

Theantenna 46 is, for example, a Wi-Fi antenna or the like, and is configured to receive and transmit electromagnetic waves, and to perform interconversion between the electromagnetic waves and electrical signals, thereby performing communication with another device for video call.

Asound collection device 43, such as a microphone, is used to collect the voice of the calling party.

Theleft speaker 451 and theright speaker 452 are used for playing the voice of the other party of the video call, and the left channel voice signal and the right channel voice signal are played correspondingly.

An Application Processor (Application Processor)41 is used to acquire the current video image and transmit it to a digital signal Processor (digital signal processing) 42.

Thedsp 42 is configured to start a corresponding algorithm to identify a target person in the current video image, and obtain listening parameters of the target person, wherein the listening parameters are used to identify the sound intensity heard by the target person and include at least one of the position of the target person in the current video image, the motion of the target person, and the voice of the target person.

Specifically, theapplication processor 41 sends the current video image to thedigital signal processor 42, and thedigital signal processor 42 starts the corresponding algorithm to recognize the position of the target person in the current video image and the motion of the target person. Theapplication processor 41 transmits the voice of the target person to thedigital signal processor 42, thedigital signal processor 42 starts a voice recognition algorithm to obtain corresponding parameters, for example, to recognize whether there is a specific language segment in the voice signal, such as "your voice is too small", and returns the detection result to theapplication processor 41.

Theapplication processor 41 is configured to generate a gain scheme according to the listening parameters, adjust the gain of thesound collection device 43 according to the gain scheme, and transmit the voice signal collected by the sound collection device after adjustment to the target terminal through theantenna 46.

The specific working modes of the structural elements can refer to the steps of the method, and are not described in detail herein. For example, theapplication processor 41 is further configured to set priorities for various listening parameters, select a listening parameter with the highest priority and discard a listening parameter with a low priority when the sound strengths identified by the listening parameters conflict with each other; when the sound intensities identified by the plurality of listening parameters do not conflict, the gain of thesound collection device 43 is adjusted according to the listening parameters.

Here, thecommunication terminal 40 has advantageous effects that can be achieved by the foregoing method.

It should be understood that, when implemented in a practical application scenario, the execution bodies of the above steps may not be the aforementioned structural elements, but may be implemented by other modules and units, respectively, according to the type of the device to which thecommunication terminal 40 belongs.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor. To this end, an embodiment of the present application provides a readable storage medium, where a plurality of instructions are stored in the readable storage medium, and the instructions can be loaded by a processor to execute the steps in any video call voice processing method provided in the embodiment of the present application.

The storage medium may include a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.

The instructions stored in the storage medium can execute the steps in any video call voice processing method provided in the embodiments of the present application, so that the beneficial effects that any video call voice processing method can achieve can be achieved, as detailed in the foregoing embodiments.

Embodiments of the present application also provide a computer program product, which includes computer program code, when the computer program code runs on a computer, causes the computer to execute the method as described in the above various possible embodiments.

Embodiments of the present application further provide a chip, which includes a memory for storing a computer program and a processor for calling and executing the computer program from the memory, so that a device in which the chip is installed performs the method in the above various possible embodiments.

It should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element, and that elements, features, or elements having the same designation in different embodiments may or may not have the same meaning as that of the other elements, and that the particular meaning will be determined by its interpretation in the particular embodiment or by its context in further embodiments.

In addition, although the terms "first, second, third, etc. are used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, depending on the context, without departing from the scope herein. The term "if" can be interpreted as "at … …" or "when … …" or "in response to a determination". Furthermore, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. The terms "or" and/or "are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions, steps or operations are inherently mutually exclusive in some way.

Further, although the various steps in the flowcharts herein are shown in order as indicated by the arrows, they are not necessarily performed in order as indicated by the arrows. Unless explicitly stated otherwise herein, the steps are not performed in the exact order, but may be performed in other orders. Moreover, at least some of the steps in the figures may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, in different orders, and may be performed alternately or at least partially with respect to other steps or sub-steps of other steps.

Although the application has been shown and described with respect to one or more implementations, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The present application includes all such modifications and variations, and is supported by the technical solutions of the foregoing embodiments. That is, the above-mentioned embodiments are only some of the embodiments of the present application, and not intended to limit the scope of the present application, and all equivalent structural changes made by using the contents of the present specification and the drawings, such as the combination of technical features between the embodiments, or the direct or indirect application to other related technical fields, are included in the scope of the present application.

Claims

Translated fromChinese

1.一种视频通话语音处理方法，其特征在于，包括：1. a video call voice processing method, is characterized in that, comprises:

识别当前视频图像中的目标人物；Identify the target person in the current video image;

获取目标人物的接听参数，所述接听参数用于标识目标人物听到的声音强度，且包括所述目标人物在当前视频图像中的位置、目标人物的动作、以及目标人物的语音中的至少一者；Obtain the answering parameter of the target person, the answering parameter is used to identify the sound intensity heard by the target person, and includes at least one of the position of the target person in the current video image, the action of the target person, and the voice of the target person. By;

根据所述接听参数调整声音采集设备的增益，并将调整后的声音采集设备采集到的语音信号传输给目标终端。The gain of the sound collecting device is adjusted according to the answering parameter, and the voice signal collected by the adjusted sound collecting device is transmitted to the target terminal.

2.根据权利要求1所述的视频通话语音处理方法，其特征在于，2. The video call voice processing method according to claim 1, characterized in that,

所述获取目标人物的接听参数，包括：The acquisition of the answering parameters of the target person includes:

对各类接听参数设置优先级；以及Prioritize various answering parameters; and

在多个接听参数所标识的声音强弱相冲突时，选取优先级最高的接听参数、并舍弃优先级低的接听参数；When the sound strengths identified by multiple answering parameters conflict, select the answering parameter with the highest priority, and discard the answering parameter with the lower priority;

在多个接听参数所标识的声音强弱均未冲突时，执行所述根据所述接听参数调整声音采集设备的增益的步骤。The step of adjusting the gain of the sound collection device according to the answering parameters is performed when the sound strengths identified by the plurality of answering parameters are not in conflict.

3.根据权利要求1所述的视频通话语音处理方法，其特征在于，所述获取目标人物的接听参数，包括：3. The video call voice processing method according to claim 1, wherein the acquiring the answering parameters of the target person comprises:

获取目标终端的拍摄焦距与位置的对应关系；Obtain the correspondence between the shooting focal length and the position of the target terminal;

获取成像当前视频图像时目标终端的拍摄焦距，并根据所述对应关系获取所述目标人物在当前视频图像中的位置。The shooting focal length of the target terminal when imaging the current video image is acquired, and the position of the target person in the current video image is acquired according to the corresponding relationship.

4.根据权利要求1所述的视频通话语音处理方法，其特征在于，所述视频通话语音处理方法还包括：4. The video call voice processing method according to claim 1, wherein the video call voice processing method further comprises:

检测所述目标人物是否持续显示于所述当前视频图像中；Detecting whether the target person is continuously displayed in the current video image;

在检测到所述目标人物在当前视频图像中消失时，采集所述目标人物的当前语音，并据此确定目标人物在当前视频图像中的位置。When it is detected that the target person disappears in the current video image, the current voice of the target person is collected, and the position of the target person in the current video image is determined accordingly.

5.根据权利要求1所述的视频通话语音处理方法，其特征在于，5. The video call voice processing method according to claim 1, wherein,

所述目标人物在视频图像中的位置包括：目标人物的脸部位于视频图像的中心区域、左半部分、或者右半部分；The position of the target person in the video image includes: the face of the target person is located in the central area, the left half, or the right half of the video image;

所述接听参数为目标人物在视频图像中的位置，所述根据所述接听参数调整声音采集设备的增益，包括：The answering parameter is the position of the target person in the video image, and adjusting the gain of the sound collection device according to the answering parameter includes:

当目标人物的脸部位于视频图像的左半部分时，增大声音采集设备的左声道的增益；当目标人物的脸部位于视频图像的右半部分时，增大声音采集设备的右声道的增益；当目标人物的脸部位于视频图像的中心区域时，保持声音采集设备的左声道和右声道的增益不变。When the face of the target person is located in the left half of the video image, increase the gain of the left channel of the sound acquisition device; when the face of the target person is located in the right half of the video image, increase the gain of the right channel of the sound acquisition device When the face of the target person is located in the central area of the video image, keep the gain of the left and right channels of the sound acquisition device unchanged.

6.根据权利要求1所述的视频通话语音处理方法，其特征在于，6. The video call voice processing method according to claim 1, wherein,

所述目标人物的动作包括：目标人物的耳朵朝向目标终端；The action of the target person includes: the ear of the target person faces the target terminal;

所述接听参数为所述目标人物的动作，所述根据所述接听参数调整声音采集设备的增益，包括：增大声音采集设备的增益。The answering parameter is the action of the target person, and the adjusting the gain of the sound collection device according to the answering parameter includes: increasing the gain of the sound collection device.

7.根据权利要求1所述的视频通话语音处理方法，其特征在于，7. The video call voice processing method according to claim 1, wherein,

所述目标人物的语音包括：标识声音大小的语言片段；The voice of the target person includes: a language segment identifying the size of the voice;

所述接听参数为所述目标人物的语音，所述根据所述接听参数调整声音采集设备的增益，包括：The answering parameter is the voice of the target person, and adjusting the gain of the sound collection device according to the answering parameter includes:

当获取到标识声音小的语言片段时，增大声音采集设备的增益；当获取到标识声音大的语言片段时，降低声音采集设备的增益。The gain of the sound collection device is increased when a language segment with a small identification sound is obtained; when a language segment with a loud identification sound is obtained, the gain of the sound collection device is decreased.

8.一种通信终端，其特征在于，所述通信终端包括应用处理器、数字信号处理器以及声音采集设备，8. A communication terminal, characterized in that the communication terminal comprises an application processor, a digital signal processor and a sound collection device,

所述应用处理器，用于获取当前视频图像；the application processor for acquiring the current video image;

所述数字信号处理器，用于识别当前视频图像中的目标人物，并获取目标人物的接听参数，所述接听参数用于标识目标人物听到的声音强度，且包括所述目标人物在当前视频图像中的位置、目标人物的动作、以及目标人物的语音中的至少一者；以及，The digital signal processor is used to identify the target person in the current video image, and obtain the answering parameter of the target person, the answering parameter is used to identify the sound intensity heard by the target person, and includes the target person in the current video. at least one of the position in the image, the action of the target person, and the voice of the target person; and,

所述应用处理器还用于根据所述接听参数调整所述声音采集设备的增益，并将调整后声音采集设备采集到的语音信号传输给目标终端。The application processor is further configured to adjust the gain of the sound collecting device according to the answering parameter, and transmit the voice signal collected by the adjusted sound collecting device to the target terminal.

9.根据权利要求8所述的通信终端，其特征在于，所述应用处理器还用于对各类接听参数设置优先级，9. The communication terminal according to claim 8, wherein the application processor is further configured to set priorities for various types of answering parameters,

在多个接听参数所标识的声音强弱均未冲突时，根据所述接听参数调整所述声音采集设备的增益。When there is no conflict between the sound strengths identified by the plurality of answering parameters, the gain of the sound collecting device is adjusted according to the answering parameters.

10.一种可读存储介质，其特征在于，所述可读存储介质中存储有程序，所述程序用于被处理器运行以执行如上述权利要求1～7任一项所述的视频通话语音处理方法中的一个或多个步骤。10 . A readable storage medium, wherein a program is stored in the readable storage medium, and the program is configured to be executed by a processor to perform the video call according to any one of the preceding claims 1 to 7 . One or more steps in a speech processing method.