Detailed Description
In order to make the purpose, technical solution and beneficial effects of the present application more clear and more obvious, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
For convenience of understanding, terms referred to in the embodiments of the present application are explained below.
Cloud technology (Cloud technology): based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, a resource pool can be formed and used as required, and the cloud computing business model is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.
Cloud conference: the cloud conference is an efficient, convenient and low-cost conference form based on a cloud computing technology. A user can share voice, data files and videos with teams and clients all over the world quickly and efficiently only by performing simple and easy-to-use operation through an internet interface, and complex technologies such as transmission and processing of data in a conference are assisted by a cloud conference service provider to operate. At present, domestic cloud conferences mainly focus on Service contents mainly in a Software as a Service (SaaS a Service) mode, including Service forms such as telephones, networks and videos, and cloud computing-based video conferences are called cloud conferences. For example, in the embodiment of the present application, a multi-person voice or video conference may be performed based on cloud computing.
In the cloud conference era, data transmission, processing and storage are all processed by computer resources of video conference manufacturers, users do not need to purchase expensive hardware and install complicated software, and efficient teleconferencing can be performed only by opening a browser and logging in a corresponding interface.
The cloud conference system supports multi-server dynamic cluster deployment, provides a plurality of high-performance servers, and greatly improves conference stability, safety and usability. In recent years, video conferences are popular with many users because of greatly improving communication efficiency, continuously reducing communication cost and bringing about upgrading of internal management level, and the video conferences are widely applied to various fields such as governments, armies, transportation, finance, operators, education, enterprises and the like. Undoubtedly, after the video conference uses cloud computing, the cloud computing has stronger attraction in convenience, rapidness and usability, and the arrival of new climax of video conference application is necessarily stimulated.
Automatic gain control: automatic Gain Control (AGC) refers to an Automatic Control method for automatically adjusting the Gain of an amplifier circuit according to signal intensity. The circuit that implements this function is referred to as an AGC loop. The AGC loop is a closed-loop electronic circuit, which is a negative feedback system, and can be divided into two parts, a gain-controlled amplifying circuit and a control voltage forming circuit. The gain controlled amplifying circuit is located in the forward amplifying path, and the gain of the gain controlled amplifying circuit is changed along with the control voltage. The basic components of the control voltage forming circuit are an AGC detector and a low-pass smoothing filter, and may include components such as a gate circuit and a dc amplifier.
At present, in a voice/video communication system with multi-person interaction, since a plurality of persons may speak at the same time, mixing is an essential step for a person participating in a conference to clearly hear the voices of the plurality of persons. Currently, a server receives audio data packets sent by a plurality of terminals and decodes the audio data packets, respectively, then calculates voice characteristics according to audio signals obtained after decoding, and selects an audio data packet including a voice segment based on the voice characteristics. And then mixing the selected audio data packets into a mixed audio data packet and sending the mixed audio data packet to a receiving terminal. Exemplarily, as shown in fig. 1, 5 terminals are set to be in call connection with a server, the server receives audio data packets sent by the terminals 1 to 4, and a decoding module in the server decodes the received audio data packets to obtain audio signals corresponding to the terminals 1 to 4, respectively. And an analysis module in the server calculates voice characteristics based on the audio signals obtained after decoding, obtains the voice characteristics 1 to 4 and sends the voice characteristics to a routing module in the server. And the route selection module in the server selects the audio signals respectively sent by the terminal 1 to theterminal 4 according to the voice characteristics. Assuming that the routing result is the audio signal sent by the terminal 1, the terminal 2 and theterminal 4, the routing result is sent to the audio mixing module of the server. And the audio mixing module in the server performs audio mixing on the audio signals sent by the terminal 1, the terminal 2 and theterminal 4 according to the routing result to obtain mixed audio signals. The encoding module in the server encodes the mixed audio signal and then transmits the encoded mixed audio signal to the terminal 5.
Since the server decodes all received audio data streams, calculates speech features, and then performs routing, mixing and encoding, the decoding and speech feature calculation both require a large amount of Central Processing Units (CPUs), which results in excessive consumption of the server. Secondly, the audio decoded by the server is generally the audio processed by the audio processing link of the terminal, for example, after the terminal performs Automatic Gain Control (AGC) on the audio, the non-speech segment is relatively amplified, and when the server performs speech feature extraction on the audio, the obtained speech feature cannot truly reflect the feature of the non-speech segment in the audio, thereby affecting the routing effect of the server. In view of this, in the embodiment of the present application, before the terminal processes the acquired audio signal, the terminal performs voice feature extraction on the acquired audio signal to obtain voice feature information, then performs automatic gain control and coding on the acquired audio signal, and then packages the audio signal and the voice feature information together to obtain an audio data packet, and then sends the audio data packet to the server. After receiving the audio data packets sent by the N terminals, the server respectively obtains the voice characteristic information from each audio data packet, and then selects the target audio signals sent by the M terminals from the coded audio signals in each audio data packet according to the voice characteristic information corresponding to the coded audio signals in each audio data packet, wherein M is a positive integer smaller than N. And then mixing sound based on the target audio signals sent by the M terminals.
The terminal extracts voice characteristics of the collected audio signals to obtain voice characteristic information, and then sends audio data packets containing the voice characteristic information to the server, and the server directly obtains the voice characteristic information from the received audio data packets and selects a route based on the voice characteristic information.
Referring to fig. 2, it is a diagram of a system architecture applicable to the embodiment of the present application, the system architecture at least includes N terminals 101 and a server 102. The N terminals 101 are terminals 101-1 to 101-N shown in FIG. 2, where N is a positive integer, and the value of N is not limited in this embodiment.
An application program for multi-person conversation may be installed in the terminal 101, and the application program may be a social application program, an office application program, or the like, and a user may use the application program to realize multi-person conversation.
Illustratively, the setting terminal installs the social application program in advance, and the user first establishes the chat group XXXX in the social application program and then clicks the "voice call" icon in the chat group interface, as shown in fig. 3 a. And then select a plurality of members from the chat group to initiate the voice chat. When the selected member receives the voice chat request, the selected member clicks the answer icon displayed on the terminal to join the voice chat, and the multi-user voice chat interface is specifically shown in fig. 3 b. The member joining the voice chat can click the video icon, open the camera and switch the voice chat to the video chat.
Illustratively, the terminal is configured to install an office application in advance, and an interface of the office application includes a "join meeting" icon, a "quick meeting" icon, and a "scheduled meeting" icon, which is specifically shown in fig. 4 a. User a may click on the icon for "speed meeting" in the interface of the office application as a meeting initiator to enter the meeting interface. The conference interface includes a conference number, a "mute" icon, an "open video" icon, an "administrator" icon, an "end" icon, and the like. User a may invite the people participating in the conference interface, setting user a to invite user B and user C to the conference. When the invited person receives the conference request, the invited person clicks the answer icon displayed on the terminal to agree to join the voice conference, and an interface of the multi-person voice conference is specifically shown in fig. 4 b. The members participating in the conference can click the 'video on' icon, turn on the camera, and switch the voice conference to the video conference. The conference initiator may click on the "end" icon to end the voice or video conference.
In addition, the terminal 101 may also be equipped with a browser, and the user enters a web page for multi-person conversation using the browser, and then uses the web page to realize multi-person conversation. The multi-person call may be a video call or a voice call, which is not specifically limited in the embodiment of the present application. The terminal 101 may include one or more processors 1011, memory 1012, an I/O interface 1013 for interacting with the server 102, and a display panel 1014, among other things. The memory 1012 of the terminal 101 may store program instructions for audio processing, which when executed by the processor 1011 can be used to process audio and display an interface for multi-person conversation on the display panel 1014. The terminal 101 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like.
In the specific implementation, each terminal 101 exchanges information with the server 102 through signaling, the server 102 establishes connection with the terminal 101 after authentication of the terminal 101 is passed, and then the terminal 101 sends an audio data packet to the server 102. The server 102 is configured to perform unpacking, decoding, routing, mixing, encoding, and packing on audio data packets sent by multiple terminals to obtain a mixed audio data packet, and then send the mixed audio data packet to the terminal 101 in call connection with the server 102. In addition, the server 102 monitors the call process during the call, for example, monitors whether a new terminal is accessed or exited, and releases resources by performing data connection and signaling connection with the terminal 101 when the call is ended. The server 102 may include one or more processors 1021, memory 1022, and an I/O interface 1023 to interact with the terminal 101, among other things. Signaling interaction functions, authentication functions, call control functions, unpacking functions, decoding functions, routing functions, mixing functions, encoding functions, and packing functions of the server 102 may be implemented on the one or more processors 1021. In addition, server 102 may also configure database 1024. The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal 101 and the server 102 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited thereto.
Based on the system architecture diagram shown in fig. 2, an embodiment of the present application provides a flow of an audio processing method, as shown in fig. 5, where the flow of the method is executed by a terminal and a server interactively, and the method includes the following steps:
step S501, the terminal extracts voice characteristics of the collected audio signals to obtain voice characteristic information.
Specifically, the terminal collects an audio signal through a microphone, then performs echo cancellation and denoising on the audio signal, and then performs voice feature extraction on the audio signal from which the echo and the noise are removed to obtain voice feature information.
Step S502, the terminal packs the collected audio information number with the voice characteristic information after performing automatic gain control and coding on the collected audio information number, and obtains an audio data packet.
In step S503, the terminal sends the audio data packet to the server.
Specifically, the terminal performs automatic gain control on the acquired audio signal, adjusts the audio signal to a suitable volume, then performs compression coding on the audio signal, for example, performs coding by using vocoders such as g.729, talk or opus, and the like, to obtain a coded audio signal, and then performs packaging on the coded audio signal together with the voice feature information, to obtain an audio data packet. In one possible implementation, the voice feature information is embedded in a specified field in a header of a Real-time Transport Protocol (RTP), and then the packaged audio data packet is sent to the server through the RTP.
When the user turns off the microphone of the terminal or the energy of the audio signal detected by the terminal is less than the preset threshold, the terminal may not send the audio data packet to the server. The terminal may also send the audio data packet to the server all the time, and this is not limited in this embodiment of the present application.
Step S504, the server receives the audio data packets sent by the N terminals.
Specifically, each audio data packet includes voice feature information and a coded audio signal, the voice feature information is obtained by a terminal after voice feature extraction is performed on the acquired audio signal, and the N terminals are all or part of terminals in call connection with the server. If the user closes the microphone of the terminal or the energy of the audio signal detected by the terminal is less than the preset threshold value, the terminal does not send the audio data packet to the server, and the server may receive the audio data packet sent by a part of terminals in call connection with the server. And if the terminal always sends the audio data packet to the server, the server receives the audio data packets sent by all terminals in call connection with the server.
Step S505, the server selects, from the encoded audio signals in each audio data packet, target audio signals sent by M terminals according to the speech feature information corresponding to the encoded audio signals in each audio data packet.
Specifically, M is a positive integer smaller than N, the voice feature information is used for indicating whether voice exists in the audio signals, and the server selects target audio signals with voice sent by M terminals from the coded audio signals in each audio data packet according to the voice feature information.
In step S506, the server performs mixing processing based on the target audio signals sent by the M terminals.
In one possible embodiment, when M is greater than or equal to the number of terminals that transmit audio packets to the server, the server performs mixing processing based on the received audio signals encoded in the audio packets transmitted by all the terminals.
Illustratively, setting M to 5, and 10 terminals in call connection with the server, where 7 terminals have their microphones turned off, and 3 terminals transmit audio data packets to the server, the server generates and transmits a mixed audio data packet for each of the 10 terminals in call connection with the server based on the received encoded audio signals in the audio data packets transmitted by the 3 terminals.
The terminal extracts voice characteristics of the collected audio signals to obtain voice characteristic information, and then sends audio data packets containing the voice characteristic information to the server, and the server directly obtains the voice characteristic information from the received audio data packets and selects a route based on the voice characteristic information.
Optionally, in step S505, the voice feature information at least includes audio energy, where the audio energy may be energy of voice in the audio signal, or energy of the audio signal, and generally, the audio energy of a voice segment is greater than that of a non-voice segment.
Optionally, the voice feature information further includes a voice flag, where the voice flag is obtained by performing voice detection on the collected audio signal by the terminal and sending the audio signal to the server, or is obtained by performing voice detection on the received encoded audio signal by the server.
Specifically, the voice flag is a Voice Activity Detection (VAD) flag, and the VAD flag is obtained by analyzing and calculating an audio coding parameter and a feature value thereof, and then determining whether a voice signal exists in a current audio signal by using a preset logic judgment criterion. When the terminal generates the voice mark, the terminal can automatically perform voice detection on the acquired audio signal to obtain the voice mark, or perform voice detection on the acquired audio signal to obtain the voice mark after receiving an instruction input by a user, and then send the audio energy and the voice mark as voice characteristic information to the server. After the server receives the audio data packet, an unpacking module in the server unpacks the audio data packet to obtain a coded audio signal, audio energy and a voice mark. When the server generates the voice mark, after the server receives the audio data packet, an unpacking module in the server unpacks the audio data packet to obtain a coded audio signal and audio energy, and then voice detection is carried out on the coded audio signal to obtain the voice mark.
When a server selects target audio data packets sent by M terminals from each audio data packet according to speech feature information corresponding to a coded audio signal in each audio data packet, embodiments of the present application provide at least the following several implementation manners:
in a possible implementation manner, the voice characteristic information corresponding to the encoded audio signal includes audio energy and a voice flag, the server screens out the audio signal with the voice flag from the encoded audio signal in each audio data packet, then sorts the screened out audio signal according to the audio energy from large to small, and determines the audio signal with the top M bits as the target audio signal.
Specifically, the voice flag includes two flags, i.e., voice flag and no-voice flag, and in a specific implementation, 1 may be used to indicate voice, and 0 may be used to indicate no-voice. The server firstly screens out the audio signals with voice marks as the audio signals with voice from the coded audio signals in each audio data packet. And when the number of all the screened audio signals is greater than M, sorting the screened audio signals according to the audio energy from large to small, and determining the audio signals with the top M bits as target audio signals. When the number of all the screened audio signals is not more than M, the screened audio signals are determined as target audio signals, and the server can select the coded audio signals as the target audio signals according to whether the corresponding terminal sends voices recently or not from the coded audio signals with the voice marks of no voices, so that the number of the target audio signals reaches M.
Illustratively, setting M to 5, the server receives audio data packets respectively sent by the terminal 1 to the terminal 10, which are respectively the audio data packet 1 to the audio data packet 10, where a voice flag in 7 audio data packets is 1, which are respectively the audio data packet 1 to the audio data packet 7, and a voice flag in 3 audio data packets is 0, which are respectively theaudio data packet 8 to the audio data packet 10. The server screens 7 encoded audio signals with corresponding voice marks as having voices from the encoded audio signals in the 10 audio data packets, and the signals are respectively encoded audio signals 1 to 7. Then, according to the sequence of the audio energy from large to small, the coded audio signal 1 to the coded audio signal 7 are sequenced, and the sequencing result is as follows: encoded audio signal 1, encoded audio signal 2, encoded audio signal 7, encoded audio signal 5, encoded audio signal 3, encoded audio signal 6, encodedaudio signal 4. Taking the coded audio signal in the top 5 as the target audio signal, respectively: encoded audio signal 1, encoded audio signal 2, encoded audio signal 7, encoded audio signal 5, encoded audio signal 3.
Illustratively, setting M to 5, the server receives audio data packets respectively sent by the terminal 1 to the terminal 10, which are respectively the audio data packet 1 to the audio data packet 10, where a voice flag in 3 audio data packets is 1, a voice flag in 1 audio data packet to 3 audio data packet, a voice flag in 7 audio data packets is 0, and anaudio data packet 4 to the audio data packet 10. The server screens 3 coded audio signals with corresponding voice marks as voice from the coded audio signals in the 10 audio data packets, wherein the coded audio signals are respectively a coded audio signal 1, a coded audio signal 2 and a coded audio signal 3. Since the number of the screened encoded audio signals is less than 5, all of the screened 3 encoded audio signals are taken as the target audio signal. If the encoded audio signals transmitted from theterminals 4 and 6 received by the server in the past 1 minute include speech, the encodedaudio signals 4 and 6 transmitted this time from theterminals 4 and 6 are also set as target audio signals. The voice mark and the audio energy are combined for selecting the route, and the considered voice characteristics are more comprehensive, so that the accuracy of selecting the route is improved.
In a possible implementation manner, the speech characteristic information corresponding to the encoded audio signal is audio energy, and the server sorts the encoded audio signals in each audio data packet according to the order of the audio energy from large to small, and determines the encoded audio signal in the top M-bit audio data packet as the target audio signal.
Illustratively, setting M to 5, the server receives audio data packets respectively sent by the terminals 1 to 10, namely, the audio data packets 1 to 10. The server sorts the encoded audio signals in the audio data packets 1 to 10 according to the sequence of the audio energy from large to small, and the result of the sorting is as follows: encoded audio signal 1, encoded audio signal 2, encoded audio signal 10, encoded audio signal 9, encoded audio signal 7, encodedaudio signal 8, encoded audio signal 5, encoded audio signal 3, encoded audio signal 6, encodedaudio signal 4. Determining the coded audio signal ranked at the top 5 as a target audio signal, respectively: encoded audio signal 1, encoded audio signal 2, encoded audio signal 10, encoded audio signal 9, encoded audio signal 7. Because the audio energy of the voice section is larger than that of the non-voice section, the audio signals are sequenced based on the audio energy, and the audio signals with voice can be effectively screened out.
Optionally, in step S506, the server performs mixing processing based on the target audio signals sent by the M terminals, which specifically includes the following several embodiments:
in one possible implementation, the target audio signals sent by M terminals are sequentially decoded, mixed and encoded to obtain a first mixed audio data packet, and the first mixed audio data packet is sent to each other terminal except the M terminals. And respectively aiming at each target terminal in the M terminals, sequentially decoding, mixing and encoding the target audio signals sent by the M-1 terminals except the target audio signal sent by the target terminal to obtain a second mixed audio data packet, and sending the second mixed audio data packet to the target terminal.
Specifically, the target audio signals sent by the M terminals are decoded to obtain M audio digital signals, and optionally, the audio digital signals may be Pulse Code Modulation (PCM) digital signals. Then mixing the M audio digital signals to obtain a first mixed audio digital signal, and then coding the first mixed audio digital signal to obtain a first mixed audio data packet. And respectively decoding target audio signals sent by other M-1 terminals except the target audio signal sent by the target terminal aiming at each target terminal in the M terminals to obtain M-1 audio digital signals, mixing the M-1 audio digital signals to obtain a second mixed audio digital signal, and then coding the second mixed audio digital signal to obtain a second mixed audio data packet.
Illustratively, M is set to be 3, 5 terminals in call connection with the server are respectively terminal 1 to terminal 5, where terminal 1 toterminal 4 respectively send audio data packets to the server, and the audio data packets include speech feature information and encoded audio signals. The terminal 5 turns off the microphone and does not send audio data packets to the server. The server receives audio data packets sent by the terminal 1 to theterminal 4, respectively, and the routing module in the server selects 3 target audio signals with voices from the coded audio signals in the 4 audio data packets according to the voice feature information in the 4 audio data packets, wherein the target audio signals are the coded audio signals sent by the terminal 1, the terminal 2 and theterminal 4, respectively. The decoding module in the server decodes the selected 3 target audio signals respectively to obtain 3 PCM digital signals, which are respectively a PCM digital signal 1, a PCM digital signal 2 and a PCMdigital signal 4.
The audio mixing module in the server mixes the decoded 3 PCM digital signals to obtain a first mixed PCM digital signal, the encoding module in the server encodes the first mixed PCM digital signal to obtain a first mixed audio data packet, and the first mixed audio data packet is sent to the terminal 3 and the terminal 5, as shown in fig. 6 specifically.
A mixing module in the server mixes the PCM digital signal 2 and the PCMdigital signal 4 to obtain a second mixed PCM digital signal corresponding to the terminal 1, a coding module in the server codes the second mixed PCM digital signal corresponding to the terminal 1 to obtain a second mixed audio data packet corresponding to the terminal 1, and the second mixed audio data packet corresponding to the terminal 1 is sent to the terminal 1, as shown in fig. 7.
A mixing module in the server mixes the PCM digital signal 1 and the PCMdigital signal 4 to obtain a second mixed PCM digital signal corresponding to the terminal 2, a coding module in the server codes the second mixed PCM digital signal corresponding to the terminal 2 to obtain a second mixed audio data packet corresponding to the terminal 2, and sends the second mixed audio data packet corresponding to the terminal 2, as shown in fig. 7 specifically.
A mixing module in the server mixes the PCM digital signal 1 and the PCM digital signal 2 to obtain a second mixed PCM digital signal corresponding to theterminal 4, a coding module in the server codes the second mixed PCM digital signal corresponding to theterminal 4 to obtain a second mixed audio data packet corresponding to theterminal 4, and sends the second mixed audio data packet corresponding to theterminal 4, as shown in fig. 7 specifically.
Because different mixed audio data packets are generated and sent for different terminals, the terminals which do not participate in audio mixing can receive audio signals sent by all the terminals which participate in audio mixing, and the terminals which participate in audio mixing receive audio signals except the audio signals sent by the terminals, so that the users which do not speak can hear the voices of other speaking users, and the phenomenon of echo caused by the speaking users hearing own voices is avoided.
In one possible implementation, target audio signals sent by M terminals are sequentially decoded, mixed and encoded to obtain mixed audio data packets, and the mixed audio data packets are sent to each terminal in call connection with the server.
Illustratively, as shown in fig. 8, M is set to be 3, and 5 terminals in call connection with the server are respectively terminal 1 to terminal 5, where terminal 1 toterminal 4 respectively send audio data packets to the server, and the audio data packets include speech feature information and encoded audio signals. The terminal 5 turns off the microphone and does not send audio data packets to the server. The server receives audio data packets sent by the terminal 1 to theterminal 4, respectively, and the routing module in the server selects 3 target audio signals with voices from the coded audio signals in the 4 audio data packets according to the voice feature information in the 4 audio data packets, wherein the target audio signals are the coded audio signals sent by the terminal 1, the terminal 2 and theterminal 4, respectively. The decoding module in the server decodes the selected 3 target audio signals respectively to obtain 3 PCM digital signals, which are respectively a PCM digital signal 1, a PCM digital signal 2 and a PCMdigital signal 4. And a sound mixing module in the server mixes the decoded 3 PCM digital signals to obtain mixed PCM digital signals, and an encoding module in the server encodes the mixed PCM digital signals to obtain mixed audio data packets, and sends the mixed audio data packets to the terminals 1 to 5.
Optionally, the server further includes a packet cache module and an audio cache module, where the packet cache module in the server is configured to cache the target audio signal selected by the routing module in the server, and the maximum cache number may be a fixed number, or may be automatically adjusted according to a network load condition. The audio buffer module in the server is used for buffering the audio digital signals obtained after decoding by the decoding module in the server, and the maximum buffer amount can be a fixed amount or can be automatically adjusted according to the network load condition.
Illustratively, as shown in fig. 9, M is set to be 3, and 5 terminals in call connection with the server are respectively terminal 1 to terminal 5, where terminal 1 toterminal 4 respectively send audio data packets to the server, and the audio data packets include speech feature information and encoded audio signals. The terminal 5 turns off the microphone and does not send audio data packets to the server. The server receives audio data packets sent by the terminal 1 to theterminal 4, respectively, and the routing module in the server selects 3 target audio signals with voices from the coded audio signals in the 4 audio data packets according to the voice feature information in the 4 audio data packets, wherein the target audio signals are the coded audio signals sent by the terminal 1, the terminal 2 and theterminal 4, respectively. And a packet buffer module in the server buffers the coded audio signals sent by the terminal 1, the terminal 2 and theterminal 4. The decoding module in the server decodes the selected 3 target audio signals respectively to obtain 3 PCM digital signals, which are respectively a PCM digital signal 1, a PCM digital signal 2 and a PCMdigital signal 4. And an audio caching module in the server caches the PCM digital signal 1, the PCM digital signal 2 and the PCMdigital signal 4. And the audio mixing module in the server mixes the decoded 3 PCM digital signals to obtain mixed PCM digital signals, and the coding module in the server codes the mixed PCM digital signals to obtain mixed audio data packets. The audio signals before and after decoding are buffered by the packet buffer module and the audio buffer module, so that on one hand, the audio data packet is convenient to recover when packet loss occurs, on the other hand, network jitter and delay can be effectively resisted, and the audio sounds more continuous.
Since the server mixes each target audio signal into one mixed audio signal, a user at a terminal side receiving the mixed audio signal cannot distinguish which terminal the voice in the mixed audio signal comes from, and in order to facilitate the user to distinguish the source of the voice in the mixed audio signal, the mixed audio data packet in the embodiment of the application further includes a voice flag corresponding to the target audio signal.
Specifically, the server receives audio data packets sent by the N terminals, and an unpacking module in the server unpacks each received audio data packet to obtain encoded audio signals and voice feature information. When the received voice characteristic information comprises the audio energy and the voice mark, an unpacking module in the server sends the audio energy and the voice mark to a routing module in the server. When the received voice characteristic information comprises audio energy and does not comprise a voice mark, an unpacking module in the server sends the coded audio signal to an analysis module in the server, the analysis module in the server carries out voice detection on the coded audio signal to obtain the voice mark, and then the voice mark is sent to a routing module in the server. An unpacking module in the server sends the audio energy to a routing module in the server.
And the routing module in the server selects target audio signals sent by the M terminals from the coded audio signals in each audio data packet based on the voice characteristic information, and simultaneously sends the voice mark corresponding to each target audio signal to the packing module of the server. And a decoding module in the server decodes the M target audio signals to obtain M PCM digital signals. And a sound mixing module and an encoding module in the server carry out sound mixing and encoding on the M PCM digital signals to obtain an encoded first mixed audio signal. And a sound mixing module and an encoding module in the server respectively aim at each target terminal in the M terminals, and sequentially perform sound mixing and encoding on the PCM digital signals corresponding to the M-1 terminals except the PCM digital signal corresponding to the target terminal to obtain an encoded second mixed audio signal.
And a packaging module of the server packages the coded first mixed audio signal and the voice marks corresponding to the M target audio signals to obtain a first mixed audio data packet. And a packaging module of the server packages the coded second mixed audio signal and the voice marks corresponding to the M-1 target audio signals to obtain a second mixed audio data packet. The packetizing module of the server may specifically add the voice flag to a specified field in the RTP packet header. After the terminal receives the mixed audio data packet, the user at the speaking terminal side is determined according to the voice mark, and then the head portrait of the speaking user is displayed in a highlight or jumping mode.
Exemplarily, as shown in fig. 10, M is set to be 3, and 5 terminals in call connection with the server are respectively terminal 1 to terminal 5, where the terminals 1 to 4 respectively transmit audio data packets to the server, the audio data packets include speech feature information and encoded audio signals, and the terminal 5 turns off the microphone and does not transmit the audio data packets to the server. The server receives the audio data packets respectively sent by the terminals 1 to 4. And the unpacking module in the server unpacks the received 4 audio data packets to obtain the coded audio signals and the voice characteristic information. The routing module in the server selects 3 target audio signals from the coded audio signals in the 4 audio data packets based on the voice characteristic information, wherein the 3 target audio signals are respectively a coded audio signal 1, a coded audio signal 2 and a codedaudio signal 4, and then sends the voice mark corresponding to each target audio signal to the packing module of the server. A decoding module in the server decodes the 3 target audio signals to obtain 3 PCM digital signals, which are respectively a PCM digital signal 1, a PCM digital signal 2 and a PCMdigital signal 4. And a sound mixing module and an encoding module in the server sequentially carry out sound mixing and encoding on the 3 PCM digital signals to obtain an encoded first mixed audio signal. And a packaging module of the server packages the coded first mixed audio signal and the voice marks corresponding to the 3 target audio signals to obtain a first mixed audio data packet. And a sound mixing module and an encoding module in the server sequentially perform sound mixing and encoding on the PCM digital signal 1 and the PCM digital signal 2 to obtain an encoded second mixed audio signal. And a packaging module of the server packages the coded second mixed audio signal and the voice marks corresponding to the coded audio signal 1 and the coded audio signal 2 to obtain a second mixed audio data packet. Since one terminal needs to receive the audio signals of 4 other terminals at most, a 4-bit integer field in the RTP packet can be used to store the voice flag, where 0 indicates no voice and 1 indicates voice.
The server sends the first mixed audio data packet to the terminal 3 and the terminal 5, after receiving the first mixed audio data packet, the terminal 3 and the terminal 5 display the user avatars corresponding to the terminal 1, the terminal 2 and theterminal 4 in a highlight form, specifically as shown in fig. 11a, the users corresponding to the terminal 1, the terminal 2 and theterminal 4 are the user 1, the user 2 and theuser 4 participating in the conference respectively. The server sends the second mixed audio data packet to theterminal 4, and after receiving the second mixed audio data packet, theterminal 4 displays the user avatars corresponding to the terminal 1 and the terminal 2 in a highlight form, specifically, as shown in fig. 11b, the users corresponding to the terminal 1 and the terminal 2 are the user 1 and the user 2 participating in the conference, respectively.
The server adds the voice mark corresponding to the target audio signal into the mixed audio data packet, sends the mixed audio data packet to the terminal in communication connection with the server, and the terminal receiving the audio data packet displays the head portrait of the terminal side user sending the voice in the forms of highlight, jitter and the like, so that each user can timely know the speaking user, and the interactive experience of the user is improved.
In order to better explain the embodiment of the present application, an audio processing method provided by the embodiment of the present application is introduced below with reference to a specific implementation scenario, where the method is interactively executed by a terminal and a server, where the server includes an unpacking module, a routing module, a packet caching module, a decoding module, an audio caching module, a mixing module, an encoding module, and a packing module, and each module may be implemented on an independent server host or an independent CPU hardware, or may be implemented by being integrated on one server or CPU hardware, and the embodiment of the present application is not particularly limited.
Setting M as 3, and 5 terminals in call connection with the server, namely the terminal 1 to the terminal 5, wherein the terminal 5 closes the microphone and does not send audio data packets to the server. The terminal 1 to theterminal 4 respectively send audio data packets to the server, the audio data packets comprise coded audio signals and voice characteristic information, and the voice characteristic information is obtained after the terminal extracts voice characteristics of the collected audio signals. The following describes a process of generating audio data packets by the terminal 1, as shown in fig. 12. The terminal 1 collects the audio signal through a microphone, then performs echo cancellation and denoising on the audio signal, and then performs voice feature extraction on the audio signal with the echo and noise removed to obtain voice feature information. The terminal 1 performs automatic gain control on the acquired audio signal, adjusts the audio signal to a proper volume, then performs compression coding on the audio signal to obtain a coded audio signal, then packs the coded audio signal and the voice characteristic information together, and then sends a packed audio data packet to a server through an RTP protocol, wherein the voice characteristic information is embedded in a designated field in an RTP packet header.
The server receives the audio data packets respectively sent by the terminals 1 to 4, and an unpacking module in the server unpacks the received 4 audio data packets to obtain encoded audio signals and voice characteristic information, wherein the voice characteristic information comprises audio energy and voice marks. The route selection module in the server firstly screens out the corresponding voice marks as the coded audio signals with voice, if the voice marks corresponding to the 4 coded audio signals are all with voice, the 4 coded audio signals are sequenced according to the sequence of audio energy from large to small, and the sequencing result is set as follows: the encoded audio signal 1 transmitted by the terminal 1, the encoded audio signal 2 transmitted by the terminal 2, the encodedaudio signal 4 transmitted by theterminal 4, and the encoded audio signal 3 transmitted by the terminal 3, the encoded audio signal 1 transmitted by the terminal 1, the encoded audio signal 2 transmitted by the terminal 2, and the encodedaudio signal 4 transmitted by theterminal 4 are used as target audio signals. The packet buffer module in the server buffers the 3 target audio signals. And the routing module in the server sends the voice marks corresponding to the 3 target audio signals to the packaging module of the server. A decoding module in the server decodes the 3 target audio signals to obtain 3 PCM digital signals, which are respectively a PCM digital signal 1, a PCM digital signal 2 and a PCMdigital signal 4. And an audio caching module in the server caches the PCM digital signal 1, the PCM digital signal 2 and the PCMdigital signal 4.
And the audio mixing module in the server mixes the decoded 3 PCM digital signals to obtain a first mixed PCM digital signal, and the coding module in the server codes the first mixed PCM digital signal to obtain a first coded mixed audio signal. The packing module of the server packs the first encoded mixed audio signal and the voice flags corresponding to the 3 target audio signals to obtain a first mixed audio data packet, and sends the first mixed audio data packet to the terminal 3 and the terminal 5, as shown in fig. 13. After receiving the first mixed audio data packet, the terminal 3 and the terminal 5 display the user head portraits corresponding to the terminal 1, the terminal 2 and theterminal 4 in a highlight mode.
And a sound mixing module in the server mixes the PCM digital signal 2 and the PCMdigital signal 4 to obtain a second mixed PCM digital signal corresponding to the terminal 1, and an encoding module in the server encodes the second mixed PCM digital signal corresponding to the terminal 1 to obtain a second encoded mixed audio signal corresponding to the terminal 1. A packing module of the server packs the second encoded mixed audio signal corresponding to the terminal 1, the encoded audio signal 2 sent by the terminal 2, and the encodedaudio signal 4 sent by theterminal 4, which correspond to the voice flag, respectively, to obtain a second mixed audio data packet corresponding to the terminal 1, and sends the second mixed audio data packet corresponding to the terminal 1, as shown in fig. 14 specifically. After receiving the second mixed audio data packet, the terminal 1 displays the user head portraits corresponding to the terminal 2 and theterminal 4 in a highlight mode.
And a sound mixing module in the server mixes the PCM digital signal 1 and the PCMdigital signal 4 to obtain a second mixed PCM digital signal corresponding to the terminal 2. And the coding module in the server codes the second mixed PCM digital signal corresponding to the terminal 2 to obtain a second coded mixed audio signal corresponding to the terminal 2. The packing module of the server packs the second encoded mixed audio signal corresponding to the terminal 2, the encoded audio signal 1 sent by the terminal 1, and the encodedaudio signal 4 sent by theterminal 4, which correspond to the voice flag, respectively, to obtain a second mixed audio data packet corresponding to the terminal 2, and sends the second mixed audio data packet corresponding to the terminal 2, as shown in fig. 14 specifically. And after receiving the second mixed audio data packet, the terminal 2 displays the user head portraits corresponding to the terminal 1 and theterminal 4 in a highlight mode.
And a sound mixing module in the server mixes the PCM digital signal 1 and the PCM digital signal 2 to obtain a second mixed PCM digital signal corresponding to theterminal 4. And the coding module in the server codes the second mixed PCM digital signal corresponding to theterminal 4 to obtain a second coded mixed audio signal corresponding to theterminal 4. The packing module of the server packs the second encoded mixed audio signal corresponding to theterminal 4, the encoded audio signal 1 sent by the terminal 1, and the encoded audio signal 2 sent by the terminal 2, which correspond to the voice flag, respectively, to obtain a second mixed audio data packet corresponding to theterminal 4, and sends the second mixed audio data packet corresponding to theterminal 4, as shown in fig. 14 specifically. And after receiving the second mixed audio data packet, theterminal 4 displays the user head portraits corresponding to the terminal 1 and the terminal 2 in a highlight mode.
The terminal extracts voice characteristics of the collected audio signals to obtain voice characteristic information, and then sends audio data packets containing the voice characteristic information to the server, and the server directly obtains the voice characteristic information from the received audio data packets and selects a route based on the voice characteristic information.
Based on the same technical concept, an embodiment of the present application provides a server, as shown in fig. 15, where theserver 1500 includes:
thereceiving module 1501 is configured to receive audio data packets sent by N terminals, where each audio data packet includes voice feature information and a coded audio signal, and the voice feature information is obtained by performing voice feature extraction on an acquired audio signal;
ascreening module 1502, configured to select, from the encoded audio signals in each audio data packet, target audio signals sent by M terminals according to the voice feature information corresponding to the encoded audio signals in each audio data packet, where M is a positive integer smaller than N;
afirst processing module 1503, configured to perform mixing processing based on the target audio signals sent by the M terminals.
Optionally, the voice feature information at least includes audio energy, and the audio energy is energy of voice in the audio signal.
Optionally, the voice feature information further includes a voice flag, where the voice flag is obtained by performing voice detection on the acquired audio signal by the terminal and sending the audio signal to the server, or is obtained by performing voice detection on the received encoded audio signal by the server.
Optionally,screening module 1502 is specifically configured to:
screening out audio signals with voice marks as voice from the coded audio signals in each audio data packet;
sorting the screened audio signals according to the audio energy from big to small;
the audio signal of the top M bits is determined as a target audio signal.
Optionally, thefirst processing module 1503 is specifically configured to:
sequentially decoding, mixing and encoding target audio signals sent by the M terminals to obtain a first mixed audio data packet, and respectively sending the first mixed audio data packet to each other terminal except the M terminals;
and respectively aiming at each target terminal in the M terminals, sequentially decoding, mixing and encoding the target audio signals sent by the M-1 terminals except the target audio signal sent by the target terminal to obtain a second mixed audio data packet, and sending the second mixed audio data packet to the target terminal.
Optionally, the first mixed audio data packet further includes voice tags corresponding to M target audio signals, and the second mixed audio data packet further includes voice tags corresponding to M-1 target audio signals.
Optionally, thefirst processing module 1503 is specifically configured to:
and sequentially decoding, mixing and encoding target audio signals sent by the M terminals to obtain mixed audio data packets, and respectively sending the mixed audio data packets to each terminal connected with the server in a call mode, wherein the mixed audio data packets further comprise voice signs corresponding to the M target audio signals.
Based on the same technical concept, an embodiment of the present application provides a terminal, as shown in fig. 16, where the terminal 1600 includes:
thefeature extraction module 1601 is configured to perform voice feature extraction on the acquired audio signal to obtain voice feature information;
asecond processing module 1602, configured to perform automatic gain control and coding on the acquired audio information number, and then package the audio information number and the voice feature information to obtain an audio data packet;
the sendingmodule 1603 is configured to send the audio data packets to a server, so that the server selects, according to the voice feature information corresponding to the encoded audio signals in the audio data packets sent by the N terminals, target audio signals sent by M terminals from the encoded audio signals in each audio data packet, where M is a positive integer smaller than N, and performs audio mixing processing based on the target audio signals sent by the M terminals.
Based on the same technical concept, an embodiment of the present application provides an audio processing system, as shown in fig. 17, theaudio processing system 1700 includes:
aserver 1701 andN terminals 1702;
each terminal 1702 of theN terminals 1702 is configured to perform voice feature extraction on the acquired audio signal to obtain voice feature information; after carrying out automatic gain control and coding on the acquired audio information number, packaging the audio information number and the voice characteristic information to obtain an audio data packet, and sending the audio data packet to theserver 1701;
aserver 1701, configured to select, according to speech feature information corresponding to encoded audio signals in audio data packets sent byN terminals 1702, target audio signals sent byM terminals 1702 from the encoded audio signals in each audio data packet, where M is a positive integer smaller than N; mixing processing is performed based on the target audio signals transmitted by theM terminals 1702.
Based on the same technical concept, the embodiment of the present application provides a computer device, as shown in fig. 18, including at least oneprocessor 1801 and amemory 1802 connected to the at least one processor, where a specific connection medium between theprocessor 1801 and thememory 1802 is not limited in this embodiment, and theprocessor 1801 and thememory 1802 in fig. 18 are connected through a bus as an example. The bus may be divided into an address bus, a data bus, a control bus, etc.
In this embodiment, thememory 1802 stores instructions executable by the at least oneprocessor 1801, and the at least oneprocessor 1801 may execute the steps included in the foregoing audio processing method by executing the instructions stored in thememory 1802.
Theprocessor 1801 is a control center of the computer device, and may be connected to various portions of the computer device through various interfaces and lines, and may process audio by executing or executing instructions stored in thememory 1802 and calling data stored in thememory 1802. Optionally, theprocessor 1801 may include one or more processing units, and theprocessor 1801 may integrate an application processor and a modem processor, where the application processor mainly handles an operating system, a user interface, application programs, and the like, and the modem processor mainly handles wireless communication. It is to be appreciated that the modem processor described above may not be integrated into theprocessor 1801. In some embodiments, theprocessor 1801 and thememory 1802 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.
Theprocessor 1801 may be a general-purpose processor, such as a Central Processing Unit (CPU), a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, and may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.
Memory 1802, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. TheMemory 1802 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. Thememory 1802 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. Thememory 1802 of the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.
Based on the same inventive concept, embodiments of the present application provide a computer-readable storage medium storing a computer program executable by a computer device, which, when the program is run on the computer device, causes the computer device to perform the steps of the audio processing method described above.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.