CN104167210A

Movatterモバイル変換

Info

Publication number: CN104167210A
Application number: CN201410414450.5A
Authority: CN
Inventors: 王田; 蔡奕侨; 钟必能; 陈永红; 田晖; 张国亮
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2014-08-21
Filing date: 2014-08-21
Publication date: 2014-11-26

Abstract

Provided is a lightweight class multi-side conference sound mixing method and device. The method comprises the steps that (1) after a client side uses an AMR encoder for encoding voice, voice PCM data and data length are obtained, the encoded PCM data are subjected to framing processing, each frame voice energy value is computed, the fact that a frame is a voice frame or a non-voice frame is determined according to the frame voice energy value and the data length, and accordingly the probability values of the voice frames in the voice PCM data are obtained in a statistics mode; and (2) a server side selects current voice streams of two speakers with the highest voice probability values according to the received voice probability values, whether the superposition principle is used for carrying out sound mixing on the at most two selected voice streams is determined according to the two voice probability values, and finally a voice packet obtained after sound mixing is transferred. According to the method, the shortcoming that portable equipment such as a mobile phone is weak in computing capacity is ingeniously overcome, meanwhile, the computing amount of a server for sound mixing operation is greatly lowered, and the lightweight class multi-side conference sound mixing method and device can be widely used in a multimedia multi-side conference system.

Description

Translated fromChinese

一种轻量级的多方会议混音方法和装置A lightweight multi-party conference sound mixing method and device

技术领域technical field

本发明涉及多方会议通信领域技术，特别是一种轻量级的多方会议混音方法和装置。The invention relates to the technology in the field of multi-party conference communication, in particular to a lightweight multi-party conference audio mixing method and device.

背景技术Background technique

多方视频会议系统中，混音是一项重要技术。混音是将多个音频源的音频根据音频叠加原理混合为一路音频输出，使音频的接收者感觉到多人会议交流的效果。In a multi-party video conferencing system, audio mixing is an important technology. Mixing is to mix the audio of multiple audio sources into one audio output according to the principle of audio superposition, so that the receiver of the audio can feel the effect of multi-person conference communication.

混音可以实现在媒体控制器即服务器端，也可以实现在终端即客户端。Audio mixing can be implemented on the media controller, that is, the server, or on the terminal, that is, the client.

直接在服务器端实现，即客户端把各自的音频数据PCM语音信号通过编码器编码，然后发送给服务器端，服务器先将多个音频源的音频解码，然后根据音频叠加原理混合为一路音频再编码输出，使音频的接收者感觉到多人会议交流的效果。但是由于服务器端需要多路解码，同时最后又要编码，因此计算量和时间复杂度均较大，导致延时也较大。这也就限制了该方案的应用范围。It is implemented directly on the server side, that is, the client encodes its own audio data PCM voice signal through the encoder, and then sends it to the server. The server first decodes the audio from multiple audio sources, and then mixes it into one audio according to the principle of audio superposition and then encodes it. Output, so that the receiver of the audio can feel the effect of multi-person conference communication. However, since the server needs multi-channel decoding and encoding at the end, the amount of calculation and time complexity are large, resulting in a large delay. This limits the scope of application of the scheme.

直接在终端实现混音，即客户端把音频数据PCM语音信号通过编码器编码，发送给服务器端，服务器端将各个终端即客户端的音频，发送到除源端以外的所有终端，各个终端对所有接收到的音频流进行合成。混音的计算压力在各个终端，这种方案会对网络造成更大的压力。一来终端的计算量增大，这对于一些计算能力较弱的移动终端，无法承担混音计算的压力。二来每个终端的语音包都要转发给除源端以外的终端，占用网络带宽资源。Realize audio mixing directly at the terminal, that is, the client encodes the audio data PCM voice signal through the encoder and sends it to the server, and the server sends the audio of each terminal, that is, the client, to all terminals except the source. The received audio streams are synthesized. The calculation pressure of mixing is on each terminal, and this solution will put more pressure on the network. Firstly, the calculation amount of the terminal increases, which cannot bear the pressure of sound mixing calculation for some mobile terminals with weak calculation capabilities. Second, the voice packets of each terminal must be forwarded to terminals other than the source, occupying network bandwidth resources.

还有一些方案，不需要编码和解码，终端直接把语音包发给服务器端，然后服务器端进行混音。由于终端没有对语音包进行编码就直接发包，严重占用网络带宽。There are also some solutions that do not require encoding and decoding, and the terminal directly sends the voice packet to the server, and then the server performs audio mixing. Since the terminal sends the voice packet directly without encoding it, the network bandwidth is seriously occupied.

发明内容Contents of the invention

本发明的主要目的在于针对多方会议的实际应用需求，同时兼顾手机等便携小设备的个性特征，提出一种新颖而简单的快速实时的轻量级的多方会议混音方法和装置。The main purpose of the present invention is to propose a novel and simple fast real-time lightweight multi-party conference sound mixing method and device for the actual application requirements of multi-party conferences, while taking into account the individual characteristics of portable small devices such as mobile phones.

本发明采用如下技术方案：The present invention adopts following technical scheme:

一种轻量级的多方会议混音方法，其特征在于：1)客户端采用AMR编码器对语音进行编码后得到语音PCM数据及数据长度，对编码后的语音PCM数据采用分帧处理，计算每帧语音能量值，并结合该帧语音能量值及其数据长度来确定该帧为语音帧或非语音帧，从而统计出语音PCM数据中语音帧的概率值；2)服务器端通过接收到的语音概率值选出当前的语音概率值最高的两个发言者的语音流，并根据这两个语音概率值大小决定是否使用叠加原理将选出的最多两路语音流进行混音，最后转发混音后的语音包。A light-weight multi-party conference sound mixing method is characterized in that: 1) the client uses an AMR coder to encode the voice to obtain voice PCM data and data length, and adopts sub-frame processing for the encoded voice PCM data, and calculates Every frame speech energy value, and determine this frame is speech frame or non-speech frame in conjunction with this frame speech energy value and its data length, thereby statistics the probability value of speech frame in the speech PCM data; 2) server end passes through the received The voice probability value selects the voice streams of the two speakers with the highest current voice probability values, and decides whether to use the superposition principle to mix the selected voice streams at most according to the two voice probability values, and finally forwards the mixed voice streams. The voice package after the sound.

优选的，预先设定：客户端每隔一段时间抓取到一帧语音信号，每帧语音信号包括m个采样值，每个采样值的能量为r_i；设定统计窗口包括连续的n帧语音信号，当前帧的能量相对参考值为E_refer；步骤1)具体包括如下：Preferably, it is pre-set: the client captures a frame of voice signal at intervals, each frame of voice signal includes m sample values, and the energy of each sample value is r_i ; the statistical window is set to include consecutive n frames Speech signal, the energy relative reference value of current frame is E_refer ; Step 1) specifically includes as follows:

1.1)客户端输入语音PCM数据及AMR编码后的输出长度，计算当前帧语音PCM数据的能量值1.1) The client inputs the voice PCM data and the output length after AMR encoding, and calculates the energy value of the current frame voice PCM data

1.2)判断AMR编码后的当前帧输出长度是否等于31，若是，则记录该帧的能量值，作为语音能量参考值，判定该帧为语音帧并加入统计窗口内，进入步骤1.4)；若否，则记录该帧的能量值，作为非语音能量参考值，进入步骤1.3)；1.2) judge whether the current frame output length after AMR encoding is equal to 31, if so, then record the energy value of this frame, as the speech energy reference value, judge that this frame is a speech frame and add in the statistical window, enter step 1.4); If not , then record the energy value of the frame, as the non-speech energy reference value, enter step 1.3);

1.3)判断当前帧能量值是否大于其能量相对参考值E_refer，若是，则判定该帧为语音帧，若否，则判定该帧为非语音帧；加入新的统计窗口内，进入步骤1.4)1.3) Judging whether the current frame energy value is greater than its energy relative reference value E_refer , if so, then determine that the frame is a voice frame, if not, then determine that the frame is a non-voice frame; add in a new statistical window, and enter step 1.4)

1.4)判断统计窗口是否已满，若是，则计算统计窗口内语音帧的占比，表示成0到100的语音概率值；若否，则进入下一帧，跳至步骤1.1)；1.4) judge whether the statistical window is full, if so, then calculate the proportion of the speech frame in the statistical window, and express it as the speech probability value of 0 to 100; if not, then enter the next frame and skip to step 1.1);

优选的，设定当前帧的前n个连续帧的非语音能量参考值的最大值为E_noise，而语音能量参考值的最大值表示为E_voise，则当前帧的能量相对参考值E_refer用如下公式计算：Preferably, the maximum value of the non-speech energy reference value of the previous n consecutive frames of the current frame is set as E_noise , and the maximum value of the speech energy reference value is expressed as E_voise , then the energy relative reference value E_refer of the current frame is expressed by Calculated with the following formula:

E_refer＝E_noise+(E_voice-E_noise)/10。E_{refer =} E_noise + (E_voice - E_noise )/10.

优选的，步骤2)具体如下：Preferably, step 2) is specifically as follows:

2.1)服务器接收客户端发送过来的语音概率值，选出语音概率值最高的两条语音流F1、F2，其语音概率值分别为P1、P2，P1>P2；2.1) The server receives the voice probability value sent by the client, and selects two voice streams F1 and F2 with the highest voice probability values, whose voice probability values are P1 and P2 respectively, and P1>P2;

2.2)判断P1>2P2是否成立，若是，则只将P₁对应的语音流输出；若否，则将这两条语音流进行混音后输出。2.2) Judging whether P1>2P2 holds true, if yes, then output only the voice stream corresponding to P₁ ; if not, then output the two voice streams after mixing.

一种轻量级的多方会议混音装置，包括客户端和服务器，其特征在于：A lightweight multi-party conference sound mixing device, including a client and a server, is characterized in that:

客户端包括：用于对语音进行编码得到语音PCM数据及数据长度的AMR编码器、用于计算编码后的语音PCM数据的每帧语音能量值的语音能量计算装置、结合语音能量值及其数据长度来确定该帧为语音帧或非语音帧的判定装置，及统计出语音PCM数据的统计窗口中语音帧的概率值的统计装置；The client includes: an AMR encoder for encoding voice to obtain voice PCM data and data length, a voice energy calculation device for calculating the voice energy value of each frame of the encoded voice PCM data, and a combined voice energy value and its data Length determines that this frame is the judging device of speech frame or non-speech frame, and counts the statistical device of the probability value of speech frame in the statistics window of speech PCM data;

服务器包括：用于接收语音概率值并选出当前的语音概率值最高的两个发言者的语音流的接收选择装置，根据这两个语音概率值大小决定是否使用叠加原理将选出的最多两路语音流进行混音的混音装置，及转发语音包的发送装置。The server includes: a receiving and selecting device for receiving speech probability values and selecting the speech streams of the two speakers with the highest current speech probability values, and deciding whether to use the superposition principle to select the speech streams of the two speakers with the highest speech probability values. A mixing device for mixing voice streams, and a sending device for forwarding voice packets.

由上述对本发明的描述可知，与现有技术相比，本发明具有如下有益效果：As can be seen from the above description of the present invention, compared with the prior art, the present invention has the following beneficial effects:

1、采用概率分析的方法，客户端对语音流进行分析，服务器端利用接收到的语音概率值进行决策，充分利用服务器端与客户端的资源，让其共同分担计算压力，算法简单、易实现、可扩展性好；1. Using the method of probability analysis, the client side analyzes the voice stream, and the server side uses the received voice probability value to make decisions, making full use of the resources of the server side and the client side, allowing them to share the calculation pressure. The algorithm is simple, easy to implement, Good scalability;

2、服务器端和客户端的计算压力小，反应时间快。在客户端方面，只需要进行AMR编码，并计算每个数据帧的能量值，以及判断每帧数据是语音、静音还是噪音，在服务器端方面，只需要比较每个客户端的语音概率值，大多数时候不需要进行混音编码操作，最多只需要对2路语音进行混音。2. The calculation pressure of the server and client is small, and the response time is fast. On the client side, it only needs to perform AMR encoding, calculate the energy value of each data frame, and judge whether each frame of data is speech, silence or noise. On the server side, it only needs to compare the speech probability value of each client. Most of the time, there is no need to perform audio mixing and encoding operations, and only two channels of voices need to be mixed at most.

3、应用范围广，可适应PDA、手机等轻量级设备的应用。3. It has a wide range of applications and can be adapted to the application of light-weight devices such as PDAs and mobile phones.

附图说明Description of drawings

图1为本发明客户端工作流程图；Fig. 1 is the working flowchart of the client of the present invention;

图2为本发明服务器工作流程图。Fig. 2 is a working flow diagram of the server of the present invention.

具体实施方式Detailed ways

以下通过具体实施方式对本发明作进一步的描述。The present invention will be further described below through specific embodiments.

针对多方会议的实际应用需求，同时兼顾手机等便携小设备的个性特征，提出一种新颖而简单的快速实时的轻量级的多方会议混音方法。该方案的基本思想是，根据会议发言的特征，绝大多数情况下，只有一个人在发言，最多两个人同时发言，其他都是听众。因此，本发明在服务器端最多选择发言概率最高的两路语音进行混音，并将混音后的语音发送给客户端，从而客户端不需要做混音，同时服务器端的混音计算量也不大。Aiming at the actual application requirements of multi-party conferences, and taking into account the individual characteristics of small portable devices such as mobile phones, a novel and simple fast real-time lightweight multi-party conference audio mixing method is proposed. The basic idea of the program is that, according to the characteristics of conference speeches, in most cases, only one person is speaking, at most two people are speaking at the same time, and the others are all listeners. Therefore, the present invention selects at most the two voices with the highest probability of speaking on the server side for mixing, and sends the mixed voices to the client, so that the client does not need to do the mixing, and the calculation amount of the mixing on the server side is also small. big.

一种轻量级的多方会议混音方法，预先设定：客户端每隔20ms抓取到一帧语音信号，每帧语音信号包括160个采样值，每个采样值的能量为r_i；设定统计窗口包括连续的20帧语音信号，当前帧的能量相对参考值为E_refer。设定当前帧的前20个连续帧的非语音能量参考值的最大值为E_noise，而语音能量参考值的最大值表示为E_voise，则当前帧的能量相对参考值E_refer用如下公式计算：A lightweight multi-party conference audio mixing method, preset: the client captures a frame of voice signal every 20ms, each frame of voice signal includes 160 sampling values, and the energy of each sampling value is r_i ; set The predetermined statistics window includes 20 consecutive frames of speech signals, and the energy relative reference value of the current frame is E_refer . Set the maximum value of the non-speech energy reference value of the first 20 consecutive frames of the current frame as E_noise , and the maximum value of the speech energy reference value is expressed as E_voise , then the energy relative reference value E_refer of the current frame is calculated by the following formula :

E_refer＝E_noise+(E_voice-E_noise)/10。E_refer = E_noise + (E_voice - E_noise )/10.

其中，若当前帧为会话开始后首个20帧中的某1帧，例如第2帧，则用之前的1帧能量值作为能量相对参考值，若是第3帧，就用前面的1、2帧的对应值带入公式中进行计算，以此类推。Among them, if the current frame is one of the first 20 frames after the session starts, such as the second frame, use the energy value of the previous frame as the energy relative reference value; if it is the third frame, use the previous 1 and 2 frames The corresponding value of is brought into the formula for calculation, and so on.

包括如下步骤：Including the following steps:

1)客户端采用AMR编码器对语音进行编码后得到语音PCM数据及数据长度，对编码后的语音PCM数据采用分帧处理，计算每帧语音能量值，并结合该帧语音能量值及其数据长度来确定该帧为语音帧或非语音帧，从而统计出语音PCM数据中语音帧的概率值。AMR编码器对语音PCM数据编码后，获得编码后的数据以及数据长度，数据长度用nsize表示。根据AMR编码输出的规则，nsize只有三个取值，当nsize为1时，为静音状态；当nsize为6时，为噪音状态；当nsize为31时，为语音状态。但是，这种划分方法并不准确，当nsize为31时，基本上为语音状态，但当nsize为6时，却有可能也是语音状态。因此，当nsize为6时，需要综合分析PCM数据的能量值。参照图1，流程如下1) The client uses the AMR encoder to encode the voice to obtain the voice PCM data and data length. The encoded voice PCM data is processed in frames, and the voice energy value of each frame is calculated, and combined with the voice energy value of the frame and its data. Length to determine whether the frame is a speech frame or a non-speech frame, so as to calculate the probability value of the speech frame in the speech PCM data. After the AMR encoder encodes the voice PCM data, the encoded data and the data length are obtained, and the data length is represented by nsize. According to the rules of AMR encoding output, nsize has only three values. When nsize is 1, it is in silent state; when nsize is 6, it is in noise state; when nsize is 31, it is in speech state. However, this division method is not accurate. When nsize is 31, it is basically a speech state, but when nsize is 6, it may also be a speech state. Therefore, when nsize is 6, it is necessary to comprehensively analyze the energy value of PCM data. Referring to Figure 1, the process is as follows

1.3)判断该帧能量值是否大于能量相对参考值E_refer，若是，则判定该帧为语音帧，若否，则判定该帧为非语音帧；加入统计窗口内，进入步骤1.4)1.3) Determine whether the frame energy value is greater than the energy relative reference value E_refer , if so, then determine that the frame is a speech frame, if not, then determine that the frame is a non-speech frame; add it to the statistical window and enter step 1.4)

2)服务器端通过接收到的语音概率值选出当前的语音概率值最高的两个发言者的语音流，并根据这两个语音概率值大小决定是否使用叠加原理将选出的最多两路语音流进行混音，最后转发混音后的语音包。具体的，参照图2，流程如下：2) The server side selects the voice streams of the two speakers with the highest current voice probability values through the received voice probability values, and decides whether to use the superposition principle to select the maximum two voice streams according to the two voice probability values. Streams are mixed, and finally the mixed voice packets are forwarded. Specifically, referring to Figure 2, the process is as follows:

本发明还提出一种轻量级的多方会议混音装置，包括客户端和服务器。The invention also proposes a lightweight multi-party conference sound mixing device, including a client and a server.

客户端包括：用于对语音进行编码得到语音PCM数据及数据长度的AMR编码器、用于计算编码后的语音PCM数据的每帧语音能量值的语音能量计算装置、结合语音能量值及其数据长度来确定该帧为语音帧或非语音帧的判定装置，及统计出语音PCM数据的统计窗口中语音帧的概率值的统计装置。The client includes: an AMR encoder for encoding voice to obtain voice PCM data and data length, a voice energy calculation device for calculating the voice energy value of each frame of the encoded voice PCM data, and a combined voice energy value and its data Determine the frame as a judging device for a speech frame or a non-speech frame, and a statistical device for counting the probability value of the speech frame in the statistical window of the speech PCM data.

该装置在客户端进行编码，并结合语音能量值和AMR编码数据大小两个因素来区分语音帧和非语音帧，从而统计出其语音概率值。服务器端，由语音概率值决策出当前发言者(最多两个)的语音流，并使用叠加原理将选出的最多两路流进行混音，最后转发混音后的语音包。该方法巧妙地弥补了手机等便携小设备计算能力弱的缺陷，同时又大大降低了服务器进行混音操作的计算量，可广泛应用在多媒体多方会议系统中。The device encodes at the client end, and distinguishes speech frames and non-speech frames by combining speech energy value and AMR coded data size, so as to calculate the speech probability value. On the server side, the voice streams of the current speaker (up to two) are determined based on the voice probability value, and the selected two streams are mixed using the superposition principle, and finally the mixed voice packets are forwarded. The method cleverly makes up for the weak computing capability of small portable devices such as mobile phones, and at the same time greatly reduces the amount of computing required for the server to perform audio mixing operations, and can be widely used in multimedia multi-party conference systems.

上述仅为本发明的具体实施方式，但本发明的设计构思并不局限于此，凡利用此构思对本发明进行非实质性的改动，均应属于侵犯本发明保护范围的行为。The above is only a specific embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any non-substantial changes made to the present invention by using this concept should be an act of violating the protection scope of the present invention.

Claims

Translated fromChinese

1.一种轻量级的多方会议混音方法，其特征在于：1)客户端采用AMR编码器对语音进行编码后得到语音PCM数据及数据长度，对编码后的语音PCM数据采用分帧处理，计算每帧语音能量值，并结合该帧语音能量值及其数据长度来确定该帧为语音帧或非语音帧，从而统计出语音PCM数据中语音帧的概率值；2)服务器端通过接收到的语音概率值选出当前的语音概率值最高的两个发言者的语音流，并根据这两个语音概率值大小决定是否使用叠加原理将选出的最多两路语音流进行混音，最后转发混音后的语音包。 1. A light-weight multi-party conference sound mixing method is characterized in that: 1) the client adopts AMR coder to encode voice to obtain voice PCM data and data length, and adopts sub-frame processing to the voice PCM data after encoding , calculate the speech energy value of each frame, and determine that the frame is a speech frame or a non-speech frame in conjunction with the speech energy value of the frame and its data length, thereby counting the probability value of the speech frame in the speech PCM data; Select the voice streams of the two speakers with the highest current voice probability values, and decide whether to use the superposition principle to mix the selected two voice streams according to the two voice probability values, and finally Forward the mixed audio packets. the

2.如权利要求1所述的一种轻量级的多方会议混音方法，其特征在于：预先设定：客户端每隔一段时间抓取到一帧语音信号，每帧语音信号包括m个采样值，每个采样值的能量为r_i；设定统计窗口包括连续的n帧语音信号，当前帧的能量相对参考值为E_refer；步骤1)具体包括如下： 2. A kind of lightweight multi-party conference audio mixing method as claimed in claim 1, characterized in that: preset: the client captures a frame of voice signal at intervals, and each frame of voice signal includes m Sampled value, the energy of each sampled value is r_i ; The setting statistical window includes continuous n frames of speech signals, and the relative reference value of the energy of the current frame is E_refer ; Step 1) specifically includes as follows:

1.2)判断AMR编码后的当前帧输出长度是否等于31，若是，则记录该帧的能量值，作为语音能量参考值，判定该帧为语音帧并加入统计窗口内，进入步骤1.4)；若否，则记录该帧的能量值，作为非语音能量参考值，进入步骤1.3)； 1.2) judge whether the current frame output length after AMR encoding is equal to 31, if so, then record the energy value of this frame, as the speech energy reference value, judge that this frame is a speech frame and add in the statistical window, enter step 1.4); If not , then record the energy value of the frame, as the non-speech energy reference value, enter step 1.3);

1.3)判断当前帧能量值是否大于其能量相对参考值E_refer，若是，则判定该帧为语音帧，若否，则判定该帧为非语音帧；加入新的统计窗口内，进入步骤1.4) 1.3) Judging whether the current frame energy value is greater than its energy relative reference value E_refer , if so, then determine that the frame is a voice frame, if not, then determine that the frame is a non-voice frame; add in a new statistical window, and enter step 1.4)

1.4)判断统计窗口是否已满，若是，则计算统计窗口内语音帧的占比，表示成0到100的语音概率值；若否，则进入下一帧，跳至步骤1.1)。 1.4) Determine whether the statistical window is full, if so, calculate the proportion of the speech frame in the statistical window, and express it as a speech probability value from 0 to 100; if not, then enter the next frame and skip to step 1.1). the

3.如权利要求2所述的一种轻量级的多方会议混音方法，其特征在于：设定当前帧的前n个连续帧的非语音能量参考值的最大值为E_noise，而语音能量参考值的最大值表示为E_voise，则当前帧的能量相对参考值E_refer用如下公式计算： 3. a kind of light-weight multi-party conference sound mixing method as claimed in claim 2 is characterized in that: the maximum value of the non-speech energy reference value of the preceding n consecutive frames of setting current frame is E_noise , and speech The maximum value of the energy reference value is expressed as E_voise , then the energy relative reference value E_refer of the current frame is calculated by the following formula:

E_refer＝E_noise+(E_voice-E_noise)/10。 E_refer = E_noise + (E_voice - E_noise )/10.

4.如权利要求1所述的一种轻量级的多方会议混音方法，其特征在于：步骤2)具体如下： 4. a kind of lightweight multi-party conference sound mixing method as claimed in claim 1, is characterized in that: step 2) is specifically as follows:

2.1)服务器接收客户端发送过来的语音概率值，选出语音概率值最高的两条语音流F1、F2，其语音概率值分别为P1、P2，P1>P2； 2.1) The server receives the voice probability value sent by the client, selects the two voice streams F1 and F2 with the highest voice probability values, and their voice probability values are P1 and P2 respectively, and P1>P2;

2.2)判断P1>2P2是否成立，若是，则只将P₁对应的语音流输出；若否，则将这两条语音流进行混音后输出。 2.2) Judging whether P1>2P2 holds true, if yes, then output only the voice stream corresponding to P₁ ; if not, then output the two voice streams after mixing.

5.一种轻量级的多方会议混音装置，包括客户端和服务器，其特征在于： 5. A lightweight multi-party conference sound mixing device, comprising a client and a server, characterized in that:

客户端包括：用于对语音进行编码得到语音PCM数据及数据长度的AMR编码器、用于计算编码后的语音PCM数据的每帧语音能量值的语音能量计算装置、结合语音能量值及其数据长度来确定该帧为语音帧或非语音帧的判定装置，及统计出语音PCM数据的统计窗口中语音帧的概率值的统计装置； The client includes: an AMR encoder for encoding voice to obtain voice PCM data and data length, a voice energy calculation device for calculating the voice energy value of each frame of the encoded voice PCM data, and a combined voice energy value and its data Length determines that this frame is the judging device of speech frame or non-speech frame, and counts the statistical device of the probability value of speech frame in the statistics window of speech PCM data;

服务器包括：用于接收语音概率值并选出当前的语音概率值最高的两个发言者的语音流的接收选择装置，根据这两个语音概率值大小决定是否使用叠加原理将选出的最多两路语音流进行混音的混音装置，及转发语音包的发送装置。 The server includes: a receiving and selecting device for receiving speech probability values and selecting the speech streams of the two speakers with the highest current speech probability values, and deciding whether to use the superposition principle to select the speech streams of the two speakers with the highest speech probability values. A mixing device for mixing voice streams, and a sending device for forwarding voice packets. the