TWI731382B

Movatterモバイル変換

Info

Publication number: TWI731382B
Application number: TW108127727A
Authority: TW
Inventors: 韓喆; 陳力; 吳軍
Original assignee: 開曼群島商創新先進技術有限公司
Priority date: 2018-10-29
Filing date: 2019-08-05
Publication date: 2021-06-21
Also published as: TW202036534A; WO2020088006A1; CN109599090B; CN109599090A

Abstract

Translated fromChinese

本說明書提供了一種語音合成的方法、裝置、設備及儲存介質。該方法包括：獲取待合成語音的文本中的各音節的語音文件，該語音文件儲存有該音節的採樣點的音強數據；從相鄰兩音節的語音文件中分別獲取指定採樣點的音強數據；其中，前一音節的指定採樣點為該音節的最後N個採樣點，後一音節的指定採樣點為該音節的前N個採樣點，其中，N為整數；將兩個音節的該指定採樣點的音強進行處理，以獲得合成後的語音。通過對相鄰兩音節首尾部分指定採樣點進行處理，使合成的語音更加自然。另外，由於只是對相鄰音節部分採樣點做簡單處理，因此避免了大量的計算，適用於嵌入式設備等處理能力較低的設備。This manual provides a method, device, equipment and storage medium for speech synthesis. The method includes: obtaining the voice file of each syllable in the text of the speech to be synthesized, the voice file storing the tone intensity data of the sampling point of the syllable; respectively obtaining the tone intensity of the designated sampling point from the voice files of two adjacent syllables Data; where the designated sampling point of the previous syllable is the last N sampling points of the syllable, and the designated sampling point of the next syllable is the first N sampling points of the syllable, where N is an integer; the two syllables The sound intensity of the designated sampling point is processed to obtain the synthesized speech. By processing the designated sampling points at the beginning and end of two adjacent syllables, the synthesized speech is more natural. In addition, since only the sampling points of adjacent syllables are simply processed, a large number of calculations are avoided, and it is suitable for devices with low processing capabilities such as embedded devices.

Description

Translated fromChinese

語音合成的方法、裝置及設備Method, device and equipment for speech synthesis

本發明涉及語音合成技術領域，尤其涉及一種語音合成的方法、裝置及設備。The present invention relates to the technical field of speech synthesis, in particular to a method, device and equipment for speech synthesis.

語音播報在生活中很多領域都有應用，比如在使用支付寶或微信付款時自動播報到帳金額，超市、車站等公共場所使用的智慧型播報系統等。在語音播報時，需要用到語音合成技術，即將不同音節的字或詞語拼接起來，組成需要播報的一段話。目前製作播報語音的技術中，有的技術雖然可以使播報的語音聽起來自然，但是此技術對設備的處理能力要求高；有的技術雖然對處理能力要求不高，但是聽起來不自然。Voice broadcast has applications in many areas of life, such as automatically reporting the amount of money when paying with Alipay or WeChat, and smart broadcast systems used in public places such as supermarkets and stations. In the voice broadcast, speech synthesis technology is needed, that is, words or words of different syllables are spliced together to form a paragraph that needs to be broadcast. Among the current technologies for producing broadcast voice, although some technologies can make the broadcast voice sound natural, this technology requires high processing capabilities of the equipment; some technologies do not require high processing capabilities, but they sound unnatural.

為克服相關技術中存在的問題，本發明提供了一種語音拼接的方法、裝置及設備。首先，本說明書提供了一種語音合成的方法，該方法包括：獲取待合成語音的文本中的各音節的語音文件，該語音文件儲存有該音節的採樣點的音強數據；從相鄰兩音節的語音文件中分別獲取指定採樣點的音強數據；其中，前一音節的指定採樣點為該音節的最後N個採樣點，後一音節的指定採樣點為該音節的前N個採樣點，其中，N為整數；將兩個音節的該指定採樣點的音強數據進行數據處理，以獲得合成後的語音。其次，本說明書提供了一種語音合成裝置，該裝置包括：獲取單元，獲取待合成語音的文本中的各音節的語音文件，該語音文件儲存有該音節的採樣點的音強數據；以及從相鄰兩音節的語音文件中分別獲取指定採樣點的音強數據；其中，前一音節的指定採樣點為該音節的最後N個採樣點，後一音節的指定採樣點為該音節的前N個採樣點，其中，N為整數；處理單元，將兩個音節的該指定採樣點的音強數據進行處理，以獲得合成後的語音。另外，本說明書還提供了一種語音合成設備，該語音合成設備包括：處理器和記憶體；該記憶體用於儲存可執行的電腦指令；該處理器用於執行該電腦指令時實現以下步驟：獲取待合成語音的文本中的各音節的語音文件，該語音文件儲存有該音節的採樣點的音強數據；從相鄰兩音節的語音文件中分別獲取指定採樣點的音強數據；其中，前一音節的指定採樣點為該音節的最後N個採樣點，後一音節的指定採樣點為該音節的前N個採樣點，其中，N為整數；將兩個音節的該指定採樣點的音強數據進行處理，以獲得合成後的語音。本說明書的有益效果：在語音合成時，將相鄰兩音節中前一個音節的尾部與後一個音節的首部的指定採樣點的音強進行處理，使合成後的語音更加自然，另外，由於不需要通過學習模型訓練，而是對相鄰音節部分採樣點做簡單處理，因此避免了高強度的計算，使本方案更加具有適用性，適用於嵌入式設備等處理能力較低的設備。應當理解的是，以上的一般描述和後文的細節描述僅是示例性和解釋性的，並不能限制本發明。In order to overcome the problems in the related art, the present invention provides a method, device and equipment for speech splicing.First of all, this specification provides a method of speech synthesis, which includes:Acquiring a voice file of each syllable in the text of the speech to be synthesized, the voice file storing the tone intensity data of the sampling point of the syllable;Obtain the tone intensity data of the designated sampling points from the voice files of two adjacent syllables; among them, the designated sampling point of the previous syllable is the last N sampling points of the syllable, and the designated sampling point of the next syllable is the previous N sampling points, where N is an integer;Data processing is performed on the tone intensity data of the designated sampling point of the two syllables to obtain the synthesized speech.Secondly, this specification provides a speech synthesis device, which includes:The acquiring unit acquires the voice file of each syllable in the text to be synthesized, the voice file stores the sound intensity data of the sampling point of the syllable; and obtains the sound intensity of the designated sampling point from the voice files of two adjacent syllables. Data; where the designated sampling point of the previous syllable is the last N sampling points of the syllable, and the designated sampling point of the next syllable is the first N sampling points of the syllable, where N is an integer;The processing unit processes the tone intensity data of the designated sampling points of the two syllables to obtain synthesized speech.In addition, this specification also provides a speech synthesis device, which includes: a processor and a memory;The memory is used to store executable computer commands;The processor is used to implement the following steps when executing the computer instructions:Acquiring a voice file of each syllable in the text of the speech to be synthesized, the voice file storing the tone intensity data of the sampling point of the syllable;Obtain the tone intensity data of the designated sampling points from the speech files of two adjacent syllables; among them, the designated sampling point of the previous syllable is the last N sampling points of the syllable, and the designated sampling point of the next syllable is the previous N sampling points of the syllable. N sampling points, where N is an integer;The tone intensity data of the designated sampling points of the two syllables are processed to obtain synthesized speech.The beneficial effects of this manual: during speech synthesis, the sound intensity of the designated sampling point at the end of the previous syllable and the beginning of the next syllable in two adjacent syllables is processed, so that the synthesized speech is more natural. It needs to be trained through the learning model, but simple processing is performed on the sampling points of adjacent syllables, thus avoiding high-intensity calculations, making this solution more applicable and suitable for devices with low processing capabilities such as embedded devices.It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and cannot limit the present invention.

這裡將詳細地對示例性實施例進行說明，其示例表示在圖式中。下面的描述涉及圖式時，除非另有表示，不同圖式中的相同數字表示相同或相似的要素。以下示例性實施例中所描述的實施方式並不代表與本發明相一致的所有實施方式。相反，它們僅是與如所附申請專利範圍中所詳述的、本發明的一些方面相一致的裝置和方法的例子。在本發明使用的術語是僅僅出於描述特定實施例的目的，而非旨在限制本發明。在本發明和所附申請專利範圍中所使用的單數形式的“一種”、“該”和“該”也旨在包括多數形式，除非上下文清楚地表示其他含義。還應當理解，本文中使用的術語“和/或”是指並包含一個或多個相關聯的列出項目的任何或所有可能組合。應當理解，儘管在本發明可能採用術語第一、第二、第三等來描述各種資訊，但這些資訊不應限於這些術語。這些術語僅用來將同一類型的資訊彼此區分開。例如，在不脫離本發明範圍的情況下，第一資訊也可以被稱為第二資訊，類似地，第二資訊也可以被稱為第一資訊。取決於語境，如在此所使用的詞語“如果”可以被解釋成為“在……時”或“當……時”或“回應於確定”。語音播報廣泛的應用於生活中的各個領域，比如車站中的車次資訊的播報，超市中商品促銷資訊播放、以及目前常用的支付寶支付時的到帳播報等。語音播報時需要用到語音合成技術，即將不同音節的字或詞語拼接起來，組成需要播報的一段話。目前有的語音合成的方法是基於深度學習模型，產生模擬的語音，這種方法合成的語音聽起來比較自然，但是由於需要大量的訓練資源和計算資源，很難在嵌入式系統等處理能力較弱的系統上運行。目前，針對嵌入式系統等處理能力較弱的系統，主要採用的是拼接的方法，即先錄製每一個單詞的讀音，然後把待播放的句子的每個單詞的讀音全部播放一遍，這種方法對語音合成系統的處理能力要求不高，但是這種方法合成的語音效果比較差，聽起來不自然。為了解決採用拼接的方法進行語音合成時，合成效果較差，聽起來不自然的問題，本說明書提供了一種語音合成的方法，該方法可用于實現語音合成的設備，該語音合成方法的流程圖如圖1所示，包括步驟S102-步驟S106：S102、獲取待合成語音的文本中的各音節的語音文件，該語音文件儲存有該音節的採樣點的音強數據；S104、從相鄰兩音節的語音文件中分別獲取指定採樣點的音強數據；其中，前一音節的指定採樣點為該音節的最後N個採樣點，後一音節的指定採樣點為該音節的前N個採樣點，其中，N為整數；S106、將兩個音節的該指定採樣點的音強數據進行處理，以獲得合成後的語音。在收到需要合成語音的文本後，會根據文本的內容獲取文本中每個音節的語音文件。在某些情況下語音文件可以儲存在本地，語音合成設備可以直接從本地獲取語音文件；在某些情況下，語音文件可以保存在雲端，語音合成設備需要使用時從雲端下載。語音文件可以是事先錄製好的不同的音節的錄音，可以是WAV.、Mp3.等格式的文件，在音節錄製的時候，會對聲音的模擬信號進行採樣，轉化成二進制的採樣數據，得到最終的語音文件。音節在錄製並保存成語音文件的時候，可以將各音節單獨錄製，也可以以一個詞語或成語的形式錄製，比如“我喜歡跑步”這句話中的各音節，可以是“我”、“喜”、“歡”、“跑”、“步”五個音節分別錄製保存成五份語音文件，也可以將詞語組合起來錄製成一個語音文件，即“我”、“喜歡”、“跑步”三份語音文件，語音文件可以根據實際需求錄製，本說明書不作限制。在一個實施例中，如果音節在錄製的時候是以詞語組合的形式錄製的，在獲取待合成語音的文本中的各音節的語音文件之前，還可以對待合成文本進行分詞處理，以便根據分詞的結果去獲取音節的語音文件。比如待合成文本是“我們在吃飯”，由於保存的語音文件是以“我們”、“在”、“吃飯”這種詞語的形式錄製儲存的，所以我們在獲取這些音節的語音文件之前可以對待合成文本“我們在吃飯”先進行分詞處理，以便找到對應的詞語或字的語音文件，對文本的分詞可通過分詞演算法來完成，將“我們在吃飯”分詞處理後即分成“我們”、“在”、“吃飯”，然後再獲取“我們”、“在”、“吃飯”這三個詞的語音文件，進行後續的語音合成。對於處理能力較弱的設備，比如嵌入式系統的設備，如果又要運行分詞演算法，又要進行語音合成，可能需要耗費較多的內部記憶體和功耗，會導致處理速度較慢。為了減小語音合成設備的資源消耗，在一個實施例中，對該文本進行分詞處理可以由伺服器端完成。由於設備的語音文件都是從伺服器端下載的，伺服器端保存的語音文件與設備的語音文件是一致的，所以服務可以根據語音文件將待合成文本進行分詞，然後將經過分詞的文本下發給設備。另外，如果待合成語音的文本是中文文本，在錄製音節的語音文件時，由於漢字的數量較多，如果儲存每個漢字的拼音，語音文件會很大，非常佔用內部記憶體資源，所以可以只儲存漢字音節的四個聲調，無需儲存每個漢字的拼音，這樣可以減小儲存的語音文件的大小，節約內部記憶體。在一個實施例中，該語音文件記錄有音節的音頻時長、採樣點的音強數據、採樣頻率、採樣精度和/或採樣點數量。其中，音頻時長為每個音節的發音時長，表徵每個音節發音的長短，音頻時長越短，則音節發音越短促。採樣頻率為每秒中採集採樣點音強數據的數量，比如採樣頻率為48K，表示1秒中採集48K個音強數據。每個音節的採樣點數量則為該音節的音頻時長與採樣頻率的乘積，比如“我”這個音節的音頻時長為1.2s，採樣頻率為48K，則“我”這個音節採樣數量一共有1.2×48K=57.6K個。採樣精度是指採集卡處理聲音的解析度，反映了聲音波形幅度(即音強)的精度。採樣精度越高，錄製和回放的聲音就越真實。採樣精度也叫採樣位數，由於聲音信號在保存的時候都是以二進制的形式保存，保存的位數可以是8位或16位，如果是8位，則採集的採樣點音強數值在0-256之間，如果是16位，則測得的採集的採樣點音強數值在0～65535之間。位數越多，聲音的質量越高，而需要的儲存空間也越多。一般在對音強進行處理的時候，會先對音強數據進行歸一化處理，比如採樣精度為8位時，採樣點音強數值在0-256之間，一般會對影響數據進行歸一化處理，使音強數值在0-1之間，便於後續處理。在獲取文本中的各音節的語音文件後，可以從語音文件中分別獲取相鄰兩音節指定採樣點的音強數據，其中，前一音節的指定採樣點為該音節的最後N個採樣點，後一音節的指定採樣點為該音節的前N個採樣點，其中，N為整數，將相鄰兩音節中前一音節最後N個採樣點與後一音節前N個採樣點的音強數據進行處理後，得到合成後的語音。例如，可以將前一個音節的最後1000個採樣點的音強數據與後一個音節前面1000個採樣點的數據進行處理，以便兩個音節在合成時，尾部過渡自然些。圖2為一文本在進行語音合成的示意圖，在合成“我喜歡跑步”這句話時，可以逐一將前一個音節的指定採樣點的音強和後一個音節的指定採樣點的音強進行處理，以得到合成後的文本，其中圖中4.5%和5%代表處理採樣點數量與前一音節採樣數量的比值。通過將相鄰兩音節的首尾部分的指定採樣點的音強數據進行處理，可得到銜接比較自然的合成語音。在對相鄰兩音節進行處理時，需要保留前後音節的本身的特點，所以處理的部分不能太多，還需考慮前後兩音節在處理時前後留白的問題，如果留白過長，則處理後的語音會出現明顯的停頓，造成合成的語音聽起來特別自然。綜合考慮以上因素，在一個實施例中，在確定指定採樣點時，需要處理的採樣點數量N可以基於相鄰兩音節是否組成詞語或四字成語、相鄰兩音節的採樣點數量、相鄰兩音節的最後M1個採樣點的平均音強和/或相鄰兩音節前M2個採樣點的平均音強計算得到，其中M1、M2為整數。如果兩個音節可以組成一個詞語或成語，在處理的時候需處理採樣點數量可以適當的多一些，所以可以根據相鄰兩音節是否可以組成詞語來確定需處理採樣點數量N。另外，每個音節開頭部分和末尾部分的音強也是處理時候需要重點關注的一個因素，所以，在計算需需處理採樣點數量N時，也可以基於相鄰兩音節相鄰兩音節的最後M1個採樣點的平均音強或相鄰兩音節前M2個採樣點的平均音強來計算。另外，在採樣頻率一定時，採樣點的數量的多少即反映了的每個音節音頻的時長，相鄰兩個音節的音頻時長的差別對合成語音的效果影響也比較大，如果兩個音節的音頻時長過大，說明兩個音節有輕重、快慢的差別，在處理時需處理採樣點的數量需要多一些，如果兩個音節的音頻時長相差不大，則需處理的採樣點的數量可以少一些。所以，在計算需處理採樣點數量N時，也可以考慮音節的採樣點數量。為了考慮相鄰兩音節的留白問題，在計算需處理採樣點數量時還可以考慮相鄰兩音節的開頭的平均音強和末尾的平均音強。末尾的平均音強可以通過計算音節最後M1個採樣點的平均音強獲得，開頭的平均音強可以取音節前M2個採樣點的平均音強獲得，其中M1和M2可以根據音節自身的特點去設定，比如M1為前一個音節採樣點總數的10%，M2為後一個音節採樣點總數的5%，或者M1為1000，M2為2000，本說明書不作限制。在一個實施例中，經過申請人的反復試驗，為了達到較好合成效果，使前後音節在合成後不會有明顯的停頓感， M1可以取前一個音節音頻採樣點總數的20%， M2可以取後一個音節音頻採樣點總數的20%。進一步的，在一個實施例中，需處理的採樣點的數量N可以通過以下公式去計算：

其中，Nw的不同取值表示當前相鄰兩音節是否組成詞語或四字成語，SNpre表示前一個音節的採樣數量，SNnext表示後一個音節的採樣數量；末尾平均音強_pre表示前一個音節最後M1個採樣點的平均音強；開頭平均音強_next表示後一個音節前M2個採樣點的平均音強，M1、M2為整數。在計算需處理採樣點數量N時可以考慮相鄰兩音節是否組成詞語或成語，為了方便計算需處理採樣點數量N，可以將相鄰兩音節是否組成詞語或成語時這個影響因素量化，即用Nw的不同數值來表示相鄰兩音節是否組成詞語或成語，便於N的計算，一般如果相鄰兩音節可以組成詞語，Nw數值會比不能組成詞語大。在一個實施例中，為了達到較好的合成效果，如果相鄰兩音節為一個詞語，則Nw取2，如果相鄰兩音節不在一個詞語或四字成語中，則Nw取1，如果相鄰兩音節在一個四字成語中，則Nw取2。當然，該Nw的取值可根據具體的情況去設定，本說明書不作限制。例如，需要合成“我”、“不”兩個音節，其中“我”這個音節的採樣為96K，“不”這個音節的採樣數量為48K，即SNpre=96K，SNnext=48K，這個音節不組成詞語，所以Nw可以取1，即Nw=1，取“我”這個音節的最後2K的採樣點的音強，計算最後2K個採樣點的平均音強為0.3，即末尾平均音強pre=0.3，取“不”這個音節的前面2K個採樣點的音強，計算前面2K個採樣點的平均音強為0.2，開頭平均音強next=0.2，代入公式計算，可得到N的值為3920。即取前一個音節的最後3920個採樣點與後一個音節前3920個採樣點的音強數據，將這些音強數據處理後得到合成的語音。在獲取指定採樣點的音強數據後，將兩個音節的該指定採樣點的音強進行處理具體方式也可以根據音節的特點來選擇，比如，在某些實施例中，可以將前一個音節的最後N個採樣點的音強與後一個音節的前N個採樣點音強直接相加，得到疊加的音強，比如需要處理前一個音節的最後五個採樣點的音強以及後一個音節的前五個採樣點的音強，前一個音節的最後五個採樣點的音強分別為0.15、0.10、0.05、0.03和0.01，後一個音節的前五個採樣點的音強分別為0.005、0.01、0.04、0.06、0.07和0.10，則處理後的疊加部分的語音的音強為0.155、0.11、0.09、0.09、0.08、0.11。當然，為了獲得更加優質和自然的合成效果，在某些實施例中，也可以將前一個音節的最後N個採樣點的音強與後一個音節的前N個採樣點音強分別乘以預設權重後再相加，得到疊加的音強，其中，該預設權重基於音節的前後順序與採樣點的前後順序設定。在進行前後相鄰兩音節的音強的處理的時候，可以將前後兩音節的音強乘以一個權重後再相加，比如，一般在處理部分的前面部分前一個音節要重一些，因此前一個音節的權重可以大一些，在處理部分的後面部分，後一個音節要重一些因而後一個音節的權重可以大一些。舉個例子，需要將前一個音節最後五個採樣點與後一個音節前五個採樣點的音強進行處理，前一個音節的最後五個採樣點音強分別為0.5、0.4、0.3、0.2和0.1，其中，五個採樣點的權重分別為90%、80%、70%、60%、50%，後一個音節前五個採樣點的音強分別為0.1、0.2、0.3、0.4、0.5，其中，五個採樣點的權重分別為10%、20%、30%、40%、50%，則處理後的音強分別為0.5×90%+0.1×10%、0.4×80%+0.2×20%、0.3×70%+0.3× 30%、0.2×70%+0.4×40%、0.1×50%+0.5×50%，即0.46、0.36、0.3、0.3、0.3。為了保證處理後的音節不會出現破音的現象，需要處理的指定採樣點的音強一般不會太大，避免處理後破音，在某個實施例中，指定採樣點的音強與該音節的採樣點的最大音強的比值小於0.5。比如，音節的所有採樣點中的音強最大的採樣點的音強為1，那麼指定的需要處理的採樣點的音強都小於0.5。以下用幾個具體實施例來進一步解釋本說明書提供的語音合成的方法。比如語音設備需要對“我喜歡跑步”這句話進行語音合成。在語音合成前，預先錄製有“我”、“喜”、“歡”、“跑”、“步”這五個漢字的讀音的五份語音文件，這五份語音文件保存在伺服器中。且五份語音文件的開頭記錄有語音文件的配置資訊，採樣頻率為48K，採樣精度為16位，以及每個讀音的音頻時長。其中，“我”、“喜”、“歡”、“跑”、“步”的音頻時長分別為1s、0.5s、1s、1.5s、0.8s。語音合成設備在收到需要合成語音的文本，“我喜歡跑步”後，會從伺服器下載這個5個音節的語音文件。然後按照文本的順序逐一對連續兩個音節做處理，比如先對“我”和“喜”進行處理，需要處理“我”最後一部採樣點和“喜”最前面一部分採樣點的音強，在處理前需要先根據以後公式計算需要處理的採樣點的數量：

其中，Nw的不同取值表示當前相鄰兩音節是否組成詞語或四字成語，如果相鄰兩音節為一個詞語，則該Nw取2，如果相鄰兩音節不在一個詞語或四字成語中，則Nw取1，如果相鄰兩音節在一個四字成語中，則Nw取2。SNpre表示前一個音節的採樣數量，SNnext表示後一個音節的採樣數量；末尾平均音強pre表示前一個音節最後20%的採樣點的平均音強；開頭平均音強next表示後一個音節前20%的採樣點的平均音強，M1、M2為整數。由於“我”和“喜”不能組成一個詞語或成語，所以公式中的Nw取1，“我”這個音節的採樣數量等於採樣頻率乘以音頻時長，即SNpre=0.5×48K=24K,“喜”這個音節的採樣數量SNnext=48K×1，“我”這個音節最後20%的採樣點的平均音強為0.3，“喜”這個音節最前面20% 的採樣點的平均音強為0.1，將這些數據代入以上公式，可以獲得需要處理的採樣點的數量為711，即從“我”這個音節的語音文件中獲取最後711個採樣點的音強數據，和“喜”這個音節的語音文件中獲取最前711個採樣點的音強數據，然後將獲取的音強數據直接相加，得到處理後的音強。同理，“喜”和“歡”，“歡”和“跑”，“跑”和“步”之間也採用同樣的方式進行處理，得到合成以後的文本“我喜歡跑步”。再比如，語音設備需要合成的文本為“我們愛天安門”，在錄製語音文件時，是以詞語的形式錄製的，即語音文件中包括有“我們”、“愛”、“天安門”三個詞的語音文件，語音文件預先從伺服器下載下來並保存在語音設備本地目錄當中。伺服器收到需要合成的文本“我們愛天安門”後，會根據語音文件的形式對文本進行分詞處理，分詞處理可通過分詞演算法完成。將文本分成“我們/愛/天安門”，然後將分詞處理後的文本下發給語音合成設備，語音合成設備在收到文本後，會先獲取“我們”、“愛”、“天安門”三個詞的語音文件，其中採樣頻率為48K，採樣精度為8位，以及三個詞讀音的音頻時長分貝為2s、1s、3s。然後先對“我們”和“愛”進行處理，處理前需要根據以下公式計算得到處理採樣點的數量：

其中，Nw的不同取值表示當前相鄰兩音節是否組成詞語或四字成語，如果相鄰兩音節為一個詞語，則該Nw取2，如果相鄰兩音節不在一個詞語或四字成語中，則Nw取1，如果相鄰兩音節在一個四字成語中，則Nw取2。SNpre表示前一個音節的採樣數量，SNnext表示後一個音節的採樣數量；末尾平均音強pre表示前一個音節最後15%的採樣點的平均音強；開頭平均音強next表示後一個音節前20%的採樣點的平均音強，M1、M2為整數。根據採樣頻率和音頻時長，可計算得到SNpre=96K，SNnext=48K，“我們”最後15%的採樣點的音強平均值為0.2，“愛”前20%的採樣點的平均音強為0.3，前後音節不組成詞語，Nw=1，將這些數據代入公式可計算得到處理採樣點數量為5689，即從語音文件中獲取“我們”最後5689個採樣點的音強數據和“愛”前面5689個採樣點音強的數據。在獲取處理採樣點的音強數據後，將“我們”每個採樣點的音強乘以一定的權重，再將“愛”每個採樣點的音強乘以一定的權重，然後再相加，得到處理部分的音強。同理，“愛”和“天安門”也採用同樣的處理方法，得到合成後的文本“我們”、“愛”、“天安門”。與上述一種語音合成方法相對應，本說明書還提供了一種語音合成裝置，如圖3所示，該語音合成裝置300包括：獲取單元301，獲取待合成語音的文本中的各音節的語音文件，該語音文件儲存有該音節的採樣點的音強數據；以及從相鄰兩音節的語音文件中分別獲取指定採樣點的音強數據；其中，前一音節的指定採樣點為該音節的最後N個採樣點，後一音節的指定採樣點為該音節的前N個採樣點，其中，N為整數；處理單元302，將兩個音節的該指定採樣點的音強數據進行處理，以獲得合成後的語音。在一個實施例中，該語音文件記錄有：音節的音頻時長、採樣點的音強數據、採樣頻率、採樣精度和/或採樣點數量。在一個實施例中，將兩個音節的該指定採樣點的音強數據進行處理具體包括：將前一個音節的最後N個採樣點的音強與後一個音節的前N個採樣點音強數據相加；或將前一個音節的最後N個採樣點的音強數據與後一個音節的前N個採樣點音強數據分別乘以預設權重後再相加，其中，該預設權重基於音節的前後順序與採樣點的前後順序設定。在一個實施例中，該待合成語音的文本為中文，該語音文件為記錄有漢字音節的四個聲調的語音文件。在一個實施例中，該指定採樣點的音強數據與該音節的採樣點的最大音強數據的比值小於0.5 。在一個實施例中，該N基於相鄰兩音節是否組成詞語或四字成語、相鄰兩音節的採樣點數量、相鄰兩音節的最後M1個採樣點的平均音強和/或相鄰兩音節前M2個採樣點的平均音強計算得到，其中M1、M2為整數在一個實施例中，該數M1為前一個音節音頻採樣點總數的20%，該M2為後一個音節音頻採樣點總數的20%。在一個實施例中，如果相鄰兩音節為一個詞語，則該轉化係數為2，如果相鄰兩音節不在一個詞語或四字成語中，則轉化係數1，如果相鄰兩音節在一個四字成語中，則轉化係數為2。在一個實施例中，該N具體計算公式如下：

其中，Nw的不同取值表示當前相鄰兩音節是否組成詞語或四字成語，SNpre表示前一個音節的採樣數量，SNnext表示後一個音節的採樣數量；末尾平均音強pre表示前一個音節最後M1個採樣點的平均音強；開頭平均音強next表示後一個音節前M2個採樣點的平均音強，M1、M2為整數。在一個實施例中，在獲取待合成語音的文本中的各音節的語音文件之前，還包括：對該文本進行分詞處理。在一個實施例中，對該文本進行分詞處理由伺服器端完成。上述裝置中各個單元的功能和作用的實現過程具體詳見上述方法中對應步驟的實現過程，在此不再贅述。對於裝置實施例而言，由於其基本對應於方法實施例，所以相關之處參見方法實施例的部分說明即可。以上所描述的裝置實施例僅僅是示意性的，其中該作為分離零件說明的單元可以是或者也可以不是實體上分開的，作為單元顯示的零件可以是或者也可以不是實體單元，即可以位於一個地方，或者也可以分佈到多個網路單元上。可以根據實際的需要選擇其中的部分或者全部模組來實現本說明書方案的目的。本領域普通技術人員在不付出創造性勞動的情況下，即可以理解並實施。另外，本說明書還提供了一中語音合成設備，如圖4所示，該語音合成設備包括：處理器401和記憶體402；該記憶體用於儲存可執行的電腦指令；該處理器用於執行該電腦指令時實現以下步驟：獲取待合成語音的文本中的各音節的語音文件，該語音文件儲存有該音節的採樣點的音強數據；從相鄰兩音節的語音文件中分別獲取指定採樣點的音強數據；其中，前一音節的指定採樣點為該音節的最後N個採樣點，後一音節的指定採樣點為該音節的前N個採樣點，其中，N為整數；將兩個音節的該指定採樣點的音強進行處理，以獲得合成後的語音。以上所述僅為本說明書的較佳實施例而已，並不用以限制本說明書，凡在本說明書的精神和原則之內，所做的任何修改、等同替換、改進等，均應包含在本說明書保護的範圍之內。The exemplary embodiments will be described in detail here, and examples thereof are shown in the drawings. When the following description refers to the drawings, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present invention. On the contrary, they are merely examples of devices and methods consistent with some aspects of the present invention as detailed in the scope of the appended application. The terms used in the present invention are only for the purpose of describing specific embodiments, and are not intended to limit the present invention. The singular forms of "a", "the" and "the" used in the scope of the present invention and the appended applications are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" as used herein refers to and includes any or all possible combinations of one or more associated listed items. It should be understood that although the terms first, second, third, etc. may be used in the present invention to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of the present invention, the first information can also be referred to as second information, and similarly, the second information can also be referred to as first information. Depending on the context, the word "if" as used herein can be interpreted as "when" or "when" or "in response to certainty". Voice broadcast is widely used in various areas of life, such as the broadcast of train schedule information in stations, the broadcast of product promotion information in supermarkets, and the current commonly used Alipay payment arrival broadcast. Speech synthesis technology is needed for voice broadcast, which means that words or words of different syllables are spliced together to form a paragraph that needs to be broadcast. At present, some speech synthesis methods are based on deep learning models to generate simulated speech. The speech synthesized by this method sounds more natural, but due to the large amount of training resources and computing resources, it is difficult to compare the processing capabilities in embedded systems. Run on weak systems. At present, for systems with weak processing capabilities such as embedded systems, the main method is splicing, that is, the pronunciation of each word is recorded first, and then the pronunciation of each word in the sentence to be played is played again. This method The processing capability of the speech synthesis system is not high, but the speech effect synthesized by this method is relatively poor and sounds unnatural. In order to solve the problem of poor synthesis effect and unnatural sounding when using splicing method for speech synthesis, this manual provides a method for speech synthesis, which can be used to implement speech synthesis equipment. The flow chart of the speech synthesis method is as follows: As shown in FIG. 1, it includes steps S102 to S106: S102. Acquire a voice file of each syllable in the text of the speech to be synthesized, and the voice file stores the tone intensity data of the sampling point of the syllable; S104, from two adjacent syllables Acquire the tone intensity data of the specified sampling points in the voice file; among them, the specified sampling point of the previous syllable is the last N sampling points of the syllable, and the specified sampling point of the next syllable is the first N sampling points of the syllable, Wherein, N is an integer; S106. Process the tone intensity data of the designated sampling point of the two syllables to obtain synthesized speech. After receiving the text to be synthesized speech, the speech file of each syllable in the text will be obtained according to the content of the text. In some cases, the voice file can be stored locally, and the voice synthesis device can directly obtain the voice file from the local; in some cases, the voice file can be stored in the cloud, and the voice synthesis device can be downloaded from the cloud when it needs to be used. Voice files can be pre-recorded recordings of different syllables, such as WAV., Mp3. and other formats. When syllables are recorded, the analog signal of the sound will be sampled and converted into binary sampling data to obtain the final Voice file. When syllables are recorded and saved as a voice file, each syllable can be recorded separately or in the form of a word or idiom. For example, each syllable in the sentence "I like running" can be "I" or " The five syllables of "Hi", "Huan", "Run" and "Step" are recorded and saved as five voice files respectively. You can also combine the words and record them into one voice file, namely "I", "Like", "Run" Three voice files, voice files can be recorded according to actual needs, this manual does not limit. In one embodiment, if the syllables are recorded in the form of word combinations when they are recorded, before the speech files of each syllable in the text to be synthesized are obtained, the synthesized text can also be segmented according to the word segmentation. As a result, to obtain the syllable voice file. For example, the text to be synthesized is "we are eating", because the saved voice files are recorded and stored in the form of words such as "we", "in", and "dining", so we can treat these syllables before obtaining the voice files of these syllables. The synthesized text "we are eating" is processed by word segmentation first in order to find the corresponding word or word voice file. The word segmentation of the text can be completed by the word segmentation algorithm. After the word segmentation of "we are eating", it will be divided into "we", "In" and "dinner", and then obtain the voice files of the three words "we", "in" and "dinner" for subsequent speech synthesis. For devices with weaker processing capabilities, such as embedded system devices, if you want to run word segmentation algorithms and perform speech synthesis, you may need to consume more internal memory and power consumption, which will result in slower processing speeds. In order to reduce the resource consumption of the speech synthesis device, in one embodiment, the word segmentation processing on the text can be completed by the server side. Since the voice files of the device are all downloaded from the server, the voice files saved on the server are the same as the voice files of the device, so the service can segment the text to be synthesized according to the voice file, and then download the segmented text Issued to the device. In addition, if the text to be synthesized is a Chinese text, when recording a syllable voice file, due to the large number of Chinese characters, if the pinyin of each Chinese character is stored, the voice file will be very large, which takes up internal memory resources, so it can Only the four tones of the Chinese character syllables are stored, and there is no need to store the pinyin of each Chinese character. This can reduce the size of the stored voice file and save internal memory. In one embodiment, the voice file records the audio duration of the syllable, the sound intensity data of the sampling point, the sampling frequency, the sampling accuracy, and/or the number of sampling points. Among them, the audio duration is the pronunciation duration of each syllable, which characterizes the pronunciation length of each syllable. The shorter the audio duration, the shorter the pronunciation of the syllable. The sampling frequency is the number of sound intensity data collected at the sampling point per second. For example, the sampling frequency is 48K, which means that 48K sound intensity data are collected in 1 second. The number of sampling points for each syllable is the product of the audio duration of the syllable and the sampling frequency. For example, the audio duration of the syllable "I" is 1.2s, and the sampling frequency is 48K, so the number of samples for the syllable of "I" is total 1.2×48K=57.6K pieces. Sampling accuracy refers to the resolution of the sound processed by the capture card, and reflects the accuracy of the amplitude of the sound waveform (that is, the sound intensity). The higher the sampling accuracy, the more realistic the sound recorded and played back. The sampling accuracy is also called the number of sampling bits. Since the sound signal is saved in binary form when it is saved, the number of saved bits can be 8 or 16 bits. If it is 8 bits, the sound intensity value of the collected sampling point is 0. -256, if it is 16 bits, the measured sound intensity value of the collected sampling points is between 0～65535. The more digits, the higher the sound quality and the more storage space required. Generally, when processing the sound intensity, the sound intensity data will be normalized first. For example, when the sampling accuracy is 8 bits, the sound intensity value of the sampling point is between 0-256, and the impact data will generally be normalized. Chemical processing, so that the value of the sound intensity is between 0-1, which is convenient for subsequent processing. After obtaining the voice file of each syllable in the text, the sound intensity data of the designated sampling points of two adjacent syllables can be obtained separately from the voice file, where the designated sampling point of the previous syllable is the last N sampling points of the syllable, The designated sampling point of the next syllable is the first N sampling points of the syllable, where N is an integer, and the intensity data of the last N sampling points of the previous syllable and the previous N sampling points of the next syllable are combined After processing, the synthesized speech is obtained. For example, the pitch data of the last 1000 sampling points of the previous syllable can be processed with the data of the 1000 sampling points before the next syllable, so that when the two syllables are synthesized, the tail transition is more natural. Figure 2 is a schematic diagram of a text undergoing speech synthesis. When synthesizing the sentence "I like running", the pitch of the designated sampling point of the previous syllable and the pitch of the designated sampling point of the next syllable can be processed one by one. , In order to get the synthesized text, where 4.5% and 5% in the figure represent the ratio of the number of processed sampling points to the number of samples of the previous syllable. By processing the tone intensity data of the designated sampling points at the beginning and the end of two adjacent syllables, a more natural-cohesive synthesized speech can be obtained. When processing two adjacent syllables, it is necessary to retain the characteristics of the front and back syllables, so there should not be too many parts to be processed, and the problem of blank space before and after the two syllables should be considered when processing. If the blank space is too long, handle it. There will be obvious pauses in the subsequent speech, which makes the synthesized speech sound particularly natural. Taking the above factors into consideration, in one embodiment, when determining the designated sampling point, the number of sampling points N to be processed may be based on whether two adjacent syllables form a word or a four-character idiom, the number of sampling points of two adjacent syllables, and the number of adjacent syllables. The average pitch of the last M1 sampling points of two syllables and/or the average pitch of the M2 sampling points before two adjacent syllables are calculated, where M1 and M2 are integers. If two syllables can form a word or idiom, the number of sampling points to be processed can be appropriately larger when processing, so the number of sampling points to be processed N can be determined according to whether two adjacent syllables can form a word. In addition, the sound intensity of the beginning and end of each syllable is also a factor that needs to be paid attention to when processing. Therefore, when calculating the number of sampling points to be processed N, it can also be based on the last M1 of two adjacent syllables. Calculate the average sound intensity of two sampling points or the average sound intensity of M2 sampling points before two adjacent syllables. In addition, when the sampling frequency is constant, the number of sampling points reflects the audio duration of each syllable. The difference in audio duration of two adjacent syllables also has a greater impact on the effect of synthesized speech. If the audio duration of the syllable is too long, it means that the two syllables are different in severity and speed. The number of sampling points that need to be processed during processing needs to be more. If the audio duration of the two syllables is not much different, the number of sampling points that need to be processed The number can be less. Therefore, when calculating the number N of sampling points to be processed, the number of sampling points of the syllable can also be considered. In order to consider the blanking problem of two adjacent syllables, the average sound intensity at the beginning and the end of the adjacent two syllables can also be considered when calculating the number of sampling points to be processed. The average pitch at the end can be obtained by calculating the average pitch of the last M1 sampling points of the syllable, and the average pitch at the beginning can be obtained by taking the average pitch of the M2 sampling points before the syllable. Among them, M1 and M2 can be determined according to the characteristics of the syllable itself. Set, for example, M1 is 10% of the total number of sampling points of the previous syllable, M2 is 5% of the total number of sampling points of the next syllable, or M1 is 1000, and M2 is 2000. This manual does not limit it. In one embodiment, after repeated trials by the applicant, in order to achieve a better synthesis effect so that the front and back syllables will not have a significant sense of pause after synthesis, M1 can be 20% of the total number of audio sampling points of the previous syllable, and M2 can be Take 20% of the total number of audio sampling points for the next syllable. Further, in one embodiment, the number N of sampling points to be processed can be calculated by the following formula:

Among them, the different value of Nw indicates whether the current two adjacent syllables form a word or four-character idiom, SNpre indicates the number of samples of the previous syllable, and SNnext indicates the number of samples of the next syllable; the average end tone_pre indicates the last M1 of the previous syllable The average sound intensity of each sampling point; the beginning average sound intensity_next represents the average sound intensity of M2 sampling points before the next syllable, and M1 and M2 are integers. When calculating the number of sampling points to be processed N, you can consider whether two adjacent syllables form a word or idiom. In order to facilitate the calculation of the number of sampling points to be processed N, you can quantify the influencing factor when two adjacent syllables form a word or idiom, that is, use Different values of Nw indicate whether two adjacent syllables form a word or an idiom, which is convenient for calculating N. Generally, if two adjacent syllables can form a word, the value of Nw will be larger than that which cannot form a word. In one embodiment, in order to achieve a better synthesis effect, if two adjacent syllables are a word, Nw is taken as 2. If two adjacent syllables are not in a word or four-character idiom, then Nw is taken as 1. If two syllables are in a four-character idiom, Nw is 2. Of course, the value of Nw can be set according to specific conditions, and this specification does not limit it. For example, it is necessary to synthesize two syllables of "I" and "No", where the sample of the syllable of "I" is 96K, and the number of samples of the syllable of "No" is 48K, that is, SNpre=96K, SNnext=48K, this syllable is not composed Words, so Nw can take 1, that is, Nw=1, take the sound intensity of the last 2K sampling points of the "I" syllable, and calculate the average sound intensity of the last 2K sampling points as 0.3, that is, the final average sound intensity pre=0.3 , Take the sound intensity of the first 2K sampling points of the "no" syllable, calculate the average sound intensity of the first 2K sampling points as 0.2, and the beginning average sound intensity next=0.2, substituting the formula for calculation, the value of N can be obtained as 3920. That is, take the last 3920 sampling points of the previous syllable and the sound intensity data of 3920 sampling points before the next syllable, and process these pitch data to obtain the synthesized speech. After acquiring the tone intensity data of the specified sampling point, the specific method of processing the tone intensity of the specified sampling point of the two syllables can also be selected according to the characteristics of the syllable. For example, in some embodiments, the previous syllable can be The pitch of the last N sampling points of the syllable and the first N sampling points of the next syllable are directly added to obtain the superimposed pitch. For example, the pitch of the last five sampling points of the previous syllable and the next syllable need to be processed The first five sampling points of the sound intensity, the sound intensity of the last five sampling points of the previous syllable are 0.15, 0.10, 0.05, 0.03 and 0.01, respectively, and the sound intensity of the first five sampling points of the latter syllable are 0.005, 0.01, 0.04, 0.06, 0.07, and 0.10, the voice intensity of the processed superimposed part is 0.155, 0.11, 0.09, 0.09, 0.08, 0.11. Of course, in order to obtain a more high-quality and natural synthesis effect, in some embodiments, the sound intensity of the last N sampling points of the previous syllable and the first N sampling point of the next syllable may be multiplied by the preset sound intensity. The weights are set and then added to obtain the superimposed sound intensity, where the preset weights are set based on the syllable sequence and the sample point sequence. When processing the pitch of two adjacent syllables, you can multiply the pitch of the two syllables by a weight and then add them together. For example, the previous syllable is usually heavier in the front part of the processing part, so the front The weight of a syllable can be larger. In the latter part of the processing part, the latter syllable is heavier and the weight of the latter syllable can be larger. For example, it is necessary to process the pitch of the last five sampling points of the previous syllable and the first five sampling points of the next syllable. The pitches of the last five sampling points of the previous syllable are 0.5, 0.4, 0.3, 0.2 and respectively. 0.1, where the weights of the five sampling points are 90%, 80%, 70%, 60%, and 50%, respectively. The first five sampling points of the next syllable have the intensity of 0.1, 0.2, 0.3, 0.4, 0.5, Among them, the weights of the five sampling points are 10%, 20%, 30%, 40%, 50%, and the processed sound intensity is 0.5×90%+0.1×10%, 0.4×80%+0.2× 20%, 0.3×70%+0.3×30%, 0.2×70%+0.4×40%, 0.1×50%+0.5×50%, namely 0.46, 0.36, 0.3, 0.3, 0.3. In order to ensure that the processed syllable does not appear to be broken, the sound intensity of the designated sampling point that needs to be processed is generally not too large to avoid broken sound after processing. In an embodiment, the sound intensity of the designated sampling point is equal to the The ratio of the maximum pitch of the sampling points of the syllable is less than 0.5. For example, if the sound intensity of the sampling point with the largest sound intensity among all the sampling points of the syllable is 1, then the sound intensity of the designated sampling point to be processed is less than 0.5. The following specific examples are used to further explain the method of speech synthesis provided in this specification. For example, the voice device needs to synthesize the sentence "I like running". Before speech synthesis, five speech files are pre-recorded with the pronunciations of the five Chinese characters "I", "喜", "huan", "Run", and "步". These five speech files are stored in the server. And the configuration information of the voice file is recorded at the beginning of the five voice files, the sampling frequency is 48K, the sampling accuracy is 16 bits, and the audio duration of each pronunciation. Among them, the audio durations of "I", "Hi", "Huan", "Run", and "Step" are 1s, 0.5s, 1s, 1.5s, 0.8s, respectively. After the speech synthesis device receives the text that needs to be synthesized speech, "I like running", it will download this 5-syllable speech file from the server. Then process two consecutive syllables one by one according to the order of the text. For example, first process "I" and "喜". It is necessary to process the sound intensity of the last sample point of "I" and the first part of "Xi". Before processing, you need to calculate the number of sampling points that need to be processed according to the following formula:

Among them, the different value of Nw indicates whether two adjacent syllables form a word or a four-character idiom. If two adjacent syllables are a word, the Nw is 2. If the adjacent two syllables are not in a word or four-character idiom, Then Nw takes 1, if two adjacent syllables are in a four-character idiom, then Nw takes 2. SNpre represents the number of samples of the previous syllable, SNnext represents the number of samples of the next syllable; the end average pitch pre represents the average pitch of the last 20% of the sampling points of the previous syllable; the beginning average pitch next represents the first 20% of the next syllable The average sound intensity of the sampling points, M1 and M2 are integers. Since "我" and "喜" cannot form a word or idiom, Nw in the formula is 1, and the number of samples for the syllable of "I" is equal to the sampling frequency multiplied by the audio duration, that is, SNpre=0.5×48K=24K," The number of samples of the syllable "Hi" SNnext=48K×1, the average sound intensity of the last 20% of the sampling points of the "I" syllable is 0.3, and the average sound intensity of the first 20% of the sampling points of the "Hi" syllable is 0.1. Substituting these data into the above formula, the number of sampling points that need to be processed is 711, that is, the tone intensity data of the last 711 sampling points are obtained from the voice file of the syllable "I", and the voice file of the syllable "Hi" Acquire the sound intensity data of the first 711 sampling points in, and then directly add the acquired sound intensity data to obtain the processed sound intensity. In the same way, "happiness" and "huan", "huan" and "running", "running" and "step" are processed in the same way, and the synthesized text "I like running" is obtained. For another example, the text that the voice device needs to synthesize is "We love Tiananmen". When the voice file is recorded, it is recorded in the form of words, that is, the voice file includes the three words "we", "love" and "Tiananmen" The voice file is downloaded from the server in advance and saved in the local directory of the voice device. After the server receives the text "We love Tiananmen" that needs to be synthesized, it will segment the text according to the form of the voice file, and the word segmentation can be completed by the word segmentation algorithm. Divide the text into "we/love/Tiananmen", and then send the word-segmented text to the speech synthesis device. After the speech synthesis device receives the text, it will first obtain the three "we", "love" and "Tiananmen" The voice file of the word, the sampling frequency is 48K, the sampling accuracy is 8 bits, and the audio duration of the three word pronunciations is 2s, 1s, 3s. Then first process "we" and "love". Before processing, you need to calculate the number of processed sampling points according to the following formula:

Among them, the different value of Nw indicates whether two adjacent syllables form a word or a four-character idiom. If two adjacent syllables are a word, the Nw is 2. If the adjacent two syllables are not in a word or four-character idiom, Then Nw takes 1, if two adjacent syllables are in a four-character idiom, then Nw takes 2. SNpre represents the number of samples of the previous syllable, SNnext represents the number of samples of the next syllable; the end average pitch pre represents the average pitch of the last 15% of the sampling points of the previous syllable; the beginning average pitch next represents the first 20% of the next syllable The average sound intensity of the sampling points, M1 and M2 are integers. According to the sampling frequency and audio duration, it can be calculated that SNpre=96K, SNnext=48K, the average sound intensity of the last 15% sampling points of "we" is 0.2, and the average sound intensity of the first 20% sampling points of "love" is 0.3, before and after syllables do not form a word, Nw=1, substituting these data into the formula can calculate the number of processed sampling points to be 5689, that is, the sound intensity data of the last 5689 sampling points of "we" and the front of "love" are obtained from the voice file The sound intensity data of 5689 sampling points. After obtaining the sound intensity data of the processed sampling points, multiply the sound intensity of each sampling point of "We" by a certain weight, and then multiply the sound intensity of each sampling point of "Love" by a certain weight, and then add them. , Get the sound intensity of the processed part. In the same way, "Love" and "Tiananmen" also use the same processing method to obtain the synthesized texts "we", "love" and "Tiananmen". Corresponding to the above-mentioned speech synthesis method, this specification also provides a speech synthesis device. As shown in FIG. 3, thespeech synthesis device 300 includes: an obtainingunit 301, which obtains a voice file of each syllable in the text to be synthesized; The voice file stores the tone intensity data of the sampling points of the syllable; and obtains the tone intensity data of the specified sampling points from the voice files of two adjacent syllables; wherein, the specified sampling point of the previous syllable is the last N of the syllable The specified sampling point of the next syllable is the first N sampling points of the syllable, where N is an integer; theprocessing unit 302 processes the tone intensity data of the specified sampling point of the two syllables to obtain a synthesis After the voice. In one embodiment, the voice file records: the audio duration of the syllable, the sound intensity data of the sampling points, the sampling frequency, the sampling accuracy, and/or the number of sampling points. In one embodiment, processing the pitch data of the designated sampling points of the two syllables specifically includes: comparing the pitch of the last N sampling points of the previous syllable to the first N sampling points of the next syllable. Add; or multiply the intensity data of the last N sampling points of the previous syllable and the first N sampling points of the next syllable by a preset weight and then add them together, where the preset weight is based on the syllable The order of the sampling points and the order of the sampling points are set. In one embodiment, the text of the speech to be synthesized is Chinese, and the speech file is a speech file in which four tones of Chinese character syllables are recorded. In an embodiment, the ratio of the pitch data of the designated sampling point to the maximum pitch data of the sampling point of the syllable is less than 0.5. In one embodiment, the N is based on whether two adjacent syllables form a word or four-character idiom, the number of sampling points of two adjacent syllables, the average sound intensity of the last M1 sampling points of two adjacent syllables, and/or two adjacent syllables. The average sound intensity of M2 sampling points before the syllable is calculated, where M1 and M2 are integers. In one embodiment, the number M1 is 20% of the total number of audio sampling points of the previous syllable, and the M2 is the total number of audio sampling points of the next syllable. 20% of it. In one embodiment, if two adjacent syllables are a word, the conversion coefficient is 2. If two adjacent syllables are not in a word or four-character idiom, then the conversion coefficient is 1. If two adjacent syllables are in a four-character In the idiom, the conversion coefficient is 2. In one embodiment, the specific calculation formula of N is as follows:

Among them, the different value of Nw indicates whether the current two adjacent syllables form a word or four-character idiom, SNpre indicates the number of samples of the previous syllable, and SNnext indicates the number of samples of the next syllable; the average end tone pre indicates the last M1 of the previous syllable The average sound intensity of each sampling point; the beginning average sound intensity next represents the average sound intensity of M2 sampling points before the next syllable, and M1 and M2 are integers. In an embodiment, before acquiring the voice file of each syllable in the text to be synthesized, the method further includes: performing word segmentation processing on the text. In one embodiment, the word segmentation processing of the text is completed by the server side. For the implementation process of the functions and roles of each unit in the above-mentioned device, please refer to the implementation process of the corresponding steps in the above-mentioned method for details, which will not be repeated here. For the device embodiment, since it basically corresponds to the method embodiment, the relevant part can refer to the part of the description of the method embodiment. The device embodiments described above are merely illustrative, where the unit described as a separate part may or may not be physically separated, and the part displayed as a unit may or may not be a solid unit, that is, it may be located in one unit. Locally, or it can be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in this specification. Those of ordinary skill in the art can understand and implement without creative work. In addition, this specification also provides a speech synthesis device, as shown in Figure 4, the speech synthesis device includes: aprocessor 401 and amemory 402; the memory is used to store executable computer instructions; the processor is used to execute The computer command implements the following steps: Obtain the voice file of each syllable in the text of the speech to be synthesized, the voice file stores the pitch data of the sampling point of the syllable; Obtain designated samples from the voice files of two adjacent syllables respectively Intensity data of points; among them, the designated sampling point of the previous syllable is the last N sampling points of the syllable, and the designated sampling point of the next syllable is the first N sampling points of the syllable, where N is an integer; The tone intensity of the designated sampling point of each syllable is processed to obtain the synthesized speech. The above descriptions are only the preferred embodiments of this specification and are not intended to limit this specification. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this specification shall be included in this specification. Within the scope of protection.

S102~S106:步驟300:語音合成裝置301:獲取單元302:處理單元400:語音合成設備401:處理器402:記憶體S102~S106: steps300: Speech synthesis device301: get unit302: Processing Unit400: Speech synthesis equipment401: processor402: Memory

此處的圖式被併入說明書中並構成本說明書的一部分，示出了符合本發明的實施例，並與說明書一起用於解釋本發明的原理。圖1為本說明書一示例性實施例示出的一種語音合成方法流程圖；圖2為本說明書一示例性實施例示出的一種語音合成方法示意圖；圖3為本說明書一示例性實施例示出的一種語音合成裝置的邏輯方塊圖；圖4為本說明書一示例性實施例示出的一種語音合成設備的邏輯方塊圖。The drawings here are incorporated into the specification and constitute a part of the specification, show embodiments in accordance with the present invention, and together with the specification are used to explain the principle of the present invention.Fig. 1 is a flowchart of a speech synthesis method shown in an exemplary embodiment of this specification;Fig. 2 is a schematic diagram of a speech synthesis method shown in an exemplary embodiment of this specification;Fig. 3 is a logical block diagram of a speech synthesis device shown in an exemplary embodiment of this specification;Fig. 4 is a logic block diagram of a speech synthesis device shown in an exemplary embodiment of this specification.

Claims

Translated fromChinese

一種語音合成的方法，該方法包括：獲取待合成語音的文本中的各音節的語音文件，該語音文件係從伺服器端或雲端下載，該語音文件儲存有該音節的採樣點的音強數據；從相鄰兩音節的語音文件中分別獲取指定採樣點的音強數據；其中，前一音節的指定採樣點為該音節的最後N個採樣點，後一音節的指定採樣點為該音節的前N個採樣點，其中，N為整數，該N基於相鄰兩音節是否組成詞語或四字成語、相鄰兩音節的採樣點數量、相鄰兩音節的最後M1個採樣點的平均音強和/或相鄰兩音節前M2個採樣點的平均音強確定，其中M1、M2為整數；該N具體計算公式如下：

其中，Nw的不同取值表示相鄰兩音節是否組成詞語或四字成語，SNpre表示前一個音節的採樣數量，SNnext表示後一個音節的採樣數量；末尾平均音強_pre表示前一個音節最後M1個採樣點的平均音強；開頭平均音強_next表示後一個音節前M2個採樣點的平均音強；將兩個音節的該指定採樣點的音強數據進行處理，以獲得合成後的語音，其中，將兩個音節的該指定採樣點的音強進行處理具體包括：將前一個音節的最後N個採樣點的音強數據與後一個音節的前N個採樣點音強數據相加；或將前一個音節的最後N個採樣點的音強數據與後一個音節的前N個採樣點音強數據分別乘以預設權重後再相加，其中，該預設權重基於音節的前後順序與採樣點的前後順序設定。A method of speech synthesis, the method comprising: obtaining a speech file of each syllable in a text to be synthesized, the speech file is downloaded from the server or the cloud, and the speech file stores the sound intensity data of the sampling point of the syllable ; Obtain the tone intensity data of the specified sampling points from the voice files of two adjacent syllables; among them, the specified sampling point of the previous syllable is the last N sampling points of the syllable, and the specified sampling point of the next syllable is the specified sampling point of the syllable The first N sampling points, where N is an integer, the N is based on whether two adjacent syllables form a word or four-character idiom, the number of sampling points of two adjacent syllables, and the average sound intensity of the last M1 sampling points of two adjacent syllables And/or the average sound intensity of M2 sampling points before two adjacent syllables is determined, where M1 and M2 are integers; the specific calculation formula of N is as follows:

Among them, the different value of Nw indicates whether two adjacent syllables form a word or four-character idiom, SNpre indicates the number of samples of the previous syllable, and SNnext indicates the number of samples of the next syllable; the average end tone_pre indicates the last M1 of the previous syllable The average pitch of the sampling point; the beginning average pitch_next represents the average pitch of M2 sampling points before the next syllable; the pitch data of the specified sampling point of the two syllables is processed to obtain the synthesized speech, where , Processing the pitch of the designated sampling points of the two syllables specifically includes: adding the pitch data of the last N sampling points of the previous syllable to the pitch data of the first N sampling points of the next syllable; or adding The tone intensity data of the last N sampling points of the previous syllable and the tone intensity data of the first N sampling points of the next syllable are respectively multiplied by a preset weight and then added. The preset weight is based on the syllable sequence and sampling The order of the points is set.

如請求項1所述的一種語音合成的方法，該語音文件記錄有：音節的音頻時長、採樣點的音強數據、採樣頻率、採樣精度和/或採樣點數量。According to the method for speech synthesis according to claim 1, the speech file records: the audio duration of the syllable, the sound intensity data of the sampling point, the sampling frequency, the sampling accuracy and/or the number of sampling points.

如請求項1所述的一種語音合成的方法，該待合成語音的文本為中文，該語音文件為記錄有漢字音節的四個聲調的語音文件。According to the method for speech synthesis according to claim 1, the text to be synthesized is Chinese, and the speech file is a speech file in which four tones of Chinese character syllables are recorded.

如請求項1所述的一種語音合成的方法，該指定採樣點的音強數據與該音節的各採樣點的最大音強數據的比值小於0.5。In the method for speech synthesis according to claim 1, the ratio of the pitch data of the designated sampling point to the maximum pitch data of each sampling point of the syllable is less than 0.5.

如請求項1所述的一種語音合成的方法，該M1為前一個音節採樣點總數的20%，該M2為後一個音節採樣點總數的20%。In the method of speech synthesis described in claim 1, the M1 is 20% of the total number of sampling points of the previous syllable, and the M2 is 20% of the total number of sampling points of the next syllable.

如請求項1所述的一種語音合成的方法，如果相鄰兩音節為一個詞語，則該Nw的值為2，如果相鄰兩音節不在一個詞語或四字成語中，則Nw的值為1，如果相鄰兩音節不在一個詞語且在一個四字成語中，則Nw的值為2。A method of speech synthesis as described in claim 1, if two adjacentIf a syllable is a word, the value of Nw is 2. If two adjacent syllables are not in a word or four-character idiom, the value of Nw is 1. If two adjacent syllables are not in a word and are in a four-character idiom, Then the value of Nw is 2.

如請求項1所述的一種語音合成的方法，在獲取待合成語音的文本中的各音節的語音文件之前，還包括：對該文本進行分詞處理。The method for speech synthesis according to claim 1, before acquiring the speech file of each syllable in the text to be synthesized, further includes: performing word segmentation processing on the text.

如請求項7所述的一種語音合成的方法，對該文本進行分詞處理由該伺服器端完成。For a speech synthesis method as described in claim 7, word segmentation processing on the text is completed by the server.

一種語音合成裝置，該裝置包括：獲取單元，獲取待合成語音的文本中的各音節的語音文件，該語音文件儲存有該音節的採樣點的音強數據；以及從相鄰兩音節的語音文件中分別獲取指定採樣點的音強數據；其中，前一音節的指定採樣點為該音節的最後N個採樣點，後一音節的指定採樣點為該音節的前N個採樣點，其中，N為整數，該N基於相鄰兩音節是否組成詞語或四字成語、相鄰兩音節的採樣點數量、相鄰兩音節的最後M1個採樣點的平均音強和/或相鄰兩音節前M2個採樣點的平均音強確定，其中M1、M2為整數；該N具體計算公式如下：

其中，Nw的不同取值表示相鄰兩音節是否組成詞語或四字成語，SNpre表示前一個音節的採樣數量，SNnext表示後一個音節的採樣數量；末尾平均音強_pre表示前一個音節最後M1個採樣點的平均音強；開頭平均音強_next表示後一個音節前M2個採樣點的平均音強；處理單元，將兩個音節的該指定採樣點的音強數據進行處理，以獲得合成後的語音，其中，將兩個音節的該指定採樣點的音強進行處理具體包括：將前一個音節的最後N個採樣點的音強數據與後一個音節的前N個採樣點音強數據相加；或將前一個音節的最後N個採樣點的音強數據與後一個音節的前N個採樣點音強數據分別乘以預設權重後再相加，其中，該預設權重基於音節的前後順序與採樣點的前後順序設定。A speech synthesis device, the device comprising: an acquiring unit that acquires a voice file of each syllable in a text to be synthesized, the voice file storing the tone intensity data of the sampling point of the syllable; and the voice file from two adjacent syllables Obtain the sound intensity data of the designated sampling points in the syllable; among them, the designated sampling point of the previous syllable is the last N sampling points of the syllable, and the designated sampling point of the next syllable is the first N sampling points of the syllable, where N It is an integer, the N is based on whether two adjacent syllables form a word or four-character idiom, the number of sampling points of two adjacent syllables, the average pitch of the last M1 sampling points of two adjacent syllables, and/or M2 before two adjacent syllables The average sound intensity of each sampling point is determined, where M1 and M2 are integers; the specific calculation formula of N is as follows:

Among them, the different value of Nw indicates whether two adjacent syllables form a word or four-character idiom, SNpre indicates the number of samples of the previous syllable, and SNnext indicates the number of samples of the next syllable; the average end tone_pre indicates the last M1 of the previous syllable The average sound intensity of the sampling point; the beginning average sound intensity_next represents the average sound intensity of M2 sampling points before the next syllable; the processing unit processes the sound intensity data of the specified sampling point of the two syllables to obtain the synthesized Voice, where processing the pitch of the designated sampling points of the two syllables specifically includes: adding the pitch data of the last N sampling points of the previous syllable to the pitch data of the first N sampling points of the next syllable ; Or multiply the tone intensity data of the last N sampling points of the previous syllable and the tone intensity data of the first N sampling points of the next syllable by preset weights and then add them together, where the preset weights are based on before and after the syllable The order and the order of sampling points are set.

一種語音合成設備，該語音合成設備包括：處理器和記憶體；該記憶體用於儲存可執行的電腦指令；該處理器用於執行該電腦指令時實現請求項1至8中任一項所述的方法的步驟。A speech synthesis device, the speech synthesis device comprising: a processor and a memory; the memory is used to store executable computer instructions; the processor is used to execute the computer instructions to implement any one of request items 1 to 8 Steps of the method.