TWI866178B

Movatterモバイル変換

Info

Publication number: TWI866178B
Application number: TW112115390A
Authority: TW
Inventors: 林世海; 葉韋麟; 周敬程
Original assignee: 旭瑞文化傳媒股份有限公司
Priority date: 2023-04-25
Filing date: 2023-04-25
Publication date: 2024-12-11
Also published as: TW202443557A

Abstract

The present invention provides an audio-visual broadcasting system comprising: a speech synthesis model including a plurality of characteristic parameters; and a system backend allowing at least one user end and at least one host end to connect and log in thereto and receive a host number, a user number, an input message and a preset voice number or a customized voice sent by the user end, wherein the speech synthesis model matches the preset voice number with at least a portion of the characteristic parameters or selects at least one characteristic parameter based on the customized voice, and then generates an effect voice based on the matched or selected characteristic parameters and the input message, so that an audio-visual broadcast provided by the host end includes a video of the host end, the user number and the effect voice.

Description

Translated fromChinese

影音播送系統及其方法Audio and video broadcasting system and method thereof

本發明係關於一種影音播送系統及其方法，特別是一種根據輸入訊息產生一效果語音之影音播送系統及其方法。The present invention relates to an audio-visual broadcasting system and method thereof, and in particular to an audio-visual broadcasting system and method thereof for generating an effect voice according to an input message.

關於語音合成TTS(Text to Speech/文字轉聲音)技術，大部分的人聯想到的都是Google小姐或Siri，是網路影音產業熟知之聲音播送方式，如實況直播中，為了刺激不同客戶來捐贈禮物給直播主，直播平台除了推廣圖片/動畫相關禮物之外，漸漸開始有音效相關的禮物。When it comes to text-to-speech (TTS) technology, most people think of Google or Siri, which is a well-known voice broadcasting method in the online video industry. For example, in live broadcasts, in order to stimulate different customers to donate gifts to the live broadcaster, in addition to promoting picture/animation related gifts, live broadcast platforms have gradually begun to have sound-related gifts.

然而，傳統的語音合成技術通常是基於單一聲音來源的合成，且產生的語音效果可能會顯得不自然或不符合特定角色的語音特徵。這些音效禮物往往只能是固定的內容，例如欲錄製好的音檔，往往難以表達用戶真正想表達。如果是自己喜歡或熟悉的聲音，能夠被播送至直播平台給對方聽，也能夠更影響共鳴。現有之直播平台注重於文字、貼圖與禮物的樣式，積極開發不同禮物的種類、音效，或是有聲貼圖，都是以單一語音搭配單一畫面的對應呈現，而往往這些音效並不一定能呈現觀眾想表達的話語。However, traditional speech synthesis technology is usually based on the synthesis of a single sound source, and the resulting voice effects may appear unnatural or not conform to the voice characteristics of a specific character. These sound effect gifts can often only have fixed content, such as the recorded audio files, which often cannot express what the user really wants to express. If it is a voice that you like or are familiar with, it can be broadcast to the live broadcast platform for the other party to hear, and it can also affect the resonance more. Existing live broadcast platforms focus on the style of text, stickers and gifts, and actively develop different types of gifts, sound effects, or audio stickers, all of which are presented with a single voice and a single picture, and often these sound effects may not necessarily present the words that the audience wants to express.

故現有語音合成技術仍有以下缺點：缺乏更多元化的表達方式，與語音合成模型只是將文字轉換為語音，缺乏針對不同角色與情境設計專屬的聲音效果。Therefore, existing speech synthesis technology still has the following shortcomings: lack of more diversified expressions, and speech synthesis models only convert text into speech, lack of sound effects designed specifically for different roles and situations.

因此，本技術領域之產業需要一種新的影音播送系統及其方法，該系統結合了語音合成技術和即時傳輸技術，允許用戶間或複數用戶與主播之間進行即時互動交流，並支持多角色聲音特徵的語音合成技術，以提供更加自然和流暢的語音效果。利用更少的語句與不必透過人工標註以降低作業成本，並同時提供可供選擇之客製化語音之訊息交流平台系統，為該產業亟待解決之問題。Therefore, the industry in this technical field needs a new audio-visual broadcasting system and method. The system combines speech synthesis technology and real-time transmission technology to allow real-time interactive communication between users or between multiple users and anchors, and supports speech synthesis technology with multiple role voice characteristics to provide more natural and smooth voice effects. Using fewer sentences and not having to manually mark to reduce operating costs, while providing a customized voice information exchange platform system for selection, is an urgent problem to be solved in the industry.

為解決上述問題，本發明的目的在於提供一種影音播送系統，將輸入訊息產生一效果語音應用於用戶間或用戶與主播之間進行即時互動交流。In order to solve the above problems, the purpose of the present invention is to provide an audio-visual broadcasting system that generates an effect voice from the input message for real-time interactive communication between users or between users and anchors.

為了達到上述的目的，本發明影音播送系統包含：一語音合成模型，包含複數個特色參數；以及一系統後端，允許至少一用戶端與至少一主播端連線登入，並接收該至少一用戶端所傳送的一主播號、一用戶號、一預設語音編號與一輸入訊息，其中該預設語音編號對應一角色選單所提供複數個角色之一；其中該語音合成模型將該預設語音編號對應至少部分的特色參數，並根據該對應的特色參數與該輸入訊息產生一效果語音，俾使該主播端提供的一影音播送含有一主播端影像、該用戶號與該效果語音。In order to achieve the above-mentioned purpose, the video and audio broadcasting system of the present invention comprises: a speech synthesis model, including a plurality of characteristic parameters; and a system backend, allowing at least one client terminal to connect and log in with at least one host terminal, and receiving a host number, a user number, a default voice number and an input message transmitted by the at least one client terminal, wherein the default voice number corresponds to one of a plurality of roles provided by a role menu; wherein the speech synthesis model corresponds the default voice number to at least part of the characteristic parameters, and generates an effect voice according to the corresponding characteristic parameters and the input message, so that the video and audio broadcast provided by the host terminal contains a host terminal image, the user number and the effect voice.

根據本發明影音播送系統所提供之一種態樣，該語音合成模型配置於該系統後端；該系統後端根據該主播號傳送該用戶號與該效果語音至對應該主播號的主播端，俾使該主播端提供的該影音播送含有該主播端影像、該用戶號與該效果語音。該系統後端根據該主播號傳送該輸入訊息至對應該主播號的主播端，俾使該主播端提供的該影音播送含有該主播端影像、該用戶號、該輸入訊息與該效果語音。According to one aspect provided by the video and audio broadcasting system of the present invention, the speech synthesis model is configured at the back end of the system; the back end of the system transmits the user number and the effect voice to the host end corresponding to the host number according to the host number, so that the video and audio broadcast provided by the host end contains the host end image, the user number and the effect voice. The back end of the system transmits the input message to the host end corresponding to the host number according to the host number, so that the video and audio broadcast provided by the host end contains the host end image, the user number, the input message and the effect voice.

根據本發明影音播送系統所提供之另一種態樣，該語音合成模型配置於該主播端。該系統後端根據該主播號傳送該用戶號與該預設語音編號至對應該主播號的主播端，俾使該主播端提供的該影音播送含有該主播端影像、該用戶號與該效果語音。該系統後端根據該主播號傳送該輸入訊息至對應該主播號的主播端，俾使該主播端提供的該影音播送含有該主播端影像、該用戶號、該輸入訊息與該效果語音。該主播端提供的該影音播送進一步含有對應該預設語音編號的一角色圖像。該輸入訊息為一文字訊息。其中該語音合成模型是根據監督式類神經網路訓練產生該等特色參數，分別用以產生對應的效果語音。According to another aspect provided by the audio and video broadcasting system of the present invention, the speech synthesis model is configured at the anchor end. The system back end transmits the user number and the preset voice number to the anchor end corresponding to the anchor number according to the anchor number, so that the audio and video broadcast provided by the anchor end contains the anchor end image, the user number and the effect voice. The system back end transmits the input message to the anchor end corresponding to the anchor number according to the anchor number, so that the audio and video broadcast provided by the anchor end contains the anchor end image, the user number, the input message and the effect voice. The audio and video broadcast provided by the anchor end further contains a character image corresponding to the preset voice number. The input message is a text message. The speech synthesis model generates these characteristic parameters based on supervised neural network training, which are used to generate corresponding effect speech.

為了達到上述的目的，本發明再提供一種影音播送系統，包含：一語音合成模型，包含複數個特色參數；以及一系統後端，允許至少一用戶端與至少一主播端連線登入，並接收該至少一用戶端所傳送的一主播號、一用戶號、一客製化語音與一輸入訊息；其中該語音合成模型將根據該客製化語音選用至少一特色參數，並根據該選用的至少一特色參數與該輸入訊息產生一效果語音，俾使該主播端提供的一影音播送含有一主播端影像、該用戶號與該效果語音。In order to achieve the above-mentioned purpose, the present invention further provides a video and audio broadcasting system, comprising: a speech synthesis model, including a plurality of characteristic parameters; and a system backend, allowing at least one client terminal to connect and log in with at least one host terminal, and receiving a host number, a user number, a customized voice and an input message transmitted by the at least one client terminal; wherein the speech synthesis model will select at least one characteristic parameter according to the customized voice, and generate an effect voice according to the selected at least one characteristic parameter and the input message, so that the video and audio broadcast provided by the host terminal contains a host terminal image, the user number and the effect voice.

根據本發明影音播送系統所提供之一種態樣，該語音合成模型配置於該系統後端或該主播端且該輸入訊息為一文字訊息。According to one embodiment of the video and audio broadcasting system of the present invention, the speech synthesis model is configured at the back end of the system or the host end and the input message is a text message.

為解決上述問題，本發明又提供一種影音播送系統，包含：一語音合成模型，包含複數個特色參數；以及一系統後端，允許至少兩用戶端連線登入，並從發話之該用戶端接收一用戶號與一輸入訊息，以及接收一客製化語音與一預設語音編號其中之一；其中當該系統後端從發話之該用戶端接收該預設語音編號時，該語音合成模型將該預設語音編號對應複數個特色參數的其中之一，並根據該對應的特色參數與該輸入訊息產生一第一效果語音，俾使該用戶號對應的用戶端接收該輸入訊息與該第一效果語音；其中當該系統後端從發話之該用戶端接收該客製化語音時，該語音合成模型將根據該客製化語音選用數個特色參數，並根據該選用的數個特色參數與該輸入訊息產生一第二效果語音，俾使該用戶號對應的用戶端接收該輸入訊息與該第二效果語音。To solve the above problems, the present invention provides a video and audio broadcasting system, comprising: a speech synthesis model, comprising a plurality of characteristic parameters; and a system backend, allowing at least two clients to log in online, and receiving a user number and an input message from the client that makes a call, and receiving a customized voice and one of a preset voice number; wherein when the system backend receives the preset voice number from the client that makes a call, the speech synthesis model maps the preset voice number to one of the plurality of characteristic parameters. 1. A first effect voice is generated according to the corresponding characteristic parameters and the input message, so that the client corresponding to the user number receives the input message and the first effect voice; wherein when the system back end receives the customized voice from the speaking client, the voice synthesis model selects several characteristic parameters according to the customized voice, and generates a second effect voice according to the selected several characteristic parameters and the input message, so that the client corresponding to the user number receives the input message and the second effect voice.

本發明進一步提供該影音播送系統之一種影音播送方法，使至少一主播端提供的一影音播送含有一主播端影像，包含：建立一語音合成模型，該語音合成模型包含複數個特色參數；從至少一用戶端接收一主播號、一用戶號與一輸入訊息，且接收一客製化語音與一預設語音編號其中之一；當接收該預設語音編號時，該語音合成模型將該預設語音編號對應至少部分的特色參數，並根據該對應的特色參數與該輸入訊息產生一第一效果語音，俾使該主播號對應的主播端提供的該影音播送含有該主播號對應的該主播端影像與該第一效果語音；以及當接收該客製化語音時，該語音合成模型將根據該客製化語音選用數個特色參數，並根據該選用的數個特色參數與該輸入訊息產生一第二效果語音，俾使該主播號對應的主播端提供的該影音播送含有該主播號對應的該主播端影像與該效果語音。The present invention further provides a video and audio broadcasting method of the video and audio broadcasting system, so that a video and audio broadcast provided by at least one anchor terminal contains a anchor terminal image, comprising: establishing a speech synthesis model, the speech synthesis model comprising a plurality of characteristic parameters; receiving an anchor number, a user number and an input message from at least one user terminal, and receiving one of a customized voice and a default voice number; when receiving the default voice number, the speech synthesis model corresponds the default voice number to at least part of the characteristic parameters, and according to the corresponding characteristic parameters, The color parameter and the input message generate a first effect voice, so that the video and audio broadcast provided by the host terminal corresponding to the host number contains the host terminal image corresponding to the host number and the first effect voice; and when receiving the customized voice, the voice synthesis model will select a number of characteristic parameters according to the customized voice, and generate a second effect voice according to the selected number of characteristic parameters and the input message, so that the video and audio broadcast provided by the host terminal corresponding to the host number contains the host terminal image corresponding to the host number and the effect voice.

本發明進一步提供該影音播送系統之一種影音播送方法，其中該語音合成模型配置於一系統後端或該主播號對應的主播端，且該輸入訊息為一文字訊息。The present invention further provides a video broadcasting method of the video broadcasting system, wherein the speech synthesis model is configured at a system backend or an anchor terminal corresponding to the anchor number, and the input message is a text message.

本發明進一步提供該影音播送系統之另一種影音播送方法，包含：建立一語音合成模型，該語音合成模型包含複數個特色參數；從至少一用戶端接收一用戶號與一輸入訊息，且接收一客製化語音與一預設語音編號其中之一；當接收該預設語音編號時，該語音合成模型將該預設語音編號對應複數個特色參數的其中之一，並根據該對應的特色參數與該輸入訊息產生一第一效果語音，俾使該用戶號對應的用戶端接收該輸入訊息與該第一效果語音；其中當該系統後端從發話之該用戶端接收該客製化語音時，該語音合成模型將根據該客製化語音選用數個特色參數，並根據該選用的數個特色參數與該輸入訊息產生一第二效果語音，俾使該用戶號對應的用戶端接收該輸入訊息與該第二效果語音。The present invention further provides another video and audio broadcasting method of the video and audio broadcasting system, comprising: establishing a speech synthesis model, the speech synthesis model comprising a plurality of characteristic parameters; receiving a user number and an input message from at least one client, and receiving one of a customized voice and a default voice number; when receiving the default voice number, the speech synthesis model corresponds the default voice number to one of the plurality of characteristic parameters, and generates a customized voice according to the corresponding characteristic parameter and the default voice number; The input message generates a first effect voice, so that the client corresponding to the user number receives the input message and the first effect voice; wherein when the system back end receives the customized voice from the speaking client, the voice synthesis model selects a plurality of characteristic parameters according to the customized voice, and generates a second effect voice according to the selected plurality of characteristic parameters and the input message, so that the client corresponding to the user number receives the input message and the second effect voice.

綜上所述，根據本發明所實施的影音播送系統及其方法，結合語音合成技術和即時傳輸技術，允許用戶間或複數用戶與主播之間進行即時互動，並支持多角色聲音特徵的語音合成，以提供更加自然和流暢的語音效果，並同時提供可供選擇之客製化語音之訊息交流平台系統。In summary, the audio-visual broadcasting system and method implemented by the present invention, combined with speech synthesis technology and real-time transmission technology, allows real-time interaction between users or between multiple users and anchors, and supports speech synthesis of multiple role voice characteristics to provide a more natural and smooth voice effect, and at the same time provides a customized voice information exchange platform system for selection.

1:用戶端1: Client

11:角色選單11: Character menu

12:輸入模組12: Input module

13:使用者介面13: User Interface

14:輸出入模組14: Input and output module

2:系統後端2: System backend

21:接收佇列模組21: Receive queue module

22:雲端儲存伺服器22: Cloud storage server

23:模型更新伺服器23: Model update server

24:雲端伺服器24: Cloud Server

3:主播端3: Anchor side

32:輸出入模組32: Input and output module

33:使用者介面33: User Interface

34:混音設備34: Mixing equipment

4:語音合成模型4: Speech synthesis model

41:音標編碼器41: Phonetic coder

42:差異化適配器42: Differentiated adapter

43:Mel解碼器43:Mel decoder

5:使用者介面5: User interface

51:主播實況畫面51: Live broadcast of the anchor

511:效果語音播送511: Effect voice broadcast

52:輔助列表52: Auxiliary list

521:每日任務清單521: Daily Task List

522:活動列表清單522:Activity list list

523:其他主播清單523: Other anchor list

53:聊天室畫面53: Chat room screen

531:角色選單531: Character Menu

532:文字輸入模組532: Text input module

54:選單列表54: Menu list

6:使用者介面6: User Interface

611:效果語音播送611: Effect voice broadcast

63:聊天畫面63:Chat screen

631:角色選單631: Character Menu

632:文字輸入模組632: Text input module

64:用戶頭像64: User avatar

圖1為本發明影音播送系統之架構圖，係顯示該影音播送系統之系統後端架構。Figure 1 is a diagram of the architecture of the video and audio broadcasting system of the present invention, showing the system backend architecture of the video and audio broadcasting system.

圖2A為本發明影音播送系統之再詳細架構圖，係顯示該影音播送系統之系統後端結合主播端與用戶端之架構。FIG2A is a more detailed architecture diagram of the video broadcasting system of the present invention, showing the architecture of the system backend of the video broadcasting system combined with the host end and the user end.

圖2B為本發明影音播送系統之另一實施例架構圖，其中該語音合成模型位於系統後端。FIG2B is a diagram showing another embodiment of the audio-visual broadcasting system of the present invention, wherein the speech synthesis model is located at the back end of the system.

圖2C為本發明影音播送系統之再一實施例架構圖，顯示兩用戶端透過系統後端進行影音播送。FIG2C is a diagram of another embodiment of the video broadcasting system of the present invention, showing two clients broadcasting videos through the system backend.

圖3A為本發明影音播送系統之語音合成模型之方塊圖。FIG3A is a block diagram of the speech synthesis model of the video and audio broadcasting system of the present invention.

圖3B為本發明影音播送系統之語音合成模型以預設編號取得特色參數之方塊圖。FIG. 3B is a block diagram of the speech synthesis model of the video broadcasting system of the present invention using a preset number to obtain characteristic parameters.

圖3C為本發明影音播送系統之語音合成模型以客製化語音取得特色參數之方塊圖。FIG3C is a block diagram of the speech synthesis model of the video broadcasting system of the present invention using customized speech to obtain characteristic parameters.

圖4為本發明影音播送系統使用預設角色選單進行影音播送之流程圖。Figure 4 is a flow chart of the video broadcasting system of the present invention using the default role menu for video broadcasting.

圖5為本發明影音播送系統使用客製化語音進行影音播送之流程圖。Figure 5 is a flow chart of the audio and video broadcasting system of the present invention using customized voice for audio and video broadcasting.

圖6為本發明影音播送系統之使用者介面之一實施例示意圖。Figure 6 is a schematic diagram of an embodiment of the user interface of the video and audio broadcasting system of the present invention.

圖7為本發明影音播送系統之使用者介面之另一實施例示意圖。FIG7 is a schematic diagram of another embodiment of the user interface of the video and audio broadcasting system of the present invention.

下面藉由特定的具體實施例加以說明本發明之實施方式。The following is a specific example to illustrate the implementation of the present invention.

首先參考圖1，係顯示本發明影音播送系統之架構圖。本發明影音播送系統之系統後端2允許至少一用戶端1與至少一主播端3連線登入，其中該系統後端2至少包括：接收佇列模組21、雲端儲存伺服器22、模型更新伺服器23。該系統後端2透過接收佇列模組21接收來自用戶端1所傳輸之資料，該資料包括但不限於：欲傳遞之主播端3之主播號、用戶端1之用戶號、輸入文字訊息以及預設語音編號與客製化語音其中之一者，系統後端2之雲端儲存伺服器22紀錄該預設語音編號與客製化語音其中之一者，並藉由一語音合成模型4根據該預設語音編號與客製化語音其中之一者與輸入文字訊息產生一效果語音，俾使該主播號對應的主播端提供的一影音播送含有該主播端影像與該效果語音。First, refer to FIG1, which shows the architecture of the video broadcasting system of the present invention. The system backend 2 of the video broadcasting system of the present invention allows at least one client 1 and at least one anchor terminal 3 to connect and log in, wherein the system backend 2 at least includes: a receiving queue module 21, a cloud storage server 22, and a model update server 23. The system backend 2 receives data transmitted from the client 1 through the receiving queue module 21, and the data includes but is not limited to: the anchor number of the anchor terminal 3 to be transmitted, the user number of the client 1, the input text message, and one of the preset voice number and the customized voice. The cloud storage server 22 of the system backend 2 records the preset voice number and the customized voice, and generates an effect voice according to the preset voice number and the customized voice and the input text message through a voice synthesis model 4, so that the video broadcast provided by the anchor terminal corresponding to the anchor number contains the anchor terminal image and the effect voice.

請配合圖1參考圖2A，該圖2A顯示本發明影音播送系統之詳細架構圖。在本發明的一種實施例中，影音播送系統之系統後端2會接受來自一或複數個用戶端1的連線，其中該用戶端1包括：角色選單11、輸出入模組12、使用者介面13、輸出入模組14。用戶可透過用戶端1之使用者介面13操作選擇角色選單11，角色選單11為可選擇式選單，其中包括數個預設角色之預設角色編號，預設角色的語音被提供作為訓練語音合成模型4產生一效果語音的參考語音。用戶透過輸入模組12輸入傳送語音之文字訊息，再透過操作用戶端1之使用者介面13將預設角色編號與文字訊息透過輸出入模組14傳輸至系統後端2之接收佇列模組21。Please refer to FIG. 2A in conjunction with FIG. 1, which shows a detailed architecture diagram of the video and audio broadcasting system of the present invention. In one embodiment of the present invention, the system backend 2 of the video and audio broadcasting system receives a connection from one or more client terminals 1, wherein the client terminal 1 includes: a character menu 11, an input/output module 12, a user interface 13, and an input/output module 14. The user can select the character menu 11 through the user interface 13 of the client terminal 1. The character menu 11 is a selectable menu, which includes a plurality of preset character numbers. The voice of the preset character is provided as a reference voice for training the speech synthesis model 4 to generate an effect voice. The user inputs the text message of the voice transmission through the input module 12, and then transmits the preset role number and text message to the receiving queue module 21 of the system backend 2 through the input/output module 14 by operating the user interface 13 of the client terminal 1.

系統後端2允許至少一用戶端1與至少一主播端3連線登入，並透過接收佇列模組21接收來自用戶端1所傳輸之資料，該資料包括但不限於：欲傳遞之主播端3之主播號、用戶端1之用戶號、預設語音編號與輸入文字訊息，系統後端2之雲端儲存伺服器22紀錄該預設語音編號，並將該預設語音編號與輸入文字訊息，透過欲傳遞之主播號，傳送至指定之主播端3。其中該主播端3包括：輸出入模組32、使用者介面33、混音設備34。The system backend 2 allows at least one client 1 and at least one anchor terminal 3 to connect and log in, and receives data transmitted from the client 1 through the receiving queue module 21. The data includes but is not limited to: the anchor number of the anchor terminal 3 to be transmitted, the user number of the client 1, the default voice number and the input text message. The cloud storage server 22 of the system backend 2 records the default voice number and transmits the default voice number and the input text message to the designated anchor terminal 3 through the anchor number to be transmitted. The anchor terminal 3 includes: input and output module 32, user interface 33, mixing device 34.

在本發明的此一實施例中，一語音合成模型4配置於各主播端3，該系統後端2包含模型更新伺服器23。主播端3之輸出入模組32接收系統後端2傳輸之預設語音編號與輸入文字訊息，將其傳送至該語音合成模型4。其中該語音合成模型4是根據類神經網路訓練不同角色的參考語音以產生複數個特色參數，而選用不同的特色參數可使該語音合成模型4根據輸入的文字訊息產生一對應的效果語音，以下圖3A與3B將進一步說明該語音合成模型4。在本發明的一具體實施例中，當該系統後端2完成該語音合成模型4的訓練後，該模型更新伺服器23可以更新各主播端3的語音合成模型4，而下載最新訓練產生的複數個特色參數至各語音合成模型4，俾使該語音合成模型4可以使用最新訓練產生的複數個特色參數來產生對應的效果語音。In this embodiment of the present invention, a speech synthesis model 4 is configured at each anchor terminal 3, and the system backend 2 includes a model update server 23. The input/output module 32 of the anchor terminal 3 receives the preset voice number and input text message transmitted by the system backend 2, and transmits them to the speech synthesis model 4. The speech synthesis model 4 generates a plurality of characteristic parameters based on the reference voices of different roles trained by the neural network, and the selection of different characteristic parameters enables the speech synthesis model 4 to generate a corresponding effect voice according to the input text message. The following Figures 3A and 3B will further illustrate the speech synthesis model 4. In a specific embodiment of the present invention, after the system backend 2 completes the training of the speech synthesis model 4, the model update server 23 can update the speech synthesis model 4 of each anchor terminal 3, and download the multiple characteristic parameters generated by the latest training to each speech synthesis model 4, so that the speech synthesis model 4 can use the multiple characteristic parameters generated by the latest training to generate the corresponding effect speech.

例如，語音合成模型4事先以唐老鴨或米老鼠為角色的參考語音進行訓練後，該語音合成模型4可以根據預設語音編號對應選用唐老鴨為角色，並根據輸入文字訊息語音合成出唐老鴨說出文字訊息的效果語音。一主播可控透過使用者介面33進入混音設備34將此效果語音結合主播端影像透過輸出入模組32進行影音播送至用戶端1其中該主播端影像顯示主播實況畫面。請同時參考圖6，此影音播送會在主播實況畫面51包含由選用角色說出文字訊息的效果語音播送511，並進一步顯示表示該選用角色之圖案R2與用戶號，讓主播與用戶皆可看見是由哪個用戶號的用戶，送出由選用角色說出文字訊息的效果語音播送511。For example, after the speech synthesis model 4 is trained with reference speech of Donald Duck or Mickey Mouse as the character, the speech synthesis model 4 can select Donald Duck as the character according to the preset speech number correspondence, and synthesize the effect speech of Donald Duck speaking the text message according to the input text message speech. A host can control the audio mixer 34 through the user interface 33 to combine the effect speech with the host end image and broadcast it to the client 1 through the input/output module 32, wherein the host end image shows the host's live screen. Please refer to FIG. 6 . This video broadcast includes an effect voice broadcast 511 of a text message spoken by the selected character on the host's live screen 51 , and further displays the pattern R2 and user number representing the selected character, so that the host and the user can see which user number sent the effect voice broadcast 511 of the text message spoken by the selected character.

在本發明進一步具體實施例中，繼續參考圖2A，其中影音播送系統之系統後端2會接受來自一或複數個用戶端1的連線，用戶可透過用戶端1之使用者介面13操作選擇角色選單11，角色選單11的其中一選項提供用戶可選擇自行上傳客製化語音檔案。該客製化語音檔案可以為各種影音檔案格式，不局限於mp3、mp4等。例如，用戶希望使用周杰倫的聲音產生一效果語音傳送給一主播，但是周杰倫的聲音並非為角色選單11的其中一預設角色，所以不會存在於預設角色編號中，用戶可自行上傳周杰倫的語音檔案。用戶透過輸入模組12輸入傳送語音之文字訊息，再透過操作用戶端1之使用者介面13將客製化語音檔案與文字訊息透過輸出入模組14傳輸至系統後端2之接收佇列模組21。後續可使用本發明語音合成模型4產生一效果語音來模擬周杰倫說出該文字訊息的聲音。In a further specific embodiment of the present invention, referring to FIG. 2A , the system backend 2 of the audio and video broadcasting system receives connections from one or more client terminals 1. The user can select the character menu 11 through the user interface 13 of the client terminal 1. One of the options of the character menu 11 provides the user with the option to upload a customized voice file. The customized voice file can be in various audio and video file formats, not limited to mp3, mp4, etc. For example, the user wants to use Jay Chou's voice to generate an effect voice to be sent to a host, but Jay Chou's voice is not one of the preset characters in the character menu 11, so it does not exist in the preset character number. The user can upload Jay Chou's voice file by himself. The user inputs the text message of the transmitted voice through the input module 12, and then transmits the customized voice file and text message to the receiving queue module 21 of the system backend 2 through the input/output module 14 by operating the user interface 13 of the client terminal 1. Subsequently, the speech synthesis model 4 of the present invention can be used to generate an effect voice to simulate the voice of Jay Chou speaking the text message.

在本發明的進一步實施例中，系統後端2允許至少一用戶端1與至少一主播端3連線登入，並透過接收佇列模組21接收來自用戶端1所傳輸之資料，該資料包括但不限於：欲傳遞之主播端3之主播號、用戶端1之用戶號、客製化語音檔案與輸入文字訊息，系統後端2之雲端儲存伺服器22紀錄該客製化語音檔案，並將該客製化語音檔案與輸入文字訊息，透過欲傳遞之主播號，傳送至指定之主播端3。In a further embodiment of the present invention, the system backend 2 allows at least one client 1 and at least one anchor terminal 3 to connect and log in, and receives data transmitted from the client 1 through the receiving queue module 21, the data includes but is not limited to: the anchor number of the anchor terminal 3 to be transmitted, the user number of the client 1, the customized voice file and the input text message, the cloud storage server 22 of the system backend 2 records the customized voice file, and transmits the customized voice file and the input text message to the designated anchor terminal 3 through the anchor number to be transmitted.

主播端3之輸出入模組32接收系統後端2傳輸之客製化語音檔案與輸入文字訊息，將其傳送至語音合成模型4。其中該語音合成模型4是根據類神經網路訓練不同角色的參考語音以產生複數個特色參數，而選用不同的特色參數可使該語音合成模型4根據輸入的文字訊息產生一對應的客製化效果語音，以下圖3C將進一步說明該語音合成模型4如何處理客製化語音。一主播可控透過使用者介面33進入混音設備34將此客製化效果語音結合主播端影像透過輸出入模組32進行影音播送至用戶端1其中該主播端影像顯示主播實況畫面。請同時參考圖6，此影音播送會在主播實況畫面51包含由選用角色說出文字訊息的效果語音播送511，並進一步顯示用戶號，讓主播與用戶皆可看見是由哪個用戶號的用戶，送出說出文字訊息的效果語音播送511。The input/output module 32 of the anchor terminal 3 receives the customized voice file and input text message transmitted by the system backend 2, and transmits it to the speech synthesis model 4. The speech synthesis model 4 generates a plurality of characteristic parameters based on the reference voices of different roles trained by the neural network, and the selection of different characteristic parameters can enable the speech synthesis model 4 to generate a corresponding customized effect voice according to the input text message. The following FIG. 3C will further illustrate how the speech synthesis model 4 processes the customized voice. An anchor can control the mixing device 34 through the user interface 33 to combine the customized effect voice with the anchor terminal image through the input/output module 32 to broadcast the video to the client terminal 1, wherein the anchor terminal image displays the anchor's live screen. Please refer to FIG. 6 . This video broadcast includes an effect voice broadcast 511 of a text message spoken by the selected character on the host's live screen 51 , and further displays the user number, so that the host and the user can see which user number sent the effect voice broadcast 511 of the text message.

請參考圖2B，該圖2B顯示本發明影音播送系統之另一實施例，其中該用戶端1與圖2A所述一致，該語音合成模型4配置於系統後端2，語音合成模型4連接雲端儲存伺服器22存取預設語音編號與輸入文字訊息以進行語音合成，生成效果語音回傳至雲端儲存伺服器22，以下圖3A與3B將進一步說明該語音合成模型4。雲端儲存伺服器22再透過欲傳遞之主播號，將該效果語音與輸入文字訊息傳送至指定之主播端3。一主播可控透過使用者介面33進入混音設備34將效果語音結合主播端影像透過輸出入模組32進行影音播送至用戶端1，其中該主播端影像顯示主播實況畫面。Please refer to FIG. 2B, which shows another embodiment of the video and audio broadcasting system of the present invention, wherein the client terminal 1 is consistent with FIG. 2A, and the speech synthesis model 4 is configured at the system backend 2. The speech synthesis model 4 is connected to the cloud storage server 22 to access the preset voice number and the input text message to perform speech synthesis, and the generated effect voice is sent back to the cloud storage server 22. The following FIGS. 3A and 3B will further explain the speech synthesis model 4. The cloud storage server 22 then transmits the effect voice and the input text message to the designated anchor terminal 3 through the anchor number to be transmitted. A host can control the mixing device 34 through the user interface 33 to combine the effect voice with the host end image and broadcast the audio and video to the client 1 through the input and output module 32, where the host end image shows the host's live screen.

在本發明進一步具體實施例中，繼續參考圖2B，其中該用戶端1與圖2A所述一致，該語音合成模型4位於系統後端2，語音合成模型4連接雲端儲存伺服器22存取客製化語音檔案，與輸入文字訊息進行語音合成，生成效果語音傳輸至雲端儲存伺服器22，以下圖3C將進一步說明該語音合成模型4如何處理客製化語音。雲端儲存伺服器22再透過欲傳遞之主播號，將效果語音傳送至指定之主播端3。一主播可控透過使用者介面33進入混音設備34將效果語音結合主播端影像透過輸出入模組32進行影音播送至用戶端1其中該主播端影像顯示主播實況畫面。In a further specific embodiment of the present invention, referring to FIG. 2B, the client terminal 1 is consistent with FIG. 2A, and the speech synthesis model 4 is located at the system backend 2. The speech synthesis model 4 is connected to the cloud storage server 22 to access the customized speech file, and performs speech synthesis with the input text message to generate the effect speech and transmit it to the cloud storage server 22. FIG. 3C below will further illustrate how the speech synthesis model 4 processes the customized speech. The cloud storage server 22 then transmits the effect speech to the designated anchor terminal 3 through the anchor number to be transmitted. An anchor can control the mixing device 34 through the user interface 33 to combine the effect speech with the anchor terminal image through the input/output module 32 to broadcast the audio and video to the client terminal 1, wherein the anchor terminal image displays the anchor's live screen.

請參考圖2C，圖2C顯示本發明影音播送系統之用戶端間透過雲端伺服器24接收效果語音之一實施例示意圖，影音播送系統之系統後端2會接受來自一或複數個用戶端1的連線，用戶可透過用戶端1之使用者介面13操作選擇角色選單11，角色選單11中為可選擇式選單，其中包括數個預設角色之預設角色編號。用戶透過輸入模組12輸入傳送語音之文字訊息，再透過操作用戶端1之使用者介面13將預設角色編號與文字訊息透過輸出入模組14傳輸至系統後端2之雲端伺服器24。Please refer to FIG. 2C, which shows an embodiment of receiving effect voice between clients of the audio-visual broadcasting system of the present invention through the cloud server 24. The system backend 2 of the audio-visual broadcasting system will accept connections from one or more clients 1. The user can select the role menu 11 through the user interface 13 of the client 1. The role menu 11 is a selectable menu, which includes the default role numbers of several default roles. The user inputs the text message of the voice transmission through the input module 12, and then transmits the default role number and the text message to the cloud server 24 of the system backend 2 through the input/output module 14 by operating the user interface 13 of the client 1.

系統後端2之雲端伺服器24接收到用戶端1上傳之預設角色編號與輸入的文字訊息，語音合成模型4將預設角色編號對應特色參數提取並將輸入文字訊息進行語音合成，生成效果語音，以下圖3A與3B將進一步說明該語音合成模型4。在本發明一具體實施例中，當該系統後端2完成該語音合成模型4的訓練後，該模型更新伺服器23可以更新該語音合成模型4，而下載最新訓練產生的複數個特色參數至該語音合成模型4，俾使該語音合成模型4可以使用最新訓練產生的複數個特色參數來產生對應的效果語音。The cloud server 24 of the system backend 2 receives the preset character number and the input text message uploaded by the client 1, and the speech synthesis model 4 extracts the characteristic parameters corresponding to the preset character number and performs speech synthesis on the input text message to generate the effect speech. The following Figures 3A and 3B will further illustrate the speech synthesis model 4. In a specific embodiment of the present invention, after the system backend 2 completes the training of the speech synthesis model 4, the model update server 23 can update the speech synthesis model 4 and download the multiple characteristic parameters generated by the latest training to the speech synthesis model 4, so that the speech synthesis model 4 can use the multiple characteristic parameters generated by the latest training to generate the corresponding effect speech.

在本發明進一步具體實施例中，繼續參考圖2C，影音播送系統之系統後端2會接受來自一或複數個用戶端1的連線，用戶可透過用戶端1之使用者介面13操作選擇角色選單11，角色選單11中用戶可選擇自行上傳客製化語音，其中該選擇可供用戶選擇之客製化語音檔案。該客製化語音檔案可以為各種影音檔案格式，不局限於mp3、mp4等。用戶透過輸入模組12輸入傳送語音之文字訊息，再透過操作用戶端1之使用者介面13將客製化語音檔案與文字訊息透過輸出入模組14傳輸至系統後端2之雲端伺服器24。In a further specific embodiment of the present invention, referring to FIG. 2C , the system backend 2 of the audio and video broadcasting system will accept connections from one or more client terminals 1. The user can select the role menu 11 through the user interface 13 of the client terminal 1. In the role menu 11, the user can choose to upload customized voice by himself, wherein the selection is a customized voice file that can be selected by the user. The customized voice file can be in various audio and video file formats, not limited to mp3, mp4, etc. The user inputs the text message of the voice transmission through the input module 12, and then transmits the customized voice file and the text message to the cloud server 24 of the system backend 2 through the input/output module 14 by operating the user interface 13 of the client terminal 1.

系統後端2之雲端伺服器24接收到用戶端1上傳之客製化語音檔案與輸入的文字訊息，語音合成模型4連接雲端儲存伺服器24存取客製化語音檔案，與輸入文字訊息進行語音合成，生成效果語音，以下圖3C將進一步說明該語音合成模型4如何處理客製化語音。另一用戶可透過系統後端2之雲端伺服器24取得該效果語音，並在用戶端1播放與使用。The cloud server 24 of the system backend 2 receives the customized voice file and input text message uploaded by the client 1. The speech synthesis model 4 connects to the cloud storage server 24 to access the customized voice file and performs speech synthesis with the input text message to generate the effect voice. The following Figure 3C will further explain how the speech synthesis model 4 processes the customized voice. Another user can obtain the effect voice through the cloud server 24 of the system backend 2 and play and use it on the client 1.

本發明語音合成模型4的架構可參考至圖3A所示方塊圖，在本發明的一具體實施例中，語音合成模型4的組成方塊包含音標嵌入、依一語者狀況的音標編碼器41、差異化適配器42以及依該語者狀況的Mel解碼器43，其中，該音標編碼器41與Mel解碼器43的語者狀況是以語音特徵提取44根據不同角色所提取的複數個特色參數所表示。本發明利用不同角色的參考語音來訓練語音合成模型4及其語音特徵提取(Speech Extraction)44，該語音合成模型4可基於監督式或非監督式類神經網路來訓練。訓練後語音合成模型4的語音特徵提取44，可以根據預設角色編號指定一角色來提取出對應的複數個特色參數，或根據非預設角色之一客製化語音來提取出選用的複數個特色參數。The structure of the speech synthesis model 4 of the present invention can be referred to the block diagram shown in FIG. 3A. In a specific embodiment of the present invention, the components of the speech synthesis model 4 include phonetic symbol embedding, a phonetic symbol encoder 41 according to a speaker condition, a differentiation adapter 42, and a Mel decoder 43 according to the speaker condition, wherein the speaker condition of the phonetic symbol encoder 41 and the Mel decoder 43 is represented by a plurality of characteristic parameters extracted by speech feature extraction 44 according to different roles. The present invention uses reference speech of different roles to train the speech synthesis model 4 and its speech feature extraction (Speech Extraction) 44, and the speech synthesis model 4 can be trained based on a supervised or unsupervised neural network. The speech feature extraction 44 of the trained speech synthesis model 4 can extract a plurality of corresponding characteristic parameters by specifying a role according to a preset role number, or extract a plurality of selected characteristic parameters according to a customized voice of one of the non-preset roles.

本發明語音合成模型4的各組成方塊可由根據Yihan Wu等人於2022年4月發表之論文所揭露內容來實施，該論文名稱為「AdaSpeech 4：Adaptive Text to Speech in Zero-Shot Scenarios」，公開網址：https：//www.isca-speech.org/archive/interspeech_2022/index.html。該論文揭露一種語音合成技術，可利用參考語音(Reference Speech)來訓練一語音合成模型，該語音合成模型是基於類神經網路來訓練。Each component block of the speech synthesis model 4 of the present invention can be implemented according to the content disclosed in the paper published by Yihan Wu et al. in April 2022, the paper is named "AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios", the public website is: https://www.isca-speech.org/archive/interspeech_2022/index.html. The paper discloses a speech synthesis technology that can use a reference speech to train a speech synthesis model, and the speech synthesis model is trained based on a neural network.

請繼續參考圖3A，在本發明的一具體實施例中，本發明利用不同角色的參考語音來訓練語音合成模型4，該語音合成模型4可基於監督式或非監督式類神經網路來訓練。訓練後語音合成模型4的語音特徵提取44，可以根據不同角色的參考語音提取出複數個特色參數。Please continue to refer to Figure 3A. In a specific embodiment of the present invention, the present invention uses reference speech of different characters to train a speech synthesis model 4. The speech synthesis model 4 can be trained based on a supervised or unsupervised neural network. The speech feature extraction 44 of the trained speech synthesis model 4 can extract multiple characteristic parameters based on the reference speech of different characters.

當本發明系統使用預設語音編號與輸入文字訊息時，參考圖3B，該預設語音編號與輸入文字訊息的音標輸入至本發明語音合成模型4，透過預先訓練的語音特徵提取44將指定語音編號所對應角色的特色參數用來表示音標編碼器41與Mel解碼器43的語者狀況，該預先訓練的語音合成模型4會根據文字訊息的音標(phoneme)經音標嵌入，透過音標編碼器41進行音標編碼將輸入文字訊息的音標與特色參數編碼合成。接著，透過差異化適配器42進行音軌適配性調整，最後透過Mel解碼器43依該語者狀況解碼生成一效果語音，而該效果語音為指定語音編號所對應角色說出該文字訊息的語音。When the system of the present invention uses a default voice code and inputs a text message, referring to FIG. 3B , the default voice code and the phonetic symbol of the input text message are input to the speech synthesis model 4 of the present invention, and the characteristic parameters of the role corresponding to the designated voice code are used to represent the speaker status of the phonetic coder 41 and the Mel decoder 43 through the pre-trained voice feature extraction 44. The pre-trained speech synthesis model 4 embeds the phonetic symbols of the text message through the phonetic symbols, and performs phonetic encoding through the phonetic coder 41 to encode and synthesize the phonetic symbols and characteristic parameters of the input text message. Then, the audio track adaptability is adjusted through the differentiation adapter 42, and finally, the Mel decoder 43 decodes and generates an effect voice according to the speaker's status. The effect voice is the voice of the character corresponding to the specified voice number speaking the text message.

接著參考圖3C，圖3C為當用戶使用客製化檔案與輸入文字訊息輸入至本發明語音合成模型4時，該語音合成模型4會透過語音特徵提取44對該客製化語音進行前處理與內差運算之特徵提取，透過前處理與內差運算的客製化語音將選用已訓練完成所產生之一或複數個特色參數來表示音標編碼器41與Mel解碼器43的語者狀況，該預先訓練的語音合成模型4會根據文字訊息的音標(phoneme)經音標嵌入，透過音標編碼器41進行音標編碼將輸入文字訊息的音標與特色參數編碼合成。接著，透過差異化適配器42進行音軌適配性調整，最後透過Mel解碼器43依該語者狀況解碼生成一效果語音，而該效果語音為客製化語音者說出該文字訊息的語音。Next, refer to Figure 3C. Figure 3C shows that when a user uses a customized file and inputs a text message into the speech synthesis model 4 of the present invention, the speech synthesis model 4 will perform pre-processing and feature extraction of the customized speech through the speech feature extraction 44. The customized speech after pre-processing and interpolation will use one or more characteristic parameters generated by the training to represent the speaker status of the phonetic encoder 41 and the Mel decoder 43. The pre-trained speech synthesis model 4 will embed the phonemes of the text message through the phonemes, and perform phonetic encoding through the phonetic encoder 41 to synthesize the phonemes and characteristic parameters of the input text message. Then, the audio track is adapted through the differentiation adapter 42, and finally, the Mel decoder 43 decodes and generates an effect voice according to the speaker's condition, and the effect voice is the voice of the customized speaker speaking the text message.

接著看到圖4，可配合參考圖2A，圖4顯示本發明使用預設之角色編號進行影音播送之一具體實施例流程圖。用戶於用戶端1透過使用者介面13接收一角色選單11之一預設語音編號與輸入模組12之一輸入訊息(S11)，例如：文字訊息，輸出入模組14將主播號、用戶號、該預設語音編號與該輸入訊息輸出至系統後端(S12)，此兩步驟完成於用戶端1。Next, see Figure 4, which can be used in conjunction with Figure 2A. Figure 4 shows a flowchart of a specific embodiment of the present invention using a preset character number for audio and video broadcasting. The user receives a preset voice number of a character menu 11 and an input message of an input module 12 (S11), such as a text message, through a user interface 13 at the client 1. The input/output module 14 outputs the anchor number, user number, the preset voice number and the input message to the system backend (S12). These two steps are completed at the client 1.

接著系統後端2之接收佇列模組21接收主播號、用戶號、該預設語音編號與該輸入訊息(S13)，雲端儲存伺服器22將接收之該預設語音編號與該輸入訊息，並傳送至一主播端3之語音合成模型4(S14)，此兩步驟完成於系統後端2。Then the receiving queue module 21 of the system backend 2 receives the anchor number, user number, the default voice number and the input message (S13), and the cloud storage server 22 receives the default voice number and the input message and transmits them to the speech synthesis model 4 of the anchor terminal 3 (S14). These two steps are completed at the system backend 2.

接著主播端3之語音合成模型4以該預設語音編號對應角色的特色參數表示音標編碼器41與Mel解碼器43的語者狀況，並與該輸入訊息進行語音合成，而生成一效果語音(S15)。主播端3透過混音設備34將效果語音進行混音(S18)，混音設備會將該效果語音伴隨影音畫面一同輸出，如選擇的禮物等影音效果，混音後之該效果語音透過輸出入模組32進行影音播送(S19)，此三步驟完成於主播端3。Then, the speech synthesis model 4 of the host terminal 3 uses the characteristic parameters of the character corresponding to the preset speech number to represent the speaker status of the phonetic coder 41 and the Mel decoder 43, and performs speech synthesis with the input message to generate an effect speech (S15). The host terminal 3 mixes the effect speech through the mixing device 34 (S18), and the mixing device will output the effect speech along with the audio and video screen, such as the audio and video effects of the selected gift, etc. The mixed effect speech is broadcast through the input and output module 32 (S19). These three steps are completed at the host terminal 3.

當語音合成模型4接收預設語音編號並且以該預設語音編號對應角色的特色參數表示一語者狀況，並與該輸入訊息進行語音合成，而生成一效果語音時，若語音生成之參數有所優化，語音合成模型4會將優化之該特色參數同步更新至系統後端2之模型更新伺服器23(S16)，並且模型更新伺服器23會更新該特色參數，以利下次進行該對應之特色參數(S17)。When the speech synthesis model 4 receives the preset speech number and represents a speaker's status with the characteristic parameters of the role corresponding to the preset speech number, and performs speech synthesis with the input information to generate an effect speech, if the parameters of speech generation are optimized, the speech synthesis model 4 will synchronously update the optimized characteristic parameters to the model update server 23 of the system backend 2 (S16), and the model update server 23 will update the characteristic parameters to facilitate the next time to perform the corresponding characteristic parameters (S17).

接著看到圖5，可配合參考圖2A，圖5顯示本發明使用客製化語音檔案進行影音播送之另一具體實施例流程圖。用戶於用戶端1透過使用者介面13接收一角色選單11之一客製化語音檔案與輸入模組12之一輸入訊息(S21)，輸出入模組14將主播號、用戶號、客製化語音檔案與輸入訊息輸出至系統後端(S22)，此兩步驟完成於用戶端1。Next, see Figure 5, which can be used in conjunction with Figure 2A. Figure 5 shows another specific embodiment flow chart of the present invention using a customized voice file for audio and video broadcasting. The user receives a customized voice file of a character menu 11 and an input message of an input module 12 through a user interface 13 at the client terminal 1 (S21), and the input/output module 14 outputs the anchor number, user number, customized voice file and input message to the system backend (S22). These two steps are completed at the client terminal 1.

接著系統後端2之接收佇列模組21接收主播號、用戶號、客製化檔案與輸入訊息(S23)，雲端儲存伺服器22將接收之客製化語音檔案與該輸入訊息，並傳送至一主播端3之語音合成模型4(S24)，此兩步驟完成於系統後端2。Then the receiving queue module 21 of the system backend 2 receives the anchor number, user number, customized file and input message (S23), and the cloud storage server 22 transmits the received customized voice file and the input message to the speech synthesis model 4 of the anchor terminal 3 (S24). These two steps are completed at the system backend 2.

接著主播端3之語音合成模型4會透過語音特徵提取44對該客製化語音進行前處理與內差運算之特徵提取，透過前處理與內差運算的客製化語音將選用已訓練完成所產生之一或複數個特色參數來表示音標編碼器41與Mel解碼器43的語者狀況，該預先訓練的語音合成模型4會根據文字訊息的音標(phoneme)經音標嵌入，透過音標編碼器41進行音標編碼將輸入文字訊息的音標與特色參數編碼合成。接著，透過差異化適配器42進行音軌適配性調整，最後透過Mel解碼器43依該語者狀況解碼生成一效果語音(S25)，並透過混音設備34將效果語音進行混音(S26)，混音設備會將該效果語音伴隨影音畫面一同輸出，如選擇的禮物等影音效果，混音後之該效果語音透過輸出入模組32進行影音播送(S27)，此三步驟完成於主播端3。Then, the speech synthesis model 4 of the anchor terminal 3 will perform pre-processing and feature extraction of the customized speech through speech feature extraction 44. The customized speech after pre-processing and interpolation will use one or more characteristic parameters generated by the training to represent the speaker status of the phonetic encoder 41 and the Mel decoder 43. The pre-trained speech synthesis model 4 will embed the phonemes of the text message through phonetic symbols, and perform phonetic encoding through the phonetic encoder 41 to synthesize the phonemes and characteristic parameter encoding of the input text message. Next, the audio track adaptability is adjusted through the differentiation adapter 42, and finally, the Mel decoder 43 decodes and generates an effect voice according to the speaker's status (S25), and the effect voice is mixed through the mixing device 34 (S26). The mixing device will output the effect voice along with the audio and video screen, such as the audio and video effects of the selected gift, etc. The mixed effect voice is broadcast through the input and output module 32 (S27). These three steps are completed at the host end 3.

在本發明一具體實施例中，透過客製化語音檔案選用特色參數並不會被儲存於語音合成模型4中成為預設語音，每一次進行客製化效果語音皆須重新進行語音特徵之提取步驟。In a specific embodiment of the present invention, the characteristic parameters selected through the customized voice file will not be stored in the voice synthesis model 4 as the default voice, and the voice feature extraction step must be repeated every time the customized effect voice is performed.

接著看到圖6，圖6顯示本發明使用者介面之一具體實施例示意圖。其中畫面分為三區域，分別為左側之輔助列表52、中間之主播實況畫面51與右邊之聊天室畫面53。中間之主播實況畫面51會撥放主播實況之互動畫面與本發明之效果語音之效果語音播送511，該效果語音播送511包括效果語音與其對應之文字、選擇之選單角色之頭像與送出效果語音之用戶號，與相對應的禮物圖式等。Next, see Figure 6, which shows a schematic diagram of a specific embodiment of the user interface of the present invention. The screen is divided into three areas, namely the auxiliary list 52 on the left, the host live screen 51 in the middle, and the chat room screen 53 on the right. The host live screen 51 in the middle will play the host's live interactive screen and the effect voice broadcast 511 of the effect voice of the present invention. The effect voice broadcast 511 includes the effect voice and its corresponding text, the avatar of the selected menu character and the user number of the effect voice, and the corresponding gift pattern, etc.

右側之聊天室畫面53會顯示複數個用戶與主播文字聊天之內容，在一特定實施例中，所送出的效果語音播送511的文字亦會在聊天室畫面53展示，並搭配對應之特殊顏色、外觀等。聊天室畫面53下方包含角色選單531與文字輸入模組532，圖6僅展示一種特定實施例中之角色選單531之呈現，其亦可以選單式、彈出式視窗等方式進行設計。例如：角色選單531可包含複數個頭像圖案R1、R2...，便於用戶識別選用。The chat room screen 53 on the right side will display the content of the text chat between multiple users and the anchor. In a specific embodiment, the text of the sent effect voice broadcast 511 will also be displayed on the chat room screen 53, and matched with corresponding special colors, appearances, etc. The chat room screen 53 includes a character menu 531 and a text input module 532 below. Figure 6 only shows the presentation of the character menu 531 in a specific embodiment. It can also be designed in a menu style, pop-up window, etc. For example: the character menu 531 can include multiple avatar patterns R1, R2..., which is convenient for users to identify and select.

左側之輔助列表52中會存放著相關互動功能等列表，左上方提供各種功能的選單列表54，每日任務列表521可設計每日需達成之任務，如發送五次對話等增加用戶與主播的互動性，與特定節日可舉辦活動，並登載在活動列表522中。左下方為其他主播清單531，可藉由畫面點選切換至不同主播的直播間，進行互動。The auxiliary list 52 on the left side will store lists of related interactive functions, and the upper left side provides a menu list 54 of various functions. The daily task list 521 can design daily tasks to be completed, such as sending five dialogues to increase the interaction between users and anchors. Activities can be held on specific holidays and recorded in the activity list 522. The lower left side is a list of other anchors 531, which can be switched to the live broadcast room of different anchors by clicking on the screen to interact.

請參考圖7，圖7顯示本發明使用者介面之另一實施例示意圖，特別如手機用戶間傳遞訊息。本實施例中畫面顯示的為聊天畫面63，其中用戶位於聊天畫面63之右側，而對象用戶位於聊天畫面63的左側，可見於圖7，在一特定實施例中，對象用戶傳送了文字訊息「我愛你」，於畫面右側顯示用戶頭像64，而用戶透過角色選單631選擇R2的角色，並同時透過文字輸入模組632發送「我愛妳」的文字，透過本發明之影音播送系統，該預設語音編號與輸入文字訊息進行一效果語音輸出，進行效果語音播送611給對象用戶。Please refer to FIG. 7, which shows another embodiment of the user interface of the present invention, particularly a message transmission between mobile phone users. In this embodiment, the screen displays a chat screen 63, wherein the user is located on the right side of the chat screen 63, and the target user is located on the left side of the chat screen 63. As shown in FIG. 7, in a specific embodiment, the target user sends a text message "I love you", and the user avatar 64 is displayed on the right side of the screen. The user selects the character R2 through the character menu 631, and at the same time sends the text "I love you" through the text input module 632. Through the audio and video broadcasting system of the present invention, the preset voice number and the input text message are output as an effect voice, and the effect voice broadcast 611 is performed to the target user.