CN112562733A

Movatterモバイル変換

Info

Publication number: CN112562733A
Application number: CN202011434920.6A
Authority: CN
Inventors: 张乐雨; 张慧敏
Original assignee: Ping An Puhui Enterprise Management Co Ltd
Current assignee: Ping An Puhui Enterprise Management Co Ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-03-26

Abstract

The application discloses a media data processing method and device, a storage medium and computer equipment, wherein the method comprises the following steps: receiving source media data, wherein the source media data comprises video data and source audio data; performing voice translation on the source audio data to obtain translated text data, and translating the translated text data to obtain translated text data of a target language; acquiring text semantic parameters corresponding to the translated text data, and adjusting preset sound synthesis parameters based on the text semantic parameters; performing voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language; and synthesizing the audio data corresponding to the target language and the video data to obtain synthesized media data. According to the method and the device, the media data can be suitable for people with different language habits to watch, the sound characteristics which are more matched with the emotion of the source media data are reserved, and the watching experience of a user is improved.

Description

Translated fromChinese

媒体数据处理方法及装置、存储介质、计算机设备Media data processing method and device, storage medium, computer equipment

技术领域technical field

本申请涉及数据处理技术领域，尤其是涉及到一种媒体数据处理方法及装置、存储介质、计算机设备。The present application relates to the technical field of data processing, and in particular, to a media data processing method and apparatus, storage medium, and computer equipment.

背景技术Background technique

随着通信技术的不断发展，用户除了使用手机、平板电脑、台式电脑等智能终端设备进行通话或者查询信息之外，对其他功能的应用也越发广泛，目前随着视频直播平台、短视频平台的飞速发展，通过各类视频平台观看视频的用户遍布全球。With the continuous development of communication technology, in addition to using smart terminal devices such as mobile phones, tablet computers, and desktop computers to make calls or inquire information, users have more and more applications of other functions. Currently, with the development of live video platforms and short video platforms With rapid development, users watching videos through various video platforms are all over the world.

目前的视频观看过程中，视频生产者将录制的音频视频数据发送到视频服务器中，再由视频服务器将视频生产者录制的视频转发给视频观看者的终端进行播放。然而，观看视频的用户可能是世界各地的用户，并不能完全理解视频生产者上传的音频视频中的语言，导致观看视频体验差，视频平台的视频播放量难以提高。During the current video viewing process, the video producer sends the recorded audio and video data to the video server, and then the video server forwards the video recorded by the video producer to the terminal of the video viewer for playback. However, users who watch videos may be users from all over the world and cannot fully understand the language in the audio and video uploaded by the video producer, resulting in poor video viewing experience and difficulty in improving the video playback volume on the video platform.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本申请提供了一种媒体数据处理方法及装置、存储介质、计算机设备。In view of this, the present application provides a media data processing method and apparatus, a storage medium, and a computer device.

根据本申请的一个方面，提供了一种媒体数据处理方法，包括：According to one aspect of the present application, a media data processing method is provided, comprising:

接收源媒体数据，其中，所述源媒体数据包括视频数据和源音频数据；receiving source media data, wherein the source media data includes video data and source audio data;

对所述源音频数据进行语音转译得到转译文本数据，并对所述转译文本数据进行翻译得到目标语言的翻译文本数据；Carry out voice translation to the source audio data to obtain translation text data, and translate the translation text data to obtain the translation text data of the target language;

获取所述转译文本数据对应的文本语义参数，并基于所述文本语义参数对预设的声音合成参数进行调整；Acquiring text semantic parameters corresponding to the translated text data, and adjusting preset voice synthesis parameters based on the text semantic parameters;

根据调整后的声音合成参数对所述翻译文本数据进行声音合成，得到所述目标语言对应的音频数据；Perform voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;

将所述目标语言对应的音频数据以及所述视频数据进行合成，得到合成媒体数据。The audio data corresponding to the target language and the video data are synthesized to obtain synthesized media data.

可选地，所述对所述转译文本数据进行翻译得到目标语言的翻译文本数据，具体包括：Optionally, the translation of the translated text data to obtain the translated text data of the target language specifically includes:

根据预设翻译线路对应的输入参数拼装规则对所述转译文本数据进行拼装，得到与所述转译文本数据对应的翻译输入数据；Assembling the translated text data according to the input parameter assembly rule corresponding to the preset translation line, to obtain translation input data corresponding to the translated text data;

调用所述预设翻译线路，将所述翻译输入数据作输入至所述预设翻译线路中进行翻译，得到翻译输出数据；Calling the preset translation circuit, inputting the translation input data into the preset translation circuit for translation, and obtaining translation output data;

根据预设翻译线路对应的输出参数解析规则，对所述翻译输出数据进行解析，得到所述翻译文本数据。According to the output parameter parsing rule corresponding to the preset translation circuit, the translation output data is parsed to obtain the translated text data.

可选地，所述调用所述预设翻译线路之前，所述方法还包括：Optionally, before invoking the preset translation line, the method further includes:

获取所述预设翻译线路对应的验证种子，并根据令牌生成规则生成所述验证种子对应的验证令牌；Obtain the verification seed corresponding to the preset translation circuit, and generate the verification token corresponding to the verification seed according to the token generation rule;

利用所述验证令牌对所述预设翻译线路进行验证，若验证通过则确定所述预设翻译线路为可调用状态。The preset translation circuit is verified by using the verification token, and if the verification is passed, it is determined that the preset translation circuit is in a callable state.

可选地，所述获取所述转译文本数据对应的文本语义参数，具体包括：Optionally, the acquiring the text semantic parameters corresponding to the translated text data specifically includes:

根据所述转译文本数据对应的文本结构对所述转译文本数据进行分割，得到所述转译文本数据对应的多个语句；The translated text data is segmented according to the text structure corresponding to the translated text data to obtain a plurality of sentences corresponding to the translated text data;

分别获取每个语句对应的语义参数，并根据所述每个语句对应的语义参数确定所述转译文本数据对应的文本语义参数。Semantic parameters corresponding to each sentence are obtained respectively, and text semantic parameters corresponding to the translated text data are determined according to the semantic parameters corresponding to each sentence.

可选地，所述接收源媒体数据，具体包括：Optionally, the receiving source media data specifically includes:

接收视频发布终端发送的所述源媒体数据；receiving the source media data sent by the video publishing terminal;

所述将所述目标语言对应的音频数据以及所述视频数据进行合成，得到合成媒体数据，具体包括：Described synthesizing the audio data corresponding to the target language and the video data to obtain synthetic media data, specifically including:

获取视频播放终端对应的播放语言，并从所述目标语言对应的音频数据中获取与所述播放语言对应的音频数据；acquiring the playback language corresponding to the video playback terminal, and acquiring the audio data corresponding to the playback language from the audio data corresponding to the target language;

将所述播放语言对应的音频数据以及所述视频数据进行合成，得到播放媒体数据；Synthesize the audio data corresponding to the playback language and the video data to obtain playback media data;

将所述播放媒体数据发送至所述视频播放终端。Send the playback media data to the video playback terminal.

可选地，所述将所述播放语言对应的音频数据以及所述视频数据进行合成，得到播放媒体数据，具体包括：Optionally, the audio data corresponding to the playback language and the video data are synthesized to obtain playback media data, which specifically includes:

获取与所述播放语言对应的翻译文本数据；obtaining translated text data corresponding to the playback language;

将所述播放语言对应的翻译文本数据和音频数据以及所述视频数据进行合成，得到所述播放媒体数据。Synthesize the translated text data corresponding to the playback language, the audio data and the video data to obtain the playback media data.

可选地，所述获取视频播放终端对应的播放语言，具体包括：Optionally, the acquiring the playback language corresponding to the video playback terminal specifically includes:

根据所述视频播放终端所在的地理位置，确定所述视频播放终端的所述播放语言；或者，Determine the playback language of the video playback terminal according to the geographic location where the video playback terminal is located; or,

根据所述视频播放终端对应的常用语言，确定所述视频播放终端的所述播放语言；或者，Determine the playback language of the video playback terminal according to the common language corresponding to the video playback terminal; or,

根据所述视频播放终端发送的播放指令，解析所述播放指令指示的所述播放语言。According to the playback instruction sent by the video playback terminal, the playback language indicated by the playback instruction is parsed.

根据本申请的另一方面，提供了一种媒体数据处理装置，包括：According to another aspect of the present application, a media data processing apparatus is provided, comprising:

源数据接收模块，用于接收源媒体数据，其中，所述源媒体数据包括视频数据和源音频数据；a source data receiving module for receiving source media data, wherein the source media data includes video data and source audio data;

音频数据翻译模块，用于对所述源音频数据进行语音转译得到转译文本数据，并对所述转译文本数据进行翻译得到目标语言的翻译文本数据；an audio data translation module, for performing voice translation on the source audio data to obtain translation text data, and for translating the translation text data to obtain translation text data in the target language;

声音参数调整模块，用于获取所述转译文本数据对应的文本语义参数，并基于所述文本语义参数对预设的声音合成参数进行调整；a voice parameter adjustment module, configured to obtain text semantic parameters corresponding to the translated text data, and adjust preset voice synthesis parameters based on the text semantic parameters;

声音合成模块，用于根据调整后的声音合成参数对所述翻译文本数据进行声音合成，得到所述目标语言对应的音频数据；a voice synthesis module, configured to perform voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;

媒体数据合成模块，用于将所述目标语言对应的音频数据以及所述视频数据进行合成，得到合成媒体数据。The media data synthesis module is used for synthesizing the audio data corresponding to the target language and the video data to obtain synthesized media data.

可选地，所述音频数据翻译模块，具体包括：Optionally, the audio data translation module specifically includes:

输入数据拼装单元，用于根据预设翻译线路对应的输入参数拼装规则对所述转译文本数据进行拼装，得到与所述转译文本数据对应的翻译输入数据；an input data assembling unit, configured to assemble the translated text data according to the input parameter assembly rule corresponding to the preset translation line, to obtain translation input data corresponding to the translated text data;

翻译数据输出单元，用于调用所述预设翻译线路，将所述翻译输入数据作输入至所述预设翻译线路中进行翻译，得到翻译输出数据；a translation data output unit, configured to call the preset translation circuit, input the translation input data into the preset translation circuit for translation, and obtain translation output data;

翻译文本解析单元，用于根据预设翻译线路对应的输出参数解析规则，对所述翻译输出数据进行解析，得到所述翻译文本数据。The translation text parsing unit is used for parsing the translation output data according to the output parameter parsing rule corresponding to the preset translation circuit to obtain the translation text data.

可选地，所述装置还包括：Optionally, the device further includes:

验证令牌生成模块，用于所述调用所述预设翻译线路之前，获取所述预设翻译线路对应的验证种子，并根据令牌生成规则生成所述验证种子对应的验证令牌；A verification token generation module, configured to obtain a verification seed corresponding to the preset translation circuit before invoking the preset translation circuit, and generate a verification token corresponding to the verification seed according to a token generation rule;

线路验证模块，用于利用所述验证令牌对所述预设翻译线路进行验证，若验证通过则确定所述预设翻译线路为可调用状态。The line verification module is configured to use the verification token to verify the preset translation line, and if the verification passes, determine that the preset translation line is in a callable state.

可选地，所述声音参数调整模块，具体包括：Optionally, the sound parameter adjustment module specifically includes:

语句分割单元，用于根据所述转译文本数据对应的文本结构对所述转译文本数据进行分割，得到所述转译文本数据对应的多个语句；a statement segmentation unit, configured to segment the translated text data according to the text structure corresponding to the translated text data to obtain a plurality of statements corresponding to the translated text data;

语义参数确定单元，用于分别获取每个语句对应的语义参数，并根据所述每个语句对应的语义参数确定所述转译文本数据对应的文本语义参数。The semantic parameter determining unit is configured to obtain the semantic parameter corresponding to each sentence respectively, and determine the text semantic parameter corresponding to the translated text data according to the semantic parameter corresponding to each sentence.

可选地，所述源数据接收模块，具体用于：接收视频发布终端发送的所述源媒体数据；Optionally, the source data receiving module is specifically configured to: receive the source media data sent by the video publishing terminal;

所述媒体数据合成模块，具体包括：The media data synthesis module specifically includes:

播放语言获取单元，用于获取视频播放终端对应的播放语言，并从所述目标语言对应的音频数据中获取与所述播放语言对应的音频数据；a playback language acquisition unit, configured to acquire the playback language corresponding to the video playback terminal, and acquire audio data corresponding to the playback language from the audio data corresponding to the target language;

播放数据合成单元，用于将所述播放语言对应的音频数据以及所述视频数据进行合成，得到播放媒体数据；A playback data synthesis unit, used for synthesizing the audio data corresponding to the playback language and the video data to obtain playback media data;

播放数据发送单元，用于将所述播放媒体数据发送至所述视频播放终端。A playback data sending unit, configured to send the playback media data to the video playback terminal.

可选地，所述播放数据合成单元，具体包括：Optionally, the playback data synthesis unit specifically includes:

播放文本获取子单元，用于获取与所述播放语言对应的翻译文本数据；a playback text acquisition subunit for acquiring translation text data corresponding to the playback language;

播放数据合成子单元，用于将所述播放语言对应的翻译文本数据和音频数据以及所述视频数据进行合成，得到所述播放媒体数据。The playback data synthesis subunit is used for synthesizing the translated text data corresponding to the playback language, the audio data and the video data to obtain the playback media data.

可选地，所述播放语言获取单元，具体包括：Optionally, the playback language acquisition unit specifically includes:

第一语言获取子单元，用于根据所述视频播放终端所在的地理位置，确定所述视频播放终端的所述播放语言；或者，a first language acquisition subunit, configured to determine the playback language of the video playback terminal according to the geographic location where the video playback terminal is located; or,

第二语言获取子单元，用于根据所述视频播放终端对应的常用语言，确定所述视频播放终端的所述播放语言；或者，A second language acquisition subunit, configured to determine the playback language of the video playback terminal according to the common language corresponding to the video playback terminal; or,

第三语言获取子单元，用于根据所述视频播放终端发送的播放指令，解析所述播放指令指示的所述播放语言。A third language acquisition subunit, configured to parse the playback language indicated by the playback instruction according to the playback instruction sent by the video playback terminal.

依据本申请又一个方面，提供了一种存储介质，其上存储有计算机程序，所述程序被处理器执行时实现上述媒体数据处理方法。According to yet another aspect of the present application, a storage medium is provided on which a computer program is stored, and when the program is executed by a processor, the above-mentioned media data processing method is implemented.

依据本申请再一个方面，提供了一种计算机设备，包括存储介质、处理器及存储在存储介质上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现上述媒体数据处理方法。According to yet another aspect of the present application, a computer device is provided, comprising a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, wherein the processor implements the above-mentioned media data when executing the program Approach.

借由上述技术方案，本申请提供的一种媒体数据处理方法及装置、存储介质、计算机设备，接收源媒体数据之后，先对源媒体数据包含的源音频数据进行语音转译得到源音频数据对应的转译文本数据，然后将转译文本从源语言翻译为目标语言的翻译文本数据，并根据转译文本对应的文本语义参数调整声音合成参数，从而基于调整后的声音合成参数将翻译文本数据合成为相应目标语言的音频数据，并将该目标语言的音频数据与源媒体数据包含的视频数据进行组装，得到合成媒体数据。本申请实施例相比于现有技术中直接播放直播视频的方式，不仅可以将源媒体数据转换成多种不同语言的媒体数据，方便不同语言习惯的用户观看，还可以获取源音频数据的转译文本数据对应的文本语义参数确定声音合成参数，从而利用声音合成参数进行声音合成，使得合成得到的声音与源音频数据所表达的情绪情感更匹配，提高了合成媒体数据与源媒体数据的观感相似度，提高了用户的视频观看体验，也有助于提高视频平台的视频播放量。With the above technical solutions, a media data processing method and device, storage medium, and computer equipment provided by the present application, after receiving the source media data, first perform voice translation on the source audio data included in the source media data to obtain the corresponding source audio data. Translate the text data, then translate the translated text from the source language to the translated text data of the target language, and adjust the speech synthesis parameters according to the text semantic parameters corresponding to the translated text, so as to synthesize the translated text data into the corresponding target based on the adjusted speech synthesis parameters The audio data of the language is assembled, and the audio data of the target language is assembled with the video data contained in the source media data to obtain synthetic media data. Compared with the way of directly playing live video in the prior art, the embodiment of the present application can not only convert the source media data into media data in multiple different languages, which is convenient for users with different language habits to watch, but also obtain the translation of the source audio data. The text semantic parameters corresponding to the text data determine the voice synthesis parameters, so that the voice synthesis parameters are used for voice synthesis, so that the synthesized voices are more matched with the emotions expressed by the source audio data, and the look and feel of the synthesized media data and the source media data is improved. It improves the user's video viewing experience and also helps to increase the video playback volume of the video platform.

上述说明仅是本申请技术方案的概述，为了能够更清楚了解本申请的技术手段，而可依照说明书的内容予以实施，并且为了让本申请的上述和其它目的、特征和优点能够更明显易懂，以下特举本申请的具体实施方式。The above description is only an overview of the technical solution of the present application. In order to be able to understand the technical means of the present application more clearly, it can be implemented according to the content of the description, and in order to make the above-mentioned and other purposes, features and advantages of the present application more obvious and easy to understand , and the specific embodiments of the present application are listed below.

附图说明Description of drawings

此处所说明的附图用来提供对本申请的进一步理解，构成本申请的一部分，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The drawings described herein are used to provide further understanding of the present application and constitute a part of the present application. The schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute an improper limitation of the present application. In the attached image:

图1示出了本申请实施例提供的一种媒体数据处理方法的流程示意图；FIG. 1 shows a schematic flowchart of a media data processing method provided by an embodiment of the present application;

图2示出了本申请实施例提供的另一种媒体数据处理方法的流程示意图；FIG. 2 shows a schematic flowchart of another media data processing method provided by an embodiment of the present application;

图3示出了本申请实施例提供的一种媒体数据处理装置的结构示意图；FIG. 3 shows a schematic structural diagram of a media data processing apparatus provided by an embodiment of the present application;

图4示出了本申请实施例提供的另一种媒体数据处理装置的结构示意图。FIG. 4 shows a schematic structural diagram of another media data processing apparatus provided by an embodiment of the present application.

具体实施方式Detailed ways

下文中将参考附图并结合实施例来详细说明本申请。需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。Hereinafter, the present application will be described in detail with reference to the accompanying drawings and in conjunction with the embodiments. It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict.

在本实施例中提供了一种媒体数据处理方法，如图1所示，该方法包括：A media data processing method is provided in this embodiment, as shown in FIG. 1 , the method includes:

步骤101，接收源媒体数据，其中，源媒体数据包括视频数据和源音频数据；Step 101, receiving source media data, wherein the source media data includes video data and source audio data;

本申请实施例提供的媒体数据处理方法可以用于对直播平台中主播在直播终端设备录制的媒体数据进行处理，也可以用于对视频平台中视频上传方上传的媒体数据进行处理，本申请实施例以对主播直播时录制产生的媒体数据进行处理为例，对该方法进行解释说明，但本申请实施例并不限于上述应用场景。在上述实施例中，直播平台服务器接收源媒体数据，该源媒体数据包括视频数据和音频数据，音频数据对应的语言类型为主播使用的语言，例如在国内的大多数直播平台中，主播均使用中文进行直播，则该音频数据对应的语言为中文。The media data processing method provided in the embodiment of the present application can be used to process the media data recorded by the host in the live broadcast platform on the live broadcast terminal device, and can also be used to process the media data uploaded by the video uploader in the video platform. This application implements For example, the processing of the media data generated during the live broadcast of the host is taken as an example to explain the method, but the embodiments of the present application are not limited to the above application scenarios. In the above embodiment, the live broadcast platform server receives source media data, the source media data includes video data and audio data, and the language type corresponding to the audio data is the language used by the host. For example, in most live broadcast platforms in China, the host uses If the live broadcast is performed in Chinese, the language corresponding to the audio data is Chinese.

步骤102，对源音频数据进行语音转译得到转译文本数据，并对转译文本数据进行翻译得到目标语言的翻译文本数据；Step 102, performing voice translation on the source audio data to obtain the translated text data, and translating the translated text data to obtain the translated text data of the target language;

在该实施例中，接收到源音频数据后，先将源音频数据进行语音转译，得到与该源音频数据对应的转译文本数据，即对该源音频数据进行语音识别将语音数据转译为文本数据，进而为了实现对媒体数据的语言转换，将语音转译得到的转译文本数据进行翻译，将转译文本数据翻译为目标语言得到翻译文本数据，例如可以将转译文本数据从中文翻译成英文、日文等。In this embodiment, after receiving the source audio data, the source audio data is firstly subjected to voice translation to obtain the translated text data corresponding to the source audio data, that is, voice recognition is performed on the source audio data to translate the voice data into text data. Furthermore, in order to realize the language conversion of the media data, the translated text data obtained by the speech translation is translated, and the translated text data is translated into the target language to obtain the translated text data. For example, the translated text data can be translated from Chinese into English, Japanese, etc.

步骤103，获取转译文本数据对应的文本语义参数，并基于文本语义参数对预设的声音合成参数进行调整；Step 103, acquiring text semantic parameters corresponding to the translated text data, and adjusting preset voice synthesis parameters based on the text semantic parameters;

在该实施例中，为了确保处理后的媒体数据能够表现出自然的语音效果，避免过于生硬的声音，得到转译文本数据之后，获取转译文本数据对应的文本语义参数，文本语义参数可以描述源媒体数据表达的语义信息，例如源媒体数据表达出作者开心的情绪，这种开心的情绪可以通过转译文本的文本语义参数来表达。进而基于文本语义参数可以对预设的声音合成参数进行调整，使得调整后的声音合成参数能够通过声音的一些特征反应出文本语义，声音合成参数具体可以包括声音波动幅度、基频、语速、音量、句子间隔时长等等。例如开心时语速较快，句子间隔时长较短。In this embodiment, in order to ensure that the processed media data can show a natural voice effect and avoid too blunt voices, after obtaining the translated text data, the text semantic parameters corresponding to the translated text data are obtained, and the text semantic parameters can describe the source media. The semantic information expressed by the data, such as the source media data expresses the author's happy mood, which can be expressed by the textual semantic parameters of the translated text. Furthermore, the preset voice synthesis parameters can be adjusted based on the text semantic parameters, so that the adjusted voice synthesis parameters can reflect the text semantics through some features of the sound. Volume, sentence interval length, etc. For example, the speed of speech is faster when happy, and the interval between sentences is shorter.

步骤104，根据调整后的声音合成参数对翻译文本数据进行声音合成，得到目标语言对应的音频数据；Step 104, performing voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;

在该实施例中，根据调整后的声音合成参数对翻译文本数据进行声音合成，利用赋予文本语义信息的声音合成参数对翻译文本数据进行文本转声音的处理，得到目标语言对应的音频数据，从而实现将源媒体数据中的源语言对应的源音频数据转换为目标语言的音频数据。In this embodiment, voice synthesis is performed on the translated text data according to the adjusted voice synthesis parameters, and the text-to-speech processing is performed on the translated text data by using the voice synthesis parameters given the text semantic information to obtain the audio data corresponding to the target language, thereby Convert the source audio data corresponding to the source language in the source media data to the audio data of the target language.

步骤105，将目标语言对应的音频数据以及视频数据进行合成，得到合成媒体数据。Step 105: Synthesize audio data and video data corresponding to the target language to obtain synthesized media data.

在该实施例中，生成目标语言对应的音频数据之后，将该音频数据与源媒体数据包含的视频数据进行组装，得到合成媒体数据，最终实现将源媒体数据从源语言转换为目标语言对应的合成媒体数据，以使不同语言习惯的用户能够看懂视频所表达的内容，提高用户的视频观看体验，提高视频平台的视频播放量。In this embodiment, after the audio data corresponding to the target language is generated, the audio data is assembled with the video data contained in the source media data to obtain synthetic media data, and finally the conversion of the source media data from the source language to the target language corresponding to the target language is realized. Synthesize media data so that users with different language habits can understand the content expressed in the video, improve the user's video viewing experience, and increase the video playback volume of the video platform.

通过应用本实施例的技术方案，接收源媒体数据之后，先对源媒体数据包含的源音频数据进行语音转译得到源音频数据对应的转译文本数据，然后将转译文本从源语言翻译为目标语言的翻译文本数据，并根据转译文本对应的文本语义参数调整声音合成参数，从而基于调整后的声音合成参数将翻译文本数据合成为相应目标语言的音频数据，并将该目标语言的音频数据与源媒体数据包含的视频数据进行组装，得到合成媒体数据。本申请实施例相比于现有技术中直接播放直播视频的方式，不仅可以将源媒体数据转换成多种不同语言的媒体数据，方便不同语言习惯的用户观看，还可以获取源音频数据的转译文本数据对应的文本语义参数确定声音合成参数，从而利用声音合成参数进行声音合成，使得合成得到的声音与源音频数据所表达的情绪情感更匹配，提高了合成媒体数据与源媒体数据的观感相似度，提高了用户的视频观看体验，也有助于提高视频平台的视频播放量。By applying the technical solution of the present embodiment, after receiving the source media data, first perform voice translation on the source audio data included in the source media data to obtain the translated text data corresponding to the source audio data, and then translate the translated text from the source language to the target language. Translate the text data, and adjust the voice synthesis parameters according to the text semantic parameters corresponding to the translated text, so as to synthesize the translated text data into the audio data of the corresponding target language based on the adjusted voice synthesis parameters, and combine the audio data of the target language with the source media. The video data contained in the data is assembled to obtain synthetic media data. Compared with the way of directly playing live video in the prior art, the embodiment of the present application can not only convert the source media data into media data in multiple different languages, which is convenient for users with different language habits to watch, but also obtain the translation of the source audio data. The text semantic parameters corresponding to the text data determine the voice synthesis parameters, so that the voice synthesis parameters are used for voice synthesis, so that the synthesized voices are more matched with the emotions expressed by the source audio data, and the look and feel of the synthesized media data and the source media data is improved. It improves the user's video viewing experience and also helps to increase the video playback volume of the video platform.

进一步的，作为上述实施例具体实施方式的细化和扩展，为了完整说明本实施例的具体实施过程，提供了另一种媒体数据处理方法，如图2所示，该方法包括：Further, as a refinement and expansion of the specific implementation manner of the above-mentioned embodiment, in order to fully describe the specific implementation process of this embodiment, another media data processing method is provided, as shown in FIG. 2 , the method includes:

步骤201，接收视频发布终端发送的源媒体数据，其中，源媒体数据包括视频数据和源音频数据；Step 201, receiving source media data sent by a video publishing terminal, wherein the source media data includes video data and source audio data;

在该实施例中，主播在主播终端进行直播时，主播终端录制内容得到源媒体数据，这里的源媒体数据包括视频数据以及音频数据，主播终端将该源媒体数据发送给直播服务器，直播服务器接收主播终端发送的源媒体数据。In this embodiment, when the host performs a live broadcast on the host terminal, the host terminal records content to obtain source media data, where the source media data includes video data and audio data. The host terminal sends the source media data to the live broadcast server, and the live broadcast server receives the source media data. Source media data sent by the host terminal.

步骤202，对源音频数据进行语音转译得到转译文本数据，并对转译文本数据进行翻译得到目标语言的翻译文本数据；Step 202, performing voice translation on the source audio data to obtain the translated text data, and translating the translated text data to obtain the translated text data of the target language;

步骤203，获取转译文本数据对应的文本语义参数，并基于文本语义参数对预设的声音合成参数进行调整；Step 203, acquiring text semantic parameters corresponding to the translated text data, and adjusting preset voice synthesis parameters based on the text semantic parameters;

步骤204，根据调整后的声音合成参数对翻译文本数据进行声音合成，得到目标语言对应的音频数据；Step 204, performing voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;

步骤202至步骤204对应的描述参见步骤102至步骤104对应的描述，在此不再赘述。具体的可以采用TTS技术进行声音合成，将计算机产生的、或外部输入的文字信息转变为可以听得懂的、流利的语音输出技术。For descriptions corresponding tosteps 202 to 204, refer to the descriptions corresponding to steps 102 to 104, and details are not repeated here. Specifically, TTS technology can be used for voice synthesis, and the text information generated by the computer or input from the outside can be converted into comprehensible and fluent voice output technology.

步骤205，获取视频播放终端对应的播放语言，并从目标语言对应的音频数据中获取与播放语言对应的音频数据；Step 205, acquiring the playback language corresponding to the video playback terminal, and acquiring audio data corresponding to the playback language from the audio data corresponding to the target language;

在步骤205中，由于直播服务器需要将视频直播端发送的源媒体数据进行处理后转发给视频播放终端，因此为了确定将源媒体数据转换成何种语言，在该实施例中，目标语言可以包括多种，获取视频播放终端对应的播放语言，并从多种目标语言对应的音频数据中找出与播放语言对应的音频数据，以便利用该音频数据进行媒体数据的合成，方便不同语言习惯的人群观看视频直播。Instep 205, since the live server needs to process the source media data sent by the live video terminal and forward it to the video playback terminal, in order to determine which language to convert the source media data into, in this embodiment, the target language may include Multiple, obtain the playback language corresponding to the video playback terminal, and find out the audio data corresponding to the playback language from the audio data corresponding to the multiple target languages, so as to use the audio data to synthesize media data, which is convenient for people with different language habits Watch live video.

在上述实施例中，具体地，根据视频播放终端所在的地理位置，确定视频播放终端的播放语言；或者，根据视频播放终端对应的常用语言，确定视频播放终端的播放语言；或者，根据视频播放终端发送的播放指令，解析播放指令指示的播放语言。In the above embodiment, specifically, the playback language of the video playback terminal is determined according to the geographic location of the video playback terminal; or, the playback language of the video playback terminal is determined according to the common language corresponding to the video playback terminal; or, according to the video playback terminal The playback instruction sent by the terminal parses the playback language indicated by the playback instruction.

在该实施例中，可以根据视频播放终端的所在位置确定播放语言，例如根据视频播放终端所在的地理位置为日本，该地区的常用语言为日语，那么可以确定播放语言为日语。或者，可以根据直播服务器接收到的视频播放终端发送的直播观看请求中携带的播放语言，即根据播放指令解析播放语言。再或者，可以直接根据视频播放终端对应的常用语言确定播放语言，例如上一次观看视频时选择的语言。In this embodiment, the playback language may be determined according to the location of the video playback terminal. For example, according to the geographical location of the video playback terminal being Japan and the common language in the region is Japanese, the playback language may be determined to be Japanese. Alternatively, the playback language may be parsed according to the playback language carried in the live viewing request sent by the video playback terminal received by the live playback server, that is, according to the playback instruction. Alternatively, the playback language may be directly determined according to the common language corresponding to the video playback terminal, for example, the language selected when watching the video last time.

步骤206，将播放语言对应的音频数据以及视频数据进行合成，得到播放媒体数据；Step 206, synthesizing the audio data and video data corresponding to the playback language to obtain playback media data;

具体地，获取与播放语言对应的翻译文本数据；将播放语言对应的翻译文本数据和音频数据以及视频数据进行合成，得到播放媒体数据。Specifically, the translated text data corresponding to the playback language is acquired; the translated text data corresponding to the playback language, audio data and video data are synthesized to obtain playback media data.

在上述实施例中，将播放语言对应的翻译文本数据作为字幕数据，利用该翻译文本数据、音频数据以及视频数据进行合成，得到播放媒体数据，使得合成的播放媒体数据不仅声音与观看用户的语言习惯匹配，字幕也与观看用户的语言习惯匹配，进一步提高了用户的视频观看体验。In the above-mentioned embodiment, the translated text data corresponding to the playback language is used as subtitle data, and the translated text data, audio data and video data are used for synthesis to obtain playback media data, so that the synthesized playback media data not only has sound and the language of the viewing user Habit matching, and subtitles also match the language habits of the viewing user, which further improves the user's video viewing experience.

步骤207，将播放媒体数据发送至视频播放终端。Step 207: Send the playback media data to the video playback terminal.

在上述实施例中，合成播放媒体数据后，将该播放媒体数据发送至视频播放终端中，以供用户观看。In the above embodiment, after synthesizing the playback media data, the playback media data is sent to the video playback terminal for the user to watch.

需要说明的是，在直播场景中，一般来说，为了保证视频播放质量，直播服务器通常会将视频缓存一段时间后再发送到视频播放终端中，例如缓存30秒，那么可以将缓存的视频按照每15秒做一次切分得到源媒体数据，分别对每一份源媒体数据进行播放语言的转换，以使视频播放端接收到的视频连续不卡顿，使视频播放质量得到保障。It should be noted that, in a live broadcast scenario, in general, in order to ensure the video playback quality, the live broadcast server usually caches the video for a period of time before sending it to the video playback terminal, for example, for 30 seconds, then the cached video can be The source media data is obtained by segmenting every 15 seconds, and the playback language is converted for each piece of source media data, so that the video received by the video player end does not freeze continuously and the video playback quality is guaranteed.

在本申请任一实施例中，步骤102、步骤202中对源音频数据进行语音转译得到转译文本数据，具体包括：In any embodiment of the present application, insteps 102 and 202, the source audio data is voice-translated to obtain translated text data, which specifically includes:

步骤102-1，根据预设翻译线路对应的输入参数拼装规则对转译文本数据进行拼装，得到与转译文本数据对应的翻译输入数据；Step 102-1, assembling the translated text data according to the input parameter assembly rule corresponding to the preset translation line, to obtain translation input data corresponding to the translated text data;

步骤102-2，调用预设翻译线路，将翻译输入数据作输入至预设翻译线路中进行翻译，得到翻译输出数据；Step 102-2, calling a preset translation circuit, inputting the translation input data into the preset translation circuit for translation, and obtaining translation output data;

步骤102-3，根据预设翻译线路对应的输出参数解析规则，对翻译输出数据进行解析，得到翻译文本数据。Step 102-3, according to the output parameter parsing rule corresponding to the preset translation circuit, parse the translation output data to obtain translated text data.

在上述实施例中，首先获取预设翻译线路对应的输入参数拼装规则，然后按照该规则对所述待翻译的转译文本数据进行拼装，得到翻译输入数据，并将该翻译输入数据作为预设翻译线路对应的输入参数，调用预设翻译线路，并将翻译输入数据输入至该线路中进行翻译，得到输出参数，即翻译输出数据，进一步，为了获得能够被计算机所识别的翻译文本数据，还需要按照预设翻译线路对应的输出参数解析规则对翻译输出数据进行解析，最终得到翻译文本数据，从而实现利用翻译线路将转译文本数据翻译成翻译文本数据，实现文本数据从源语言向目标语言的转换。其中，预设翻译线路可以为各种终端或浏览器的接口，例如百度翻译、谷歌翻译接口等，也可以为预设的翻译数据库接口。In the above embodiment, first obtain the input parameter assembly rule corresponding to the preset translation line, and then assemble the translation text data to be translated according to the rule to obtain translation input data, and use the translation input data as the preset translation The input parameters corresponding to the line, call the preset translation line, input the translation input data into the line for translation, and obtain the output parameters, that is, the translation output data. Further, in order to obtain the translated text data that can be recognized by the computer, it is necessary to The translation output data is parsed according to the output parameter parsing rules corresponding to the preset translation lines, and the translated text data is finally obtained, so as to realize the translation of the translated text data into the translated text data by using the translation line, and realize the conversion of the text data from the source language to the target language . The preset translation line may be an interface of various terminals or browsers, such as Baidu translation, Google translation interface, etc., or a preset translation database interface.

在一些应用场景中，一些翻译接口预先定义了调用验证规则，为避免恶意调用浪费资源，在调用接口之前需要进行验证，在上述实施例中，步骤102-2之前，还包括：In some application scenarios, some translation interfaces have pre-defined call verification rules. To avoid wasting resources by malicious calls, verification needs to be performed before calling the interface. In the above embodiment, before step 102-2, further includes:

步骤102-4，获取预设翻译线路对应的验证种子，并根据令牌生成规则生成验证种子对应的验证令牌；Step 102-4, obtaining the verification seed corresponding to the preset translation line, and generating a verification token corresponding to the verification seed according to the token generation rule;

步骤102-5，利用验证令牌对预设翻译线路进行验证，若验证通过则确定预设翻译线路为可调用状态。Step 102-5, using the verification token to verify the preset translation line, and if the verification is passed, it is determined that the preset translation line is in a callable state.

在上述实施例中，先获取预设翻译心路对应的验证种子，然后根据该预设翻译线路预先约定的令牌生成规则，根据该验证种子进行加密处理生成验证令牌，在调用预设翻译线路之前，通过该验证令牌进行验证，并在验证通过后，确定该预设翻译线路为可调用状态，预设翻译线路只有在可调用状态下，才可以被调用，否则无法被调用，避免预设翻译线路被恶意调用，浪费翻译线路资源，提高翻译效率。例如，调用谷歌翻译接口来获取验证种子，根据该验证种子以及当前时间对应的时间戳信息，按照预设加密算法，生成验证令牌，从而验证翻译接口请求。In the above-mentioned embodiment, the verification seed corresponding to the preset translation mentality is obtained first, and then according to the token generation rule pre-agreed by the preset translation circuit, the verification token is encrypted according to the verification seed to generate the verification token, and then the preset translation circuit is called. Before, the verification token is used for verification, and after the verification is passed, it is determined that the preset translation line is in a callable state. The preset translation line can only be called in the callable state. It is assumed that the translation line is called maliciously, which wastes translation line resources and improves translation efficiency. For example, the Google Translate API is called to obtain the verification seed, and according to the verification seed and the timestamp information corresponding to the current time, a verification token is generated according to the preset encryption algorithm, so as to verify the translation interface request.

在本申请任一实施例中，步骤103、步骤203中获取转译文本数据对应的文本语义参数，具体包括：In any embodiment of the present application, in step 103 and step 203, the text semantic parameters corresponding to the translated text data are obtained, specifically including:

步骤103-1，根据转译文本数据对应的文本结构对转译文本数据进行分割，得到转译文本数据对应的多个语句；Step 103-1, dividing the translated text data according to the text structure corresponding to the translated text data to obtain a plurality of sentences corresponding to the translated text data;

步骤103-2，分别获取每个语句对应的语义参数，并根据每个语句对应的语义参数确定转译文本数据对应的文本语义参数。Step 103-2: Obtain the semantic parameters corresponding to each sentence respectively, and determine the text semantic parameters corresponding to the translated text data according to the semantic parameters corresponding to each sentence.

在上述实施例中，根据转译文本数据的文本结构，具体可以根据文本中的阅读符号(例如句号、问号以及感叹号等)来对转译文本进行分割，从而将转译文本转换为多个语句，在完成语句提取后，对分割得到的各个语句进行特征词抽取，其中，特征词能够用于表征语句所隐含的情感，例如特征词可以包括连词以及否定词等。并对各个语句进行句法分析，确定各语句红连词前后的分词权重，而针对否定词则会进行极性反转或者双重否定的识别。根据各个语句中的情感词汇以及句法分析结果，综合确定出语句的评分，该评分则能够表征语句的语义参数。例如，语句的评分越低，那么语句所表征的情感越负面；语句的评分越高，那么语句所表征的情感越正面。例如，如果语句的评分为-10，那么则表示该语句所表征的情感为极端负面的情感(例如暴躁、狂怒等)；如果语句的评分为-2，那么则表示该语句所表征的情感为较为负面的情感(例如心情低落等)；而如果语句的评分为0，那么则表示该语句所表征的情感为中性；如果语句的评分为+7，那么则表示该语句所表征的情感为较为正面的情感(例如十分喜悦)。进而基于每个语句对应的语义参数，确定转译文本数据对应的文本语义参数，例如将每个语句对应的语义参数的平均值作为该文本语义参数，避免单个句子的语义参数差距过大，导致最终合成的声音表现出的情感情绪波动过大。In the above-mentioned embodiment, according to the text structure of the translated text data, specifically, the translated text can be segmented according to the reading symbols (such as periods, question marks, exclamation marks, etc.) in the text, so as to convert the translated text into multiple sentences. After the sentences are extracted, feature words are extracted for each sentence obtained by segmentation, wherein the feature words can be used to represent the emotions implied by the sentences. For example, the feature words may include conjunctions and negative words. Syntactic analysis of each sentence is carried out to determine the weight of the word segmentation before and after the red conjunction of each sentence, and polarity reversal or double negation will be carried out for negative words. According to the emotional vocabulary in each sentence and the results of syntactic analysis, the score of the sentence is comprehensively determined, and the score can represent the semantic parameters of the sentence. For example, the lower the score of the sentence, the more negative the emotion represented by the sentence; the higher the score of the sentence, the more positive the emotion represented by the sentence. For example, if the score of a sentence is -10, it means that the emotion represented by the sentence is an extremely negative emotion (such as irritability, anger, etc.); if the score of the sentence is -2, it means that the emotion represented by the sentence is Negative emotions (such as low mood, etc.); and if the score of the sentence is 0, it means that the emotion represented by the sentence is neutral; if the score of the sentence is +7, it means that the emotion represented by the sentence is More positive emotions (eg very happy). Then, based on the semantic parameters corresponding to each sentence, the text semantic parameters corresponding to the translated text data are determined. For example, the average value of the semantic parameters corresponding to each sentence is used as the text semantic parameter, so as to avoid a large gap between the semantic parameters of a single sentence, resulting in the final result. The synthesized voice expresses excessive emotional mood swings.

进一步的，作为图1方法的具体实现，本申请实施例提供了一种媒体数据处理装置，如图3所示，该装置包括：Further, as a specific implementation of the method in FIG. 1 , an embodiment of the present application provides an apparatus for processing media data. As shown in FIG. 3 , the apparatus includes:

源数据接收模块31，用于接收源媒体数据，其中，源媒体数据包括视频数据和源音频数据；a source data receiving module 31, configured to receive source media data, wherein the source media data includes video data and source audio data;

音频数据翻译模块32，用于对源音频数据进行语音转译得到转译文本数据，并对转译文本数据进行翻译得到目标语言的翻译文本数据；The audio data translation module 32 is used to perform phonetic translation on the source audio data to obtain the translated text data, and translate the translated text data to obtain the translated text data of the target language;

声音参数调整模块33，用于获取转译文本数据对应的文本语义参数，并基于文本语义参数对预设的声音合成参数进行调整；The sound parameter adjustment module 33 is used to obtain the text semantic parameters corresponding to the translated text data, and adjust the preset voice synthesis parameters based on the text semantic parameters;

声音合成模块34，用于根据调整后的声音合成参数对翻译文本数据进行声音合成，得到目标语言对应的音频数据；The voice synthesis module 34 is configured to perform voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;

媒体数据合成模块35，用于将目标语言对应的音频数据以及视频数据进行合成，得到合成媒体数据。The media data synthesis module 35 is used for synthesizing audio data and video data corresponding to the target language to obtain synthesized media data.

在具体的应用场景中，如图4所示，可选地，音频数据翻译模块32，具体包括：In a specific application scenario, as shown in Figure 4, optionally, the audio data translation module 32 specifically includes:

输入数据拼装单元321，用于根据预设翻译线路对应的输入参数拼装规则对转译文本数据进行拼装，得到与转译文本数据对应的翻译输入数据；The input data assembling unit 321 is used to assemble the translated text data according to the input parameter assembly rule corresponding to the preset translation line, and obtain the translation input data corresponding to the translated text data;

翻译数据输出单元322，用于调用预设翻译线路，将翻译输入数据作输入至预设翻译线路中进行翻译，得到翻译输出数据；The translation data output unit 322 is used to call the preset translation circuit, input the translation input data into the preset translation circuit for translation, and obtain the translation output data;

翻译文本解析单元323，用于根据预设翻译线路对应的输出参数解析规则，对翻译输出数据进行解析，得到翻译文本数据。The translation text parsing unit 323 is configured to parse the translation output data according to the output parameter parsing rule corresponding to the preset translation circuit to obtain translation text data.

在具体的应用场景中，如图4所示，可选地，该装置还包括：In a specific application scenario, as shown in Figure 4, optionally, the device further includes:

验证令牌生成模块36，用于调用预设翻译线路之前，获取预设翻译线路对应的验证种子，并根据令牌生成规则生成验证种子对应的验证令牌；The verification token generation module 36 is used to obtain the verification seed corresponding to the preset translation circuit before calling the preset translation circuit, and generate the verification token corresponding to the verification seed according to the token generation rule;

线路验证模块37，用于利用验证令牌对预设翻译线路进行验证，若验证通过则确定预设翻译线路为可调用状态。The line verification module 37 is configured to use the verification token to verify the preset translation line, and if the verification passes, determine that the preset translation line is in a callable state.

在具体的应用场景中，如图4所示，可选地，声音参数调整模块33，具体包括：In a specific application scenario, as shown in FIG. 4 , optionally, the sound parameter adjustment module 33 specifically includes:

语句分割单元331，用于根据转译文本数据对应的文本结构对转译文本数据进行分割，得到转译文本数据对应的多个语句；The sentence segmentation unit 331 is used to segment the translated text data according to the text structure corresponding to the translated text data, and obtain a plurality of sentences corresponding to the translated text data;

语义参数确定单元332，用于分别获取每个语句对应的语义参数，并根据每个语句对应的语义参数确定转译文本数据对应的文本语义参数。The semantic parameter determining unit 332 is configured to obtain the semantic parameter corresponding to each sentence respectively, and determine the text semantic parameter corresponding to the translated text data according to the semantic parameter corresponding to each sentence.

在具体的应用场景中，如图4所示，可选地，源数据接收模块31，具体用于：接收视频发布终端发送的源媒体数据；In a specific application scenario, as shown in FIG. 4 , optionally, the source data receiving module 31 is specifically configured to: receive source media data sent by the video publishing terminal;

媒体数据合成模块35，具体包括：The media data synthesis module 35 specifically includes:

播放语言获取单元351，用于获取视频播放终端对应的播放语言，并从目标语言对应的音频数据中获取与播放语言对应的音频数据；The playback language acquisition unit 351 is used to acquire the playback language corresponding to the video playback terminal, and acquire the audio data corresponding to the playback language from the audio data corresponding to the target language;

播放数据合成单元352，用于将播放语言对应的音频数据以及视频数据进行合成，得到播放媒体数据；The playback data synthesis unit 352 is used for synthesizing the audio data and video data corresponding to the playback language to obtain playback media data;

播放数据发送单元353，用于将播放媒体数据发送至视频播放终端。The playback data sending unit 353 is configured to send the playback media data to the video playback terminal.

可选地，播放数据合成单元352，具体包括：Optionally, the playback data synthesis unit 352 specifically includes:

播放文本获取子单元3521，用于获取与播放语言对应的翻译文本数据；The playback text acquisition subunit 3521 is used to acquire the translated text data corresponding to the playback language;

播放数据合成子单元3522，用于将播放语言对应的翻译文本数据和音频数据以及视频数据进行合成，得到播放媒体数据。The playback data synthesis subunit 3522 is used for synthesizing the translated text data corresponding to the playback language, audio data and video data to obtain playback media data.

可选地，播放语言获取单元351，具体包括：Optionally, the playback language acquisition unit 351 specifically includes:

第一语言获取子单元3511，用于根据视频播放终端所在的地理位置，确定视频播放终端的播放语言；或者，The first language acquisition subunit 3511 is used to determine the playback language of the video playback terminal according to the geographic location of the video playback terminal; or,

第二语言获取子单元3512，用于根据视频播放终端对应的常用语言，确定视频播放终端的播放语言；或者，The second language acquisition subunit 3512 is used to determine the playback language of the video playback terminal according to the common language corresponding to the video playback terminal; or,

第三语言获取子单元3513，用于根据视频播放终端发送的播放指令，解析播放指令指示的播放语言。The third language acquisition subunit 3513 is configured to parse the playback language indicated by the playback instruction according to the playback instruction sent by the video playback terminal.

需要说明的是，本申请实施例提供的一种媒体数据处理装置所涉及各功能单元的其他相应描述，可以参考图1至图2方法中的对应描述，在此不再赘述。It should be noted that, for other corresponding descriptions of the functional units involved in the media data processing apparatus provided in the embodiments of the present application, reference may be made to the corresponding descriptions in the methods in FIGS. 1 to 2 , and details are not repeated here.

基于上述如图1至图2所示方法，相应的，本申请实施例还提供了一种存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现上述如图1至图2所示的媒体数据处理方法。Based on the above methods as shown in FIGS. 1 to 2 , correspondingly, an embodiment of the present application further provides a storage medium on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned FIGS. 1 to 2 are implemented. The media data processing method shown.

基于这样的理解，本申请的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施场景所述的方法。Based on this understanding, the technical solution of the present application can be embodied in the form of a software product, and the software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.), including several The instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in various implementation scenarios of this application.

基于上述如图1至图2所示的方法，以及图3至图4所示的虚拟装置实施例，为了实现上述目的，本申请实施例还提供了一种计算机设备，具体可以为个人计算机、服务器、网络设备等，该计算机设备包括存储介质和处理器；存储介质，用于存储计算机程序；处理器，用于执行计算机程序以实现上述如图1至图2所示的媒体数据处理方法。Based on the above-mentioned methods shown in FIGS. 1 to 2 and the virtual device embodiments shown in FIGS. 3 to 4 , in order to achieve the above purpose, an embodiment of the present application further provides a computer device, which may specifically be a personal computer, A server, a network device, etc., the computer device includes a storage medium and a processor; a storage medium for storing a computer program; and a processor for executing the computer program to implement the media data processing method shown in FIG. 1 to FIG. 2 .

可选地，该计算机设备还可以包括用户接口、网络接口、摄像头、射频(RadioFrequency，RF)电路，传感器、音频电路、WI-FI模块等等。用户接口可以包括显示屏(Display)、输入单元比如键盘(Keyboard)等，可选用户接口还可以包括USB接口、读卡器接口等。网络接口可选的可以包括标准的有线接口、无线接口(如蓝牙接口、WI-FI接口)等。Optionally, the computer device may further include a user interface, a network interface, a camera, a radio frequency (Radio Frequency, RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like. The user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, and the like. Optional network interfaces may include standard wired interfaces, wireless interfaces (eg, Bluetooth interfaces, WI-FI interfaces), and the like.

本领域技术人员可以理解，本实施例提供的一种计算机设备结构并不构成对该计算机设备的限定，可以包括更多或更少的部件，或者组合某些部件，或者不同的部件布置。Those skilled in the art can understand that the structure of a computer device provided in this embodiment does not constitute a limitation on the computer device, and may include more or less components, or combine some components, or arrange different components.

存储介质中还可以包括操作系统、网络通信模块。操作系统是管理和保存计算机设备硬件和软件资源的程序，支持信息处理程序以及其它软件和/或程序的运行。网络通信模块用于实现存储介质内部各组件之间的通信，以及与该实体设备中其它硬件和软件之间通信。The storage medium may also include an operating system and a network communication module. An operating system is a program that manages and saves the hardware and software resources of computer equipment, supports the operation of information processing programs and other software and/or programs. The network communication module is used to realize the communication between various components inside the storage medium, as well as the communication with other hardware and software in the physical device.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到本申请可以借助软件加必要的通用硬件平台的方式来实现，也可以通过硬件实现接收源媒体数据之后，先对源媒体数据包含的源音频数据进行语音转译得到源音频数据对应的转译文本数据，然后将转译文本从源语言翻译为目标语言的翻译文本数据，并根据转译文本对应的文本语义参数调整声音合成参数，从而基于调整后的声音合成参数将翻译文本数据合成为相应目标语言的音频数据，并将该目标语言的音频数据与源媒体数据包含的视频数据进行组装，得到合成媒体数据。本申请实施例相比于现有技术中直接播放直播视频的方式，不仅可以将源媒体数据转换成多种不同语言的媒体数据，方便不同语言习惯的用户观看，还可以获取源音频数据的转译文本数据对应的文本语义参数确定声音合成参数，从而利用声音合成参数进行声音合成，使得合成得到的声音与源音频数据所表达的情绪情感更匹配，提高了合成媒体数据与源媒体数据的观感相似度，提高了用户的视频观看体验，也有助于提高视频平台的视频播放量。From the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software plus a necessary general hardware platform, and can also be implemented by hardware after receiving the source media data, The included source audio data is subjected to voice translation to obtain the translated text data corresponding to the source audio data, and then the translated text is translated from the source language to the translated text data of the target language, and the speech synthesis parameters are adjusted according to the text semantic parameters corresponding to the translated text, so as to be based on The adjusted voice synthesis parameter synthesizes the translated text data into audio data in the corresponding target language, and assembles the audio data in the target language with the video data contained in the source media data to obtain synthesized media data. Compared with the way of directly playing live video in the prior art, the embodiment of the present application can not only convert the source media data into media data in multiple different languages, which is convenient for users with different language habits to watch, but also obtain the translation of the source audio data. The text semantic parameters corresponding to the text data determine the voice synthesis parameters, so that the voice synthesis parameters are used for voice synthesis, so that the synthesized voices are more matched with the emotions expressed by the source audio data, and the look and feel of the synthesized media data and the source media data is improved. It improves the user's video viewing experience and also helps to increase the video playback volume of the video platform.

本领域技术人员可以理解附图只是一个优选实施场景的示意图，附图中的模块或流程并不一定是实施本申请所必须的。本领域技术人员可以理解实施场景中的装置中的模块可以按照实施场景描述进行分布于实施场景的装置中，也可以进行相应变化位于不同于本实施场景的一个或多个装置中。上述实施场景的模块可以合并为一个模块，也可以进一步拆分成多个子模块。Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred implementation scenario, and the modules or processes in the accompanying drawing are not necessarily necessary to implement the present application. Those skilled in the art can understand that the modules in the device in the implementation scenario may be distributed in the device in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the implementation scenario with corresponding changes. The modules of the above implementation scenarios may be combined into one module, or may be further split into multiple sub-modules.

上述本申请序号仅仅为了描述，不代表实施场景的优劣。以上公开的仅为本申请的几个具体实施场景，但是，本申请并非局限于此，任何本领域的技术人员能思之的变化都应落入本申请的保护范围。The above serial numbers in the present application are only for description, and do not represent the pros and cons of the implementation scenarios. The above disclosures are only a few specific implementation scenarios of the present application, however, the present application is not limited thereto, and any changes that can be conceived by those skilled in the art should fall within the protection scope of the present application.