CN118782046A

Movatterモバイル変換

Info

Publication number: CN118782046A
Application number: CN202410864944.7A
Authority: CN
Inventors: 张瑶
Original assignee: Wanxing Technology Hunan Co ltd
Current assignee: Wanxing Technology Hunan Co ltd
Priority date: 2024-06-28
Filing date: 2024-06-28
Publication date: 2024-10-15

Abstract

The embodiment of the application provides a voice translation method, which comprises the following steps: acquiring an original text and a sentence breaking time stamp of an original audio; translating the original text of the original audio into a target language text; synthesizing the target language text into target audio; and comparing the sentence breaking time stamp of the target audio with the sentence breaking time stamp of the original audio, and performing secondary translation on the original text of the original audio so as to enable the target audio to be consistent with the stuck point of the original audio. The voice translation method provided by the embodiment of the application can effectively improve the matching coordination of the target audio and the original audio. The embodiment of the application also provides a voice translation device and electronic equipment.

Description

Translated fromChinese

语音翻译方法、装置、电子设备Speech translation method, device, and electronic device

技术领域Technical Field

本申请实施例涉及语音处理技术领域，具体而言，涉及一种语音翻译方法、装置、电子设备。The embodiments of the present application relate to the field of speech processing technology, and more specifically, to a speech translation method, device, and electronic device.

背景技术Background Art

随着全球化日益加深，跨国交流和合作变得越来越频繁，而视频已经成为人们交流和获取信息的主要方式。然而，由于语言障碍，人们在观看外语视频时往往会遇到理解困难。以前这种问题需要专业的配音员针对原视频内容进行人工译制配音，成本较高，效率也比较低下。As globalization deepens, cross-border communication and cooperation become more frequent, and video has become the main way for people to communicate and obtain information. However, due to language barriers, people often encounter difficulties in understanding when watching foreign language videos. In the past, this problem required professional dubbing artists to manually translate and dub the original video content, which was costly and inefficient.

近年来，随着机器翻译、STT、TTS等AI技术取得的不断突破，通过对原始音频进行文本识别、文本翻译、最后再通过文本合成音频，实现了自动化的多语言自动配音。但是配音结果与原始音频在节奏上相差很大，比如，视频中人说话已经结束，但是自动配音的结果仍在说话，或者视频中人正在说话，自动配音的结果说话已经结束，这导致了视频中语音与画面的极度不协调。In recent years, with the continuous breakthroughs in AI technologies such as machine translation, STT, and TTS, automatic multi-language dubbing has been achieved by performing text recognition, text translation, and finally synthesizing audio through text. However, the dubbing results are very different from the original audio in terms of rhythm. For example, the person in the video has finished speaking, but the automatic dubbing result is still speaking, or the person in the video is speaking, but the automatic dubbing result has finished speaking, which leads to extreme incoordination between the voice and the picture in the video.

发明内容Summary of the invention

针对上述现有技术中存在的问题，本申请实施例提供了一种语音翻译方法、装置、电子设备，能够有效地提升目标音频和原始音频的匹配协调性。In view of the problems existing in the above-mentioned prior art, the embodiments of the present application provide a speech translation method, device, and electronic device, which can effectively improve the matching coordination between the target audio and the original audio.

第一方面，本申请实施例提供了一种语音翻译方法，包括：In a first aspect, an embodiment of the present application provides a speech translation method, comprising:

获取原始音频的原始文本和断句时间戳；Get the original text and segmentation timestamps of the original audio;

将所述原始音频的原始文本翻译成目标语言文本；translating the original text of the original audio into a target language text;

将所述目标语言文本合成为目标音频；和synthesizing the target language text into target audio; and

比较所述目标音频的断句时间戳和所述原始音频的断句时间戳，对所述原始音频的原始文本进行二次翻译，以使得所述目标音频与所述原始音频的卡点一致。The sentence segmentation timestamps of the target audio and the sentence segmentation timestamps of the original audio are compared, and the original text of the original audio is secondary translated so that the target audio is consistent with the original audio at the same time.

进一步地，所述将所述原始音频的原始文本翻译成目标语言文本，包括：Furthermore, translating the original text of the original audio into a target language text comprises:

通过文本大模型将所述原始音频的原始文本翻译成所述目标语言文本。The original text of the original audio is translated into the target language text by using a large text model.

进一步地，所述获取原始音频的原始文本和断句时间戳，包括：Furthermore, the obtaining of the original text and sentence segmentation timestamps of the original audio includes:

通过STT算法获取原始音频的原始文本和断句时间戳。The original text and sentence timestamps of the original audio are obtained through the STT algorithm.

进一步地，所述将所述目标语言文本合成为目标音频，包括：Furthermore, synthesizing the target language text into target audio includes:

通过TTS算法将所述目标语言文本合成为所述目标音频。The target language text is synthesized into the target audio through a TTS algorithm.

进一步地，所述比较所述目标音频的断句时间戳和所述原始音频的断句时间戳，对所述原始音频的原始文本进行二次翻译，包括：Furthermore, the comparing the sentence segmentation timestamps of the target audio with the sentence segmentation timestamps of the original audio and performing a secondary translation on the original text of the original audio includes:

比较所述目标音频的断句时间戳和所述原始音频的断句时间戳，针对所述目标音频和所述原始音频的文本长度，通过设置二次翻译prompt程序对所述原始音频的原始文本进行二次翻译。The sentence segmentation timestamps of the target audio and the sentence segmentation timestamps of the original audio are compared, and according to the text lengths of the target audio and the original audio, a secondary translation prompt program is set to perform a secondary translation on the original text of the original audio.

进一步地，所述比较所述目标音频的断句时间戳和所述原始音频的断句时间戳，对所述原始音频的原始文本进行二次翻译之后，还包括：Furthermore, after comparing the sentence segmentation timestamps of the target audio and the sentence segmentation timestamps of the original audio and performing a secondary translation on the original text of the original audio, the method further includes:

将二次翻译后的目标语言文本合成为目标音频。The target language text after secondary translation is synthesized into the target audio.

进一步地，通过TTS算法将二次翻译后的目标语言文本合成为目标音频。Furthermore, the target language text after the second translation is synthesized into the target audio through the TTS algorithm.

第二方面，本申请实施例还提供了一种语音翻译装置，包括：In a second aspect, the embodiment of the present application further provides a speech translation device, comprising:

文本获取模块，用于获取原始音频的原始文本和断句时间戳；A text acquisition module is used to obtain the original text and sentence timestamps of the original audio;

文本翻译模块，用于将所述原始音频的原始文本翻译成目标语言文本；A text translation module, used for translating the original text of the original audio into a target language text;

音频合成模块，用于将所述目标语言文本合成为目标音频；和an audio synthesis module, used for synthesizing the target language text into target audio; and

二次翻译模块，用于比较所述目标音频的断句时间戳和所述原始音频的断句时间戳，对所述原始音频的原始文本进行二次翻译，以使得所述目标音频与所述原始音频的卡点一致。The secondary translation module is used to compare the sentence segmentation timestamps of the target audio with the sentence segmentation timestamps of the original audio, and perform secondary translation on the original text of the original audio so that the target audio is consistent with the original audio at the same time.

第三方面，本申请实施例还提供了一种电子设备，包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序，所述处理器用于执行所述程序时实现根据上述的第一方面所述的语音翻译方法。In a third aspect, an embodiment of the present application further provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor is configured to implement the speech translation method according to the first aspect when executing the program.

第四方面，本申请实施例还提供了一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序用于实现根据上述的第一方面所述的语音翻译方法。In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, wherein the computer program is used to implement the speech translation method according to the first aspect above.

第五方面，本申请实施例还提供了一种计算机程序产品，其上存储有计算机程序，所述计算机程序用于实现根据上述的第一方面所述的语音翻译方法。In a fifth aspect, an embodiment of the present application further provides a computer program product on which a computer program is stored, wherein the computer program is used to implement the speech translation method according to the first aspect above.

本申请实施例带来了以下有益效果：The embodiments of the present application bring the following beneficial effects:

本申请实施例提供的语音翻译方法，当获取原始音频的原始文本和断句时间戳后，将原始音频的原始文本翻译成目标语言文本，并且将所述目标语言文本合成为目标音频，并且通过比较目标音频的断句时间戳和原始音频的断句时间戳，对原始音频的原始文本进行二次翻译，以使得目标音频与原始音频的卡点一致，从而确保翻译后的目标文本生成的目标音频与原始音频的节奏卡点的匹配性，能够优先提升语音和画面的匹配协调性。The speech translation method provided in the embodiment of the present application, after obtaining the original text and sentence time stamp of the original audio, translates the original text of the original audio into a target language text, and synthesizes the target language text into a target audio, and by comparing the sentence time stamp of the target audio with the sentence time stamp of the original audio, performs a secondary translation on the original text of the original audio, so that the target audio and the original audio have the same card points, thereby ensuring the matching of the rhythm card points of the target audio generated by the translated target text and the original audio, and can give priority to improving the matching coordination of the voice and the picture.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图示出的结构获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on the structures shown in these drawings without paying any creative work.

图1为本申请一实施例提供的一种语音翻译方法的流程示意图；FIG1 is a flow chart of a speech translation method provided in an embodiment of the present application;

图2为本申请实施例提供的语音翻译装置的结构框图；FIG2 is a structural block diagram of a speech translation device provided in an embodiment of the present application;

图3为本申请实施例提供的电子设备的结构示意图。FIG3 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.

本申请目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization of the purpose, functional features and advantages of this application will be further explained in conjunction with embodiments and with reference to the accompanying drawings.

具体实施方式DETAILED DESCRIPTION

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请的一部分所述的实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都应属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments described in the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work should fall within the scope of protection of this application.

本申请的说明书和权利要求书及上述附图中，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中，除非另有说明，“多个”的含义是两个或两个以上。对于本领域的普通技术人员而言，可以根据具体情况理解上述术语在本申请中的具体含义。In the specification and claims of this application and the above-mentioned drawings, the terms "first" and "second" are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Therefore, the features defined as "first" and "second" may explicitly or implicitly include one or more of the features. In the description of this application, unless otherwise specified, "multiple" means two or more. For ordinary technicians in this field, the specific meanings of the above terms in this application can be understood according to the specific circumstances.

图1是根据本申请一个实施例的语音翻译方法的流程图。如图1所示，本申请实施例的语音翻译方法，包括如下步骤：FIG1 is a flow chart of a speech translation method according to an embodiment of the present application. As shown in FIG1 , the speech translation method of the embodiment of the present application includes the following steps:

S101：获取原始音频的原始文本和断句时间戳；S101: Obtaining the original text and sentence timestamps of the original audio;

具体地，音频通常指存储声音内容的文件，声音内容则包括人声和自然声音等，这里的原始音频特指包含人的语音的音频文件。当处理对象为视频文件时，需要将其分离成音频文件后进行处理。Specifically, audio usually refers to a file storing sound content, which includes human voice and natural sound, etc. The original audio here specifically refers to an audio file containing human voice. When the processing object is a video file, it needs to be separated into audio files for processing.

当获取原始音频后，需要进行语音识别获取原始文本文件，并且获取原始音频的断句时间戳。After obtaining the original audio, speech recognition is required to obtain the original text file and obtain the sentence segmentation timestamp of the original audio.

S102：将所述原始音频的原始文本翻译成目标语言文本；S102: translating the original text of the original audio into a target language text;

具体地，当原始音频的原始文本获取后，需要对其进行翻译，翻译成目标语言文本，翻译的方法包括人工翻译或者智能机器翻译等，而目标语言则包括各国语言或者其他语言等。Specifically, after the original text of the original audio is obtained, it needs to be translated into a target language text. The translation methods include manual translation or intelligent machine translation, and the target language includes languages of various countries or other languages.

S103：将所述目标语言文本合成为目标音频；S103: synthesizing the target language text into target audio;

也就是，当将原始音频的原始文本翻译成目标语言文本后，需要通过分轨操作等，将目标语言文本合成为目标音频，但是目标音频可能存在与原始音频卡点不一致的情况。That is, after translating the original text of the original audio into the target language text, it is necessary to synthesize the target language text into the target audio through track splitting operation, etc., but the target audio may be inconsistent with the original audio.

S104：比较所述目标音频的断句时间戳和所述原始音频的断句时间戳，对所述原始音频的原始文本进行二次翻译，以使得所述目标音频与所述原始音频的卡点一致。S104: Compare the sentence segmentation timestamps of the target audio with the sentence segmentation timestamps of the original audio, and perform secondary translation on the original text of the original audio so that the target audio is consistent with the original audio at the same time.

具体地，通过比较原始音频的原始文本的长度和目标音频的文本的长度，也就是两者的断句时间戳，从而对原始音频的原始文本进行二次翻译，从而两者的文本长度相对应，也就是使得目标音频和原始音频的卡点一致。Specifically, by comparing the length of the original text of the original audio and the length of the text of the target audio, that is, the sentence timestamps of the two, the original text of the original audio is translated a second time, so that the text lengths of the two correspond to each other, that is, the card points of the target audio and the original audio are consistent.

在实际应用中，例如，目标音频为原始音频的配音音频，本申请实施例提供的语音翻译方法，能够解决配音音频与画面的匹配协调性问题。In practical applications, for example, the target audio is the dubbing audio of the original audio, and the speech translation method provided in the embodiment of the present application can solve the problem of matching and coordinating the dubbing audio with the picture.

因此，本申请实施例提供的语音翻译方法，当获取原始音频的原始文本和断句时间戳后，将原始音频的原始文本翻译成目标语言文本，并且将所述目标语言文本合成为目标音频，并且通过比较目标音频的断句时间戳和原始音频的断句时间戳，对原始音频的原始文本进行二次翻译，以使得目标音频与原始音频的卡点一致，从而确保翻译后的目标文本生成的目标音频与原始音频的节奏卡点的匹配性，能够优先提升语音和画面的匹配协调性。Therefore, the speech translation method provided in the embodiment of the present application, after obtaining the original text and sentence timestamp of the original audio, translates the original text of the original audio into a target language text, and synthesizes the target language text into a target audio, and by comparing the sentence timestamp of the target audio with the sentence timestamp of the original audio, performs a secondary translation on the original text of the original audio, so that the target audio is consistent with the original audio at the same time, thereby ensuring the matching of the rhythm points of the target audio generated by the translated target text with the original audio, and can give priority to improving the matching coordination of the voice and the picture.

进一步地，在本申请的一些实施例中，所述将所述原始音频的原始文本翻译成目标语言文本，包括：Furthermore, in some embodiments of the present application, translating the original text of the original audio into a target language text includes:

通过文本大模型将所述原始音频的原始文本翻译成目标语言文本。The original text of the original audio is translated into a target language text by using a large text model.

具体地，文本大模型也就是AI文本大模型，是指基于深度学习和自然语言处理技术构建的强大的文本理解和生成模型。它通过训练大规模的文本数据集,能够自动学习并理解文本中的语义、上下文信息和语法结构。AI文本大模型具备高度准确性和智能性,能够实现文本分类、情感分析、机器翻译等任务。Specifically, the text big model, also known as the AI text big model, refers to a powerful text understanding and generation model built based on deep learning and natural language processing technology. It can automatically learn and understand the semantics, contextual information and grammatical structure in the text by training large-scale text data sets. The AI text big model is highly accurate and intelligent, and can achieve tasks such as text classification, sentiment analysis, and machine translation.

这里采用的是文本大模型，如GPT、文心一言等，通过设置翻译的prompt，使得文本大模型作为一个多语言文本翻译器，输入原始文本内容，翻译出目标语言文本。文本大模型相对于人工翻译和机器翻译，其准确性更高，自然性更佳，因而能够确保原始音频的原始文本的翻译质量。The text big model used here is GPT, Wenxinyiyan, etc. By setting the translation prompt, the text big model acts as a multilingual text translator, inputs the original text content, and translates the target language text. Compared with manual translation and machine translation, the text big model has higher accuracy and better naturalness, thus ensuring the translation quality of the original text of the original audio.

进一步地，在本申请的一些实施例中，所述获取原始音频的原始文本和断句时间戳，包括：Furthermore, in some embodiments of the present application, the step of obtaining the original text and sentence segmentation timestamps of the original audio includes:

具体地，STT(Speech to Text，语言文本转换)算法是一种将语音信号转化为文本的技术,它的基本原理是将语音信号转换成一系列的音频特征，然后通过机器学习算法来将这些特征映射成文本。STT自然语言生成技术,与其他自然语言生成技术相比,具有以下优点:(1)实现简单:使用Python等编程语言实现,不需要使用专门的硬件设备；(2)准确度高:使用大规模的文本数据训练,能够实现高精度的文本生成；(3)可定制性强:可以根据不同的应用场景和需求,灵活选择和定制不同的算法和参数；(4)支持多种生成方式:支持多种文本生成方式，如基于规则的生成、基于模型的生成、基于生成式模型的生成等。Specifically, the STT (Speech to Text) algorithm is a technology that converts speech signals into text. Its basic principle is to convert speech signals into a series of audio features, and then map these features into text through machine learning algorithms. Compared with other natural language generation technologies, STT natural language generation technology has the following advantages: (1) Simple implementation: It is implemented using programming languages such as Python and does not require the use of special hardware devices; (2) High accuracy: It uses large-scale text data training to achieve high-precision text generation; (3) Strong customizability: Different algorithms and parameters can be flexibly selected and customized according to different application scenarios and needs; (4) Support for multiple generation methods: It supports multiple text generation methods, such as rule-based generation, model-based generation, and generative model-based generation.

因此，本申请实施例提供的语音翻译方法，通过STT算法获取原始音频的原始文本和断句时间戳，能够提高操作的准确性及有效性。Therefore, the speech translation method provided in the embodiment of the present application obtains the original text and sentence timestamps of the original audio through the STT algorithm, which can improve the accuracy and effectiveness of the operation.

需要说明的是，此处的STT算法可以是主流的开源技术如Whisper、wav2vec等,也可以是成熟的商业化接口例如微软、阿里等。It should be noted that the STT algorithm here can be a mainstream open source technology such as Whisper, wav2vec, etc., or it can be a mature commercial interface such as Microsoft, Alibaba, etc.

进一步地，在本申请的一些实施例中，所述将所述目标语言文本合成为目标音频，包括：Furthermore, in some embodiments of the present application, synthesizing the target language text into target audio includes:

具体地，TTS(Text to Speech，文本语音转换)算法属于语音合成技术，能够将计算机自己产生的、或外部输入的文字信息转变为可以听得懂的、流利的口语输出，本申请实施例提供的语音翻译方法，通过TTS算法将目标语言文本合成为所述目标音频，准确性高且工作效率高。Specifically, the TTS (Text to Speech) algorithm belongs to speech synthesis technology, which can convert text information generated by the computer itself or input externally into understandable and fluent spoken output. The speech translation method provided in the embodiment of the present application synthesizes the target language text into the target audio through the TTS algorithm, with high accuracy and high work efficiency.

需要说明的是，此处所用的TTS算法可以是开源的CoQuiTTS、OpenTTS等，也可以是商业化的TTS接口如微软等。It should be noted that the TTS algorithm used here can be open source CoQuiTTS, OpenTTS, etc., or a commercial TTS interface such as Microsoft, etc.

进一步地，在本申请的一些实施例中，所述比较所述目标音频的断句时间戳和所述原始音频的断句时间戳，对所述原始音频的原始文本进行二次翻译，包括：Further, in some embodiments of the present application, comparing the sentence segmentation timestamp of the target audio with the sentence segmentation timestamp of the original audio and performing a secondary translation on the original text of the original audio includes:

具体地，prompt通常指的是一个输入的文本段落或短语，作为生成模型输出的起点或引导。prompt可以是一个问题、一段文字描述、一段对话或任何形式的文本输入，模型会基于prompt所提供的上下文和语义信息，生成相应的输出文本。Specifically, a prompt usually refers to an input text paragraph or phrase, which serves as the starting point or guide for generating model output. The prompt can be a question, a text description, a dialogue, or any form of text input. The model will generate the corresponding output text based on the context and semantic information provided by the prompt.

本申请实施例提供的语音翻译方法，比较述目标音频的断句时间戳和原始音频的断句时间戳，针对目标音频和原始音频的文本长度，通过设置二次翻译prompt程序对原始音频的原始文本进行二次翻译，以使得目标音频与原始音频的卡点一致，从而确保翻译后的目标文本生成的目标音频与原始音频的节奏卡点的匹配性，能够优先提升语音和画面的匹配协调性。The speech translation method provided in the embodiment of the present application compares the sentence punctuation timestamps of the target audio and the sentence punctuation timestamps of the original audio, and performs a secondary translation on the original text of the original audio according to the text lengths of the target audio and the original audio by setting a secondary translation prompt program, so that the target audio and the original audio have the same card points, thereby ensuring the matching of the rhythm card points of the target audio generated by the translated target text and the original audio, and can give priority to improving the matching coordination of the speech and the picture.

进一步地，在本申请的一些实施例中，所述比较所述目标音频的断句时间戳和所述原始音频的断句时间戳，对所述原始音频的原始文本进行二次翻译之后，还包括：Further, in some embodiments of the present application, after comparing the sentence segmentation timestamps of the target audio and the sentence segmentation timestamps of the original audio and performing a secondary translation on the original text of the original audio, the method further includes:

具体地，当目标语言文本完成二次翻译后，需要将二次翻译后的目标语言文本合成为目标音频，从而使得目标音频与原始音频的卡点一致，从而确保翻译后的目标文本生成的目标音频与原始音频的节奏卡点的匹配性，能够优先提升语音和画面的匹配协调性。Specifically, after the target language text is translated for the second time, it is necessary to synthesize the target language text after the second translation into target audio so that the target audio and the original audio have the same rhythm points, thereby ensuring the matching of the target audio generated by the translated target text with the rhythm points of the original audio, and giving priority to improving the matching coordination of voice and picture.

进一步地，在本申请的一些实施例中，通过TTS算法将二次翻译后的目标语言文本合成为目标音频。Furthermore, in some embodiments of the present application, the target language text after the second translation is synthesized into the target audio through a TTS algorithm.

图2是本申请实施例的语音翻译装置200的结构框图。如图2所示，本申请实施例的语音翻译装置200，包括：文本获取模块210、文本翻译模块220、音频合成模块230和二次翻译模块240，其中：FIG2 is a structural block diagram of a speech translation device 200 according to an embodiment of the present application. As shown in FIG2 , the speech translation device 200 according to an embodiment of the present application comprises: a text acquisition module 210, a text translation module 220, an audio synthesis module 230 and a secondary translation module 240, wherein:

文本获取模块210，用于加载浏览器的待渲染页面；The text acquisition module 210 is used to load the page to be rendered in the browser;

文本翻译模块220，用于执行预先设置的全局变量函数，对所述待渲染页面进行渲染操作；The text translation module 220 is used to execute a preset global variable function to render the page to be rendered;

音频合成模块230，用于对完成渲染操作后的页面进行截图；和The audio synthesis module 230 is used to take a screenshot of the page after the rendering operation is completed; and

二次翻译模块240，用于用于比较所述目标音频的断句时间戳和所述原始音频的断句时间戳，对所述原始音频的原始文本进行二次翻译，以使得所述目标音频与所述原始音频的卡点一致。The secondary translation module 240 is used to compare the sentence segmentation timestamps of the target audio with the sentence segmentation timestamps of the original audio, and perform secondary translation on the original text of the original audio so that the target audio is consistent with the original audio at the same time.

本申请实施例提供的语音翻译装置，当获取原始音频的原始文本和断句时间戳后，将原始音频的原始文本翻译成目标语言文本，并且将所述目标语言文本合成为目标音频，并且通过比较目标音频的断句时间戳和原始音频的断句时间戳，对原始音频的原始文本进行二次翻译，以使得目标音频与原始音频的卡点一致，从而确保翻译后的目标文本生成的目标音频与原始音频的节奏卡点的匹配性，能够优先提升语音和画面的匹配协调性。The speech translation device provided in the embodiment of the present application, after obtaining the original text and sentence time stamp of the original audio, translates the original text of the original audio into a target language text, and synthesizes the target language text into a target audio, and performs a secondary translation on the original text of the original audio by comparing the sentence time stamp of the target audio with the sentence time stamp of the original audio, so that the target audio and the original audio have the same card points, thereby ensuring the matching of the rhythm card points of the target audio generated by the translated target text and the original audio, and can give priority to improving the matching coordination of the voice and the picture.

需要说明的是，本申请实施例的语音翻译装置的具体实现方式与本申请实施例的语音翻译方法的具体实现方式类似，具体请参见方法部分的描述，此处不做赘述。It should be noted that the specific implementation of the speech translation device in the embodiment of the present application is similar to the specific implementation of the speech translation method in the embodiment of the present application. Please refer to the description of the method part for details, and no further details will be given here.

图3为本申请实施例的电子设备300的结构示意图。FIG. 3 is a schematic diagram of the structure of an electronic device 300 according to an embodiment of the present application.

如图3所示，电子设备300包括中央处理单元(CPU)301，其可以根据存储在只读存储器(ROM)302中的程序或者从存储部分302加载到随机访问存储器(RAM)303中的程序而执行各种适当的动作和处理。在RAM 303中，还存储有电子设备300操作所需的各种程序和数据。CPU 301、ROM 302以及RAM 303通过总线304彼此相连。输入/输出(I/O)接口305也连接至总线304。As shown in Fig. 3, electronic device 300 includes central processing unit (CPU) 301, which can perform various appropriate actions and processes according to the program stored in read-only memory (ROM) 302 or the program loaded from storage part 302 into random access memory (RAM) 303. In RAM 303, various programs and data required for the operation of electronic device 300 are also stored. CPU 301, ROM 302 and RAM 303 are connected to each other through bus 304. Input/output (I/O) interface 305 is also connected to bus 304.

以下部件连接至I/O接口305：包括键盘、鼠标等的输入部分306；包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分307；包括硬盘等的存储部分308；以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分309。通信部分309经由诸如因特网的网络执行通信处理。驱动器310也根据需要连接至I/O接口305。可拆卸介质311，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器310上，以便于从其上读出的计算机程序根据需要被安装入存储部分308。The following components are connected to the I/O interface 305: an input section 306 including a keyboard, a mouse, etc.; an output section 307 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 308 including a hard disk, etc.; and a communication section 309 including a network interface card such as a LAN card, a modem, etc. The communication section 309 performs communication processing via a network such as the Internet. A drive 310 is also connected to the I/O interface 305 as needed. A removable medium 311, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 310 as needed, so that a computer program read therefrom is installed into the storage section 308 as needed.

特别地，根据本申请的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本申请的实施例包括一种计算机程序产品，其包括承载在机器可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信部分309从网络上被下载和安装，和/或从可拆卸介质311被安装。在该计算机程序被中央处理单元(CPU)301执行时，执行本申请的电子设备中限定的上述功能。In particular, according to an embodiment of the present application, the process described above with reference to the flowchart can be implemented as a computer software program. For example, an embodiment of the present application includes a computer program product, which includes a computer program carried on a machine-readable medium, and the computer program includes a program code for executing the method shown in the flowchart. In such an embodiment, the computer program can be downloaded and installed from a network through a communication section 309, and/or installed from a removable medium 311. When the computer program is executed by a central processing unit (CPU) 301, the above-mentioned functions defined in the electronic device of the present application are executed.

需要说明的是，本申请所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的电子设备、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in the present application may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two. The computer-readable storage medium may be, for example, but not limited to, an electronic device, device or device of electricity, magnetism, light, electromagnetic, infrared, or semiconductor, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.

在本申请中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行电子设备、装置或者器件使用或者与其结合使用。而在本申请中，计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读介质可以发送、传播或者传输用于由指令执行电子设备、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：无线、电线、光缆、RF等等，或者上述的任意合适的组合。In the present application, a computer-readable storage medium may be any tangible medium containing or storing a program that may be used by or in combination with an instruction-executing electronic device, apparatus, or device. In the present application, a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, which carries a computer-readable program code. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which may send, propagate, or transmit a program for use by or in combination with an instruction-executing electronic device, apparatus, or device. The program code contained on the computer-readable medium may be transmitted using any suitable medium, including but not limited to: wireless, wire, optical cable, RF, etc., or any suitable combination of the above.

附图中的流程图和框图，图示了按照本申请各种实施例的处理接收设备、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，前述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的电子设备来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagram in the accompanying drawings illustrate the possible architecture, functions and operations of the processing receiving device, method and computer program product according to various embodiments of the present application. In this regard, each box in the flowchart or block diagram can represent a module, a program segment, or a part of the code, and the aforementioned module, program segment, or a part of the code contains one or more executable instructions for realizing the specified logical function. It should also be noted that in some alternative implementations, the functions marked in the box can also occur in a different order from the order marked in the accompanying drawings. For example, two boxes represented in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved. It should also be noted that each box in the block diagram and/or flowchart, and the combination of the boxes in the block diagram and/or flowchart can be implemented with a dedicated hardware-based electronic device that performs a specified function or operation, or can be implemented with a combination of dedicated hardware and computer instructions.

描述于本申请实施例中所涉及到的单元或模块可以通过软件的方式实现，也可以通过硬件的方式来实现。所描述的单元或模块也可以设置在处理器中，处理器用于执行所述程序时实现语音翻译方法：The units or modules involved in the embodiments described in this application may be implemented by software or hardware. The units or modules described may also be set in a processor, and the processor is used to implement the speech translation method when executing the program:

作为另一方面，本申请还提供了一种计算机可读存储介质，该计算机可读存储介质可以是上述实施例中描述的电子设备中所包含的；也可以是单独存在，而未装配入该电子设备中的。上述计算机可读存储介质存储有一个或者多个程序，当上述前述程序被一个或者一个以上的处理器用来执行描述于本申请的语音翻译方法：As another aspect, the present application further provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiment; or may exist independently and not be assembled into the electronic device. The above computer-readable storage medium stores one or more programs, and when the above programs are used by one or more processors to execute the speech translation method described in the present application:

作为另一方面，本申请还提供了一种计算机程序产品，该计算机程序产品可以是上述实施例中描述的电子设备中所包含的；也可以是单独存在，而未装配入该电子设备中的。上述计算机程序产品存储有一个或者多个程序，当上述前述程序被一个或者一个以上的处理器用来执行描述于本申请的语音翻译方法：As another aspect, the present application further provides a computer program product, which may be included in the electronic device described in the above embodiment; or may exist independently without being installed in the electronic device. The above computer program product stores one or more programs, and when the above programs are used by one or more processors to execute the speech translation method described in the present application:

以上所述仅为本申请的优选实施例，并非因此限制本申请的专利范围，凡是在本申请的申请构思下，利用本申请说明书及附图内容所作的等效结构变换，或直接/间接运用在其他相关的技术领域均包括在本申请的专利保护范围内。The above description is only a preferred embodiment of the present application, and does not limit the patent scope of the present application. All equivalent structural changes made by using the contents of the present application specification and drawings under the application concept of the present application, or directly/indirectly used in other related technical fields are included in the patent protection scope of the present application.

Claims

Translated fromChinese

1.一种语音翻译方法，其特征在于，包括：1. A speech translation method, comprising:

2.根据权利要求1所述的语音翻译方法，其特征在于，所述将所述原始音频的原始文本翻译成目标语言文本，包括：2. The speech translation method according to claim 1, characterized in that the translating the original text of the original audio into the target language text comprises:

3.根据权利要求1所述的语音翻译方法，其特征在于，所述获取原始音频的原始文本和断句时间戳，包括：3. The speech translation method according to claim 1, wherein the obtaining of the original text and the sentence segmentation timestamp of the original audio comprises:

4.根据权利要求1所述的语音翻译方法，其特征在于，所述将所述目标语言文本合成为目标音频，包括：4. The speech translation method according to claim 1, wherein synthesizing the target language text into target audio comprises:

5.根据权利要求1所述的语音翻译方法，其特征在于，所述比较所述目标音频的断句时间戳和所述原始音频的断句时间戳，对所述原始音频的原始文本进行二次翻译，包括：5. The speech translation method according to claim 1, characterized in that the comparing the sentence segmentation timestamps of the target audio with the sentence segmentation timestamps of the original audio and performing a secondary translation on the original text of the original audio comprises:

6.根据权利要求1所述的语音翻译方法，其特征在于，所述比较所述目标音频的断句时间戳和所述原始音频的断句时间戳，对所述原始音频的原始文本进行二次翻译之后，还包括：6. The speech translation method according to claim 1, characterized in that after comparing the sentence segmentation timestamps of the target audio and the sentence segmentation timestamps of the original audio and performing a secondary translation on the original text of the original audio, it further comprises:

7.根据权利要求6所述的语音翻译方法，其特征在于，通过TTS算法将二次翻译后的目标语言文本合成为目标音频。7. The speech translation method according to claim 6 is characterized in that the target language text after the secondary translation is synthesized into the target audio through a TTS algorithm.

8.一种语音翻译装置，其特征在于，包括：8. A speech translation device, comprising:

9.一种电子设备，其特征在于，包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序，所述处理器用于执行所述程序时实现根据权利要求1-7任一项所述的语音翻译方法。9. An electronic device, characterized in that it comprises a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor is configured to implement the speech translation method according to any one of claims 1 to 7 when executing the program.

10.一种计算机可读存储介质，其特征在于，所述计算机可读存储介质存储有计算机程序，所述计算机程序用于实现根据权利要求1-7任一项所述的语音翻译方法。10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and the computer program is used to implement the speech translation method according to any one of claims 1 to 7.