CN114780792A

Movatterモバイル変換

Info

Publication number: CN114780792A
Application number: CN202210499374.7A
Authority: CN
Inventors: 杜臣; 周文
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2022-05-09
Filing date: 2022-05-09
Publication date: 2022-07-22

Abstract

The application provides a video abstract generating method, a device, equipment and a medium, wherein the method comprises the following steps: the electronic equipment extracts the content characteristics of the target video from the target video, extracts the text characteristics of the target video from the auxiliary text of the target video, determines the similarity between the content characteristics and the text characteristics, determines the target segments from the target video according to the similarity, and generates the video abstract of the target video according to the target segments. Therefore, the accurate and efficient video abstract generating method capable of avoiding subjectivity is provided.

Description

Translated fromChinese

一种视频摘要生成方法、装置、设备及介质A method, apparatus, device and medium for generating a video summary

技术领域technical field

本申请涉及视频处理技术领域，尤其涉及一种视频摘要生成方法、装置、设备以及计算机可读存储介质、计算机程序产品。The present application relates to the technical field of video processing, and in particular, to a method, apparatus, device, computer-readable storage medium, and computer program product for generating a video abstract.

背景技术Background technique

视频摘要是指对于长视频中视频内容的简单概要，用户通过观看视频摘要可以获取长视频中的主要内容。视频摘要可以帮助用户在大量的视频中集中高效地浏览视频内容，获取到用户感兴趣的内容。The video summary refers to a simple summary of the video content in the long video, and the user can obtain the main content in the long video by watching the video summary. Video summarization can help users browse video content in a large number of videos in a centralized and efficient manner, and obtain the content that users are interested in.

通常情况下，视频摘要生成技术通过人工标注视频中的关键帧作为视频摘要，这种方法仅考虑视频的图像特征，并且人工标注的主观性较强，难以满足用户通过视频摘要获取视频信息的需求。Usually, the video summary generation technology uses manual annotation of key frames in the video as the video summary. This method only considers the image features of the video, and the manual annotation is highly subjective, which is difficult to meet the needs of users to obtain video information through video summary. .

因此，亟需一种准确高效的视频摘要生成方法。Therefore, an accurate and efficient video summary generation method is urgently needed.

发明内容SUMMARY OF THE INVENTION

本公开的目的在于：提供了一种视频摘要生成方法、装置、设备、计算机可读存储介质以及计算机程序产品，能够提供一种准确高效的视频摘要生成方法。The purpose of the present disclosure is to provide a method, apparatus, device, computer-readable storage medium, and computer program product for generating a video abstract, which can provide an accurate and efficient method for generating a video abstract.

第一方面，本公开提供视频摘要生成方法，所述方法包括：In a first aspect, the present disclosure provides a video summary generation method, the method comprising:

从目标视频中提取所述目标视频的内容特征，以及从所述目标视频的辅助文本中提取所述目标视频的文本特征；Extract the content feature of the target video from the target video, and extract the text feature of the target video from the auxiliary text of the target video;

确定所述内容特征和所述文本特征的相似度，根据所述相似度从所述目标视频中确定目标片段；determining the similarity between the content feature and the text feature, and determining a target segment from the target video according to the similarity;

根据所述目标片段生成所述目标视频的视频摘要。A video summary of the target video is generated according to the target segment.

第二方面，本公开提供了一种视频摘要生成装置，所述装置包括：In a second aspect, the present disclosure provides an apparatus for generating a video summary, the apparatus comprising:

提取模块，用于从目标视频中提取所述目标视频的内容特征，以及从所述目标视频的辅助文本中提取所述目标视频的文本特征；an extraction module for extracting the content features of the target video from the target video, and extracting the text features of the target video from the auxiliary text of the target video;

确定模块，用于确定所述内容特征和所述文本特征的相似度，根据所述相似度从所述目标视频中确定目标片段；a determining module, configured to determine the similarity between the content feature and the text feature, and determine a target segment from the target video according to the similarity;

生成模块，用于根据所述目标片段生成所述目标视频的视频摘要。A generating module is configured to generate a video summary of the target video according to the target segment.

第三方面，本公开提供一种电子设备，包括：存储装置，其上存储有计算机程序；处理装置，用于执行所述存储装置中的所述计算机程序，以实现本公开第一方面所述方法的步骤。In a third aspect, the present disclosure provides an electronic device, comprising: a storage device on which a computer program is stored; and a processing device for executing the computer program in the storage device, so as to implement the first aspect of the present disclosure. steps of the method.

第四方面，本公开提供一种计算机可读介质，其上存储有计算机程序，该程序被处理装置执行时实现本公开第一方面所述方法的步骤。In a fourth aspect, the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing apparatus, implements the steps of the method described in the first aspect of the present disclosure.

第五方面，本公开提供了一种包含指令的计算机程序产品，当其在设备上运行时，使得设备执行上述第一方面所述方法的步骤。In a fifth aspect, the present disclosure provides a computer program product comprising instructions which, when run on a device, cause the device to perform the steps of the method of the first aspect above.

从以上技术方案可以看出，本公开至少具有如下优点：As can be seen from the above technical solutions, the present disclosure has at least the following advantages:

电子设备从目标视频中提取目标视频的内容特征，以及从目标视频的辅助文本中提取目标视频的文本特征，然后确定内容特征和文本特征的相似度，根据相似度从目标视频中确定目标片段，根据目标片段生成目标视频的视频摘要。The electronic device extracts the content feature of the target video from the target video, and extracts the text feature of the target video from the auxiliary text of the target video, then determines the similarity between the content feature and the text feature, and determines the target segment from the target video according to the similarity, Generate a video summary of the target video based on the target segment.

其中，目标视频的辅助文本特征是由上传目标视频的用户所确定的，因此目标视频的辅助文本中包括该目标视频的文字版摘要，因此可以根据内容特征和文本特征的相似度，从目标视频中确定出与文字版摘要相似度高的视频片段作为目标视频的视频摘要。由于视频摘要与上传用户所确定的辅助文本相似度较高，因此可以避免人工标注的主观性，并且可以通过内容特征和文本特征的相似度自动计算获得，从而提高了目标视频的视频摘要的生成效率。The auxiliary text feature of the target video is determined by the user who uploaded the target video. Therefore, the auxiliary text of the target video includes the text version summary of the target video. A video segment with high similarity to the text version of the abstract is determined as the video abstract of the target video. Since the similarity between the video summary and the auxiliary text determined by the uploading user is high, the subjectivity of manual annotation can be avoided, and it can be obtained by automatic calculation of the similarity between content features and text features, thereby improving the generation of video summaries of the target video. efficiency.

本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。Other features and advantages of the present disclosure will be described in detail in the detailed description that follows.

附图说明Description of drawings

为了更清楚地说明本申请实施例的技术方法，下面将对实施例中所需使用的附图作以简单地介绍。In order to illustrate the technical methods of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings used in the embodiments.

图1为本申请实施例提供的一种视频摘要生成方法的流程图；1 is a flowchart of a method for generating a video summary provided by an embodiment of the present application;

图2为本申请实施例提供的一种从目标视频中提取特征的示意图；2 is a schematic diagram of extracting features from a target video according to an embodiment of the present application;

图3为本申请实施例提供的一种通过相似度评估模型获取相似度的示意图；3 is a schematic diagram of obtaining similarity through a similarity evaluation model provided by an embodiment of the present application;

图4为本申请实施例提供的一种从多个视频片段中确定目标片段的示意图；4 is a schematic diagram of determining a target segment from multiple video segments according to an embodiment of the present application;

图5为本公开实施例提供的一种文本对齐语音装置的结构示意图；5 is a schematic structural diagram of a text-aligned speech device according to an embodiment of the present disclosure;

图6为本公开实施例提供的一种电子设备的结构示意图。FIG. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

具体实施方式Detailed ways

本申请实施例中的术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。The terms "first" and "second" in the embodiments of the present application are only used for the purpose of description, and cannot be understood as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature defined as "first" or "second" may expressly or implicitly include one or more of that feature.

首先对本申请实施例中所涉及到的一些技术术语进行介绍。First, some technical terms involved in the embodiments of this application are introduced.

视频是一种携带丰富信息的媒体资源。视频可以被视频播放器播放，以向用户呈现动态影像。其中，动态影像包括连续的图像和音频。针对电视剧、电影、新闻等视频，视频被播放时所呈现的图像中可以包括文本。该文本例如可以是字幕、赞助商名称、视频制作者名称、背景文字等。上述视频被播放时所呈现的音频中可以包括语音和背景音(如主题曲、插曲等)。Video is a media resource that carries rich information. Videos can be played by video players to present moving images to the user. Among them, dynamic images include continuous images and audio. For videos such as TV series, movies, news, etc., the image presented when the video is played may include text. The text can be, for example, subtitles, sponsor name, video producer name, background text, and the like. The audio presented when the above-mentioned video is played may include voice and background sound (such as the theme song, interlude, etc.).

通常情况下，可以将视频摘要生成技术作为判断视频片段(或者视频帧)是否为视频摘要的分类问题，通过判断视频中的每一个视频片段(或者视频帧)是否为视频摘要确定目标视频的摘要。Usually, the video summary generation technology can be regarded as a classification problem of judging whether a video segment (or video frame) is a video summary, and the summary of the target video is determined by judging whether each video segment (or video frame) in the video is a video summary. .

具体地，可以通过监督学习的方式对于该分类问题进行学习，从而实现对于视频摘要的判断。训练数据包括人工标注的视频中的关键视频片段(或者关键视频帧)作为真实摘要。但是，人工标注数据的方法主观性较强，不同标注员标注的结果并不一定完全相同。并且，视频中包括图像和音频，该方法仅考虑视频的图像，可能导致音频中的重要内容被忽略，影响视频摘要生成的准确度。Specifically, the classification problem can be learned by means of supervised learning, so as to realize the judgment of the video summary. The training data includes key video clips (or key video frames) in human-annotated videos as ground-truth summaries. However, the method of manually annotating data is highly subjective, and the results annotated by different annotators are not necessarily the same. Moreover, the video includes both images and audio, and this method only considers the image of the video, which may cause important content in the audio to be ignored, which affects the accuracy of video summary generation.

有鉴于此，本申请提供了一种视频摘要生成方法，该方法应用于电子设备。电子设备是指具有数据处理能力的设备，例如可以是服务器，或者是终端。其中，终端包括但不限于智能手机、平板电脑、笔记本电脑、个人数字助理(personal digital assistant，PDA)或者智能穿戴设备等。服务器可以是云服务器，例如是中心计算集群中的中心服务器，或者是边缘计算集群中的边缘服务器。当然，服务器也可以是本地数据中心中的服务器。本地数据中心是指用户直接控制的数据中心。In view of this, the present application provides a method for generating a video abstract, which is applied to an electronic device. An electronic device refers to a device with data processing capabilities, such as a server or a terminal. The terminal includes, but is not limited to, a smart phone, a tablet computer, a notebook computer, a personal digital assistant (personal digital assistant, PDA), or a smart wearable device. The server may be a cloud server, for example, a central server in a central computing cluster, or an edge server in an edge computing cluster. Of course, the server can also be a server in a local data center. The local data center refers to the data center directly controlled by the user.

具体地，电子设备从目标视频中提取目标视频的内容特征，以及从目标视频的辅助文本中提取目标视频的文本特征，然后确定内容特征和文本特征的相似度，根据相似度从目标视频中确定目标片段，根据目标片段生成目标视频的视频摘要。Specifically, the electronic device extracts the content feature of the target video from the target video, and extracts the text feature of the target video from the auxiliary text of the target video, and then determines the similarity between the content feature and the text feature, and determines from the target video according to the similarity. The target segment, which generates a video summary of the target video according to the target segment.

其中，目标视频的辅助文本是目标视频的上传用户在上传视频时确定的，目标视频的辅助文本中包括该目标视频的文字版摘要，因此从辅助文本中提取获得的文本特征，然后根据文本特征与内容特征的相似度所确定出的目标片段与文字版摘要的相似度较高，因此可以作为该目标视频的视频摘要，该视频摘要能够准确概括目标视频的主要内容，避免人工标注所确定的摘要主观性较强的问题。Among them, the auxiliary text of the target video is determined by the uploading user of the target video when uploading the video, and the auxiliary text of the target video includes the text version summary of the target video. Therefore, the obtained text features are extracted from the auxiliary text, and then based on the text features The target segment determined by the similarity with the content features has a high similarity with the text version summary, so it can be used as the video summary of the target video. The video summary can accurately summarize the main content of the target video and avoid manual annotation. Abstract the subject with strong subjectivity.

为了使得本公开的技术方案更加清楚、易于理解，下面从电子设备为终端为例，对本公开实施例提供的视频摘要生成方法进行介绍。In order to make the technical solutions of the present disclosure clearer and easier to understand, the following takes an electronic device as a terminal as an example to introduce the video abstract generation method provided by the embodiments of the present disclosure.

S102：终端从目标视频中提取目标视频的内容特征。S102: The terminal extracts content features of the target video from the target video.

目标视频的内容特征包括目标视频的图像特征、目标视频的图像语义特征和目标视频的语音语义特征中的一种或多种。其中，目标视频的图像特征是指从目标视频的图像中提取获得的图像特征。目标视频的图像语义特征是指从目标视频的图像中的文本提取获得的图像语义特征。目标视频的语音语义特征是指从目标视频的语音文本中提取获得的语音语义特征。下面以目标视频的内容特征包括目标视频的图像特征、目标视频的图像语义特征和目标视频的语音语义特征为例进行介绍。The content features of the target video include one or more of image features of the target video, image semantic features of the target video, and speech semantic features of the target video. The image features of the target video refer to image features obtained by extracting images from the target video. The image semantic features of the target video refer to the image semantic features obtained by text extraction from the images of the target video. The phonetic semantic feature of the target video refers to the phonetic semantic feature extracted from the phonetic text of the target video. The following takes the content feature of the target video including the image feature of the target video, the image semantic feature of the target video, and the speech semantic feature of the target video as an example for introduction.

如图2所示，该目标视频为时长为T，包括N帧的视频。终端通过图像特征提取器，对目标视频的图像进行特征提取，获得目标视频的图像特征

As shown in FIG. 2 , the target video is a video with a duration of T and including N frames. The terminal performs feature extraction on the image of the target video through the image feature extractor to obtain the image features of the target video

终端对目标视频中的图像进行光学字符识别(optical character recognition，OCR)，获得目标视频的图像文本，然后对该图像文本进行特征提取，获得目标视频的图像语义特征

The terminal performs optical character recognition (OCR) on the image in the target video to obtain the image text of the target video, and then performs feature extraction on the image text to obtain the image semantic features of the target video

视频中不仅包括连续的图像，还包括音频，因此可以对目标视频中的音频进行自动语音识别(automatic speech recognition，ASR)，获得语音文本，然后对该语音识别文本进行特征提取，获得目标视频的语音语义特征

The video includes not only continuous images, but also audio, so it is possible to perform automatic speech recognition (ASR) on the audio in the target video to obtain speech text, and then perform feature extraction on the speech recognition text to obtain the target video's phonetic semantic features

当内容特征包括目标视频的图像特征、图像语义特征和语音语义特征时，该内容特征中包括图像信息和音频信息等多种信息，能够全面反应该目标视频的内容。When the content features include image features, image semantic features, and speech semantic features of the target video, the content features include image information and audio information and other information, which can fully reflect the content of the target video.

S104：终端从目标视频的辅助文本中提取目标视频的文本特征。S104: The terminal extracts text features of the target video from the auxiliary text of the target video.

目标视频的辅助文本包括该目标视频的标题和该目标视频的简介中的至少一种。目标视频的辅助文本是该目标视频的上传用户在上传该目标视频时确定的。示例性地，当用户上传目标视频时，视频上传平台需要用户输入视频标题，视频标题长度通常较短，例如可以为80字以内。在一些情况下，视频上传平台在用户输入视频标题之后，还需要上传用户输入视频简介，视频简介的长度通常较长，例如可以为200字以内等。The auxiliary text of the target video includes at least one of a title of the target video and an introduction of the target video. The auxiliary text of the target video is determined when the uploading user of the target video uploads the target video. Exemplarily, when a user uploads a target video, the video uploading platform requires the user to input a video title, and the length of the video title is usually short, for example, can be within 80 characters. In some cases, after the user inputs the video title, the video uploading platform also needs to upload the video introduction input by the user. The length of the video introduction is usually longer, for example, it can be within 200 words.

目标视频的辅助文本可以包括该目标视频的文字版摘要。并且，该文字版摘要是由目标视频的上传用户所确定的，相比于不同标注人员对于该目标视频的不同理解，该目标视频的上传用户所确定的该目标视频的文字版摘要可信度较高。The auxiliary text of the target video may include a textual summary of the target video. In addition, the text version summary is determined by the uploading user of the target video. Compared with the different understandings of the target video by different annotators, the text version summary of the target video determined by the uploading user of the target video has the credibility of the target video. higher.

终端从目标视频的辅助文本中提取获得目标视频的文本特征

其中，M表示辅助文本的字数。The terminal extracts the text features of the target video from the auxiliary text of the target video

where M represents the number of words in the auxiliary text.

如图3所示，以目标视频的内容特征包括图像语义特征为例，输入目标视频的图像序列和目标视频的辅助文本序列至语言模型中，该语言模型从图像序列中通过OCR获得图像文本，对该图像文本提取获得目标视频的图像语义特征，并且，该语言模型从辅助文本序列中提取获得目标视频的文本特征。As shown in Figure 3, taking the content features of the target video including image semantic features as an example, input the image sequence of the target video and the auxiliary text sequence of the target video into the language model, and the language model obtains the image text from the image sequence through OCR, The image semantic features of the target video are obtained by the image text extraction, and the language model is extracted from the auxiliary text sequence to obtain the text features of the target video.

需要说明的是，本方案中并不限制S102和S104的执行顺序，终端可以先从目标视频中提取目标视频的内容特征，然后从目标视频的辅助文本中提取目标视频的文本特征；终端也可以先从目标视频的辅助文本中提取目标视频的文本特征，再从目标视频中提取目标视频的内容特征；或者，终端也可以从目标视频中提取目标视频的内容特征的同时，从目标视频的辅助文本中提取目标视频的文本特征。It should be noted that the execution order of S102 and S104 is not limited in this solution. The terminal can first extract the content features of the target video from the target video, and then extract the text features of the target video from the auxiliary text of the target video; the terminal can also First extract the text features of the target video from the auxiliary text of the target video, and then extract the content features of the target video from the target video; Extract the text features of the target video from the text.

S106：终端确定内容特征和文本特征的相似度，根据相似度从目标视频中确定目标片段。S106: The terminal determines the similarity between the content feature and the text feature, and determines the target segment from the target video according to the similarity.

当目标视频的内容特征包括目标视频的图像特征、目标视频的图像语义特征和语音语义特征时，可以分别获取目标视频的图像特征和文本特征的相似度、目标视频的图像语义特征和文本特征的相似度以及目标视频的语音语义特征和文本特征的相似度。When the content features of the target video include the image features of the target video, the image semantic features and the speech semantic features of the target video, the similarity between the image features and text features of the target video and the similarity between the image semantic features and text features of the target video can be obtained respectively. Similarity and similarity of speech-semantic features and textual features of the target video.

具体地，终端可以分别通过三个不同的相似度评估模型分别获取目标视频的图像特征和文本特征的相似度、目标视频的图像语义特征和文本特征的相似度以及目标视频的语音语义特征和文本特征的相似度，也可以通过一个相似度评估模型同时获得目标视频的图像特征和文本特征的相似度、目标视频的图像语义特征和文本特征的相似度以及目标视频的语音语义特征和文本特征的相似度。Specifically, the terminal can obtain the similarity between the image features and text features of the target video, the similarity between the image semantic features and text features of the target video, and the voice semantic features and text features of the target video through three different similarity evaluation models. The similarity of features can also be obtained through a similarity evaluation model to simultaneously obtain the similarity between the image features and text features of the target video, the similarity between the image semantic features and text features of the target video, and the phonetic semantic features and text features of the target video. similarity.

终端可以将内容特征和文本特征输入至相似度评估模型获得内容特征和文本特征的相似度。在一些可能的实现方式中，还可以引入注意力机制(attention)，以突出其中的部分特征，将通过注意力机制处理后的文本特征和查询特征通过判别器，获得内容特征和文本特征的相似度。The terminal may input the content feature and the text feature into the similarity evaluation model to obtain the similarity between the content feature and the text feature. In some possible implementations, an attention mechanism can also be introduced to highlight some of the features, and the text features and query features processed by the attention mechanism are passed through the discriminator to obtain the similarity between content features and text features. Spend.

下面以通过相似度评估模型获取目标视频的图像语义特征和文本特征的相似度为例进行介绍。The following is an example of obtaining the similarity between image semantic features and text features of a target video through a similarity evaluation model.

如图3所示，将文本特征作为Q(query)向量，将图像语义特征作为K(key)向量和V(value)向量输入至attention中，通过attention分别计算Q和每一个K的相似度，对所有的V进行加权获得权重矩阵A。图像语义特征经过注意力机制加权后的特征记为C’。As shown in Figure 3, the text features are used as the Q (query) vector, and the image semantic features are input into the attention as the K (key) vector and the V (value) vector, and the similarity between Q and each K is calculated through the attention, respectively. A weight matrix A is obtained by weighting all V. The image semantic feature weighted by the attention mechanism is denoted as C'.

然后通过判别器，获取文本特征和图像语义特征的相似度。具体地，如图3所示，分别提取文本特征的矩阵每一列的最大值col_max(Q)和图像语义特征的矩阵每一列的最大值col_max(C’)，然后将col_max(Q)和col_max(C’)进行合并，获得concat(col_max(Q)，col_max(C’))。将concat(col_max(Q)，col_max(C’))作为判别器的输入，获得对图像语义特征和文本特征的相似度。示例性地，如图3所示，该判别器可以由两层全连接层组成，分别为2d*128、128*2。然后通过激活函数(Softmax)获得判别结果，该判别结果表征图像语义特征和文本特征的相似度。Then through the discriminator, the similarity between text features and image semantic features is obtained. Specifically, as shown in Figure 3, the maximum value col_max(Q) of each column of the matrix of text features and the maximum value col_max(C') of each column of the matrix of image semantic features are extracted respectively, and then col_max(Q) and col_max( C') is merged to obtain concat(col_max(Q), col_max(C')). Taking concat(col_max(Q), col_max(C')) as the input of the discriminator, the similarity between image semantic features and text features is obtained. Exemplarily, as shown in FIG. 3 , the discriminator may be composed of two fully connected layers, which are 2d*128 and 128*2 respectively. Then, the discriminant result is obtained through the activation function (Softmax), and the discriminant result represents the similarity between the image semantic feature and the text feature.

本方案中的相似度评估模型可以通过弱监督的方式训练获得。在训练过程中，终端从训练视频中提取训练视频的训练内容特征，以及从训练视频的辅助文本中提取训练视频的训练文本特征，通过训练内容特征、训练文本特征以及标签对相似度评估模型进行训练。其中，标签表征训练内容特征和训练文本特征是否来自于同一训练视频。当训练内容特征和训练文本特征来自于同一训练视频时，对应的标签为1，当训练内容特征和训练文本特征不来自于同一训练视频时，对应的标签为0。The similarity evaluation model in this scheme can be obtained by training in a weakly supervised manner. During the training process, the terminal extracts the training content features of the training video from the training video, and extracts the training text features of the training video from the auxiliary text of the training video, and performs the similarity evaluation model through the training content features, training text features and labels. train. Among them, the label indicates whether the training content feature and the training text feature come from the same training video. When the training content feature and the training text feature come from the same training video, the corresponding label is 1, and when the training content feature and the training text feature do not come from the same training video, the corresponding label is 0.

继续以该相似度评估模型用于确定文本特征和图像语义特征为例进行介绍。终端对训练视频的图像进行OCR，获得训练视频的图像文本，然后从图像文本中提取获得该训练视频的图像语义特征。并且，终端从训练视频的辅助文本中提取获得该训练视频的内容特征。将训练视频的图像语义特征和内容特征输入至相似度评估模型时，根据输出的评估结果和对应的标签获得损失函数，然后通过该损失函数对相似度评估模型的参数进行更新。其中，损失函数可以通过交叉熵(Cross Entropy)损失函数表示。Continue to take the similarity evaluation model for determining text features and image semantic features as an example to introduce. The terminal performs OCR on the images of the training video to obtain the image text of the training video, and then extracts the image semantic features of the training video from the image text. And, the terminal extracts and obtains the content feature of the training video from the auxiliary text of the training video. When the image semantic features and content features of the training video are input to the similarity evaluation model, a loss function is obtained according to the output evaluation results and corresponding labels, and then the parameters of the similarity evaluation model are updated through the loss function. Among them, the loss function can be represented by the cross entropy (Cross Entropy) loss function.

在本方案中，由于训练视频的视频内容(包括图像和语音)和训练视频的辅助文本对应，因此可以通过弱监督模型学习训练视频的视频内容和训练视频的辅助文本间的相似度对应关系，由此可以使训练获得的相似度评估模型可以从多个视频片段中确定对辅助文本中相似度最高的视频片段。In this scheme, since the video content (including images and voices) of the training video corresponds to the auxiliary text of the training video, the similarity correspondence between the video content of the training video and the auxiliary text of the training video can be learned through a weakly supervised model, In this way, the similarity evaluation model obtained by training can determine the video segment with the highest similarity to the auxiliary text from multiple video segments.

因此，可以充分利用训练视频中视频内容和辅助文本间的对应关系，而无需进行人工标注，从而可以避免人工标注的主观性以及人力物力成本的问题，可以准确高效的确定视频片段与辅助文本之间的相似度。Therefore, the corresponding relationship between the video content and the auxiliary text in the training video can be fully utilized without manual annotation, thereby avoiding the subjectivity of manual annotation and the problems of human and material resources, and can accurately and efficiently determine the relationship between the video clip and the auxiliary text. similarity between.

当终端获取到内容特征和文本特征的相似度后，可以根据相似度从目标视频中确定目标片段。After the terminal obtains the similarity between the content feature and the text feature, the target segment can be determined from the target video according to the similarity.

在一些可能的实现方式中，终端可以通过相似度评估模型分别确定目标视频的图像特征和文本特征的相似度、目标视频的图像语义特征和文本特征的相似度以及语音语义特征和文本特征的相似度，然后将分别确定的相似度较高的片段确定为目标片段。或者，终端也可以对目标视频图像特征和文本特征的相似度、目标视频的图像语义特征和文本特征的相似度以及语音语义特征和文本特征的相似度进行加权，从而获得目标视频的内容特征和文本特征的相似度，然后根据相似度从目标视频中确定目标片段。In some possible implementations, the terminal may determine the similarity between image features and text features of the target video, the similarity between image semantic features and text features of the target video, and the similarity between speech semantic features and text features through a similarity evaluation model. Then, the segment with higher similarity determined respectively is determined as the target segment. Alternatively, the terminal may also weight the similarity between image features and text features of the target video, the similarity between image semantic features and text features of the target video, and the similarity between speech semantic features and text features, so as to obtain the content features and text features of the target video. The similarity of text features, and then determine the target segment from the target video according to the similarity.

在另一些可能的实现方式中，终端可以通过相似度评估模型获取包括图像特征、图像语义特征和语音语义特征的内容特征与文本特征的相似度，然后根据相似度从目标视频中确定目标片段。In some other possible implementation manners, the terminal may obtain the similarity between content features including image features, image semantic features and speech semantic features and text features through a similarity evaluation model, and then determine the target segment from the target video according to the similarity.

如图4所示，在一些可能的实现方式中，终端可以对目标视频的音频进行自动语音识别，获得目标视频的语音文本，然后根据语音文本将目标视频切分为n个视频片段。终端也可以对目标视频的图像进行光学字符识别，获得目标视频的图像文本，然后根据图像文本将目标视频切分为n个视频片段。终端进一步通过相似度评估模型，分别获得n个视频片段对应的相似度，例如视频片段1的相似度为0.0019，视频片段2的相似度为0.2012，视频片段3的相似度为0.1520，视频片段4的相似度为0.0007，……，视频片段n的相似度为0.1046。因此可以按照相似度排序，确定目标片段，例如，相似度排序为视频片段2、视频片段3、视频片段m……。As shown in FIG. 4 , in some possible implementations, the terminal may perform automatic speech recognition on the audio of the target video, obtain the speech text of the target video, and then divide the target video into n video segments according to the speech text. The terminal may also perform optical character recognition on the image of the target video to obtain the image text of the target video, and then divide the target video into n video segments according to the image text. The terminal further obtains the similarity corresponding to n video clips through the similarity evaluation model. For example, the similarity of video clip 1 is 0.0019, the similarity of video clip 2 is 0.2012, the similarity of video clip 3 is 0.1520, and the similarity of video clip 4 is 0.1520. The similarity is 0.0007, ..., and the similarity of video segment n is 0.1046. Therefore, the target segment can be determined according to the similarity order, for example, the similarity order is video segment 2, video segment 3, video segment m . . .

S108：终端根据目标片段生成目标视频的视频摘要。S108: The terminal generates a video summary of the target video according to the target segment.

在一些可能的实现方式中，目标片段可能为一个视频片段。例如，终端分别获得图像特征和文本特征相似度最高的视频片段，图像语义特征和文本特征相似度最高的视频片段，以及语音语义特征和文本特征相似度最高的视频片段，所获得的三个视频片段为同一视频片段，因此可以将该目标视频片段生成目标视频的摘要。In some possible implementations, the target segment may be a video segment. For example, the terminal obtains the video clip with the highest similarity between image features and text features, the video clip with the highest similarity between image semantic features and text features, and the video clip with the highest similarity between speech semantic features and text features. The clips are the same video clip, so a summary of the target video can be generated from the target video clip.

在一些情况下，目标视频的视频摘要通常具有时长限制，因此可以根据视频片段分别对应的时长和分别对应的相似度，以及目标视频摘要的时长，生成目标视频的视频摘要。In some cases, the video summary of the target video usually has a time limit, so the video summary of the target video can be generated according to the corresponding time lengths and the corresponding similarities of the video clips, and the time length of the target video summary.

基于以上内容的描述，本公开提供一种视频摘要生成方法。具体地，终端从目标视频中提取目标视频的内容特征，以及从目标视频的辅助文本中提取目标视频的文本特征，然后确定内容特征和文本特征的相似度，根据相似度从目标视频中确定目标片段，根据目标片段生成目标视频的视频摘要。Based on the description of the above content, the present disclosure provides a method for generating a video summary. Specifically, the terminal extracts the content feature of the target video from the target video, and extracts the text feature of the target video from the auxiliary text of the target video, then determines the similarity between the content feature and the text feature, and determines the target from the target video according to the similarity. segment, which generates a video summary of the target video based on the target segment.

为了使得本申请技术方案更加清楚易于理解，本申请实施例还提供了一个具体场景对视频摘要生成方法进行示例说明。In order to make the technical solution of the present application clearer and easier to understand, the embodiment of the present application further provides a specific scene to illustrate the method for generating a video abstract.

目标视频为上传用户所拍摄的早起看日出的视频，该视频时长1小时，上传用户上传该目标视频时所确定的标题为：“黄山日出”。为了帮助其他用户快速获取该视频的主要内容，因此需要根据该目标视频生成时长2分钟的视频摘要。具体地，终端可以将目标视频按照2分钟的时间间隔分为30个视频片段。终端根据目标视频的图像，对30个视频片段进行特征提取，获得30个视频片段分别对应的图像特征。终端对目标视频的图像进行光学字符识别，获得目标视频的图像文本，然后从图像文本中提取获得30个视频片段分别对应的图像语义特征。例如，目标视频的图像中可能包括用户在上山过程中所路过的地标、指示牌等。终端对目标视频的音频进行自动语音识别，获得目标视频的语音文本，然后从语音文本中提取30个视频片段分别对应的语音语义特征。例如，语音语义特征包括该用户对于日出的赞叹。并且，终端从目标视频的辅助文本中，即从“黄山日出”中提取目标视频的文本特征。然后分别对比30个视频片段的图像特征、图像语义特征和语音语义特征与文本特征的相似度，按照一定的权重确定相似度最高的视频片段作为目标片段，然后根据目标片段生成目标视频的视频摘要。The target video is a video taken by the uploading user to get up early to watch the sunrise. The video is 1 hour long, and the title determined by the uploading user when uploading the target video is: "Sunrise in Huangshan Mountain". In order to help other users quickly obtain the main content of the video, a 2-minute video summary needs to be generated based on the target video. Specifically, the terminal may divide the target video into 30 video segments according to a time interval of 2 minutes. The terminal performs feature extraction on 30 video clips according to the image of the target video, and obtains image features corresponding to the 30 video clips respectively. The terminal performs optical character recognition on the image of the target video, obtains the image text of the target video, and then extracts the image semantic features corresponding to 30 video clips from the image text. For example, the image of the target video may include landmarks, signs, etc. that the user passes by on the way up the mountain. The terminal performs automatic speech recognition on the audio of the target video, obtains the speech text of the target video, and then extracts the speech semantic features corresponding to the 30 video segments respectively from the speech text. For example, the phonetic semantic features include the user's admiration for the sunrise. And, the terminal extracts the text features of the target video from the auxiliary text of the target video, that is, from "Sunrise in Huangshan". Then compare the similarity of the image features, image semantic features and speech semantic features and text features of 30 video clips respectively, determine the video clip with the highest similarity as the target clip according to a certain weight, and then generate a video summary of the target video according to the target clip. .

图5是根据一示例性公开实施例示出的一种视频摘要生成装置的示意图，如图5所示，所述视频摘要生成装置500包括：FIG. 5 is a schematic diagram of an apparatus for generating a video summary according to an exemplary disclosed embodiment. As shown in FIG. 5 , theapparatus 500 for generating a video summary includes:

提取模块502，用于从目标视频中提取所述目标视频的内容特征，以及从所述目标视频的辅助文本中提取所述目标视频的文本特征；Extraction module 502, for extracting the content feature of the target video from the target video, and extracting the text feature of the target video from the auxiliary text of the target video;

确定模块504，用于确定所述内容特征和所述文本特征的相似度，根据所述相似度从所述目标视频中确定目标片段；A determination module 504, configured to determine the similarity between the content feature and the text feature, and determine a target segment from the target video according to the similarity;

生成模块506，用于根据所述目标片段生成所述目标视频的视频摘要。A generating module 506, configured to generate a video summary of the target video according to the target segment.

可选地，所述内容特征包括图像特征、图像语义特征和语音语义特征，所述图像特征包括从所述目标视频的图像中提取获得的特征，所述图像语义特征包括从所述目标视频的图像中的文字提取获得的特征，所述语音语义特征包括从所述目标视频的音频中提取获得的特征。Optionally, the content features include image features, image semantic features, and speech semantic features, the image features include features extracted from images of the target video, and the image semantic features include features from the target video. Features obtained by text extraction in the image, and the speech semantic features include features extracted from the audio of the target video.

可选地，所述内容特征包括图像特征；Optionally, the content features include image features;

所述提取模块502可以用于：The extraction module 502 can be used to:

从目标视频的图像中提取所述目标视频的图像特征。Image features of the target video are extracted from images of the target video.

可选地，所述内容特征包括图像语义特征；Optionally, the content features include image semantic features;

所述提取模块502可以用于：The extraction module 502 can be used to:

对所述目标视频的图像进行光学字符识别，获得所述目标视频的图像文本；Perform optical character recognition on the image of the target video to obtain the image text of the target video;

从所述图像文本中提取所述目标视频的图像语义特征。Extract image semantic features of the target video from the image text.

可选地，所述内容特征包括语音语义特征；Optionally, the content features include phonetic semantic features;

所述提取模块502可以用于：The extraction module 502 can be used to:

对所述目标视频的音频进行自动语音识别，获得所述目标视频的语音文本；Perform automatic speech recognition on the audio of the target video to obtain the voice text of the target video;

从所述语音文本中提取所述目标视频的语音语义特征。Extract the phonetic semantic features of the target video from the phonetic text.

可选地，所述确定模块504可以用于：Optionally, the determining module 504 may be used to:

将所述内容特征和所述文本特征输入至相似度评估模型，通过所述相似度评估模型确定所述内容特征和所述文本特征的相似度。The content feature and the text feature are input into a similarity evaluation model, and the similarity between the content feature and the text feature is determined by the similarity evaluation model.

可选地，所述相似度评估模型通过训练模块训练获得，所述训练模块具体用于：Optionally, the similarity evaluation model is obtained by training a training module, and the training module is specifically used for:

从训练视频中提取所述训练视频的训练内容特征，以及从所述训练视频的辅助文本中提取所述训练视频的训练文本特征；Extract the training content feature of the training video from the training video, and extract the training text feature of the training video from the auxiliary text of the training video;

通过所述训练内容特征、训练文本特征和标签对所述相似度评估模型进行模型训练，所述标签用于表示所述训练内容特征与所述训练文本特征是否来自于同一训练视频。Model training is performed on the similarity evaluation model by using the training content feature, the training text feature and the label, where the label is used to indicate whether the training content feature and the training text feature are from the same training video.

可选地，所述目标视频的辅助文本包括所述目标视频的标题和所述目标视频的简介。Optionally, the auxiliary text of the target video includes a title of the target video and an introduction of the target video.

上述各模块的功能在上一实施例中的方法步骤中已详细阐述，在此不做赘述。The functions of the above modules have been described in detail in the method steps in the previous embodiment, and will not be repeated here.

下面参考图6，其示出了适于用来实现本公开实施例的电子设备600的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图6示出的电子设备仅仅是一个示例，不应对本公开实施例的功能和使用范围带来任何限制。Referring next to FIG. 6 , it shows a schematic structural diagram of an electronic device 600 suitable for implementing an embodiment of the present disclosure. Terminal devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals (eg, mobile terminals such as in-vehicle navigation terminals), etc., and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in FIG. 6 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.

如图6所示，电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601，其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM603中，还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。As shown in FIG. 6, an electronic device 600 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 601 that may be loaded into random access according to a program stored in a read only memory (ROM) 602 or from astorage device 608 Various appropriate actions and processes are executed by the programs in the memory (RAM) 603 . In theRAM 603, various programs and data necessary for the operation of the electronic device 600 are also stored. Theprocessing device 601 , theROM 602 , and theRAM 603 are connected to each other through abus 604 . An input/output (I/O)interface 605 is also connected tobus 604 .

通常，以下装置可以连接至I/O接口605：包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606；包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置607；包括例如磁带、硬盘等的存储装置608；以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图6示出了具有各种装置的电子设备600，但是应理解的是，并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Typically, the following devices can be connected to the I/O interface 605:input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration Anoutput device 607 of a computer, etc.; astorage device 608 including, for example, a magnetic tape, a hard disk, etc.; and acommunication device 609. Communication means 609 may allow electronic device 600 to communicate wirelessly or by wire with other devices to exchange data. While FIG. 6 shows electronic device 600 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

特别地，根据本公开的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本公开的实施例包括一种计算机程序产品，其包括承载在非暂态计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信装置609从网络上被下载和安装，或者从存储装置608被安装，或者从ROM 602被安装。在该计算机程序被处理装置601执行时，执行本公开实施例的方法中限定的上述功能。In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via thecommunication device 609 , or from thestorage device 608 , or from theROM 602 . When the computer program is executed by theprocessing apparatus 601, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.

需要说明的是，本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中，计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：电线、光缆、RF(射频)等等，或者上述的任意合适的组合。It should be noted that the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.

在一些实施方式中，客户端、服务器可以利用诸如HTTP(HyperText TransferProtocol，超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信，并且可以与任意形式或介质的数字数据通信(例如，通信网络)互连。通信网络的示例包括局域网(“LAN”)，广域网(“WAN”)，网际网(例如，互联网)以及端对端网络(例如，ad hoc端对端网络)，以及任何当前已知或未来研发的网络。In some embodiments, the client and server can communicate using any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol), and can communicate with digital data in any form or medium (eg, a communications network) interconnected. Examples of communication networks include local area networks ("LAN"), wide area networks ("WAN"), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently known or future development network of.

上述计算机可读介质可以是上述电子设备中所包含的；也可以是单独存在，而未装配入该电子设备中。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.

上述计算机可读介质承载有一个或者多个程序，当上述一个或者多个程序被该电子设备执行时，使得该电子设备：从目标视频中提取所述目标视频的内容特征，以及从所述目标视频的辅助文本中提取所述目标视频的文本特征；确定所述内容特征和所述文本特征的相似度，根据所述相似度从所述目标视频中确定目标片段；根据所述目标片段生成所述目标视频的视频摘要。可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码，上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++，还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device can: extract the content features of the target video from the target video, and extract the content features of the target video from the target video. Extract the text feature of the target video from the auxiliary text of the video; determine the similarity between the content feature and the text feature, and determine the target segment from the target video according to the similarity; generate the target segment according to the target segment. The video summary of the target video. Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider to via Internet connection).

附图中的流程图和框图，图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.

描述于本公开实施例中所涉及到的模块可以通过软件的方式实现，也可以通过硬件的方式来实现。其中，模块的名称在某种情况下并不构成对该模块本身的限定。The modules involved in the embodiments of the present disclosure may be implemented in software or hardware. Among them, the name of the module does not constitute a limitation of the module itself under certain circumstances.

本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如，非限制性地，可以使用的示范类型的硬件逻辑部件包括：现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

根据本公开的一个或多个实施例，示例1提供了一种视频摘要生成方法，所述方法包括：从目标视频中提取所述目标视频的内容特征，以及从所述目标视频的辅助文本中提取所述目标视频的文本特征；确定所述内容特征和所述文本特征的相似度，根据所述相似度从所述目标视频中确定目标片段；根据所述目标片段生成所述目标视频的视频摘要。According to one or more embodiments of the present disclosure, Example 1 provides a method for generating a video summary, the method comprising: extracting content features of the target video from a target video, and extracting content features of the target video from auxiliary text of the target video extracting text features of the target video; determining the similarity between the content features and the text features, and determining a target segment from the target video according to the similarity; generating a video of the target video according to the target segment Summary.

根据本公开的一个或多个实施例，示例2提供了示例1的方法，所述内容特征包括图像特征、图像语义特征和语音语义特征，所述图像特征包括从所述目标视频的图像中提取获得的特征，所述图像语义特征包括从所述目标视频的图像中的文字提取获得的特征，所述语音语义特征包括从所述目标视频的音频中提取获得的特征。According to one or more embodiments of the present disclosure, Example 2 provides the method of Example 1, the content features include image features, image semantic features, and speech semantic features, the image features including extraction from images of the target video The obtained features, the image semantic features include features extracted from text in the image of the target video, and the speech semantic features include features extracted from the audio of the target video.

根据本公开的一个或多个实施例，示例3提供了示例1的方法，所述内容特征包括图像特征；所述从目标视频中提取所述目标视频的内容特征，包括：从目标视频的图像中提取所述目标视频的图像特征。According to one or more embodiments of the present disclosure, Example 3 provides the method of Example 1, wherein the content features include image features; and the extracting the content features of the target video from the target video includes: extracting the content features of the target video from images of the target video to extract the image features of the target video.

根据本公开的一个或多个实施例，示例4提供了示例1的方法，所述内容特征包括图像语义特征；所述从目标视频中提取所述目标视频的内容特征，包括：对所述目标视频的图像进行光学字符识别，获得所述目标视频的图像文本；从所述图像文本中提取所述目标视频的图像语义特征。According to one or more embodiments of the present disclosure, Example 4 provides the method of Example 1, the content features include image semantic features; the extracting the content features of the target video from the target video includes: The image of the video is subjected to optical character recognition to obtain the image text of the target video; the image semantic features of the target video are extracted from the image text.

根据本公开的一个或多个实施例，示例5提供了示例1的方法，所述内容特征包括语音语义特征；所述从目标视频中提取所述目标视频的内容特征，包括：对所述目标视频的音频进行自动语音识别，获得所述目标视频的语音文本；从所述语音文本中提取所述目标视频的语音语义特征。According to one or more embodiments of the present disclosure, Example 5 provides the method of Example 1, wherein the content features include phonetic semantic features; the extracting the content features of the target video from the target video includes: The audio of the video is subjected to automatic speech recognition to obtain the speech text of the target video; the speech semantic features of the target video are extracted from the speech text.

根据本公开的一个或多个实施例，示例6提供了示例1的方法，所述确定所述内容特征和所述文本特征的相似度，包括：将所述内容特征和所述文本特征输入至相似度评估模型，通过所述相似度评估模型确定所述内容特征和所述文本特征的相似度。According to one or more embodiments of the present disclosure, Example 6 provides the method of Example 1, wherein the determining the similarity between the content feature and the text feature includes: inputting the content feature and the text feature into a A similarity evaluation model, through which the similarity between the content feature and the text feature is determined.

根据本公开的一个或多个实施例，示例7提供了示例6的方法，所述相似度评估模型通过以下方式训练获得：从训练视频中提取所述训练视频的训练内容特征，以及从所述训练视频的辅助文本中提取所述训练视频的训练文本特征；通过所述训练内容特征、训练文本特征和标签对所述相似度评估模型进行模型训练，所述标签用于表示所述训练内容特征与所述训练文本特征是否来自于同一训练视频。According to one or more embodiments of the present disclosure, Example 7 provides the method of Example 6, and the similarity evaluation model is obtained by training in the following manner: extracting training content features of the training video from the training video, and extracting the training content feature of the training video from the training video Extract the training text feature of the training video from the auxiliary text of the training video; perform model training on the similarity evaluation model through the training content feature, training text feature and label, and the label is used to represent the training content feature Whether the training text features come from the same training video.

根据本公开的一个或多个实施例，示例8提供了示例1至示例7任意一项所述的方法，所述目标视频的辅助文本包括所述目标视频的标题和所述目标视频的简介。According to one or more embodiments of the present disclosure, Example 8 provides the method of any one of Examples 1 to 7, wherein the auxiliary text of the target video includes a title of the target video and an introduction of the target video.

根据本公开的一个或多个实施例，示例9提供了一种视频摘要生成装置，所述装置包括：提取模块，用于从目标视频中提取所述目标视频的内容特征，以及从所述目标视频的辅助文本中提取所述目标视频的文本特征；确定模块，用于确定所述内容特征和所述文本特征的相似度，根据所述相似度从所述目标视频中确定目标片段；生成模块，用于根据所述目标片段生成所述目标视频的视频摘要。According to one or more embodiments of the present disclosure, Example 9 provides an apparatus for generating a video summary, the apparatus comprising: an extraction module for extracting content features of the target video from a target video, and extracting content features of the target video from the target video Extract the text feature of the target video from the auxiliary text of the video; a determination module is used to determine the similarity between the content feature and the text feature, and determine a target segment from the target video according to the similarity; a generation module , for generating a video summary of the target video according to the target segment.

根据本公开的一个或多个实施例，示例10提供了示例9的装置，所述内容特征包括图像特征、图像语义特征和语音语义特征，所述图像特征包括从所述目标视频的图像中提取获得的特征，所述图像语义特征包括从所述目标视频的图像中的文字提取获得的特征，所述语音语义特征包括从所述目标视频的音频中提取获得的特征。According to one or more embodiments of the present disclosure, Example 10 provides the apparatus of Example 9, the content features including image features, image semantic features, and speech semantic features, the image features including extraction from images of the target video The obtained features, the image semantic features include features extracted from text in the image of the target video, and the speech semantic features include features extracted from the audio of the target video.

根据本公开的一个或多个实施例，示例11提供了示例9的装置，所述内容特征包括图像特征；所述提取模块可以用于：从目标视频的图像中提取所述目标视频的图像特征。According to one or more embodiments of the present disclosure, Example 11 provides the apparatus of Example 9, the content features include image features; the extraction module may be configured to: extract the image features of the target video from the images of the target video .

根据本公开的一个或多个实施例，示例12提供了示例9的装置，所述内容特征包括图像语义特征；所述提取模块可以用于：对所述目标视频的图像进行光学字符识别，获得所述目标视频的图像文本；从所述图像文本中提取所述目标视频的图像语义特征。According to one or more embodiments of the present disclosure, Example 12 provides the apparatus of Example 9, wherein the content features include image semantic features; the extraction module may be configured to: perform optical character recognition on the image of the target video to obtain Image text of the target video; extracting image semantic features of the target video from the image text.

根据本公开的一个或多个实施例，示例13提供了示例9的装置，所述内容特征包括语音语义特征；所述提取装置可以用于：对所述目标视频的音频进行自动语音识别，获得所述目标视频的语音文本；从所述语音文本中提取所述目标视频的语音语义特征。According to one or more embodiments of the present disclosure, Example 13 provides the device of Example 9, wherein the content features include speech semantic features; the extracting device may be configured to: perform automatic speech recognition on the audio of the target video, and obtain The phonetic text of the target video; extracting the phonetic semantic features of the target video from the phonetic text.

根据本公开的一个或多个实施例，示例14提供了示例9的装置，所述确定模块可以用于：将所述内容特征和所述文本特征输入至相似度评估模型，通过所述相似度评估模型确定所述内容特征和所述文本特征的相似度。According to one or more embodiments of the present disclosure, Example 14 provides the apparatus of Example 9, the determining module may be configured to: input the content feature and the text feature into a similarity evaluation model, and use the similarity The evaluation model determines the similarity of the content feature and the text feature.

根据本公开的一个或多个实施例，示例15提供了示例14的装置，所述相似度评估模型通过训练模块训练获得，所述训练模块具体用于：从训练视频中提取所述训练视频的训练内容特征，以及从所述训练视频的辅助文本中提取所述训练视频的训练文本特征；通过所述训练内容特征、训练文本特征和标签对所述相似度评估模型进行模型训练，所述标签用于表示所述训练内容特征与所述训练文本特征是否来自于同一训练视频。According to one or more embodiments of the present disclosure, Example 15 provides the apparatus of Example 14, wherein the similarity evaluation model is obtained by training a training module, and the training module is specifically configured to: extract the training video from the training video training content features, and extracting the training text features of the training video from the auxiliary text of the training video; model training is performed on the similarity evaluation model through the training content features, training text features and labels, and the label It is used to indicate whether the training content feature and the training text feature come from the same training video.

根据本公开的一个或多个实施例，示例16提供了示例9至示例15任意一项所述的装置，所述目标视频的辅助文本包括所述目标视频的标题和所述目标视频的简介。According to one or more embodiments of the present disclosure, Example 16 provides the apparatus of any one of Examples 9 to 15, wherein the auxiliary text of the target video includes a title of the target video and an introduction of the target video.

以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解，本公开中所涉及的公开范围，并不限于上述技术特征的特定组合而成的技术方案，同时也应涵盖在不脱离上述公开构思的情况下，由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is merely a preferred embodiment of the present disclosure and an illustration of the technical principles employed. Those skilled in the art should understand that the scope of the disclosure involved in the present disclosure is not limited to the technical solutions formed by the specific combination of the above-mentioned technical features, and should also cover, without departing from the above-mentioned disclosed concept, the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of its equivalent features. For example, a technical solution is formed by replacing the above features with the technical features disclosed in the present disclosure (but not limited to) with similar functions.

此外，虽然采用特定次序描绘了各操作，但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下，多任务和并行处理可能是有利的。同样地，虽然在上面论述中包含了若干具体实现细节，但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地，在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。Additionally, although operations are depicted in a particular order, this should not be construed as requiring that the operations be performed in the particular order shown or in a sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, although the above discussion contains several implementation-specific details, these should not be construed as limitations on the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题，但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反，上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置，其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述，此处将不做详细阐述说明。Although the subject matter has been described in language specific to structural features and/or logical acts of method, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims. Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.

Claims

Translated fromChinese

1.一种视频摘要生成方法，其特征在于，所述方法包括：1. a video summary generation method, is characterized in that, described method comprises:

2.根据权利要求1所述的方法，其特征在于，所述内容特征包括图像特征、图像语义特征和语音语义特征，所述图像特征包括从所述目标视频的图像中提取获得的特征，所述图像语义特征包括从所述目标视频的图像中的文字提取获得的特征，所述语音语义特征包括从所述目标视频的音频中提取获得的特征。2. The method according to claim 1, wherein the content features include image features, image semantic features and speech semantic features, and the image features include features extracted from images of the target video, the The image semantic features include features extracted from text in the image of the target video, and the phonetic semantic features include features extracted from audio of the target video.

3.根据权利要求1所述的方法，其特征在于，所述内容特征包括图像特征；3. The method according to claim 1, wherein the content features comprise image features;

所述从目标视频中提取所述目标视频的内容特征，包括：The extracting the content features of the target video from the target video includes:

4.根据权利要求1所述的方法，其特征在于，所述内容特征包括图像语义特征；4. The method according to claim 1, wherein the content features comprise image semantic features;

5.根据权利要求1所述的方法，其特征在于，所述内容特征包括语音语义特征；5. The method according to claim 1, wherein the content features comprise phonetic semantic features;

6.根据权利要求1所述的方法，其特征在于，所述确定所述内容特征和所述文本特征的相似度，包括：6. The method according to claim 1, wherein the determining the similarity between the content feature and the text feature comprises:

7.根据权利要求6所述的方法，其特征在于，所述相似度评估模型通过以下方式训练获得：7. The method according to claim 6, wherein the similarity evaluation model is obtained by training in the following manner:

8.根据权利要求1至7任意一项所述的方法，其特征在于，所述目标视频的辅助文本包括所述目标视频的标题和所述目标视频的简介。8. The method according to any one of claims 1 to 7, wherein the auxiliary text of the target video comprises a title of the target video and an introduction of the target video.

9.一种视频摘要生成装置，其特征在于，所述装置包括：9. A device for generating a video summary, wherein the device comprises:

10.一种设备，其特征在于，所述设备包括处理器和存储器；10. A device, characterized in that the device comprises a processor and a memory;

所述处理器用于执行所述存储器中存储的指令，以使得所述设备执行如权利要求1至8中任一项所述的方法。The processor is adapted to execute instructions stored in the memory to cause the apparatus to perform the method of any one of claims 1 to 8.

11.一种计算机可读存储介质，其特征在于，包括指令，所述指令指示设备执行如权利要求1至8中任一项所述的方法。11. A computer-readable storage medium, comprising instructions, the instructions instructing a device to perform the method of any one of claims 1 to 8.

12.一种计算机程序产品，其特征在于，当所述计算机程序产品在计算机上运行时，使得计算机执行如权利要求1至8中任一项所述的方法。12. A computer program product, characterized in that, when the computer program product is run on a computer, the computer is caused to perform the method according to any one of claims 1 to 8.