CN116170651A

Movatterモバイル変換

Info

Publication number: CN116170651A
Application number: CN202210979659.0A
Authority: CN
Inventors: 周昕; 亢乐; 程治宇; �田�浩; 卢大明; 李大鹏; 荀镜雅; 王健宇; 陈曦; 李幸
Original assignee: Baidu USA LLC
Current assignee: Baidu USA LLC
Priority date: 2021-11-23
Filing date: 2022-08-16
Publication date: 2023-05-26
Anticipated expiration: 2042-08-16
Also published as: CN116170651B

Abstract

Translated fromChinese

本文提供了用于自动和精确地生成内容的高光时刻视频或概要视频的系统、方法和数据集。在一个或多个实施例中，输入包括活动(例如，比赛、音乐会等)中的关键事件(例如，进球、球员动作等)的文本(例如，文章)以及活动的一个或多个视频。在一个或多个实施例中，输出是文本中的一个或多个事件的短视频，其中，视频可以包括高光时刻事件的评论和/或其他音频(例如音乐)，其也可以被自动合成。

This paper presents systems, methods, and datasets for automatically and precisely generating highlight-moment videos or summary videos of content. In one or more embodiments, the input includes text (e.g., articles) of key events (e.g., goals, player actions, etc.) at an event (e.g., a game, concert, etc.) and one or more videos of the event . In one or more embodiments, the output is a short video of one or more events in the text, where the video may include commentary and/or other audio (eg, music) of the highlight moment event, which may also be automatically synthesized.

Description

Translated fromChinese

从视频和文本输入生成高光时刻视频的方法、系统和存储介质Method, system and storage for generating highlight moment video from video and text inputmedium

相关申请的交叉引用Cross References to Related Applications

本专利申请是共同待审和共有的美国专利申请17/393,373的部分接续申请案，并要求该美国专利申请17/393,373的优先权，该美国专利申请17/393,373于2021年8月3日提交、题为“利用人工智能自动和精确地生成高光时刻视频(AUTOMATICALLY ANDPRECISELY GENERATING HIGHLIGHT VIDEOS WITH ARTIFICIAL INTELLIGENCE)”、并列出了Zhiyu Cheng、Le Kang、Xin Zhou、Hao Tian和Xing Li作为发明人(案号：28888-2450(BN201118USN1)，其要求共同待审和共有的美国专利申请63/124,832的优先权，该美国专利申请63/124,832于2020年12月13日提交、题为“利用人工智能自动和精确地生成高光时刻视频(AUTOMATICALLY AND PRECISELY GENERATING HIGHLIGHT VIDEOS WITHARTIFICIAL INTELLIGENCE)”、并列出了Zhiyu Cheng、Le Kang、Xin Zhou、 Hao Tian和Xing Li作为发明人(案号：28888-2450P(BN201118USN1- 临时))；上述专利文献中的每一篇出于所有目的通过引用以其整体并入本文。This patent application is a continuation-in-part of co-pending and commonly owned U.S.patent application 17/393,373 and claims priority from that U.S.patent application 17/393,373, which was filed on August 3, 2021 , entitled "AUTOMATICALLY AND PRECISELY GENERATING HIGHLIGHT VIDEOS WITH ARTIFICIAL INTELLIGENCE," and lists Zhiyu Cheng, Le Kang, Xin Zhou, Hao Tian, and Xing Li as inventors (case No. 28888-2450 (BN201118USN1), which claims priority to co-pending and commonly-owned U.S. Patent Application 63/124,832, filed December 13, 2020, and entitled "Using Artificial Intelligence to Automate and Precisely Generate Highlight Moment Videos (AUTOMATICALLY AND PRECISELY GENERATING HIGHLIGHT VIDEOS WITHARTIFICIAL INTELLIGENCE)", and listed Zhiyu Cheng, Le Kang, Xin Zhou, Hao Tian and Xing Li as inventors (case number: 28888-2450P (BN201118USN1- provisional)); each of the above patent documents is hereby incorporated by reference in its entirety for all purposes.

技术领域technical field

本公开总体上涉及用于计算机学习的系统和方法，其可以提供改进的计算机性能、特征和用途。更具体地，本公开涉及用于自动生成内容的摘要或高光时刻的系统和方法。The present disclosure generally relates to systems and methods for computer learning that can provide improved computer performance, features and uses. More specifically, the present disclosure relates to systems and methods for automatically generating summaries or highlight moments of content.

背景技术Background technique

随着因特网技术和新兴工具的迅速发展，在线生成的视频内容(诸如体育相关的或其它事件的视频)，正在以前所未有的速度快速增长。特别是在COVID-19大流行期间，由于不允许粉丝们在现场(例如体育场或戏剧场)参加事件，网络视频的观看量激增。创建高光时刻视频或其它事件相关视频通常涉及人工编辑原始未修整视频的人工工作。例如，最流行的运动视频通常包括几秒钟的短剪辑，而精确地理解视频和现场关键事件对于机器来说是非常有挑战性的。结合存在的大量原始内容，将原始内容分解为合适的高光时刻视频是非常耗时和代价高的。而且，对于观众来说，在给定了观看内容的有限时间的情况下，能够获取适当地捕获了突出元素或事件的压缩内容是重要的。With the rapid development of Internet technology and emerging tools, video content generated online, such as sports-related or other event videos, is growing at an unprecedented rate. Especially during the COVID-19 pandemic, online video viewing has exploded as fans are not allowed to attend events in-person (e.g. stadiums or theaters). Creating highlight moment videos or other event-related videos often involves the manual work of manually editing the original untrimmed video. For example, the most popular sports videos usually include short clips of a few seconds, and it is very challenging for machines to accurately understand the videos and key events in the scene. Combined with the large amount of raw content that exists, it is very time-consuming and expensive to decompose the raw content into suitable highlight moment videos. Also, it is important for viewers to be able to obtain compressed content that properly captures salient elements or events, given the limited time available to view the content.

因此，需要一种能够自动和精确地生成诸如高光时刻视频的经提炼或压缩的视频内容的系统和方法。Therefore, there is a need for a system and method that can automatically and accurately generate refined or compressed video content, such as highlight moment video.

发明内容Contents of the invention

本公开的一方面提供了一种计算机实现的方法，包括：给定提及活动中的事件的输入文本，使用文本解析模块解析所述输入文本，以识别所述输入文本中提及的所述事件；以及使用文本到语音TTS模块将所述输入文本转换为TTS生成的音频；给定所述活动的至少一部分和所识别的事件的输入视频：进行时间锚定以将所述输入视频的运行时间与所述活动的运行时间相关联；通过使用从所述输入文本、从与所述活动相关的附加源、或从两者解析的时间信息以及通过时间锚定获得的相关时间，识别在所述活动期间发生所述事件的近似时间，进而从包括所述事件的所述输入视频生成初始剪辑；从所述初始视频剪辑中提取特征；使用所提取的特征和训练后的神经网络模型，获得所述初始视频剪辑中所述事件的最终时间值；响应于所述初始视频剪辑的运行时间与所述TTS生成的音频的运行时间不一致，通过将所述初始视频剪辑编辑为具有与所述TTS生成的音频的运行时间一致的运行时间来生成最终视频剪辑；以及响应于所述初始视频剪辑的运行时间与所述TTS生成的音频的运行时间一致，使用所述初始视频剪辑作为所述最终视频剪辑；以及将所述TTS生成的音频与所述最终视频剪辑组合，以生成事件高光时刻视频。One aspect of the present disclosure provides a computer-implemented method comprising: given input text mentioning events in a campaign, parsing the input text using a text parsing module to identify the event; and using a text-to-speech TTS module to convert the input text into TTS-generated audio; given at least a portion of the activity and an input video of the identified event: time anchoring to the running of the input video A time is associated with the running time of the activity; by using temporal information parsed from the input text, from an additional source related to the activity, or both, and relative times obtained through temporal anchoring, identify The approximate time at which the event occurred during the activity, and then generate an initial clip from the input video including the event; extract features from the initial video clip; use the extracted features and the trained neural network model to obtain the final time value of the event in the initial video clip; in response to the running time of the initial video clip being inconsistent with the running time of the audio generated by the TTS, by editing the initial video clip to have a generating a final video clip whose runtime is consistent with the runtime of the generated audio; and responsive to the runtime of the initial video clip being consistent with the runtime of the audio generated by the TTS, using the initial video clip as the final video clipping; and combining the audio generated by the TTS with the final video clip to generate an event highlight moment video.

本公开的另一方面提供了一种系统，包括：一个或多个处理器；以及包括一组或多组指令的非暂时性计算机可读介质，当所述一组或多组指令被所述一个或多个处理器中的至少一个执行时，使得要执行以下步骤，所述步骤包括：给定提及活动中的事件的输入文本，使用文本解析模块解析所述输入文本，以识别所述输入文本中提及的所述事件；以及使用文本到语音TTS模块将所述输入文本转换为TTS生成的音频；给定所述活动的至少一部分和所识别的事件的输入视频：进行时间锚定以将所述输入视频的运行时间与所述活动的运行时间相关联；通过使用从所述输入文本、从与所述活动相关的附加源、或从两者解析的时间信息以及通过时间锚定获得的相关时间，识别在所述活动期间发生所述事件的近似时间，进而从包括所述事件的所述输入视频生成初始剪辑；从所述初始视频剪辑中提取特征；使用所提取的特征和训练后的神经网络模型，获得所述初始视频剪辑中所述事件的最终时间值；响应于所述初始视频剪辑的运行时间与所述TTS生成的音频的运行时间不一致，通过将所述初始视频剪辑编辑为具有与所述 TTS生成的音频的运行时间一致的运行时间来生成最终视频剪辑；以及响应于所述初始视频剪辑的运行时间与所述TTS生成的音频的运行时间一致，使用所述初始视频剪辑作为所述最终视频剪辑；以及将所述TTS生成的音频与所述最终视频剪辑组合，以生成事件高光时刻视频。Another aspect of the present disclosure provides a system comprising: one or more processors; and a non-transitory computer-readable medium comprising one or more sets of instructions, when the one or more sets of instructions are executed by the At least one of the one or more processors, when executed, causes the following steps to be performed, the steps comprising: given input text referring to an event in a campaign, parsing the input text using a text parsing module to identify the said event mentioned in an input text; and converting said input text to TTS-generated audio using a text-to-speech TTS module; given an input video of at least a portion of said activity and the identified event: time anchoring to associate the running time of the input video with the running time of the activity; by using temporal information parsed from the input text, from an additional source related to the activity, or both, and by temporally anchoring obtaining an associated time, identifying an approximate time at which said event occurred during said event, thereby generating an initial clip from said input video including said event; extracting features from said initial video clip; using the extracted features and The trained neural network model obtains the final time value of the event in the initial video clip; in response to the running time of the initial video clip being inconsistent with the running time of the audio generated by the TTS, the initial video clip clip editing to have a run time consistent with the run time of the audio generated by the TTS to generate a final video clip; and in response to the run time of the initial video clip being consistent with the run time of the audio generated by the TTS, using the an initial video clip as the final video clip; and combining the TTS generated audio with the final video clip to generate an event highlight moment video.

本公开的又一方面提供了一种包括一个或多个指令序列的非暂时性计算机可读介质，当所述一个或多个指令序列被至少一个处理器执行时使得要执行以下步骤，所述步骤包括：给定提及活动中的事件的输入文本，使用文本解析模块解析所述输入文本，以识别所述输入文本中提及的所述事件；以及使用文本到语音TTS模块将所述输入文本转换为TTS生成的音频；给定所述活动的至少一部分和所识别的事件的输入视频：进行时间锚定以将所述输入视频的运行时间与所述活动的运行时间相关联；通过使用从所述输入文本、从与所述活动相关的附加源、或从两者解析的时间信息以及通过时间锚定获得的相关时间，识别在所述活动期间发生所述事件的近似时间，进而从包括所述事件的所述输入视频生成初始剪辑；从所述初始视频剪辑中提取特征；使用所提取的特征和训练后的神经网络模型，获得所述初始视频剪辑中所述事件的最终时间值；响应于所述初始视频剪辑的运行时间与所述 TTS生成的音频的运行时间不一致，通过将所述初始视频剪辑编辑为具有与所述TTS生成的音频的运行时间一致的运行时间来生成最终视频剪辑；以及响应于所述初始视频剪辑的运行时间与所述TTS生成的音频的运行时间一致，使用所述初始视频剪辑作为所述最终视频剪辑；以及将所述TTS生成的音频与所述最终视频剪辑组合，以生成事件高光时刻视频。Yet another aspect of the present disclosure provides a non-transitory computer-readable medium comprising one or more sequences of instructions which, when executed by at least one processor, cause the following steps to be performed, the The steps include: given an input text mentioning an event in a campaign, parsing the input text using a text parsing module to identify the event mentioned in the input text; and translating the input text using a text-to-speech TTS module text conversion to TTS-generated audio; given an input video of at least a portion of the activity and the identified event: temporally anchoring to associate the runtime of the input video with the runtime of the activity; by using identifying the approximate time at which said event occurred during said activity from said input text, from temporal information parsed from an additional source related to said activity, or from both, and relative times obtained through temporal anchoring, from generating an initial clip from the input video including the event; extracting features from the initial video clip; obtaining a final time value of the event in the initial video clip using the extracted features and the trained neural network model ; in response to the runtime of the initial video clip being inconsistent with the runtime of the TTS generated audio, generating a final video clip by editing the initial video clip to have a runtime consistent with the runtime of the TTS generated audio video clip; and in response to the runtime of the initial video clip being consistent with the runtime of the audio generated by the TTS, using the initial video clip as the final video clip; and combining the audio generated by the TTS with the audio clip generated by the TTS The final video clips are combined to generate the event highlight moment video.

附图说明Description of drawings

将参考本公开的实施例，其示例可以在附图中示出。这些附图是说明性的，而不是限制性的。尽管在这些实施例的上下文中一般性地描述了本公开，但是应当理解，其并不旨在将本公开的范围限制在这些特定实施例。图中的项目可能不是按比例绘制的。Reference will be made to embodiments of the present disclosure, examples of which may be illustrated in the accompanying drawings. The drawings are illustrative, not restrictive. While the disclosure is generally described in the context of these embodiments, it should be understood that it is not intended to limit the scope of the disclosure to these specific embodiments. Items in diagrams may not be drawn to scale.

图1描述了根据本公开的实施例的高光时刻生成系统的概览；FIG. 1 depicts an overview of a highlight moment generation system according to an embodiment of the present disclosure;

图2描述了根据本公开的实施例的用于训练生成系统的概述方法；Figure 2 depicts an overview method for training a generation system according to an embodiment of the present disclosure;

图3描述了根据本公开的实施例的数据集生成过程的总体概述；Figure 3 depicts a general overview of the dataset generation process according to an embodiment of the present disclosure;

图4总结了根据本公开的实施例的一些云源文本数据的评论和标签；Figure 4 summarizes the comments and tags of some cloud source text data according to an embodiment of the present disclosure;

图5总结了根据本公开的实施例的收集的未修剪的比赛视频；Figure 5 summarizes collected untrimmed game video according to an embodiment of the present disclosure;

图6示出了根据本公开的实施例为人来标注视频的事件时间设计的用户界面的实施例；Figure 6 illustrates an embodiment of a user interface designed for humans to annotate the event time of a video according to an embodiment of the present disclosure;

图7示出了根据本公开的实施例的用于事件时间和视频运行时间关联的方法；Figure 7 shows a method for event time and video runtime association according to an embodiment of the present disclosure;

图8示出了根据本公开的实施例的识别比赛视频中的定时器数字的示例；Figure 8 shows an example of identifying timer numbers in a game video according to an embodiment of the present disclosure;

图9描绘了根据本公开实施例的用于从输入视频产生剪辑的方法；Figure 9 depicts a method for generating clips from input video according to an embodiment of the disclosure;

图10示出了根据本公开的实施例的特征提取；FIG. 10 illustrates feature extraction according to an embodiment of the present disclosure;

图11示出了根据本公开的实施例的用于提取特征的流水线；FIG. 11 shows a pipeline for extracting features according to an embodiment of the present disclosure;

图12示出了根据本公开的实施例的可以用于提取特征的神经网络模型；Fig. 12 shows a neural network model that can be used to extract features according to an embodiment of the present disclosure;

图13描述了根据本公开的实施例的使用慢快神经网络模型的特征提取；Figure 13 depicts feature extraction using a slow and fast neural network model according to an embodiment of the present disclosure;

图14描绘根据本公开实施例的用于音频特征提取和兴趣事件时间预测的方法；Fig. 14 depicts the method for audio feature extraction and interest event time prediction according to an embodiment of the present disclosure;

图15A示出了根据本公开的实施例的原始音频波形的示例，并且图15B示出了根据本公开的实施例的其相应的平均绝对值特征；Figure 15A shows an example of a raw audio waveform according to an embodiment of the present disclosure, and Figure 15B shows its corresponding mean absolute value characteristic according to an embodiment of the present disclosure;

图16示出了根据本公开的实施例的用于预测视频中兴趣事件的时间的方法；Figure 16 shows a method for predicting the time of an event of interest in a video according to an embodiment of the present disclosure;

图17示出了根据本公开实施例的用于时间定位的流水线；FIG. 17 shows a pipeline for temporal positioning according to an embodiment of the present disclosure;

图18描述了根据本公开的实施例的用于预测视频剪辑中兴趣事件的可能性的方法；Figure 18 describes a method for predicting the likelihood of an event of interest in a video clip, according to an embodiment of the present disclosure;

图19示出了根据本公开实施例的用于动作定位预测的流水线；FIG. 19 shows a pipeline for action localization prediction according to an embodiment of the present disclosure;

图20描述了根据本公开的实施例的用于预测视频剪辑中兴趣事件的可能性的方法；Figure 20 describes a method for predicting the likelihood of an event of interest in a video clip, according to an embodiment of the present disclosure;

图21示出了根据本公开实施例的使用整合神经网络模型的用于最终时间预测的流水线；FIG. 21 shows a pipeline for final time prediction using an integrated neural network model according to an embodiment of the disclosure;

图22描述了根据本公开的实施例的与另一方法相比的进球定位结果；Figure 22 depicts goal location results compared to another method, according to an embodiment of the present disclosure;

图23示出了根据本公开的实施例的3个剪辑的进球定位结果，整合学习取得了最好的效果；Fig. 23 shows the goal positioning results of 3 clips according to an embodiment of the present disclosure, and integrated learning has achieved the best results;

图24A和图24B描述了根据本公开的实施例的用于从视频输入和文本概要输入生成概要或高光时刻视频的系统；24A and 24B describe a system for generating a summary or highlight moment video from video input and text summary input, according to an embodiment of the disclosure;

图25描述了根据本公开的实施例的用于从输入文本提取信息的方法；Fig. 25 has described the method for extracting information from input text according to the embodiment of the present disclosure;

图26示出了根据本公开的实施例的用于生成球员数据库的方法；Figure 26 shows a method for generating a player database according to an embodiment of the present disclosure;

图27描述了根据本公开实施例的用于组合视频片段和对应的音频片段以创建概要视频的方法；Figure 27 has described the method for combining video segment and corresponding audio segment to create synopsis video according to the disclosed embodiment;

图28描述了根据本公开实施例的计算装置/信息处理系统的简化框图。Figure 28 depicts a simplified block diagram of a computing device/information handling system according to an embodiment of the disclosure.

具体实施方式Detailed ways

在以下描述中，出于解释的目的，阐述了具体细节以提供对本公开的理解。然而，对于本领域的技术人员显而易见的是，可以在没有这些细节的情况下实践本公开。此外，本领域技术人员将认识到，以下描述的本公开的实施例可以以多种方式来实现，诸如过程、设备、系统、设备或有形计算机可读介质上的方法。In the following description, for purposes of explanation, specific details are set forth in order to provide an understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure may be practiced without these details. Furthermore, those skilled in the art will appreciate that the embodiments of the present disclosure described below can be implemented in various ways, such as a process, an apparatus, a system, a device, or a method on a tangible computer readable medium.

在图中示出的部件或模块是本公开的示例性实施例的示例，并且旨在避免混淆本公开。还应当理解，在整个讨论中，部件可以被描述为单独的功能单元，其可以包括子单元，但是本领域技术人员将认识到，各种部件或其部分可以被划分为单独的部件或者可以被集成在一起，包括例如在单个系统或部件中。应注意，本文所论述的功能或操作可实施为部件。部件可以用软件，硬件或其组合来实现。Components or modules shown in the figures are examples of exemplary embodiments of the present disclosure and are intended to avoid obscuring the present disclosure. It should also be understood that throughout this discussion, components may be described as separate functional units, which may include subunits, but those skilled in the art will recognize that various components or portions thereof may be divided into separate components or may be divided into Integrated together, including for example in a single system or component. It should be noted that functions or operations discussed herein may be implemented as components. Components may be implemented in software, hardware or a combination thereof.

此外，附图中的部件或系统之间的连接并不限于直接连接。相反，这些部件之间的数据可以由中间部件修改、重新格式化或以其它方式改变。此外，可以使用另外的或更少的连接。还应当注意，术语“联接”、“连接”、“通信地联接”、“接合”、“接口”或它们的任何派生词应当被理解为包括直接连接，通过一个或多个中间设备的间接连接以及无线连接。还应当注意，诸如信号、响应、应答、确认、消息、查询等的任何通信可以包括一个或多个信息交换。Furthermore, connections between components or systems in the drawings are not limited to direct connections. Rather, data between these components may be modified, reformatted, or otherwise altered by intermediate components. Additionally, additional or fewer connections may be used. It should also be noted that the terms "coupled", "connected", "communicatively coupled", "joined", "interface" or any derivative thereof shall be understood to include direct connections, indirect connections through one or more intermediary devices and wireless connectivity. It should also be noted that any communication such as a signal, response, reply, acknowledgment, message, query, etc. may include one or more exchanges of information.

在说明书中提及“一个或多个实施例”、“优选实施例”、“一个实施例”、“实施例”等意味着结合实施例描述的特定特征、结构、特性或功能被包括在本公开的至少一个实施例中，并且可以包括在多个实施例中。此外，在说明书的各个地方出现的上述短语不一定都是指相同的一个或多个实施例。Reference in the specification to "one or more embodiments," "preferred embodiment," "one embodiment," "an embodiment" and the like means that a particular feature, structure, characteristic, or function described in connection with the embodiments is included in this specification. In at least one embodiment disclosed, and may be included in multiple embodiments. Furthermore, the appearances of the above phrases in various places in the specification are not necessarily all referring to the same embodiment or embodiments.

在本说明书的各个地方使用某些术语是为了说明，而不应被解释为限制。服务、功能或资源不限于单个服务、功能或资源；这些术语的使用可以指相关服务、功能或资源的分组，其可以是分布式的或聚集的。术语“包括(include)”、“包括(including)”、“包含(comprise)” 和“包含(comprising)”应理解为开放性术语，并且随后的任何列表都是示例，并不意味着限于所列出的项目。“层”可以包括一个或多个操作。词语“最佳的”、“最佳化”、“最优化”等是指结果或过程的改进，并且不要求指定的结果或过程已经达到“最佳的”或峰值状态。存储器、数据库、信息库、数据存储、表、硬件、高速缓存等的使用在这里可以被用于指代可以将信息输入或以其它方式记录其中的一个或多个系统部件。The use of certain terms in various places in this specification is for the purpose of description and should not be construed as limiting. A service, function or resource is not limited to a single service, function or resource; use of these terms may refer to a grouping of related services, functions or resources, which may be distributed or aggregated. The terms "include," "including," "comprise," and "comprising" are to be understood as open-ended terms, and any listing that follows is an example and not meant to be limiting to all Listed items. A "layer" can consist of one or more operations. The words "best," "optimization," "optimization," etc. refer to improvement of a result or process, and do not require that the specified result or process has been at "best" or peak condition. The terms memory, database, repository, data store, table, hardware, cache, etc., may be used herein to refer to one or more system components into which information may be entered or otherwise recorded.

在一个或多个实施例中，停止条件可以包括：(1)已经执行了设定次数的迭代；(2)已经达到处理时间量；(3)收敛(例如，连续迭代之间的差小于第一阈值)；(4)发散(例如，性能劣化)；以及(5) 已经达到可接受的结果。In one or more embodiments, stopping conditions may include: (1) a set number of iterations have been performed; (2) a processing amount of time has been reached; (3) convergence (e.g., the difference between successive iterations is less than a threshold); (4) divergence (eg, performance degradation); and (5) acceptable results have been achieved.

本领域技术人员应认识到：(1)可任选地进行某些步骤；(2)步骤可以不限于本文所述的具体顺序；(3)某些步骤可以以不同的顺序进行；以及(4)某些步骤可以同时进行。Those skilled in the art will recognize that: (1) certain steps may be performed optionally; (2) the steps may not be limited to the specific order described herein; (3) certain steps may be performed in a different order; and (4) ) Certain steps can be performed simultaneously.

本文所用的任何标题仅用于组织目的，不应用于限制说明书或权利要求书的范围。在该专利文献中提及的每个参考文献/文件通过引用全文结合到本文中。Any headings used herein are for organizational purposes only and should not be used to limit the scope of the description or claims. Each reference/document mentioned in this patent document is hereby incorporated by reference in its entirety.

应注意的是，本文提供的任何实验和结果均是通过举例说明的方式提供的，并且是使用一个或多个具体实施例在具体条件下进行的；因此，这些实验和它们的结果都不应用于限制本专利文件的公开范围。It should be noted that any experiments and results presented herein are provided by way of illustration and were performed under specific conditions using one or more specific examples; therefore, neither these experiments nor their results apply To limit the disclosure scope of this patent document.

还应注意，尽管本文所述的实施例可处于体育赛事(如足球)的背景中，但本公开的方面不限于此。因此，本公开的各方面可应用于或适于在其它环境中使用。It should also be noted that although the embodiments described herein may be in the context of a sporting event, such as soccer, aspects of the disclosure are not so limited. Accordingly, aspects of the present disclosure may be applicable or adapted for use in other environments.

A.总体介绍A. General introduction

1.总体概述1. General overview

在此呈现的是自动地、大量地和精确地生成高光时刻视频的实施例。为了进行说明，将使用足球比赛。然而，应注意，本文中的实施例可用于或适于用于其它运动和非运动事件，例如音乐会、表演、演讲、演示、新闻、演出、视频游戏、游戏、运动事件、动画、社交媒体发布、电影等。这些活动中的每一个都可以被称为发生事件或事件，并且发生事件的高光时刻可以被称为兴趣事件、发生的事或高光时刻。Presented herein are embodiments that automatically, massively, and accurately generate video of highlight moments. For the purposes of illustration, a soccer game will be used. It should be noted, however, that the embodiments herein may be used or adapted for use with other athletic and non-athletic events, such as concerts, performances, lectures, presentations, news, shows, video games, gaming, athletic events, animation, social media Releases, Movies, etc. Each of these activities may be referred to as an occurrence or event, and a highlight of an occurrence may be referred to as an event of interest, a happening, or a highlight.

利用大规模多模式数据集，创建并训练现有技术的深度学习模型以检测比赛中的一个或多个事件(例如进球)—尽管也可以使用兴趣事件(例如，点球、受伤、斗殴、红牌、角球、罚点球等)。本文还提供了一种整合学习模块的实施例，以提高兴趣事件定位的性能。Leveraging large-scale multimodal datasets, state-of-the-art deep learning models are created and trained to detect one or more events in a game (e.g., goals)—although events of interest (e.g., penalties, injuries, brawls, Red card, corner kick, penalty kick, etc.). This paper also provides an embodiment of integrating learning modules to improve the performance of interest event localization.

图1描述了根据本公开的实施例的高光时刻生成系统的概览。在一个或多个实施例中，收集大规模云源文本数据和未修剪的足球比赛视频，并将其馈送到一系列数据处理工具中，以生成包含主要的比赛兴趣事件(例如，进球事件)的候选长剪辑(例如，70秒，尽管也可以使用其它时间长度)。在一个或多个实施例中，新颖的兴趣事件定位流水线精确地定位所述剪辑中的事件的时刻。最后，实施例可以围绕检测到的高光时刻构建一个或多个定制的高光时刻视频/故事。FIG. 1 depicts an overview of a highlight moment generation system according to an embodiment of the present disclosure. In one or more embodiments, large-scale cloud-sourced text data and untrimmed football game videos are collected and fed into a series of data processing tools to generate ) candidate long clips (eg, 70 seconds, although other lengths of time may also be used). In one or more embodiments, the novel event-of-interest location pipeline precisely locates the moment of an event in the clip. Finally, an embodiment may build one or more customized highlight moment videos/stories around the detected highlight moments.

图2描述了根据本公开的实施例的用于训练生成系统的概述方法。为了训练生成系统，必须生成或获得事件相关数据的大规模多模式数据集，以用作训练数据(205)。因为视频运行时间可能不对应于事件中的时间，所以在一个或多个实施例中，对于一组训练视频中的每个视频，执行时间锚定，以使视频运行时间与事件时间相关联(210)。然后，元数据(例如，评论和/或标签)和通过时间锚定获得的相关时间可用于识别兴趣事件的近似时间，以从视频中生成包括兴趣事件的剪辑(215)。通过使用剪辑而不是整个视频，大大降低了处理要求。对于每个剪辑，提取特征(220)。在一个或多个实施例中，可以使用一组预先训练的模型来获得所提取的特征，其可以是多模式的。Figure 2 depicts an overview method for training a generation system according to an embodiment of the present disclosure. In order to train the generative system, a large-scale multimodal dataset of event-related data must be generated or obtained for use as training data (205). Because video runtimes may not correspond to times in events, in one or more embodiments, for each video in a set of training videos, temporal anchoring is performed such that video runtimes are associated with event times ( 210). Metadata (e.g., comments and/or tags) and relative times obtained through temporal anchoring can then be used to identify the approximate time of the event of interest to generate a clip from the video that includes the event of interest (215). Processing requirements are greatly reduced by using clips instead of entire videos. For each clip, features are extracted (220). In one or more embodiments, a set of pre-trained models can be used to obtain the extracted features, which can be multimodal.

在一个或多个实施例中，对于每个剪辑，使用神经网络模型获得兴趣事件的最终时间值(225)。在实施例中，神经网络模型可以是从一组模型接收特征并输出最终时间值的整合模块。给定每个剪辑的预测的最终时间值，将预测的最终时间值与其对应的真实值进行比较，以获得损失值(230)；并且可以使用损失值来更新模型(235)。In one or more embodiments, for each clip, a neural network model is used to obtain the final time value of the event of interest (225). In an embodiment, the neural network model may be an ensemble module that receives features from a set of models and outputs a final time value. Given a predicted final time value for each clip, the predicted final time value is compared with its corresponding true value to obtain a loss value (230); and the loss value may be used to update the model (235).

一旦被训练，可以输出生成系统并且给定输入事件视频，使用生成系统生成高光时刻视频。Once trained, the generative system can be output and given an input event video, use the generative system to generate video of the highlight moments.

2.相关工作2. Related work

近年来，人工智能被应用于视频内容的分析和视频的生成。在体育分析中，开发了许多计算机视觉技术来理解体育广播。特别地，在足球中，研究者提出了识别关键比赛事件和运动员动作的算法、使用运动员的身体定向来分析通过可行性的算法、结合音频和视频流来检测事件的算法、使用广播流和轨迹数据来识别现场上的群体活动的算法、聚合深度帧特征来定位主要的比赛事件的算法、以及利用动作周围的时间上下文信息来处理表示这些动作的固有时间模式的算法。In recent years, artificial intelligence has been applied to the analysis of video content and the generation of videos. In sports analysis, many computer vision techniques have been developed to understand sports broadcasts. In football in particular, researchers have proposed algorithms to identify key game events and player movements, use players' body orientation to analyze pass feasibility, combine audio and video streams to detect events, use broadcast streams and trajectories data to identify group activity on the field, algorithms that aggregate deep frame features to locate major game events, and algorithms that exploit temporal contextual information around actions to process the inherent temporal patterns that represent those actions.

针对各种视频理解任务，用大规模数据集训练深度神经网络。最近的挑战包括找到活动的时间边界或在时间域中定位事件。在足球视频理解中，一些人将进球事件定义为球越过进球线的时刻。Train deep neural networks with large-scale datasets for various video understanding tasks. More recent challenges include finding temporal boundaries of activities or locating events in the temporal domain. In football video understanding, some define a goal event as the moment when the ball crosses the goal line.

在一个或多个实施例中，采用了进球的这种定义，并且利用了现有技术的深度学习模型和方法以及音频流处理技术，并且在实施例中采用了整合学习模块，以便在足球视频剪辑中精确地定位事件。In one or more embodiments, this definition of a goal is adopted, and deep learning models and methods of the prior art and audio stream processing techniques are utilized, and an integrated learning module is adopted in an embodiment, so that in football Precisely locate events in video clips.

3.实施例的一些贡献3. Some contributions of the embodiment

在本专利文件中，提出了能够精确地识别视频中的事件发生的自动高光时刻生成系统的实施例。在一个或多个实施例中，该系统可用于大量地生成高光时刻视频，而无需常规的人工编辑工作。由一个或多个实施例提供的一些贡献包括但不限于以下：In this patent document, an embodiment of an automatic highlight moment generation system capable of accurately identifying event occurrences in a video is proposed. In one or more embodiments, the system can be used to mass generate highlight moment videos without the need for conventional manual editing efforts. Some contributions provided by one or more embodiments include, but are not limited to the following:

—创建包括云源文本数据、高清晰度视频的大规模多模式足球数据集。而且，在一个或多个实施例中，应用各种数据处理机制来解析、清理和标注所收集的数据。— Create a large-scale multi-modal football dataset including cloud-sourced text data, high-definition video. Also, in one or more embodiments, various data processing mechanisms are applied to parse, clean and label the collected data.

—对齐来自多个源的多模式数据，并且通过使用来自云源评论数据的经解析的标签将原始视频切割成70秒的剪辑来生成候选长视频剪辑。- Align multimodal data from multiple sources and generate candidate long video clips by cutting the original video into 70-second clips using parsed labels from cloud-sourced review data.

—本文提出了事件定位流水线的实施例。实施例从多个角度提取高级特征表示，并且应用时间定位方法来帮助定位剪辑中的事件。另外，还将实施例设计成具有整合学习模块以提高事件定位的性能。应当注意，尽管发生的事件可以是足球比赛，而兴趣事件可以是进球，但是实施例可以用于或适于其他发生的事件和其他兴趣事件。- This paper proposes an embodiment of an event localization pipeline. Embodiments extract high-level feature representations from multiple perspectives and apply temporal localization methods to help locate events in clips. In addition, embodiments are also designed to have integrated learning modules to improve the performance of event localization. It should be noted that while the occurrence may be a football game and the event of interest may be a goal, embodiments may be used or adapted for other occurrences and other events of interest.

—实验结果显示，关于在剪辑中定位进球事件，测试的实施例在5秒的容差的情况下达到接近1的精确度(0.984)，这优于现有的工作并建立了新的现有技术。该结果有助于准确地捕捉进球时刻以及精确地生成高光时刻视频。- Experimental results show that the tested embodiment achieves an accuracy close to 1 (0.984) with a tolerance of 5 seconds for locating the goal event in the clip, which outperforms existing work and establishes a new reality have technology. The result helps to accurately capture the goal moment and precisely generate video of the highlight moment.

4.专利文献布局4. Layout of patent documents

该专利文件组织如下：B节介绍了创建数据集以及如何收集和标注数据。在C节中给出了用于构建高光时刻生成系统实施例的方法的实施例、以及如何利用所提出的方法在足球视频剪辑中精确地定位进球事件的方法的实施例。在D节中概述和讨论了实验结果。应当重申，仅作为示例提出了使用足球比赛作为整体内容和使用进球作为该内容内的事件，并且本领域技术人员应当认识到，本文的方面可以应用于其他内容领域(包括比赛领域之外)以及其他事件。The patent document is organized as follows: Section B describes the creation of the dataset and how to collect and annotate the data. In Section C an example of a method for constructing an example of a highlight moment generation system and how the proposed method can be used to precisely locate a goal event in a soccer video clip is given. Experimental results are outlined and discussed in Section D. It should be reiterated that the use of a football match as an overall content and a goal as an event within that content is presented as an example only, and those skilled in the art will recognize that aspects herein may be applied to other content domains (including outside of the game domain) and other events.

B.数据处理实施例B. Data Processing Examples

为了训练和开发系统实施例，创建了大规模多模式数据集。图3 描述了根据本公开的实施例的数据集生成过程的总体概述。在一个或多个实施例中，收集与事件的视频相关联的一个或多个评论和/或标签 (305)。例如，可以爬取来自网站或其它源的足球比赛评论和标签(例如，角球、进球、阻断、头球等)(参见例如图1中的标签和评论105) 以获得数据。此外，还收集与元数据(即，评论和/或标签)相关联的视频(305)。对于本文的实施例，收集来自各种源的高清晰度(HD) 未修剪的足球比赛视频。用亚马逊土耳其机器人(AMT)在未修剪的原始视频中标注比赛开始时间(315)。在一个或多个实施例中，可以使用元数据(例如，评论和/或标签信息)来帮助识别兴趣事件的近似时间，以从包括兴趣事件的视频中生成剪辑(例如，进球的剪辑)(320)。最后，用亚马逊土耳其机器人(AMT)识别所处理的视频剪辑中兴趣事件(例如，进球)的精确时间。在训练进球定位模型的实施例的过程中，可以将标注的进球时间用作真实值。To train and develop system embodiments, a large-scale multimodal dataset was created. FIG. 3 depicts a general overview of the dataset generation process according to an embodiment of the present disclosure. In one or more embodiments, one or more comments and/or tags associated with the video of the event are collected (305). For example, soccer game comments and tags (e.g., corners, goals, tackles, headers, etc.) from a website or other source (see, e.g., tags andcomments 105 in FIG. 1 ) may be crawled for data. Additionally, videos associated with metadata (i.e., comments and/or tags) are also collected (305). For the examples herein, high definition (HD) untrimmed soccer game video was collected from various sources. The untrimmed raw video was annotated with the start time of the match (315) using Amazon Robotics Turk (AMT). In one or more embodiments, metadata (e.g., commentary and/or tag information) may be used to help identify the approximate time of an event of interest to generate a clip (e.g., a clip of a goal) from a video that includes the event of interest (320). Finally, Amazon Robotics Turk (AMT) was used to identify the precise time of the event of interest (e.g., goal) in the processed video clips. Annotated goal times may be used as ground truth values during training of embodiments of the goal location model.

1.数据收集实施例1. Example of data collection

在一个或多个实施例中，爬行体育网站，得到超过1,000,000个评论和标签，其覆盖了从2015到2020赛季的来自各种联赛的超过10,000 个足球比赛。图4总结了根据本公开的实施例的一些云源文本数据中的评论和标签。In one or more embodiments, the sports website is crawled for over 1,000,000 comments and tags covering over 10,000 football games from various leagues from the 2015-2020 season. Figure 4 summarizes comments and tags in some cloud source text data according to an embodiment of the present disclosure.

评论和标签为每个比赛提供了大量信息。例如，它们包括比赛日期、队名、联赛、比赛事件时间(例如，以分钟为单位)、事件标签(诸如进球、射门、角球、换人、犯规等)、以及相关联的球员姓名。来自云源数据的这些评论和标签可以被翻译成或可以被认作用于原始视频处理实施例以及高光时刻视频生成实施例的丰富的元数据。Comments and tags provide a wealth of information for each match. For example, they include game date, team name, league, game event time (e.g., in minutes), event tags (such as goals, shots, corners, substitutions, fouls, etc.), and associated player names. These comments and tags from cloud source data can be translated or can be considered as rich metadata for raw video processing embodiments as well as highlight moment video generation embodiments.

还收集了来自各种线上源的超过2600个高清晰度(720P或以上) 未修剪的足球比赛视频。比赛来自从2014年到2020年的各种联赛。图5总结了根据本公开的实施例的收集的未修整的比赛视频。More than 2600 high-definition (720P or above) untrimmed football match videos from various online sources were also collected. Matches come from various leagues from 2014 to 2020. Figure 5 summarizes collected untrimmed game video according to an embodiment of the present disclosure.

2.数据标注实施例2. Data labeling example

在一个或多个实施例中，首先将未修剪的原始视频发送到亚马逊土耳其机器人(AMT)工人以标注比赛开始时间(定义为裁判吹哨以开始比赛的时刻)，然后解析云源比赛评论和标签以获得每一场比赛的以分钟为单位的进球时间。通过结合进球分钟标签和视频中的比赛开始时间，生成包含进球事件的候选70秒剪辑。接下来，在一个或多个实施例中，将这些候选剪辑发送到AMT，用于以秒为单位来标注进球时间。图6示出了根据本公开的实施例的针对进球时间标注的为AMT 设计的用户界面实施例。In one or more embodiments, the untrimmed raw video is first sent to an Amazon Robotics Turk (AMT) worker to annotate the game start time (defined as the moment when the referee blows the whistle to start the game), and then parses the cloud-sourced game commentary and tab to get the goal time in minutes for each match. Candidate 70-second clips containing goal events are generated by combining the goal minute label with the match start time in the video. Next, in one or more embodiments, these candidate clips are sent to the AMT for marking the goal time in seconds. FIG. 6 illustrates an embodiment of a user interface designed for AMT for goal time stamping according to an embodiment of the disclosure.

对于AMT上的进球时间标注，每个HIT(人类智能任务，一个工人任务)包含一个(1)候选剪辑。将每个HIT分配给五(5)个 AMT工人，并收集中值时间戳值作为真实值标签。For goal time annotation on AMT, each HIT (human intelligence task, one worker task) contains one (1) candidate clip. Each HIT is assigned to five (5) AMT workers, and the median timestamp value is collected as the ground truth label.

C.方法实施例C. Method Examples

在该部分中，给出了高光时刻生成系统的五个模块中的每一个的实施例的细节。作为简要概述，C.1节中的第一模块实施例是比赛时间锚定实施例，其检查视频的时间完整性并将比赛中的任何时间映射到视频中的时间。In this section, details of embodiments of each of the five modules of the highlight moment generation system are given. As a brief overview, the first module embodiment in Section C.1 is a game time anchoring embodiment that checks the temporal integrity of the video and maps any time in the game to a time in the video.

C.2节中的第二模块实施例是粗略间隔提取实施例。该模块是相对于通常研究的事件定位流水线的主要区别。在该模块的实施例中，提取70秒的间隔(尽管可以使用其它大小的间隔)，其中，通过利用文本元数据来定位特定事件。与普通的端到端可视事件定位流水线相比，这种方法以至少三个原因具有优势。首先，用元数据提取的剪辑包含更多的上下文信息，并且可以在不同维度上使用。利用元数据，剪辑可以用作时间剪辑(诸如比赛高光时刻视频)或者可以与同一队或球员的其它剪辑一起使用以生成队、球员和/或赛季高光时刻视频。第二个原因是稳健性，这源于文本数据的低事件模糊性。第三，通过分析感兴趣的事件的较短剪辑而不是整个视频，节省了许多资源(处理、处理时间、存储器、能耗等)。The second module embodiment in Section C.2 is a coarse margin extraction embodiment. This module is the main difference with respect to commonly studied event localization pipelines. In an embodiment of this module, intervals of 70 seconds are extracted (although intervals of other sizes may be used), where specific events are located by utilizing textual metadata. This approach is advantageous over a normal end-to-end visual event localization pipeline for at least three reasons. First, clips extracted with metadata contain more contextual information and can be used in different dimensions. With metadata, a clip can be used as a temporal clip (such as a game highlight video) or can be used with other clips of the same team or player to generate a team, player and/or season highlight video. The second reason is robustness, which stems from the low event ambiguity of text data. Third, many resources (processing, processing time, memory, energy consumption, etc.) are saved by analyzing shorter clips of events of interest rather than the entire video.

系统实施例中的第三模块的实施例是多模式特征提取。从多个角度提取视频特征。An embodiment of the third module in the system embodiment is multimodal feature extraction. Extract video features from multiple perspectives.

第四模块的实施例是精确的时间定位。分别在C.3节和C.4节中提供了如何设计和实现特征提取和时间定位的实施例的技术的广泛研究。An embodiment of the fourth module is precise temporal positioning. An extensive study of how to design and implement the techniques of embodiments of feature extraction and temporal localization is provided in Sections C.3 and C.4, respectively.

最后，在C.5节中描述了整合学习模块的实施例。Finally, an embodiment of an integrated learning module is described in Section C.5.

1.比赛时间锚定实施例1. Example of game time anchoring

事件视频中的事件时钟有时是不规则的。主要原因似乎是从因特网收集的至少一些事件视频文件包含损坏的时间戳或帧。据观察，在视频收集中，大约10％的视频文件包含时间的损坏，该损坏使视频的一部分在时间上偏移，有时超过10秒。所观察到的一些严重损坏包括 100秒以上的丢失帧。除了视频文件中的错误之外，在发生事件/事件期间可能已经发生了一些意想不到的罕见事件，并且事件时钟在它们恢复之前必须停止几分钟。如果是视频内容损坏或比赛中断，则时间不规则性可以被视为向前或向后的时间跳转。为了精确地定位由元数据指定的任何事件的剪辑，在一个或多个实施例中，检测时间跳跃，并且相应地进行校准。因此，在一个或多个实施例中，设计并使用锚定机制。Event clocks in event videos are sometimes irregular. The main reason appears to be that at least some event video files collected from the Internet contain corrupted timestamps or frames. It has been observed that in video collections about 10% of video files contain temporal corruption which shifts a portion of the video in time, sometimes by more than 10 seconds. Some of the severe corruption observed included more than 100 seconds of dropped frames. In addition to errors in the video file, some unexpected and rare events may have occurred during the event/event, and the event clock had to be stopped for a few minutes before they resumed. Time irregularities can be seen as forward or backward time jumps if video content is corrupted or a match is interrupted. In order to precisely locate the clip of any event specified by the metadata, in one or more embodiments, time jumps are detected and calibrated accordingly. Therefore, in one or more embodiments, an anchoring mechanism is designed and used.

图7示出了根据本公开的实施例的用于事件时间和视频运行时间关联的方法。在一个或多个实施例中，以5秒的间隔(但是也可以使用其它间隔)对视频帧执行OCR(光学字符识别)，以读取显示在视频中的比赛时钟(705)。可以从所识别的比赛时钟推导出视频中的比赛开始时间(710)。每当发生时间跳转时，在一个或多个实施例中，在时间跳转之后的比赛时间的记录被保持，并且其被称为时间锚 (710)。利用时间锚，在一个或多个实施例中，可以将比赛中的任何时间映射到视频中的时间(即，视频运行时间)(715)，并且可以精确地提取由元数据指定的任何剪辑。图8示出了根据本公开的实施例的识别比赛视频中的定时器数字的示例。Figure 7 illustrates a method for event time and video runtime association according to an embodiment of the present disclosure. In one or more embodiments, OCR (Optical Character Recognition) is performed on video frames at 5 second intervals (although other intervals may be used) to read the game clock displayed in the video (705). The game start time in the video may be deduced from the identified game clock (710). Whenever a time jump occurs, in one or more embodiments, a record of the game time after the time jump is maintained and is referred to as a time anchor (710). Using time anchors, in one or more embodiments, any time in the game can be mapped to a time in the video (i.e., video runtime) (715), and any clip specified by the metadata can be extracted precisely. FIG. 8 shows an example of recognizing timer numbers in a game video according to an embodiment of the present disclosure.

如图8所示，定时器数字805-820可以被识别并且关联至视频运行时间。实施例可以随时间收集多个识别结果，并且可以基于空间平稳性和时间连续性自校正。As shown in Figure 8, timer numbers 805-820 may be identified and associated to video runtimes. Embodiments may collect multiple recognition results over time, and may self-correct based on spatial stationarity and temporal continuity.

2.粗略间隔提取实施例2. Example of Rough Interval Extraction

图9描绘了根据本公开实施例的用于从输入视频产生剪辑的方法。在一个或多个实施例中，解析来自云源比赛评论和标签的元数据 (905)，所述元数据包括用于进球事件的以分钟为单位的时间戳。结合由OCR工具的实施例(上面讨论的)检测到的比赛开始时间，可以编辑原始视频以生成包含兴趣事件的x秒(例如，70秒)候选剪辑。在一个或多个实施例中，提取规则可以通过以下等式来描述：Figure 9 depicts a method for generating clips from input video according to an embodiment of the disclosure. In one or more embodiments, metadata from cloud-sourced game commentary and tags are parsed (905), including timestamps in minutes for goal events. Combined with the game start time detected by an embodiment of the OCR tool (discussed above), the original video can be edited to generate x seconds (e.g., 70 seconds) of candidate clips containing the event of interest. In one or more embodiments, the extraction rules can be described by the following equations:

t_{clipStart}＝t_(gameStart}+60*t_(goalMinute}-tolerance (1)t_{clipStart} = t_(gameStart} +60*t_(goalMinute} -tolerance (1)

t_(clipEnd}＝t_{clipStart}+(base clip length+2*tolerance) (2)t_(clipEnd} = t_{clipStart} +(base clip length+2*tolerance) (2)

在一个或多个实施例中，给定进球分钟t_{goalMinute}和比赛开始时间 t_{gameStart)，从视频中t_{clipStart}秒提取剪辑。在一个或多个实施例中，候选剪辑的持续时间可以被设置为70秒(其中基本剪辑长度是60秒并且容差是5秒，尽管应当注意，可以使用不同的值和不同的公式)，因为这覆盖了当兴趣事件非常接近于进球分钟发生时的极端情况，并且它还可以容忍OCR检测到的比赛开始时间的小偏差。在下一节中，给出了用于在候选剪辑中定位进球秒(球越过进球线的时刻)的方法实施例。In one or more embodiments, given the goal minute t_{goalMinute} and the game start time t_{gameStart) , a clip is extracted from the video at t_{clipStart} seconds. In one or more embodiments, the duration of the candidate clip may be set to 70 seconds (where the base clip length is 60 seconds and the tolerance is 5 seconds, although it should be noted that different values and different formulas may be used), Because this covers the extreme case when the event of interest occurs very close to the minute of goal, and it can also tolerate small deviations in the match start time detected by OCR. In the next section, an embodiment of a method for locating the goal second (the moment when the ball crosses the goal line) in a candidate clip is given.

3.多模式特征提取实施例3. Example of multi-mode feature extraction

在本节中，公开了从候选剪辑获得高级特征表示的三个实施例。In this section, three embodiments for obtaining high-level feature representations from candidate clips are disclosed.

a)利用预训练模型的特征提取实施例a) Example of Feature Extraction Using a Pre-Training Model

图10示出了根据本公开的实施例的特征提取。在给定视频数据的情况下，在一个或多个实施例中，提取(1005)时间帧，并且如果需要匹配输入大小，在空间域中调整时间帧的大小(1010)，馈送至深度神经网络模型以获得高级特征表示。在一个或多个实施例中，使用在图像数据集上预先训练的ResNet-152模型，但是也可以使用其它网络。在一个或多个实施例中，以原始视频的固有帧/秒(fps)提取时间帧，然后以2fps降采样，即获得原始视频的每秒2帧的ResNet-152特征表示。ResNet是一种非常深度的神经网络，它以全连接1000层输出每帧2048维度的特征表示。在一个或多个实施例中，在softmax层之前的层的输出可以用作所提取的高级特征。请注意，ResNet-152可用于从单个图像中提取高级特征；它不固有地嵌入时间上下文信息。图11 示出了根据本公开实施例的用于提取高级特征的流水线1100。FIG. 10 illustrates feature extraction according to an embodiment of the disclosure. Given video data, in one or more embodiments, temporal frames are extracted (1005), and if necessary to match the input size, resized (1010) in the spatial domain, fed to a deep neural network model for high-level feature representations. In one or more embodiments, a ResNet-152 model pre-trained on an image dataset is used, although other networks may also be used. In one or more embodiments, the time frame is extracted at the intrinsic frame per second (fps) of the original video, and then down-sampled at 2 fps, that is, the ResNet-152 feature representation of 2 frames per second of the original video is obtained. ResNet is a very deep neural network, which outputs a feature representation of 2048 dimensions per frame with 1000 fully connected layers. In one or more embodiments, the output of the layers preceding the softmax layer can be used as the extracted high-level features. Note that ResNet-152 can be used to extract high-level features from a single image; it does not inherently embed temporal context information. FIG. 11 shows apipeline 1100 for extracting high-level features according to an embodiment of the present disclosure.

b)慢快特征提取器实施例b) Slow and fast feature extractor embodiment

作为视频特征提取器的一部分，在一个或多个实施例中，可以使用例如Feichtenhofer等人提出的慢快网络架构(Feichtenhofer,C.,Fan, H.,Malik,J.,&He,K.，“用于视频识别的慢快网络(Slowfast Networks for Video Recognition)”，《IEEE国际计算机视觉会议论文集 (Proceedings of The IEEE International Conference onComputer Vision)》(pp.6202-6211)(2019)，其全部内容通过引用并入本文)或 Xiao等人提出的慢快网络架构(Xiao et al.，“用于视频识别的视听慢快网络(AudiovisualSlowFast Networks for Video Recognition)”, arxiv.org/abs/2001.08740v1(2020)，其通过引用以其整体并入本文)；尽管应当注意，可以使用其它网络体系结构。图12以图形的方式描述了根据本公开的实施例的可用于提取特征的神经网络模型。As part of the video feature extractor, in one or more embodiments, a slow-fast network architecture such as that proposed by Feichtenhofer et al. (Feichtenhofer, C., Fan, H., Malik, J., & He, K., "Slowfast Networks for Video Recognition", Proceedings of The IEEE International Conference on Computer Vision (pp.6202-6211) (2019), in full The content is incorporated herein by reference) or the slow-fast network architecture proposed by Xiao et al. (Xiao et al., "Audiovisual SlowFast Networks for Video Recognition", arxiv.org/abs/2001.08740v1 (2020), which is incorporated herein by reference in its entirety); although it should be noted that other network architectures may be used. Figure 12 graphically depicts a neural network model that can be used to extract features according to an embodiment of the present disclosure.

图13描述了根据本公开的实施例的使用慢快神经网络模型的特征提取。在一个或多个实施例中，使用训练数据集用预训练的权重来初始化慢快网络(1305)。网络可以被微调作为分类器(1310)。下面的表1中的第二列显示了使用测试数据集利用基准网络的事件分类结果。在一个或多个实施例中，特征提取器被用于将4秒剪辑分类为4 个类别：1)远离兴趣事件(例如，进球)，2)刚好在兴趣事件之前， 3)兴趣事件，以及4)刚好在兴趣事件之后。Figure 13 depicts feature extraction using a slow and fast neural network model according to an embodiment of the present disclosure. In one or more embodiments, the slow and fast network is initialized with pretrained weights using the training dataset (1305). The network can be fine-tuned as a classifier (1310). The second column in Table 1 below shows the event classification results utilizing the benchmark network using the test dataset. In one or more embodiments, a feature extractor is used to classify 4-second clips into 4 categories: 1) away from an event of interest (eg, goal), 2) just before the event of interest, 3) event of interest, and 4) just after the event of interest.

可以实现几种技术来找到最佳分类器，这是通过前1的误差百分比来评估的。首先，应用如图12中所构造的网络，其将音频作为额外路径添加到慢快网络(AVSlowfast)。可以用相同的权重初始化网络的视觉部分。可以看出，视觉和音频特征的直接联合训练会损害性能。这是在训练多模式网络时发现的常见问题。在一个或多个实施例中，应用了分别针对视觉和音频模式添加不同损失函数的技术，并且用多任务损失来训练整个网络。在一个或多个实施例中，可以使用视听结果上的交叉熵损失和每个视听分支的线性组合。线性组合可以是加权组合，其中权重可以被学习或者可以被选择为超参数。获得了表1的底行中所示的最佳前1的误差结果。Several techniques can be implemented to find the best classifier, which is evaluated by the top-1 error percentage. First, a network constructed as in Fig. 12 is applied, which adds audio as an additional path to the slow-fast network (AVSlowfast). The vision part of the network can be initialized with the same weights. It can be seen that direct joint training of visual and audio features hurts performance. This is a common problem found when training multimodal networks. In one or more embodiments, a technique of adding different loss functions for visual and audio modalities separately is applied, and the entire network is trained with multi-task loss. In one or more embodiments, a cross-entropy loss on the audiovisual results and a linear combination of each audiovisual branch may be used. Linear combinations can be weighted combinations, where weights can be learned or can be chosen as hyperparameters. The best top 1 error results shown in the bottom row of Table 1 were obtained.

表1.事件分类的结果Table 1. Results of event classification

算法algorithm前1的误差％Error % ofTop 1慢快slowly33.2733.27仅音频audio only60.0160.01AVSlowfastAVSlowfast40.8440.84AVSlowfast多任务AVSlowfast multitasking31.8231.82

在进球定位流水线的一个或多个实施例中，可以利用该网络(具有多任务损失的AVSlowfast)的特征提取器部分。因此，目的是降低前 1的误差，其对应于较强的特征。In one or more embodiments of the goal positioning pipeline, the feature extractor part of this network (AVSlowfast with multi-task loss) can be utilized. Therefore, the aim is to reduce the error of the top 1s, which correspond to stronger features.

c)平均绝对值音频特征实施例c) Mean Absolute Value Audio Feature Embodiment

通过收听事件的声音轨迹(例如，没有实况评论的比赛)，人们通常可以简单地根据观众的音量来确定兴趣事件何时发生。受这种观察的启发，开发了一种直接从音频中提取有关兴趣事件的关键信息的简单方法。By listening to the sound track of an event (e.g., a game without live commentary), one can often determine when an event of interest occurs simply based on the volume of the audience. Inspired by this observation, a simple method for extracting key information about events of interest directly from audio is developed.

图14描绘了根据本公开实施例的用于音频特征提取和兴趣事件时间预测的方法。在一个或多个实施例中，取音频波形的绝对值并降采样到1赫兹(Hz)(1405)。该特征表示可以被称为平均绝对值特征，因为它表示每秒的平均声音幅度。图15A和图15B分别示出了根据本公开的实施例的一个剪辑的原始音频波形和其平均绝对值特征的示例。Figure 14 depicts a method for audio feature extraction and event time prediction of interest according to an embodiment of the disclosure. In one or more embodiments, the absolute value of the audio waveform is taken and downsampled to 1 Hertz (Hz) (1405). This feature representation may be called the mean absolute value feature, since it represents the average sound amplitude per second. Figures 15A and 15B show an example of a clip's raw audio waveform and its mean absolute value feature, respectively, according to an embodiment of the present disclosure.

对于每个剪辑，可以定位该平均绝对值音频特征1500B的最大值 1505(1410)。通过为测试数据集中的剪辑定位该平均绝对值音频特征的最大值(例如，最大值1505)及其对应的时间(例如，时间1510)，关于时间定位实现了79％的精确度(在5秒容差的情况下)。For each clip, themaximum value 1505 of the mean absolute valueaudio feature 1500B may be located (1410). By locating the maximum (e.g., maximum value 1505) of this mean absolute value audio feature and its corresponding time (e.g., time 1510) for clips in the test dataset, an accuracy of 79% was achieved with respect to temporal localization (at 5 seconds tolerances).

在一个或多个实施例中，平均绝对值音频特征(例如，图15B中的1500B)可以被当作剪辑中时间中的兴趣事件的可能性预测。如将在下面讨论的，该平均绝对值音频特征可以是输入到整合模型中的特征，该整合模型预测剪辑内发生兴趣事件的最终时间。In one or more embodiments, the mean absolute value audio feature (e.g., 1500B in FIG. 15B ) can be used as a likelihood prediction for the event of interest in time in the clip. As will be discussed below, this mean absolute value audio feature may be the feature that is input into an integrated model that predicts the final time at which the event of interest occurred within the clip.

4.动作定位实施例4. Action positioning example

在一个或多个实施例中，为了精确地定位足球比赛视频中的进球的时刻，结合该时刻周围的时间上下文信息来学习视频中所发生的事情。例如，在进球事件发生之前，球员将射门(或头球)，并且球将向球门移动。在某些情况下，进攻和防守队员聚集在禁区内，而且离球门不远。在进球事件之后，进球选手通常会跑到边线，与队友拥抱，并且在观众中以及教练之间也会有庆祝。直观地，视频中的这些模式可以帮助模型学习所发生的事情并定位进球事件的时刻。In one or more embodiments, in order to pinpoint the moment of a goal in a soccer game video, what is happening in the video is learned in conjunction with temporal context information around that moment. For example, before a goal event occurs, a player will shoot (or head) and the ball will move towards the goal. In some situations, offensive and defensive players are clustered in the penalty area and not far from the goal. After a goal incident, the goal scorer will usually run to the touchline, hug his teammates, and there will be celebrations among the spectators and among the coaches. Intuitively, these patterns in the video can help the model learn what happened and localize the moment of the goal event.

图16描述了根据本公开的实施例的用于预测视频剪辑中兴趣事件可能性的方法。在一个或多个实施例中，为了构建时间定位模型，使用将所提取的视觉特征作为输入的时间卷积神经网络(1605)。在一个或多个实施例中，输入特征可以是从上面讨论的一个或多个现有模型中提取的特征。对于每个帧，它输出一组中间特征，其混合跨帧的时间信息。然后，在一个或多个实施例中，中间特征被输入到生成分割分数的分割模块中(1610)，分割分数由分割损失函数评估。交叉熵损失函数可用于分割损失函数：Figure 16 depicts a method for predicting the likelihood of an event of interest in a video clip, according to an embodiment of the disclosure. In one or more embodiments, to build the temporal localization model, a temporal convolutional neural network (1605) is used that takes the extracted visual features as input. In one or more embodiments, the input features may be features extracted from one or more of the existing models discussed above. For each frame, it outputs a set of intermediate features that mix temporal information across frames. Then, in one or more embodiments, the intermediate features are input into a segmentation module that generates a segmentation score (1610), which is evaluated by a segmentation loss function. The cross-entropy loss function can be used for segmentation loss functions:

其中，t_i是真实值标记，p_i是第i个分类的softmax概率。where t_i is the ground-truth label and p_i is the softmax probability of the ith class.

在一个或多个实施例中，分割分数和中间特征被级联并馈送到动作定位模块(1615)，该动作定位模块生成定位预测(例如，剪辑的范围内的在每个时间点发生兴趣事件的可能性预测)(1620)，该预测可以通过类似YOLO的动作定位损失函数来评估。L2损失函数可用于动作定位损失函数：In one or more embodiments, the segmentation scores and intermediate features are concatenated and fed to the action localization module (1615), which generates localization predictions (e.g., events of interest at each time point within the scope of the clip) likelihood prediction) (1620), which can be evaluated by a YOLO-like action localization loss function. L2 loss function can be used in action localization loss function:

图17示出了根据本公开实施例的用于时间定位的流水线。在一个或多个实施例中，时间CNN可以包括卷积层，分割模块可以包括卷积层和批量归一化层，并且动作定位模型可以包括池化层和卷积层。FIG. 17 shows a pipeline for temporal positioning according to an embodiment of the disclosure. In one or more embodiments, the temporal CNN may include convolutional layers, the segmentation module may include convolutional layers and batch normalization layers, and the action localization model may include pooling layers and convolutional layers.

在一个或多个实施例中，考虑到时间的上下文信息，用如Cioppa 等人(A.,Deliège,A.,Giancola,S.,Ghanem,B.,Droogenbroeck,M.V., Gade,R.,&Moeslund,T.，“用于足球视频中动作识别的上下文感知损失函数(A Context-Aware Loss Function forAction Spotting in Soccer Videos)”，《2020年IEEE/CVF计算机视觉和模式识别会议(2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR))》,13123-13133,其全部内容通过引用并入本文中)所描述的分割和动作定位损失函数来训练模型实施例。在一个或多个实施例中，使用分割损失来训练分割模块，其中每一帧与分数相关联以表示帧属于动作分类的可能性，而使用动作定位损失来训练所述动作定位模块，其中预测动作分类的时间位置。In one or more embodiments, temporal context information is taken into account, as described in Cioppa et al. (A., Deliège, A., Giancola, S., Ghanem, B., Droogenbroeck, M.V., Gade, R., & Moeslund , T., "A Context-Aware Loss Function for Action Spotting in Soccer Videos", "2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)), 13123-13133, the entire contents of which are incorporated herein by reference) to train model embodiments with the segmentation and action localization loss functions described. In one or more embodiments, the segmentation module is trained using a segmentation loss, where each frame is associated with a score representing the likelihood that the frame belongs to an action class, and the action localization module is trained using an action localization loss, where predicting The temporal position of the action classification.

这里的实施例和Cioppa等人的方法之间的至少一个主要区别在于，这里的实施例处理短剪辑，而Cioppa等人将整个比赛视频作为输入，因此当实时实现时，需要长得多的时间来处理视频并提取特征。At least one major difference between the embodiments here and the method of Cioppa et al. is that the embodiments here deal with short clips, whereas Cioppa et al. take the entire game video as input and thus take much longer when implemented in real time to process the video and extract features.

在一个或多个实施例中，所提取的特征输入可以是从上面讨论的 ResNet模型或上面讨论的AVSlowFast多任务模型中提取的特征。或者，对于AVSlowFast多任务模型，可以去除动作定位模型的分割部分。图18描述了根据本公开的实施例的用于预测视频剪辑中兴趣事件的可能性的方法。在一个或多个实施例中，时间卷积神经网络接收从AVSlowfast多任务模型提取的特征作为输入(1805)。对于每个帧，它输出一组中间特征，其混合跨帧的时间信息。然后，在一个或多个实施例中，中间特征被输入到动作定位模块(1810)，该动作定位模块生成定位预测(例如，剪辑的范围内的在每个时间点发生兴趣事件的可能性预测)(1815)，该预测可以通过动作定位损失函数评估。图19 示出了根据本公开实施例的用于动作定位预测的流水线。In one or more embodiments, the extracted feature input may be a feature extracted from the ResNet model discussed above or the AVSlowFast multi-task model discussed above. Alternatively, for the AVSlowFast multi-task model, the segmentation part of the action localization model can be removed. Figure 18 depicts a method for predicting the likelihood of an event of interest in a video clip, according to an embodiment of the disclosure. In one or more embodiments, the temporal convolutional neural network receives as input features extracted from the AVSlowfast multi-task model (1805). For each frame, it outputs a set of intermediate features that mix temporal information across frames. Then, in one or more embodiments, the intermediate features are input to an action localization module (1810), which generates a localization prediction (e.g., a prediction of the likelihood of an event of interest occurring at each time point within the scope of the clip ) (1815), the prediction can be evaluated by the action localization loss function. FIG. 19 shows a pipeline for action localization prediction according to an embodiment of the present disclosure.

5.整合学习实施例5. Integrating learning examples

在一个或多个实施例中，可以从上述三个模型中的每一个获得剪辑中兴趣事件的单个预测时间(例如，选取最大值)。可以使用预测之一，或者可以组合(例如，平均)预测。或者，可以使用整合模型组合来自每个模型的信息，以便获得对剪辑中兴趣事件的最终预测。In one or more embodiments, a single predicted time (e.g., taking the maximum value) of an event of interest in a clip can be obtained from each of the three models described above. One of the forecasts can be used, or the forecasts can be combined (e.g., averaged). Alternatively, an ensemble model can be used to combine the information from each model in order to obtain a final prediction of the event of interest in the clip.

图20描述了根据本公开实施例的用于预测视频剪辑中兴趣事件的可能性的方法，以及图21示出了根据本公开实施例的用于最终时间预测的流水线。在一个或多个实施例中，可以以汇集在上述子节中描述的三个模型/特征的输出的整合方式增强最终准确率。在一个或多个实施例中，可将所有三个先前模型的输出与位置编码向量一起组合为整合模块的输入(2005)。组合可以使用级联来完成，例如，4个d维向量变成4×d矩阵。对于ResNet和AVSlowfast多任务模型，输入可以是来自上述第4节中它们的动作定位模型的可能性预测输出。而且，对于音频，输入可以是剪辑的平均绝对值音频特征(例如，图15B)。在一个或多个实施例中，位置编码向量是表示剪辑的时间长度(即，索引)的1-D向量。Figure 20 describes a method for predicting the likelihood of an event of interest in a video clip according to an embodiment of the disclosure, and Figure 21 shows a pipeline for final temporal prediction according to an embodiment of the disclosure. In one or more embodiments, the final accuracy can be enhanced in an integrated manner that pools the outputs of the three models/features described in the above subsections. In one or more embodiments, the outputs of all three previous models may be combined together with position encoding vectors as input to the integration module (2005). Combining can be done using concatenation, e.g., 4 d-dimensional vectors become 4×d matrices. For the ResNet and AVSlowfast multi-task models, the input can be the likelihood prediction output from their action localization models inSection 4 above. Also, for audio, the input may be the mean absolute audio feature of the clip (eg, FIG. 15B ). In one or more embodiments, the position-encoding vector is a 1-D vector representing the temporal length (ie, index) of the clip.

在一个或多个实施例中，整合模块的核心是具有回归头部的18 层1-D ResNet网络。本质上，整合模块学习从包括多模式的多维输入特征到剪辑中兴趣事件的最终时间位置的映射。在一个或多个实施例中，从整合模型输出最终时间值预测(2010)，并且可以与真实值时间进行比较以计算损失。各种剪辑的损失可用于更新整合模型的参数。In one or more embodiments, the core of the integration module is an 18-layer 1-D ResNet network with a regression head. Essentially, the integration module learns a mapping from multi-dimensional input features including multiple modalities to the final temporal locations of events of interest in clips. In one or more embodiments, a final time value prediction is output (2010) from the integrated model and can be compared to the ground truth time to calculate the loss. The loss of various clips can be used to update the parameters of the integrated model.

6.推断实施例6. Inferred Example

一旦经过训练，就可以部署如图1所示的整个高光时刻生成系统。在一个或多个实施例中，系统还可以包括如下输入，该输入允许用户关于生成的剪辑选择一个或多个参数。例如，用户可以选择特定的球员、比赛范围、一个或多个兴趣事件(例如，进球和惩罚)、以及制作高光时刻视频的剪辑数量(或者每个剪辑和/或整个高光时刻编辑视频的时间长度)。然后，高光时刻生成系统可以访问视频和元数据，并通过级联剪辑产生高光时刻编辑视频。例如，用户可能想要每个兴趣事件剪辑有10秒钟。因此，在一个或多个实施例中，定制的高光时刻视频生成模块可以根据针对剪辑的最终预测时间，选择兴趣事件之前的8秒和之后的2秒。或者，如图1所示，球员职业生涯中的关键事件可以是兴趣事件，并且它们可以被自动识别并汇编成球员职业生涯的 “故事”。可以由定制的高光时刻视频生成模块将可以由用户选择的音频和其它多媒体特征添加到视频。本领域的技术人员将认识到高光时刻生成系统的其它应用。Once trained, the entire highlight moment generation system as shown in Figure 1 can be deployed. In one or more embodiments, the system may also include an input that allows the user to select one or more parameters with respect to the generated clip. For example, the user can select specific players, the range of play, one or more events of interest (e.g., goals and penalties), and the number of clips to make the highlight video (or the time to edit the video for each clip and/or the entire highlight length). A highlight generation system can then access the video and metadata and produce a highlight edit video by cascading clips. For example, a user may want 10 seconds of each interest clip. Thus, in one or more embodiments, the customized highlight moment video generation module may select 8 seconds before and 2 seconds after the event of interest based on the final predicted time for the clip. Alternatively, as shown in Figure 1, key events in a player's career can be events of interest, and they can be automatically identified and compiled into a "story" of a player's career. Audio and other multimedia features, which may be selected by the user, may be added to the video by a customized highlight moment video generation module. Those skilled in the art will recognize other applications of the highlight moment generating system.

D.实验结果D. Experimental results

应注意的是，这些实验和结果是通过举例说明的方式提供的，并且是使用一个或多个具体实施例在具体条件下进行的；因此，这些实验和它们的结果都不应用于限制本专利文件的公开范围。It should be noted that these experiments and results are provided by way of illustration and are carried out under specific conditions using one or more specific examples; therefore, neither these experiments nor their results should be used to limit the scope of this patent. The visibility of the file.

1.进球定位1. Goal Positioning

为了与现有工作进行公平比较，用从数据集的训练集中的比赛提取的包含进球的候选剪辑来训练测试模型实施例，并且用从数据集的验证/测试集中的比赛提取的包含进球的候选剪辑来验证/测试。For a fair comparison with existing work, the test model embodiment is trained with candidate clips containing goals extracted from games in the training set of the dataset, and with goal clips extracted from games in the validation/test set of the dataset. candidate clips for validation/testing.

图22示出了主要结果：关于在70秒的剪辑中定位进球，测试的实施例2205显著优于现有技术的方法2210，该现有技术的方法2210 被称为足球中定位进球的上下文感知(Context-Aware)方法。Figure 22 shows the main results: with respect to positioning a goal in a 70-second clip, the testedembodiment 2205 significantly outperforms aprior art method 2210 known as themethod 2210 of positioning a goal in soccer. Context-Aware approach.

还示出了通过使用在C.3节或C.4节中描述的三个不同特征获得的中间预测结果，并且通过在C.5节中描述的整合学习模块预测最终结果。图23中堆叠了对3个剪辑的进球定位结果。如图23所示，整合学习模块实施例的最终预测输出在其与真实值标记的接近程度方面是最好的(用虚线椭圆示出)。Also shown are the intermediate prediction results obtained by using the three different features described in Section C.3 or C.4, and the final results predicted by the ensemble learning module described in Section C.5. The goal positioning results for the 3 clips are stacked in Figure 23. As shown in Figure 23, the final predicted output of the integrated learning module embodiment is best in terms of how close it is to the ground-truth label (shown by the dashed ellipse).

2.部分附注2. Partial notes

如图22所示，实施例可以以5秒的容差达到接近1的准确率 (0.984)。该结果是现象级的，因为它可以用于校正来自文本的错误标记、并与定制音频评论同步。它还有助于精确地生成高光时刻，并因此给予用户/编辑器选择以围绕精确的进球时刻定制他们的视频。流水线实施例可自然地扩展为捕捉其它事件(例如角球、任意球和点球) 的时刻。As shown in Figure 22, the embodiment can achieve an accuracy rate close to 1 (0.984) with a tolerance of 5 seconds. The result is phenomenal, as it can be used to correct mismarks from the text, synchronized with custom audio commentary. It also helps to precisely generate highlight moments, and thus gives users/editors the option to tailor their video around the precise moment of the goal. The pipeline embodiment can be naturally extended to capture the moments of other events such as corner kicks, free kicks and penalties.

再次重申，使用足球比赛作为整体内容以及使用进球作为该内容中的事件仅仅是示例性的，并且本领域技术人员将认识到，本文的各方面可以应用于其他内容领域(包括比赛领域之外的其他内容领域) 以及其他事件。Again, the use of a football game as the overall content and the goal as an event within that content is exemplary only, and those skilled in the art will recognize that aspects herein can be applied to other areas of content (including those outside of the game). other content areas of the ) and other events.

E.从文本和视频输入产生高光时刻视频的替代实施例E. Alternative Embodiment to Generate Video of Highlight Moments from Text and Video Input

如前所述，上述公开的系统和方法的应用是生成高光时刻视频的能力。例如，能够自动生成诸如具有评论的体育高光时刻视频的事件高光时刻视频，将是非常有益的。如上所述，对视频内容的需求在不断增长，并且生成视频内容比生成基于文章的内容要花费更长的时间。以前，诸如视频的生成是人工过程，其中视频编辑花费大量时间。本文的实施例使得更容易地生成高光时刻视频，其中，在生成相应的视频内容时使用输入文本，例如匹配的概要文章。在一个或多个实施例中，人工智能/机器学习被用于查找、匹配或生成与文本相对应的视频剪辑，上述文本与视频的部分有关；因此，一旦写了文本文章，就可以生成视频，其中，生成这种高光时刻视频的过程被简化为编写文章、并且使用自动系统的实施例来编辑原始视频以生成高光时刻视频。As previously mentioned, an application of the above-disclosed systems and methods is the ability to generate video of highlight moments. For example, it would be very beneficial to be able to automatically generate event highlight video such as sports highlight video with commentary. As mentioned above, the demand for video content is constantly increasing, and it takes longer to generate video content than article-based content. Previously, generation such as video was a manual process where video editing took a lot of time. Embodiments herein make it easier to generate highlight moment videos where the input text is used in generating the corresponding video content, such as a matching synopsis article. In one or more embodiments, artificial intelligence/machine learning is used to find, match, or generate video clips that correspond to the text that pertains to the portion of the video; thus, once the text article is written, the video can be generated , wherein the process of generating such a highlight moment video is reduced to writing an article and using an embodiment of an automated system to edit the raw video to generate a highlight moment video.

图24A和图24B描述了根据本公开的实施例的用于从视频和文本生成一个或多个事件的高光时刻视频的系统。如图所示，系统2400 可以作为输入接收描述比赛的文章或文本段2408以及整个比赛的一个或多个视频2402。注意，出于说明的目的，活动是体育比赛，但是应当注意，也可以使用其他活动(例如，演讲、集会、新闻广播、音乐会等)。24A and 24B describe a system for generating a highlight moment video of one or more events from video and text, according to an embodiment of the present disclosure. As shown, the system 2400 may receive as input an article ortext segment 2408 describing the game and one ormore videos 2402 of the entire game. Note that for purposes of illustration, the event is a sporting event, but it should be noted that other events (e.g., speeches, rallies, newscasts, concerts, etc.) could be used as well.

回到图24A，所公开的系统的一个任务是识别视频2402中与输入文本2408中所指出的元素相对应的一个或多个正确部分。图25描述了根据本公开的实施例的用于从输入文本提取信息的方法。如图所示，接收包括与活动中的一个或多个高光时刻事件相关的文本的输入文本 (例如，图24A中的文本2408)(2505)。例如，输入文本可以是概述比赛的文章，并且该文章将成为系统输出的最终概要视频的基础。为了帮助视频分段选择，对输入文本进行解析以识别兴趣事件和相关数据(如果有的话)(2510)。在一个或多个实施例中，解析可以基于规则(例如，模式匹配、关键字匹配等)，可以采用机器学习模型(例如被训练来提取的神经网络模型)，或利用两者来提取和分类关键数据，例如：事件的数量、事件发生的分钟/时间、事件/动作的类型、球员和其它识别信息(谁、什么、何处、何时等)。下面提供了一些模板匹配的示例：Returning to FIG. 24A , one task of the disclosed system is to identify the correct portion or portions of thevideo 2402 that correspond to elements indicated in theinput text 2408. Figure 25 describes a method for extracting information from input text according to an embodiment of the present disclosure. As shown, input text (eg,text 2408 in FIG. 24A ) is received that includes text related to one or more highlight moment events in the campaign (2505). For example, the input text could be an article summarizing a competition, and this article would form the basis for the final summary video output by the system. To aid in video segment selection, the input text is parsed to identify events of interest and related data, if any, (2510). In one or more embodiments, parsing may be rule-based (e.g., pattern matching, keyword matching, etc.), may employ machine learning models (e.g., neural network models trained to extract), or both to extract and classify Key data such as: number of events, minutes/times of events, types of events/actions, players and other identifying information (who, what, where, when, etc.). Some examples of template matching are provided below:

—阿森纳队的角球，由Adam Weber丢球：包含关键词“角球” 并且被分类为角球动作，并且所提取的球员是Adam Weber。—Arsenal's corner kick, conceded by Adam Weber: Contains the keyword "corner kick" and is classified as a corner kick action, and the extracted player is Adam Weber.

—Paul Grom(布莱顿队)在防守半场赢得一个任意球：这被分类为任意球动作，球员是Paul Grom。- Paul Grom (Brighton) wins a free kick in the defensive half: this is classified as a free kick and the player is Paul Grom.

—进球,阿森纳队1分，布莱顿队0分。Nicolas Pem(阿森纳队) 从禁区中央射右脚射门至右下角，Clark Kamers协助他做横传：这包含关键字“进球”和“射门”，所以这将被归类为进球和射门事件两者。这个球员就是Nicolas Pem。- Goals, 1 point for Arsenal, 0 points for Brighton. Nicolas Pem (Arsenal) right footed shot from the center of the box to the bottom right corner, Clark Kamers assists him with a cross: this contains the keywords 'goal' and 'shot' so this would be classified as goal and shot event both. This player is Nicolas Pem.

在一个或多个实施例中，所提取的包括关键事件和相关数据(如果有的话)的信息被提供给数据集(例如，图24A中的数据集2424)。该数据集由视频模型(例如，图24A中的视频深度学习模型2428)使用来生成用于视频概要的视频片段。In one or more embodiments, the extracted information, including key events and related data (if any), is provided to a data set (e.g.,data set 2424 in Figure 24A). This dataset is used by a video model (e.g., videodeep learning model 2428 in Figure 24A) to generate video segments for video summarization.

如图24A所示，数据可以与从其它源提取的文本数据组合。例如，可以从在线发布、社交媒体(例如，Facebook、Twitter等)、高频或实况报道文本、新闻推送、论坛、用户群等收集文本数据2404。在一个或多个实施例中，可以使用相同或类似的基于规则的模型的或神经网络模型来解析和分类该文本数据，或者可以使用更贴近输入文本数据的不同的基于规则的模型或神经网络模型来解析和分类该文本数据。通常，描述比赛动作的高频(实况)文本数据的web内容或用户生成的内容包括时间信息，其也可以被提取并用于帮助识别视频中的正确时间。该附加文本数据2404可以被解析并分类为句子组的集合，该句子包含分钟、动作和可能的其它数据(例如球员数据、时间等)。每个句子或句子组描述比赛期间的事件。因此，文本解析/动作分类模型 2412和2416可以是相同或不同的模型。注意，信息2404的源提供附加信息以帮助从输入概要文本2408中提取正确的事件As shown in Figure 24A, the data can be combined with textual data extracted from other sources. For example,textual data 2404 may be collected from online postings, social media (e.g., Facebook, Twitter, etc.), high frequency or factual text, newsfeeds, forums, user groups, etc. In one or more embodiments, the same or similar rule-based or neural network models may be used to parse and classify the text data, or a different rule-based model or neural network that more closely fits the input text data may be used model to parse and classify this text data. Often, web content or user-generated content of high-frequency (live) text data describing game action includes timing information, which can also be extracted and used to help identify the correct time in the video. Thisadditional text data 2404 may be parsed and categorized into a collection of sentence groups containing minutes, actions, and possibly other data (eg player data, time, etc.). Each sentence or group of sentences describes an event during the game. Accordingly, text parsing/action classification models 2412 and 2416 may be the same or different models. Note that the source ofinformation 2404 provides additional information to help extract the correct event from theinput summary text 2408

注意，还可以使用附加的数据输入来帮助解析和/或视频概要生成。作为示例，考虑系统2400可以使用球员数据数据库2406(其中，系统可以网络爬取球员信息或使用用户生成的球员信息内容)。还可以使用解析器2414来解析该信息，解析器2414可以使用相同或类似的基于规则的模型的或神经网络模型，或者可以使用更贴近输入的文本数据的不同的基于规则的模型或神经网络模型。例如，考虑图26所示的方法。Note that additional data inputs can also be used to aid parsing and/or video summary generation. As an example, consider that system 2400 may use player data database 2406 (where the system may web crawl player information or use user-generated player information content). The information may also be parsed using aparser 2414, which may use the same or a similar rule-based or neural network model, or may use a different rule-based or neural network model that more closely fits the input text data . For example, consider the method shown in Figure 26.

在一个或多个实施例中，解析器模块2414可将表信息归一化并存储实体信息(例如球队、球员、演员、乐队、组织等)。例如，球队的球员列表(该列表可以在球队网站处获得)可以包括按球员的号码和姓名的球员列表(2605)。解析器可以提取行并获得球员姓名和球衣号码(2610)，从而得到具有该球队的球员的号码和姓名的数据库。输出是球员数据集2418，其可用于帮助补充经解析的数据(即，具有可能的球员数据的由动作分类的具有秒数的文本数据2420和数据收集(如分钟、动作、可能的其他相关数据)2422)。应当指出，类似的动作也可用于其他实体(如表演者、演员、主持人等)。In one or more embodiments, theparser module 2414 may normalize table information and store entity information (e.g., teams, players, actors, bands, organizations, etc.). For example, a team's player list (which is available at the team's website) may include a player list by player number and name (2605). The parser can extract the row and get the player name and jersey number (2610), resulting in a database with the numbers and names of the players for that team. The output is aplayer data set 2418 that can be used to help supplement the parsed data (i.e. text data with seconds sorted by action withpossible player data 2420 and data collection (like minutes, action, possibly other related data) )2422). It should be noted that similar actions can be used for other entities (e.g., performers, actors, presenters, etc.).

回到图24A，给定输入视频2402，时间检测模块2410将视频的运行时间与比赛时间相关联。在一个或一个以上的实施例中，可以如关于时间锚定的上文论述来执行时间检测，但也可使用其它方法。Returning to Figure 24A, given aninput video 2402, thetime detection module 2410 associates the video's running time with the game time. In one or more embodiments, time detection may be performed as discussed above with respect to time anchoring, although other methods may also be used.

如图24A中的实施例所示，系统2400可以包括通过分钟和动作匹配模块2426，其也可以帮助识别事件的时间信息。例如，对于描述事件的每个句子组，模块2426可以将句子与高频文本数据相匹配，从而获得动作发生的秒数。在一个或多个实施例中，模型还可以关联附加的相关数据，例如球员。来自其它数据集(即，数据集2420和数据集2422)的信息可以被提供给模块2426，模块2426使用该信息用于匹配。As shown in the embodiment in FIG. 24A , the system 2400 can include a match by minute and action module 2426, which can also help identify time information for events. For example, for each group of sentences describing an event, module 2426 can match the sentence to high frequency text data to obtain the number of seconds the action occurred. In one or more embodiments, the model may also be associated with additional relevant data, such as players. Information from other data sets (i.e.,data set 2420 and data set 2422) may be provided to module 2426, which uses this information for matching.

例如，在匹配概要中，在第4分钟，阿森纳队获得了一个角球，并且实时流数据包括以下内容：For example, in the match summary, in the 4th minute, Arsenal were awarded a corner and the live stream data included the following:

3'30”由Bukayo犯规3'30" fouled by Bukayo

4'5”Adam Weber传球4'5" Adam Weber passes

4'12”Nicolas Pem射门得分4'12" Nicolas Pem scores

4'25”Pascal被Adam犯规4'25" Pascal was fouled by Adam

4'37”阿森纳队的角球，由Adam Weber丢球4'37" Arsenal Corner, lost by Adam Weber

5'10”阿森纳队换人，Arsenal.Sam Bukayo替换Emil Bowe5'10" Substitution for Arsenal, Arsenal.Sam Bukayo replaces Emil Bowe

在这种情况下，感兴趣的匹配概要数据在第4分钟出现，并且该动作是角球。在一个或多个实施例中，匹配模型实施例过滤实况数据并保留所有第4分钟数据，即：In this case, the match summary data of interest came in the 4th minute and the action was a corner kick. In one or more embodiments, the matching model embodiment filters live data and retains all 4th minute data, i.e.:

4'5”Adam Weber传球4'5" Adam Weber passes

4'12”Nicolas Pem射门得分4'12" Nicolas Pem scores

4'25”Pascal被Adam犯规4'25" Pascal was fouled by Adam

然后，通过上述文本解析模块，这些动作是已知的：传球、进球、丢球和角球。因此，模块2426匹配至“4'37”的阿森纳队的角球，由 Adam Weber丢球”，如果这是兴趣事件的话。在一个或多个实施例中，也可以将该信息提供给动作模块2430，动作模块2430使用时间信息来生成相应的视频片段。Then, through the above-mentioned text parsing module, these actions are known: pass, goal, conceded goal and corner kick. Therefore, module 2426 matches Arsenal's corner kick to "4'37", conceded by Adam Weber", if this is an event of interest. In one or more embodiments, this information may also be provided to theaction module 2430. Theaction module 2430 uses the time information to generate a corresponding video segment.

如实施例图24A所示，汇编数据集2424，其包括当一个或多个事件发生时的近似分钟时间以及这些事件的分类。在所示实施例中，数据集2424可以是来自时间检测模块2410的信息、来自文本数据信息 2420和数据收集信息2422的汇编。然后，该信息可以被视频深度学习模型2428用于在视频中确定更精确的时间，以便生成包含兴趣事件的相应视频片段。模型2428可以是上面讨论的一个或多个模型或上面讨论的一个或多个模型的整合。As shown in the embodiment of FIG. 24A , adata set 2424 is compiled that includes approximate minutes of time when one or more events occurred and a classification of those events. In the illustrated embodiment,data set 2424 may be a compilation of information fromtime detection module 2410, fromtextual data information 2420, anddata collection information 2422. This information can then be used by the videodeep learning model 2428 to determine a more precise time in the video in order to generate a corresponding video segment containing the event of interest.Model 2428 may be one or more of the models discussed above or an integration of one or more of the models discussed above.

例如，在一个或多个实施例中，对于描述事件的每个句子组，句子可以包含动作发生的分钟。该系统可以从输入视频中提取一分钟(或更多)的视频，并且在一个或多个实施例中，使用一个或多个深度学习视频理解模型来识别事件/动作发生在哪一秒。如上所述，模型可以是以上讨论的一个或多个模型或以上讨论的一个或多个模型的全部或其子集的整合。For example, in one or more embodiments, for each group of sentences describing an event, the sentence may contain the minute the action occurred. The system can extract a minute (or more) of video from an input video and, in one or more embodiments, use one or more deep learning video understanding models to identify at which second an event/action occurred. As noted above, a model may be one or more of the models discussed above or an integration of all or a subset of one or more of the models discussed above.

在一个或多个实施例中，模块2428的输出是经识别的动作/事件和那些事件/动作在视频中发生的时间(以秒计)的集合2430。该信息可以与来自匹配模块2426的信息组合(并且多余的事件和时间可以被去除)。该最终时间和事件信息集合2430可用于生成包括兴趣动作/ 事件的视频片段2434(图24B)。片段可以是设定的时间量(例如，从事件之前x秒到事件之后y秒，片段时间跨度可以根据动作类型或从视频深度学习模型2428检测到的事件长度而变化)和/或可以具有与针对相应文本的，来自文本到语音模块的音频的长度相对应的长度，该相应文本与视频剪辑中的事件有关。In one or more embodiments, the output ofmodule 2428 is acollection 2430 of identified actions/events and the times (in seconds) at which those events/actions occurred in the video. This information can be combined with information from the matching module 2426 (and redundant events and times can be removed). This final set of time andevent information 2430 can be used to generate a video segment 2434 (FIG. 24B) that includes the action/event of interest. Segments may be a set amount of time (e.g., from x seconds before the event to y seconds after the event, the segment time span may vary depending on the type of action or event length detected from the video deep learning model 2428) and/or may have A length corresponding to the length of the audio from the text-to-speech module for the corresponding text related to the event in the video clip.

在一个或多个实施例中，输入文本2408或描述事件/动作的一组一个或多个句子可以被输入到文本到语音(TTS)系统2432中，TTS 系统2432将输入文本转换为音频。在一个或多个实施例中，文本可以是通过用TTS将句子转换为音频而生成的音频片段的汇编。In one or more embodiments,input text 2408 or a set of one or more sentences describing an event/action may be input into a text-to-speech (TTS)system 2432, which converts the input text to audio. In one or more embodiments, the text may be a compilation of audio clips generated by converting sentences to audio with TTS.

本领域的技术人员应该认识到，可以使用许多TTS系统中的任何一个。例如，若干著作解决了给定输入文本通过神经网络合成语音的问题，包括但不限于：Those skilled in the art will recognize that any of a number of TTS systems may be used. For example, several works address the problem of synthesizing speech by neural networks given input text, including but not limited to:

—深度语音1(其公开在2018年1月29日提交的题为“用于实时神经文本到语音的系统和方法(SYSTEMS AND METHODS FOR REAL-TIME NEURAL TEXT-TO-SPEECH)”的共同受让人的美国专利申请15/882,926(案号28888-2105)、以及2017年2月24日提交的标题为“用于实时神经文本到语音的系统和方法”的美国临时专利申请 62/463,482(案号28888-2105P)中，上述专利文献中的每一篇通过引用整体并入本文(为方便起见，这些公开内容可称为“深度语音1”或 “DV1”)；— Common assignee of Deep Speech 1 (disclosure of which was filed on Jan. 29, 2018, entitled "SYSTEMS AND METHODS FOR REAL-TIME NEURAL TEXT-TO-SPEECH"U.S. Patent Application 15/882,926 (Docket No. 28888-2105), and U.S. Provisional Patent Application 62/463,482 (Case No. 28888-2105P), each of which is incorporated herein by reference in its entirety (for convenience, these disclosures may be referred to as "Deep Voice 1" or "DV1");

—深度语音2(其公开在2018年5月8日提交的题为“用于多扬声器神经文本到语音的系统和方法(SYSTEMS AND METHODS FOR MULTI-SPEAKER NEURAL TEXT-TO-SPEECH)”的共同受让人的美国专利申请15/974,397(案号8888-2144)、以及2017年5月19日提交的题为“用于多扬声器神经文本到语音的系统和方法”的美国临时专利申请62/508,579(案号28888-2144P)中，上述专利文献中的每一篇通过引用整体并入本文(为方便起见，这些公开内容可以称为“深度语音2”或“DV2”)；—Deep Speech 2 (whose disclosure was co-received on May 8, 2018, entitled "SYSTEMS AND METHODS FOR MULTI-SPEAKER NEURAL TEXT-TO-SPEECH" Ren'sU.S. Patent Application 15/974,397 (Docket No. 8888-2144), and U.S. Provisional Patent Application 62/508,579, filed May 19, 2017, entitled "Systems and Methods for Multi-Speaker Neural Text-to-Speech" (Docket No. 28888-2144P), each of which is incorporated herein by reference in its entirety (for convenience, these disclosures may be referred to as "Deep Voice 2" or "DV2");

—深度语音3(其公开在2018年8月8日提交的题为“使用卷积序列学习的文本到语音神经系统与方法(SYSTEMS AND METHODS FOR NEURAL TEXT-TO-SPEECH USINGCONVOLUTIONAL SEQUENCE LEARNING)”的共同受让人的美国专利申请16/058,265 (案号28888-2175)、以及2017年10月19日提交的题为“使用卷积序列学习的文本到语音神经系统与方法”美国临时专利申请 62/574,382(案号2888-2175P)中，并作为发明人列出了Sercan

Ar₁k、 Wei Ping、Kainan Peng、Sharan Narang、Ajay Kannan、AndrewGibiansky、Jonathan Raiman和John Miller(上述专利文献中的每一篇通过引用整体并入本文(为方便起见，其公开内容可称为“深度语音 3”或“DV3”))；—Deep Speech 3 (which is published in the common paper entitled "SYSTEMS AND METHODS FOR NEURAL TEXT-TO-SPEECH USING CONVOLUTIONAL SEQUENCE LEARNING)" submitted on August 8, 2018 Assignee'sU.S. Patent Application 16/058,265 (Docket No. 28888-2175), and U.S. Provisional Patent Application 62/ 574,382 (Docket No. 2888-2175P), and lists Sercan

Ar_1k , Wei Ping, Kainan Peng, Sharan Narang, Ajay Kannan, Andrew Gibiansky, Jonathan Raiman, and John Miller (each of the aforementioned patent documents is incorporated herein by reference in its entirety (for convenience, the disclosures thereof may be referred to as "Deep Voice 3" or "DV3"));

—公开在2020年12月22日授权的共有的美国专利10,872,596(案号28888-2269)中的实施例(该专利文件以全文引用的方式并入本文)；以及— Examples disclosed in commonly-owned U.S. Patent 10,872,596 (Docket No. 28888-2269), issued December 22, 2020 (which patent document is incorporated herein by reference in its entirety); and

—公开在2021年5月25日授权的共有的美国专利11,017,761(案号28888-2326)中的实施例(该专利文件以全文引用的方式并入本文中)。- Examples disclosed in co-owned U.S. Patent 11,017,761 (Docket No. 28888-2326) issued May 25, 2021 (which patent document is incorporated herein by reference in its entirety).

回到图24B，给定视频片段2434和相应的音频2436，可以将视频和音频片段组合成比赛高光时刻视频2440。图27描绘根据本公开实施例的用于产生组合视频的方法。Returning to Figure 24B, given avideo clip 2434 andcorresponding audio 2436, the video and audio clips can be combined into a gamehighlight moment video 2440. Figure 27 depicts a method for generating a combined video according to an embodiment of the disclosure.

如上所述，对于每个事件，可以使用TTS和关于事件的一组一个或多个句子来生成音频(2705)，其中TTS将一组一个或多个句子转换为具有特定长度的音频。已经确定了事件在视频中发生的确切或近似时间，可以从完整视频中提取包括事件的视频片段(2710)，并且可以选择其长度，使得其足够长以用于相应生成的音频。例如，如果针对关于一个事件(例如，角球)的句子或句子组TTS生成的音频需要一定量的时间，则可以编辑相应的视频以匹配关于视频片段中的事件的音频的长度(例如，视频可以在音频之前具有几秒钟并且在音频之后具有几秒钟)。最后，视频片段和相应的音频可以被组合成关于事件的多媒体视频。在一个或多个实施例中，可以使用工具(诸如FFmpeg 工具)来组合视频片段和音频。As described above, for each event, audio may be generated (2705) using a TTS and a set of one or more sentences about the event, wherein the TTS converts the set of one or more sentences into audio of a certain length. Having determined the exact or approximate time at which the event occurred in the video, the video segment comprising the event may be extracted from the full video (2710), and its length may be chosen such that it is long enough for the correspondingly generated audio. For example, if the audio generated for a sentence or group of sentences TTS about an event (e.g., a corner kick) takes a certain amount of time, the corresponding video can be edited to match the length of the audio about the event in the video clip (e.g., the video can have seconds before the audio and have seconds after the audio). Finally, the video clips and corresponding audio can be combined into a multimedia video about the event. In one or more embodiments, a tool such as the FFmpeg tool can be used to combine video clips and audio.

应当注意，输入文本2408可以提到多个事件/动作，并且可以将具有相应音频的多个视频片段进行关联和组合。例如，在一个或多个实施例中，视频剪辑被级联成单个最终视频，并且音频与相应的事件同步。在一个或多个实施例中，可以使用工具(诸如FFmpeg工具) 来组合视频片段和音频。It should be noted that theinput text 2408 may refer to multiple events/actions, and multiple video clips with corresponding audio may be associated and combined. For example, in one or more embodiments, video clips are concatenated into a single final video, and the audio is synchronized with the corresponding events. In one or more embodiments, tools such as FFmpeg tools can be used to combine video clips and audio.

F.计算系统实施例F. Computing System Embodiments

在一个或多个实施例中，本专利文件的各方面可以针对一个或多个信息处理系统(或计算系统)，可以包括一个或多个信息处理系统(或计算系统)，或者可以在一个或多个信息处理系统(或计算系统)上实现。信息处理系统/计算系统可以包括任何可操作的工具或工具的集合来计算、估计、确定、分类、处理、发送、接收、检索、始发、路由、交换、存储、显示、通信、显现、检测、记录、再现、处理或利用任何形式的信息、智能或数据。例如，计算系统可以是或可以包括个人计算机(例如，膝上型计算机)、平板计算机、移动设备(例如，个人数字助理(PDA)、智能电话、平板手机、平板等)、智能卡、服务器 (例如，刀片服务器或机架服务器)、网络存储设备、照相机或任何其它合适的设备，并且可以在大小、形状、性能、功能和价格上变化。计算系统可以包括随机存取存储器(RAM)、诸如中央处理单元(CPU) 或硬件或软件控制逻辑的一个或多个处理资源、只读存储器(ROM) 和/或其它类型的存储器。计算系统的附加部件可包括一个或多个驱动器(例如，硬盘驱动器，固态驱动器或两者)，用于与外部设备以及各种输入和输出(I/O)设备(例如键盘、鼠标、手写笔、触摸屏和/或视频显示器)通信的一个或多个网络端口。计算系统还可以包括一个或多个总线，用于在各种硬件部件之间传输通信。In one or more embodiments, aspects of this patent document may be directed to one or more information processing systems (or computing systems), may include one or more information processing systems (or computing systems), or may be implemented in one or more Implemented on multiple information processing systems (or computing systems). An information handling system/computing system may include any implement or collection of implements operable to compute, estimate, determine, classify, process, send, receive, retrieve, originate, route, exchange, store, display, communicate, visualize, detect , record, reproduce, process or exploit information, intelligence or data of any kind. For example, a computing system can be or include a personal computer (e.g., a laptop), a tablet computer, a mobile device (e.g., a personal digital assistant (PDA), smartphone, phablet, tablet, etc.), a smart card, a server (e.g., , blade server or rack server), network storage device, camera, or any other suitable device, and may vary in size, shape, performance, functionality, and price. A computing system may include random access memory (RAM), one or more processing resources such as a central processing unit (CPU) or hardware or software control logic, read only memory (ROM), and/or other types of memory. Additional components of the computing system may include one or more drives (e.g., hard disk drive, solid-state drive, or both) for communicating with external devices and various input and output (I/O) devices (e.g., keyboard, mouse, stylus, , touch screen, and/or video display) communication. A computing system may also include one or more buses for carrying communications between various hardware components.

图28描绘了根据本公开实施例的信息处理系统(或计算系统)的简化框图。应当理解，系统2800所示的功能可以用于支持计算系统的各种实施例，尽管应当理解，计算系统可以被不同地配置并且包括不同的部件，包括具有如图28所示的更少或更多的部件。Figure 28 depicts a simplified block diagram of an information handling system (or computing system) according to an embodiment of the disclosure. It should be appreciated that the functionality illustrated bysystem 2800 can be used to support various embodiments of computing systems, although it should be appreciated that computing systems can be configured differently and include different components, including having fewer or more components as shown in FIG. many parts.

如图28所示，计算系统2800包括一个或多个中央处理单元(CPU) 2801，其提供计算资源并控制计算机。CPU 2801可以用微处理器等来实现，并且计算系统2800还可以包括用于数学计算的一个或多个图形处理单元(GPU)2802和/或浮点协处理器。在一个或多个实施例中，可在显示控制器2809内并入一个或一个以上GPU 2802，例如图形卡的一部分。系统2800还可包括系统存储器2819，其可包括RAM、ROM 或两者。As shown in Figure 28,computing system 2800 includes one or more central processing units (CPUs) 2801, which provide computing resources and control the computer.CPU 2801 may be implemented with a microprocessor or the like, andcomputing system 2800 may also include one or more graphics processing units (GPUs) 2802 and/or floating point coprocessors for mathematical calculations. In one or more embodiments, one ormore GPUs 2802 may be incorporated withindisplay controller 2809, such as part of a graphics card.System 2800 may also includesystem memory 2819, which may include RAM, ROM, or both.

如图28所示，也可以提供多个控制器和外围设备。输入控制器 2803表示到诸如键盘、鼠标、触摸屏和/或手写笔的各种输入设备2804 的接口。计算系统2800还可以包括用于与一个或多个存储设备2808 接合的存储控制器2807，所述一个或多个存储设备2808中的每一个包括诸如磁带或磁盘的存储介质或光学介质，其可以记录用于操作系统、实用程序和应用程序的指令程序，所述指令程序可以包括实现本公开的各个方面的程序的实施例。存储设备2808也可用于存储根据本公开的经处理的数据或待处理的数据。系统2800还可以包括用于提供到显示设备2811的接口的显示控制器2809，，显示设备2811可以是阴极射线管(CRT)显示器、薄膜晶体管(TFT)显示器、有机发光二极管、电致发光面板、等离子体面板或任何其它类型的显示器。计算系统2800还可以包括用于一个或多个外围设备2806的一个或多个外围设备控制器或接口2805。外围设备的示例可以包括一个或多个打印机、扫描仪、输入设备、输出设备、传感器等。通信控制器2814 可以与一个或多个通信设备2815连接，这使得系统2800能够通过多种网络中的任何一种或通过包括红外信号的任何合适的电磁载波信号连接到远程设备，所述网络包括因特网、云资源(例如，以太网云、以太网上的光纤信道(FCoE)/数据中心桥接(DCB)云等)、局域网 (LAN)、广域网(WAN)、存储区域网(SAN)。如所描绘的实施例所示，计算系统2800包括一个或多个风扇或风扇盘2818和一个或多个冷却子系统控制器2817，冷却子系统控制器2817监视系统2800(或其部件)的热温度并操作风扇/风扇盘2818以帮助调节温度。As shown in Figure 28, multiple controllers and peripherals may also be provided. Theinput controller 2803 represents an interface tovarious input devices 2804 such as a keyboard, mouse, touch screen, and/or stylus.Computing system 2800 may also include astorage controller 2807 for interfacing with one ormore storage devices 2808, each of which includes storage media such as magnetic tape or magnetic disks or optical media, which may Programs of instructions for operating systems, utilities, and application programs are recorded, which may include embodiments of programs implementing various aspects of the present disclosure.Storage device 2808 may also be used to store processed data or data to be processed in accordance with the present disclosure.System 2800 may also include adisplay controller 2809 for providing an interface to adisplay device 2811, which may be a cathode ray tube (CRT) display, a thin film transistor (TFT) display, an organic light emitting diode, an electroluminescent panel, Plasma panels or any other type of display.Computing system 2800 may also include one or more peripheral device controllers orinterfaces 2805 for one or moreperipheral devices 2806. Examples of peripheral devices may include one or more printers, scanners, input devices, output devices, sensors, and the like.Communications controller 2814 may be connected to one ormore communication devices 2815, which enablesystem 2800 to connect to remote devices via any of a variety of networks, including Internet, cloud resources (e.g., Ethernet cloud, Fiber Channel over Ethernet (FCoE)/Data Center Bridging (DCB) cloud, etc.), Local Area Network (LAN), Wide Area Network (WAN), Storage Area Network (SAN). As shown in the depicted embodiment,computing system 2800 includes one or more fans orfan trays 2818 and one or morecooling subsystem controllers 2817 that monitor the thermal performance of system 2800 (or components thereof). Temperature and operates fan/fan tray 2818 to help regulate temperature.

在所示的系统中，所有主要的系统部件可以连接到总线2816，总线2816可以表示多于一个的物理总线。然而，各种系统部件可彼此物理接近或不物理接近。例如，输入数据和/或输出数据可以从一个物理位置远程传输到另一个物理位置。此外，可以通过网络从远程位置(例如，服务器)访问实现本公开的各个方面的程序。这种数据和/或程序可以通过各种机器可读介质中的任何一种来传送，所述机器可读介质例如包括：磁介质(诸如硬盘、软盘和磁带)；光学介质(诸如光盘(CD) 和全息设备)；磁光介质；以及被专门配置为存储或执行程序代码的硬件设备(诸如专用集成电路(ASIC)、可编程逻辑设备(PLD)、闪存设备、其它非易失性存储器(NVM)设备(诸如基于3D XPoint的设备)、以及ROM和RAM设备)。In the system shown, all major system components can be connected tobus 2816, which can represent more than one physical bus. However, the various system components may or may not be in physical proximity to each other. For example, input data and/or output data may be transmitted remotely from one physical location to another. In addition, programs implementing various aspects of the present disclosure can be accessed from a remote location (e.g., a server) through a network. Such data and/or programs may be transmitted by any of a variety of machine-readable media, including, for example: magnetic media (such as hard disks, floppy disks, and magnetic tape); optical media (such as compact disks (CDs); ) and holographic devices); magneto-optical media; and hardware devices specially configured to store or execute program code (such as application-specific integrated circuits (ASICs), programmable logic devices (PLDs), flash memory devices, other non-volatile memories ( NVM) devices (such as 3D XPoint-based devices), and ROM and RAM devices).

本公开的各方面可以被编码在一个或多个非暂时性计算机可读介质上，所述非暂时性计算机可读介质具有用于一个或多个处理器或处理单元的指令，以执行步骤。应当注意，一个或多个非暂时性计算机可读介质应当包括易失性和/或非易失性存储器。应注意，替代实施例是可能的，包括硬件实施例或软件/硬件实施例。硬件实现的功能可以使用ASIC、可编程阵列、数字信号处理电路等来实现。因此，任何权利要求中的“装置”术语旨在覆盖软件和硬件实现。类似地，这里使用的术语“计算机可读介质”包括其上包含有指令程序的软件和/或硬件、或其组合。考虑到这些实现替换，应当理解，附图和附带的描述提供了本领域技术人员在编写程序代码(即，软件)和/或制造电路(即，硬件)以执行所需处理时所需的功能信息。Aspects of the present disclosure may be encoded on one or more non-transitory computer-readable media having instructions for one or more processors or processing units to perform the steps. It should be noted that the one or more non-transitory computer readable media should include volatile and/or nonvolatile memory. It should be noted that alternative embodiments are possible, including hardware embodiments or software/hardware embodiments. Hardware-implemented functions may be implemented using ASICs, programmable arrays, digital signal processing circuits, and the like. Accordingly, a "means" term in any claim is intended to cover both software and hardware implementations. Similarly, the term "computer-readable medium" as used herein includes software and/or hardware, or a combination thereof, on which a program of instructions is embodied. With these implementation alternatives in mind, it should be understood that the drawings and accompanying descriptions provide the functionality required by one skilled in the art in writing program code (i.e., software) and/or fabricating circuits (i.e., hardware) to perform the desired processes information.

应注意，本公开的实施例还可涉及具有非暂时性、有形的计算机可读介质的计算机产品，所述计算机可读介质在其上具有用于执行各种计算机实施操作的计算机代码。介质和计算机代码可以是为了本公开的目的而专门设计和构造的，或者它们可以是相关领域的技术人员已知或可用的类型。有形计算机可读介质的示例包括，例如：磁介质 (诸如硬盘、软盘和磁带)；光学介质(例如光盘(CD)和全息设备)；磁光介质；以及被专门配置为存储或执行程序代码的硬件设备(诸如专用集成电路(ASIC)、可编程逻辑设备(PLD)、闪存设备、其它非易失性存储器(NVM)设备(诸如基于3D XPoint的设备)、以及ROM 和RAM设备)。计算机代码的示例包括诸如由编译器产生的机器代码，以及包含由使用解释器的计算机执行的高级代码的文件。本公开的实施例可以全部或部分地作为机器可执行指令来实现，所述机器可执行指令可以位于由处理设备执行的程序模块中。程序模块的示例包括库、程序、例程、对象、部件和数据结构。在分布式计算环境中，程序模块可以物理地位于本地、远程或两者的设置中。It should be noted that embodiments of the present disclosure may also relate to computer products having a non-transitory, tangible computer-readable medium having computer code thereon for performing various computer-implemented operations. The media and computer code may be specially designed and constructed for the purposes of the present disclosure, or they may be of a type known or available to those skilled in the relevant art. Examples of tangible computer-readable media include, for example: magnetic media (such as hard disks, floppy disks, and magnetic tape); optical media (such as compact discs (CDs) and holographic devices); magneto-optical media; and Hardware devices such as application specific integrated circuits (ASICs), programmable logic devices (PLDs), flash memory devices, other non-volatile memory (NVM) devices such as 3D XPoint based devices, and ROM and RAM devices). Examples of computer code include machine code such as produced by a compiler, and files containing high-level code executed by a computer using an interpreter. Embodiments of the present disclosure may be implemented in whole or in part as machine-executable instructions, which may reside in program modules executed by a processing device. Examples of program modules include libraries, programs, routines, objects, components, and data structures. In a distributed computing environment, program modules may be physically located in local, remote, or both settings.

本领域的技术人员将认识到，对于本公开的实践，计算系统或编程语言不是关键的。本领域的技术人员还将认识到，上述的多个元件可以物理地和/或功能地分离成模块和/或子模块或组合在一起。Those skilled in the art will recognize that the computing system or programming language is not critical to the practice of the present disclosure. Those skilled in the art will also appreciate that the various elements described above may be physically and/or functionally separated into modules and/or sub-modules or combined together.

本领域技术人员应当理解，前述示例和实施例是示例性的，而不旨在限制本公开的范围。在阅读本说明书和研究附图后，本领域技术人员显而易见的所有置换、增强、等同、组合和其改进都包括在本公开的本质和范围内。还应当注意，可以不同地布置任何权利要求的元素，包括具有多种从属关系、配置和组合。Those skilled in the art should appreciate that the foregoing examples and embodiments are exemplary and not intended to limit the scope of the present disclosure. All permutations, enhancements, equivalents, combinations and improvements thereof apparent to those skilled in the art after reading this specification and studying the accompanying drawings are included within the spirit and scope of this disclosure. It should also be noted that elements of any claims may be arranged differently, including in various affiliations, configurations and combinations.