CN115035509A

Movatterモバイル変換

Info

Publication number: CN115035509A
Application number: CN202210753941.7A
Authority: CN
Inventors: 毕泊
Original assignee: Beijing IQIYI Science and Technology Co Ltd
Current assignee: Beijing IQIYI Science and Technology Co Ltd
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-09-09

Abstract

The embodiment of the invention provides a video detection method, a video detection device, electronic equipment and a medium, wherein the video detection method comprises the following steps: acquiring a video file and determining a detection target aiming at the video file; acquiring a plurality of continuous video clips from the video file, and respectively determining a plurality of audio and video characteristic information of the video clips; respectively inputting the audio and video characteristic information into a pre-trained classification model to obtain a plurality of corresponding output results; determining candidate video segments from the plurality of video segments according to the plurality of output results; performing character recognition on the image frames of the candidate video clips to obtain character recognition results; and determining a target image frame where the detection target is located according to the character recognition result. According to the embodiment of the invention, the candidate video segment where the detection target is located is determined by combining the picture information and the audio information, and character recognition is carried out in the segment range, so that the image frame where the detection target is located is accurately positioned.

Description

Translated fromChinese

一种视频检测方法、装置、电子设备和存储介质A video detection method, device, electronic device and storage medium

技术领域technical field

本发明涉及视频处理技术领域，特别是涉及一种视频检测方法、一种视频检测装置、一种电子设备和一种计算机可读存储介质。The present invention relates to the technical field of video processing, and in particular, to a video detection method, a video detection device, an electronic device and a computer-readable storage medium.

背景技术Background technique

视频流媒体服务中为了提升用户观看体验，会对视频中的特定时间点位进行标记，通过对特定时间点位进行标记，可以帮助用户快速定位到该时间点位对应的视频内容，或者可以在该时间点位上有针对性地提供相应的服务功能。例如，可以对片头结束时间点位进行标记，一方面可以提供跳过功能帮助用户快速进入正片内容，另一方面可以在该点位插入前情提要等内容丰富用户体验。此外还可以对片尾开始时间点位进行标记，可以在影片播放完毕后提供推荐相似影片功能增加用户在影片中的停留时长。视频流媒体服务还可以对电影影片中的彩蛋开始时间点位进行标记，提供跳转功能帮助用户快速进入彩蛋内容。In order to improve the user's viewing experience, the video streaming service will mark a specific time point in the video. By marking a specific time point, it can help users to quickly locate the video content corresponding to the time point. Corresponding service functions are provided in a targeted manner at this time point. For example, the ending time point of the title can be marked. On the one hand, a skip function can be provided to help users quickly enter the main content; In addition, the start time point of the end of the film can be marked, and the function of recommending similar videos can be provided after the video is played to increase the user's stay in the video. Video streaming services can also mark the start time of easter eggs in movies, and provide a jump function to help users quickly enter the content of easter eggs.

视频中标记的时间点位是视频内容生产商剪辑制作的，不同视频时间点位的标记位置不同。传统的标记方式要么是利用人工进行标记，标记效率低下；要么是使用统一的图像模板进行匹配标记，该标记方法无法解决点位在不同视频中可能存在偏差以及部分点位设置较为灵活的情况，标记灵活性差。The time points marked in the video are edited and produced by the video content producer, and the marked positions of different video time points are different. The traditional marking method either uses manual marking, which is inefficient; or uses a unified image template for matching marking. This marking method cannot solve the possible deviation of points in different videos and the flexible setting of some points. Poor marking flexibility.

发明内容SUMMARY OF THE INVENTION

鉴于上述问题，提出了本发明实施例以便提供一种克服上述问题或者至少部分地解决上述问题的一种视频检测方法和相应的一种视频检测装置、一种电子设备、以及一种计算机可读存储介质。In view of the above problems, the embodiments of the present invention are proposed to provide a video detection method and a corresponding video detection device, an electronic device, and a computer-readable method that overcomes the above problems or at least partially solves the above problems storage medium.

本发明实施例公开了一种视频检测方法，所述方法包括：The embodiment of the present invention discloses a video detection method, and the method includes:

获取视频文件，并确定针对所述视频文件的检测目标；所述检测目标包括片头结束标志信息、片尾开始标志信息和片尾结束标志信息中的至少一种；Acquiring a video file, and determining a detection target for the video file; the detection target includes at least one of the end of credit information, the beginning of credit information and the end of credit information;

从所述视频文件中获取连续的多个视频片段，并分别确定所述多个视频片段的多个音视频特征信息；Acquire a plurality of consecutive video clips from the video file, and respectively determine a plurality of audio and video feature information of the plurality of video clips;

将所述多个音视频特征信息分别输入预先训练的分类模型中，获得对应的多个输出结果；Inputting the plurality of audio and video feature information into a pre-trained classification model, respectively, to obtain a plurality of corresponding output results;

根据所述多个输出结果，从所述多个视频片段中确定候选视频片段；determining candidate video segments from the plurality of video segments according to the plurality of output results;

对所述候选视频片段的图像帧进行文字识别，获得文字识别结果；Perform text recognition on the image frames of the candidate video clips to obtain a text recognition result;

根据所述文字识别结果确定所述检测目标所在的目标图像帧。The target image frame where the detection target is located is determined according to the character recognition result.

本发明实施例还公开了一种视频检测装置，所述装置包括：The embodiment of the present invention also discloses a video detection device, the device includes:

第一获取及确定模块，用于获取视频文件，并确定针对所述视频文件的检测目标；所述检测目标包括片头结束标志信息、片尾开始标志信息和片尾结束标志信息中的至少一种；The first acquisition and determination module is used to acquire a video file, and determine a detection target for the video file; the detection target includes at least one of the title ending marker information, the ending ending marker information and the ending ending marker information;

第二获取及确定模块，用于从所述视频文件中获取连续的多个视频片段，并分别确定所述多个视频片段的多个音视频特征信息；The second acquisition and determination module is used to acquire a plurality of continuous video clips from the video file, and respectively determine a plurality of audio and video feature information of the plurality of video clips;

输入及输出模块，用于将所述多个音视频特征信息分别输入预先训练的分类模型中，获得对应的多个输出结果；an input and output module, used for inputting the plurality of audio and video feature information into the pre-trained classification model, respectively, to obtain a plurality of corresponding output results;

第一确定模块，用于根据所述多个输出结果，从所述多个视频片段中确定候选视频片段；a first determining module, configured to determine candidate video segments from the multiple video segments according to the multiple output results;

文字识别模块，用于对所述候选视频片段的图像帧进行文字识别，获得文字识别结果；a text recognition module, configured to perform text recognition on the image frames of the candidate video clips to obtain a text recognition result;

第二确定模块，用于根据所述文字识别结果确定所述检测目标所在的目标图像帧。The second determining module is configured to determine the target image frame where the detection target is located according to the character recognition result.

本发明实施例还公开了一种电子设备，包括：处理器、存储器及存储在所述存储器上并能够在所述处理器上运行的计算机程序，所述计算机程序被所述处理器执行时实现如上所述的一种视频检测方法的步骤。An embodiment of the present invention further discloses an electronic device, comprising: a processor, a memory, and a computer program stored on the memory and capable of running on the processor, the computer program being implemented when executed by the processor The steps of a video detection method as described above.

本发明实施例还公开了一种计算机可读存储介质，所述计算机可读存储介质上存储计算机程序，所述计算机程序被处理器执行时实现如上所述的一种视频检测方法的步骤。Embodiments of the present invention further disclose a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the above-mentioned video detection method are implemented.

本发明实施例包括以下优点：The embodiments of the present invention include the following advantages:

在本发明实施例中，可以通过视频文件中连续的多个视频片段的音视频特征信息，确定在多个视频片段中可能包含检测目标的候选视频片段，并对该候选视频片段的图像帧进行文字识别，确定检测目标所在的目标图像帧。通过采用上述方法，结合视频文件的画面信息和音频信息，利用深度学习模型确定检测目标所在的候选视频片段，并在该片段范围进行文字识别，从而对准确的检测目标所在的图像帧进行定位，可以提高视频片头结束时间点位、片尾开始时间点位和彩蛋开始时间点位的识别精度，该检测方法无需进行图像模板匹配，也无需依赖人工操作，点位识别方式灵活且识别效率高。In this embodiment of the present invention, the audio and video feature information of multiple consecutive video segments in the video file can be used to determine candidate video segments that may contain the detection target in the multiple video segments, and the image frames of the candidate video segments can be detected. Character recognition, to determine the target image frame where the detection target is located. By adopting the above method, combined with the picture information and audio information of the video file, the candidate video segment where the detection target is located is determined by the deep learning model, and the text recognition is performed in the range of the segment, so as to locate the image frame where the accurate detection target is located. It can improve the recognition accuracy of the end time point of the video title, the start time point of the end film, and the start time point of the easter egg. The detection method does not need to perform image template matching, and does not need to rely on manual operation, and the point recognition method is flexible and efficient.

附图说明Description of drawings

图1是本发明实施例的一种视频检测方法的步骤流程图；1 is a flow chart of the steps of a video detection method according to an embodiment of the present invention;

图2是本发明实施例的另一种视频检测方法的步骤流程图；2 is a flow chart of steps of another video detection method according to an embodiment of the present invention;

图2A-2G是本发明实施例的另一种视频检测方法的子步骤流程图；2A-2G are sub-step flowcharts of another video detection method according to an embodiment of the present invention;

图3是本发明实施例的一种音视频特征信息的处理过程示意图；3 is a schematic diagram of a processing process of audio and video feature information according to an embodiment of the present invention;

图4是本发明实施例的一种视频检测方法的流程图；4 is a flowchart of a video detection method according to an embodiment of the present invention;

图5是本发明实施例的另一种视频检测方法的流程图；5 is a flowchart of another video detection method according to an embodiment of the present invention;

图6是本发明实施例的又一种视频检测方法的流程图；6 is a flowchart of another video detection method according to an embodiment of the present invention;

图7是本发明实施例的一种视频检测装置的结构框图。FIG. 7 is a structural block diagram of a video detection apparatus according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的上述目的、特征和优点能够更加明显易懂，下面结合附图和具体实施方式对本发明作进一步详细的说明，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员所获得的所有其他实施例，都属于本发明保护的范围。In order to make the above objects, features and advantages of the present invention more clearly understood, the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments. Obviously, the described embodiments are only a part of the embodiments of the present invention, not All examples. Based on the embodiments in the present invention, all other embodiments obtained by those of ordinary skill in the art fall within the protection scope of the present invention.

视频流媒体服务中为了提升用户观看体验，会对片头结束时间点位、片尾开始时间点位和彩蛋开始时间点位等进行标记，由于视频中标记的时间点位是视频内容生产商剪辑制作的，不同视频时间点位的标记位置不同。In the video streaming service, in order to improve the user's viewing experience, the ending time point of the title, the starting point of the ending, and the starting point of the easter egg will be marked, because the marked time point in the video is edited and produced by the video content producer. , the marker positions of different video time points are different.

传统的时间点位标记方法要么是人工进行标记，该方法不适应于海量视频，人工标记无法快速确定点位位置，标记效率低下；要么是使用统一的图像模板进行匹配，该标记方法无法解决点位在不同视频中可能存在偏差以及部分点位设置较为灵活的情况。The traditional time point marking method is either manual marking, which is not suitable for massive videos, manual marking cannot quickly determine the point position, and the marking efficiency is low; or a unified image template is used for matching, and this marking method cannot solve the problem. There may be deviations in different videos and some point settings are more flexible.

本发明实施例的核心构思之一在于，可以通过视频文件中连续的多个视频片段的音视频特征信息，确定在多个视频片段中可能包含检测目标的候选视频片段，并对该候选视频片段的图像帧进行文字识别，确定检测目标所在的目标图像帧。通过采用上述方法，结合视频文件的画面信息和音频信息，利用深度学习模型确定检测目标所在的候选视频片段，并在该片段范围进行文字识别，从而对准确的检测目标所在的图像帧进行定位，可以提高视频片头结束时间点位、片尾开始时间点位和彩蛋开始时间点位的识别精度，该检测方法无需进行图像模板匹配，也无需依赖人工操作，点位识别方式灵活且识别效率高。One of the core concepts of the embodiments of the present invention is that, according to the audio and video feature information of multiple consecutive video clips in a video file, candidate video clips that may contain detection targets in the multiple video clips can be determined, and the candidate video clips The text recognition is carried out on the image frame of the detection target, and the target image frame where the detection target is located is determined. By adopting the above method, combined with the picture information and audio information of the video file, the candidate video segment where the detection target is located is determined by the deep learning model, and the text recognition is performed in the range of the segment, so as to locate the image frame where the accurate detection target is located. It can improve the recognition accuracy of the end time point of the video title, the start time point of the end film, and the start time point of the easter egg. The detection method does not need to perform image template matching, and does not need to rely on manual operation, and the point recognition method is flexible and efficient.

参照图1，示出了本发明实施例的一种视频检测方法的步骤流程图，具体可以包括如下步骤：Referring to FIG. 1, a flowchart of steps of a video detection method according to an embodiment of the present invention is shown, which may specifically include the following steps:

步骤101，获取视频文件，并确定针对所述视频文件的检测目标。Step 101: Acquire a video file, and determine a detection target for the video file.

其中，检测目标包括片头结束标志信息、片尾开始标志信息和片尾结束标志信息中的至少一种。Wherein, the detection target includes at least one of credit end flag information, credit start flag information, and credit end flag information.

在本发明实施例中，视频文件可以是电视剧视频文件，综艺节目视频文件和电影视频文件等多媒体视频文件，视频文件中可以包括片头信息、正片信息、片尾信息和彩蛋信息中的一种或多种。In the embodiment of the present invention, the video file may be a multimedia video file such as a TV drama video file, a variety show video file, a movie video file, etc., and the video file may include one or more of title information, main movie information, ending information, and easter egg information. kind.

片头信息和正片信息的播放次序可以灵活设置，在一种示例中，可以是先播放片头信息之后再播放正片信息；在另一种示例中，可以是先播放一部分的正片信息，然后播放片头信息，之后再播放另一部分的正片信息。对于片头信息和正片信息的播放次序，本发明实施例不加以具体限制。The playback order of the title information and the main film information can be set flexibly. In one example, the title information can be played first and then the main film information; in another example, a part of the main film information can be played first, and then the title information can be played. , and then play another part of the feature information. The embodiment of the present invention does not impose specific restrictions on the playing order of the title information and the main film information.

片头结束标志信息可以是指片头内容结束的标志信息；片尾开始标志信息可以是指片尾内容开始的标志信息；片尾结束标志信息可以是指片尾内容结束的标志信息。对于一集电视剧视频文件而言，片头结束标志信息可以为集数信息或发行编号信息。对于电影视频文件而言，片尾开始标志信息或片尾结束标志信息可以为文本框信息。The credit end flag information may refer to the flag information of the end of the credit content; the credit start flag information may refer to the flag information of the beginning of the credit content; the credit end flag information may refer to the flag information of the end of the credit content. For an episode of a TV drama video file, the title end marker information may be episode number information or release number information. For a movie video file, the starting credit information or the ending credit information can be text box information.

通过检测视频文件中的片头结束标志信息可以确定片头结束时间点位所在的位置，通过检测视频文件中的片尾开始标志信息可以确定片尾开始时间点位所在的位置，通过检测视频文件中的片尾结束标志信息可以确定彩蛋开始时间点位所在的位置，从而可以使用户可以快速跳转到对应的时间点位，或者在对应的时间点位上有针对性地提供相应的服务功能。The position of the ending time point of the title can be determined by detecting the end-of-credits information in the video file; The flag information can determine the location of the start time point of the easter egg, so that the user can quickly jump to the corresponding time point, or provide corresponding service functions at the corresponding time point in a targeted manner.

步骤102，从所述视频文件中获取连续的多个视频片段，并分别确定所述多个视频片段的多个音视频特征信息。Step 102: Acquire multiple consecutive video clips from the video file, and respectively determine multiple audio and video feature information of the multiple video clips.

在本发明实施例中，可以将视频文件划分为连续的多个视频片段，并分别确定该多个视频片段的多个音视频特征信息，通过结合视频特征和音频特征，对正片内容(包括彩蛋内容)、片头内容和片尾内容进行区分。In this embodiment of the present invention, a video file can be divided into multiple consecutive video clips, and multiple audio and video feature information of the multiple video clips can be determined respectively. content), introductory content and end credits content.

步骤103，将所述多个音视频特征信息分别输入预先训练的分类模型中，获得对应的多个输出结果。Step 103: Inputting the plurality of audio and video feature information into the pre-trained classification model, respectively, to obtain a plurality of corresponding output results.

在本发明实施例中，预先构建了分类模型，分类模型用于根据输入的音视频特征信息，确定对应的视频片段是否为片头片段或者正片片段或者片尾片段。其中，对于电影视频文件来说，通常将片尾片段播放完之后播放的正片片段作为彩蛋片段。In the embodiment of the present invention, a classification model is pre-built, and the classification model is used to determine whether the corresponding video segment is a title segment, a main segment or an end segment according to the input audio and video feature information. Among them, for a movie video file, the main movie segment played after the ending segment is usually played as an easter egg segment.

步骤104，根据所述多个输出结果，从所述多个视频片段中确定候选视频片段。Step 104: Determine candidate video segments from the multiple video segments according to the multiple output results.

将多个音视频特征信息分别输入预先训练的分类模型，可以获得对应的多个输出结果，可以对该多个输出结果进行分析，确定多个视频片段中有可能包含检测目标的候选视频片段。Inputting multiple audio and video feature information into the pre-trained classification model, corresponding multiple output results can be obtained, and the multiple output results can be analyzed to determine candidate video segments that may contain detection targets in multiple video segments.

步骤105，对所述候选视频片段的图像帧进行文字识别，获得文字识别结果。Step 105: Perform text recognition on the image frames of the candidate video segments to obtain a text recognition result.

在本发明实施例中，在确定可能包含检测目标的候选视频片段之后，可以对候选视频片段的图像帧进行文字识别，获得对应的文字识别结果。In this embodiment of the present invention, after a candidate video segment that may contain a detection target is determined, text recognition may be performed on the image frame of the candidate video segment to obtain a corresponding text recognition result.

步骤106，根据所述文字识别结果确定所述检测目标所在的目标图像帧。Step 106: Determine the target image frame where the detection target is located according to the character recognition result.

根据进行文字识别得到的文字识别结果，确定检测目标所在的目标图像帧，从而可以定位到视频文件中片头结束时间点位所在的位置或者片尾开始时间点位所在的位置或者彩蛋开始时间点位所在的位置，且定位的位置精度达到帧级。According to the text recognition result obtained by the text recognition, the target image frame where the detection target is located is determined, so that the position of the ending time point of the title or the starting point of the ending or the starting point of the easter egg in the video file can be located. position, and the positioning accuracy reaches the frame level.

综上，在本发明实施例中，可以通过视频文件中连续的多个视频片段的音视频特征信息，确定在多个视频片段中可能包含检测目标的候选视频片段，并对该候选视频片段的图像帧进行文字识别，确定检测目标所在的目标图像帧。通过采用上述方法，结合视频文件的画面信息和音频信息，利用深度学习模型确定检测目标所在的候选视频片段，并在该片段范围进行文字识别，从而对准确的检测目标所在的图像帧进行定位，可以提高视频片头结束时间点位、片尾开始时间点位和彩蛋开始时间点位的识别精度，该检测方法无需进行图像模板匹配，也无需依赖人工操作，点位识别方式灵活且识别效率高。To sum up, in this embodiment of the present invention, the audio and video feature information of multiple consecutive video segments in the video file can be used to determine candidate video segments that may contain detection targets in the multiple video segments, and the candidate video segments of the candidate video segments can be determined. The image frame is used for character recognition, and the target image frame where the detection target is located is determined. By adopting the above method, combined with the picture information and audio information of the video file, the candidate video segment where the detection target is located is determined by the deep learning model, and the text recognition is performed in the range of the segment, so as to locate the image frame where the accurate detection target is located. It can improve the recognition accuracy of the end time point of the video title, the start time point of the end film, and the start time point of the easter egg. The detection method does not need to perform image template matching, and does not need to rely on manual operation, and the point recognition method is flexible and efficient.

参照图2，示出了本发明实施例的另一种视频检测方法的步骤流程图，具体可以包括如下步骤：Referring to FIG. 2, a flowchart of steps of another video detection method according to an embodiment of the present invention is shown, which may specifically include the following steps:

步骤201，获取视频文件，并确定针对所述视频文件的检测目标。Step 201: Acquire a video file, and determine a detection target for the video file.

在本发明实施例中，可以获取需要检测分析的视频文件，并确定针对该视频文件的检测目标，检测目标可以是片头结束标志信息、片尾开始标志信息和片尾结束标志信息中的一种或多种。In this embodiment of the present invention, a video file that needs to be detected and analyzed can be obtained, and a detection target for the video file can be determined. kind.

在一种示例中，可以是根据视频文件的视频类型确定检测目标的，例如，如果是电视剧视频文件，可以将片头结束标志信息和片尾开始标志信息作为该视频文件的检测目标；如果是电影视频文件，可以将片头结束标志信息、片尾开始标志信息和片尾结束标志信息均作为该视频文件的检测目标。In an example, the detection target may be determined according to the video type of the video file. For example, if it is a TV drama video file, the end-of-credits flag information and the start-of-credits information may be used as the detection target of the video file; if it is a movie video file, the title end mark information, the credit start mark information and the credit end mark information can all be used as detection targets of the video file.

步骤202，从所述视频文件中获取连续的多个视频片段，并分别确定所述多个视频片段的多个音视频特征信息。Step 202: Acquire multiple consecutive video clips from the video file, and respectively determine multiple audio and video feature information of the multiple video clips.

在本发明实施例中，对检测目标进行检测，需要从视频文件中获取连续的多个视频片段，其中，连续的多个视频片段是指该多个视频片段在视频文件的播放时间上是连续的，播放完一个视频片段之后紧接着播放下一个视频片段。在获取连续的多个视频片段之后，可以分别确定多个视频片段的多个音视频特征信息。In the embodiment of the present invention, to detect the detection target, it is necessary to obtain a plurality of consecutive video clips from a video file, wherein the consecutive video clips means that the video clips are consecutive in the playback time of the video file Yes, play a video clip immediately after playing the next video clip. After acquiring multiple consecutive video clips, multiple audio and video feature information of the multiple video clips can be determined respectively.

在本发明一种可选的实施例中，步骤202中所述从所述视频文件中获取连续的多个视频片段的步骤，具体可以包括如下子步骤：In an optional embodiment of the present invention, the step of obtaining multiple consecutive video clips from the video file described instep 202 may specifically include the following sub-steps:

子步骤S11，根据所述检测目标确定截取时间段。Sub-step S11, determining the interception time period according to the detection target.

子步骤S12，按照所述截取时间段对所述视频文件进行截取，获得截取视频片段。Sub-step S12, the video file is intercepted according to the interception time period to obtain the intercepted video segment.

子步骤S13，对所述截取视频片段进行平均分割，获得连续的所述多个视频片段。In sub-step S13, averagely divide the intercepted video clips to obtain the multiple consecutive video clips.

可以根据不同的检测目标，确定不同的截取时间段，从完整的视频文件中按照截取时间段获得截取视频片段，然后将截取视频片段划分为连续的多个视频片段。Different clipping time periods can be determined according to different detection targets, and clipped video clips can be obtained from the complete video file according to the clipping time slots, and then the clipped video clips can be divided into multiple consecutive video clips.

例如，可以截取t₀-t₁时间范围内的第一视频片段作为截取视频片段，并将该第一视频片段平均分为T个视频片段，每个视频片段均包含64帧图像和12.8s音频。通过截取视频文件中的一部分片段进行分析，可以提高分析效率。For example, the first video clip in the time range of t₀ -t₁ can be intercepted as the clipped video clip, and the first video clip can be divided into T video clips on average, and each video clip contains 64 frames of images and 12.8s of audio . The analysis efficiency can be improved by intercepting a part of the video file for analysis.

在一种示例中，如果需要检测的是电影视频文件中的片尾开始标志信息或者片尾结束标志信息，可以截取电影视频文件中的后30分钟视频片段作为截取视频片段进行分析。In an example, if what needs to be detected is the start-of-credits flag information or the end-of-credits flag information in the movie video file, the last 30 minutes of video clips in the movie video file can be captured as clipped video clips for analysis.

在获取到连续的多个视频片段后，可以分别对各个视频片段确定对应的音视频特征信息，结合视频特征和音频特征，对正片内容(包括彩蛋内容)、片头内容和片尾内容进行区分。After obtaining multiple consecutive video clips, the corresponding audio and video feature information can be determined for each video clip respectively, and the main content (including the easter egg content), the intro and the end content can be distinguished by combining the video features and audio features.

在本发明一种可选的实施例中，步骤202中所述分别确定所述多个视频片段的多个音视频特征信息的步骤，具体可以包括如下子步骤：In an optional embodiment of the present invention, the step of respectively determining multiple audio and video feature information of the multiple video clips instep 202 may specifically include the following sub-steps:

子步骤S21，针对各个视频片段，采用预先训练的超分辨率测试序列VGG模型提取对应的音频特征信息，以及，采用预先训练的双流膨胀三维卷积网络I3D模型提取对应的视频特征信息，将所述音频特征信息和所述视频特征信息进行合并，得到该视频片段对应的所述音视频特征信息。Sub-step S21, for each video segment, use the pre-trained super-resolution test sequence VGG model to extract the corresponding audio feature information, and use the pre-trained dual-stream expanded three-dimensional convolution network I3D model to extract the corresponding video feature information, and extract the corresponding video feature information. The audio feature information and the video feature information are combined to obtain the audio and video feature information corresponding to the video segment.

对于视频片段中的音频信息是输入音频特征提取模型中进行特征提取的；对于视频片段中的图像信息是输入视频特征提取模型中进行特征提取的。The audio information in the video clip is input into the audio feature extraction model for feature extraction; the image information in the video clip is input into the video feature extraction model for feature extraction.

其中，音频特征提取模型可以为在公开数据集训练好的超分辨率测试序列VGG(visual geometry group，超分辨率测试序列)模型，超分辨率测试序列VGG模型使用了反复堆叠的3*3小型卷积核，并增加了网络的深度。例如，可以将视频片段的12.8s音频输入超分辨率测试序列VGG模型，输出8*128维音频特征。Among them, the audio feature extraction model can be the super-resolution test sequence VGG (visual geometry group, super-resolution test sequence) model trained on the public data set, and the super-resolution test sequence VGG model uses the repeated stacking of 3*3 small Convolution kernel and increase the depth of the network. For example, the 12.8s audio of the video clip can be input into the super-resolution test sequence VGG model to output 8*128 dimensional audio features.

视频特征提取模型可以为在公开数据集训练好的双流膨胀三维卷积网络I3D(Inflated 3D ConvNet，双流膨胀三维卷积网络)模型，双流膨胀三维卷积网络I3D模型为基于2D卷积网络的增强版，将卷积分类的卷积核与池化核扩展为3D。举例而言，可以将视频片段的64帧图像输入双流膨胀三维卷积网络I3D模型，输出6*1024维视频特征。The video feature extraction model can be a dual-stream inflated 3D convolutional network I3D (Inflated 3D ConvNet, dual-stream inflated 3D convolutional network) model trained on a public dataset, and the dual-stream inflated 3D convolutional network I3D model is an enhancement based on a 2D convolutional network. version, which expands the convolution kernel and pooling kernel of convolution classification to 3D. For example, a 64-frame image of a video clip can be input into a dual-stream dilated 3D convolutional network I3D model, and a 6*1024-dimensional video feature can be output.

对于一个视频片段，获取到该一个视频片段对应的音频特征信息和视频特征信息后，可以对音频特征信息和视频特征信息进行合并得到该一个视频片段对应的音视频特征信息。For a video clip, after acquiring the audio feature information and video feature information corresponding to the one video clip, the audio and video feature information can be combined to obtain the audio and video feature information corresponding to the one video clip.

参照图3所示，为本发明实施例的一种音视频特征信息的处理过程示意图。获取需要进行检测的视频文件，可以从该视频文件中抽取连续的视频帧，获得视频帧图像，视频帧图像可以包括视频帧图像序列1、视频帧图像序列2、……、视频帧图像序列T的T个视频帧图像序列(相当于T个视频片段)；以及可以从该视频文件中提取与视频帧图像对应的音频文件，音频文件可以对应划分为音频文件片段1、音频文件片段2、……、音频文件片段T的T个音频文件片段；对T个视频帧图像序列分别进行视频特征提取，获得视频特征1、视频特征2、……、视频特征T的T个视频特征；以及对T个音频文件片段分别进行音频特征提取，获得音频特征1、音频特征2、……、音频特征T的T个音频特征；将对应的音频特征和视频特征进行合并，即可以获得音视频特征1、音视频特征2、……、音视频特征T的T个音视频特征。Referring to FIG. 3 , it is a schematic diagram of a processing process of audio and video feature information according to an embodiment of the present invention. To obtain the video file that needs to be detected, continuous video frames can be extracted from the video file to obtain video frame images. The video frame images can include video frame image sequence 1, video frame image sequence 2, ..., video frame image sequence T T video frame image sequences (equivalent to T video segments); and audio files corresponding to the video frame images can be extracted from the video file, and the audio files can be correspondingly divided into audio file segment 1, audio file segment 2, ... ..., T audio file segments of audio file segment T; perform video feature extraction on T video frame image sequences respectively to obtain T video features of video feature 1, video feature 2, ..., video feature T; and for T Audio feature extraction is performed for each audio file segment respectively, and T audio features of audio feature 1, audio feature 2, ..., audio feature T are obtained; the corresponding audio features and video features are merged to obtain audio and video features 1, Audio and video features 2, ..., T audio and video features of audio and video features T.

在本发明一种可选的实施例中，子步骤S21中所述将所述音频特征信息和所述视频特征信息进行合并，得到该视频片段对应的所述音视频特征信息的步骤，具体可以包括如下子步骤：In an optional embodiment of the present invention, the step of combining the audio feature information and the video feature information in sub-step S21 to obtain the audio and video feature information corresponding to the video clip may specifically be It includes the following sub-steps:

子步骤S211，基于移位注意力机制分别对所述音频特征信息和所述视频特征信息进行注意力计算，得到对应的注意力音频特征信息和注意力视频特征信息。Sub-step S211 , respectively perform attention calculation on the audio feature information and the video feature information based on the shift attention mechanism to obtain corresponding attention audio feature information and attention video feature information.

子步骤S212，将所述注意力音频特征信息和所述注意力视频特征信息进行拼接，得到对应的所述音视频特征信息。Sub-step S212, splicing the attention audio feature information and the attention video feature information to obtain the corresponding audio and video feature information.

音频特征信息和视频特征信息的合并方法采用移位注意力机制。由于通过实验发现简单地将音频特征和视频特征拼接不能训练出令人满意的分类模型，因为视频画面和音频的模态不同，特征向量和表达值含义也不一样。为了解决这个问题，采用了移位注意力机制，在每个模态上增加注意力单元并进行移位操作来提升音频/视频的特征表达。注意力计算公式为：The merging method of audio feature information and video feature information adopts the shift attention mechanism. It is found through experiments that a satisfactory classification model cannot be trained simply by splicing audio features and video features, because the modalities of video images and audio are different, and the meanings of feature vectors and expression values are also different. To solve this problem, a shift attention mechanism is adopted, adding attention units on each modality and performing shift operations to improve the feature representation of audio/video. The attention calculation formula is:

其中，X是输入的音频特征或者视频特征；α和β都是可学习的参数；a是注意力加权向量；N是注意力单元的个数；v是输出的注意力音频特征或者注意力视频特征。where X is the input audio feature or video feature; α and β are both learnable parameters; a is the attention weight vector; N is the number of attention units; v is the output attention audio feature or attention video feature.

由上式可知，采用可学习的参数对输入的音频特征或视频特征进行线性变换操作后，进行L2正则化处理(||α·aX+β||₂)，得到经过移位注意力机制变换后的注意力音频特征信息和注意力视频特征信息，再进行拼接得到音视频特征信息。其中，注意力单元的个数是通过实验进行设置的，音频特征和视频特征均可以设置为8个。It can be seen from the above formula that after linearly transforming the input audio features or video features with learnable parameters, L2 regularization processing is performed (||α·aX+β||₂ ), and the transformation by the shift attention mechanism is obtained. The attention audio feature information and the attention video feature information are then spliced to obtain the audio and video feature information. Among them, the number of attention units is set through experiments, and both audio features and video features can be set to 8.

步骤203，将所述多个音视频特征信息分别输入所述分类模型中，获得对应的多个置信度结果。Step 203: Input the plurality of audio and video feature information into the classification model respectively to obtain a plurality of corresponding confidence results.

其中，置信度结果用于表示对应的视频片段属于片头片段/正片片段/片尾片段的置信度。Wherein, the confidence result is used to indicate the confidence that the corresponding video segment belongs to the intro segment/main segment/outer segment.

在本发明一种可选的实施例中，分类模型可以通过以下方式训练：In an optional embodiment of the present invention, the classification model can be trained in the following manner:

获取用于训练的样本视频片段集；所述样本视频片段集包括连续的多个样本视频片段；所述多个样本视频片段分别标注的片段类型为片头片段或正片片段或片尾片段；分别确定所述多个样本视频片段的多个样本音视频特征信息；使用所述多个样本音视频特征信息进行模型训练，得到用于识别片头片段/正片片段/片尾片段的所述分类模型。Obtain a sample video clip set for training; the sample video clip set includes a plurality of consecutive sample video clips; the clip types marked respectively by the plurality of sample video clips are introductory clips or main clips or end clips; multiple sample audio and video feature information of the multiple sample video clips; use the multiple sample audio and video feature information to perform model training to obtain the classification model for identifying the introductory clip/main clip/outro clip.

用于进行模型训练的样本视频片段集中包含有连续的多个样本视频片段，每个样本视频片段标注为片头片段或正片片段或片尾片段，可以确定多个样本视频片段的多个样本音视频特征信息，然后将多个样本音视频特征信息分别输入模型训练系统中进行模型训练，得到具有片头片段/正片片段/片尾片段识别能力的分类模型。在一种示例中，分类模型可以为全连接FC(Fully Connected，全连接)分类模型。The sample video clip set used for model training contains multiple consecutive sample video clips, each sample video clip is marked as a title clip or a main clip or an ending clip, and the audio and video features of multiple sample video clips can be determined. Then, input the audio and video feature information of multiple samples into the model training system respectively for model training, and obtain a classification model with the ability to identify the introductory clips, the main clips and the ending clips. In one example, the classification model may be a fully connected FC (Fully Connected, fully connected) classification model.

在本发明实施例中，将一个视频片段的音视频特征信息输入分类模型，输出的置信度结果可以用于确定该一个视频片段是否属于片头片段，或者是否属于正片片段，或者是否属于片尾片段。In the embodiment of the present invention, the audio and video feature information of a video clip is input into the classification model, and the output confidence result can be used to determine whether the video clip belongs to the intro clip, or whether it belongs to the main clip, or whether it belongs to the end clip.

示例性地，分类模型输出的置信度结果可以为概率分数值。在一种示例中，将某一个视频片段的音视频特征信息输入训练好的分类模型中，可以得到该视频片段属于片头片段/正片片段/片尾片段的概率分数值。Exemplarily, the confidence result output by the classification model may be a probability score value. In an example, the audio and video feature information of a certain video clip is input into the trained classification model, and the probability score value of the video clip belonging to the introductory clip/main clip/outro clip can be obtained.

步骤204，根据所述多个输出结果，从所述多个视频片段中确定候选视频片段。Step 204: Determine candidate video segments from the multiple video segments according to the multiple output results.

在本发明实施例中，可以根据分类模型输出的置信度结果，从该多个视频片段中确定可能包含检测目标的候选视频片段。In this embodiment of the present invention, a candidate video segment that may contain a detection target may be determined from the plurality of video segments according to the confidence result output by the classification model.

在本发明一种可选的实施例中，步骤204具体可以包括如下子步骤：In an optional embodiment of the present invention, step 204 may specifically include the following sub-steps:

子步骤S31，将所述多个置信度结果分别与预设置信度阈值比较，获得对应的多个比较结果。Sub-step S31, compare the plurality of confidence results with preset reliability thresholds respectively to obtain a plurality of corresponding comparison results.

子步骤S32，根据所述多个比较结果，从所述多个视频片段中确定所述候选视频片段。Sub-step S32, according to the multiple comparison results, determine the candidate video segment from the multiple video segments.

在一种实施方式中，片头片段具有对应的第一置信度阈值，如果置信度结果大于该第一置信度阈值，则可以将对应的视频片段确定为片头片段；正片片段具有对应的第二置信度阈值，如果置信度结果大于该第二置信度阈值，则可以将对应的视频片段确定为正片片段；片尾片段具有对应的第三置信度阈值，如果置信度结果大于该第三置信度阈值，则可以将对应的视频片段确定为片尾片段。In one embodiment, the title segment has a corresponding first confidence threshold, and if the confidence result is greater than the first confidence threshold, the corresponding video segment can be determined as the title segment; the main segment has a corresponding second confidence If the confidence result is greater than the second confidence threshold, then the corresponding video clip can be determined as a feature clip; the end clip has a corresponding third confidence threshold, if the confidence result is greater than the third confidence threshold, Then, the corresponding video clip can be determined as the end clip.

在另一种实施方式中，可以只设置片头片段对应的第一置信度阈值和片尾片段对应的第三置信度阈值，如果置信度结果大于该第一置信度阈值，则可以将对应的视频片段确定为片头片段，否则为正片片段；如果置信度结果大于该第三置信度阈值，则可以将对应的视频片段确定为片尾片段，否则为正片片段。In another embodiment, only the first confidence threshold corresponding to the title segment and the third confidence threshold corresponding to the ending segment may be set. If the confidence result is greater than the first confidence threshold, the corresponding video segment may be set to If the confidence result is greater than the third confidence threshold, the corresponding video segment can be determined as the end segment, otherwise, it is the main segment.

在本发明实施例中，根据多个比较结果确定对应的多个视频片段的片段类型，依据该多个视频片段的片段类型和该多个视频片段的播放次序，可以从该多个视频片段中确定可能包含检测目标的候选视频片段。In this embodiment of the present invention, the segment types of the corresponding multiple video segments are determined according to the multiple comparison results, and according to the segment types of the multiple video segments and the playback order of the multiple video segments, the segment types of the multiple video segments can be selected from the multiple video segments. Identify candidate video segments that may contain detection targets.

在本发明一种可选的实施例中，候选视频片段包括用于查找检测目标为片头结束标志信息的第一候选视频片段，子步骤S32具体可以包括如下子步骤：In an optional embodiment of the present invention, the candidate video segment includes a first candidate video segment used for finding the detection target as the title end marker information, and sub-step S32 may specifically include the following sub-steps:

子步骤S321，若所述检测目标为所述片头结束标志信息，则根据所述多个比较结果，分别将所述多个视频片段分类为片头片段和正片片段。Sub-step S321, if the detection target is the title end marker information, classify the plurality of video clips into a title clip and a main clip respectively according to the plurality of comparison results.

子步骤S322，若所述多个视频片段中存在播放次序相邻的一个片头片段和一个正片片段，且所述一个正片片段在所述一个片头片段播放完之后播放，则将所述一个片头片段和所述一个正片片段确定为所述第一候选视频片段。Sub-step S322, if there is a title segment and a feature segment adjacent to the playback order in the plurality of video segments, and the one feature segment is played after the one credit segment is played, then the one credit segment is played. and the one feature film segment is determined as the first candidate video segment.

如果检测目标为片头结束标志信息，则可以根据多个比较结果，分别将多个视频片段分类为片头片段和正片片段。如果多个视频片段中存在播放次序相邻的一个片头片段和一个正片片段，且该一个正片片段在该一个片头片段播放完之后播放，则将该一个片头片段和该一个正片片段确定为可能包含片头结束标志信息的第一候选视频片段。If the detection target is the title end marker information, the multiple video clips can be classified into title clips and main clips respectively according to multiple comparison results. If there is a title segment and a feature segment that are adjacent to each other in the playback sequence, and the one feature segment is played after the one title segment is played, it is determined that the one title segment and the one feature segment may contain The first candidate video segment of the title end marker information.

举例而言，可以设置第一置信度阈值为0.8，当某一视频片段的置信度结果大于第一置信度阈值0.8时，可以认为该视频片段为片头片段，否则为正片片段。假设有连续的T＝10个视频片段{t1，……，t10}，它们分别对应的置信度结果为[1.0，0.9，0.9，0.9，0.9，0.9，0.7，0.4，0.3，0.1]，将多个置信度结果分别与第一置信度阈值比较，可以确定t1-t6均属于片头片段，t7-t10均属于正片片段，也就是说片头为t1-t6，正片为t7-t10，t6和t7为播放次序相邻的一个片头片段和一个正片片段，且t7在t6播放完之后播放，则可以选择t6和t7为第一候选视频片段。For example, the first confidence threshold can be set to 0.8. When the confidence result of a certain video segment is greater than the first confidence threshold of 0.8, the video segment can be considered as a title segment, otherwise it is a full-length segment. Assuming that there are consecutive T=10 video clips {t1,...,t10}, their corresponding confidence results are [1.0, 0.9, 0.9, 0.9, 0.9, 0.9, 0.7, 0.4, 0.3, 0.1], the By comparing the multiple confidence results with the first confidence threshold, it can be determined that t1-t6 belong to the title segment, and t7-t10 belong to the main segment, that is to say, the title is t1-t6, and the main is t7-t10, t6 and t7. For a title segment and a feature segment adjacent to each other in playback order, and t7 is played after t6 is played, then t6 and t7 may be selected as the first candidate video segments.

在本发明一种可选的实施例中，候选视频片段包括用于查找检测目标为片尾开始标志信息的第二候选视频片段，子步骤S32具体可以包括如下子步骤：In an optional embodiment of the present invention, the candidate video segment includes a second candidate video segment for which the detection target is the ending start marker information, and sub-step S32 may specifically include the following sub-steps:

子步骤S323，若所述检测目标为所述片尾开始标志信息，则根据所述多个比较结果，分别将所述多个视频片段分类为正片片段和片尾片段。Sub-step S323, if the detection target is the end-credits start flag information, classify the multiple video clips into a main movie segment and an end-credits segment respectively according to the multiple comparison results.

子步骤S324，若所述多个视频片段中存在播放次序相邻的一个正片片段和一个片尾片段，且所述一个片尾片段在所述一个正片片段播放完之后播放，则将所述一个正片片段和所述一个片尾片段确定为所述第二候选视频片段。Sub-step S324, if there is a main film segment and an end film segment adjacent to the playback order in the plurality of video segments, and the one end film segment is played after the one main film segment is played, then the one main film segment is played. and the one credit clip is determined as the second candidate video clip.

如果检测目标为片尾开始标志信息，则可以根据多个比较结果，分别将多个视频片段分类为正片片段和片尾片段。如果多个视频片段中存在播放次序相邻的一个正片片段和一个片尾片段，且该一个片尾片段在该一个正片片段播放完之后播放，则将该一个正片片段和该一个片尾片段确定为可能包含片尾开始标志信息的第二候选视频片段。If the detection target is the end-credits marker information, the multiple video clips may be classified into a main movie segment and an end-credits segment respectively according to the multiple comparison results. If there is a main clip and an end clip that are adjacent to each other in the playback order, and the end clip is played after the main clip is played, then the main clip and the end clip are determined to be possible to contain The second candidate video segment of the end-credits start marker information.

举例而言，可以设置第三置信度阈值为0.8，当某一视频片段的置信度结果大于第三置信度阈值0.8时，可以认为该视频片段为片尾片段，否则为正片片段。假设有连续的T＝10个视频片段{t1，……，t10}，它们分别对应的置信度结果为[0.1，0.9，0.9，0.9，0.1，0.1，0.1，0.8，0.8，0.9]，将多个置信度结果分别与第三置信度阈值比较，可以确定t2-t4以及t8-t10均属于片尾片段，t1以及t5-t7均属于正片片段，也就是说第一个正片为t1，第一个片尾为t2-t4，第二个正片为t5-t7，第二个片尾为t8-t10，t1和t2为播放次序相邻的一个正片片段和一个片尾片段，t7和t8也为播放次序相邻的一个正片片段和一个片尾片段，且t2在t1播放完之后播放，t8在t7播放完之后播放，则可以选择t1和t2，以及t7和t8为第二候选视频片段。For example, the third confidence threshold can be set to 0.8. When the confidence result of a certain video segment is greater than the third confidence threshold of 0.8, the video segment can be considered as an ending segment, otherwise it is a main segment. Assuming that there are consecutive T=10 video clips {t1,...,t10}, their corresponding confidence results are [0.1, 0.9, 0.9, 0.9, 0.1, 0.1, 0.1, 0.8, 0.8, 0.9], the By comparing the multiple confidence results with the third confidence threshold, it can be determined that t2-t4 and t8-t10 belong to the end clips, and t1 and t5-t7 belong to the main clip, that is to say, the first positive is t1, the first The first credits are t2-t4, the second feature is t5-t7, the second credits are t8-t10, t1 and t2 are a feature segment and a credit segment adjacent to each other in playback order, and t7 and t8 are also related to the playback order. If there is an adjacent feature segment and an end segment, and t2 is played after t1 is played, and t8 is played after t7 is played, then t1 and t2, and t7 and t8 can be selected as the second candidate video segments.

在本发明一种可选的实施例中，候选视频片段包括用于查找检测目标为片尾结束标志信息的第三候选视频片段，子步骤S32具体可以包括如下子步骤：In an optional embodiment of the present invention, the candidate video segment includes a third candidate video segment for finding the detection target of the end-of-credits marker information, and sub-step S32 may specifically include the following sub-steps:

子步骤S325，若所述检测目标为所述片尾结束标志信息，则根据所述多个比较结果，分别将所述多个视频片段分类为正片片段和片尾片段。Sub-step S325, if the detection target is the end-of-credits information, classify the multiple video clips into a main segment and an ending segment respectively according to the multiple comparison results.

子步骤S326，若所述多个视频片段中存在播放次序相邻的一个片尾片段和一个正片片段，且所述一个正片片段在所述一个片尾片段播放完之后播放，则将所述一个片尾片段和所述一个正片片段确定为所述第三候选视频片段。Sub-step S326, if there is an end segment and a main segment in the plurality of video segments that are adjacent to each other in playback order, and the one main segment is played after the one end segment is played, then the one end segment is played. and the one feature film segment is determined as the third candidate video segment.

分类模型只能识别出片头片段、正片片段和片尾片段，不能识别出彩蛋片段。如果在截取出来进行分析的多个视频片段中识别到在播放完一个片尾片段之后继续播放一个正片片段，则可以将该一个正片片段认为是彩蛋片段。The classification model can only identify the introductory fragment, the main movie fragment and the ending fragment, but cannot identify the Easter egg fragment. If it is identified in the multiple video clips cut out for analysis that after an end-credits segment is played, a main-film segment continues to be played, the main-film segment can be regarded as an easter egg segment.

在本发明实施例中，如果检测目标为片尾结束标志信息，则可以根据多个比较结果，分别将多个视频片段分类为正片片段和片尾片段。如果多个视频片段中存在播放次序相邻的一个片尾片段和一个正片片段，且该一个正片片段在该一个片尾片段播放完之后播放，则将该一个片尾片段和该一个正片片段确定为可能包含片尾结束标志信息的第三候选视频片段。In this embodiment of the present invention, if the detection target is the end-of-credits flag information, the multiple video clips may be classified into a main segment and a credit segment respectively according to multiple comparison results. If there is an end segment and a main segment that are adjacent to each other in the playback order, and the main segment is played after the end segment is played, then the end segment and the main segment are determined to be possible to contain The third candidate video segment for the end-of-credits flag information.

举例而言，可以设置第三置信度阈值为0.8，当某一视频片段的置信度结果大于第三置信度阈值0.8时，可以认为该视频片段为片尾片段，否则为正片片段。假设有连续的T＝10个视频片段{t1，……，t10}，它们分别对应的置信度结果为[1.0，0.9，0.9，0.9，0.1，0.1，0.1，0.8，0.8，0.9]，将多个置信度结果分别与第三置信度阈值比较，可以确定t1-t4以及t8-t10均属于片尾片段，t5-t7属于正片片段，也就是说第一个片尾为t1-t4，正片为t5-t7，第二个片尾为t8-t10，t4和t5为播放次序相邻的一个片尾片段和一个正片片段，且t5在t4播放完之后播放，则可以选择t4和t5为第三候选视频片段。For example, the third confidence threshold can be set to 0.8. When the confidence result of a certain video segment is greater than the third confidence threshold of 0.8, the video segment can be considered as an ending segment, otherwise it is a main segment. Assuming that there are consecutive T=10 video clips {t1,...,t10}, their corresponding confidence results are [1.0, 0.9, 0.9, 0.9, 0.1, 0.1, 0.1, 0.8, 0.8, 0.9], the By comparing the multiple confidence results with the third confidence threshold, it can be determined that t1-t4 and t8-t10 belong to the ending segment, and t5-t7 belongs to the main segment, that is to say, the first ending is t1-t4, and the main movie is t5 -t7, the second ending is t8-t10, t4 and t5 are an ending segment and a main segment adjacent to the playback order, and t5 is played after t4 is played, then t4 and t5 can be selected as the third candidate video segment .

需要说明的是，上述子步骤S321-S322、子步骤S323-S324以及子步骤S325-S326三对子步骤是并列执行的，如果检测目标为片头结束标志信息，则执行子步骤S321-S322；如果检测目标为片尾开始标志信息，则执行子步骤S323-S324；如果检测目标为片尾开始标志信息，则执行子步骤S325-S326。三对子步骤之间没有执行先后顺序的制约关系。It should be noted that the three pairs of sub-steps S321-S322, sub-steps S323-S324 and sub-steps S325-S326 are executed in parallel, if the detection target is the end-of-segment flag information, then execute sub-steps S321-S322; if If the detection target is the start of credit information, the sub-steps S323-S324 are executed; if the detection target is the start of the credit information, the sub-steps S325-S326 are executed. There is no constraint relationship between the three pairs of sub-steps in the order of execution.

步骤205，对所述候选视频片段的图像帧进行文字识别，获得文字识别结果。Step 205: Perform text recognition on the image frames of the candidate video segments to obtain a text recognition result.

在本发明实施例中，可以将候选视频片段中的图像帧输入预先训练的文字识别模型中进行文字识别。文字识别模型具有检测图像中文本显示位置区域和辨别文本内容的能力，结合文本显示位置区域和文本内容进行文字识别，可以提高识别准确度。In this embodiment of the present invention, the image frames in the candidate video segments may be input into a pre-trained character recognition model for character recognition. The text recognition model has the ability to detect the text display location area in the image and identify the text content. Combining the text display location area and the text content for text recognition can improve the recognition accuracy.

在具体实施中，可以对候选视频片段的图像帧进行抽样，将抽样的图像帧输入文字识别模型中进行文字识别。例如，可以隔5帧进行采样，并将采样后的图像帧输入文字识别模型。In a specific implementation, the image frames of the candidate video segments may be sampled, and the sampled image frames may be input into the character recognition model for character recognition. For example, sampling can be performed every 5 frames, and the sampled image frames can be input into the character recognition model.

步骤206，根据所述文字识别结果确定所述检测目标所在的目标图像帧。Step 206: Determine a target image frame where the detection target is located according to the character recognition result.

在本发明一种可选的实施例中，文字识别结果包括对第一候选视频片段的图像帧进行文字识别得到的第一文字识别结果，步骤206具体可以包括如下子步骤：In an optional embodiment of the present invention, the text recognition result includes a first text recognition result obtained by performing text recognition on the image frame of the first candidate video segment. Step 206 may specifically include the following sub-steps:

子步骤S41，将所述第一文字识别结果中的文本内容与预设关键词进行匹配，并在匹配到包含所述预设关键词的图像帧之后，对包含所述预设关键词的图像帧进行跟踪，将跟踪到的最后一个包含所述预设关键词的图像帧确定为所述片头结束标志信息所在的所述目标图像帧。Sub-step S41: Match the text content in the first text recognition result with a preset keyword, and after matching the image frame containing the preset keyword, compare the image frame containing the preset keyword Tracking is performed, and the last tracked image frame containing the preset keyword is determined as the target image frame where the title end marker information is located.

文字识别结果包括针对第一候选视频片段的第一文字识别结果，第一文字识别结果中可以包括识别得到的文本内容，可以将文本内容与预设关键词进行匹配，并在匹配到包含预设关键词的文本内容之后，确定该文本内容对应的图像帧，对该图像帧进行跟踪，将跟踪到的最后一个包含预设关键词的图像帧确定为片头结束标志信息所在的目标图像帧。The text recognition result includes the first text recognition result for the first candidate video segment, the first text recognition result may include the recognized text content, the text content may be matched with the preset keywords, and when the matching contains the preset keywords After the text content, the image frame corresponding to the text content is determined, the image frame is tracked, and the last tracked image frame containing the preset keyword is determined as the target image frame where the title end marker information is located.

将识别到的文字识别结果中的文本内容与预设关键词进行匹配。其中，预设关键词可以为集数信息或发行编号信息，例如，“第*集”、“第*章”、以及发行编号等。Match the text content in the recognized text recognition result with preset keywords. The preset keyword may be episode number information or release number information, for example, "Episode *", "Chapter *", and release number.

对包含预设关键词的图像帧进行跟踪，其中，可以将跟踪到像素抖动超过阈值为止时的图像帧确定为跟踪到的最后一个包含预设关键词的图像帧，跟踪到像素抖动超过阈值为止是为了捕捉预设关键词的字体画面渐变消失的那一个图像帧。例如，在第N帧时匹配到预设关键词，跟踪到第N+3帧时像素抖动超过阈值，此时预设关键词的字体画面渐变消失，则将第N+3帧作为目标图像帧。Tracking the image frame containing the preset keyword, wherein the image frame tracked until the pixel jitter exceeds the threshold can be determined as the last tracked image frame containing the preset keyword, and tracked until the pixel jitter exceeds the threshold It is to capture the image frame where the font screen of the preset keyword gradually disappears. For example, the preset keyword is matched at the Nth frame, and the pixel jitter exceeds the threshold when the N+3th frame is tracked. At this time, the font image of the preset keyword disappears gradually, and the N+3th frame is used as the target image frame. .

在本发明一种可选的实施例中，文字识别结果包括对第二候选视频片段的图像帧进行文字识别得到的第二文字识别结果，步骤206具体可以包括如下子步骤：In an optional embodiment of the present invention, the text recognition result includes a second text recognition result obtained by performing text recognition on the image frame of the second candidate video segment. Step 206 may specifically include the following sub-steps:

子步骤S42，按时间顺序遍历所述第二候选视频片段中各个图像帧对应的第二文字识别结果，若连续的多个图像帧对应的第二文字识别结果中的文本框数量大于预设的数量阈值，则将第二文字识别结果中的文本框数量大于预设的数量阈值的第一个图像帧确定为所述片尾开始标志信息所在的所述目标图像帧。Sub-step S42, traverse the second text recognition results corresponding to each image frame in the second candidate video segment in chronological order, if the number of text boxes in the second text recognition results corresponding to multiple consecutive image frames is greater than a preset number If the number threshold is set, the first image frame in which the number of text boxes in the second character recognition result is greater than the preset number threshold is determined as the target image frame where the ending start marker information is located.

文字识别结果包括针对第二候选视频片段的第二文字识别结果，如果检测目标为片尾开始标志信息，则可以对第二文字识别结果进行分析，按时间顺序遍历第二候选视频片段中各个图像帧对应的第二文字识别结果，如果有连续的多个图像帧对应的第二文字识别结果中的文本框数量大于预设的数量阈值，则可以认为片尾开始，可以将文本框数量大于预设的数量阈值的第一个图像帧确定为片尾开始标志信息所在的目标图像帧。The text recognition result includes the second text recognition result for the second candidate video segment. If the detection target is the end-of-credits information, the second text recognition result can be analyzed, and each image frame in the second candidate video segment can be traversed in chronological order. For the corresponding second text recognition result, if the number of text boxes in the second text recognition result corresponding to multiple consecutive image frames is greater than the preset number threshold, it can be considered that the end of the film begins, and the number of text boxes can be greater than the preset number. The first image frame of the quantity threshold is determined as the target image frame where the ending credit information is located.

在本发明一种可选的实施例中，文字识别结果包括对第三候选视频片段的图像帧进行文字识别得到的第三文字识别结果，步骤206具体可以包括如下子步骤：In an optional embodiment of the present invention, the text recognition result includes a third text recognition result obtained by performing text recognition on the image frame of the third candidate video segment. Step 206 may specifically include the following sub-steps:

子步骤S43，若所述第三文字识别结果中包含文本框，则对包含文本框的图像帧进行跟踪，并将跟踪到的最后一个包含文本框的图像帧确定为所述片尾结束标志信息所在的所述目标图像帧。Sub-step S43, if the third character recognition result contains a text box, then the image frame containing the text box is tracked, and the last tracked image frame containing the text box is determined as the end of the credits where the information is located. of the target image frame.

文字识别结果包括针对第三候选视频片段的第三文字识别结果，如果检测目标为片尾结束标志信息，则可以对第三文字识别结果进行分析，确定第三文字识别结果中是否包含文本框，如果包含文本框，可以对包含文本框的图像帧进行跟踪，将跟踪到的最后一个包含文本框的图像帧确定为片尾结束标志信息所在的目标图像帧。The text recognition result includes the third text recognition result for the third candidate video segment. If the detection target is the end-of-credits sign information, the third text recognition result can be analyzed to determine whether the third text recognition result contains a text box. If the text box is included, the image frame including the text box can be tracked, and the last tracked image frame including the text box is determined as the target image frame where the end credit information is located.

由于视频文件片尾通常有滚动的制作信息，例如制作人员名单，制作公司等，因此片尾的图像帧中会存在滚动的多个文本框，可以检测文本框出现的位置坐标，对包含文本框的图像帧进行跟踪，将跟踪到的最后一个包含文本框的图像帧确定为片尾结束标志信息所在的目标图像帧。Since there is usually scrolling production information at the end of the video file, such as the list of producers, production company, etc., there will be multiple scrolling text boxes in the image frame of the end of the video file. The frame is tracked, and the last tracked image frame containing the text box is determined as the target image frame where the end credit information is located.

需要说明的是，上述子步骤S41、子步骤S42以及子步骤S43三个子步骤是并列执行的，如果检测目标为片头结束标志信息，则执行子步骤S41；如果检测目标为片尾开始标志信息，则执行子步骤S42；如果检测目标为片尾开始标志信息，则执行子步骤S43。三个子步骤之间没有执行先后顺序的制约关系。此外，子步骤S41承接子步骤S321-S322执行，子步骤S42承接子步骤S323-S324执行，子步骤S43承接子步骤S325-S326执行。It should be noted that the three sub-steps of the above-mentioned sub-step S41, sub-step S42 and sub-step S43 are executed in parallel, if the detection target is the end-of-credit information, then the sub-step S41 is executed; Go to sub-step S42; if the detection target is the ending start flag information, go to sub-step S43. There is no constraint relation of execution sequence among the three sub-steps. In addition, the sub-step S41 is executed after the sub-steps S321-S322, the sub-step S42 is executed after the sub-steps S323-S324, and the sub-step S43 is executed after the sub-steps S325-S326.

参照图4所示，为本发明实施例的一种视频检测方法的流程图，其中，检测目标为片头结束标志信息，具体流程包括：Referring to FIG. 4 , it is a flowchart of a video detection method according to an embodiment of the present invention, wherein the detection target is title end marker information, and the specific process includes:

S4a、可以获取一集电视剧视频文件，截取该视频文件中时间段为T₀-T₁的片段作为截取视频片段用于分析，可以将该截取视频片段划分为多个视频片段，分别获取多个视频片段的多个图像信息和多个音频信息。S4a, can obtain a TV drama video file, intercept the segment whose time period is T₀ -T₁ in the video file as the intercepted video segment for analysis, the intercepted video segment can be divided into a plurality of video segments, respectively obtain a plurality of Multiple image information and multiple audio information of the video clip.

S4b、对于任一个视频片段，将该视频片段对应的图像信息和音频信息分别输入视频特征提取模型和音频特征提取模型提取特征，得到该视频片段对应的视频特征信息和音频特征信息。S4b. For any video clip, input the image information and audio information corresponding to the video clip into the video feature extraction model and the audio feature extraction model to extract features respectively, and obtain the video feature information and audio feature information corresponding to the video clip.

S4c、对于任一个视频片段，将该视频片段对应的视频特征信息和音频特征信息进行合并得到该视频片段对应的音视频特征信息。S4c. For any video clip, combine the video feature information and audio feature information corresponding to the video clip to obtain audio and video feature information corresponding to the video clip.

S4d、对于任一个视频片段，将该视频片段对应的音视频特征信息输入训练好的分类模型中，得到该视频片段属于片头片段/正片片段/片尾片段的置信度结果。S4d. For any video clip, input the audio and video feature information corresponding to the video clip into the trained classification model, and obtain a confidence result that the video clip belongs to the introductory clip/main clip/outro clip.

S4e、根据输出的多个置信度结果从多个视频片段中确定第一候选视频片段，具体的，可以根据输出的多个置信度结果，分别将多个视频片段分类为片头片段和正片片段，如果该多个视频片段中存在播放次序相邻的一个片头片段和一个正片片段，且该一个正片片段在该一个片头片段播放完之后播放，则将该一个片头片段和该一个正片片段确定为用于查找片头结束标志信息的第一候选视频片段。S4e, determining the first candidate video segment from the multiple video segments according to the multiple output confidence results. Specifically, according to the output multiple confidence results, the multiple video segments can be classified into a title segment and a main segment respectively, If there is a title segment and a feature segment that are adjacent to each other in the playback sequence, and the one feature segment is played after the one title segment is played, then the one title segment and the one feature segment are determined to be used It is used to find the first candidate video segment of the title end marker information.

S4f、将第一候选视频片段的图像帧输入文字识别模型中，得到对应的第一文字识别结果。S4f: Input the image frame of the first candidate video segment into the text recognition model to obtain a corresponding first text recognition result.

S4g、将第一文字识别结果中的文本内容与预设关键词进行匹配，并在匹配到包含预设关键词的图像帧之后，对包含预设关键词的图像帧进行跟踪，将跟踪到的最后一个包含预设关键词的图像帧确定为片头结束标志信息所在的目标图像帧。而如果匹配不到包含预设关键词的图像帧，则可以对步骤S4a中的截取时间段进行调整，例如，此时可以将截取视频片段的截取时间段调整为T₁-T₂时间段，对截取时间段为T₁-T₂的截取视频片段进行分析，重复步骤S4b-S4g，直到匹配到包含预设关键词的图像帧或者已经对该视频文件所包含的视频片段均分析完为止。S4g. Match the text content in the first text recognition result with the preset keyword, and after matching the image frame containing the preset keyword, track the image frame containing the preset keyword, and track the last tracked image frame. An image frame containing a preset keyword is determined as the target image frame where the title end marker information is located. However, if the image frame containing the preset keyword cannot be matched, the clipping time period in step S4a can be adjusted, for example, the clipping time period for_clipping the video clip can be adjusted to the T1_- T2 time period, Analyze the clipped video clips whose clipping time period is T1_- T2, and repeat steps S4b_- S4g until an image frame containing preset keywords is matched or all video clips contained in the video file have been analyzed.

参照图5所示，为本发明实施例的另一种视频检测方法的流程图，其中，检测目标为片尾开始标志信息，具体流程包括：Referring to FIG. 5, it is a flowchart of another video detection method according to an embodiment of the present invention, wherein the detection target is the end-credits start marker information, and the specific process includes:

S5a、可以获取一集电影视频文件，截取该视频文件中可能包含片尾开始标志信息的片段进行分析，假设截取时间段为T₉-T₁₀的片段作为截取视频片段用于分析，可以将该截取视频片段划分为多个视频片段，分别获取多个视频片段的多个图像信息和多个音频信息。_S5a , a set of movie video files can be obtained, and the clips that may contain the ending start marker information in the video file are intercepted for analysis. It is assumed that the clipping time period is T9-_T10 clips as clipped video clips for analysis, and the clipping can be used for analysis. The video clip is divided into multiple video clips, and multiple image information and multiple audio information of the multiple video clips are obtained respectively.

S5b、对于任一个视频片段，将该视频片段对应的图像信息和音频信息分别输入视频特征提取模型和音频特征提取模型提取特征，得到该视频片段对应的视频特征信息和音频特征信息。S5b. For any video clip, input the image information and audio information corresponding to the video clip into the video feature extraction model and the audio feature extraction model to extract features respectively, and obtain the video feature information and audio feature information corresponding to the video clip.

S5c、对于任一个视频片段，将该视频片段对应的视频特征信息和音频特征信息进行合并得到该视频片段对应的音视频特征信息。S5c. For any video clip, combine the video feature information and audio feature information corresponding to the video clip to obtain audio and video feature information corresponding to the video clip.

S5d、对于任一个视频片段，将该视频片段对应的音视频特征信息输入训练好的分类模型中，得到该视频片段属于片头片段/正片片段/片尾片段的置信度结果。S5d. For any video clip, input the audio and video feature information corresponding to the video clip into the trained classification model, and obtain a confidence result that the video clip belongs to the introductory clip/main clip/outro clip.

S5e、根据输出的多个置信度结果从多个视频片段中确定第二候选视频片段，具体的，可以根据输出的多个置信度结果，分别将多个视频片段分类为正片片段和片尾片段，如果该多个视频片段中存在播放次序相邻的一个正片片段和一个片尾片段，且该一个片尾片段在该一个正片片段播放完之后播放，则将该一个正片片段和该一个片尾片段确定为用于查找片尾开始标志信息的第二候选视频片段。S5e, determining the second candidate video segment from the multiple video segments according to the multiple output confidence results. Specifically, according to the output multiple confidence results, the multiple video segments can be classified into the main segment and the ending segment, respectively, If there is a main clip and an end clip that are adjacent to each other in the playback order, and the end clip is played after the main clip is played, then the main clip and the end clip are determined to be used to search for the second candidate video segment of the end-of-credits start marker information.

S5f、将第二候选视频片段的图像帧输入文字识别模型中，得到对应的第二文字识别结果。S5f: Input the image frame of the second candidate video segment into the text recognition model to obtain a corresponding second text recognition result.

S5g、按时间顺序遍历第二候选视频片段中各个图像帧对应的第二文字识别结果，若连续的多个图像帧对应的第二文字识别结果中的文本框数量大于预设的数量阈值，则将第二文字识别结果中的文本框数量大于预设的数量阈值的第一个图像帧确定为片尾开始标志信息所在的目标图像帧。而如果不存在连续的多个图像帧对应的第二文字识别结果中的文本框数量大于预设的数量阈值，则可以对步骤S5a中的截取时间段进行调整，例如，此时可以将截取视频片段的截取时间段调整为T₁₀-T₁₁时间段，对截取时间段为T₁₀-T₁₁的截取视频片段进行分析，重复步骤S5b–S5g，直到连续的多个图像帧对应的第二文字识别结果中的文本框数量大于预设的数量阈值或者已经对该视频文件所包含的视频片段均分析完为止。S5g, traverse the second text recognition results corresponding to each image frame in the second candidate video segment in chronological order, if the number of text boxes in the second text recognition results corresponding to multiple consecutive image frames is greater than a preset number threshold, then The first image frame in which the number of text boxes in the second character recognition result is greater than the preset number threshold is determined as the target image frame where the ending start flag information is located. However, if the number of text boxes in the second text recognition result corresponding to no consecutive multiple image frames is greater than the preset number threshold, the clipping time period in step S5a may be adjusted, for example, the clipping video may be The clipping time period of the clip is adjusted to the T₁₀ -T₁₁ time period, and the clipped video clip whose clipping time period is T₁₀ -T₁₁ is analyzed, and steps S5b - S5g are repeated until the second text corresponding to the consecutive multiple image frames The number of text boxes in the recognition result is greater than a preset number threshold or the analysis of the video clips contained in the video file has been completed.

参照图6所示，为本发明实施例的又一种视频检测方法的流程图，其中，检测目标为片尾结束标志信息，具体流程包括：Referring to FIG. 6, it is a flowchart of another video detection method according to an embodiment of the present invention, wherein the detection target is the end-of-credits sign information, and the specific process includes:

S6a、可以获取一集电影视频文件，截取该视频文件中可能包含片尾结束标志信息的片段进行分析，假设截取时间段为T₈-T₉的片段作为截取视频片段用于分析，可以将该截取视频片段划分为多个视频片段，分别获取多个视频片段的多个图像信息和多个音频信息。S6a, a set of movie video files can be obtained, and the clips that may contain the end-of-credits sign information in the video files are intercepted for analysis, and it is assumed that the clips in the_clipping time period are_T8 -T9 clips for analysis as clipping video clips, and the clipping can be used for analysis. The video clip is divided into multiple video clips, and multiple image information and multiple audio information of the multiple video clips are obtained respectively.

S6b、对于任一个视频片段，将该视频片段对应的图像信息和音频信息分别输入视频特征提取模型和音频特征提取模型提取特征，得到该视频片段对应的视频特征信息和音频特征信息。S6b. For any video clip, input the image information and audio information corresponding to the video clip into the video feature extraction model and the audio feature extraction model to extract features respectively, and obtain the video feature information and audio feature information corresponding to the video clip.

S6c、对于任一个视频片段，将该视频片段对应的视频特征信息和音频特征信息进行合并得到该视频片段对应的音视频特征信息。S6c. For any video clip, combine the video feature information and audio feature information corresponding to the video clip to obtain the audio and video feature information corresponding to the video clip.

S6d、对于任一个视频片段，将该视频片段对应的音视频特征信息输入训练好的分类模型中，得到该视频片段属于片头片段/正片片段/片尾片段的置信度结果。S6d. For any video clip, input the audio and video feature information corresponding to the video clip into the trained classification model, and obtain a confidence result that the video clip belongs to the introductory clip/main clip/outro clip.

S6e、根据输出的多个置信度结果从多个视频片段中确定第三候选视频片段，具体的，可以根据输出的多个置信度结果，分别将多个视频片段分类为正片片段和片尾片段，如果该多个视频片段中存在播放次序相邻的一个片尾片段和一个正片片段，且该一个正片片段在该一个片尾片段播放完之后播放，则将该一个片尾片段和该一个正片片段确定为用于查找片尾结束标志信息的第三候选视频片段。S6e, determining the third candidate video segment from the multiple video segments according to the multiple output confidence results, specifically, according to the multiple output confidence results, classifying the multiple video segments into a main segment and an ending segment respectively, If there is an end segment and a main segment that are adjacent to each other in the playback order, and the main segment is played after the end segment is played, then the end segment and the main segment are determined to be used The third candidate video segment for finding the end-of-credits flag information.

S6f、将第三候选视频片段的图像帧输入文字识别模型中，得到对应的第三文字识别结果。S6f: Input the image frame of the third candidate video segment into the text recognition model to obtain a corresponding third text recognition result.

S6g、如果第三文字识别结果中包含文本框，则对包含文本框的图像帧进行跟踪，并将跟踪到的最后一个包含文本框的图像帧确定为片尾结束标志信息所在的目标图像帧。而如果第三文字识别结果中不包含文本框，则可以对步骤S6a中的截取时间段进行调整，例如，此时可以将截取视频片段的截取时间段调整为T₉-T₁₀时间段，对截取时间段为T₉-T₁₀的截取视频片段进行分析，重复步骤S6b–S6g，直到第三文字识别结果中包含文本框或者已经对该视频文件所包含的视频片段均分析完为止。S6g, if the third character recognition result includes a text box, track the image frame including the text box, and determine the last tracked image frame including the text box as the target image frame where the end credit information is located. However, if the third character recognition result does not contain a text box, the_clipping time period in step_S6a can be adjusted. The intercepted video clips whose interception time period is T9-T10 are analyzed, and steps_S6b -_S6g are repeated until the third text recognition result contains a text box or all video clips contained in the video file have been analyzed.

本发明提供的片头检测方法对于自制剧中先播放正片后播放片头的情况也能对片头进行准确识别。The title detection method provided by the invention can also accurately identify the title in the case of playing the main film first and then playing the title in the self-made drama.

本发明提供的片尾检测方法，首先将电影视频文件进行3D卷积网络特征提取，通过训练是否为正片还是片尾的分类网络，对视频画面进行识别，将正片结束时间点位作为片尾开始时间点位，对正片结束时间点位附近画面进行文字检测，将开始显示有多个文本框的时间点位作为片尾开始时间点位。In the film ending detection method provided by the present invention, firstly, the feature extraction of 3D convolution network is carried out on the movie video file, the video picture is identified by training the classification network of whether it is a feature film or a movie ending, and the ending time point of the feature film is taken as the starting time point of the ending movie. , perform text detection on the picture near the end time point of the main film, and use the time point where multiple text boxes start to be displayed as the start time point of the end film.

本发明提供的彩蛋检测方法，可自动地将电影彩蛋开始时间点位检测出来，在第一个正片结束时间点位之后搜索是否还有正片类型的视频片段，如果有，则将第二个正片开始时间点位作为彩蛋开始时间点位。The easter egg detection method provided by the present invention can automatically detect the start time point of a movie easter egg, search whether there are still video clips of the feature film type after the end time point of the first feature film, and if so, the second feature film will be detected. The start time point is used as the start time point of the easter egg.

需要说明的是，对于方法实施例，为了简单描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本发明实施例并不受所描述的动作顺序的限制，因为依据本发明实施例，某些步骤可以采用其他顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作并不一定是本发明实施例所必须的。It should be noted that, for the sake of simple description, the method embodiments are described as a series of action combinations, but those skilled in the art should know that the embodiments of the present invention are not limited by the described action sequences, because According to embodiments of the present invention, certain steps may be performed in other sequences or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of the present invention.

参照图7，示出了本发明实施例的一种视频检测装置的结构框图，具体可以包括如下模块：Referring to FIG. 7 , a structural block diagram of a video detection apparatus according to an embodiment of the present invention is shown, which may specifically include the following modules:

第一获取及确定模块701，用于获取视频文件，并确定针对所述视频文件的检测目标；所述检测目标包括片头结束标志信息、片尾开始标志信息和片尾结束标志信息中的至少一种；The first acquisition anddetermination module 701 is used to acquire a video file, and determine a detection target for the video file; the detection target includes at least one of the title ending marker information, the ending ending marker information and the ending ending marker information;

第二获取及确定模块702，用于从所述视频文件中获取连续的多个视频片段，并分别确定所述多个视频片段的多个音视频特征信息；A second obtaining and determiningmodule 702, configured to obtain a plurality of consecutive video clips from the video file, and respectively determine a plurality of audio and video feature information of the plurality of video clips;

输入及输出模块703，用于将所述多个音视频特征信息分别输入预先训练的分类模型中，获得对应的多个输出结果；The input andoutput module 703 is used for inputting the plurality of audio and video feature information into the pre-trained classification model, respectively, to obtain a plurality of corresponding output results;

第一确定模块704，用于根据所述多个输出结果，从所述多个视频片段中确定候选视频片段；a first determiningmodule 704, configured to determine candidate video segments from the multiple video segments according to the multiple output results;

文字识别模块705，用于对所述候选视频片段的图像帧进行文字识别，获得文字识别结果；Atext recognition module 705, configured to perform text recognition on the image frames of the candidate video clips to obtain a text recognition result;

第二确定模块706，用于根据所述文字识别结果确定所述检测目标所在的目标图像帧。The second determiningmodule 706 is configured to determine the target image frame where the detection target is located according to the character recognition result.

在本发明实施例中，所述输入及输出模块，包括：In an embodiment of the present invention, the input and output module includes:

输入及输出子模块，用于将所述多个音视频特征信息分别输入所述分类模型中，获得对应的多个置信度结果；所述置信度结果用于表示对应的视频片段属于片头片段/正片片段/片尾片段的置信度。The input and output sub-modules are used for inputting the plurality of audio and video feature information into the classification model respectively to obtain a plurality of corresponding confidence results; the confidence results are used to indicate that the corresponding video segment belongs to the title segment/ Confidence for the main/outer clip.

在本发明实施例中，所述分类模型通过以下模块训练：In the embodiment of the present invention, the classification model is trained by the following modules:

获取模块，用于获取用于训练的样本视频片段集；所述样本视频片段集包括连续的多个样本视频片段；所述多个样本视频片段分别标注的片段类型为片头片段或正片片段或片尾片段；an acquisition module, configured to acquire a sample video clip set for training; the sample video clip set includes a plurality of consecutive sample video clips; the clip types marked respectively by the plurality of sample video clips are introductory clips, main clips or end clips fragment;

第三确定模块，用于分别确定所述多个样本视频片段的多个样本音视频特征信息；a third determining module, configured to respectively determine a plurality of sample audio and video feature information of the plurality of sample video clips;

模型训练模块，用于使用所述多个样本音视频特征信息进行模型训练，得到用于识别片头片段/正片片段/片尾片段的所述分类模型。A model training module, configured to perform model training using the audio and video feature information of the plurality of samples to obtain the classification model for identifying the introductory clip/main clip/outro clip.

在本发明实施例中，所述第二获取及确定模块，包括：In this embodiment of the present invention, the second obtaining and determining module includes:

特征提取及合并子模块，用于针对各个视频片段，采用预先训练的超分辨率测试序列VGG模型提取对应的音频特征信息，以及，采用预先训练的双流膨胀三维卷积网络I3D模型提取对应的视频特征信息，将所述音频特征信息和所述视频特征信息进行合并，得到该视频片段对应的所述音视频特征信息。The feature extraction and merging sub-module is used to extract the corresponding audio feature information using the pre-trained super-resolution test sequence VGG model for each video segment, and extract the corresponding video using the pre-trained dual-stream dilated 3D convolutional network I3D model. feature information, combining the audio feature information and the video feature information to obtain the audio and video feature information corresponding to the video clip.

在本发明实施例中，所述特征提取及合并子模块，包括：In an embodiment of the present invention, the feature extraction and merging submodules include:

注意力计算单元，用于基于移位注意力机制分别对所述音频特征信息和所述视频特征信息进行注意力计算，得到对应的注意力音频特征信息和注意力视频特征信息；an attention calculation unit, configured to perform attention calculation on the audio feature information and the video feature information based on the shift attention mechanism, respectively, to obtain corresponding attention audio feature information and attention video feature information;

拼接单元，用于将所述注意力音频特征信息和所述注意力视频特征信息进行拼接，得到对应的所述音视频特征信息。A splicing unit, configured to splicing the attention audio feature information and the attention video feature information to obtain the corresponding audio and video feature information.

在本发明实施例中，所述第一确定模块，包括：In this embodiment of the present invention, the first determining module includes:

比较子模块，用于将所述多个置信度结果分别与预设置信度阈值比较，获得对应的多个比较结果；a comparison submodule, configured to compare the multiple confidence results with the preset reliability thresholds, respectively, to obtain multiple corresponding comparison results;

第一确定子模块，用于根据所述多个比较结果，从所述多个视频片段中确定所述候选视频片段。The first determination submodule is configured to determine the candidate video segment from the multiple video segments according to the multiple comparison results.

在本发明实施例中，所述候选视频片段包括用于查找所述检测目标为所述片头结束标志信息的第一候选视频片段，所述第一确定子模块，包括：In the embodiment of the present invention, the candidate video segment includes a first candidate video segment for finding the detection target as the title end marker information, and the first determination submodule includes:

第一分类单元，用于若所述检测目标为所述片头结束标志信息，则根据所述多个比较结果，分别将所述多个视频片段分类为片头片段和正片片段；a first classification unit, configured to classify the plurality of video clips into a title clip and a full-length clip respectively according to the multiple comparison results if the detection target is the title end marker information;

第一确定单元，用于若所述多个视频片段中存在播放次序相邻的一个片头片段和一个正片片段，且所述一个正片片段在所述一个片头片段播放完之后播放，则将所述一个片头片段和所述一个正片片段确定为所述第一候选视频片段。The first determination unit is configured to: if there is a title segment and a feature segment adjacent to the playback order in the plurality of video segments, and the one feature segment is played after the one title segment is played, then the A title segment and the one feature segment are determined as the first candidate video segments.

在本发明实施例中，所述候选视频片段包括用于查找所述检测目标为所述片尾开始标志信息的第二候选视频片段，所述第一确定子模块，包括：In this embodiment of the present invention, the candidate video segment includes a second candidate video segment for finding the detection target as the end-credits start marker information, and the first determination submodule includes:

第二分类单元，用于若所述检测目标为所述片尾开始标志信息，则根据所述多个比较结果，分别将所述多个视频片段分类为正片片段和片尾片段；a second classification unit, configured to classify the plurality of video clips into main clips and clips respectively according to the plurality of comparison results if the detection target is the end-credits start marker information;

第二确定单元，用于若所述多个视频片段中存在播放次序相邻的一个正片片段和一个片尾片段，且所述一个片尾片段在所述一个正片片段播放完之后播放，则将所述一个正片片段和所述一个片尾片段确定为所述第二候选视频片段。The second determining unit is configured to: if there is a main film segment and an end film segment adjacent to the playback order in the plurality of video segments, and the one end film segment is played after the one main film segment is played, then the A main film segment and the one end-credit segment are determined as the second candidate video segments.

在本发明实施例中，所述候选视频片段包括用于查找所述检测目标为所述片尾结束标志信息的第三候选视频片段，所述第一确定子模块，包括：In the embodiment of the present invention, the candidate video segment includes a third candidate video segment for finding the detection target as the end-of-credits marker information, and the first determination submodule includes:

第三分类单元，用于若所述检测目标为所述片尾结束标志信息，则根据所述多个比较结果，分别将所述多个视频片段分类为正片片段和片尾片段；a third classification unit, configured to, if the detection target is the end-of-credits flag information, classify the multiple video clips into a main segment and an ending segment respectively according to the multiple comparison results;

第三确定单元，用于若所述多个视频片段中存在播放次序相邻的一个片尾片段和一个正片片段，且所述一个正片片段在所述一个片尾片段播放完之后播放，则将所述一个片尾片段和所述一个正片片段确定为所述第三候选视频片段。a third determining unit, configured to: if there is an end segment and a main segment in the plurality of video segments that are adjacent to each other in playback order, and the one main segment is played after the one end segment is played, then the One end credit segment and the one main movie segment are determined as the third candidate video segment.

在本发明实施例中，所述文字识别结果包括对所述第一候选视频片段的图像帧进行文字识别得到的第一文字识别结果，所述第二确定模块，包括：In the embodiment of the present invention, the text recognition result includes a first text recognition result obtained by performing text recognition on the image frame of the first candidate video segment, and the second determination module includes:

第二确定子模块，用于将所述第一文字识别结果中的文本内容与预设关键词进行匹配，并在匹配到包含所述预设关键词的图像帧之后，对包含所述预设关键词的图像帧进行跟踪，将跟踪到的最后一个包含所述预设关键词的图像帧确定为所述片头结束标志信息所在的所述目标图像帧。The second determination sub-module is configured to match the text content in the first text recognition result with the preset keyword, and after matching the image frame containing the preset keyword, determine the image frame containing the preset keyword The image frame of the word is tracked, and the last tracked image frame containing the preset keyword is determined as the target image frame where the title end marker information is located.

在本发明实施例中，所述文字识别结果包括对所述第二候选视频片段的图像帧进行文字识别得到的第二文字识别结果，所述第二确定模块，包括：In the embodiment of the present invention, the text recognition result includes a second text recognition result obtained by performing text recognition on the image frame of the second candidate video segment, and the second determination module includes:

第三确定子模块，用于按时间顺序遍历所述第二候选视频片段中各个图像帧对应的第二文字识别结果，若连续的多个图像帧对应的第二文字识别结果中的文本框数量大于预设的数量阈值，则将第二文字识别结果中的文本框数量大于预设的数量阈值的第一个图像帧确定为所述片尾开始标志信息所在的所述目标图像帧。The third determination sub-module is used to traverse the second text recognition results corresponding to each image frame in the second candidate video segment in time order, if the number of text boxes in the second text recognition results corresponding to multiple consecutive image frames is is greater than the preset number threshold, the first image frame in which the number of text boxes in the second character recognition result is greater than the preset number threshold is determined as the target image frame where the ending start marker information is located.

在本发明实施例中，所述文字识别结果包括对所述第三候选视频片段的图像帧进行文字识别得到的第三文字识别结果，所述第二确定模块，包括：In the embodiment of the present invention, the text recognition result includes a third text recognition result obtained by performing text recognition on the image frame of the third candidate video segment, and the second determination module includes:

第四确定子模块，用于若所述第三文字识别结果中包含文本框，则对包含文本框的图像帧进行跟踪，并将跟踪到的最后一个包含文本框的图像帧确定为所述片尾结束标志信息所在的所述目标图像帧。The fourth determination sub-module is configured to track the image frame including the text box if the third character recognition result includes a text box, and determine the last tracked image frame including the text box as the end of the credits The target image frame where the end marker information is located.

第五确定子模块，用于根据所述检测目标确定截取时间段；a fifth determination submodule, configured to determine the interception time period according to the detection target;

截取子模块，用于按照所述截取时间段对所述视频文件进行截取，获得截取视频片段；An interception submodule, configured to intercept the video file according to the interception time period to obtain the intercepted video segment;

分割子模块，用于对所述截取视频片段进行平均分割，获得连续的所述多个视频片段。A segmentation sub-module, configured to perform an average segmentation on the intercepted video clips to obtain the multiple consecutive video clips.

对于装置实施例而言，由于其与方法实施例基本相似，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。As for the apparatus embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for related parts.

本发明实施例还提供了一种电子设备，包括：处理器、存储器及存储在所述存储器上并能够在所述处理器上运行的计算机程序，所述计算机程序被所述处理器执行时实现上述一种视频检测方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。An embodiment of the present invention further provides an electronic device, including: a processor, a memory, and a computer program stored on the memory and capable of running on the processor, where the computer program is implemented when executed by the processor Each process of the above-mentioned embodiment of the video detection method can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

本发明实施例还提供了一种计算机可读存储介质，计算机可读存储介质上存储计算机程序，所述计算机程序被处理器执行时实现上述一种视频检测方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。Embodiments of the present invention further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, each process of the above-mentioned embodiment of the video detection method is implemented, and can achieve The same technical effect, in order to avoid repetition, will not be repeated here.

本说明书中的各个实施例均采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似的部分互相参见即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments may be referred to each other.

本领域内的技术人员应明白，本发明实施例的实施例可提供为方法、装置、或计算机程序产品。因此，本发明实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。It should be understood by those skilled in the art that the embodiments of the embodiments of the present invention may be provided as a method, an apparatus, or a computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product implemented on one or more computer-usable storage media having computer-usable program code embodied therein, including but not limited to disk storage, CD-ROM, optical storage, and the like.

本发明实施例是参照根据本发明实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。Embodiments of the present invention are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal equipment to produce a machine that causes the instructions to be executed by the processor of the computer or other programmable data processing terminal equipment Means are created for implementing the functions specified in the flow or flows of the flowcharts and/or the blocks or blocks of the block diagrams.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer readable memory capable of directing a computer or other programmable data processing terminal equipment to operate in a particular manner, such that the instructions stored in the computer readable memory result in an article of manufacture comprising instruction means, the The instruction means implement the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上，使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing terminal equipment, so that a series of operational steps are performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby executing on the computer or other programmable terminal equipment The instructions executed on the above provide steps for implementing the functions specified in the flowchart or blocks and/or the block or blocks of the block diagrams.

尽管已描述了本发明实施例的优选实施例，但本领域内的技术人员一旦得知了基本创造性概念，则可对这些实施例做出另外的变更和修改。所以，所附权利要求意欲解释为包括优选实施例以及落入本发明实施例范围的所有变更和修改。Although preferred embodiments of the embodiments of the present invention have been described, additional changes and modifications to these embodiments may be made by those skilled in the art once the basic inventive concepts are known. Therefore, the appended claims are intended to be construed to include the preferred embodiments as well as all changes and modifications that fall within the scope of the embodiments of the present invention.

最后，还需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。Finally, it should also be noted that in this document, relational terms such as first and second are used only to distinguish one entity or operation from another, and do not necessarily require or imply these entities or that there is any such actual relationship or sequence between operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or terminal device comprising a list of elements includes not only those elements, but also a non-exclusive list of elements. other elements, or also include elements inherent to such a process, method, article or terminal equipment. Without further limitation, an element defined by the phrase "comprises a..." does not preclude the presence of additional identical elements in the process, method, article or terminal device comprising said element.

以上对本发明所提供的一种视频检测方法、一种视频检测装置、一种电子设备和一种计算机可读存储介质，进行了详细介绍，本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。A video detection method, a video detection device, an electronic device, and a computer-readable storage medium provided by the present invention have been described in detail above. Specific examples are used in this paper to explain the principles and implementations of the present invention. Elaborated, the description of the above embodiment is only used to help understand the method of the present invention and its core idea; meanwhile, for those of ordinary skill in the art, according to the idea of the present invention, there will be a For changes, in summary, the contents of this specification should not be construed as limiting the present invention.