CN114339362B

Movatterモバイル変換

Info

Publication number: CN114339362B
Application number: CN202111494410.2A
Authority: CN
Inventors: 张皓
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2023-06-13
Anticipated expiration: 2041-12-08
Also published as: CN114339362A

Abstract

The application relates to a video bullet screen matching method, a video bullet screen matching device, computer equipment, a storage medium and a computer program product. The method comprises the following steps: extracting initial text features corresponding to the real-time barrages corresponding to the target videos; determining a plurality of video segments to be matched corresponding to the real-time barrage from the target video, and acquiring fusion video features corresponding to the video segments to be matched; the fusion video features are obtained by carrying out feature fusion on target video frame feature sequences corresponding to video clips to be matched; calculating the matching degree of the real-time barrage and each video segment to be matched based on the initial text features and the fusion video features; determining a target video segment from each video segment to be matched based on the matching degree; and establishing an association relation between the real-time barrage and the target video clip, wherein the association relation is used for synchronously playing the real-time barrage when the target video clip is played. By adopting the method, the matching accuracy of the barrage and the video clips can be improved.

Description

Video bullet screen matching method, device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a video bullet screen matching method, a device, a computer device, and a storage medium.

Background

With the development of network media, a bullet screen technology appears, and bullet screens refer to comment subtitles popped up when video is watched on a network. The barrage can enable a user to watch video comments in real time when watching videos, and is a novel information interaction mode.

In the conventional technology, matching is usually performed between a bullet screen and a video clip according to a bullet screen publishing time, and the bullet screen published by a user is displayed in the video clip corresponding to the bullet screen publishing time. However, sometimes, when the user plays the bullet screen, the corresponding video plot is already played and ended, and matching the bullet screen with the video clip based on the playing time of the bullet screen may cause the bullet screen and the video clip to be mismatched.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a video bullet screen matching method, apparatus, computer device, computer-readable storage medium, and computer program product that can improve matching accuracy of a bullet screen and a video clip.

In one aspect, the present application provides a video bullet screen matching method, where the method includes:

acquiring a real-time barrage corresponding to a target video, and extracting initial text features corresponding to the real-time barrage;

determining a plurality of video segments to be matched corresponding to the real-time barrage from the target video, and acquiring fusion video features corresponding to the video segments to be matched; the fusion video features are obtained by feature fusion of target video frame feature sequences corresponding to video clips to be matched, the target video frame feature sequences are obtained by feature extraction of all video frames in the video clips to be matched, and the target video frame feature sequences comprise target video frame features corresponding to all video frames in the same video clip to be matched;

Calculating the matching degree of the real-time barrage and each video segment to be matched based on the initial text features and the fusion video features;

determining target video clips from the video clips to be matched based on the matching degree;

and establishing an association relation between the real-time barrage and the target video clip, wherein the association relation is used for synchronously playing the real-time barrage when the target video clip is played.

In another aspect, the present application further provides a video bullet screen matching device, where the device includes:

the bullet screen processing module is used for acquiring a real-time bullet screen corresponding to a target video and extracting initial text features corresponding to the real-time bullet screen;

the video feature acquisition module is used for determining a plurality of video segments to be matched corresponding to the real-time barrage from the target video and acquiring fusion video features corresponding to the video segments to be matched; the fusion video features are obtained by feature fusion of target video frame feature sequences corresponding to video clips to be matched, the target video frame feature sequences are obtained by feature extraction of all video frames in the video clips to be matched, and the target video frame feature sequences comprise target video frame features corresponding to all video frames in the same video clip to be matched;

The matching degree calculation module is used for calculating the matching degree of the real-time barrage and each video segment to be matched based on the initial text features and the fusion video features;

the target video segment determining module is used for determining target video segments from the video segments to be matched based on the matching degree;

the association relation establishing module is used for establishing the association relation between the real-time barrage and the target video clip, and the association relation is used for synchronously playing the real-time barrage when the target video clip is played.

The application also provides a video bullet screen matching method, which comprises the following steps:

acquiring a real-time barrage corresponding to a target video;

the real-time barrage is sent to a server, so that the server extracts initial text features corresponding to the real-time barrage, a plurality of video segments to be matched corresponding to the real-time barrage are determined from the target video, fusion video features corresponding to all the video segments to be matched are obtained, the matching degree of the real-time barrage and all the video segments to be matched is calculated based on the initial text features and the fusion video features, the target video segments are determined from all the video segments to be matched based on the matching degree, and the association relation between the real-time barrage and the target video segments is established; the fusion video features are obtained by feature fusion of target video frame feature sequences corresponding to video clips to be matched, the target video frame feature sequences are obtained by feature extraction of all video frames in the video clips to be matched, and the target video frame feature sequences comprise target video frame features corresponding to all video frames in the same video clip to be matched;

And acquiring the association relation returned by the server, and synchronously playing the real-time barrage when playing the target video clip based on the association relation.

The application also provides a video bullet screen matching device, which comprises:

the bullet screen acquisition module is used for acquiring a real-time bullet screen corresponding to the target video;

the data matching module is used for sending the real-time barrage to a server so that the server extracts initial text features corresponding to the real-time barrage, determines a plurality of video segments to be matched corresponding to the real-time barrage from the target video, acquires fusion video features corresponding to the video segments to be matched, calculates the matching degree of the real-time barrage and each video segment to be matched respectively based on the initial text features and the fusion video features, determines target video segments from each video segment to be matched based on the matching degree, and establishes the association relation between the real-time barrage and the target video segments; the fusion video features are obtained by feature fusion of target video frame feature sequences corresponding to video clips to be matched, the target video frame feature sequences are obtained by feature extraction of all video frames in the video clips to be matched, and the target video frame feature sequences comprise target video frame features corresponding to all video frames in the same video clip to be matched;

And the bullet screen playing module is used for acquiring the association relation returned by the server and synchronously playing the real-time bullet screen when the target video clip is played based on the association relation.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the video bullet screen matching method described above when the processor executes the computer program.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the video bullet screen matching method described above.

A computer program product comprising a computer program which when executed by a processor performs the steps of the video bullet screen matching method described above.

According to the video barrage matching method, the video barrage matching device, the computer equipment and the storage medium, the initial text characteristics corresponding to the real-time barrages are extracted by acquiring the real-time barrages corresponding to the target videos; determining a plurality of video segments to be matched corresponding to the real-time barrage from the target video, and acquiring fusion video features corresponding to the video segments to be matched; the fusion video features are obtained by feature fusion of target video frame feature sequences corresponding to the video clips to be matched, the target video frame feature sequences are obtained by feature extraction of all video frames in the video clips to be matched, and the target video frame feature sequences comprise target video frame features corresponding to all video frames in the same video clips to be matched; and calculating the matching degree of the real-time barrage and each video segment to be matched based on the initial text feature and the fusion video feature, determining a target video segment from each video segment to be matched based on the matching degree, and establishing an association relationship between the real-time barrage and the target video segment, wherein the association relationship is used for synchronously playing the real-time barrage when playing the target video segment. Therefore, when the latest published real-time barrage of a user when watching the target video is obtained, the target video segment matched with the content of the real-time barrage is accurately determined through the matching degree calculated based on the text characteristics of the real-time barrage and the video characteristics of the video segment, and further, when the target video segment is played, the real-time barrage is synchronously played, the matching accuracy of the barrage and the video segment is improved, and the barrage and the video segment of the target video are ensured to be accurately matched and played all the time.

Drawings

FIG. 1 is a diagram of an application environment for a video bullet screen matching method in one embodiment;

FIG. 2 is a flow chart of a video bullet screen matching method according to one embodiment;

FIG. 3 is a schematic diagram of slicing video in one embodiment;

FIG. 4 is a flow diagram of generating a fused video feature in one embodiment;

FIG. 5A is a schematic diagram of feature shifting in one embodiment;

FIG. 5B is a schematic diagram of feature shifting in another embodiment;

FIG. 6 is a flow diagram of feature fusion in one embodiment;

FIG. 7 is a schematic diagram of a text processing model training process in one embodiment;

FIG. 8 is a flowchart of a video bullet screen matching method according to another embodiment;

FIG. 9A is a schematic view of an interface of a bullet screen according to one embodiment;

FIG. 9B is a schematic view of an interface of a bullet screen according to another embodiment;

FIG. 10 is a system frame diagram of a video bullet screen matching method in one embodiment;

FIG. 11 is a block diagram of a video bullet screen matching apparatus in one embodiment;

FIG. 12 is a block diagram of a video bullet screen matching apparatus in one embodiment;

FIG. 13 is an internal block diagram of a computer device in one embodiment;

fig. 14 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of how to make a machine "look at", and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing, so that the Computer processes the target into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The scheme provided by the embodiment of the application relates to the technologies of computer vision technology, machine learning and the like of artificial intelligence, and is specifically described by the following embodiments:

the video bullet screen matching method provided by the application can be applied to an application environment shown in fig. 1. Wherein theplayback terminal 102 communicates with theserver 104 via a network. The data storage system may store data that theserver 104 needs to process. The data storage system may be integrated on theserver 104 or may be located on a cloud or other network server. The playingterminal 102 may be, but not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, and the internet of things devices may be smart televisions, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. An application program can be arranged on the playing terminal, the application program can be a client installed in the terminal, and the client (also called as application client and APP client) can be a program installed and operated in the terminal; an application may also refer to an installation-free application, i.e., an application that can be used without downloading an installation, such an application is also commonly referred to as an applet, which typically runs as a subroutine in a client; an application may also refer to a web application that is opened through a browser; etc. The various applications described above are divided by the application functions they provide, and the types of applications may include, but are not limited to: instant messaging applications, audiovisual applications, and the like. Theserver 104 may be implemented as a stand-alone server or as a server cluster or cloud server composed of a plurality of servers.

Both theplayback terminal 102 and theserver 104 may be used separately to perform the video bullet screen matching method provided in the embodiments of the present application.

For example, the server acquires a real-time barrage corresponding to the target video, extracts initial text features corresponding to the real-time barrage, determines a plurality of video segments to be matched corresponding to the real-time barrage from the target video, acquires fusion video features corresponding to each video segment to be matched, and calculates matching degrees of the real-time barrage and each video segment to be matched based on the initial text features and the fusion video features. The fusion video features are obtained by feature fusion of target video frame feature sequences corresponding to the video clips to be matched, the target video frame feature sequences are obtained by feature extraction of all video frames in the video clips to be matched, and the target video frame feature sequences comprise target video frame features corresponding to all video frames in the same video clips to be matched. Furthermore, the server determines target video clips from the video clips to be matched based on the matching degree, and establishes an association relationship between the real-time barrage and the target video clips, wherein the association relationship is used for synchronously playing the real-time barrage when the target video clips are played.

Theplayback terminal 102 and theserver 104 may also cooperate to perform the video bullet screen matching method provided in the embodiments of the present application. For example, the playing terminal acquires a real-time barrage corresponding to the target video, and sends the real-time barrage to the server. And the server determines target video fragments from a plurality of video fragments to be matched corresponding to the real-time barrage through data processing, and establishes an association relationship between the real-time barrage and the target video fragments. The server can send the association relationship to the playing terminal, so that the playing terminal synchronously plays the real-time bullet screen when playing the target video clip.

In one embodiment, as shown in fig. 2, a video bullet screen matching method is provided, which is illustrated by a computer device, and it is understood that the computer device may be the playingterminal 102 shown in fig. 1 or theserver 104. In this embodiment, the video bullet screen matching method includes the following steps:

step S202, acquiring a real-time barrage corresponding to the target video, and extracting initial text features corresponding to the real-time barrage.

The target video refers to a video currently being played by the playing terminal. The real-time barrage refers to the latest barrage obtained by the playing terminal in real time, and is the barrage published by the video watching user in real time. The user can post comments during the process of watching the video, and any comment posted by the user can be displayed in a sliding subtitle at all the playing terminals for playing the video, so that the interactivity among viewers is improved. The real-time barrage may be entered by the user by typing, voice, etc.

The initial text features are obtained by extracting features of the real-time barrage and can reflect text contents of the real-time barrage.

Specifically, the user can watch the target video on the playing terminal and release the barrage at any time. The computer equipment can acquire the real-time barrage published by the user when watching the target video, and extract the characteristics of the real-time barrage to obtain the initial text characteristics corresponding to the real-time barrage.

The computer device may extract initial text features corresponding to the real-time barrage through a machine learning algorithm, e.g., the computer device may extract initial text features corresponding to the real-time barrage through a machine learning model. The computer device may input the real-time barrage into a machine learning model with the output or intermediate data of the machine learning model as the initial text feature. For example, if the machine learning model is a text feature extraction model, the output of the text feature extraction model may be taken as an initial text feature, and if the machine learning model is a text classification model, the output of the feature extraction layer in the text classification model may be taken as an initial text feature, that is, intermediate data of the text classification model may be taken as an initial text feature.

Step S204, determining a plurality of video segments to be matched corresponding to the real-time barrage from the target video, and obtaining fusion video features corresponding to the video segments to be matched; the fusion video features are obtained by feature fusion of target video frame feature sequences corresponding to the video clips to be matched, the target video frame feature sequences are obtained by feature extraction of all video frames in the video clips to be matched, and the target video frame feature sequences comprise target video frame features corresponding to all video frames in the same video clips to be matched.

It may be understood that each video segment of the target video may include at least one video frame, and the video segments may be randomly divided, or may be obtained by using a specific video frame in the target video as a slicing video frame, and performing video slicing on the target video based on the slicing video frame. The special video frames may specifically include at least one of black frames, scene-cut frames, and the like in the target video. The computer device may identify a particular video frame in the target video based on a custom algorithm or formula.

The fusion video features are obtained by carrying out feature fusion on target video frame feature sequences corresponding to the video segments to be matched. The fusion video features represent video-level data, represent global features of the video segments, and can represent semantic information of the whole video segments. The feature fusion is used for compressing data, converting the data at the video frame level into the data at the video level, and fusing the local features into global features. It can be understood that the target video frame feature sequence corresponding to a video segment to be matched is composed of a plurality of target video frame features, and the feature fusion is performed on the target video frame feature sequence, so that data composed of a plurality of feature vectors can be fused into a feature vector to be represented.

Specifically, after the real-time bullet screen corresponding to the target video is obtained, the computer device may determine a plurality of video segments to be matched corresponding to the real-time bullet screen in the target video, then obtain the fusion video features corresponding to the video segments to be matched, and subsequently match the bullet screen with the video features through the text features of the bullet screen and the video features of the video segments, thereby determining the bullet screen and the video segments which are matched with each other.

It may be appreciated that the fused video features may be pre-calculated before the real-time bullet screen is acquired, for example, in order to improve the matching efficiency, the computer device may pre-segment the target video to obtain a plurality of video segments, pre-extract the target video frame feature sequences corresponding to each video segment, and perform feature fusion on the target video frame feature sequences corresponding to each video segment, to obtain the fused video features corresponding to each video segment, and store each fused video feature. After the computer equipment acquires the real-time bullet screen corresponding to the target video, determining a plurality of video clips to be matched corresponding to the real-time bullet screen, for example, taking each video clip of the target video as the video clip to be matched corresponding to the real-time bullet screen, and then directly acquiring fusion video features corresponding to each video clip to be matched from the pre-stored data.

The fused video feature may also be obtained by real-time calculation after the real-time bullet screen is obtained, for example, after the computer device obtains the real-time bullet screen corresponding to the target video, the computer device determines a plurality of video segments to be matched corresponding to the real-time bullet screen, extracts the target video frame feature sequences corresponding to the video segments to be matched respectively, and performs feature compression on the target video frame feature sequences corresponding to the video segments respectively to obtain the fused video feature corresponding to the video segments.

The computer device may store the relevant data of the target video in association with the video identifier of the target video, for example, store each video segment of the target video in association with the video identifier of the target video, so that each video segment of the target video, and even each feature corresponding to each video segment, may be found based on the video identifier. The real-time barrage received by the computer equipment can carry the video identification corresponding to the target video, and the computer equipment can determine the target video corresponding to the real-time barrage based on the video identification, so that the related data corresponding to the target video is obtained. The real-time bullet screen received by the computer equipment can also carry bullet screen publishing time.

Step S206, calculating the matching degree of the real-time barrage and each video segment to be matched based on the initial text features and the fusion video features.

The matching degree refers to the matching degree and matching score of the real-time barrage and the video clips to be matched. It will be appreciated that the greater the matching of the bullet screen to the video clip, the more similar, the more matched the content of the bullet screen and video clip.

Specifically, after obtaining the initial text feature corresponding to the real-time barrage and the fusion video feature corresponding to the video segment to be matched, the computer device may calculate the matching degree of the real-time barrage and the video segment to be matched based on the initial text feature and the fusion video feature, for example, calculate the feature similarity between the initial text feature and the fusion video feature, and use the feature similarity as the matching degree of the real-time barrage and the video segment to be matched. The computer device may calculate the degree of matching based on a custom algorithm or formula. The computer device may calculate the matching degree through a machine learning algorithm, for example, may calculate the matching degree through a machine learning model, input the initial text feature corresponding to the real-time barrage and the fusion video feature corresponding to the video segment to be matched into a trained video text matching model, and use the output data of the video text matching model as the matching degree of the real-time barrage and the video segment to be matched. The number of the video clips to be matched corresponding to the real-time barrage is multiple, so that the matching degree of the real-time barrage and each video clip to be matched can be calculated finally, and multiple matching degrees can be calculated finally.

Step S208, determining target video segments from the video segments to be matched based on the matching degree.

Specifically, the computer device may determine, based on the matching degree, a target video segment from the video segments to be matched, where the target video segment is a video segment that is the most matched or more matched with the real-time barrage, and the scenario of the target video segment and the content of the real-time barrage are consistent with each other, and may consider that the real-time barrage belongs to the target video segment.

In one embodiment, determining a target video clip from each video clip to be matched based on the degree of matching includes: and acquiring the video segment to be matched corresponding to the maximum matching degree from each matching degree as a target video segment.

Specifically, when determining the target video segment, the computer device may select, from each matching degree, the video segment to be matched corresponding to the maximum matching degree as the target video segment, so as to use the video segment to be matched that is the most matched as the target video segment corresponding to the real-time barrage.

It can be appreciated that the computer device may also select at least one video segment to be matched having a matching degree greater than a preset matching degree as the target video segment. One bullet screen can be matched with at least one video clip, and the same bullet screen can be synchronously played with at least one video clip. Of course, the matching degree may be a tag for indicating whether the video clips are matched, and the video clips to be matched corresponding to the tag for indicating the matching may be obtained as the target video clips.

Step S210, establishing an association relationship between the real-time barrage and the target video clip, wherein the association relationship is used for synchronously playing the real-time barrage when the target video clip is played.

Specifically, after determining the target video segment corresponding to the real-time barrage, the computer device may establish an association relationship between the real-time barrage and the target video segment, where the association relationship is used to play the real-time barrage synchronously when the target video segment is played. Therefore, in the process of playing the target video, any subsequent playing terminal synchronously plays the corresponding real-time barrage once playing the target video clip, and finally, the aim of barrage moment calibration is achieved. It will be appreciated that the real-time bullet screen may be played in synchronization with any one of the video frames of the target video clip, e.g., the real-time bullet screen may be played in synchronization with the starting video frame of the target video clip, i.e., the real-time bullet screen is played at the beginning of the target video clip.

In one embodiment, there may be multiple playing terminals corresponding to the target video, and if the computer device is a server, each playing terminal may send a respective real-time barrage to the server, so that the server performs matching between the barrage and the video clip. The server can send the association relation to all the playing terminals when determining the association relation between one real-time barrage and the corresponding target video clip, so that all the playing terminals synchronously play the corresponding real-time barrages when playing the target video clip. If the computer equipment is a playing terminal, after any playing terminal obtains the real-time bullet screen sent by the user, matching between the bullet screen and the video clip can be carried out locally. And each time the playing terminal determines the association relation between one real-time barrage and the corresponding target video clip, the association relation can be sent to the server, so that the server can send the association relation to other playing terminals, and finally, when all the playing terminals play the target video clip, the corresponding real-time barrages are synchronously played.

The number of the playing terminals corresponding to the target video can be multiple, and all users can publish the barrages at any time and any place when watching the target video, so that the number of the barrages corresponding to the target video can be multiple. All the barrages corresponding to the target video can determine the corresponding target video fragments by the method, and all the barrages corresponding to the target video can be synchronously played with the corresponding target video fragments.

It can be appreciated that the bullet screen matching method of the present application can also be applied to calibrating bullet screen moments of historical bullet screens.

In one embodiment, determining a plurality of video segments to be matched corresponding to a real-time bullet screen from a target video includes:

determining segmentation video frames from each target video frame based on pixel information corresponding to each target video frame of the target video; video segmentation is carried out on the target video based on the segmented video frames, so that a plurality of initial video clips are obtained; and determining a plurality of video clips to be matched corresponding to the real-time barrage from each initial video clip.

The target video frame refers to any video frame in the target video. The pixel information corresponding to the target video frame is obtained based on pixel values of each pixel point in the target video frame, wherein the pixel values of each pixel point comprise pixel values of each pixel point in at least one color space.

Specifically, the computer device may segment the target video, segment the target video into a plurality of initial video segments, and determine a video segment to be matched corresponding to the real-time bullet screen from each initial video segment. When video segmentation is performed, the computer device can determine the segmented video frames from the target video frames based on the pixel information corresponding to the target video frames of the target video, wherein the pixel information of the segmented video frames meets the preset condition and has certain characteristics. For example, a black frame in the target video may be determined from each target video frame based on pixel information, the black frame being typically used for transition in the video, the black frame being taken as a sliced video frame. The scene change frame in the target video may also be determined from each target video frame based on the pixel information, and the scene change frame may be used as the slice video frame. The pixel characteristics of special video frames such as black frames, scene switching frames and the like can be extracted as preset conditions for determining the segmentation video frames.

After determining to segment the video frame, the computer device may segment the target video based on the segmented video frame, to segment the target video into a plurality of initial video segments with the segmented video frame as a segmentation point. The initial video clip obtained by slicing may or may not contain sliced video frames, for example, the initial video clip may not contain black frames. Furthermore, the computer device may determine a plurality of video segments to be matched corresponding to the real-time barrage from the initial video segments, for example, each initial video segment may be respectively used as a video segment to be matched corresponding to the real-time barrage, or a plurality of video segments may be selected from the initial video segments to be respectively used as video segments to be matched corresponding to the real-time barrage.

It can be understood that the video segmentation can be performed on the target video in advance before the real-time barrage is acquired, or the video segmentation can be performed on the target video after the real-time barrage is acquired.

In the above embodiment, based on the pixel information corresponding to each target video frame of the target video, the split video frame is determined from each target video frame, and the video splitting is performed on the target video based on the split video frame, so that the target video frame with stronger relevance can be divided into the same initial video segment, the target video frame with weaker relevance can be divided into different initial video segments, and the determination of the video segment to be matched corresponding to the real-time bullet screen from such initial video segments is helpful for improving the matching accuracy of the bullet screen and the video segment.

In one embodiment, determining a slice video frame from each target video frame of the target video based on pixel information corresponding to each target video frame includes:

acquiring a first pixel value of each pixel point in each target video frame under a first color space; counting all first pixel values corresponding to the same target video frame to obtain pixel information corresponding to all target video frames; and taking the target video frame with the pixel information smaller than the first threshold value as the segmentation video frame.

Wherein the first color space is an RGB color space. The first threshold for determining the slicing video frames may be set according to actual needs.

In particular, in video slicing, the computer device may treat black frames in the target video as sliced video frames. The computer equipment can acquire first pixel values of each pixel point in each target video frame under the RGB color space, count each first pixel value corresponding to the same target video frame, calculate the average value of RGB pixels of each target video frame, and acquire pixel information corresponding to each target video frame. The computer device may treat the target video frame with pixel information less than the first threshold as a black frame, which is typically used for transitions in video, and thus may treat the black frame as a sliced video frame.

For example, the first pixel value of the pixel point may be represented by (R, G, B), and R, G and B represent the values of the color of the pixel on the three color channels of red, green, and blue, respectively. The statistics of the respective first pixel values may be an average of the statistics over three color channels, respectively. The first threshold may include sub-thresholds corresponding to the three color channels, respectively, and a target video frame having an average value on the three color channels smaller than the corresponding sub-thresholds may be regarded as a black frame.

In the above embodiment, statistics is performed on each first pixel value corresponding to the same target video frame to obtain pixel information corresponding to each target video frame, the target video frame with the pixel information smaller than the first threshold value is used as a segmentation video frame, a black frame used for transition in the video is used as a segmentation video frame, video segmentation is performed based on the black frame, and video segments corresponding to different scenes can be segmented.

acquiring second pixel values of all pixel points in the same target video frame under a second color space, and obtaining pixel information corresponding to all target video frames; calculating pixel change information between adjacent target video frames based on second pixel values corresponding to the matched pixel points in the adjacent target video frames; a sliced video frame is determined from adjacent target video frames for which the pixel change information is greater than a second threshold.

Wherein the second color space is an HSV color space. The second threshold may be set according to actual needs. The adjacent target video frames refer to two adjacent target video frames, and the matched pixels in the adjacent target video frames refer to pixels in the same position in the two target video frames, for example, the center pixel of the two video frames can be considered as the matched pixels.

In particular, in video slicing, the computer device may treat scene change frames in the target video as sliced video frames. Whether scene switching occurs or not can be judged through the variable quantity between adjacent frames in the HSV color space, and then the scene switching frame is determined. In contrast to the RGB color space, the HSV color space may separate color changes, which often imply that a scene has changed, from intensity changes, which often are due to factors such as illumination, and typically do not indicate that a scene change has occurred. Therefore, scene switching frames in the video can be accurately found based on the amount of change between adjacent frames in the HSV color space.

The computer equipment can acquire the second pixel value of each pixel point in the same target video frame under the second color space as the pixel information corresponding to the target video frame, and acquire the pixel information corresponding to each target video frame. Further, the computer device may calculate second pixel value differences based on second pixel values corresponding to the matched pixel points in the two adjacent target video frames, obtain a plurality of second pixel value differences, calculate pixel change information between the adjacent frames based on the respective second pixel value differences, for example, may calculate an average value of the respective second pixel value differences as the pixel change information, and may calculate a sum of the respective second pixel value differences as the pixel change information. All adjacent target video frames in the target video can be calculated to obtain corresponding pixel change information, and then a scene switching frame is determined from the target video frames based on the pixel change information, and the scene switching frame is used as a segmentation video frame. Specifically, the segmented video frame may be determined from adjacent target video frames whose pixel change information is greater than a second threshold, for example, the video frame a and the video frame B are adjacent target video frames, the pixel change information between the video frame a and the video frame B is greater than the second threshold, any one of the video frame a and the video frame B may be used as the segmented video frame, and one video segment includes the video frame a and one video segment includes the video frame B based on two video segments obtained by the segmented video frame.

For example, the video frame a and the video frame B include m×n pixels, and the video frame a and the video frame B are two adjacent video frames. Calculating the difference value of HSV pixels between a first pixel point in a first row in a video frame A and a first pixel point in a first row in a video frame B, calculating the difference value of HSV pixels between a second pixel point in the first row in the video frame A and a second pixel point in the first row in the video frame B, calculating the difference value of HSV pixels between a third pixel point in the first row in the video frame A and a third pixel point in the first row in the video frame B, and so on, calculating the difference value of HSV pixels between matching pixel points in the video frame A and the video frame B, finally obtaining M times N second pixel value difference values, and calculating the sum of the second pixel value difference values as pixel change information. If the pixel change information is larger than the second threshold value, determining that scene switching occurs, and taking any one of the video frame A and the video frame B as a segmentation video frame.

It will be appreciated that second pixel value differences between all matched pixel points in adjacent target video frames may be calculated, and pixel variation information may be derived based on each second pixel value difference. The second pixel value difference between the partially matched pixels of the adjacent target video frame may also be calculated, and the pixel change information may be obtained based on each second pixel value difference, for example, the target video frame is divided into a plurality of image areas, and at least one pixel point is selected from the pixels covered by each image area to calculate the second pixel value difference.

In the above embodiment, the second pixel value of each pixel point in the same target video frame in the second color space is obtained, the pixel information corresponding to each target video frame is obtained, the pixel change information between the adjacent target video frames is calculated based on the second pixel value corresponding to the matched pixel point in the adjacent target video frame, the adjacent target video frames with the pixel change information larger than the second threshold value can be considered as video frames in video clips corresponding to different scenes respectively, the segmentation video frames are determined from the adjacent target video frames with the pixel change information larger than the second threshold value, and the video clips corresponding to different scenes can be segmented based on the segmentation video frames.

Referring to fig. 3, in performing video slicing, a computer device may refer to two pieces of information simultaneously, one being black frame information in a video, i.e., black frames in the video, and one being scene-switching information, i.e., scene-switching frames in the video. The computer device may perform video black frame detection on the target video to obtain a black frame in the target video. For video black frame detection, firstly, each target video frame is extracted from target video, the average value of RGB pixels of each frame of image is calculated and compared with a first threshold value, and if the average value of RGB pixels is smaller than the first threshold value, the frame is regarded as a black frame. Black frames are typically used for transitions in video, so detected black frames can characterize the transition point of a video segment. The computer device may perform scene-cut detection on the target video to obtain a scene-cut frame in the target video. For scene switching detection, firstly, extracting each target video frame from the target video, converting each frame of image from an RGB color space to an HSV color space, then calculating the variation between adjacent frames in the HSV color space, and if the variation is larger than a second threshold value, considering that scene switching occurs, and determining the scene switching frame. And (3) integrating detection results of the video black frame detection and the scene switching detection, determining a plurality of segmentation video frames from a complete video, and segmenting the complete video into a plurality of video fragments according to the segmentation video frames.

In one embodiment, as shown in fig. 4, obtaining the fused video features corresponding to each video segment to be matched includes:

step S402, respectively extracting features of each video segment to be matched to obtain an initial video frame feature sequence corresponding to each video segment to be matched; the initial video frame feature sequence is obtained by sequencing initial video frame features corresponding to all video frames in the same video segment to be matched according to video frame time stamps, wherein the initial video frame features comprise video frame sub-features corresponding to a plurality of feature channels respectively.

Specifically, when generating the fusion video feature corresponding to the video segment to be matched, the computer device may perform feature extraction on the video segment to be matched to obtain an initial video frame feature sequence corresponding to the video segment to be matched, then perform feature shift on the video frame sub-feature corresponding to the target feature channel in the initial video frame feature sequence to obtain an intermediate video frame feature sequence, then perform two-dimensional convolution processing on the intermediate video frame feature sequence to obtain a target video frame feature sequence, and finally perform feature fusion on the target video frame feature sequence to obtain the fusion video feature.

The initial video frame characteristics corresponding to one video frame comprise video frame sub-characteristics respectively corresponding to a plurality of characteristic channels, and the video frame sub-characteristics corresponding to different characteristic channels represent characteristics of different information extracted from the image. For example, the initial video frame feature may be represented by c×h×w, which represents C feature channels, each feature channel is a feature map, feature vector of size h×w, where h×w represents the size of a video frame sub-feature. Video frame sub-features may also be considered feature maps.

The computer device may perform feature extraction based on a custom algorithm or formula. The computer device may perform feature extraction on each video segment to be matched through a machine learning algorithm, for example, input the video segment to be matched into a convolutional neural network, perform feature extraction through the convolutional neural network, and use output data of the convolutional neural network as an initial video frame feature sequence. And (3) through data processing of the computer equipment, initial video frame characteristic sequences corresponding to the video clips to be matched can be obtained. The initial video frame features in the initial video frame feature sequence mainly characterize the content information of each video frame itself.

Step S404, in the initial video frame feature sequences corresponding to the same video segments to be matched, feature shifting is performed on the video frame sub-features corresponding to the target feature channels based on the ordering information of the initial video frame features, so as to obtain intermediate video frame feature sequences corresponding to the video segments to be matched.

The feature shift refers to moving the video frame sub-feature corresponding to the target feature channel in the initial video frame feature, so as to change the video frame sub-feature corresponding to the target feature channel in the initial video frame feature. The target feature channel may include at least one feature channel, and the target feature channel may be specifically set according to actual needs, for example, the video frame feature includes video frame sub-features corresponding to eight feature channels respectively, and two feature channels may be selected from the eight feature channels as the target feature channel.

The ranking information of the initial video frame features refers to the ranking order of the initial video frame features in the initial video frame feature sequence. The ordering information for the initial video frame characteristics may also be considered as a temporal ordering of video frame time stamps for the respective initial video frames.

Specifically, for any one initial video frame feature sequence, the computer device may perform feature shift on the video frame sub-features corresponding to the target feature channel based on the ordering information of the initial video frame features to obtain intermediate video frame features corresponding to each video frame, where the intermediate video frame features corresponding to each video frame form an intermediate video frame feature sequence corresponding to the video segment to be matched. The intermediate video frame features of one video frame comprise video frame sub-features on the target feature channel after feature shift and original video frame sub-features on other feature channels. The intermediate video frame features in the sequence of intermediate video frame features may be ordered or unordered. And obtaining the intermediate video frame characteristic sequences corresponding to the video clips to be matched respectively through data processing of the computer equipment.

It will be appreciated that the direction of displacement of the feature displacement may be either forward along the time sequence or reverse along the time sequence. If there are multiple target feature channels, the shift directions corresponding to the target feature channels may be the same or different. The shift distance of the feature shift may be at least one time unit, for example, a video frame sub-feature corresponding to a target feature channel in an initial video frame feature of a current video frame may be used as a video frame sub-feature corresponding to a target feature channel in an intermediate video frame feature of a next video frame; the video frame sub-feature corresponding to the target feature channel in the initial video frame feature of the current video frame can be used as the video frame sub-feature corresponding to the target feature channel in the intermediate video frame feature of the next video frame.

Step S406, two-dimensional convolution processing is carried out on each intermediate video frame feature sequence to obtain a target video frame feature sequence corresponding to each current video segment to be matched.

The two-dimensional convolution processing refers to convolution processing in video frame features corresponding to the same video frame. The specific process of the two-dimensional convolution processing can refer to various existing 2D convolutions, for example, for an intermediate video frame, the convolution kernels are utilized to slide on feature graphs corresponding to all feature channels respectively, pixel values on the feature graphs are multiplied by numerical values on the corresponding convolution kernels, then all multiplied values are added to be used as gray values of pixels on the feature graphs corresponding to the pixels in the middle of the convolution kernels, and finally all images are slid, so that the target video frame features are obtained.

Specifically, for any intermediate video frame feature sequence, the computer device may perform two-dimensional convolution processing on the intermediate video frame feature sequence, and perform two-dimensional convolution processing on each intermediate video frame feature to obtain a target video frame feature corresponding to each video frame, where the target video frame feature corresponding to each video frame forms a target video frame feature sequence corresponding to the video segment to be matched. The sequence of target video frame features may be ordered by video frame time stamps for each target video frame feature. And (3) obtaining target video frame characteristic sequences corresponding to the video clips to be matched respectively through data processing of computer equipment.

It can be understood that the conventional two-dimensional convolution processing only uses the information of the current frame, but in the application, the intermediate video frame feature sequence is obtained through feature shift, and the intermediate video frame feature corresponding to one video frame in the intermediate video frame feature sequence not only comprises the information of the current frame but also comprises the information of other frames, so that the intermediate video frame feature sequence is subjected to two-dimensional convolution processing, the information of different frames can be considered, the obtained target video frame feature sequence fuses the information in the time dimension, and the content of the video segment can be better represented by the target video frame feature sequence in consideration of the interconnection and the content relevance among the video frames.

And step S408, respectively carrying out feature fusion on each target video frame feature sequence to obtain fusion video features corresponding to each current video segment to be matched.

Specifically, for any one intermediate video frame feature sequence, the computer device may perform feature fusion on the target video frame feature sequence to obtain a fused video feature corresponding to the video segment to be matched. For example, each target video frame feature in the sequence of target video frame features may be weighted and summed to obtain the fused video feature. The weights corresponding to the respective target video frame characteristics may be the same or may be different. And through data processing of the computer equipment, fusion video features corresponding to the video clips to be matched can be obtained.

In the above embodiment, feature extraction is performed on the video segments to be matched to obtain an initial video frame feature sequence, feature shift is performed on the video frame sub-features corresponding to the target feature channels in the initial video frame feature sequence, information exchange is performed between different video frames to obtain an intermediate video frame feature sequence, two-dimensional convolution processing is performed on the intermediate video frame feature sequence, and information between different video frames can be fused to obtain the target video frame feature sequence. Thus, compared with the three-dimensional convolution process, the effect of the three-dimensional convolution process can be achieved through the feature shift and the two-dimensional convolution process, and the calculated amount is smaller than that of the three-dimensional convolution process, and the calculated complexity is smaller than that of the three-dimensional convolution process. And carrying out feature fusion on the target video frame feature sequence, so that more accurate fusion video features can be obtained.

In one embodiment, in an initial video frame feature sequence corresponding to the same video segment to be matched, feature shifting is performed on video frame sub-features corresponding to the target feature channel based on ordering information of initial video frame features, so as to obtain an intermediate video frame feature sequence corresponding to each video segment to be matched, including:

In an initial video frame characteristic sequence corresponding to a current video segment to be matched, taking video frame sub-characteristics corresponding to a target characteristic channel in each initial video frame characteristic as target sub-characteristics; updating target sub-features corresponding to adjacent video frames based on target sub-features corresponding to the current video frames aiming at the initial video frame features to obtain intermediate video frame features corresponding to each video frame of the current video segment to be matched; and sequencing all the intermediate video frame features according to the video frame time stamp to obtain an intermediate video frame feature sequence corresponding to the current video segment to be matched.

The current video clips to be matched refer to currently used video clips to be matched, and each video clip to be matched corresponding to the real-time barrage can be sequentially used as the current video clips to be matched. The current video frame refers to a currently used video frame, and each video frame in the current video segment to be matched can be sequentially used as the current video frame. The adjacent video frames include at least one of a forward video frame and a backward video frame of the current video frame.

Specifically, when performing feature shifting, the computer device may use the video frame sub-feature corresponding to the target feature channel in each initial video frame feature as a target sub-feature, and generate the intermediate video frame feature by updating the target sub-feature in the initial video frame feature. Based on the initial video frame characteristics, the computer equipment can update the target sub-characteristics corresponding to the adjacent video frames based on the target sub-characteristics corresponding to the current video frames, replace the original target sub-characteristics of the adjacent video frames with the target sub-characteristics corresponding to the current video frames, and the video frame sub-characteristics corresponding to other characteristic channels are kept unchanged, so that the intermediate video frame characteristics corresponding to the adjacent video frames are obtained. And after the target sub-features of each video frame in the current video segment to be matched are sequentially updated as the current video frame, the intermediate video frame features corresponding to each video frame of the current video segment to be matched can be obtained. Furthermore, the computer device may sort the features of each intermediate video frame according to the video frame time stamp to obtain an intermediate video frame feature sequence corresponding to the video segment to be matched currently.

In the above embodiment, the content between the adjacent video frames has strong relevance and continuity, when the feature shift is performed, the target sub-feature corresponding to the adjacent video frame is updated based on the target sub-feature corresponding to the current video frame, and then, the two-dimensional convolution processing is performed, so that the context interaction and the context fusion can be performed, and the modeling capability in the time dimension is improved.

In one embodiment, updating the target sub-feature corresponding to the neighboring video frame based on the target sub-feature corresponding to the current video frame includes:

Updating the target sub-feature corresponding to the next video frame based on the target sub-feature corresponding to the current video frame, and configuring the target sub-feature corresponding to the initial video frame as a preset sub-feature.

The initial video frame refers to the first video frame in the current video segment to be matched. The preset sub-feature is a preset video frame sub-feature, is fixed data, and for example, the preset sub-feature may be set to zero.

In particular, in performing the target sub-feature update, the computer device may update the target sub-feature corresponding to the next video frame based on the target sub-feature corresponding to the current video frame, i.e., move the target sub-feature by one time unit in the direction in which the video frame time stamp is incremented in the initial video frame feature sequence. Since the starting video frame in the video clip has no forward video frame, the computer device may configure the target sub-feature corresponding to the starting video frame as a preset sub-feature.

Referring to fig. 5A, a in fig. 5A represents an initial video frame feature sequence, and b represents an intermediate video frame feature sequence. For an initial video frame feature sequence, one row of cubes represents an initial video frame feature corresponding to one video frame, and one row of cubes represents one video frame sub-feature. In the initial video frame feature sequence, the initial video frame features are ordered from small to large according to the video frame time stamps, and the initial video frame features can be represented by c×h×w. The initial video frame feature in fig. 5A includes video frame sub-features corresponding to six feature channels, and two of the feature channels are taken as target feature channels. In the initial video frame feature sequence, the video frame sub-feature (i.e., the target sub-feature) corresponding to the target feature channel is moved by one time unit along the direction in which the video frame time stamp increases, that is, the target sub-feature corresponding to the current video frame is used as the target sub-feature corresponding to the next video frame. And filling the vacant position after the shift with zero, namely configuring the target sub-feature corresponding to the initial video frame as a preset sub-feature, thereby obtaining the intermediate video frame feature sequence.

In the above embodiment, the target sub-feature corresponding to the next video frame is updated based on the target sub-feature corresponding to the current video frame, and the past frame and the current frame may be blended, and the target sub-feature corresponding to the starting video frame is configured as a preset sub-feature, so that the intermediate video frame feature is quickly obtained.

In one embodiment, the target sub-feature includes a first sub-feature corresponding to a first one of the target feature channels and a second sub-feature corresponding to other ones of the target feature channels.

Updating the target sub-feature corresponding to the adjacent video frame based on the target sub-feature corresponding to the current video frame, comprising:

updating a first sub-feature corresponding to a next video frame based on the first sub-feature corresponding to the current video frame, and configuring the first sub-feature corresponding to the initial video frame as a preset sub-feature; and updating the second sub-feature corresponding to the previous video frame based on the second sub-feature corresponding to the current video frame, and configuring the second sub-feature corresponding to the ending video frame as a preset sub-feature.

The first characteristic channel may be set as required, for example, any one of the target characteristic channels is used as the first characteristic channel. If there are at least two target feature channels, the target sub-features may include a first sub-feature corresponding to a first feature channel in the target feature channel and a second sub-feature corresponding to other feature channels in the target feature channel. The end video frame refers to the last video frame sequenced in the current video segment to be matched.

In the above embodiment, the first sub-feature corresponding to the next video frame is updated based on the first sub-feature corresponding to the current video frame, the second sub-feature corresponding to the previous video frame is updated based on the second sub-feature corresponding to the current video frame, the past frame and the future frame may be blended with the current frame, the first sub-feature corresponding to the start video frame is configured as a preset sub-feature, and the second sub-feature corresponding to the end video frame is configured as a preset sub-feature, so that the intermediate video frame feature is obtained quickly.

In one embodiment, referring to fig. 6, feature fusion includes the steps of:

step S602, obtaining a plurality of cluster center features; each cluster center feature corresponds to a different video frame topic.

The video frame theme refers to a central idea and main content expressed by image information of one video frame. For example, the information represented by a video clip is an action, and each video frame in the video clip can respectively express the subject information of each action detail, action triggering object, auxiliary prop and the like which form the action. For example, the information of one video clip representation is a "shooting" action, and each video frame can respectively express detailed information such as "basketball", "ball control", "jump", "shooting", and the like, and each video frame has a corresponding video frame theme.

The clustering center features are obtained by performing clustering analysis on video frame features of a large number of video frames. Different cluster center features may correspond to different video frame topics. For example, the computer device may obtain video frame features of a large number of video frames, perform feature clustering on each video frame feature to obtain a plurality of cluster centers, where each cluster center corresponds to a cluster center feature, each video frame feature belongs to a cluster center closest to the cluster center, the video frame features belonging to the same cluster center have a great similarity, represent the same video frame subject, and the video frame features belonging to different cluster centers have a great difference, and represent different video frame subjects. The feature clustering may be performed by various clustering algorithms or by a machine learning algorithm.

In one embodiment, the cluster center feature may be learned end-to-end along with other parameters of the model as a parameter that can be learned by the machine learning model. For example, a video classification model may be trained based on training samples, which are video segments for which real classification results are known, the video classification model including a feature extraction layer, a feature fusion layer, and a feature classification layer. During model training, a video segment is input into a video classification model, video frame characteristics of each video frame in the video segment are extracted through a characteristic extraction layer, each video frame characteristic is fused into a video characteristic through a characteristic fusion layer, a prediction classification result is output through a characteristic classification layer based on the video characteristic, training loss is generated based on a real classification result and the prediction classification result, model parameters are adjusted based on the training loss until convergence conditions are met, and training is finished. The convergence condition may be at least one of a training loss less than a preset loss, a number of iterations greater than a preset number, and the like. The feature fusion layer fuses each video frame feature into a video feature based on the cluster center feature. The clustering center features are parameters to be learned in the model training process, and after model training is finished, the video features obtained by fusing video frame features based on the finally learned clustering center features can accurately represent semantic information of video segments, so that accurate video classification results can be obtained through a feature classification layer.

Specifically, when feature fusion is performed, video frame features representing local features of video clips can be aggregated according to cluster center features, and fused video features representing global features of video clips are generated through distances from each video frame feature to the cluster center feature to which each video frame feature belongs. For example, the distances from each video frame feature to the cluster center feature to which it belongs may be averaged and fused to obtain a fused video feature, that is, the semantic contributions of each video frame to the video segment are the same. The distances from each video frame feature to the cluster center feature can be fused unevenly to obtain a fused video feature, that is, the semantic contributions of each video frame to the video segment are different. For example, the greater the distance of each video frame feature from the belonging cluster center feature, the smaller the weight when feature fusion is performed.

Step S604, aiming at a target video frame characteristic sequence corresponding to the current video segment to be matched, determining target center characteristics from all the cluster center characteristics based on the distances between the target video frame characteristics corresponding to the same video frame and all the cluster center characteristics, and obtaining target center characteristics respectively corresponding to all the video frames of the current video segment to be matched.

Step S606, based on the distance between the target video frame feature and the target center feature corresponding to the same video frame, the target feature distance corresponding to each video frame of the current video segment to be matched is obtained.

Specifically, for any video segment to be matched, the computer device may acquire each cluster center feature, calculate a distance between a target video frame feature corresponding to any video frame in the video segment to be matched and each cluster center feature, and use the cluster center feature with the smallest distance as the target center feature corresponding to the video frame to obtain the target center feature corresponding to each video frame in the video segment to be matched. That is, the cluster center to which each video frame in the video segments to be matched belongs is determined.

Furthermore, the computer device may calculate a distance between the target video frame feature corresponding to each video frame and the target center feature corresponding to each video frame, so as to obtain a target feature distance corresponding to each video frame in the video segment to be matched. The target feature distance refers to the distance from the video frame feature to the cluster center feature to which it belongs.

Step S608, performing attention distribution on the features of each target video frame corresponding to the current video segment to be matched, to obtain the attention weights corresponding to each video frame of the current video segment to be matched.

Step S610, based on the attention weight, fusing the distances of the target features to obtain the fused video features corresponding to the video segments to be matched currently.

Attention allocation refers to allocating attention weights of different degrees to different target video frame features to distinguish important features from non-important features. The attention weight is used to represent the importance of a certain video frame in the video segment to the overall video segment and the semantic contribution.

Specifically, in order to improve accuracy of the fused video features, the computer device may perform non-average fusion on the distances of the target features to obtain the fused video features. The computer equipment can perform attention distribution on each target video frame characteristic corresponding to the video segment to be matched so as to distinguish important video frame characteristics and non-important video frame characteristics in each video frame characteristic, and attention weights respectively corresponding to each video frame of the video segment to be matched are obtained. The computer device can fuse the object feature distances based on the attention weight, and weight and sum the object feature distances based on the attention weight, so that fused video features corresponding to the video segments to be matched are obtained.

In one embodiment, attention may also be allocated as a parameter that can be learned by the machine learning model, along with other parameters of the model, end-to-end learning. For example, the feature fusion layer fuses each video frame feature to a video feature based on the cluster center feature and the attention weight. The clustering center features and the parameters for performing attention distribution are parameters to be learned in the model training process, after model training is finished, target feature distances corresponding to all video frames are determined based on the finally learned clustering center features, attention weights corresponding to all video frames are determined based on the finally learned attention distribution parameters, and then the target feature distances are fused based on the attention weights to obtain fused video features. Attention distribution in the feature fusion layer may be specifically performed by the full connectivity layer and the Softmax layer.

In the above embodiment, based on the cluster center feature and the attention weight, feature fusion is performed on each target video frame feature in the target video frame feature sequence, so that a fused video feature with higher accuracy can be obtained.

In one embodiment, the generation process of the fused video feature includes the steps of:

Inputting the current video segment to be matched into a video feature extraction model; the video feature extraction model comprises a first feature extraction layer, a second feature extraction layer and a feature fusion layer; performing feature extraction on the current video segment to be matched through a first feature extraction layer to obtain an initial video frame feature sequence corresponding to the current video segment to be matched; performing feature shifting on video frame sub-features corresponding to the target feature channels in the initial video frame feature sequence based on the ordering information of the initial video frame features through the second feature extraction layer to obtain an intermediate video frame feature sequence corresponding to the current video segment to be matched; performing two-dimensional convolution processing on the intermediate video frame feature sequence through the second feature extraction layer to obtain a target video frame feature sequence corresponding to the current video segment to be matched; and carrying out feature fusion on the target video frame feature sequence through a feature fusion layer to obtain fusion video features corresponding to the video segments to be matched currently.

It will be appreciated that specific procedures for data processing such as feature shifting, feature fusion, etc. may be referred to in the context of the various related embodiments described above.

In one embodiment, the video feature extraction model may be obtained in a supervised training manner. And during model training, a feature classification layer can be added after an output layer of the video feature extraction model to be trained, so that the video classification model to be trained is obtained. And performing supervised training on the video classification model based on the training sample carrying the training label to obtain a trained video classification model. The network before the feature classification layer is acquired from the trained video classification model as a trained video feature extraction model. The training sample is a video clip, and the training label represents the real classification result of the video clip.

In the above embodiment, the fused video feature may be quickly generated by means of the video feature extraction model.

In one embodiment, calculating the matching degree of the real-time barrage and each video segment to be matched based on the initial text feature and the fusion video feature includes:

inputting the real-time barrage into a text processing model to obtain initial text characteristics; inputting the initial text features into a text feature extraction network in a trained video text matching model to obtain target text features corresponding to the real-time barrage; inputting the fused video features into a video feature extraction network in a video text matching model to obtain target video features; and obtaining the matching degree of the real-time barrage and each video segment to be matched based on the similarity between the target text features and the target video features.

Wherein the text processing model is a machine learning model for processing text. The video text matching model is a machine learning model that is used to determine whether text and video clips match. The video text matching model comprises two branches, wherein one branch is a text feature extraction network and is used for inputting text features of a barrage, and the other branch is a video feature extraction network and is used for inputting video features of a video clip. In one embodiment, the text feature extraction network and the video feature extraction network may be comprised of a plurality of fully connected layers.

Specifically, the computer device may extract text features of the barrage by means of a machine learning model, and specifically input the real-time barrage into a text processing model to obtain initial text features corresponding to the real-time barrage. The computer equipment can calculate the matching degree by means of a machine learning model, specifically, the initial text features corresponding to the real-time barrage are input into a text feature extraction network in a trained video text matching model, the fusion video features corresponding to the video segments to be matched are input into a video feature extraction network in the video text matching model, and the target text features corresponding to the real-time barrage and the target video features corresponding to the video segments to be matched are obtained through data processing of the text feature extraction network and the video feature extraction network. Then, the computer device can calculate and obtain the matching degree of the real-time barrage and the video clips to be matched based on the similarity between the target text features and the target video features. The computer equipment can directly take the feature similarity between the target text feature and the target video feature as the matching degree, and can also input the target text feature and the target video feature into a matching layer of a video text matching model, and the video text matching model outputs the matching degree of the real-time barrage and the video clips to be matched through data processing of the matching layer.

In one embodiment, the text processing model may employ a RoBERTa model. The RoBERTa model is a language model for processing various NLP (Natural Language Processing ) tasks. The RoBERTa model adjusts some training strategies based on the BERT model, e.g., using larger training batch sizes, using longer training sequences, using dynamically adjusted masks, etc. The dynamic mask adjustment refers to mask processing in the process of loading data, and the BERT model refers to mask processing on the data in advance, and the data which is subjected to mask processing is directly loaded when the data is loaded. Referring to fig. 7, the training process of the roberta model includes two stages, a first stage being a pre-training stage and a second stage being a fine-tuning stage. In the pre-training stage, model training can be performed through a pre-training task I and a pre-training task II. The first task of pre-training is to randomly cover a part of words in a sentence, then simultaneously predict the covered words by using contextual information, for example, mask a sentence a, randomly cover a part of the words, input the masked sentence a into a BERT model, and the training is to predict the covered words according to the meaning of the covered words understood in full text. The second task is the next sentence prediction task, which is mainly to enable the model to better understand the relationships between sentences, for example, predict sentence B from sentence a. In the fine tuning stage, parameters of the BERT model are fine-tuned according to different learning tasks. For example, a question-answer task may be used as a learning task, a question and text containing an answer are input into the BERT model, and the training is performed to find the location of the answer in the text containing the answer, and predict the starting location and ending location of the answer. The learning task may also include a single sentence classification task, a pair of sentence classification tasks, and so forth. The accurate RoBERTa model can be finally obtained through training in the pre-training stage and the fine-tuning stage.

Of course, other language models, other text models, may also be employed for the text processing model.

In the embodiment, the matching degree of the bullet screen and the video clip is calculated by means of the video feature extraction model, so that the accuracy of the matching degree can be improved.

In one embodiment, the training process of the video text matching model includes the steps of:

acquiring a training sample and a training label corresponding to the training sample; the training sample comprises a training video clip and a training barrage; inputting the fusion video features corresponding to the training video segments and the initial text features corresponding to the training barrages into a video text matching model to be trained to obtain a prediction tag; the prediction labels are obtained based on the similarity between the text features output by the text feature extraction network and the video features output by the video feature extraction network; and calculating training loss based on the training label and the prediction label, and adjusting model parameters of the video text matching model to be trained based on the training loss until convergence conditions are met, so as to obtain the trained video text matching model.

The training samples comprise training video clips and training barrages. The training labels can be two kinds of labels, if the training labels corresponding to the training samples are matched labels, the training video clips and the training barrages in the training samples are matched with each other, and if the training labels corresponding to the training samples are unmatched labels, the training video clips and the training barrages in the training samples are unmatched. It will be appreciated that the training labels may also be the probability of matching the training video segments to the training barrage, the matching score, etc.

Specifically, the video text matching model may be trained by a supervised training approach. Before model training, the computer device may extract the fused video features of the training video segments in the training samples, and extract the initial text features corresponding to the training barrages in the training samples. Furthermore, during model training, the computer equipment can input the fusion video features corresponding to the training video segments and the initial text features corresponding to the training barrages into a video text matching model to be trained, specifically, the fusion video features are input into a video feature extraction network, the initial text features are input into a text feature extraction network, and the video text matching model finally outputs a prediction label through data processing in the model. The computer equipment can calculate training loss based on the training label and the prediction label, and adjusts model parameters of the model based on the training loss in a back propagation mode to obtain an updated video text matching model, and returns to the iterative execution of the step of inputting the fusion video feature corresponding to the training video segment and the initial text feature corresponding to the training barrage into the video text matching model, and continues training until convergence conditions are met, and training is completed to obtain the trained video text matching model. The convergence condition may be at least one of the number of model iterations reaching a preset number of times, the training loss being smaller than the preset loss, and the like.

In one embodiment, regarding the application of the model, in order to improve the matching efficiency, the computer device may perform video segmentation on the target video in advance, extract the fusion video features corresponding to each video segment obtained by the segmentation, input the fusion video features corresponding to each video segment into the trained video text matching model, obtain the target video features corresponding to each video segment, and store each target video feature. Subsequently, if the computer equipment receives the real-time barrage corresponding to the target video, the computer equipment only needs to use the video text matching model once. And the computer equipment extracts initial text features corresponding to the real-time barrage, and inputs the initial text features into a trained video text matching model to obtain target text features corresponding to the real-time barrage. Then, the computer device may obtain, from the data storage system, a video clip corresponding to a target video feature that is most similar to the target text feature as a target video clip corresponding to the real-time barrage. For example, the target video feature that is most similar to the target text feature is found by calculating the distance between the features, cosine similarity, etc.

Of course, after the computer device obtains the real-time bullet screen corresponding to the target video, the computer device may input the initial text feature corresponding to the real-time bullet screen and the fusion video feature corresponding to the video segment to be matched into the trained video text matching model, and the model outputs the matching degree of the real-time bullet screen and the video segment to be matched. And determining a target video segment corresponding to the real-time barrage from the plurality of video segments to be matched based on the matching degree of the model output.

In the above embodiment, the video text matching model may be obtained through rapid training in a supervised training manner.

In one embodiment, the method further comprises:

performing video medium replacement detection on the target video to obtain a detection result; when the detection result shows that the target video is replaced by the video medium, updating the video segments of the target video to obtain updated video segments corresponding to the target video; calculating the matching degree of each bullet screen corresponding to the target video and each updated video segment respectively to obtain a plurality of updated matching degrees; based on the updated matching degree, the association relation corresponding to each barrage is updated.

Wherein the video media replacement detection is used to detect whether media replacement has occurred in the video. Video media replacement refers to a change in the content of a video, for example, media replacement of a video due to insertion of an advertisement, media replacement of a video due to deletion of a partial shot, and the like.

Specifically, the media replacement of the video may cause the duration of the new video to be inconsistent with the original video, thereby causing the bullet screen of the original video to be mismatched with the new video. Because, in order to ensure that the barrage and the video clips are always matched, the computer device can perform video medium replacement detection on the target video, and if the detection result of the video medium replacement detection indicates that the target video is subjected to video medium replacement, the computer device can match all the barrages of the target video with the video clips again to update the target video clips corresponding to the barrages respectively. If the target video is subjected to medium replacement, the computer equipment can update the video segments of the target video, and the video segmentation is carried out on the target video again to obtain each updated video segment corresponding to the target video. The computer equipment can recalculate the matching degree of each barrage of the target video and each updated video segment respectively to obtain a plurality of updated matching degrees, and based on each updated matching degree, the association relation corresponding to each barrage is updated. It will be appreciated that each bullet screen corresponding to the target video includes a historical bullet screen and a current to future real-time bullet screen. The history barrage refers to a real-time barrage received by the computer equipment at the history time, the computer equipment can update the target video segments corresponding to the history barrage based on the update matching degree, update the association relation corresponding to the history barrage, establish the association relation between the history barrage and the target video segments which are determined recently, and recalibrate the barrage moment of the history barrage. Aiming at the current to future real-time barrages, the computer equipment can determine target video clips corresponding to the real-time barrages from all updated video clips, and then the association relation between the real-time barrages and the target video clips is established.

It will be appreciated that the matching degree between the bullet screen and the video clip can be calculated by referring to the content of each of the foregoing related embodiments.

In one embodiment, whether media replacement of a video occurs may be determined by detecting whether the VID (video identification) of the video has changed. If the VID of a video is changed, triggering to re-match the bullet screen with the video clips.

In the above embodiment, when the target video is replaced by the medium, the video segmentation is performed again, and matching of the barrage and the video clip is performed again, so that the barrage time of the barrage can be ensured to be accurate all the time, and the barrage is always played synchronously with the matched video clip.

In one embodiment, as shown in fig. 8, a video bullet screen matching method is provided, and the method is applied to the playing terminal in fig. 1 for illustration, and includes the following steps:

step S802, acquiring a real-time barrage corresponding to a target video.

In one embodiment, the playing terminal is provided with an application program, such as an audio-visual application program, an instant messaging application program, and the like. Referring to fig. 9A, a user can watch a video in an audio-visual application, and each bullet screen in the video is displayed as a subtitle with a sliding motion. The user may also view a video feed bullet screen in the instant messaging application. Referring to fig. 9B, a user may record the content of an instant communication application program and browse short videos published by others on an authoring platform, the platform supports praise and comment interaction, and may also forward to a circle of friends and share with friends in a chat scene.

Step S804, the real-time barrage is sent to the server, so that the server extracts initial text features corresponding to the real-time barrage, a plurality of video segments to be matched corresponding to the real-time barrage are determined from the target video, fusion video features corresponding to the video segments to be matched are obtained, the matching degree of the real-time barrage and the video segments to be matched respectively is calculated based on the initial text features and the fusion video features, the target video segments are determined from the video segments to be matched based on the matching degree, and the association relation between the real-time barrage and the target video segments is established; the fusion video features are obtained by feature fusion of target video frame feature sequences corresponding to the video clips to be matched, the target video frame feature sequences are obtained by feature extraction of all video frames in the video clips to be matched, and the target video frame feature sequences comprise target video frame features corresponding to all video frames in the same video clips to be matched.

Specifically, after the playing terminal acquires the real-time bullet screen published by the watching user of the target video, the real-time bullet screen is sent to the server, so that the server calibrates the bullet screen time, and the target video segment corresponding to the real-time bullet screen is determined, so that the bullet screen appears in the video segment matched with the content of the target video.

It will be appreciated that the data processing procedure of the server may refer to the content of the foregoing embodiments, and will not be described herein.

Step S806, obtaining the association relation returned by the server, and synchronously playing the real-time bullet screen when the target video clip is played based on the association relation.

Specifically, after determining the target video segment corresponding to the real-time barrage, the server can establish the association relationship between the real-time barrage and the corresponding target video segment, and send the association relationship to the playing terminal, so that the playing terminal can synchronously play the real-time barrage when playing the target video segment based on the association relationship. For example, the playing terminal may determine the start playing time of the target video clip, and synchronously play the corresponding real-time bullet screen when the start video frame of the target video clip starts playing.

In the video bullet screen matching method, the playing terminal sends the real-time bullet screen corresponding to the target video to the server, the server extracts initial text features corresponding to the real-time bullet screen, determines a plurality of video segments to be matched corresponding to the real-time bullet screen from the target video, and acquires initial video frame feature sequences corresponding to the video segments to be matched; the initial video frame feature sequence comprises initial video features corresponding to all video frames in the same video segment to be matched, feature extraction is carried out on each initial video frame feature sequence to obtain target video frame feature sequences corresponding to all the video segments to be matched, feature fusion is carried out on each target video frame feature sequence to obtain fusion video features corresponding to all the video segments to be matched, the matching degree of each real-time barrage and each video segment to be matched is calculated based on the initial text features and the fusion video features, the target video segments are determined from all the video segments to be matched based on the matching degree, the association relation between the real-time barrage and the target video segments is established, the server sends the association relation to the playing terminal, and the playing terminal synchronously plays the real-time barrage when playing the target video segments based on the association relation. Therefore, each time the playing terminal obtains the latest published real-time barrage of a user when watching the target video, the latest published real-time barrage is sent to the server, the server can determine the target video segment matched with the content of the real-time barrage through the matching degree calculated based on the text characteristics of the real-time barrage and the video characteristics of the video segment, and further, when the playing terminal plays the target video, the playing terminal can synchronously play the real-time barrage matched with the content when playing the target video segment, so that the matching accuracy of the barrage and the video segment is improved, and the barrage and the video segment of the target video are ensured to be accurately matched and played all the time.

In a specific embodiment, referring to fig. 10, the video bullet screen matching method of the present application includes the following steps:

1. data preparation

1. Video clip segmentation

For any video, the server segments the video into a number of video segments. For example, the server may perform video black frame detection and scene switching detection on the video, determine a plurality of split video frames from a complete video according to the detection result, and split the video into a plurality of video segments according to the split video frames.

2. Video feature extraction

For any video segment, the server performs feature extraction on video frames in the video segment to obtain initial video frame features corresponding to each video frame, and the initial video frame features are time ordered to obtain an initial video frame feature sequence, wherein the initial video frame features comprise video frame sub-features respectively corresponding to a plurality of feature channels. The server shifts part of the characteristic channels along the time dimension, exchanges information of adjacent frames, and then carries out two-dimensional convolution to improve the expression capability of the characteristics. The two-dimensional convolution can obtain a target video frame characteristic sequence composed of target video frame characteristics corresponding to each video frame.

Further, the server can perform feature fusion on the target video frame feature sequence, and fuse video frame-level data into video-level data to obtain fusion video features corresponding to the video clips.

3. Text feature extraction

And the server extracts the features of the barrage to obtain initial text features corresponding to the barrage.

2. Training phase

The server trains a video text matching model for predicting the degree of matching of the video clip and the barrage. The video text matching model comprises two branches, wherein one branch inputs the fusion video characteristics of the video clips, the other branch inputs the initial text characteristics of the barrage, and each branch carries out further characteristic extraction through a plurality of full-connection layers. During model training, training video fragments and features of a training barrage in a training sample are input into a video text matching model, two branches in the model respectively output target text features and target video features, cosine similarity between the target text features and the target video features is calculated, matching degrees of the barrage and the video fragments are predicted through a full-connection layer and a Sigmoid activation function, a prediction label is obtained, training loss is calculated based on the training label and the prediction label corresponding to the training sample, and model parameters of the video text matching model are adjusted based on the training loss until convergence conditions are met, so that a trained video text matching model is obtained. It will be appreciated that if the training tag is a classification tag (match or mismatch), the training penalty is a classification penalty.

3. Offline stage

Video media replacement does not frequently occur once the video is uploaded to the server, so that video segment segmentation and video feature extraction can be automatically triggered after the video is uploaded, and fusion video features corresponding to the video segments are input into a trained video text matching model to obtain target video features. The server may then store the target video corresponding to each video clip in a database, for example, a hard disk or in-memory database.

4. On-line stage

And the playing terminal sends the real-time barrage to the server every time the playing terminal acquires the real-time barrage published by the user when watching the target video. After the server acquires the real-time popup screen corresponding to the target video, text feature extraction is automatically triggered, and initial text features corresponding to the real-time popup screen are input into a trained video text matching model to obtain the target text features. The server performs nearest neighbor search in a video feature library corresponding to the target video based on the target text features, takes the video segment corresponding to the searched target video features as the video segment which is most matched with the real-time barrage, considers that the real-time barrage belongs to the video segment, and accordingly establishes an association relation between the real-time barrage and the video segment, and the association relation is used for synchronously playing the real-time barrage when the target video segment is played.

If the video generates medium replacement, the server needs to re-process the video such as video segment segmentation and video feature extraction, and re-process matching prediction on all the barrages in the video, and re-process barrage time calibration.

It will be appreciated that the predictive phase in fig. 10 includes the offline phase and the online phase described above.

By the video barrage matching method, barrages with inconsistent graphics and texts in a video platform can be calibrated at barrages, so that the barrages appear in corresponding video segments, and matching accuracy of the barrages and the video segments is improved.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a video barrage matching device for realizing the video barrage matching method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the video bullet screen matching device or devices provided below may be referred to the limitation of the video bullet screen matching method hereinabove, and will not be repeated herein.

In one embodiment, as shown in fig. 11, a video bullet screen matching apparatus 1100 is provided, which specifically includes: the bulletscreen processing module 1102, the videofeature acquisition module 1104, the matchingdegree calculation module 1106, the target videoclip determination module 1108 and the associationrelation establishment module 1110, wherein:

and thebarrage processing module 1102 is used for acquiring the real-time barrage corresponding to the target video and extracting the initial text characteristics corresponding to the real-time barrage.

The videofeature acquisition module 1104 is configured to determine a plurality of video segments to be matched corresponding to the real-time bullet screen from the target video, and acquire fusion video features corresponding to the video segments to be matched; the fusion video features are obtained by feature fusion of target video frame feature sequences corresponding to the video clips to be matched, the target video frame feature sequences are obtained by feature extraction of all video frames in the video clips to be matched, and the target video frame feature sequences comprise target video frame features corresponding to all video frames in the same video clips to be matched.

The matchingdegree calculating module 1106 is configured to calculate matching degrees of the real-time barrage and each video segment to be matched based on the initial text feature and the fusion video feature.

The target videosegment determining module 1108 is configured to determine a target video segment from the video segments to be matched based on the matching degree.

The associationrelationship establishing module 1110 is configured to establish an association relationship between the real-time bullet screen and the target video clip, where the association relationship is used to play the real-time bullet screen synchronously when the target video clip is played.

According to the video bullet screen matching device, when the latest real-time bullet screen published by a user when watching a target video is obtained, the target video clip matched with the real-time bullet screen content is accurately determined through the matching degree calculated based on the text characteristics of the real-time bullet screen and the video characteristics of the video clip, so that the real-time bullet screen is synchronously played when the target video clip is played, the matching accuracy of the bullet screen and the video clip is improved, and the bullet screen and the video clip of the target video are ensured to be accurately matched and played all the time.

In one embodiment, the video feature acquisition module comprises:

the video segment to be matched determining unit is used for determining segmentation video frames from all target video frames based on pixel information corresponding to all target video frames of the target video; video segmentation is carried out on the target video based on the segmented video frames, so that a plurality of initial video clips are obtained; and determining a plurality of video clips to be matched corresponding to the real-time barrage from each initial video clip.

In one embodiment, the video segment to be matched determining unit is further configured to obtain a first pixel value of each pixel point in each target video frame in the first color space; counting all first pixel values corresponding to the same target video frame to obtain pixel information corresponding to all target video frames; and taking the target video frame with the pixel information smaller than the first threshold value as the segmentation video frame.

In one embodiment, the video segment determining unit to be matched is further configured to obtain second pixel values of each pixel point in the same target video frame in a second color space, so as to obtain pixel information corresponding to each target video frame; calculating pixel change information between adjacent target video frames based on second pixel values corresponding to the matched pixel points in the adjacent target video frames; a sliced video frame is determined from adjacent target video frames for which the pixel change information is greater than a second threshold.

In one embodiment, the video feature acquisition module comprises:

the fusion video feature acquisition unit is used for extracting features of each video segment to be matched respectively to obtain an initial video frame feature sequence corresponding to each video segment to be matched; the initial video frame feature sequence is obtained by sequencing initial video frame features corresponding to each video frame in the same video segment to be matched according to video frame time stamps, wherein the initial video frame features comprise video frame sub-features corresponding to a plurality of feature channels respectively; in an initial video frame feature sequence corresponding to the same video segment to be matched, carrying out feature shift on video frame sub-features corresponding to the target feature channel based on ordering information of initial video frame features to obtain an intermediate video frame feature sequence corresponding to each video segment to be matched; respectively carrying out two-dimensional convolution processing on each intermediate video frame characteristic sequence to obtain a target video frame characteristic sequence corresponding to each current video segment to be matched; and respectively carrying out feature fusion on each target video frame feature sequence to obtain fusion video features corresponding to each current video segment to be matched.

In one embodiment, the fusion video feature obtaining unit is further configured to use, in an initial video frame feature sequence corresponding to a video segment to be matched currently, a video frame sub-feature corresponding to a target feature channel in each initial video frame feature as a target sub-feature; updating target sub-features corresponding to adjacent video frames based on target sub-features corresponding to the current video frames aiming at the initial video frame features to obtain intermediate video frame features corresponding to each video frame of the current video segment to be matched; and sequencing all the intermediate video frame features according to the video frame time stamp to obtain an intermediate video frame feature sequence corresponding to the current video segment to be matched.

In one embodiment, the fused video feature obtaining unit is further configured to update a target sub-feature corresponding to a next video frame based on a target sub-feature corresponding to a current video frame, and configure a target sub-feature corresponding to a starting video frame as a preset sub-feature.

In one embodiment, the video bullet screen matching apparatus 1100 includes:

the feature fusion module is used for acquiring a plurality of cluster center features; each cluster center feature corresponds to a different video frame theme; aiming at a target video frame characteristic sequence corresponding to the current video segment to be matched, determining target center characteristics from all the cluster center characteristics based on the distances between the target video frame characteristics corresponding to the same video frame and all the cluster center characteristics, and obtaining target center characteristics respectively corresponding to all the video frames of the current video segment to be matched; obtaining the target feature distance corresponding to each video frame of the current video segment to be matched based on the distance between the target video frame feature and the target center feature corresponding to the same video frame; performing attention distribution on the characteristics of each target video frame corresponding to the current video segment to be matched to obtain attention weights respectively corresponding to each video frame of the current video segment to be matched; and fusing the target feature distances based on the attention weight to obtain the fused video features corresponding to the video segments to be matched currently.

In one embodiment, the video bullet screen matching apparatus 1100 includes:

The feature processing module is used for inputting the current video segment to be matched into the video feature extraction model; the video feature extraction model comprises a first feature extraction layer, a second feature extraction layer and a feature fusion layer; performing feature extraction on the current video segment to be matched through a first feature extraction layer to obtain an initial video frame feature sequence corresponding to the current video segment to be matched; performing feature shifting on video frame sub-features corresponding to the target feature channels in the initial video frame feature sequence based on the ordering information of the initial video frame features through the second feature extraction layer to obtain an intermediate video frame feature sequence corresponding to the current video segment to be matched; performing two-dimensional convolution processing on the intermediate video frame feature sequence through the second feature extraction layer to obtain a target video frame feature sequence corresponding to the current video segment to be matched; and carrying out feature fusion on the target video frame feature sequence through a feature fusion layer to obtain fusion video features corresponding to the video segments to be matched currently.

In one embodiment, the matching degree calculation module is further configured to input the real-time barrage into a text processing model to obtain an initial text feature; inputting the initial text features into a text feature extraction network in a trained video text matching model to obtain target text features corresponding to the real-time barrage; inputting the fused video features into a video feature extraction network in a video text matching model to obtain target video features; and obtaining the matching degree of the real-time barrage and each video segment to be matched based on the similarity between the target text features and the target video features.

In one embodiment, the video bullet screen matching apparatus 1100 includes:

the model training module is used for acquiring training samples and training labels corresponding to the training samples; the training sample comprises a training video clip and a training barrage; inputting the fusion video features corresponding to the training video segments and the initial text features corresponding to the training barrages into a video text matching model to be trained to obtain a prediction tag; the prediction labels are obtained based on the similarity between the text features output by the text feature extraction network and the video features output by the video feature extraction network; and calculating training loss based on the training label and the prediction label, and adjusting model parameters of the video text matching model to be trained based on the training loss until convergence conditions are met, so as to obtain the trained video text matching model.

In one embodiment, the target video segment determining module is further configured to obtain, from each matching degree, a video segment to be matched corresponding to the maximum matching degree as the target video segment.

In one embodiment, the video bullet screen matching apparatus 1100 includes:

the video medium replacement detection module is used for carrying out video medium replacement detection on the target video to obtain a detection result; when the detection result shows that the target video is replaced by the video medium, updating the video segments of the target video to obtain updated video segments corresponding to the target video; calculating the matching degree of each bullet screen corresponding to the target video and each updated video segment respectively to obtain a plurality of updated matching degrees; based on the updated matching degree, the association relation corresponding to each barrage is updated.

In one embodiment, as shown in fig. 12, a video bullet screen matching apparatus 1200 is provided, which specifically includes: abarrage acquisition module 1202, adata matching module 1204, and abarrage play module 1206, wherein:

thebarrage acquisition module 1202 is configured to acquire a real-time barrage corresponding to the target video.

Thedata matching module 1204 is configured to send the real-time barrage to the server, so that the server extracts initial text features corresponding to the real-time barrage, determines a plurality of video segments to be matched corresponding to the real-time barrage from the target video, obtains fusion video features corresponding to each video segment to be matched, calculates matching degrees of the real-time barrage and each video segment to be matched respectively based on the initial text features and the fusion video features, determines the target video segments from each video segment to be matched based on the matching degrees, and establishes an association relationship between the real-time barrage and the target video segments; the fusion video features are obtained by feature fusion of target video frame feature sequences corresponding to the video clips to be matched, the target video frame feature sequences are obtained by feature extraction of all video frames in the video clips to be matched, and the target video frame feature sequences comprise target video frame features corresponding to all video frames in the same video clips to be matched.

Thebarrage playing module 1206 is configured to obtain an association relationship returned by the server, and play the real-time barrages synchronously when playing the target video clips based on the association relationship.

According to the video bullet screen matching device, when the playing terminal obtains the real-time bullet screen which is newly published when a user watches a target video, the real-time bullet screen is sent to the server, the server can determine the target video fragment matched with the content of the real-time bullet screen through the matching degree calculated based on the text characteristics of the real-time bullet screen and the video characteristics of the video fragment, and further, when the playing terminal plays the target video, the real-time bullet screen matched with the content can be synchronously played when the target video fragment is played, so that the matching accuracy of the bullet screen and the video fragment is improved, and the bullet screen and the video fragment of the target video are ensured to be accurately matched and played all the time.

The modules in the video bullet screen matching device can be all or partially realized by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 13. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data such as target videos, fusion video features, target video features and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program when executed by a processor implements a video bullet screen matching method.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 14. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program when executed by a processor implements a video bullet screen matching method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structures shown in fig. 13 and 14 are merely block diagrams of portions of structures related to the aspects of the present application and are not intended to limit the computer device on which the aspects of the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or may have different arrangements of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the above-described method embodiments.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A video bullet screen matching method, the method comprising:

2. The method of claim 1, wherein determining a plurality of video segments to be matched corresponding to the real-time bullet screen from the target video comprises:

determining segmentation video frames from each target video frame based on pixel information corresponding to each target video frame of the target video;

video segmentation is carried out on the target video based on the segmentation video frames, so that a plurality of initial video clips are obtained;

and determining a plurality of video clips to be matched corresponding to the real-time barrage from each initial video clip.

3. The method of claim 2, wherein determining a split video frame from each target video frame of the target video based on pixel information corresponding to the respective target video frame comprises:

Acquiring a first pixel value of each pixel point in each target video frame under a first color space;

counting each first pixel value corresponding to the same target video frame to obtain pixel information corresponding to each target video frame;

and taking the target video frame with the pixel information smaller than the first threshold value as the segmentation video frame.

4. The method of claim 2, wherein determining a split video frame from each target video frame of the target video based on pixel information corresponding to the respective target video frame comprises:

acquiring second pixel values of all pixel points in the same target video frame under a second color space, and obtaining pixel information corresponding to all target video frames;

calculating pixel change information between adjacent target video frames based on second pixel values corresponding to the matched pixel points in the adjacent target video frames;

the sliced video frame is determined from adjacent target video frames for which the pixel change information is greater than a second threshold.

5. The method according to claim 1, wherein the obtaining the fused video features corresponding to the video segments to be matched includes:

respectively extracting the characteristics of each video segment to be matched to obtain an initial video frame characteristic sequence corresponding to each video segment to be matched; the initial video frame feature sequence is obtained by sequencing initial video frame features corresponding to each video frame in the same video segment to be matched according to video frame time stamps, wherein the initial video frame features comprise video frame sub-features corresponding to a plurality of feature channels respectively;

In an initial video frame feature sequence corresponding to the same video segment to be matched, carrying out feature shift on video frame sub-features corresponding to a target feature channel based on ordering information of initial video frame features to obtain an intermediate video frame feature sequence corresponding to each video segment to be matched;

respectively carrying out two-dimensional convolution processing on each intermediate video frame characteristic sequence to obtain a target video frame characteristic sequence corresponding to each video segment to be matched;

and respectively carrying out feature fusion on each target video frame feature sequence to obtain fusion video features corresponding to each video segment to be matched.

6. The method according to claim 5, wherein in the initial video frame feature sequence corresponding to the same video segment to be matched, feature shifting is performed on video frame sub-features corresponding to the target feature channel based on ordering information of initial video frame features, so as to obtain intermediate video frame feature sequences corresponding to the respective video segments to be matched, including:

in an initial video frame characteristic sequence corresponding to a current video segment to be matched, taking video frame sub-characteristics corresponding to a target characteristic channel in each initial video frame characteristic as target sub-characteristics;

Updating target sub-features corresponding to adjacent video frames based on target sub-features corresponding to the current video frames aiming at the initial video frame features to obtain intermediate video frame features corresponding to each video frame of the current video segment to be matched;

and sequencing all the intermediate video frame features according to the video frame time stamp to obtain an intermediate video frame feature sequence corresponding to the current video segment to be matched.

7. The method of claim 6, wherein updating the target sub-feature corresponding to the neighboring video frame based on the target sub-feature corresponding to the current video frame comprises:

and updating the target sub-feature corresponding to the next video frame based on the target sub-feature corresponding to the current video frame, and configuring the target sub-feature corresponding to the initial video frame as a preset sub-feature.

8. The method of claim 6, wherein the target sub-feature comprises a first sub-feature corresponding to a first feature channel in the target feature channel and a second sub-feature corresponding to other feature channels in the target feature channel;

the updating the target sub-feature corresponding to the adjacent video frame based on the target sub-feature corresponding to the current video frame comprises the following steps:

Updating a first sub-feature corresponding to a next video frame based on the first sub-feature corresponding to the current video frame, and configuring the first sub-feature corresponding to the initial video frame as a preset sub-feature;

and updating the second sub-feature corresponding to the previous video frame based on the second sub-feature corresponding to the current video frame, and configuring the second sub-feature corresponding to the ending video frame as a preset sub-feature.

9. The method of claim 1, wherein the feature fusion comprises the steps of:

acquiring a plurality of cluster center features; each cluster center feature corresponds to a different video frame theme;

aiming at a target video frame characteristic sequence corresponding to a current video segment to be matched, determining target center characteristics from all cluster center characteristics based on the distances between the target video frame characteristics corresponding to the same video frame and all cluster center characteristics, and obtaining target center characteristics respectively corresponding to all video frames of the current video segment to be matched;

obtaining the target feature distance corresponding to each video frame of the current video segment to be matched based on the distance between the target video frame feature and the target center feature corresponding to the same video frame;

Performing attention distribution on the characteristics of each target video frame corresponding to the current video segment to be matched to obtain attention weights corresponding to each video frame of the current video segment to be matched;

and fusing the target feature distances based on the attention weight to obtain the fused video features corresponding to the current video segment to be matched.

10. The method according to any one of claims 1 to 9, wherein the generation of the fused video features comprises the steps of:

inputting the current video segment to be matched into a video feature extraction model; the video feature extraction model comprises a first feature extraction layer, a second feature extraction layer and a feature fusion layer;

extracting features of the current video segment to be matched through the first feature extraction layer to obtain an initial video frame feature sequence corresponding to the current video segment to be matched;

performing feature shift on video frame sub-features corresponding to the target feature channels in the initial video frame feature sequence based on the ordering information of the initial video frame features through the second feature extraction layer to obtain an intermediate video frame feature sequence corresponding to the current video segment to be matched;

Performing two-dimensional convolution processing on the intermediate video frame feature sequence through the second feature extraction layer to obtain a target video frame feature sequence corresponding to the current video segment to be matched;

and carrying out feature fusion on the target video frame feature sequence through the feature fusion layer to obtain fusion video features corresponding to the current video segment to be matched.

11. The method according to any one of claims 1 to 9, wherein calculating the matching degree of the real-time bullet screen and each video clip to be matched based on the initial text feature and the fused video feature comprises:

inputting the real-time barrage into a text processing model to obtain the initial text characteristics;

inputting the initial text features into a text feature extraction network in a trained video text matching model to obtain target text features corresponding to the real-time barrage;

inputting the fusion video features into a video feature extraction network in the video text matching model to obtain target video features;

and obtaining the matching degree of the real-time barrage and each video segment to be matched based on the similarity between the target text feature and the target video feature.

12. The method of claim 11, wherein the training process of the video text matching model comprises the steps of:

acquiring a training sample and a training label corresponding to the training sample; the training samples comprise training video clips and training barrages;

inputting the fusion video features corresponding to the training video segments and the initial text features corresponding to the training barrages into a video text matching model to be trained to obtain a prediction tag; the prediction labels are obtained based on the similarity between the text features output by the text feature extraction network and the video features output by the video feature extraction network;

and calculating training loss based on the training label and the prediction label, and adjusting model parameters of the video text matching model to be trained based on the training loss until convergence conditions are met, so as to obtain the trained video text matching model.

13. The method according to any one of claims 1 to 9, wherein said determining a target video clip from said respective video clips to be matched based on said degree of matching comprises:

and acquiring the video segment to be matched corresponding to the maximum matching degree from each matching degree as a target video segment.

14. The method according to any one of claims 1 to 9, characterized in that the method further comprises:

performing video medium replacement detection on the target video to obtain a detection result; the video medium replacement detection is used for detecting whether the content of the video changes or not;

when the detection result shows that the target video is subjected to video medium replacement, updating the video segments of the target video to obtain updated video segments corresponding to the target video;

calculating the matching degree of each bullet screen corresponding to the target video and each updated video segment respectively to obtain a plurality of updated matching degrees;

and updating the association relation corresponding to each barrage based on each updating matching degree.

15. A video bullet screen matching method, the method comprising:

acquiring a real-time barrage corresponding to a target video;

16. A video bullet screen matching apparatus, said apparatus comprising:

17. The apparatus of claim 16, wherein the video feature acquisition module comprises a video clip determination unit to be matched, the video clip determination unit to be matched to:

18. The apparatus of claim 17, wherein the video clip determination unit to be matched is further configured to:

19. The apparatus of claim 17, wherein the video clip determination unit to be matched is further configured to:

20. The apparatus of claim 16, wherein the video feature acquisition module comprises a fused video feature acquisition unit to:

21. The apparatus of claim 20, wherein the fused video feature acquisition unit is further configured to:

22. The apparatus of claim 21, wherein the fused video feature acquisition unit is further configured to:

23. The apparatus of claim 21, wherein the target sub-feature comprises a first sub-feature corresponding to a first feature channel in the target feature channel and a second sub-feature corresponding to other feature channels in the target feature channel;

the fused video feature acquisition unit is further configured to:

24. The apparatus of claim 16, further comprising a feature fusion module to:

25. The apparatus according to any one of claims 16 to 24, further comprising a feature processing module for:

26. The apparatus of any one of claims 16 to 24, wherein the matching degree calculation module is further configured to:

27. The apparatus of claim 26, further comprising a model training module to:

28. The apparatus of any one of claims 16 to 24, wherein the target video clip determination module is further configured to:

29. The apparatus of any one of claims 16 to 24, further comprising a video medium replacement detection module configured to:

30. A video bullet screen matching apparatus, said apparatus comprising:

31. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 14 or 15 when the computer program is executed.

32. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 14 or 15.