CN120640034A

Movatterモバイル変換

Info

Publication number: CN120640034A
Application number: CN202510968762.9A
Authority: CN
Inventors: 罗先荣; 倪方君; 倪红良; 吴晓涛; 包正辉; 郑春雷; 张驰
Original assignee: Paco Video Technology Hangzhou Co ltd
Current assignee: Paco Video Technology Hangzhou Co ltd
Priority date: 2025-07-15
Filing date: 2025-07-15
Publication date: 2025-09-12

Abstract

The application provides a video program feature extraction method based on artificial intelligence, which relates to the technical field of IPTV and is applied to cloud nodes, and comprises the steps of reading IPTV video files stored by the cloud nodes and decoding the IPTV video files into an original frame sequence; the method comprises the steps of performing shot disassembly on an original frame sequence to determine a plurality of shot fragments, performing self-adaptive key frame extraction on each shot fragment, determining sub-fragments by taking a key frame as a core, performing differential extraction on image features of each frame of image in the sub-fragments to obtain feature subsets corresponding to the sub-fragments, integrating the feature subsets belonging to the same shot fragment to obtain scene feature sets, and integrating the scene feature sets corresponding to each shot fragment to obtain video feature sets corresponding to IPTV video files. The scheme is provided with a new feature extraction scheme, is adaptive to a novel IPTV architecture, and is beneficial to improving the extraction efficiency of the IPTV video program feature set.

Description

Video program feature extraction method based on artificial intelligence

Technical Field

The application relates to the technical field of IPTV, in particular to a video program feature extraction method based on artificial intelligence.

Background

In the existing IPTV architecture, various auxiliary services are usually completed by a cloud end and embedded into video programs, the flexibility of the mode is poor, dynamic expansion or adjustment is difficult, if new functions (such as intelligent analysis on characters, articles and the like in the programs, intelligent advertisement recommendation and the like) are required to be added, the scheme of the existing scheme solidified to the coding flow is obviously difficult to apply and inflexible. Therefore, a new IPTV architecture is provided, by synchronously transmitting the video stream and the feature stream, the video stream is sent to the end side node, the end side node decodes the video stream, the feature stream is sent to the edge node, the edge node utilizes the feature stream to conduct intelligent processing and identification so as to provide auxiliary service, and the end side node can integrate and display the auxiliary service and video content, so that flexible expansion of the auxiliary service is realized, and the real-time interaction requirement of a user can be met (the technology is used for applying another patent).

In order to adapt to the application of the auxiliary service in the architecture, a large number of video programs stored in the cloud end need to be subjected to feature extraction, so as to provide support for the precoding of the feature stream. Although feature extraction (such as intelligent recommendation and highlight clip) is performed on a part of video programs because various intelligent applications are provided, the feature extraction is difficult to directly apply, because the feature flow corresponding to video frames of the video programs one by one is needed in an IPTV architecture (an end-side node-edge node-cloud node architecture) for design planning, the feature extracted by the traditional feature extraction scheme does not meet the condition, the feature extraction scheme needs to be designed pertinently, and a foundation is provided for flexibly deploying more intelligent auxiliary services by matching with the IPTV architecture.

However, if the existing feature extraction scheme is directly adopted to extract the full feature vector from frame to frame for the IPTV video program, the workload is too great, i.e., the number of video programs stored in the cloud node is great, each video program includes a large number of video frames (for example, a single movie is 100 minutes long and 30 frames per second, 6000 x 30=180000 frames), the full feature extraction is performed from frame to frame, the consumed resources are too great, and the efficiency is too low. Therefore, there is a need to design a more efficient feature extraction scheme suitable for use under this IPTV architecture.

Disclosure of Invention

The embodiment of the application aims to provide an artificial intelligence-based video program feature extraction method so as to improve the extraction efficiency of IPTV video program feature streams and adapt to a novel IPTV architecture.

In order to achieve the above object, an embodiment of the present application is achieved by:

The embodiment of the application provides a video program feature extraction method based on artificial intelligence, which is applied to cloud nodes and comprises the steps of reading IPTV video files stored in the cloud nodes, decoding the IPTV video files into an original frame sequence, carrying out lens disassembly on the original frame sequence to determine a plurality of lens fragments, carrying out self-adaptive key frame extraction on each lens fragment, taking the key frame as a core to determine sub-fragments, carrying out differential extraction on image features of each frame image in the sub-fragments to obtain feature subsets corresponding to the sub-fragments, integrating the feature subsets belonging to the same lens fragment to obtain scene feature sets, and integrating the scene feature sets corresponding to each lens fragment to obtain the video feature sets corresponding to the IPTV video files.

The method has the advantages that IPTV video files stored in cloud nodes are read, decoded into original frame sequences, shot disassembly is conducted to determine a plurality of shot segments, adaptive key frame extraction is conducted on each shot segment, key frames are used as cores to determine sub-segments, image features of each frame of images in the sub-segments are extracted in a differentiated mode (full feature vector extraction is conducted on the key frames, differential feature vector extraction is conducted on non-key frames in the sub-segments), feature subsets corresponding to the sub-segments are obtained, feature subsets belonging to the same shot segments are integrated to obtain scene feature sets, and scene feature sets corresponding to each shot segment are integrated to obtain video feature sets corresponding to the IPTV video files. According to the scheme, through layering processing logic of lens disassembly, key frame extraction and sub-segment feature extraction, the number of frames of the total features to be extracted is greatly reduced, feature extraction is focused on key frames and related sub-segments, and the calculated amount is remarkably reduced. According to the feature extraction scheme, features are strictly synchronous with video frames, so that the features of each video frame can be deduced through the key frame association of the sub-segment to which the features belong, and the requirements of an end-edge-cloud collaborative architecture (namely an IPTV architecture of an end-side node-edge node-cloud node) on a frame-level synchronous feature stream are met.

In combination with the first aspect, in a first possible implementation manner of the first aspect, performing shot decomposition on an original frame sequence to determine a plurality of shot segments, including downsampling the original frame sequence to obtain a preprocessed frame sequence, performing color space conversion on the preprocessed frame sequence to obtain a corresponding Y-channel frame sequence and UV-channel frame sequence, performing luminance histogram statistics on frame images in the Y-channel frame sequence, performing chrominance histogram statistics on frame images in the UV-channel frame sequence, calculating a luminance distribution difference corresponding to the i-th frame image based on a luminance histogram statistics result of the i-th frame image and the (i-1) -th frame image, calculating a chrominance distribution difference corresponding to the i-th frame image based on a chrominance histogram statistics result of the i-th frame image and the (i-1) -th frame image, wherein 1<i is less than or equal to n, n is an image total number of the original frame sequence, performing optical flow analysis based on the i-th frame image and the (i-1) -th frame image in the Y-channel frame sequence to obtain a pixel motion index corresponding to the i-th frame image, wherein the pixel motion index comprises an average motion amplitude, a motion vector consistency and a motion vector variance, determining whether all the shot segments are divided based on the luminance distribution difference node segments based on the i-th frame image and the difference, and determining whether the shot segments are divided based on the difference node segments.

The method has the beneficial effects that in the lens disassembly process, three indexes of a brightness histogram (brightness distribution difference is calculated by two adjacent frames), a chromaticity histogram (chromaticity distribution difference is calculated by two adjacent frames) and optical flow analysis (motion amplitude, direction consistency and vector variance) are utilized to comprehensively judge the lens segmentation nodes. The multi-mode fusion can more accurately identify lens switching, wherein the brightness and the chromaticity are mainly used for capturing abrupt changes of the overall color distribution of a picture (such as scene switching), and the optical flow analysis is mainly used for detecting severe changes of local motion of the picture (such as rapid transition or object motion). Therefore, the lens disassembly is performed, the accuracy of lens disassembly can be improved, the misjudgment rate is reduced, and the lens disassembly method is suitable for two lens switching types of abrupt change (such as hard cutting) and gradual change (such as fade-in fade-out). In consideration of the precision required by lens disassembly, the original frame sequence is downsampled (spatial downsampling, such as 1080P is reduced to 540P, even 360P) to generate the preprocessing frame sequence, so that the calculation amount of the follow-up histogram statistics and optical flow analysis can be effectively reduced while the lens disassembly analysis condition is met (the lens segmentation precision loss is smaller, the calculation resource consumption is reduced by more than 70 percent, and the cost performance is extremely high).

The method has the advantages that according to different shot switching types, several shot segmentation node evaluation rules are mainly designed, if the brightness distribution difference corresponding to an ith frame image is higher than the set brightness distribution difference, the chroma distribution difference corresponding to the ith frame image is higher than the set chroma distribution difference, the motion direction consistency in the pixel motion indexes corresponding to the ith frame image is lower than a first set value, the ith frame image is marked as a candidate abrupt change shot segmentation node (the method is suitable for fast and accurately identifying abrupt change shots, the combination of brightness/chroma distribution difference threshold and the motion direction consistency index is used for capturing the transient abrupt change of picture color and structure, such as hard cut, black field switching and the like, and avoiding the leak judgment of a single detection logic on gradual change scene, and meanwhile, the low motion consistency represents the motion confusion of a picture main body, so that false abrupt change signals can be filtered, such as flash lamp interference can be realized, and if the motion vector variance corresponding to continuous x frame (such as 5 frames, 10 frames and the like) images is higher than a second set value and the calculated accumulated difference is higher than the set accumulated difference, the continuous x frame image is determined to be the gradual change image, the candidate gradual change image is accurately identified, the candidate gradual change is gradually changed, the candidate image is obtained, the gradual change is gradually changed, the candidate color is gradually changed, the gradual change is obtained, the gradual change is gradually change is determined, the candidate is gradually has a transition change, the gradual change is gradually changed, the gradual change is gradually, the transition is gradually has a transition change, and the transition is gradually changed, and the transition is gradually has a high-changed, and is can be accurately, and is stably has a high, and has a high-changed, the candidate shot segmentation nodes with an interval between the culling and the adjacent candidate shot segmentation nodes being less than a set frame number (the set frame number is converted to the frame number by taking time as a standard, for example, the shortest shot of a common IPTV program is not less than 2 seconds long, if the frame rate is 30, the x is set to 60 frames, and if the frame rate is 60, the x is required to be set to 120), so as to obtain all the shot segmentation nodes. Therefore, the segment division of the IPTV video program at the lens level can be efficiently and accurately realized, redundant lens division nodes with less than set frame number interval with adjacent nodes are eliminated, the condition of short jitter in the same lens instead of real switching can be effectively eliminated, and excessive division is avoided.

With reference to the first possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, performing adaptive keyframe extraction on each shot segment includes generating a sliding window with a set size smaller than the set frame number for each shot segment, determining a plurality of keyframes with a start frame of a current shot segment as a start point and a set inter-frame interval for a first sliding window of the shot segment, determining a plurality of keyframes in the sliding window with a previous keyframe as a start point and based on a brightness distribution difference, a chromaticity distribution difference and a pixel motion index corresponding to each frame image in the current sliding window, and determining a plurality of keyframes with an end frame of the current shot segment as an end point and a set inter-frame interval for a previous keyframe backward transition for a last sliding window of the current shot segment.

The method has the beneficial effects that as the shot segments of the IPTV video program are various, the fixed key frame extraction strategy is difficult to adapt to various dynamic shot scenes, and proper key frames are difficult to select for analysis. According to the scheme, a sliding window analysis scheme is designed, sliding window analysis (the size of a sliding window is 30 frames, 60 frames and the like) is adopted for each lens segment, a scheme of determining key frames by setting inter-frame intervals is adopted for the head and tail windows, the lens segments leading to abrupt lens switching and gradual lens switching can be simultaneously processed, the coverage of the initial stage and the end stage of the lens segments is ensured, and the omission of head and tail contents is avoided. And for the middle window (namely a non-head-tail sliding window) of the shot segment, determining a plurality of key frames in the sliding window by taking the previous key frame as a starting point based on the brightness distribution difference, the chromaticity distribution difference and the pixel motion index corresponding to each frame of image in the current sliding window. In the analysis of non-head-to-tail sliding windows, the sliding window analysis of fixed step length may cause a large amount of redundant calculation, which is not beneficial to accurately determining a proper key frame (in the scheme, since each subsequent key frame is used as the core of a sub-segment, full feature extraction is performed, and other non-key frames in the sub-segment using the key frame as the core are all used for differential feature vector extraction based on the key frame, therefore, the determination of the key frame can have an important influence on the subsequent feature extraction process, and the extracted features can be used as edge nodes to perform feature analysis to provide the basis of auxiliary services).

The method has the advantages that aiming at different scenes (lens fragments), a motion state grading and color difference grading assessment mechanism is designed, the motion state grading reflects the picture dynamic complexity (such as in a fast action scene, the motion state grading is obviously higher) through quantification of average motion amplitude, direction consistency and motion vector variance, if images with the motion state grading exceeding a motion state threshold value exist in a current sliding window, based on the motion state grading and the color difference grading corresponding to each frame image, a key frame (determined according to the size design of the sliding window and the window size of 30 frames generally) is determined, the number of corresponding key frames a is 5), and the color difference grading is calculated by combining brightness and color differences, so that the capture of picture color mutation (such as light shadow change) is facilitated, if images with the motion state grading exceeding the motion state threshold value do not exist in the current sliding window, but the color difference grading exceeding the color difference grading threshold value exists, b key frames (the number b is 3 based on the window size of 30 frames corresponding to each frame) are determined, and if images with the motion state grading exceeding the motion state threshold value does not exist in the current sliding window, and the color difference grading is compared with the conventional color grading (c is 2) at the moment, and the key frame size corresponding to the color grading is determined based on the key frame size of 2. Therefore, the scheme can intensively sample in sliding windows of severe motion or color change (such as explosion and transition), sparsely sample in static areas (such as fixed lenses), realize resource allocation as required, and ensure that the determined key frames are more accurate and are more beneficial to high-quality extraction of the features. And the maximum key frame quantity under different scenes is constrained by preset values of parameters a, b and c, so that the condition of increasing the key frame sampling density without limit is avoided.

With reference to the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner of the first aspect, calculating a motion state score corresponding to each frame of image based on a pixel motion index corresponding to each frame of image in the current sliding window includes calculating a motion state score corresponding to a j-th frame of image in the current sliding window by using the following formula:

,

Wherein, theScoring the motion state corresponding to the j-th frame image in the current sliding window,、AndAs a parameter of the weight-bearing element,For the average motion amplitude corresponding to the j-th frame image in the current sliding window,As a reference value for the average motion amplitude,For the consistency of the movement direction corresponding to the j-th frame image in the current sliding window,For the motion vector variance corresponding to the j-th frame image in the current sliding window,Calculating a color difference score corresponding to each frame of image based on the brightness distribution difference and the chromaticity distribution difference corresponding to each frame of image, wherein the color difference score corresponding to the j frame of image in the current sliding window is calculated by adopting the following formula:

,

Wherein, theScoring the color difference corresponding to the j-th frame image in the current sliding window,For the brightness distribution difference corresponding to the j-th frame image in the current sliding window,For the chromaticity distribution difference corresponding to the j-th frame image in the current sliding window,Is a synergistic enhancement coefficient.

The method has the advantages that in a calculation scheme of the motion state score, the motion state score is easily led by utilizing the dimension difference of logarithmic compression motion amplitude (such as the amplitude value of a high-speed motion scene is extremely large), the index is prevented from easily leading the motion state score, the high motion consistency is converted into a low contribution value, the index with disordered directions (such as the situation of rapid shaking of a lens or collision of an object) is highlighted, the motion vector variance is approximately standardized (not standardized but converted by utilizing a motion vector variance reference value), the corresponding weight coefficient (which can be an empirically set parameter value or a learnable parameter value) is allocated to each item, so that different scenes can be dealt with, but the scheme takes the experience setting as an example in consideration of being relatively complex, and meanwhile, different weights are assigned, so that different video contents are flexibly adapted, and the generalization capability of processing different IPTV video programs is improved. The reference value is set according to empirical analysis, and the global statistic can be calculated in real time. The calculation scheme of the color difference score is characterized in that the brightness distribution difference and the chromaticity distribution difference are not in a purely linear relationship in different lens scenes, a certain synergy exists, when the brightness distribution difference and the chromaticity distribution difference are both high, the overall score is obviously improved (the condition of greatly changing the light shadow and the color in a lens segment is adapted), the condition of a single factor (the condition that the brightness distribution difference and the chromaticity distribution difference are high or low) is expressed as a higher dominant condition, and the score is lower for a steady scene and is not suitable for being used as a key frame. Therefore, the calculation scheme of the color difference score is designed to more accurately realize the color difference score of each frame, and the collaborative enhancement coefficient can be used as a means for balancing the sensitivity to different types of lens fragments, so that the adaptability to the different types of lens fragments is improved.

With reference to the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, determining a key frames based on the motion state score and the color difference score corresponding to each frame of image includes calculating a composite score corresponding to a j-th frame of image in the current sliding window by using the following formula:

,

Wherein, theThe composite score corresponding to the j-th frame image in the current sliding window,And determining a key frame with the highest comprehensive score based on the comprehensive score corresponding to each frame of image in the current sliding window.

Beneficial effects of motion state scoring for high motion type lens segmentsIs shown to exceed the movement state threshold,Gradually approaches to 1, the motion state score ratio in the comprehensive score is higher, and the motion-dominant key frame selection can be enhanced.

With reference to the fifth possible implementation manner of the first aspect, in a seventh possible implementation manner of the first aspect, determining b key frames based on the motion state score and the color difference score corresponding to each frame of image includes calculating a composite score corresponding to a j-th frame of image in the current sliding window by using the following formula:

,

Wherein, theThe composite score corresponding to the j-th frame image in the current sliding window,And determining b key frames with highest comprehensive scores based on the comprehensive scores corresponding to each frame of image in the current sliding window.

Beneficial effects of motion state scoring for low motion type lens segmentsDoes not exceed a motion state thresholdAt this time, the color difference score and the motion state score are required to be comprehensively calculatedAs a coefficient [ ]Non-zero) so that the color difference weight can be adaptively enhanced according to the color difference score, and key frames can be triggered through the color difference in lens segments with gentle motion, thereby determining more proper key frames.

With reference to the fifth possible implementation manner of the first aspect, in an eighth possible implementation manner of the first aspect, determining c key frames based on the motion state score and the color difference score corresponding to each frame of image includes calculating a composite score corresponding to a j-th frame of image in the current sliding window by using the following formula:

,

Wherein, theAnd c key frames with highest comprehensive scores are determined based on the comprehensive scores corresponding to each frame of image in the current sliding window.

The method has the advantages that for the situation that the motion state score and the color difference score are low, the motion state score and the color difference score can be directly summed (of course, other specific weights can be adopted), and the motion state score and the color difference score are not selected as key frames, but if all the motion state score and the color difference score are in a certain sliding window, the motion state score and the color difference score are selected as key frames with the highest comprehensive score (in the situation, because the picture changes are not large, the number of determined key frames is small, compared with other scenes with large changes, one key frame can dominate more non-key frames, and the key frames can be used as main bones of more non-key frames.

With reference to the fifth possible implementation manner of the first aspect, in a ninth possible implementation manner of the first aspect, determining a sub-segment by using a key frame as a core, differentially extracting image features of each frame of image in the sub-segment to obtain a feature subset corresponding to the sub-segment, where the current lens segment is divided into a plurality of sub-segments based on the key frame, a first frame of image in each sub-segment is a key frame, where each sub-segment does not include a key frame other than the first frame of image, performing full-scale feature vector extraction on the key frame for performing the sub-segment by using ResNet, performing differential feature vector extraction on each non-key frame of the sub-segment by using MobileNetV based on the key frame of the sub-segment, and performing feature vector integration based on the frame sequence of the sub-segment to obtain the feature subset corresponding to the sub-segment.

The method has the beneficial effects that for each sub-segment, global features are extracted through ResNet whole-frame processing, and MobileNetV is adopted to conduct differential feature extraction on non-key frames, so that the processing capacity can be effectively reduced, mobileNetV3 can be designed in a lightweight manner, the calculation cost is further reduced, and the method is suitable for cloud large-scale video processing.

In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of an artificial intelligence-based video program feature extraction method according to an embodiment of the present application.

FIG. 2 is a schematic diagram of determining shot segmentation nodes.

FIG. 3 is a schematic diagram of determining key frames.

Fig. 4 is a schematic diagram of a differential feature extraction process.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

Referring to fig. 1, fig. 1 is a flowchart of an artificial intelligence-based video program feature extraction method applied to a cloud node according to an embodiment of the present application, including step S10, step S20, step S30, step S40, and step S50.

In order to realize feature extraction of an IPTV video file stored in a cloud node, it is ensured that each frame of image has a corresponding feature so as to adapt to an IPTV architecture of an end-side-cloud (end-side node-edge node-cloud node), and in this embodiment, a feature extraction process of an IPTV video file is taken as an example.

First, the cloud node may run step S10.

And S10, reading the IPTV video file stored in the cloud node, and decoding the IPTV video file into an original frame sequence.

In this embodiment, the cloud node may read the stored IPTV video file to be processed, decode the IPTV video file into an original frame sequence, where the number of frames of the original frame sequence is n. For example, a set of program content of a television show, for a duration of 40 minutes, and at a frame rate of 30, the original sequence of frames comprises 40 x 60 x 30=72000 frames, and has a resolution of 1080P (i.e., 1920 x 1080 pixels), and as an example, video program content of higher resolution or lower resolution, other durations, and other frame rates are possible.

After the original frame sequence is obtained, the cloud node may run step S20.

And S20, performing shot disassembly on the original frame sequence to determine a plurality of shot fragments.

In this embodiment, the cloud node may downsample the original frame sequence (before this, the bottom region, for example, the bottom 1/5 region, may be blocked or cut off, so as to avoid the subtitle from affecting the extraction of each index) (downsampling herein refers to spatial downsampling, reducing the resolution, for example, reducing to 540P, 360P, etc. (if the bottom region is blocked or cut off, the actual resolution may not be strictly 540P, but the size of each frame image is still the same, and the processing is not affected, so that no additional explanation is made), so as to reduce the pixel size of the single frame image, and greatly improve the processing speed under the condition of losing a small amount of lens splitting precision), so as to obtain the preprocessed frame sequence, where the number of frames of the preprocessed frame sequence is n, which is the same as the number of frames of the original frame sequence.

And then performing color space conversion on the preprocessed frame sequence, and converting from RGB color space to YUV color space to obtain a corresponding Y channel frame sequence and UV channel frame sequence.

Accordingly, luminance histogram statistics can be performed on frame images in the Y-channel frame sequence, chrominance histogram statistics can be performed on frame images in the UV-channel frame sequence, and optical flow analysis can be performed on the basis of an ith frame image and an (i-1) th frame image in the Y-channel frame sequence, so that a pixel motion index corresponding to the ith frame image is obtained. The three processes can be synchronously processed, so that the efficiency is improved.

The optical flow analysis takes the Y channel frame sequence as input, and outputs corresponding parameters by using the optical flow analysis, wherein the parameters refer to pixel motion indexes corresponding to the ith frame image (1<i less than or equal to n) and comprise average motion amplitude, motion direction consistency and motion vector variance. For frame 1, the initial value is set to be the average motion amplitudeConsistency of movement directionAnd motion vector variance. It should be noted that the pixel motion index corresponding to the i-th frame image is calculated from the frame images in the Y-channel frame sequence, and corresponds to the i-th frame image in the original frame sequence.

Average motion amplitude corresponding to the i-th frame image:

, (1)

Wherein, theFor the average motion amplitude corresponding to the i-th frame image,The total amount of feature points of the ith frame image in the Y channel frame sequence,Is the first of the frame imageThe feature points (the optical flow method can be a dense optical flow method, namely, each pixel of the frame image is taken as a feature point, the optical flow method can be a sparse optical flow method, namely, a part of pixel points fixedly extracted from the frame image are taken as feature points, or the combination of the pixel points in a region is calculated as a feature point, and the feature points are extracted from the frame imageIndividual areas, thereby obtainingThe number of feature points, not limited herein),Is the first of the frame imageThe motion vectors of the feature points are calculated by an optical flow method,Is thatIs a die length of the die.

Motion direction consistency corresponding to the ith frame of image:

, (2)

Wherein, theFor the consistency of the motion direction corresponding to the ith frame image,Is the firstThe direction angle of the motion vector of each feature point,The main direction angle is obtained by taking the direction angle average value of the motion vectors of all the characteristic points of the frame image.

Motion vector variance corresponding to the i-th frame image:

, (3)

Wherein, theThe motion vector variance corresponding to the ith frame image.

After the brightness distribution difference, the chromaticity distribution difference and the pixel motion index corresponding to each frame of image are obtained, the cloud node can determine whether the ith frame of image is a lens segmentation node or not based on the brightness distribution difference, the chromaticity distribution difference and the pixel motion index corresponding to the ith frame of image, and accordingly all the lens segmentation nodes are determined.

In the lens disassembly process, three indexes of a brightness histogram (brightness distribution difference is calculated by two adjacent frames), a chromaticity histogram (chromaticity distribution difference is calculated by two adjacent frames) and optical flow analysis (motion amplitude, direction consistency and vector variance) are utilized to comprehensively judge the lens segmentation nodes. The multi-mode fusion can more accurately identify lens switching, wherein the brightness and the chromaticity are mainly used for capturing abrupt changes of the overall color distribution of a picture (such as scene switching), and the optical flow analysis is mainly used for detecting severe changes of local motion of the picture (such as rapid transition or object motion). Therefore, the lens disassembly is performed, the accuracy of lens disassembly can be improved, the misjudgment rate is reduced, and the lens disassembly method is suitable for two lens switching types of abrupt change (such as hard cutting) and gradual change (such as fade-in fade-out). Considering the precision required by lens disassembly, the original frame sequence is downsampled (spatial downsampling, such as reducing 1080P to 540P, even 360P, for example), different downsampling indexes can be determined, for example, action types such as reducing sports event programs to 540P and urban soap operas to 360P are reduced, so as to generate a preprocessed frame sequence, and the calculation amount of follow-up histogram statistics and optical flow analysis can be effectively reduced while the lens disassembly analysis condition is met (the lens segmentation precision loss is smaller, the calculation resource consumption is reduced by more than 70 percent, and the cost performance is extremely high).

In this embodiment, according to different shot switching types (hard cut and gradual change), the following shot segmentation node evaluation rules are mainly designed:

if the brightness distribution difference corresponding to the ith frame image is higher than the set brightness distribution difference, the chromaticity distribution difference corresponding to the ith frame image is higher than the set chromaticity distribution difference, and the motion direction consistency in the pixel motion index corresponding to the ith frame image is lower than a first set value, the cloud node can mark the ith frame image as a candidate abrupt lens segmentation node.

The scheme is suitable for quickly and accurately identifying the abrupt lens (namely hard cut), and captures transient abrupt changes of picture colors and structures, such as hard cut, black field switching and the like, through the combination of brightness/chromaticity distribution difference threshold and motion direction consistency indexes, so that the single detection logic is prevented from missing judgment of gradual change scenes, and meanwhile, low motion consistency indicates picture main motion confusion, and false abrupt change signals, such as flash lamp interference, can be filtered.

If the variance of the motion vector corresponding to the continuous x-frame (e.g., 5 frames, 10 frames, etc.) image is greater than the second set value, and the calculated accumulated difference of the luminance distribution difference and the chrominance distribution difference corresponding to the continuous x-frame image is greater than the set accumulated difference, determining the continuous x-frame image as a candidate progressive lens segmentation node section, and determining a candidate progressive lens segmentation node from the candidate progressive lens segmentation node section.

The scheme is suitable for accurately identifying the gradual shot, and the gradual segmentation reliability can be enhanced through the motion vector variance of continuous frames and accumulated color differences, such as fade-in fade-out and dissolution special effects, so as to identify the shot boundary of slow transition, and the high variance indicates that the picture has complex motion, such as fast translation or rotation, and cooperates with the accumulated color differences.

And then the cloud node can integrate the candidate abrupt shot segmentation nodes and the candidate gradual shot segmentation nodes, reject the candidate shot segmentation nodes which have a gap smaller than the set frame number between the candidate shot segmentation nodes and the adjacent candidate shot segmentation nodes, obtain all shot segmentation nodes, and schematically show the determined few shot segmentation nodes as shown in fig. 2.

The set frame number is converted to the frame number based on time, for example, the shortest shot of a general IPTV program is not shorter than 2 seconds long, and if the frame rate is 30, x is set to 60 frames, and if the frame rate is 60, x is set to 120. Therefore, the segment division of the IPTV video program at the lens level can be efficiently and accurately realized, redundant lens division nodes with less than set frame number interval with adjacent nodes are eliminated, the condition of short jitter in the same lens instead of real switching can be effectively eliminated, and excessive division is avoided.

Accordingly, the cloud node can conduct shot disassembly on the original frame sequence based on the shot segmentation node, and a plurality of shot fragments are determined. Note that the original frame sequence is shot broken here, not the pre-processed frame sequence, nor the Y-channel frame sequence or the UV-channel frame sequence.

After determining the shot segment, the cloud node may operate step S30.

And S30, carrying out self-adaptive key frame extraction on each lens segment, determining sub-segments by taking the key frames as cores, and differentially extracting the image characteristics of each frame of image in the sub-segments to obtain a characteristic subset corresponding to the sub-segments.

Because the shot segments of the IPTV video program are various, the fixed key frame extraction strategy is difficult to adapt to various dynamic shot scenes, and the proper key frames are difficult to select for analysis. Accordingly, the sliding window analysis scheme is designed. In the present embodiment, for each lens segment:

the cloud node may generate a sliding window of a set size, where the set size is less than the set number of frames, e.g., the sliding window is sized to 30 frames (or 60 frames, or other size).

For the first sliding window of the current shot segment (for example, taking a shot segment of 300 frames as an example), a set inter-frame interval (for example, 5 frames intervals) is adopted to determine a plurality of key frames (the key frames are respectively 1 st frame, 7 th frame, 13 th frame, 19 th frame and 25 th frame) by taking the start frame of the shot segment as a starting point. For the last sliding window of the current shot segment (from the previous key frame to the end frame of the shot segment, just 30 frames or less than 30 frames), taking the end frame of the shot segment as an end point, determining a plurality of key frames by using the inter-frame interval (the same interval 5 frames are used for taking key frames) set by the backward shift of the previous key frame.

The head-to-tail window adopts a scheme of determining key frames by setting inter-frame intervals, can simultaneously cope with the lens segments leading from the abrupt lens switching and the gradual lens switching, ensures the coverage of the start stage and the end stage of the lens segments, and avoids the omission of head-to-tail content.

For the non-head-tail sliding window of the current shot segment, the cloud node needs to determine a plurality of key frames in the sliding window by taking the previous key frame as a starting point based on the brightness distribution difference, the chromaticity distribution difference and the pixel motion index corresponding to each frame of image in the current sliding window.

In the analysis of non-head-to-tail sliding windows, the sliding window analysis of fixed step length may cause a large amount of redundant calculation, which is not beneficial to accurately determining a proper key frame (in this embodiment, the selection of the key frame is critical, because each subsequent key frame is used as the core of a sub-segment, the total feature extraction is performed, and other non-key frames in the sub-segment using the key frame as the core are all used for performing differential feature vector extraction based on the key frame, so that the determination of the key frame can have an important influence on the subsequent feature extraction process, and the extracted features can be used as the basis for providing auxiliary services for the feature analysis by taking the edge node).

The cloud node may calculate a motion state score corresponding to each frame of image based on a pixel motion index corresponding to each frame of image in the current sliding window.

Specifically, the cloud node calculates a motion state score corresponding to the jth frame of image in the current sliding window by adopting the following formula:

, (4)

Wherein, theScoring the motion state corresponding to the j-th frame image (corresponding to the unique original frame number) in the current sliding window,、AndAs a parameter of the weight-bearing element,For the average motion amplitude corresponding to the j-th frame image in the current sliding window,As a reference value for the average motion amplitude,For the consistency of the movement direction corresponding to the j-th frame image in the current sliding window,For the motion vector variance corresponding to the j-th frame image in the current sliding window,Is a motion vector variance reference value.

In the calculation scheme of the motion state score, the high motion consistency is converted into a low contribution value, the index with disordered directions is highlighted (such as the situation of rapid shaking of a lens or collision of an object) by utilizing the dimension difference of logarithmic compression motion amplitude (such as the amplitude value of a high-speed motion scene is extremely large), the motion state score is prevented from being easily led by the index, the motion vector variance is approximately standardized (not standardized but converted by utilizing the motion vector variance reference value), the corresponding weight coefficient (which can be an empirically set parameter value or a learnable parameter value so as to cope with different scenes), the scheme takes the empirically set example, simultaneously, different IPTV video program types are considered, different weights are assigned, the generalization capability of different IPTV video programs is flexibly adapted, and the generalization capability of the different IPTV video programs is improved. The reference value is set according to empirical analysis, and the global statistic can be calculated in real time.

And the cloud node can calculate a color difference score corresponding to each frame of image based on the brightness distribution difference and the chromaticity distribution difference corresponding to each frame of image.

Specifically, the color difference score corresponding to the jth frame of image in the current sliding window is calculated by adopting the following formula:

, (5)

In the calculation scheme of the color difference score, as the brightness distribution difference and the chromaticity distribution difference are in different lens scenes and are not in a purely linear relationship, a certain synergy exists, when the brightness distribution difference and the chromaticity distribution difference are both higher, the overall score should be obviously improved (adapt to the situation of large variation of the light shadow and the color in the lens segment), and the situation of single factor (that is, the brightness distribution difference and the chromaticity distribution difference are higher or lower) should be expressed as higher dominant, and the score should be lower for a steady scene and not suitable as a key frame. Therefore, the calculation scheme of the color difference score is designed to more accurately realize the color difference score of each frame, and the collaborative enhancement coefficient can be used as a means for balancing the sensitivity to different types of lens fragments, so that the adaptability to the different types of lens fragments is improved.

After calculating the motion state score and the color difference score of each frame of image in the current sliding window, the cloud node can judge and correspondingly process:

If there is an image whose motion state score exceeds the motion state threshold in the current sliding window (in order to improve reliability, the determination condition may be set to be strict, for example, there is at least 3 images whose motion state score exceeds the motion state threshold in the current sliding window), and a key frames are determined based on the motion state score and the color difference score corresponding to each image.

Specifically, the cloud node may calculate the composite score corresponding to the jth frame of image in the current sliding window according to the following formula:

, (6)

Wherein, theThe composite score corresponding to the j-th frame image in the current sliding window,A sports status threshold (sports event program is exemplified by 0.7 and movie program is exemplified by 0.5). For high motion type shots, motion state scoringIs shown to exceed the movement state threshold,Gradually approaches to 1, the motion state score ratio in the comprehensive score is higher, and the motion-dominant key frame selection can be enhanced.

Accordingly, the cloud node can determine a key frames with the highest comprehensive scores (a sliding window with a size of 30 frames corresponds to the number of the key frames a is exemplified by 5) based on the comprehensive scores corresponding to each frame of image in the current sliding window.

If no image with the motion state score exceeding the motion state threshold exists in the current sliding window (corresponding to the image with the motion state score exceeding the motion state threshold in less than 3 frames exists in the current sliding window when the reliability judging condition is met), but an image with the color difference score exceeding the color difference threshold exists (also designed to be an image with the color difference score exceeding the color difference threshold in at least 3 frames exists in the current sliding window), the cloud node can determine b key frames based on the motion state score and the color difference score corresponding to each frame of image.

, (7)

Wherein, theThe composite score corresponding to the j-th frame image in the current sliding window,Is the threshold value of color differenceAnd the value is not zero, and generally takes between 0.3 and 0.4, and special scenes such as firework shows, lamplight shows and the like need to be heightened, and the value is not expected to be lower than 0.5). For low motion type shots, motion state scoringDoes not exceed a motion state thresholdAt this time, the color difference score and the motion state score are required to be comprehensively calculatedAs a coefficient, the color difference weight can be adaptively enhanced according to the color difference score, and in a lens segment with gentle motion, a key frame can be triggered through the color difference, so that a more proper key frame can be determined.

Accordingly, the cloud node may determine b key frames (b takes 3 in this embodiment) with the highest comprehensive scores based on the comprehensive scores corresponding to each frame of image in the current sliding window.

If there is no image whose motion state score exceeds the motion state threshold value in the current sliding window (corresponding to the image whose motion state score exceeds the motion state threshold value in less than 3 frames in the current sliding window, and there is no image whose color difference score exceeds the color difference threshold value in less than 3 frames in the current sliding window), the cloud node can determine c key frames based on the motion state score and the color difference score corresponding to each frame of image, wherein a > b > c >1.

, (8)

Wherein, theAnd (5) comprehensively grading the j-th frame image in the current sliding window. For the case of low both the motion state score and the color difference score, the two may be directly summed (of course, other weights may be adopted), and the key frame is not selected in general, but if all the conditions are in a sliding window, the key frame with the highest comprehensive score is selected as the key frame (in this case, because the picture changes are not large, the number of determined key frames is small, and compared with other scenes with large changes, one key frame can dominate more non-key frames as the main bone of more non-key frames.

Accordingly, the cloud node may determine c key frames with the highest comprehensive scores (in this embodiment, c takes 2) based on the comprehensive scores corresponding to each frame of image in the current sliding window.

Aiming at different scenes (lens fragments), the embodiment designs an evaluation mechanism of a motion state score and a color difference score, wherein the motion state score reflects the dynamic complexity of a picture (such as a fast motion scene, the motion state score is obviously higher) by quantifying average motion amplitude, direction consistency and motion vector variance, if an image with the motion state score exceeding a motion state threshold exists in a current sliding window, a key frame (determined according to the size design of the sliding window and the window size of a common 30 frames and the number of corresponding key frames a is 5) is determined based on the motion state score and the color difference score corresponding to each frame image, and the color difference score is calculated by combining brightness and the color difference, so that capturing of picture color mutation (such as light-shadow change) is facilitated, if an image with the motion state score exceeding the motion state threshold does not exist in the current sliding window, but the color difference score exceeds the color difference threshold is present, b key frames (the window size of 30 frames is 3) are determined based on the motion state score and color difference corresponding to each frame image of a2, and the color difference is compared with the conventional score (c is 2) based on the size of the window of the contrast score of the window of which is 2. Therefore, the scheme can intensively sample in sliding windows of severe motion or color change (such as explosion and transition), sparsely sample in static areas (such as fixed lenses), realize resource allocation as required, and ensure that the determined key frames are more accurate and are more beneficial to high-quality extraction of the features. And the maximum key frame quantity under different scenes is constrained by preset values of parameters a, b and c, so that the condition of increasing the key frame sampling density without limit is avoided.

After determining the key frame, the cloud node can determine sub-segments by taking the key frame as a core after de-duplicating the key frame, namely, dividing the current lens segment into a plurality of sub-segments by taking the key frame as a node, wherein the first frame image of each sub-segment is the key frame, and each sub-segment does not contain the key frames except the first frame image, as shown in fig. 3.

After the sub-segments are determined by taking the key frames as cores, the cloud node can differentially extract the image characteristics of each frame of image in the sub-segments to obtain the characteristic subsets corresponding to the sub-segments.

For each sub-segment, the cloud node may perform full feature vector extraction (for example, 1024 dimensions of the full feature vector extracted and corresponding to one frame of image in the original frame sequence) on the key frame of the sub-segment by using ResNet, perform differential feature vector extraction (for example, 256 dimensions of the full feature vector extracted and corresponding to the image in the original frame sequence) on each non-key frame of the sub-segment by using MobileNetV3 based on the key frame of the sub-segment, and perform feature vector integration (for example, integrate according to the true sequence) based on the frame sequence of the sub-segment, thereby obtaining the feature subset corresponding to the sub-segment.

Referring specifically to fig. 4, in the differential feature extraction process, a key frame (first frame) of the sub-segment is taken as a reference frame, optical flow estimation is performed on each non-key frame of the sub-segment, so as to obtain an optical flow field (pixel-level motion information, i.e. horizontal displacement and vertical displacement of each pixel), which is used for performing motion compensation alignment (i.e. alignment of the non-key frame with the key frame), further obtain a differential image between the non-key frame and the key frame, input the differential image to MobileNetV (a lightweight version, such as MobileNetV-Small, can be selected), and a 256-dimensional differential feature vector of the non-key frame is extracted and output. It should be noted that, the vector dimension of the key frame is different from the vector dimension of the non-key frame, so that the subsequent application is not affected, and the differential coding mode can be used for precoding, so that the implementation coding in the transmission process is not needed, and the running efficiency of the end-edge-cloud architecture of the IPTV is improved.

According to the scheme, global features are extracted through ResNet full-frame processing, mobileNetV3 is adopted for carrying out differential feature extraction on non-key frames, the processing capacity can be effectively reduced, mobileNetV3 can be designed in a lightweight mode, the calculation cost is further reduced, and the method is suitable for cloud large-scale video processing.

After obtaining the feature subset corresponding to each sub-segment, the cloud node may operate step S40.

And S40, integrating the feature subsets belonging to the same lens segment to obtain a scene feature set.

In this embodiment, the cloud node may integrate feature subsets belonging to the same shot segment according to a frame sequence to obtain a scene feature set (or called a shot feature set).

After that, the cloud node may run step S50.

And S50, integrating the scene feature set corresponding to each lens segment to obtain the video feature set corresponding to the IPTV video file.

In this embodiment, the cloud node may integrate the scene feature set corresponding to each lens segment according to the frame sequence to obtain the video feature set corresponding to the IPTV video file, where the feature number in the video feature set is the same as the frame number of the original frame sequence in the IPTV video file and may be in one-to-one correspondence.

In summary, the embodiment of the application provides an artificial intelligence-based video program feature extraction method, which comprises the steps of reading an IPTV video file stored in a cloud node, decoding the IPTV video file into an original frame sequence, performing lens disassembly to determine a plurality of lens fragments, performing adaptive key frame extraction on each lens fragment, determining sub-fragments by taking the key frames as cores, performing differential extraction on image features of each frame image in the sub-fragments (performing full feature vector extraction on the key frames and performing differential feature vector extraction on non-key frames in the sub-fragments), obtaining feature subsets corresponding to the sub-fragments, integrating feature subsets belonging to the same lens fragment to obtain scene feature sets, and integrating the scene feature sets corresponding to each lens fragment to obtain video feature sets corresponding to the IPTV video file. According to the scheme, through layering processing logic of lens disassembly, key frame extraction and sub-segment feature extraction, the number of frames of the total features to be extracted is greatly reduced, feature extraction is focused on key frames and related sub-segments, and the calculated amount is remarkably reduced. According to the feature extraction scheme, features are strictly synchronous with video frames, so that the features of each video frame can be deduced through the key frame association of the sub-segment to which the features belong, and the requirements of an end-edge-cloud collaborative architecture (namely an IPTV architecture of an end-side node-edge node-cloud node) on a frame-level synchronous feature stream are met.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. The video program feature extraction method based on artificial intelligence is characterized by being applied to cloud nodes and comprising the following steps of:

reading an IPTV video file stored by a cloud node, and decoding the IPTV video file into an original frame sequence;

Performing shot disassembly on the original frame sequence to determine a plurality of shot fragments;

Performing self-adaptive key frame extraction on each lens segment, determining sub-segments by taking the key frames as cores, and performing differential extraction on image features of each frame of image in the sub-segments to obtain feature subsets corresponding to the sub-segments;

Integrating feature subsets belonging to the same lens segment to obtain a scene feature set;

and integrating the scene feature set corresponding to each lens segment to obtain the video feature set corresponding to the IPTV video file.

2. The method for extracting video program features based on artificial intelligence as claimed in claim 1, wherein the performing shot decomposition on the original frame sequence to determine a plurality of shot segments comprises:

downsampling an original frame sequence to obtain a preprocessed frame sequence, and performing color space conversion on the preprocessed frame sequence to obtain a corresponding Y-channel frame sequence and UV-channel frame sequence;

carrying out luminance histogram statistics on frame images in the Y channel frame sequence, and carrying out chrominance histogram statistics on frame images in the UV channel frame sequence;

calculating the brightness distribution difference corresponding to the ith frame image based on the brightness histogram statistical result of the ith frame image and the (i-1) th frame image, and calculating the chromaticity distribution difference corresponding to the ith frame image based on the chromaticity histogram statistical result of the ith frame image and the (i-1) th frame image, wherein 1<i is less than or equal to n, and n is the total number of images of the original frame sequence;

performing optical flow analysis based on an ith frame image and an (i-1) th frame image in the Y channel frame sequence to obtain a pixel motion index corresponding to the ith frame image, wherein the pixel motion index comprises average motion amplitude, motion direction consistency and motion vector variance;

Determining whether the ith frame image is a shot segmentation node or not based on the brightness distribution difference, the chromaticity distribution difference and the pixel motion index corresponding to the ith frame image, and determining all shot segmentation nodes according to the determination;

And performing shot disassembly on the original frame sequence based on the shot segmentation nodes to determine a plurality of shot fragments.

3. The method for extracting video program features based on artificial intelligence according to claim 2, wherein determining whether the i-th frame image is a shot segmentation node based on the luminance distribution difference, the chrominance distribution difference and the pixel motion index corresponding to the i-th frame image, and determining all shot segmentation nodes based on the determined i-th frame image comprises:

if the brightness distribution difference corresponding to the ith frame image is higher than the set brightness distribution difference, the chromaticity distribution difference corresponding to the ith frame image is higher than the set chromaticity distribution difference, the motion direction consistency in the pixel motion index corresponding to the ith frame image is lower than a first set value, and the ith frame image is marked as a candidate abrupt lens segmentation node;

If the variance of the motion vector corresponding to the continuous x-frame images is larger than a second set value, and the calculated accumulated difference of the brightness distribution difference and the chromaticity distribution difference corresponding to the continuous x-frame images is higher than the set accumulated difference, determining the continuous x-frame images as candidate graded lens segmentation node sections, and determining candidate graded lens segmentation nodes from the candidate graded lens segmentation node sections;

Integrating the candidate abrupt shot segmentation nodes and the candidate gradual shot segmentation nodes, and eliminating the candidate shot segmentation nodes which are less than the set frame number in interval between the candidate shot segmentation nodes and the adjacent candidate shot segmentation nodes to obtain all shot segmentation nodes.

4. The artificial intelligence based video program feature extraction method of claim 2, wherein performing adaptive keyframe extraction on each shot segment comprises:

for each lens segment:

generating a sliding window with a set size, wherein the set size is smaller than the set frame number;

For the first sliding window of the current shot segment, taking the initial frame of the shot segment as a starting point, adopting a set inter-frame interval to determine a plurality of key frames;

For a non-head-to-tail sliding window of the current lens segment, determining a plurality of key frames in the sliding window based on brightness distribution difference, chromaticity distribution difference and pixel motion index corresponding to each frame of image in the current sliding window by taking a previous key frame as a starting point;

And for the last sliding window of the current shot segment, determining a plurality of key frames by taking the end frame of the shot segment as an end point and using the inter-frame interval set by the backward shift of the previous key frame.

5. The method for extracting video program features based on artificial intelligence of claim 4, wherein determining a plurality of key frames in the sliding window based on a luminance distribution difference, a chrominance distribution difference and a pixel motion index corresponding to each frame of image in the current sliding window comprises:

Calculating a motion state score corresponding to each frame of image based on a pixel motion index corresponding to each frame of image in the current sliding window, and calculating a color difference score corresponding to each frame of image based on a brightness distribution difference and a chromaticity distribution difference corresponding to each frame of image;

If an image with the motion state score exceeding the motion state threshold exists in the current sliding window, determining a key frames based on the motion state score and the color difference score corresponding to each frame of image;

if no image with the motion state score exceeding the motion state threshold exists in the current sliding window, but an image with the color difference score exceeding the color difference threshold exists, b key frames are determined based on the motion state score and the color difference score corresponding to each frame of image;

If no image with the motion state score exceeding the motion state threshold exists in the current sliding window and no image with the color difference score exceeding the color difference threshold exists, c key frames are determined based on the motion state score and the color difference score corresponding to each frame of image, wherein a > b > c >1.

6. The method for extracting video program features based on artificial intelligence of claim 5, wherein calculating a motion state score corresponding to each frame of image based on a pixel motion index corresponding to each frame of image in a current sliding window comprises:

calculating a motion state score corresponding to a j-th frame image in the current sliding window by adopting the following formula:

,

Wherein, theScoring the motion state corresponding to the j-th frame image in the current sliding window,、AndAs a parameter of the weight-bearing element,For the average motion amplitude corresponding to the j-th frame image in the current sliding window,As a reference value for the average motion amplitude,For the consistency of the movement direction corresponding to the j-th frame image in the current sliding window,For the motion vector variance corresponding to the j-th frame image in the current sliding window,A motion vector variance reference value;

calculating a color difference score corresponding to each frame of image based on the brightness distribution difference and the chromaticity distribution difference corresponding to each frame of image, including:

calculating a color difference score corresponding to a j-th frame image in the current sliding window by adopting the following formula:

,

7. The method for extracting video program features based on artificial intelligence of claim 6, wherein determining a key frames based on the motion state score and the color difference score corresponding to each frame of image comprises:

the comprehensive score corresponding to the j-th frame image in the current sliding window is calculated by adopting the following formula:

,

Wherein, theThe composite score corresponding to the j-th frame image in the current sliding window,Is a motion state threshold;

And determining a key frame with the highest comprehensive score based on the comprehensive score corresponding to each frame of image in the current sliding window.

8. The method for extracting video program features based on artificial intelligence of claim 6, wherein determining b key frames based on the motion state score and the color difference score corresponding to each frame of image comprises:

,

Wherein, theThe composite score corresponding to the j-th frame image in the current sliding window,Is a color difference threshold;

And determining b key frames with highest comprehensive scores based on the comprehensive scores corresponding to each frame of image in the current sliding window.

9. The method for extracting video program features based on artificial intelligence of claim 6, wherein determining c key frames based on the motion state score and the color difference score corresponding to each frame of image comprises:

,

Wherein, theThe comprehensive score corresponding to the j-th frame image in the current sliding window is obtained;

And c key frames with highest comprehensive scores are determined based on the comprehensive scores corresponding to each frame of image in the current sliding window.

10. The method for extracting video program features based on artificial intelligence as claimed in claim 6, wherein determining the sub-segments by using the key frames as cores, and extracting image features of each frame of image in the sub-segments in a differentiated manner to obtain feature subsets corresponding to the sub-segments, comprises:

Dividing the current lens segment into a plurality of sub-segments based on the key frame, wherein the first frame image of each sub-segment is the key frame, and each sub-segment does not contain the key frames except the first frame image;

And for each sub-segment, carrying out full feature vector extraction on the key frames of the sub-segment by utilizing ResNet, carrying out differential feature vector extraction on each non-key frame of the sub-segment by utilizing MobileNetV based on the key frames of the sub-segment, and carrying out feature vector integration based on the frame sequence of the sub-segment to obtain a feature subset corresponding to the sub-segment.