CN117036392A

Movatterモバイル変換

Info

Publication number: CN117036392A
Application number: CN202311036023.3A
Authority: CN
Inventors: 罗达志
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-11-10

Abstract

The application relates to the technical field of computers, and provides an image detection method and a related device, which are used for improving the detection effect of a target detection area, wherein the method comprises the following steps: extracting each reference frame from each original video frame contained in the video to be detected, extracting each time sequence change frame from each original video frame, obtaining distribution information for representing the value change of each pixel according to the value change condition of each pixel in each time sequence change frame, then respectively carrying out edge detection on each edge detection frame according to the value of each pixel contained in each edge detection frame extracted from each video frame to obtain corresponding edge information, and then obtaining a target detection area by using a machine learning algorithm based on each reference frame, the distribution information and the obtained edge information. The distinction between the target detection area and the non-target detection area is described by various information, thereby improving the detection effect.

Description

Image detection method and related device

Technical Field

The application relates to the technical field of image processing, and provides an image detection method and a related device.

Background

With the rapid development of computer technology, video infringement activities are also becoming more frequent. Under the related technology, the most common infringement mode is to avoid video copyright supervision by adding interference content; such as adding edge black, frosted glass blurring effects, edge mapping, edge independent video, etc.

In the related art, each video frame image in a video to be detected is generally input into a detection model, and at least one image detection is performed, so as to obtain a core picture except for interference content in the video to be detected.

However, for the purpose of detecting the core picture of the video, since the original video frame contains more data content, detecting the original video frame results in a larger data processing amount of the model, increases learning difficulty, and thus results in poor model effect and lower image detection accuracy.

Disclosure of Invention

The embodiment of the application provides an image detection method and a related device, which are used for improving the image detection effect of a target detection area.

In a first aspect, an embodiment of the present application provides an image detection method, including:

extracting each reference frame from each original video frame contained in the video to be detected according to a first frame extraction mode;

extracting each time sequence change frame from each original video frame according to a second frame extraction mode, and obtaining distribution information according to the value change condition of each pixel in each time sequence change frame, wherein the distribution information is used for representing: at least one pixel with a value changed in each pixel;

Extracting each edge detection frame from each original video frame according to a third frame extraction mode, and respectively carrying out edge detection on each edge detection frame according to each pixel value contained in each edge detection frame to obtain corresponding edge information;

and obtaining a target detection area in the video to be detected based on the reference frames, the distribution information and the obtained edge information.

In a second aspect, an embodiment of the present application provides an image detection method, including:

acquiring a video to be detected, wherein the video to be detected comprises original video frames;

extracting each reference frame from each original video frame according to a first frame extraction mode;

extracting each time sequence change frame from each original video frame according to a second frame extraction mode, obtaining distribution information according to the value change condition of each pixel in each time sequence change frame, and constructing at least one time sequence information frame by combining the positions of each pixel based on the distribution information; wherein the distribution information is used to characterize: at least one pixel with a value changed in each pixel;

extracting each edge detection frame from each original video frame according to a third frame extraction mode, respectively carrying out edge detection on each edge detection frame according to each pixel value contained in each edge detection frame to obtain corresponding edge information, and taking each obtained edge information as each edge information frame;

And inputting the reference frames, the at least one time sequence information frame and the obtained edge information frames into a target image detection model to obtain a detection result, wherein the detection result comprises the position information of a target detection area in the video to be detected.

In a third aspect, an embodiment of the present application provides an image detection apparatus, including:

the original information extraction unit is used for extracting each reference frame from each original video frame contained in the video to be detected according to a first frame extraction mode;

the time sequence information extraction unit is used for extracting each time sequence change frame from each original video frame according to a second frame extraction mode, and obtaining distribution information according to the value change condition of each pixel in each time sequence change frame, wherein the distribution information is used for representing: at least one pixel with a value changed in each pixel;

the edge information extraction unit is used for extracting each edge detection frame from each original video frame according to a third frame extraction mode, and respectively carrying out edge detection on each edge detection frame according to each pixel value contained in each edge detection frame to obtain corresponding edge information;

And the region detection unit is used for obtaining a target detection region in the video to be detected based on the reference frames, the distribution information and the obtained edge information.

As a possible implementation manner, the area detection unit is specifically configured to, when obtaining the target detection area in the video to be detected, based on the reference frames, the distribution information, and the obtained edge information:

based on the distribution information, combining the positions of the pixels to construct at least one time sequence information frame, and taking the obtained edge information as each edge information frame;

and inputting the reference frames, the at least one time sequence information frame and the obtained edge information frames into a target image detection model to obtain a target detection area in the video to be detected.

As one possible implementation manner, the target image detection model includes an original feature extraction network, a time sequence feature extraction network, an edge feature extraction network and a detection network;

the step of inputting the reference frames, the at least one time sequence information frame and the obtained edge information frames into a target image detection model, and the region detection unit is specifically configured to:

Inputting each reference frame into the original feature extraction network to obtain the original features corresponding to each reference frame;

respectively inputting the at least one time sequence information frame into the time sequence feature extraction network to obtain corresponding time sequence features;

inputting each edge information frame into the edge feature extraction network respectively to obtain corresponding edge features;

and inputting the obtained original features, at least one time sequence feature and edge features into the detection network to obtain a target detection area in the video to be detected.

As a possible implementation manner, the raw features, at least one time sequence feature, and edge features to be obtained are input to the detection network, and when a target detection area in the video to be detected is obtained, the area detection unit is specifically configured to:

performing feature stitching on each original feature, at least one time sequence feature and each edge feature through the detection network to obtain a global feature, and performing convolution processing on the global feature at least once to obtain a target detection result;

and obtaining a target detection area in the video to be detected based on the target detection result.

As a possible implementation manner, when the reference frames are input to the original feature extraction network to obtain the original features corresponding to the reference frames, the region detection unit is specifically configured to:

Respectively inputting the reference frames into a convolution component in the original feature extraction network to carry out convolution processing to obtain a corresponding reference frame convolution result, wherein the convolution component comprises a convolution layer, a normalization layer and an activation function;

and based on the obtained convolution result of each reference frame, obtaining the original characteristic corresponding to each reference frame by utilizing at least one residual error component in the original characteristic extraction network.

As a possible implementation manner, when the target detection area in the video to be detected is obtained based on the target detection result, the area detection unit is specifically configured to:

for each category of target detection object, the following operations are respectively executed:

based on the target detection result, aiming at the current target detection object, screening at least one anchor frame meeting the preset screening condition from the anchor frames according to the category confidence degrees corresponding to the anchor frames;

and selecting a target detection area meeting preset selection conditions from the selected at least one anchor frame based on the target confidence corresponding to each selected at least one anchor frame in the target detection result.

As a possible implementation manner, the device further includes a training unit, where the training unit is configured to:

Obtaining a training video set, the training video set comprising: each tagged video and each enhanced video pair, each enhanced video pair comprising an untagged video and its corresponding enhanced video;

based on a training video set, performing iterative training on an initial image detection model to obtain the target detection model, wherein in each iterative process, the following operations are performed:

respectively inputting a plurality of tagged videos into the initial image detection model to obtain corresponding first detection results, and obtaining tagged video losses based on the obtained first detection results;

respectively inputting a plurality of enhanced videos into an initial image detection model to obtain corresponding second detection results, and obtaining label-free video loss based on the obtained second detection results;

and obtaining joint loss based on the tagged video loss and the untagged video loss, and performing model parameter adjustment based on the joint loss.

As a possible implementation manner, the training unit is specifically configured to generate an enhanced video corresponding to the unlabeled video by using at least one of the following operations:

based on the target adjustment parameters, adjusting the image parameters of each sample video frame contained in the one unlabeled video to obtain an enhanced video corresponding to the one unlabeled video;

Performing time axis clipping on each sample video frame contained in one unlabeled video to obtain an enhanced video corresponding to the one unlabeled video;

and adding Gaussian noise into each sample video frame contained in one unlabeled video to obtain an enhanced video corresponding to the one unlabeled video.

As a possible implementation manner, each first detection result includes a category confidence, where the category confidence is used to characterize the probability that the corresponding tagged video exists in the target detection area;

the training unit is specifically configured to, when the tagged video loss is obtained based on the obtained first detection results:

screening at least one first detection result with the category confidence coefficient not greater than a preset confidence coefficient threshold value from the first detection results based on the category confidence coefficient contained in each obtained first detection result;

and obtaining the tagged video loss based on the at least one first detection result.

As a possible implementation manner, the plurality of enhanced videos are respectively input into the initial image detection model, corresponding second detection results are obtained, and when the label-free video loss is obtained based on each obtained second detection result, the training unit is specifically configured to:

Determining sample sub-losses corresponding to the enhanced video pairs based on second detection results of the unlabeled video and the enhanced video contained in the enhanced video pairs;

and obtaining the label-free video loss based on the sample sub-loss corresponding to each of the plurality of enhancement video pairs.

As a possible implementation manner, when the distribution information is obtained according to the value change condition of each pixel in each time sequence change frame, the training unit is specifically configured to perform at least one of the following operations:

based on the value of each pixel in each time sequence change frame, obtaining the average value of the pixel values corresponding to each pixel, and taking the average value of the pixel values corresponding to each pixel as the distribution information;

and obtaining pixel value variances corresponding to the pixels respectively based on the values of the pixels in the time sequence change frames respectively, and taking the pixel value variances corresponding to the pixels respectively as the distribution information.

As a possible implementation manner, after extracting each time sequence change frame from each original video frame according to the second frame extracting manner, the training unit is further configured to:

scaling the time sequence change frames according to the set image resolution to obtain processed time sequence change frames;

And obtaining distribution information according to the value change condition of each pixel in each processed time sequence change frame.

As a possible implementation manner, based on the reference frames, the distribution information and the obtained edge information, after obtaining the target detection area in the video to be detected, the area detection unit is further configured to:

extracting a target video from the video to be detected based on the target detection area, and determining the similarity between the target video and each reference video by combining the respective video characteristics of each reference video based on the video characteristics of the target video;

and if at least one similarity reaching a similarity threshold exists in the determined similarities, determining that the video to be detected is an abnormal video.

if the target detection area is a subtitle area, extracting a target subtitle from the video to be detected based on the target detection area;

If the target detection area is an identification area, extracting a target identification from the video to be detected based on the target detection area;

and if the target detection area is a core picture area, extracting a core picture from the video to be detected based on the target detection area.

In a fourth aspect, an embodiment of the present application provides an image detection apparatus, including:

the input unit is used for acquiring a video to be detected, wherein the video to be detected comprises all original video frames;

the original information extraction unit is used for extracting each reference frame from each original video frame according to a first frame extraction mode;

the time sequence information extraction unit is used for extracting each time sequence change frame from each original video frame according to a second frame extraction mode, obtaining distribution information according to the value change condition of each pixel in each time sequence change frame, and constructing at least one time sequence information frame by combining the positions of each pixel based on the distribution information; wherein the distribution information is used to characterize: at least one pixel with a value changed in each pixel;

the edge information extraction unit is used for extracting each edge detection frame from each original video frame according to a third frame extraction mode, respectively carrying out edge detection on each edge detection frame according to each pixel value contained in each edge detection frame to obtain corresponding edge information, and taking each obtained edge information as each edge information frame;

And the region detection unit is used for inputting the reference frames, the at least one time sequence information frame and the obtained edge information frames into a target image detection model to obtain a detection result, wherein the detection result comprises the position information of the target detection region in the video to be detected.

In a fifth aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the memory stores a computer program that, when executed by the processor, causes the processor to perform the steps of the method described above.

In a sixth aspect, an embodiment of the present application provides a computer readable storage medium comprising a computer program for causing an electronic device to perform the steps of the above-described method when the computer program is run on the electronic device.

In a seventh aspect, embodiments of the present application provide a computer program product comprising a computer program stored in a computer readable storage medium, from which a processor of an electronic device reads and executes the computer program, causing the electronic device to perform the steps of the method described above.

In the embodiment of the application, each reference frame is extracted from each original video frame contained in a video to be detected, each time sequence change frame is extracted from each original video frame, distribution information for representing the value change of each pixel is obtained according to the value change condition of each pixel in each time sequence change frame, edge detection is carried out on each edge detection frame according to the value of each pixel contained in each edge detection frame extracted from each video frame, corresponding edge information is obtained, and then a target detection area is obtained in the video to be detected based on each reference frame, the distribution information and the obtained edge information.

In this way, aiming at the characteristic distinction between the target detection area (such as a core picture) and the non-target detection area (such as a core picture) in the video, on the basis of analyzing the video into frames, the reference frames, the distribution information and the edge information are extracted, and the extracted information is utilized to carry out image detection, the above design is in accordance with the task characteristics of the video detection task, and the distinction between the target detection area and the non-target detection area can be effectively expressed through the extracted information outside the original video input as far as possible, so that the learning difficulty of a subsequent image detection model is effectively reduced, the capability of the image detection model is improved, and the detection efficiency and the detection effect of the image detection are further improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

fig. 1 is a schematic diagram of an application scenario provided in the implementation of the present application;

fig. 2 is a schematic flow chart of an image detection method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a frame extraction process according to an embodiment of the present application;

fig. 4A is a schematic diagram of a core picture area in a video to be detected according to an embodiment of the present application;

fig. 4B is a schematic diagram of a subtitle region in a video to be detected according to an embodiment of the present application;

fig. 4C is a schematic diagram of a station caption area in a video to be detected provided in an embodiment of the present application;

Fig. 4D is a schematic diagram of a to-be-detected video including a core picture area and a station caption area according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a distributed information calculation process according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a target image detection model according to an embodiment of the present application;

FIG. 7 is a schematic diagram of another object image detection model provided in an embodiment of the present application;

FIG. 8 is a schematic diagram of a process for obtaining a target detection area using a target image detection model according to an embodiment of the present application;

FIG. 9 is a schematic view of an anchor frame according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a process for obtaining a target detection area according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a target detection area in a tagged video according to an embodiment of the present application;

FIG. 12 is a logic diagram of a supervised training process provided in an embodiment of the present application;

FIG. 13 is a schematic diagram of a detection result provided in an embodiment of the present application;

FIG. 14 is a logic diagram of a semi-supervised training process provided in an embodiment of the present application;

FIG. 15 is a flowchart of another image detection method according to an embodiment of the present application;

FIG. 16 is a schematic diagram of an image detection process according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of an image detection device according to an embodiment of the present application;

fig. 18 is a schematic structural diagram of another image detection device according to an embodiment of the present application;

fig. 19 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the technical solutions of the present application, but not all embodiments. All other embodiments, based on the embodiments described in the present document, which can be obtained by a person skilled in the art without any creative effort, are within the scope of protection of the technical solutions of the present application.

First, some concepts involved in the embodiments of the present application are explained.

Core region detection: the method is used for detecting a core effect area in a video, and the core area refers to a residual core picture area after removing edge invalid areas such as black edges, edge maps, edge irrelevant videos, ground glass and the like.

Black edge: the black edge of the video refers to that the black edge exists around the original picture besides the normal picture.

Edge mapping: the edge mapping refers to the action of filling or partially covering the edges of the video in a mapping mode in the video editing process so as to achieve the purpose of distinguishing the video from the original video.

Edge independent video: the method refers to the behavior of filling or partially covering the edges of the video in a mode of irrelevant video, animation and the like in the video editing process so as to achieve the purpose of distinguishing the video from the original video.

Frosted glass: refers to a blurring or semitransparent effect that is globally or locally rendered on an image or video.

Video copyright infringement: refers to the act of copying, reloading, or post-modification distribution of copyrighted video without permission of a non-copyright owner.

Video duplicate checking: the method is characterized in that the video platform side performs repeated detection on any video released in the flag platform, and the purpose is to prevent the released video from infringing the copyrights of other people.

Video frame: video is essentially composed of still pictures, which are called frames.

Video frame rate: the video Frame rate (Frame rate) is a measure for measuring the number of display frames. The so-called measurement units are the number of display frames per second (Frames per Second, FPS) or "hertz" (Hz).

Supervised learning: refers to a machine learning task that infers a function from a labeled training sample.

Unsupervised learning: refers to solving various problems in pattern recognition based on training samples of unknown class (no tags).

Semi-supervision: semi-supervised learning uses a large amount of unlabeled data, and simultaneously labeled data, to perform pattern recognition tasks.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The embodiment of the application mainly relates to a machine learning technology, in particular to training and application of an image detection model by using the machine learning technology. The training and application of specific models is described below.

The following is a brief description of the technical idea of the embodiment of the present application:

Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application. The application scenario at least includes the terminal device 110 and the server 120. The number of the terminal devices 110 may be one or more, and the number of the servers 120 may be one or more, and the number of the terminal devices 110 and the servers 120 is not particularly limited in the present application.

In the embodiment of the present application, the terminal device 110 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, an internet of things device, a smart home appliance, a vehicle-mounted terminal, etc., but is not limited thereto.

The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligence platform.

The terminal device 110 and the server 120 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

The image detection method mentioned in the embodiment of the present application may be applied to the terminal device 110 or the server 120, which is not limited. Hereinafter, the application to the server 120 will be described by way of example only.

Referring to fig. 2, a flowchart of an image detection method provided in an embodiment of the present application is shown, the method is applied to an electronic device, and the electronic device may be a server or a terminal device, and the specific flowchart is as follows:

s201, extracting each reference frame from each original video frame contained in the video to be detected according to a first frame extraction mode.

In the embodiment of the application, the video to be detected is an mp4 format video file, and of course, in practical application, other format video files can be input as long as the video files can be normally analyzed or converted into the mp4 format video file.

The original video frames represent original pictures of the video to be detected, but because the frames contained in the video to be detected are more, image detection is performed according to all the original video frames, so that the image detection efficiency is affected.

The first frame extraction mode may be random extraction, but in order to better embody the overall information of the video frame, in the embodiment of the present application, the first frame extraction mode adopts an average extraction mode.

In the first frame extraction mode, each original video frame is segmented according to a set first extraction number based on the total frame number of each original video frame in the video to be detected, and one or more original video frames are respectively extracted from each segmented video frame sequence as reference frames.

Taking an example of extracting a reference frame from each video frame sequence, in S201, the method specifically includes the following steps:

a) Analyzing the video to be detected, and counting the total frame number of each original video frame in the video to be detected.

The frame number calculation may use an OpenCV open source component, but is not limited thereto.

b) Based on the total frame number, the average segment segmentation is carried out on each original video frame according to the set first extraction number.

Assume that for a target sample video V, the total frame number is a, the first extraction number is 10, and from the first frame of the video to be detected, segmentation is performed once every a/10 frame, and a 10-segment video frame sequence can be obtained through segmentation.

For example, referring to fig. 3, assuming that the total frame number is 1000 and the first extraction number is 10, starting from the first frame of the video to be detected, segmentation is performed every 100 frames, and a 10-segment video frame sequence can be obtained by segmentation.

c) Extracting an original video frame at a preset position from each segmented video frame sequence, taking the original video frame as a reference frame corresponding to the segmented video frame sequence, and storing the extracted reference frame as a target storage format.

The preset position may be a center, left, right, or random position, which is not limited. It should be noted that, when the preset position is centered and the number of each original video frame included in each video frame sequence is even, any one of the two centered original video frames may be used as a reference frame corresponding to the video frame sequence. The target storage format may employ, but is not limited to, JPG format image files.

For example, each video frame sequence contains 100 frames, and the most central original video frame, namely the 50 th frame, is extracted from each video frame sequence to serve as a reference frame. Thus, from a sequence of 10 video frames, 10 reference frames can be extracted.

d) And sequentially storing each reference frame in the target storage format to obtain a reference frame sequence. The sequence of reference frames may be one of the input data for the subsequent model. The sequence refers to the playing sequence of each reference frame in the video.

S202, extracting each time sequence change frame from each original video frame according to a second frame extraction mode, and obtaining distribution information according to the value change condition of each pixel in each time sequence change frame, wherein the distribution information is used for representing: at least one pixel with changed value in each pixel.

In the embodiment of the application, the distribution information Chinese contains the change condition of each pixel point in the picture in the video playing process, so that the distribution of the core picture and the non-core picture on the video to be detected can be expressed to a certain extent.

The second frame extraction mode may be random extraction, but in order to better embody the overall information of the video frame, in the embodiment of the present application, the second frame extraction mode may also be an average frame extraction mode. In the second frame extraction mode, based on the total frame number of each original video frame in the video to be detected, dividing each original video frame according to a set second extraction number, and respectively extracting one or more original video frames from each divided video frame sequence as time sequence change frames. The second extraction number may be the same as the first extraction number, but considering that more data is adopted to perform statistical value change, the accuracy of the distribution information may be effectively improved.

Taking one time sequence change frame extracted from each video frame sequence as an example, according to a second frame extraction mode, extracting each time sequence change frame from each original video frame specifically comprises the following steps:

b) Based on the total frame number, the average segment segmentation is carried out on each original video frame according to the set second extraction number. Assume that for a target sample video V, the total frame number is a, the second extraction number is 100, and from the first frame of the video to be detected, segmentation is performed once every a/100 frame, and a 100-segment video frame sequence can be obtained through segmentation. Referring to fig. 3, the total frame number is 1000, the second extraction number is 100, and from the first frame of the video to be detected, every 1000/100=10 frames is split, and a 100-segment video frame sequence can be obtained through the splitting.

c) Extracting an original video frame at a preset position from each segmented video frame sequence to serve as a time sequence change frame corresponding to the segmented video frame sequence. The preset position may be a center, left, right, or the like, or may be a random position, which is not particularly limited. For example, from each video frame sequence, the most centered original video frame is extracted as the time-varying frame. Thus, from a sequence of 100 video frames, 100 time-varying frames can be extracted.

In some embodiments, considering that the resolution of different videos varies widely, too high a resolution may cause excessive machine computation pressure, memory overflow (OOM) problems. Therefore, it is necessary to scale each time-series change frame while maintaining the information amount as much as possible. Specifically, scaling is performed on each time sequence change frame according to the set image resolution, so as to obtain each processed time sequence change frame, and then distribution information is obtained according to the value change condition of each pixel in each processed time sequence change frame.

The image resolution may include a length threshold and a width threshold, among others. If the length of the time sequence change frame is greater than the length threshold, scaling the time sequence change frame in an equal ratio according to the length threshold to obtain a processed time sequence change frame; and if the width of the time sequence change frame is larger than the width threshold value, performing equal ratio scaling on the time sequence change frame according to the width threshold value to obtain the processed time sequence change frame.

For example, assume that the set image resolution is 416×416, the length threshold and the width threshold are both 416, the resolution of the time series change frame is 1280×720, at this time, the length of the time series change frame is greater than the length threshold, the width of the time series change frame is greater than the width threshold, the time series change frame is scaled by equal ratio according to 416×416, the processed time series change frame is obtained, and the resolution of the processed time series change frame is 740×416.

In some embodiments, the distribution information may be one or more of a mean and a variance of each pixel in each time-series variation frame, but is not limited thereto.

Obtaining distribution information according to the value change condition of each pixel in each time sequence change frame, wherein the distribution information comprises at least one of the following operations:

operation 1: and obtaining the average value of the pixel values corresponding to each pixel based on the value of each pixel in each time sequence change frame, and taking the average value of the pixel values corresponding to each pixel as distribution information.

Operation 2: and obtaining pixel value variances corresponding to the pixels based on the values of the pixels in the time sequence change frames respectively, and taking the pixel value variances corresponding to the pixels as distribution information.

For example, referring to fig. 5, the resolution of the time-varying frame is 416×416, that is, the time-varying frame includes 416×416 pixels, for each of the 416×416 pixels, a mean value of pixel values of the pixel is obtained based on the values of the pixel in 100 time-varying frames, and after the calculation is performed for the 416×416 pixels, the mean value of pixel values corresponding to the 416×416 pixels can be obtained. Similarly, for each pixel in 416×416 pixels, based on the value of a pixel in 100 time-sequence variation frames, the pixel value variance of the pixel is obtained, and after calculation is performed for 416×416 pixels, the pixel value variance corresponding to each of 416×416 pixels can be obtained.

Further, the pixel value mean value and the pixel value variance of each pixel on the time axis form a time axis mean value matrix and a time axis variance matrix which are the same as the length and the width of the picture, and the time axis mean value matrix and the time axis variance matrix can be used as time axis change characteristic frames to be used as one of input data of a subsequent model.

And S203, extracting each edge detection frame from each video frame according to a third frame extraction mode, and respectively carrying out edge detection on each edge detection frame according to each pixel value contained in each edge detection frame to obtain corresponding edge information.

The third frame extraction mode may be random extraction or average extraction mode. In the average extraction mode of the third extraction mode, based on the total frame number of each original video frame in the video to be detected, each original video frame is segmented according to a set third extraction number, and one or more original video frames are extracted from each segmented video frame sequence to be used as edge detection frames. Wherein the third number of decimations may be the same as the first number of decimations. In S203, each reference frame in S201 may be directly used as each edge detection frame.

Taking an example of extracting an edge detection frame from each video frame sequence, according to a third frame extracting mode, extracting each edge detection frame from each original video frame, and specifically comprising the following steps:

a) Analyzing the video to be detected, and counting the total frame number of each original video frame in the video to be detected. The frame number calculation may use an OpenCV open source component, but is not limited thereto.

b) Based on the total frame number, according to the set third extraction number, average segment segmentation is carried out on each original video frame. Assume that for a target sample video V, the total frame number is a, the third extraction number is 10, and from the first frame of the video to be detected, segmentation is performed once every a/10 frame, and a 10-segment video frame sequence can be obtained through segmentation. For example, referring to fig. 3, assuming that the total frame number is 1000 and the first extraction number is 10, from the first frame of the video to be detected, segmentation is performed every 10 frames, and a sequence of 10 video frames, each containing 100 frames, can be obtained by segmentation.

c) Extracting an original video frame at a preset position from each segmented video frame sequence, taking the original video frame as an edge detection frame corresponding to the video segment, and storing the extracted edge detection frame as a target storage format. The preset position may be the same as or different from the preset position in the first extraction mode, which is not limited. For example, each video frame sequence contains 100 frames, and the most central original video frame, namely the 50 th frame, is extracted from each video frame sequence to serve as a reference frame. Thus, from a sequence of 10 video frames, 10 edge detection frames can be extracted.

As a possible implementation manner, according to the values of each pixel included in each edge detection frame, edge detection is performed on each edge detection frame to obtain corresponding edge information, specifically, for each edge detection frame in the target storage format, an edge may be calculated in an image matrix corresponding to the edge detection frame by using a Canny operator to obtain corresponding edge information. Of course, other edge detection operators may be employed, and this is not limiting.

Furthermore, each obtained edge information can be used as each edge information frame and used as a subsequent model input.

S204, obtaining a target detection area in the video to be detected based on each reference frame, the distribution information and the obtained edge information.

In the embodiment of the application, the target detection area is used for representing the area where the target detection object exists in the video to be detected. The target detection object may be one or more of a core picture, a caption, and a station caption, but is not limited thereto. Herein, the region including the core picture is referred to as a core picture region, the region including the subtitle may be referred to as a subtitle region, and the region including the station caption may be referred to as a station caption region.

If the target detection object is any one of the core picture, the caption, and the station caption, the target detection area is a core picture area, a caption area, or a station caption area.

If the target detection object is any of the core picture, the subtitle, and the station caption, the target detection area includes corresponding ones of the core picture area, the subtitle area, and the station caption area.

Of course, the target detection area may be other areas using video as an identification carrier and some objects in the video as target detection objects, which is not limited.

Referring to fig. 4A, assume that the video to be detected is a related video of a certain television show with black edges, the target detection object is a core picture, and the core picture area can be obtained from the video to be detected based on each reference frame, the distribution information and each obtained edge information.

Referring to fig. 4B, assuming that the target detection object is a subtitle, a subtitle region may be obtained from the video to be detected based on each reference frame, distribution information, and obtained edge information.

Referring to fig. 4C, assuming that the target detection object is a station caption, a station caption area may be obtained from the video to be detected based on each reference frame, distribution information, and each obtained edge information.

Referring to fig. 4D, assuming that the target detection object is a core picture and a station caption, a core picture region and a station caption region may be obtained from the video to be detected based on each reference frame, distribution information, and obtained edge information.

Further, if the target detection area is a subtitle area, the target subtitle can be extracted from the video to be detected based on the target detection area; if the target detection area is the identification area, extracting a target identification (such as a station logo) from the video to be detected based on the target detection area; if the target detection area is a picture area, the target picture can be extracted from the video to be detected based on the target detection area.

Furthermore, video duplicate checking can be performed based on the target detection area. Specifically, the video duplicate checking method specifically comprises the following steps:

In the embodiment of the present application, the extraction of the video features is not limited, and will not be described herein.

For example, assuming that the similarity threshold is 80%, extracting a core picture from a video to be detected based on a target detection area, determining the similarity between the core picture and each reference video by combining the respective video characteristics of each reference video based on the video characteristics of the core picture, and further determining that the video to be detected is an abnormal video when the similarity reaching 80% exists in the determined similarities.

In the implementation manner, when infringing video identification is performed, the infringing interference can be better known and removed by detecting the core picture in the video, so that a real body area of the video is obtained, and the matching and duplicate removal accuracy of infringing data is higher and the recall rate is better.

In some embodiments, in order to improve the image detection efficiency and accuracy, in the embodiment of the present application, a machine learning model may be used to detect the target detection area, specifically, S204 specifically includes the following steps:

based on the distribution information, combining the positions of the pixels, constructing at least one time sequence information frame, and taking the obtained edge information as each edge information frame; and inputting each reference frame, at least one time sequence information frame and each obtained edge information frame into a target image detection model to obtain a target detection area in the video to be detected.

As a possible implementation manner, when at least one time sequence information frame is constructed based on the distribution information and combining the positions of the pixels, there are the following cases:

if the distribution information contains the average value of the pixel values corresponding to each pixel, a time axis average value matrix with the same length and width as the edge detection frame can be constructed based on the average value of the pixel values corresponding to each pixel and combined with the position of each pixel, and the time axis average value matrix is used as a time sequence information frame.

Each element in the time axis mean value matrix corresponds to one pixel, and the value of each element is the mean value of the pixel values of the corresponding pixel.

Still referring to fig. 5, the resolution of the time-varying frame is 416×416, that is, the time-varying frame includes 416×416 pixels, for each of the 416×416 pixels, a pixel value average of the pixel is obtained based on the values of one pixel in 100 time-varying frames, after the pixel value average corresponding to each of the 416×416 pixels is obtained, a time-axis average matrix of 416×416 is constructed based on the positions of each of the 416×416 pixels in the time-varying frame, and each element in the time-axis average matrix is the pixel value average of the corresponding pixel.

If the distribution information contains the pixel value variances corresponding to the pixels, a time axis variance matrix which is the same as the length and width of the edge detection frame can be constructed based on the pixel value variances corresponding to the pixels and combining the positions of the pixels in the edge detection frame, and the time axis variance matrix is used as a time sequence information frame.

Each element in the time axis variance matrix corresponds to one pixel, and the value of each element is the pixel value variance of the corresponding pixel.

Referring to fig. 5, the resolution of the time-varying frame is 416×416, that is, the time-varying frame includes 416×416 pixels, for each of the 416×416 pixels, the pixel value variance of the pixel is obtained based on the values of one pixel in 100 time-varying frames, and after the pixel value variances corresponding to the 416×416 pixels are obtained, a time-axis variance matrix of 416×416 is constructed based on the positions of the 416×416 pixels in the time-varying frame, where each element in the time-axis variance matrix is the pixel value variance of the corresponding pixel.

If the distribution information contains pixel value variances and pixel value averages corresponding to the pixels, constructing a time axis average matrix which is the same as the length and width of the edge detection frame by combining the positions of the pixels in the edge detection frame based on the pixel value averages corresponding to the pixels, and taking the time axis average matrix as a time sequence information frame; and constructing a time axis variance matrix which is the same as the length and width of the edge detection frame by combining the positions of the pixels in the edge detection frame based on the pixel value variances corresponding to the pixels, and taking the time axis variance matrix as a time sequence information frame. In the following, two time sequence information frames are obtained by taking the distribution information including the pixel value variance and the pixel value mean corresponding to each pixel as an example.

The target image detection model may be, but is not limited to, a Deep Learning (DL) model, including, but not limited to, a convolutional neural network (Convolutional Neural Networks, CNN).

In the embodiment of the present application, the original feature extraction network, the time sequence feature extraction network, and the edge feature extraction network in the feature extraction part may adopt a structure based on Darknet, resNet, transformer, but are not limited thereto.

Taking a feature extraction part as an example, a dark net53 network pre-trained on ImageNet is adopted, referring to fig. 7, which is a schematic structural diagram of a target image detection model provided in an embodiment of the present application. The image detection model comprises an original feature extraction network (branch A), a time sequence feature extraction network (branch B), an edge feature extraction network (branch C) and a detection network.

The original feature extraction network is used for extracting and obtaining original features corresponding to each reference frame; the time sequence feature extraction network is used for extracting time sequence features corresponding to at least one time sequence information frame respectively; the edge feature extraction network is used for the edge features corresponding to the edge information respectively; the detection network is used for detecting a target detection area in the video to be detected according to the obtained original features, at least one time sequence feature and the edge features.

In the embodiment of the application, five bottom layer components of Convolution (CONV), batch normalization (Batch Normalization, BN), leakage linear rectification (leakage Relu), addition (Add) and Zero padding (Zero padding) are adopted to upwards construct a DBL component, a Res unit component and a RESN component. See Table 1 for a description of the five underlying components CONV, BN, leak Relu, add and Zero padding.

The DBL component is composed of three bottom layer components of CONV, BN and leakage Relu, the Res unit component is composed of two DBLs and one add component (also can be understood as being composed of CONV, BN, leakage Relu and add four bottom layer components), the RESN component is composed of Zero padding, DBL and n Res unit components (also can be understood as being composed of CONV, BN, leakage Relu, add and Zero padding bottom layer components), wherein n is a positive integer.

The original feature extraction network, the time sequence feature extraction network, the edge feature extraction network and the detection network can be respectively constructed by adopting the constructed DBL component, the Res unit component and the RESN component. The original feature extraction network is called a branch a, the time sequence feature extraction network is called a branch B, the edge feature extraction network is called a branch C, and each branch comprises a DBL component, a RES1 component (n=1), a RES2 component (n=2), a RES8 component (n=8) and a RES4 component (n=4).

Table 1 bottom layer component description

The image detection process based on the target image detection model will be described below with reference to the model structures shown in fig. 6 and 7.

In some embodiments, each reference frame, at least one time sequence information frame and each obtained edge information frame are input into a target image detection model to obtain a target detection area in a video to be detected, and the method specifically includes:

Inputting each reference frame into an original feature extraction network to obtain original features corresponding to each reference frame;

respectively inputting at least one time sequence information frame into a time sequence feature extraction network to obtain corresponding time sequence features;

inputting each edge information frame into an edge feature extraction network respectively to obtain corresponding edge features;

and inputting the obtained original features, at least one time sequence feature and edge features into a detection network to obtain a target detection area in the video to be detected.

That is, after feature extraction is performed through a feature extraction network designed for a reference frame, a time series information frame, and an edge information frame, detection of a target detection area is performed through a detection network according to various extracted features.

For example, referring to fig. 6, assume that there are 10 reference frames, 2 edge information frames, 10 edge information frames, and 22 input frames each having a resolution of 416×416, 10 reference frames are input to an original feature extraction network, original features corresponding to the 10 reference frames are obtained, 2 edge information frames are respectively input to an edge feature extraction network, corresponding edge features are obtained, 10 edge information frames are respectively input to an edge feature extraction network, corresponding edge features are obtained, and then the obtained 10 original features, 2 timing features, and 10 edge features are input to a detection network, so as to obtain a target detection region in a video to be detected.

As a possible implementation manner, each reference frame is input to an original feature extraction network to obtain an original feature corresponding to each reference frame, which specifically includes:

each reference frame is respectively input into a convolution component in an original feature extraction network for convolution processing, and a corresponding reference frame convolution result is obtained, wherein the convolution component comprises a convolution layer, a normalization layer and an activation function;

Taking reference frame x as an example, the reference frame x may be any one of the reference frames, referring to fig. 7, the reference frame x is input into the branch a, the DBL component in the branch a is used for performing convolution processing to obtain a reference frame convolution result corresponding to the reference frame x, and then, based on the reference frame convolution result corresponding to the reference frame x, the RES1 component, the RES2 component, the RES8 component, and the RES8 component RES4 component are used to obtain original features corresponding to the reference frame x.

As a possible implementation manner, at least one time sequence information frame is respectively input to a time sequence feature extraction network to obtain corresponding time sequence features, which specifically includes:

At least one time sequence information frame is respectively input into a convolution component in a time sequence feature extraction network to carry out convolution processing, and a corresponding time sequence frame convolution result is obtained, wherein the convolution component comprises a convolution layer, a normalization layer and an activation function;

and based on the obtained convolution result of the at least one time sequence frame, extracting at least one residual error component in the network by utilizing the time sequence characteristics, and obtaining the original characteristics corresponding to the at least one time sequence information frame respectively.

Taking the reference frame y as an example, the reference frame y may be any one time sequence information frame of at least one time sequence information frame, referring to fig. 7, the reference frame y is input into the branch B, the DBL component in the branch B is used for performing convolution processing to obtain a time sequence frame convolution result corresponding to the reference frame y, and then, based on the time sequence frame convolution result corresponding to the reference frame y, the RES1 component, the RES2 component, the RES8 component and the RES8 component RES4 component are used to obtain the time sequence feature corresponding to the reference frame y.

As a possible implementation manner, each edge information frame is respectively input to an edge feature extraction network to obtain a corresponding edge feature, which specifically includes:

each edge information frame is respectively input into a convolution component in an edge feature extraction network to carry out convolution processing, and a corresponding edge frame convolution result is obtained, wherein the convolution component comprises a convolution layer, a normalization layer and an activation function;

And based on the obtained convolution result of each edge frame, at least one residual error component in the edge feature extraction network is utilized to obtain the edge feature corresponding to each edge information frame.

Taking reference frame z as an example, the reference frame z may be any one time sequence information frame of at least one time sequence information frame, referring to fig. 7, the reference frame z is input into a branch C, a convolution process is performed by using a DBL component in the branch C to obtain an edge frame convolution result corresponding to the reference frame z, and then, based on the edge frame convolution result corresponding to the reference frame z, edge features corresponding to the reference frame z are obtained by using an RES1 component, an RES2 component, an RES8 component, and an RES8 component RES4 component.

Specifically, the obtained original features, at least one time sequence feature, and edge features are input to a detection network, and when a target detection area in a video to be detected is obtained, the following modes can be adopted, but are not limited to:

performing feature stitching on each original feature, at least one time sequence feature and each edge feature through a detection network to obtain a global feature, and performing convolution processing on the global feature at least once to obtain a target detection result;

Referring to fig. 8, 10 original features, 2 time sequence features and 10 edge features are obtained through feature extraction, feature stitching is performed on the 10 original features, the 2 time sequence features and the 10 edge features to obtain global features, and at least one convolution process is performed through 22 DBL components (22 represent 22 features), the DBL components and the CONV components to obtain a target detection result, wherein the target detection result is a 13x13x18 three-dimensional matrix, and then a target detection region in a video to be detected is obtained based on the target detection result.

In some embodiments, the image detection may also be performed by using an anchor structure, where the anchor structure refers to that, for each feature point (for example, 13×13 feature points) on the feature map output by the model, each feature point may be used as an anchor point, and a set number of anchor frames may be generated according to a set size of the anchor frames with the center of each feature point as an anchor. The generation of 3 anchor boxes is merely illustrated herein.

Specifically, in the embodiment of the present application, the target detection result includes a feature map, and anchor frame information including a set number of anchor frames corresponding to each feature point in the feature map, where the anchor frame information of each anchor frame includes position information, target confidence coefficient, and category confidence coefficient of the anchor frame. The target confidence is used for representing the probability that the target detection object exists in the anchor frame, and the category confidence is used for representing the probability that the target detection object is of a certain category (a core picture, a caption or a station caption). For example, when the target detection object is taken as a core picture, the confidence level is used for representing the probability that the target detection object exists in the anchor frame, and the confidence level is used for representing the probability that the target detection object is taken as the core picture. Hereinafter, only the target detection object will be described as an example of the core screen. It should be noted that if there are multiple target detection objects, there will be multiple category confidence levels, each corresponding to a category of target detection objects.

Taking a 13x13x18 three-dimensional matrix as an example of the target detection result, in the 13x13x18 three-dimensional matrix, 13x13 refers to the format of the output feature map being 13x13, and 18 refers to 18 values corresponding to each pixel in the feature map. Assume that 4 position parameters are used to describe the position of one anchor frame, the 4 position parameters including: the abscissa t_x and the ordinate t_y of the center of the anchor frame, the height b_h and the width b_w of the anchor frame correspond to 3 anchor frames per pixel point, thus, there are 4 position parameters, 1 target confidence, 1 category confidence per anchor frame, thus, 3× (4+1+1) =18 values per pixel.

Referring to fig. 9, the size of the feature map is 13x13, one dotted line box represents an image area corresponding to one feature point (i.e., one pixel point) in the feature map, a solid line represents an anchor frame, three anchor frames are anchor frames corresponding to feature points centered on a circle, and the sizes of the three anchor frames are different.

In the embodiment of the present application, the sizes of the 3 anchor frames are different, and the sizes of the anchor frames may be set a priori. The anchor structure plays a role of a priori guidance, and the sizes of the 3 anchor frames can be set based on the aspect ratio of the target detection area in the actual detection scene. The sizes of the 3 anchor frames are super parameters, the super parameters are calculated by adopting a Kmeans clustering method according to the sizes of target detection areas in a training set, specifically, the sizes of candidate frames corresponding to all marked target detection areas in all training samples are firstly extracted and stored into a memory, then the candidate frames are aggregated into 3 classes by utilizing a Kmeans clustering algorithm, and the sizes of the candidate frames in the centers of the 3 classes are used as the sizes of the 3 anchor frames.

As a possible implementation manner, in the embodiment of the present application, based on a target detection result, a target detection area in a video to be detected is obtained, which specifically includes the following steps:

based on the target detection result, aiming at the current category target detection object, each anchor frame corresponds to the category confidence coefficient, and at least one anchor frame meeting the preset screening condition is screened out from the anchor frames;

and selecting a target detection area which meets preset selection conditions from the selected at least one anchor frame based on the target confidence coefficient corresponding to each selected at least one anchor frame in the target detection result.

Based on the target detection result, aiming at the current target detection object, each anchor frame corresponds to the category confidence level, and at least one anchor frame meeting the preset screening condition is screened out from the anchor frames, which specifically comprises the following steps:

at least one overlapping area comparison is performed for each anchor frame, and during each comparison, the following operations are performed:

acquiring each anchor frame which is not processed currently;

according to the target detection result, aiming at the current category target detection object, screening one anchor frame of which the corresponding category confidence meets the set category confidence condition from the current unprocessed anchor frames according to the category confidence corresponding to each anchor frame of which the current unprocessed anchor frame is not processed;

Based on the overlapping area between each other anchor frame except the selected anchor frame and the selected anchor frame in the current anchor frames, determining that at least one other anchor frame with the overlapping area of the selected anchor frame being larger than a preset area threshold exists, and deleting the existing at least one other anchor frame from the current anchor frames;

taking the selected anchor frame as a processed anchor frame, and judging whether unprocessed anchor frames exist in the anchor frames which are remained after deletion; if so, performing the next overlapping area comparison, otherwise, outputting the rest anchor frames after deletion as at least one anchor frame meeting the preset screening conditions.

The setting of the category confidence condition may be, but is not limited to, that the category confidence is highest.

Referring to fig. 10, only 4 anchor frames are taken as an example for explanation, and it is assumed that the target detection object is a core picture. The 4 anchoring frames are an anchoring frame A, an anchoring frame B, an anchoring frame C and an anchoring frame D, and the category confidence degrees corresponding to the anchoring frame A, the anchoring frame B, the anchoring frame C and the anchoring frame D are 0.7, 0.4, 0.5 and 0.3 respectively.

In a first overlap area comparison, the currently unprocessed anchor box includes: anchor frame a, anchor frame B, anchor frame C, anchor frame D. Firstly, according to a target detection result, aiming at a core picture, selecting an anchor frame A with highest category confidence from the anchor frames A, B, C and D according to the category confidence corresponding to the anchor frames A, B, C and D; assuming that the overlapping areas of the anchor frame B, the anchor frame C and the anchor frame D and the anchor frame a are respectively 0.6, 0.2 and 0.2, the preset area threshold value is 60%, and determining that the overlapping area of the anchor frame B and the anchor frame a is more than 60% based on the overlapping areas of the anchor frame B, the anchor frame C and the anchor frame D and the anchor frame a, at this time, deleting the anchor frame B from the anchor frame a, the anchor frame B, the anchor frame C and the anchor frame D; then, the anchor frame a is set as the processed anchor frame, and it is determined whether or not an unprocessed anchor frame exists in the anchor frames (anchor frame a, anchor frame C, anchor frame D) remaining after deletion, and at this time, both anchor frame C and anchor frame D are unprocessed anchor frames, so that the next overlapping area comparison is performed.

In a second overlap area comparison, the currently unprocessed anchor box includes: an anchor frame C and an anchor frame D. Firstly, according to the target detection result, aiming at the category confidence degrees corresponding to the core picture, selecting an anchor frame C with the highest category confidence degree from the anchor frame C and the anchor frame D, wherein the category confidence degrees correspond to the anchor frame C and the anchor frame D; assuming that the overlapping area of the anchor frame C and the anchor frame D is 0.7, determining that the overlapping area of the anchor frame C and the anchor frame D is greater than 60% based on the overlapping area between the anchor frame C and the anchor frame D, and deleting the anchor frame D from the anchor frame a, the anchor frame C, and the anchor frame D at this time; then, the anchor frame C is set as a processed anchor frame, and it is determined whether or not an unprocessed anchor frame exists among the anchor frames (anchor frame a, anchor frame C) remaining after deletion, and at this time, an unprocessed anchor frame does not exist, and therefore, the anchor frame a and anchor frame C are output as two anchor frames conforming to a preset screening condition.

In one possible implementation manner, based on the target confidence coefficient corresponding to each of the at least one selected anchor frame in the target detection result, when a target detection area meeting the preset selection condition is selected from the at least one selected anchor frame, if the number of the at least one selected anchor frame is one, directly taking the selected anchor frame as the target detection area meeting the preset selection condition; if the number of the at least one selected anchor frame is a plurality of anchor frames, the anchor frame with the highest target confidence coefficient among the plurality of selected anchor frames can be used as the target detection area, and the preset selection condition is not limited to the highest target confidence coefficient.

For example, referring to fig. 10, based on the target confidence levels of the respective anchor frame a and anchor frame C, the anchor frame a with the highest target confidence level among the anchor frame a and anchor frame C is taken as the target detection area.

Next, a training process of the target image detection model will be described. The model training process specifically relates to two parts of content of training data acquisition and iterative training.

Training data acquisition

In supervised learning, model training relies on tagged samples, and therefore, tagged videos need to be acquired. In the semi-supervised learning mode, model training relies on a small number of labeled samples, requiring labeled and unlabeled video to be acquired.

1. Capturing tagged video

In consideration of the fact that the deep learning model can be trained under the guidance of tagged data, a better training effect can be obtained, therefore, in the embodiment of the application, each video sample can be collected, and each collected video sample is respectively marked with a corresponding target detection area (namely, each video sample is tagged) to obtain each tagged video. The target detection area may also be referred to as a true detection area or a true box.

For example, the label corresponding to one video sample may include: the video path of the video sample, the resolution of the video sample, and the location information of the target detection area in the video sample.

Taking the target detection area as a rectangle as an example, the label format of the label may be: [ video path, width, height, x1, y1, x2, y2], wherein the horizontal axis is the x-axis and the vertical axis is the y-axis, (x 1, y 1) represents the upper left corner of the target detection area, and (x 2, y 2) represents the upper right corner of the target detection area.

Still taking the detection of the core picture area as an example, referring to fig. 11, a schematic diagram of a frame in a video sample provided in an embodiment of the present application is shown, the video sample is a short video, the labels corresponding to the video sample are [ data/1.mp4, 400, 800, 0, 291, 400, 800], that is, the video sample is a file named 1.mp4 stored under the data folder, the resolution of the video sample is 400×800, the upper left corner coordinate of the target detection area is (0, 291), and the lower right corner coordinate of the target detection area is (400, 800).

In some embodiments, after each video sample is collected, the terminal device responds to a frame selection operation of the target object for each video sample, determines a target detection area corresponding to each generated video sample according to a selection frame input by the target object in each generated video sample, and further generates a label corresponding to each video sample according to the target detection area corresponding to each generated video sample. The selection frame may be a rectangular selection frame, but is not limited thereto.

In the related data acquisition technical scheme, when the embodiment of the application is applied to specific products or technologies, the related data collection, use and processing processes should comply with the national legal and legal requirements, accord with legal, legal and necessary principles, do not relate to acquiring data types forbidden or limited by legal and legal regulations, and do not hinder the normal operation of a target website.

2. Capturing unlabeled video

In the embodiment of the application, the model can be jointly trained by using the unlabeled video and the labeled video, and the generalization of the model is improved in a semi-supervised training mode.

In the embodiment of the application, for each non-tag video, data enhancement can be performed in a manner that the method is not limited to the following, so as to obtain enhancement videos corresponding to each non-tag video, and according to the enhancement videos corresponding to each non-tag video, enhancement video pairs corresponding to each non-tag video are constructed, wherein each enhancement video pair comprises one non-tag video and the enhancement video corresponding to the non-tag video.

In some embodiments, the enhanced video corresponding to one unlabeled video is generated using at least one of:

Operation 1: and adjusting the image parameters of each sample video frame contained in the unlabeled video A based on the target adjustment parameters to obtain an enhanced video A' corresponding to the unlabeled video A.

Wherein the target adjustment parameter is used to characterize the adjusted value of the image parameter, which includes, but is not limited to, one or more of saturation, contrast, hue, and brightness.

For example, the contrast of the non-tagged video a is 0% (normal), the target adjustment parameter characterizes that the contrast is adjusted to +20%, and based on the target adjustment parameter, the image parameters of each sample video frame included in the non-tagged video a are adjusted, so as to obtain an enhanced video a 'corresponding to the non-tagged video a, and the contrast of the enhanced video a' is +20%.

Operation 2: and performing time axis clipping on each sample video frame contained in the unlabeled video A to obtain an enhanced video A' corresponding to the unlabeled video A.

The time axis clipping refers to one or operations of clipping, stretching, cutting and splicing the time axis of the video, but is not limited to this. The time axis of the video can be shortened by adjusting the starting time and the ending time of the unlabeled video A; stretching the time axis of the video may mean stretching or compressing the time axis of the unlabeled video a on the premise of not changing the picture, so as to slow down the playing speed of the unlabeled video a; the step of cutting and splicing the time axis of the video is to adjust the playing sequence of each sample video frame contained in the unlabeled video A.

For example, the time axis of the unlabeled video a is shortened, an enhanced video a 'corresponding to the unlabeled video a is obtained, the start time of the enhanced video a' is the second frame video frame of the unlabeled video a, and the end time is the last second frame video frame of the unlabeled video a.

Operation 3: and adding Gaussian noise into each sample video frame contained in the unlabeled video A to obtain an enhanced video A' corresponding to the unlabeled video A.

The gaussian noise may also be referred to herein as video noise, and includes, but is not limited to, amplification noise, pretzel noise, shot noise, and the like.

For example, gaussian noise is added to each sample video frame included in the unlabeled video a, so as to obtain an enhanced video a 'corresponding to the unlabeled video a, and each sample video frame of the enhanced video a' includes pretzel noise.

After data enhancement, there is a corresponding enhancement video for each of the non-tagged videos in the non-tagged video set U, and such a pair of video data, referred to as an enhancement video pair, may be denoted as a U' enhancement video set, which is to be used in the semi-supervised training process.

(II) iterative training

After the training data preparation is completed, the constructed model can be trained by using the training data. In the embodiment of the application, an image detection model obtained after training is called a target image detection model, and an image detection model which is not trained is called an initial image detection model.

In some embodiments, parameters and data required by the initial image detection model may be set according to the structure of the model, where the parameters include parameters to be trained, and after super parameters such as batch (batch), iteration number (epoch), learning rate (learning rate) and the like are set, training is started, so as to finally obtain the target image detection model.

In the embodiment of the application, the supervised training can be adopted, and the semi-supervised information training can also be adopted.

Referring to fig. 12, a logic diagram of a supervised training method based on an initial image detection model according to an embodiment of the present application is shown. In the iterative training process, all training samples are divided into designated batches, and training is performed based on training samples of respective sub-batches, and since steps performed in training for each batch in each iterative process are similar, training for one batch is described here as an example. A batch of training process comprising the steps of:

s1201, based on a plurality of tagged videos, obtaining first detection results corresponding to the tagged videos by using an initial image detection model.

In the embodiment of the application, for each tagged video in a plurality of tagged videos, a corresponding first detection result can be obtained by using an initial image detection model. The process of obtaining a first detection result of a tagged video using the initial image detection model is similar to the process of obtaining a target detection area of a video to be detected using the target image detection model above, and will not be described again.

Wherein the first detection result is used for representing: and predicting the position information of the detection area in the corresponding tag video.

Still taking the detection result with the model output format of 13x13x18 as an example, for each tagged video, in the first detection result corresponding to one tagged video, anchor frame information of 3 anchor frames corresponding to 13x13 pixel points respectively is included, and anchor frame information of each anchor frame includes: position information of the anchor frame, target confidence and category confidence.

S1202, determining the total loss of the model based on the first detection results corresponding to the tagged videos and combining the position information of the target detection areas of the tagged videos.

In the model training process, the target detection area of the tagged video may also be referred to as a true detection area.

In the embodiment of the application, the model total loss can be calculated by adopting a Cross entropy (Cross Entry) loss function, a square loss function (quadratic loss function), an absolute value loss function (absolute loss function) and other loss functions, but is not limited to the above.

Exemplary, model Total Loss_N Can be calculated by using the formula (1):

wherein, in the first row of formula (1), lambda_box For the loss weight of the anchor frame, N1 is the length and width of the feature map output by the model (the length and width are the same, according to the above setting, where n1=13), t_x And t_y Is the coordinates of the center point of the real box (i.e., target detection area) in the tag, t'_x And t'_y Is the coordinates of the center point of the prediction frame (i.e. the predicted detection area characterized by the detection result), t_h And t_w Is the height and width of the real frame, t'_h And t'_w Is the height and width of the prediction box, the first line is used to overall measure the regional error loss between the prediction box and the real box.

In the second row of equation (1), λ_obj For the target confidence weight of the anchor frame,a target confidence level for predicting whether or not the target detection object exists in the ijth region (the j anchor frame of the i-th feature point), c_ij Is the true target confidence of whether the ij-th region has the target detection object. Lambda (lambda)_noobj For the anticrustation weight of the anchor box, +.>For whether the ij-th region has the predicted anticrustation degree of the target object, the predicted anticrustation degree is calculated according to the predicted target belief degree, and the predicted anticrustation degree=1-the predicted target belief degree.

In the third row of formula (1), λ_class Class represents the number of classes, p 'for the class weight of the anchor frame'_ij (c) For the ij region to the c classClass confidence of other predictions, p_ij (c) True category confidence for the ij-th region for the c-th category.

S1203, determine whether the initial image detection model satisfies the convergence condition.

In the embodiment of the present application, the convergence condition may include at least one of the following conditions:

(1) The total model loss for the current round is less than the total model loss for the previous round.

(2) The total loss value is not greater than a preset loss value threshold.

(3) The iteration number reaches a preset number upper limit value.

If the determination result in S1203 is no, the model parameter is adjusted based on the model total loss. If the initial image detection model meets the convergence condition, training is finished, otherwise, if the initial image detection model does not meet the convergence condition, model parameters need to be continuously adjusted, and the adjusted initial image detection model is utilized to enter the next training process.

During the training process, the trained optimizer can adopt a random gradient descent (Stochastic Gradient Descent, SGD) function with increased momentum, the initial learning rate is 0.001, and the learning rate of step descent is reduced by 0.96 times for every 5 epoch learning rates.

Fig. 14 is a logic schematic diagram of a semi-supervised training method of a body initial image detection model according to an embodiment of the present application. In the iterative training process, all training samples are divided into designated batches, and training is performed based on training samples of respective sub-batches, and since steps performed in training for each batch in each iterative process are similar, training for one batch is described here as an example. A batch of training, comprising the steps of:

S1401, respectively inputting a plurality of tagged videos into an initial image detection model, obtaining corresponding first detection results, and obtaining tagged video losses based on the obtained first detection results.

The calculation mode of the tagged video loss is the same as the calculation mode of the model total loss in the supervised training, and is not described here again.

In some embodiments, to prevent rapid overfitting of data during training, a strategy for slow release of signals is proposed in the present examples.

The implementation basic principle of the slow release signal is that samples which are too confident for the labeled data are not counted in the training process, namely samples with too high confidence, and the errors of the labeled data cannot be reversely transmitted, so that the model is prevented from being further overfitted to the samples.

Specifically, when the tagged video loss is calculated, the following manner may be adopted: screening at least one first detection result with the category confidence coefficient not greater than a preset confidence coefficient threshold value from the first detection results based on the category confidence coefficient contained in each obtained first detection result; and obtaining a tagged video loss based on the at least one first detection result.

Specifically, a threshold value eta t is set at the time t of training, and 1/K is less than or equal to eta t and less than or equal to 1, wherein K is a category. When the confidence of the category calculated by a certain tagged data is larger than a threshold value eta, the tagged data is removed from the process of calculating loss, and only the rest data in the current training batch are calculated.

In the process of the combined training of the supervised data and the unsupervised data, the model can quickly overfit the training data set due to the fact that the supervised data is less, and the slow release signal is proposed to prevent the quick overfit of the labeled data.

S1402, respectively inputting a plurality of enhanced videos into an initial image detection model to obtain corresponding second detection results, and obtaining the label-free video loss based on the obtained second detection results.

In the embodiment of the application, for each of a plurality of unlabeled videos, a corresponding first detection result can be obtained by using an initial image detection model. The process of obtaining the first detection result of a label-free video by using the initial image detection model is similar to the process of obtaining the target detection area of the video to be detected by using the target image detection model, and is not described herein.

In the embodiment of the application, the model learning target is set as consistency prediction, wherein the consistency prediction is one of main methods for extracting signals from the unlabeled data by using a semi-supervised learning technology, and the model still can accurately judge the data after the disturbance occurs to the data.

The consistency prediction specifically refers to that for massive, easily acquired enhancement data H 'of unlabeled data H and H, the objective function forces the neural network to make a consistency prediction for data H and data H', i.e. the prediction distribution of the model should be consistent for both. Consistency prediction is equivalent to presenting a goal to the generalization ability of the model and guiding the model towards this goal with a large amount of unlabeled data information.

Specifically, the unlabeled video loss may be calculated as follows:

determining sample sub-losses corresponding to the enhanced video pairs based on second detection results of the unlabeled video and the enhanced video contained in the enhanced video pairs; and then, obtaining unlabeled video loss based on the sample sub-loss corresponding to each of the plurality of enhanced video pairs.

That is, according to the second detection results of the unlabeled video and the enhanced video in each pair of enhanced videos, the sample sub-losses corresponding to each pair of enhanced videos are calculated respectively, and then the sample sub-losses are summarized, so that the unlabeled video losses are obtained.

Illustratively, referring to equation (2), the sample sub-loss may be calculated using a mean square error (Mean Squared Error, MSE) function:

Where u represents unlabeled video in several enhanced video pairs, u' represents enhanced video in several enhanced video pairs, p_θ (u_i ) Representing the detection result of the model output of the unlabeled video in the i-th enhanced video pair, p_θ (u‘_i ) Detection result representing model output of enhanced video in enhanced video pair of ith group, function p_θ Representing the computational operation of the initial image detection model, function p_θ The subtraction of the above formula represents a point-to-point subtraction of two 13 x 18 matrices, the square represents the sum of squares of all matrix points after subtraction, i represents the data index, and n represents the number of enhanced video pairs in the current batch.

In some embodiments, considering that when there is little tagged data, the model's knowledge of the sample is insufficient, the prediction distribution of the untagged data may be very flat, and when the calculation is lost, the main contribution will come from the tagged data, which is contrary to the idea of using the untagged data, since the richer data distribution is more beneficial to model training, therefore, in the embodiment of the present application, a signal sharpening strategy is proposed. Signal sharpening strategies include, but are not limited to, at least one of the following three implementations:

Implementation 1, confidence-based masking. For unlabeled videos with poor prediction effect, i.e. unlabeled videos with prediction category confidence less than a certain threshold, a consistency prediction loss is not calculated (the sample sub-loss of the unlabeled video and the corresponding enhanced video is not counted into the unlabeled video loss).

Implementation 2, minimizing entropy. Minimizing entropy is the ability to have a lower entropy in the predicted augmented data, requiring calculation of the entropy loss in addition to the joint loss. Illustratively, the cross entropy of the two classifications, binaryCross, can be expressed using the following formula:

where P represents the true category confidence of the enhanced video,representing the confidence in the predicted category of the enhanced video.

Implementation 3, softmax control. By adjusting the Softmax control output, the class confidence is calculated by Softmax (l (X)/τ), where l (X) represents the class confidence described above, τ represents the temperature, and the smaller τ, the sharper the distribution.

S1403, obtaining a joint loss based on the tagged video loss and the untagged video loss.

Exemplary, joint loss L_θ (y) can be expressed by the formula (3):

L_θ (y)＝Loss_N +λU_θ formula (3)

Where λ is a parameter for the tagged video loss and the untagged video loss ratio.

S1404, judging whether the initial image detection model meets the convergence condition.

(1) The joint loss of the current round is smaller than the joint loss of the previous round.

(2) The joint loss is not greater than a preset loss value threshold.

(3) The iteration number reaches a preset number upper limit value.

If the determination result in S1405 is no, the model parameter is adjusted based on the joint loss.

If the initial image detection model meets the convergence condition, training is finished, otherwise, if the initial image detection model does not meet the convergence condition, model parameters need to be continuously adjusted, and the adjusted initial image detection model is utilized to enter the next training process.

In some embodiments, after the joint loss is obtained, a return gradient may be calculated based on the joint loss. The trained optimizer may employ an SGD function that adds momentum.

In the training process, a step-down learning rate decreasing strategy is adopted, for example, the initial learning rate is 0.001, and every 5 epoch learning rates are reduced by 0.96 times.

In some embodiments, in order to improve training efficiency, the method may further include training the initial image detection model by using the labeled video and using the trained initial image detection model before performing the semi-supervised training, and participating in the semi-supervised training process shown in fig. 14, where the initial image detection model after the supervised training is converged, so that the semi-supervised training process performs fine tuning learning, so that the model can learn features from a large number of unlabeled videos, and the target image detection model obtained by training is more accurate.

The application will now be described with reference to a specific example.

Example 1: scenes of objects are detected with subtitles as targets.

In the data acquisition stage of model training, acquiring a tagged video and an untagged video, and carrying out data enhancement on the untagged video to obtain a corresponding enhanced video, thereby obtaining an enhanced data pair. Each tagged video marks a caption area of the tagged video, and the position information of the caption area is represented by the upper left corner coordinate and the lower right corner coordinate of the caption area.

In the model training process, firstly, a supervised training process shown in fig. 12 is utilized to train an initial image detection model by utilizing part or all of the acquired tagged videos, and then, the trained initial image detection model is utilized to participate in a semi-supervised training process shown in fig. 14, so that a target image detection model is obtained. The detection result output by the model is a 13x13x18 three-dimensional matrix, 13x13 means that the format of the output feature map is 13x13, 18 means that each pixel in the feature map corresponds to 18 values, and the 18 values comprise: the method comprises the steps of respectively carrying out 4 position parameters on 3 anchor frames, respectively carrying out target confidence degree on 3 anchor frames and respectively carrying out category confidence degree on 3 anchor frames, wherein each category confidence degree is used for representing the probability that a target detection object is a core picture.

In the model application stage, as shown in the image detection flow shown in fig. 2, the trained target image detection model is utilized to obtain the subtitle region of the video to be detected.

Example 2: scenes of objects are detected with subtitles and station marks as targets.

In the data acquisition stage of model training, acquiring a tagged video and an untagged video, and carrying out data enhancement on the untagged video to obtain a corresponding enhanced video, thereby obtaining an enhanced data pair. The method comprises the steps that each tagged video marks the position information of a caption area and the position information of a station caption icon area, wherein the position information of the caption area is represented by the upper left corner coordinate and the lower right corner coordinate of the caption area, and the position information of the station caption icon area is represented by the upper left corner coordinate and the lower right corner coordinate of the caption area.

In the model training process, first, the supervised training process shown in fig. 12 is used to train an initial image detection model by using some or all of the acquired tagged videos, and then the trained initial image detection model is used to participate in the semi-supervised training process shown in fig. 14 to obtain a target image detection model.

The detection result output by the model is a 13x13x21 three-dimensional matrix, 13x13 means that the format of the output feature map is 13x13, 21 means that each pixel in the feature map corresponds to 21 values, and the 21 values comprise: the method comprises the steps of respectively carrying out 4 position parameters on 3 anchor frames, respectively carrying out target confidence degree on 3 anchor frames and respectively carrying out 2 category confidence degrees on 3 anchor frames, wherein one category confidence degree is used for representing the probability that a target detection object is a caption, and the other category confidence degree is used for representing the probability that the target detection object is a station caption.

In the model application stage, as shown in the image detection flow shown in fig. 2, the trained target image detection model is utilized to obtain the position information of the subtitle region and the icon region of the video to be detected.

Referring to fig. 15, a flowchart of another image detection method provided in an embodiment of the present application is shown, where the method may be applied to a terminal device or a server, and the specific flowchart is as follows:

s1501, acquiring a video to be detected, wherein the video to be detected comprises original video frames. Illustratively, the acquired video to be detected may be, but is not limited to, MP4 format.

S1502, extracting each reference frame from each original video frame according to a first frame extraction mode. Referring specifically to S201, a detailed description thereof is omitted.

S1503, extracting each time sequence change frame from each original video frame according to a second frame extraction mode, obtaining distribution information according to the value change condition of each pixel in each time sequence change frame, and constructing at least one time sequence information frame by combining the positions of each pixel based on the distribution information; wherein the distribution information is used to characterize: at least one pixel with changed value in each pixel. Referring specifically to S202, the description thereof is omitted.

S1504, extracting each edge detection frame from each original video frame according to a third frame extraction mode, respectively carrying out edge detection on each edge detection frame according to each pixel value contained in each edge detection frame to obtain corresponding edge information, and taking each obtained edge information as each edge information frame. See S203 specifically, and will not be described in detail here.

S1505, inputting each reference frame, at least one time sequence information frame and each obtained edge information frame into the target image detection model to obtain a detection result, wherein the detection result contains the position information of the target detection area in the video to be detected. Referring specifically to S204, the description thereof is omitted.

Referring to fig. 16, first, a to-be-detected video in MP4 format is acquired, where the to-be-detected video includes original video frames.

Next, video input features are constructed, which are divided into three types: the three video input features can be obtained by carrying out video analysis on the input video to be detected. Specifically, according to a first frame extraction mode, extracting each reference frame from each original video frame; extracting each time sequence change frame from each original video frame according to a second frame extraction mode, obtaining distribution information according to the value change condition of each pixel in each time sequence change frame, and constructing at least one time sequence information frame by combining the positions of each pixel based on the distribution information; and extracting each edge detection frame from each original video frame according to a third frame extraction mode, respectively carrying out edge detection on each edge detection frame according to each pixel value contained in each edge detection frame to obtain corresponding edge information, and taking each obtained edge information as each edge information frame. The extraction order among the reference frame, the time sequence change frame, and the edge information frame is not limited.

And finally, inputting each extracted reference frame, at least one time sequence information frame and each obtained edge information frame into a target image detection model to obtain a detection result, wherein the detection result comprises the position information of a target detection area in the video to be detected, namely the detection result comprises the parameter coordinates of a core picture area of the video to be detected.

Based on the same inventive concept, an embodiment of the present application provides an image detection apparatus. As shown in fig. 17, which is a schematic structural diagram of the image detection apparatus 1700, may include:

the original information extraction unit 1701 is configured to extract, according to a first frame extraction manner, each reference frame from each original video frame included in the video to be detected;

the time sequence information extraction unit 1702 is configured to extract each time sequence change frame from each original video frame according to a second frame extraction manner, and obtain distribution information according to a value change condition of each pixel in each time sequence change frame, where the distribution information is used for representing: at least one pixel with a value changed in each pixel;

an edge information extraction unit 1703, configured to extract each edge detection frame from each original video frame according to a third frame extraction manner, and perform edge detection on each edge detection frame according to each pixel value included in each edge detection frame, so as to obtain corresponding edge information;

And a region detection unit 1704 for obtaining a target detection region in the video to be detected based on the reference frames, the distribution information, and the obtained edge information.

As a possible implementation manner, when the target detection area is obtained in the video to be detected based on the reference frames, the distribution information and the obtained edge information, the area detection unit 1704 is specifically configured to:

based on the distribution information, combining the positions of the pixels in the edge detection frame to construct at least one time sequence information frame, and taking the obtained edge information as each edge information frame;

the area detecting unit 1704 is specifically configured to, when inputting the reference frames, the at least one timing information frame, and the obtained edge information frames into a target image detection model to obtain a target detection area in the video to be detected:

As a possible implementation manner, when the obtained original features, at least one time sequence feature, and edge features are input to the detection network, and a target detection area in the video to be detected is obtained, the area detection unit 1704 is specifically configured to:

As a possible implementation manner, when the reference frames are input to the original feature extraction network to obtain the original features corresponding to the reference frames, the region detection unit 1704 is specifically configured to:

As a possible implementation manner, when the target detection area in the video to be detected is obtained based on the target detection result, the area detection unit 1704 is specifically configured to:

As a possible implementation manner, the training unit 1705 is further included, where the training unit 1705 is configured to:

As a possible implementation manner, the training unit 1705 is specifically configured to generate an enhanced video corresponding to the unlabeled video by at least one of the following operations:

the training unit 1705 is specifically configured to, when the tagged video loss is obtained based on the obtained first detection results:

As a possible implementation manner, the plurality of enhanced videos are respectively input into the initial image detection model, a corresponding second detection result is obtained, and when obtaining the label-free video loss based on each obtained second detection result, the training unit 1705 is specifically configured to:

As a possible implementation manner, when obtaining the distribution information according to the value change condition of each pixel in each time sequence change frame, the training unit 1705 is specifically configured to perform at least one of the following operations:

As a possible implementation manner, after extracting each time-sequence variation frame from each original video frame according to the second frame extracting manner, the training unit 1705 is further configured to:

As a possible implementation manner, based on the reference frames, the distribution information, and the obtained edge information, after obtaining the target detection area in the video to be detected, the area detection unit 1704 is further configured to:

For convenience of description, the above parts are described as being functionally divided into modules (or units) respectively. Of course, the functions of each module (or unit) may be implemented in the same piece or pieces of software or hardware when implementing the present application.

The specific manner in which the respective units execute the requests in the apparatus of the above embodiment has been described in detail in the embodiment concerning the method, and will not be described in detail here.

Those skilled in the art will appreciate that the various aspects of the application may be implemented as a system, method, or program product. Accordingly, aspects of the application may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

Based on the same inventive concept, an embodiment of the present application provides another image detection apparatus. As shown in fig. 18, which is a schematic structural diagram of the image detection apparatus 1800, may include:

an input unit 1801, configured to obtain a video to be detected, where the video to be detected includes each original video frame;

the original information extraction unit 1802 is configured to extract, according to a first frame extraction manner, each reference frame from each original video frame;

the time sequence information extraction unit 1803 is configured to extract each time sequence change frame from each original video frame according to a second frame extraction manner, obtain distribution information according to a value change condition of each pixel in each time sequence change frame, and construct at least one time sequence information frame by combining a position of each pixel based on the distribution information; wherein the distribution information is used to characterize: at least one pixel with a value changed in each pixel;

an edge information extraction unit 1804, configured to extract each edge detection frame from each original video frame according to a third frame extraction manner, perform edge detection on each edge detection frame according to each pixel value included in each edge detection frame, obtain corresponding edge information, and use each obtained edge information as each edge information frame;

And a region detection unit 1805, configured to input the reference frames, the at least one timing information frame, and the obtained edge information frames into a target image detection model, and obtain a detection result, where the detection result includes position information of a target detection region in the video to be detected.

The input unit 1801, the original information extraction unit 1802, the timing information extraction unit 1803, the edge information extraction unit 1804, and the area detection unit 1805 cooperate with each other to realize the functions of the image detection apparatus in the above-described respective embodiments.

In the embodiment of the application, aiming at the characteristic distinction between the target detection area (such as a core picture) and the non-target detection area (such as a core picture) in the video, on the basis of analyzing the video into frames, the reference frames, the distribution information and the edge information are extracted, and the extracted information is utilized for image detection, so that the task characteristics which are matched with the video detection task are designed, and the distinction between the target detection area and the non-target detection area can be effectively expressed through the extracted information outside the original video input as far as possible, thereby effectively reducing the learning difficulty of a subsequent image detection model, improving the capability of the image detection model, and further improving the detection efficiency and the detection effect of the image detection.

Furthermore, the design and implementation of the core picture detection model based on deep learning can detect the video core picture accurately under the conditions of multiple scales, different blurring degrees and multiple surface embeddings (black edges, ground glass, edge mapping and edge irrelevant video). The scheme uses a multi-branch pre-training image recognition model as a model feature extraction module, and learns and recognizes various core/non-core picture area distinction in a video according to a multi-path anchor structure and a customized loss function, so as to detect and obtain a core picture area.

Furthermore, in the model training process, a semi-supervised training framework is provided for learning effective information from massive internet unknown video samples. The method comprises the steps and the techniques of label-free data class balancing based on pseudo labels, similar video pair construction, consistency prediction training, slow-release signals, information sharpening and the like, so that the dependence of a detection model on labeled data is greatly reduced, and the recognition effect is further improved on the premise that the actual labeled data is not required to be newly added.

Based on the same inventive concept, the embodiment of the application also provides electronic equipment. In one embodiment, the electronic device may be a server or a terminal device. Referring to fig. 19, which is a schematic structural diagram of one possible electronic device provided in an embodiment of the present application, in fig. 19, an electronic device 1900 includes: a processor 1910, and a memory 1920.

The memory 1920 stores a computer program executable by the processor 1910, and the processor 1910 can execute the steps of the image detection method described above by executing instructions stored in the memory 1920.

The memory 1920 may be a volatile memory (RAM) such as a random-access memory (RAM); the Memory 1920 may be a nonvolatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a hard disk (HDD) or a Solid State Drive (SSD); or memory 1920, is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1920 may be a combination of the above.

The processor 1910 may include one or more central processing units (central processing unit, CPU) or digital processing units, etc. Processor 1910 is configured to implement the image detection method described above when executing the computer program stored in memory 1920.

In some embodiments, processor 1910 and memory 1920 may be implemented on the same chip, and in some embodiments, they may be implemented separately on separate chips.

The particular connection medium between the processor 1910 and the memory 1920 described above is not limited by embodiments of the application. In the embodiment of the present application, the processor 1910 and the memory 1920 are connected by a bus, which is depicted in fig. 19 by a thick line, and the connection manner between other components is only schematically illustrated, and is not limited thereto. The buses may be divided into address buses, data buses, control buses, etc. For ease of description, only one thick line is depicted in fig. 19, but only one bus or one type of bus is not depicted.

Based on the same inventive concept, an embodiment of the present application provides a computer readable storage medium comprising a computer program for causing an electronic device to perform the steps of the above-described image detection method when the computer program is run on the electronic device. In some possible embodiments, aspects of the image detection method provided by the present application may also be implemented in the form of a program product comprising a computer program for causing an electronic device to perform the steps of the image detection method described above, when the program product is run on the electronic device, e.g. the electronic device may perform the steps as shown in fig. 2 or 15.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, a RAM, a ROM, an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (Compact Disk Read Only Memory, CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of embodiments of the present application may take the form of a CD-ROM and comprise a computer program and may run on an electronic device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a computer program for use by or in connection with a command execution system, apparatus, or device.

The readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave in which a readable computer program is embodied. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a computer program for use by or in connection with a command execution system, apparatus, or device.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. An image detection method, comprising:

2. The method according to claim 1, wherein the obtaining a target detection area in the video to be detected based on the reference frames, the distribution information, and the obtained edge information includes:

3. The method of claim 2, wherein the target image detection model comprises an original feature extraction network, a temporal feature extraction network, an edge feature extraction network, and a detection network;

inputting the reference frames, the at least one time sequence information frame and the obtained edge information frames into a target image detection model to obtain a target detection area in the video to be detected, wherein the method comprises the following steps:

4. A method according to claim 3, wherein inputting the obtained raw features, at least one timing feature, and edge features into the detection network to obtain the target detection area in the video to be detected comprises:

5. A method according to claim 3, wherein said inputting each of said reference frames into said original feature extraction network to obtain respective corresponding original features of each of said reference frames comprises:

6. The method according to claim 2, wherein the obtaining the target detection area in the video to be detected based on the target detection result includes:

7. The method of any of claims 2-6, wherein the target image detection model is trained by:

8. The method of claim 7, wherein an enhanced video corresponding to an unlabeled video is generated using at least one of:

9. The method of claim 7, wherein each first detection result includes a category confidence, and the category confidence is used for representing a probability that the corresponding tagged video exists in the target detection area;

obtaining a tagged video loss based on the obtained first detection results, including:

10. The method of claim 7, wherein the plurality of enhanced videos are respectively input into an initial image detection model, a corresponding second detection result is obtained, and a label-free video loss is obtained based on each obtained second detection result, comprising:

11. The method according to any one of claims 1-6, wherein obtaining the distribution information according to the value change condition of each pixel in each time sequence change frame comprises at least one of the following operations:

12. The method of claim 11, wherein after extracting each time-series change frame from each original video frame according to the second frame extracting manner, further comprising:

13. The method according to any one of claims 1 to 6, further comprising, after obtaining a target detection area in the video to be detected, based on the reference frames, the distribution information, and the obtained edge information:

14. The method according to any one of claims 1 to 6, further comprising, after obtaining a target detection area in the video to be detected, based on the reference frames, the distribution information, and the obtained edge information:

15. An image detection method, comprising:

16. An image detection apparatus, comprising:

17. An image detection apparatus, comprising:

18. An electronic device comprising a processor and a memory, wherein the memory stores a computer program which, when executed by the processor, causes the processor to perform the steps of the method of any one of claims 1 to 14 or the steps of the method of claim 15.

19. A computer readable storage medium, characterized in that it comprises a computer program for causing an electronic device to perform the steps of the method of any one of claims 1-14 or the steps of the method of claim 15 when said computer program is run on the electronic device.

20. A computer program product, characterized in that it comprises a computer program stored in a computer readable storage medium, from which computer readable storage medium a processor of an electronic device reads and executes the computer program, causing the electronic device to perform the steps of the method of any one of claims 1-14 or to perform the steps of the method of claim 15.