Detailed Description
Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this application, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present application. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g. "a and/or B" indicates implementation as "a", or as "B", or as "a and B".
It will be appreciated that in the specific embodiments of the present application, related data such as video detection is involved, and when the above embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The embodiment of the application provides a video detection method provided by a video detection system, and relates to the technical fields of artificial intelligence and cloud. For example, the video detection referred to in the embodiments of the present application may utilize artificial intelligence to enable the identification of anomalous videos; for another example, the video detection in the embodiments of the present application may acquire, by using a cloud technology, a feature sequence of each video to be detected in the video set to be detected from the cloud.
Cloud computing (clouding) is a computing model that distributes computing tasks across a large pool of computers, enabling various application systems to acquire computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the cloud are infinitely expandable in the sense of users, and can be acquired at any time, used as needed, expanded at any time and paid for use as needed.
As a basic capability provider of cloud computing, a cloud computing resource pool (cloud platform for short, generally referred to as IaaS (Infrastructure as a Service, infrastructure as a service) platform) is established, in which multiple types of virtual resources are deployed for external clients to select for use.
According to the logic function division, a PaaS (Platform as a Service ) layer can be deployed on an IaaS (Infrastructure as a Service ) layer, and a SaaS (Software as a Service, software as a service) layer can be deployed above the PaaS layer, or the SaaS can be directly deployed on the IaaS. PaaS is a platform on which software runs, such as a database, web container, etc. SaaS is a wide variety of business software such as web portals, sms mass senders, etc. Generally, saaS and PaaS are upper layers relative to IaaS.
The artificial intelligence cloud Service is also commonly referred to as AIaaS (AIas a Service, chinese is "AI as Service"). The service mode of the artificial intelligent platform is the mainstream at present, and particularly, the AIaaS platform can split several common AI services and provide independent or packaged services at the cloud. This service mode is similar to an AI theme mall: all developers can access one or more artificial intelligence services provided by the use platform through an API interface, and partial deep developers can also use an AI framework and AI infrastructure provided by the platform to deploy and operate and maintain self-proprietary cloud artificial intelligence services.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.
Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing to make the Computer process an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, and the like.
Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
In order to better understand and illustrate the embodiments of the present application, some technical terms related to the embodiments of the present application are briefly described below.
Yolact: the model utilizes a simple full convolution model to realize rapid instance segmentation.
BTS: the model uses scale pyramids and localplanguide to implement depth map prediction.
MITSemseg-short for maskRCNN model, which is used for semantic segmentation.
And (3) an Embedding: embedding may be referred to as manifold Embedding, and Embedding may also be referred to as feature vector Embedding; by reducing the high-dimensional, discrete vectors into a low-dimensional semantic space, the greatest likelihood is maintained in the low-dimensional space.
TSNE: the t-SNE algorithm, t-SNE (t-distributed Stochastic Neighbor Embedding, t distributed random neighbor embedding) is an algorithm for reducing the dimension of a high-dimension vector, and outputs a two-dimension feature embedded vector.
Median filtering: the median filtering is a nonlinear smoothing technology, and the median filtering sets the gray value of each pixel point as the median of the gray values of all the pixel points in a certain neighborhood window of the point; the value of a point in a digital image or digital sequence is replaced by the median value of the values of points in a neighborhood of the point, so that the surrounding pixel values are close to the true value, and the isolated noise point is eliminated. Median filtering, which is commonly used in image processing to protect edge information, is a classical method of smoothing noise.
And (5) median inspection: the median test is one of the nonparametric tests, and a hypothesis test is used to infer whether there is a significant difference in the population from two samples sampled at random. The principle of the median test is: if the two samples are taken from the same population, the data of the two samples are positioned above (or below) the median in the unified ordering without significant difference; if the two samples were taken from different populations, there would be a significant difference.
Proximity algorithm: the K Nearest Neighbor (KNN) classification algorithm is one of the simplest methods in the data mining classification technology; the K nearest neighbors are the K nearest neighbors, meaning that each sample can be represented by its nearest K neighbor values; the neighbor algorithm is a method of classifying each record in the data set.
And (3) principal component analysis: principal component analysis (Principal Component Analysis, PCA) is a linear dimension reduction method and is widely applied to the fields of image processing, face recognition, data compression, signal denoising and the like.
Random projection: the random projection (Random Projection) is a nonlinear dimension reduction, where a set of points in a high dimensional euclidean space are mapped to relative distances in a low dimensional space, resulting in a hold within a certain error range.
RGB: RGB color mode is a color standard in industry, and is obtained by changing three color channels of red (R), green (G) and blue (B) and overlapping them with each other, and RGB is a color representing the three channels of red, green and blue.
The solution provided in the embodiments of the present application relates to artificial intelligence, and the following details are given by specific embodiments on the technical solution of the present application and how to solve the above technical problems. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
The scheme provided by the embodiment of the application can be applied to any application scene needing video detection in the field of artificial intelligence.
In order to better understand the scheme provided by the embodiment of the present application, the scheme is described below in connection with a specific application scenario.
In one embodiment, fig. 1 shows a schematic architecture of a video detection system to which the embodiment of the present application is applicable, and it can be understood that the video detection method provided in the embodiment of the present application may be applicable, but not limited to, to an application scenario as shown in fig. 1.
In this example, as shown in fig. 1, the architecture of the video detection system in this example may include, but is not limited to, a server 10 and a terminal 20; wherein, the server 10 may call the object relation extraction and calculation module 111, the abnormality discrimination module 112, the abnormality deduplication module 113 and the abnormality text report generation module 114 in the server 10; the terminal 20 and the server 10 may interact with each other via a network. The terminal 20 transmits each video to be detected in the video set to be detected to the server 10; the server 10 obtains the feature sequence of each video to be detected in the video set to be detected by calling the object relation extracting and calculating module 111, wherein the feature sequence of each video to be detected is used for representing the relation between at least two objects in each video to be detected; the server 10 determines at least two abnormal videos from the video set to be detected based on the feature sequence of each video to be detected in the video set to be detected by calling the abnormal judging module 112; the server 10 performs de-duplication processing on at least two abnormal videos by calling an abnormal de-duplication module 113 to determine an abnormal video set, wherein the abnormal types of the abnormal videos in the abnormal video set are different; the server 10 generates an exception report based on the exception video collections by invoking the exception text report generation module 114.
It will be appreciated that the above is only an example, and the present embodiment is not limited thereto.
The terminal may be a smart phone (such as an Android phone, an iOS phone, etc.), a phone simulator, a tablet computer, a notebook computer, a digital broadcast receiver, a MID (Mobile Internet Devices, mobile internet device), a PDA (personal digital assistant), a vehicle-mounted terminal (such as a vehicle-mounted navigation terminal), a smart speaker, a smart watch, etc. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server or a server cluster for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, and the like. The network may include, but is not limited to: a wired network, a wireless network, wherein the wired network comprises: local area networks, metropolitan area networks, and wide area networks, the wireless network comprising: bluetooth, wi-Fi, and other networks implementing wireless communications. And in particular, the method can be determined based on actual application scene requirements, and is not limited herein.
Referring to fig. 2, fig. 2 is a schematic flow chart of a video detection method provided in the embodiments of the present application, where the method may be performed by any electronic device, for example, a server, and as an optional implementation manner, the method may be performed by the server, and for convenience of description, in the following description of some optional embodiments, a description will be given taking the server as an implementation subject of the method. As shown in fig. 2, the video detection method provided in the embodiment of the present application includes the following steps:
s201, obtaining characteristic sequences of videos to be detected in a video set to be detected, wherein the characteristic sequences of the videos to be detected are used for representing the relation between at least two objects in the videos to be detected.
In particular, the video to be detected may be represented as a tensor I1:N Wherein N is the number of frames of the video matrix to be detected, I1:N ∈RH×W×C×N H, W, C each represents the height, width and channel number, R represents the real number set, RH×W×C×N Representing an h×w×c×n-dimensional tensor. The feature sequence of the video to be detected may be a feature sequence of each video frame (image frame) in the video to be detected, and the feature sequence of each video frame may be used to characterize a relationship between at least two objects (objects) in each video frame; for example, two objects in a video frame are a person and a car, and the feature sequence of the video frame may be used to characterize the relationship between the person and the car.
S202, determining at least two abnormal videos from the video set to be detected based on the characteristic sequences of the videos to be detected in the video set to be detected.
Specifically, the feature sequence of the video to be detected may be a feature sequence of each video frame in the video to be detected. When the video frame is determined to be abnormal based on the feature sequence of one video frame in the video to be detected, the video to be detected can be determined to be abnormal.
S203, performing de-duplication processing on at least two abnormal videos, and determining an abnormal video set, wherein the abnormal types of the abnormal videos in the abnormal video set are different from each other.
Specifically, when the anomaly types between the two anomaly videos are the same, screening one anomaly video of the two anomaly videos, and attributing the other anomaly video of the two anomaly videos to the anomaly video set.
In the embodiment of the application, the characteristic sequences of all videos to be detected in the video set to be detected are obtained, and the characteristic sequences of all videos to be detected are used for representing the relation between at least two objects in all videos to be detected; determining at least two abnormal videos from the video set to be detected based on the characteristic sequences of the videos to be detected in the video set to be detected; therefore, video anomaly detection is performed based on the relation between at least two objects in the video to be detected, and the accuracy of detecting video anomalies is improved. Performing de-duplication processing on at least two abnormal videos, and determining an abnormal video set, wherein the types of the abnormalities in the abnormal video set are different from each other; in this way, the content of the pushed abnormal video has a lower repetition rate by performing the de-duplication processing on the abnormal video with similar content.
In one embodiment, obtaining a feature sequence of each video to be detected in a video set to be detected includes steps A1-A3:
and A1, carrying out median filtering on video frames in each video to be detected to obtain a first video frame after noise is removed.
Specifically, median filtering is performed on each video frame in the video to be detected, so that noise is filtered, and a first video frame after noise removal is obtainedThe calculation formula (1) of (2) is as follows:
wherein x, y respectively represent an abscissa and an ordinate, l1 Represents the abscissa change value, l2 Indicating the ordinate variation value, h indicating the window size value, and Z indicating the integer set. For points at the edges of the video frame, filtering may only need to be performed by taking points within the frame of the image.
And step A2, performing size correction processing on the first video frame to obtain a corrected second video frame.
Specifically, for the first video framePerforming a size correction (correction) process; for example, a transform +.>So that the second video frame->Wherein the second video frame->An RGB image with a length of 480 dimensions, a width of 640 dimensions, and a channel number of 3 is shown.
And A3, determining a characteristic sequence of the video frame in each video to be detected based on the second video frame, wherein the characteristic sequence of the video frame comprises at least one of an index of the video frame, a distance between objects in the video frame, an absolute pitch angle corresponding to the video frame, a horizontal angle corresponding to the video frame, a size of each object in the video frame and an identification score of each object in the video frame.
In one embodiment, the feature sequence of a video frame may be blocki Representing, e.g. blocki The following is shown:
blocki =indexi ,di,(person,car) ,θi ,φi ,Sperson,i ,Scar,i ,Scoreperson,i ,Scorecar,i )
wherein the kth video (containing Mk The relationship between objects (objects, e.g., people, vehicles, etc.) in a frame) can be characterized by a sequence of blocks, noted asindexi An index representing an i-th video frame; d, di,(person,car) Representing the distance between the person and the car in the ith video frame; θi Representing absolute pitch angle; phi (phi)i Representing a horizontal angle; s is Sperson,i Representing the size of the person in the i-th frame; s is Scar,i Indicating the size of the vehicle in the i-th frame; scoreperson,i Representing an identification score of the person in the i-th frame; scorecar,i Indicating the identification score of the car in the i-th frame.
In one embodiment, based on the second video frame, determining the absolute pitch angle corresponding to the video frame includes steps B1-B3:
and B1, inputting the second video frame into a preset image depth estimation function, and performing depth estimation processing to obtain a depth map corresponding to the second video frame.
Specifically, the second video frame is estimated by a BTS (bigtlal, image depth estimation function)Depth map D of (2)i ∈R480×640 ,Di The calculation formula (2) of (2) is as follows:
and B2, inputting the second video frame into a preset semantic segmentation model, and performing semantic segmentation processing to obtain a mask for identifying the ground corresponding to the second video frame.
Specifically, in order to calculate the horizontal angle and the pitch angle in the spherical coordinate system, the plane of the ground needs to be calculated; due to the irregularities of the ground, the boundary of the ground can be calculated by the semantic segmentation model MITSemseg (ResNet) as shown in formula (3):
wherein the input of the semantic segmentation model is a second video frameThe output of the semantic segmentation model is a Mask for identifying the groundfloor 。
And B3, determining an absolute pitch angle corresponding to the video frame based on the depth map and a mask for identifying the ground.
Specifically, the absolute pitch angle is a spherical coordinate systemThe complementary angle of the pitch angle of (a). In the continuous image space omega, an angle alpha exists between the gradient direction vector of each pixel point on the ground and the vector vertical to the image, and the average included angle can be obtained by averaging the angle alphaThe calculation formula (4) of (2) is as follows:
under the condition of discretization, the method comprises the steps of,the calculation formula (5) of (2) is as follows:
average included angle
The absolute pitch angle theta can be obtained by combining the included angle psi between the target object (the object in the video frame) and the direction vertical to the image, as shown in the formula (6):
wherein D isi (x, y) represents the value (representing depth) of the pixel located on (x, y) on the i-th depth map; maskfloor Mask, representing the identification of the groundfloor (. Y) is shown on the ordinate; arctg, representing the arctangent function; le represents the left end point of the ordinate interval; re represents the right end point of the ordinate interval; y is0 Represents [ Le, re ]]The same interval integer value in (a); the number of the ground is 1, and the # mask_ { floor } (. Cndot., y) represents the ground length with the y ordinate; deltai Represents i Zhang Tuzai (x)0 ,y0 ) Differential values in position for the x-direction.
In one embodiment, determining a horizontal angle corresponding to the video frame based on the second video frame includes:
and determining a horizontal angle corresponding to the video frame based on the depth map, the mask for identifying the ground and the distance between objects in the video frame.
In particular, from the relationship of cone geometry in the image, the image can be obtained on the ground (sphere S3 Local synembryo at R3 ) The horizontal angle phi, which is the horizontal plane, is shown in formula (7):
wherein d represents the distance between objects in the video frame, and the average included angleCan be calculated by the formula (4) or the formula (5).
In one embodiment, determining the size of each object in the video frame based on the second video frame comprises:
inputting the second video frame into a preset instance segmentation model, and carrying out instance segmentation processing to obtain a block set corresponding to the second video frame;
Determining coordinates of boxes of objects in the video frame based on the box set;
the size of each object in the video frame is determined based on the coordinates of the box of each object in the video frame.
In one embodiment, the Set of boxes Set is calculated by an instance segmentation model YolactBox Score SetScore And mask SetMask As shown in formula (8):
wherein, setBox Representing a set of boxes, each box containing a possible target object in an image (video frame); set (Set)Score Representing a score set, wherein each score value in the score set represents the identification score value of the object in each box, and the higher the score is, the more likely the object is considered to belong to the distinguished category by the model; set (Set)Mask Representing a set of masks, each mask in the set of masks representing a mask of an object identified by the model in the image (video frame).Representing a second video frame.
In one embodiment, the calculation formula of the relative distance between two objects in the video frame, such as a person and a car, can be obtained according to the calculation formula of the distance in euclidean space.
In one embodiment, at SetBox In (3), the coordinates of the square of the Person (Person) and the Car (Car) can be found, and the coordinates of the square of the Person are set as (x)person,1 ,yperson,1 ) And (x)person,2 ,yperson,2 ) The coordinates of the frame of the vehicle are set to (x)car,1 ,ycar,1 ) And (x)car,2 ,ycar,2 ) The calculation formulas of the coordinates of the person and the center of the vehicle are shown as formula (9) and formula (10), respectively:
(xperson,center ,yperson,center )=[(xperson,1 ,yperson,1 )+(xperson,2 ,yperson,2 )]formula/2 (9)
(xcar,center ,ycar,center )=[(xcar,1 ,ycar,1 )+(xcar,2 ,ycar,2 )]Formula/2 (10)
Correspondingly, the calculation formulas for measuring the sizes of people and vehicles are shown as a formula (11) and a formula (12) respectively:
Sperson =(xperson,2 -xperson,1 )(yperson,2 -yperson,1 ) Formula (11)
Scar =(xcar,2 -xcar,1 )(ycar,2 -ycar,1 ) Formula (12)
In one embodiment, a second video frame is input to a preset instance segmentation model, and instance segmentation processing is performed to obtain a score set corresponding to the second video frame;
determining an identification score for each object in the video frame based on the second video frame, comprising:
and determining the identification score of each object in the video frame based on the score set corresponding to the second video frame.
Specifically, the second video frame can be calculated by equation (8)Corresponding score SetScore 。
In one embodiment, determining at least two abnormal videos from the set of videos to be detected based on the feature sequence of each video to be detected in the set of videos to be detected comprises:
inputting the characteristic sequence of the video frame in each video to be detected into a sequence generation layer included in a preset long-short-period memory neural network, and performing compression processing to obtain a multidimensional vector corresponding to the characteristic sequence of the video frame;
inputting the multidimensional vector to a full connection layer included in the long-short-term memory neural network, and converting the multidimensional vector into a two-dimensional vector;
Inputting the two-dimensional vector to an output layer included in the long-short-term memory neural network, and determining a probability value corresponding to the two-dimensional vector;
if the probability value corresponding to the two-dimensional vector is a preset abnormal index, determining that each video to be detected is an initial abnormal video;
and carrying out statistical inspection on the at least two initial abnormal videos, and if the at least two initial abnormal videos meet the preset abnormal judgment conditions, determining the video to be detected corresponding to the at least two initial abnormal videos as the abnormal video.
In one embodiment, each video gets a corresponding videoConsider->The characteristic of the indefinite length of (2) can be used for judging video abnormality by a combined method of statistical test and space-time diagram cyclic neural network, and the method comprises the following steps of C1-C2:
and C1, judging video abnormality through a long-term and short-term memory neural network.
Specifically, a special LSTM (Long Short-Term Memory) is constructed for two objects (objects), for inputTwo classifications of whether to be abnormal are made, and this structure finally compresses the temporal and spatial information of the video into 150-dimensional vectors (multi-dimensional vectors such as 150-dimensional vectors) and then judges whether to be abnormal based on this. The structure of the LSTM is shown in FIG. 3. In LSTM, < > >Each element in each block in the sequence is sequentially input to Series Generation layers (sequence generating layers) of the LSTM, which can sequentially generate a 150-dimensional vector sequence. Then sequentially passing through the full connection layer and the Softmax layer (output layer) to obtain the final abnormal index Jk =0or1. Wherein (1)>Representing a set of block sequences, Mk A frame number representing a kth video;LSTMsoftmax Represent LSTM neural network (long short term memory neural network) with softmax as the last layer, Jk An index indicating whether or not an abnormality (0 or 1) is present.
Step C2, checking the abnormal video (initial abnormal video) judged in the step C1 through statistical test, and when the abnormal video is determined to be the abnormal video, performing subsequent de-duplication processing on the abnormal video and generating an abnormal report; when the video is determined to be normal, the video is screened out.
Specifically, a sample (initial abnormal video) in which LSTM is discriminated as abnormal is checked again using a statistical test. The statistical test is to perform a median test between the feature distribution of the normal video and the feature distribution of the sample to be tested, and judge the p value with a 95% confidence. Originally assumed to be H0 :μ0 =μ1 The alternative hypothesis is H1 :μ0 ≠μ1 ThenThe calculation formula (13) of (2) is as follows: / >
Wherein P represents a distribution function of standard normal distribution;representing an average number of normal block sequence samples;Representing an average number of abnormal block sequence samples;Representing the variance of the normal block sequence samples;representing variances of the samples of the sequence of abnormal blocks; z represents a normal distribution random variable;The combined total variance is represented as such,n1 ,n2 is the degree of freedom of the corresponding distribution. If->If the number of bits in the test sample is smaller than 0.05, the number of bits in the test sample is inconsistent with the number of bits in the normal sample, and the test sample is further considered to be abnormal; otherwise, the characteristic distribution of the test sample is considered to be consistent with the characteristic distribution of the normal sample, and the sample is considered to be normal. When the sample is determined to be normal in combination with the discrimination of step C1 and step C2, the sample does not participate in the subsequent deduplication process and exception report generation.
In one embodiment, complex relationships between multiple objects (objects) may be considered. Video tensorAbstract graph sequence->Wherein (1)>A set of graph models representing each frame in the kth video, Gi Graph model representing the i-th frame, Vi Is a set of points in the graph model, Ei Is the set of edges in the graph model. Each point represents an object or a feature of an object. If points in the graph represent objects, the spatial distance, relative angle between points, and class, score, color, etc. of the objects can be used as distinguishing features. If the points represent the characteristics of the object (such as hands, feet, faces, etc., wheels, doors, lamps, etc., of the vehicle), one connected subgraph can be used as the graph representation of the object (such as people, vehicles, etc.), and the anomaly video discrimination, the deduplication process and the anomaly report generation can be completed by extracting the Laplace matrix (Laplace matrix) of Gao Weitu and inputting the Laplace matrix into the space-time diagram sequence model.
In one embodiment, performing deduplication processing on at least two abnormal videos to determine an abnormal video set, including:
determining a two-dimensional feature embedding vector of each abnormal video aiming at each abnormal video;
if the Euclidean distance between the two-dimensional feature embedded vectors of any two abnormal videos in the at least two abnormal videos is smaller than a preset distance threshold, determining that the abnormal types between any two abnormal videos are the same;
screening out one of any two abnormal videos with the same abnormal type, and attributing the other abnormal video of the any two abnormal videos to an abnormal video set.
In one embodiment, in order to increase the efficiency of pushing exception reports, deduplication processing of the exception video is required. The corresponding 150-dimensional vector V can be obtained from the ith video by the Series Generation layer calculation of LSTM in FIG. 3i ,Vi The calculation formula (14) of (2) is as follows:
wherein LSTMSeriesGeneration Representing input of ith videoLSTM of the 150-dimensional vector is generated.
The 150-dimensional vector is reduced by an Embedding technique, such as t-SNE (t-distributed Stochastic Neighbor Embedding, t-distributed random nearest neighbor Embedding). Assume that two video tensors areAnd->The 150-dimensional vector calculated by LSTM is V1 And V2 The two-dimensional Embedding vector (two-dimensional feature Embedding vector) can be calculated by t-SNE, as shown in formula (15):
calculating by a formula (15) to obtain two-dimensional codingVector V1 And V2 After that, the vector V can be calculated1 And V2 And judging whether the distances between the sample points are similar or not by a threshold method (the content types are consistent, namely the anomaly types between two anomaly videos are the same). In the Embedding space, two points E1 ,E2 Distance between Euclidean distances (E)1 ,E2 ) The calculation formula (16) is as follows:
wherein E is1,x An abscissa representing a first point; e (E)2,x An abscissa representing a second point; e (E)1,y An ordinate representing the first point; e (E)2,y Representing the ordinate of the second point.
It should be noted that the thresholding method is established based on the Euclidean distance distribution between points in the two-dimensional Embedding space. If E1 And E is2 The distance between the two videos is smaller than a certain threshold value M (distance threshold value), and the anomalies associated with the two videos are considered to be consistent; otherwise, the contents (anomaly type) of the two videos are considered to be different. After the training data is counted, the threshold may be estimated such that the consistent content type is controlled as much as possible within the threshold range and the non-consistent content is controlled outside the threshold range over the training data set.
In one embodiment, the de-duplication process is mainly partial manifold Embedding (Embedding), and more dimension reduction methods may be used, for example, linear dimension reduction, for example, the linear dimension reduction may be PCA (Principal Component Analysis ), etc., and for example, the nonlinear dimension reduction may be random projection, etc. The threshold value can be set manually in the deduplication process, and whether the samples coincide or have larger similarity can be judged by KNN (K-nearest neighbor) and other methods.
In one embodiment, in order to improve the efficiency of video anomaly repair, the anomaly report pushed to the operator needs to reflect the content of the anomaly video in a concise and efficient manner. And generating a text keyword according to the text title brief introduction, judging an abnormality index (distance between objects in a video frame, absolute pitch angle corresponding to the video frame, horizontal angle corresponding to the video frame, size of each object in the video frame, identification score of each object in the video frame and the like) in an abnormality sample (abnormal video) according to the statistical test, and finally generating an abnormality report (abnormal text report) as a text abstract part, wherein an abnormality report template is shown in a table 1, and objects in the table 1 represent the objects.
Table 1: abnormality report template
The application of the embodiment of the application has at least the following beneficial effects:
based on the relation between at least two objects in the video to be detected, video anomaly detection is carried out, so that the accuracy of detecting video anomalies is improved; the content of the pushed abnormal video is provided with a lower repetition rate by carrying out de-duplication processing on the abnormal video with similar content, so that the efficiency of pushing the abnormal report is improved.
In order to better understand the method provided by the embodiment of the present application, the scheme of the embodiment of the present application is further described below with reference to an example of a specific application scenario.
In one embodiment, the application scenario of the video detection may be game video detection. The method can push effective abnormal video with non-repeated content and corresponding text abstract reports to operators with high quality based on abnormal game videos issued by game masters on various large video websites.
Referring to fig. 4, fig. 4 is a schematic flow chart of a video detection method provided in the embodiments of the present application, where the method may be performed by any electronic device, for example, a server, and as an optional implementation manner, the method may be performed by the server, and for convenience of description, in the following description of some optional embodiments, a description will be given taking the server as an implementation subject of the method. As shown in fig. 4, the video detection method provided in the embodiment of the present application includes the following steps:
S401, median filtering is carried out on the video frames in each game video to be detected, and a first video frame after noise is removed is obtained.
And S402, performing size correction processing on the first video frame to obtain a corrected second video frame.
S403, determining a characteristic sequence of the video frames in each game video to be detected based on the second video frames, wherein the characteristic sequence of the video frames is used for representing the relation between at least two objects in the video frames.
Specifically, the feature sequence of the video frame includes an index of the video frame, a distance between objects in the video frame, an absolute pitch angle corresponding to the video frame, a horizontal angle corresponding to the video frame, a size of each object in the video frame, and an identification score of each object in the video frame.
S404, inputting the characteristic sequence of the video frames in each game to be detected into the long-short-period memory neural network, and determining a plurality of initial abnormal game videos.
S405, carrying out statistical test on the plurality of initial abnormal game videos to determine at least two abnormal game videos.
S406, performing de-duplication processing on at least two abnormal game videos, and determining an abnormal game video set, wherein the abnormal types of the abnormal game videos in the abnormal game video set are different from each other.
S407, determining an abnormal text report based on the abnormal game video set, and pushing the abnormal text report to an operator.
The application of the embodiment of the application has at least the following beneficial effects:
based on the relation between at least two objects in the game video to be detected, video anomaly detection is carried out, so that the accuracy of detecting the game video anomalies is improved; by carrying out de-duplication processing on the abnormal game video with similar content, the content of the pushed abnormal game video has lower repetition rate, and the efficiency of pushing abnormal word reports is improved.
In one embodiment, according toManual strategy, which can be implemented by relative size Sperson /Scar And θ as video features, a difference between the distribution of normal samples (normal video) and abnormal samples (abnormal video) is calculated, for example, the normal sample feature number is 1077, the abnormal sample feature number is 2869, and the results obtained by the median test of the normal samples and the abnormal samples are shown in table 2:
table 2: median inspection table for important index distribution of normal sample and abnormal sample
The result obtained by LSTM training is that the classification accuracy for 9 normal videos and 10 abnormal videos is 84.2%, which indicates that steps C1-C2 are effective.
In one embodiment, from 19 samples (9 normal videos and 10 abnormal videos) in the test data set, a two-dimensional map as shown in fig. 5, namely, a two-dimensional scatter diagram after dimension reduction of the 19 samples t-SNE, can be obtained through t-SNE. As shown in fig. 5, the dots are from normal samples, i.e., 9 dots represent 9 normal videos; the large dots are from outlier samples, i.e., 10 large dots represent 10 outlier videos. When the sample contents are quite different, the distance between the low-dimensional sample vectors obtained by t-SNE is approximately uniform.
The two abnormal samples are respectively shown as a 'person-vehicle separation' content shown in fig. 6 and a 'person-vehicle separation' content shown in fig. 7, and when the two abnormal samples show that a person drives, the person is at an abnormal position of the vehicle; for the two abnormal samples with almost the same content, the obtained low-dimensional data points are shown in fig. 8 by t-SNE dimension reduction, and fig. 8 is a two-dimensional scatter diagram (normal samples with influence judgment removed) of the abnormal samples, wherein the large points represent normal abnormal samples, and the small points represent the two abnormal samples with 'man-car separation' content. Two-dimensional distance table of each abnormal sample and one "man-car separation" abnormal sample under t-SNE is shown in table 3:
Table 3: two-dimensional distance meter for abnormal samples and one 'man-car separation' abnormal sample under t-SNE
| Sample of | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| Distance × | 13.81 | 34.48 | 10.93 | 22.46 | 14.94 | 21.01 | 20.66 | 37.51 | 38.46 |
Wherein, the distance is the t-SNE two-dimensional distance between each abnormal sample and one abnormal sample of 'human-vehicle separation'; sample 1 is another "man car separation" anomaly sample. On samples with similar video content (e.g., the two "man-car separation" anomaly samples), the semantic distance is shorter than for other video of different content types.
In one embodiment, 4 anomaly videos that are identified as anomalies and are obtained after reprocessing are taken as test samples, and the generated anomaly reports are shown in table 4:
table 4: abnormality reporting
Scale may represent a scale, such as a size of a person, a size of a car, a relative size of a person to a car, etc.
The embodiment of the application further provides a video detection device, and a schematic structural diagram of the video detection device is shown in fig. 9, where the video detection device 60 includes a first processing module 601, a second processing module 602, and a third processing module 603.
The first processing module 601 is configured to obtain a feature sequence of each video to be detected in the video set to be detected, where the feature sequence of each video to be detected is used to characterize a relationship between at least two objects in each video to be detected;
The second processing module 602 is configured to determine at least two abnormal videos from the video to be detected set based on the feature sequences of each video to be detected in the video to be detected set;
the third processing module 603 is configured to perform deduplication processing on at least two abnormal videos, and determine an abnormal video set, where types of abnormalities between the abnormal videos in the abnormal video set are different from each other.
In one embodiment, the first processing module 601 is specifically configured to:
performing median filtering on video frames in each video to be detected to obtain a first video frame after noise removal;
performing size correction processing on the first video frame to obtain a corrected second video frame;
and determining a characteristic sequence of the video frames in each video to be detected based on the second video frames, wherein the characteristic sequence of the video frames comprises at least one of an index of the video frames, a distance between objects in the video frames, an absolute pitch angle corresponding to the video frames, a horizontal angle corresponding to the video frames, a size of each object in the video frames and an identification score of each object in the video frames.
In one embodiment, the first processing module 601 is specifically configured to:
inputting the second video frame into a preset image depth estimation function, and performing depth estimation processing to obtain a depth map corresponding to the second video frame;
Inputting the second video frame into a preset semantic segmentation model, and performing semantic segmentation processing to obtain a mask which is corresponding to the second video frame and used for identifying the ground;
and determining an absolute pitch angle corresponding to the video frame based on the depth map and a mask for identifying the ground.
In one embodiment, the first processing module 601 is specifically configured to:
and determining a horizontal angle corresponding to the video frame based on the depth map, the mask for identifying the ground and the distance between objects in the video frame.
In one embodiment, the first processing module 601 is specifically configured to:
inputting the second video frame into a preset instance segmentation model, and carrying out instance segmentation processing to obtain a block set corresponding to the second video frame;
determining coordinates of boxes of objects in the video frame based on the box set;
the size of each object in the video frame is determined based on the coordinates of the box of each object in the video frame.
In one embodiment, the first processing module 601 is further configured to:
inputting the second video frame into a preset instance segmentation model, and carrying out instance segmentation processing to obtain a score set corresponding to the second video frame;
determining an identification score for each object in the video frame based on the second video frame, comprising:
And determining the identification score of each object in the video frame based on the score set corresponding to the second video frame.
In one embodiment, the second processing module 602 is specifically configured to:
inputting the characteristic sequence of the video frame in each video to be detected into a sequence generation layer included in a preset long-short-period memory neural network, and performing compression processing to obtain a multidimensional vector corresponding to the characteristic sequence of the video frame;
inputting the multidimensional vector to a full connection layer included in the long-short-term memory neural network, and converting the multidimensional vector into a two-dimensional vector;
inputting the two-dimensional vector to an output layer included in the long-short-term memory neural network, and determining a probability value corresponding to the two-dimensional vector;
if the probability value corresponding to the two-dimensional vector is a preset abnormal index, determining that each video to be detected is an initial abnormal video;
and carrying out statistical inspection on the at least two initial abnormal videos, and if the at least two initial abnormal videos meet the preset abnormal judgment conditions, determining the video to be detected corresponding to the at least two initial abnormal videos as the abnormal video.
In one embodiment, the third processing module 603 is specifically configured to:
determining a two-dimensional feature embedding vector of each abnormal video aiming at each abnormal video;
If the Euclidean distance between the two-dimensional feature embedded vectors of any two abnormal videos in the at least two abnormal videos is smaller than a preset distance threshold, determining that the abnormal types between any two abnormal videos are the same;
screening out one of any two abnormal videos with the same abnormal type, and attributing the other abnormal video of the any two abnormal videos to an abnormal video set.
The application of the embodiment of the application has at least the following beneficial effects:
acquiring a characteristic sequence of each video to be detected in a video set to be detected, wherein the characteristic sequence of each video to be detected is used for representing the relationship between at least two objects in each video to be detected; determining at least two abnormal videos from the video set to be detected based on the characteristic sequences of the videos to be detected in the video set to be detected; therefore, video anomaly detection is performed based on the relation between at least two objects in the video to be detected, and the accuracy of detecting video anomalies is improved. Performing de-duplication processing on at least two abnormal videos, and determining an abnormal video set, wherein the types of the abnormalities in the abnormal video set are different from each other; in this way, the content of the pushed abnormal video has a lower repetition rate by performing the de-duplication processing on the abnormal video with similar content.
The embodiment of the application further provides an electronic device, a schematic structural diagram of which is shown in fig. 10, and an electronic device 4000 shown in fig. 10 includes: a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The processor 4001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.
Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 10, but not only one bus or one type of bus.
Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.
The memory 4003 is used for storing a computer program that executes an embodiment of the present application, and is controlled to be executed by the processor 4001. The processor 4001 is configured to execute a computer program stored in the memory 4003 to realize the steps shown in the foregoing method embodiment.
Among them, electronic devices include, but are not limited to: a server, etc.
The application of the embodiment of the application has at least the following beneficial effects:
acquiring a characteristic sequence of each video to be detected in a video set to be detected, wherein the characteristic sequence of each video to be detected is used for representing the relationship between at least two objects in each video to be detected; determining at least two abnormal videos from the video set to be detected based on the characteristic sequences of the videos to be detected in the video set to be detected; therefore, video anomaly detection is performed based on the relation between at least two objects in the video to be detected, and the accuracy of detecting video anomalies is improved. Performing de-duplication processing on at least two abnormal videos, and determining an abnormal video set, wherein the types of the abnormalities in the abnormal video set are different from each other; in this way, the content of the pushed abnormal video has a lower repetition rate by performing the de-duplication processing on the abnormal video with similar content.
Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, where the computer program, when executed by a processor, may implement the steps and corresponding content of the foregoing method embodiments.
The embodiments of the present application also provide a computer program product, which includes a computer program, where the computer program can implement the steps of the foregoing method embodiments and corresponding content when executed by a processor.
Based on the same principle as the method provided by the embodiments of the present application, the embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the method provided in any of the alternative embodiments of the present application described above.
It should be understood that, although the flowcharts of the embodiments of the present application indicate the respective operation steps by arrows, the order of implementation of these steps is not limited to the order indicated by the arrows. In some implementations of embodiments of the present application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages may be flexibly configured according to the requirement, which is not limited in the embodiment of the present application.
The foregoing is merely an optional implementation manner of the implementation scenario of the application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the application are adopted without departing from the technical ideas of the application, and also belong to the protection scope of the embodiments of the application.