CN109376603A

Movatterモバイル変換

Info

Publication number: CN109376603A
Application number: CN201811113391.2A
Authority: CN
Inventors: 程成
Original assignee: Beijing Zhou Tong Technology Co Ltd
Current assignee: Beijing Zhou Tong Technology Co Ltd
Priority date: 2018-09-25
Filing date: 2018-09-25
Publication date: 2019-02-22

Abstract

The embodiment of the invention discloses a kind of video frequency identifying method, device, computer equipment and storage mediums, the described method includes: obtain simple video subfile corresponding with video file to be identified and simple audio subfile, and acquisition key frame set corresponding with the simple video subfile and video clip set；Multi-modal picture recognition is carried out to the key frame set, the first recognition result is obtained, and video identification is carried out to the video clip set, obtains the second recognition result；Audio identification is carried out to the simple audio subfile, obtains third recognition result；According to first recognition result, second recognition result and the third recognition result, obtain corresponding with the video file integrating recognition result.The technical solution of the embodiment of the present invention is realized on the basis of reducing identification cost, and the rich of video identification technology, accuracy, high efficiency and real-time are improved.

Description

A kind of video frequency identifying method, device, computer equipment and storage medium

Technical field

The present embodiments relate to technical field of video processing more particularly to a kind of video frequency identifying method, device, computersEquipment and storage medium.

Background technique

As Global Internet is popularized with what is communicated, allow people all over the world with various communication equipmentsOnline exchange, transmission multimedia messages.People can upload to respective picture, text, voice, video etc. the network platform pointEnjoy respective state, mood, beautiful scenery etc..And video with it includes abundant content information, allow people more intuitive, clearUnderstanding content and largely transmit and be stored in the network platform.But in the video of people's upload, there are many local laws, roadThe video that moral does not allow, such as yellow, gambling, it is bloody, vulgar, sudden and violent probably, extreme religion video.User propagates these views in downloadingWhen frequency, the great variety (especially teenager) that is easy to cause in soul.And the view of magnanimity on audit internet is manually gone merelyFrequency is a very time-consuming, laborious and unpractical problem.Video audit technology is come into being in this context.

Video audits technology early stage generally using traditional machine learning method, and this method uses artificial design features, needlePair be specific library, lack generalization (general the library be applicable in, algorithm performance is just deteriorated into another library).People is used laterWork audit combines conventional video audit technology to be audited by 7*24 hours uninterrupted naked eyes+machine auxiliary, reduces illegalThe appearance of violation video content.In recent years, in the fast development of the fields such as video, image, voice in deep learning.Therefore based on deepStudy, image recognition, the audit of the machine intelligence of cloud are spent as Main Trends of The Development, this can make enterprise put into manual examination and verificationCost substantially reduce and available better video auditing result.Country Baidu, Netease, map, Shang Tangdeng section at presentSkill company is all proposed respective video auditing system accordingly, and external Google, Facebook, Amazon, ValossaDeng being also proposed each video auditing system for having oneself characteristic.

In the implementation of the present invention, the discovery prior art has following defects that inventor

Although machine learning method can identify part violation content information, in short-sighted frequency, live video etc.When appearance can not but accomplish accurate content recognition and face the video of magnanimity, algorithm cannot identify in video wellHold.And manual examination and verification combination conventional video audit technology needs huge manual examination and verification team, audits accuracy rate in artificial intelligenceIt also needs further to expand its team when not high.Meanwhile manual examination and verification also will cause fatigue in continual audit video,And then lead to missing inspection, the erroneous detection of some videos.And enterprise needs to carry out plenty of time training to manual examination and verification personnel, so that enterpriseIndustry is to the input costs of manual examination and verification considerably beyond machine learning algorithm cost.It is existing to be based on deep learning, image recognition, cloudThe machine intelligence audit technology of technology cannot detect a large amount of vulgar unhelpful videos present on current network well, and identify systemUnite identification content is relatively simple, identification range is small, identification dimension once increase calculation amount also will exponentially type increase, to calculationForce request is excessively high.

Summary of the invention

The embodiment of the present invention provides a kind of video frequency identifying method, device, computer equipment and storage medium, to know reducingOn the basis of other cost, the rich of video identification technology, accuracy, high efficiency and real-time are improved.

In a first aspect, the embodiment of the invention provides a kind of video frequency identifying methods, comprising:

Obtain simple video subfile corresponding with video file to be identified and simple audio subfile, and obtain andThe corresponding key frame set of the simple video subfile and video clip set；

Multi-modal picture recognition is carried out to the key frame set, obtains the first recognition result, and to the video clipSet carries out video identification, obtains the second recognition result；

Audio identification is carried out to the simple audio subfile, obtains third recognition result；

According to first recognition result, second recognition result and the third recognition result, obtain with it is describedVideo file is corresponding to integrate recognition result.

Second aspect, the embodiment of the invention also provides a kind of video identification devices, comprising:

Subfile obtains module, for acquisition simple video subfile corresponding with video file to be identified and merelyAudio subfile, and obtain key frame set corresponding with the simple video subfile and video clip set；

First identification module, for obtaining the first recognition result to the multi-modal picture recognition of key frame set progress,And video identification is carried out to the video clip set, obtain the second recognition result；

Second identification module obtains third recognition result for carrying out audio identification to the simple audio subfile；

Recognition result obtains module, for according to first recognition result, second recognition result and described theThree recognition results obtain corresponding with the video file integrating recognition result.

The third aspect, the embodiment of the invention also provides a kind of computer equipment, the computer equipment includes:

One or more processors；

Storage device, for storing one or more programs；

When one or more of programs are executed by one or more of processors, so that one or more of processingDevice realizes video frequency identifying method provided by any embodiment of the invention.

Fourth aspect, the embodiment of the invention also provides a kind of computer storage mediums, are stored thereon with computer program,The program realizes video frequency identifying method provided by any embodiment of the invention when being executed by processor.

The embodiment of the present invention is by obtaining simple video subfile corresponding with video file to be identified and simple soundFrequency subfile, and key frame set corresponding with simple video subfile and video clip set；To key frame set intoThe multi-modal picture recognition of row obtains the first recognition result, and carries out video identification to video clip set and obtain the second identification knotFruit；Audio identification is carried out to simple audio subfile and obtains third recognition result, finally ties the first recognition result, the second identificationFruit and third recognition result are integrated to obtain the integration recognition result of video file, solve the existing knowledge of existing video audit technologyThe problem that other content is single and identification range is small realizes abundant identification type, refinement identification content, carries out multidimensional to video contentThe identification of degree improves the rich of video identification technology, accuracy, high efficiency and reality on the basis of reducing identification costShi Xing.

Detailed description of the invention

Fig. 1 is a kind of flow chart for video frequency identifying method that the embodiment of the present invention one provides；

Fig. 2 is a kind of flow chart of video frequency identifying method provided by Embodiment 2 of the present invention；

Fig. 3 a is a kind of flow chart for video frequency identifying method that the embodiment of the present invention three provides；

Fig. 3 b is a kind of bounding box size and position prediction effect diagram that the embodiment of the present invention three provides；

Fig. 3 c is a kind of Face datection effect diagram that the embodiment of the present invention three provides；

Fig. 3 d is a kind of effect diagram for face key point location that the embodiment of the present invention three provides；

Fig. 3 e is a kind of flow chart for video frequency identifying method that the embodiment of the present invention three provides；

Fig. 3 f is a kind of system schematic for video identification that the embodiment of the present invention three provides；

Fig. 3 g is a kind of flow chart for video frequency identifying method that the embodiment of the present invention three provides；

Fig. 3 h is a kind of schematic diagram for logarithm Meier spectrum signature that the embodiment of the present invention three provides；

Fig. 3 i is a kind of video recognition algorithms configuration diagram that the embodiment of the present invention three provides；

Fig. 4 is a kind of schematic diagram for video identification device that the embodiment of the present invention four provides；

Fig. 5 is a kind of structural schematic diagram for computer equipment that the embodiment of the present invention five provides.

Specific embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouchedThe specific embodiment stated is used only for explaining the present invention rather than limiting the invention.

It also should be noted that only the parts related to the present invention are shown for ease of description, in attached drawing rather thanFull content.It should be mentioned that some exemplary embodiments are described before exemplary embodiment is discussed in greater detailAt the processing or method described as flow chart.Although operations (or step) are described as the processing of sequence by flow chart,It is that many of these operations can be implemented concurrently, concomitantly or simultaneously.In addition, the sequence of operations can be by againIt arranges.The processing can be terminated when its operations are completed, it is also possible to have the additional step being not included in attached drawing.The processing can correspond to method, function, regulation, subroutine, subprogram etc..

Embodiment one

Fig. 1 is a kind of flow chart for video frequency identifying method that the embodiment of the present invention one provides, and the present embodiment is applicable to pairVideo file carries out accurate, quick the case where identifying, this method can be executed by video identification device, which can be by softThe mode of part and/or hardware can be generally integrated in computer equipment to realize.Correspondingly, as shown in Figure 1, this method packetInclude following operation:

S110, simple video subfile corresponding with video file to be identified and simple audio subfile are obtained, andObtain key frame set corresponding with the simple video subfile and video clip set.

Wherein, video file to be identified may include two kinds of data resources of video and audio.Simple video subfile canTo be the file for only including video resource, similarly, simple audio subfile can be the file for only including audio resource.Key frameSet can be used for storing each key frame in simple video subfile, and key frame can be representative in simple video subfileStrongest video frame.Wherein, the representative representativeness for referring to video clip semantic content, content intact and semanteme are obviously.Depending onFrequency set of segments can be used for storing each video clip in simple video subfile.

In embodiments of the present invention, after getting video file to be identified, can to video file to be identified intoThe processing that row audio-video detaches, to obtain simple video subfile and simple audio subfile.To video file to be identifiedWhen being identified, simple video subfile and simple audio subfile can be identified respectively.Specifically, to simple viewWhen frequency subfile is identified, two kinds of identifying schemes of image recognition and video identification can be carried out.When being identified to image,It can be identified according to each key frame in the corresponding key frame set of simple video subfile；It is identified to videoWhen, each video clip in the corresponding video clip set of simple video subfile can be identified.

S120, multi-modal picture recognition is carried out to the key frame set, obtains the first recognition result, and to the videoSet of segments carries out video identification, obtains the second recognition result.

Wherein, multi-modal picture recognition can be integration or fusion two kinds and two or more picture recognition features.First knowsOther result can be picture recognition as a result, the second recognition result can be video recognition result.

In embodiments of the present invention, each key frame in the corresponding key frame set of simple video subfile is identifiedWhen, the first identification knot can be obtained to the key frame picture recognition in key frame set by the way of multi-modal picture recognitionFruit.Each video clip in the corresponding video clip set of simple video subfile is carried out to identify available second identification knotFruit.

It should be noted that in embodiments of the present invention, the acquisition process of the first recognition result and the second recognition result isIt is mutually independent, it is unaffected by each other.That is, the video identification process of multi-modal picture recognition and video clip is mutually indepedentLink.

S130, audio identification is carried out to the simple audio subfile, obtains third recognition result.

Wherein, third recognition result can be audio recognition result.

Correspondingly, carrying out audio identification, available corresponding third recognition result to simple audio subfile.

S140, according to first recognition result, second recognition result and the third recognition result, obtain withThe video file is corresponding to integrate recognition result.

Wherein, integrate recognition result can be to the first recognition result, the second recognition result and third recognition result byThe recognition result integrated according to setting rule.

It in embodiments of the present invention, can after obtaining the first recognition result, the second recognition result and third recognition resultCorresponding with video file recognition result is integrated to be integrated to obtain to three kinds of recognition results.It optionally, can be directly byUnion is added to obtain integrating recognition result in one recognition result, the second recognition result and third recognition result.

Video frequency identifying method provided by the embodiment of the present invention can be used for in video whether there is law and morals notThe content of permission is audited, and carries out intelligent video audit for content illegal in short-sighted frequency, live video and long video,To construct good the Internet transmission storage environment, can be very good to solve the problems, such as that current short-sighted frequency and live streaming platform exist,And enterprise is substantially reduced to the investment of manual examination and verification.Meanwhile video frequency identifying method provided by the embodiment of the present invention customizes energyPower is strong, and flexibility ratio is high, can be customized according to user behavior and solve user demand.It can also provide to have with video content and be associated with by forceProperty advertisement launch, to promote advertisement delivery effect.

Embodiment two

Fig. 2 is a kind of flow chart of video frequency identifying method provided by Embodiment 2 of the present invention, and the present embodiment is with above-mentioned implementationIt is embodied based on example, in the present embodiment, gives acquisition key frame collection corresponding with the simple video subfileThe specific implementation of conjunction and video clip set.Correspondingly, as shown in Fig. 2, the method for the present embodiment may include:

S210, simple video subfile corresponding with video file to be identified and simple audio subfile are obtained, andObtain key frame set corresponding with the simple video subfile and video clip set.

Correspondingly, S210 can specifically include:

S211, the simple video subfile is filtered using video frame coarse filtration technology, obtains filtering video frameSet.

Wherein, filtering sets of video frames can be used for storing the video frame obtained after simple video subfile is filtered.

It is understood that handling the frame image in entire video flowing is very time-consuming and waste computing resource, it is commonProcessing system for video generally use and carry out double sampling in video streaming with uniform time interval to reduce the number of video frameAmount, but this method easily loses certain key frames in video.

The embodiment of the present invention is in order to improve the accuracy of key-frame extraction, first using video frame coarse filtration technology to simpleVideo subfile is filtered, to effectively reduce the quantity of video frame.Specifically, can will be dark in simple video subfileFrame, fuzzy frame and low-quality filtering frames, to obtain the preferable video frame of most of total quality.And then obtain after filtrationFiltering sets of video frames selects clear, bright and high quality video frame as key frame.

Specifically, spacer can be filtered out by following formula:

Luminance(I_rgb)=0.2126I_r+0.7152I_g+0.0722I_b

Wherein, Luminance () indicates brightness of image, I_rgbIndicate RGB triple channel natural image, I_rIndicate red logicalRoad image, I_gIndicate that green channel images, r indicate that red channel, g indicate that green channel, b indicate that blue channel, rgb indicate threeChannel.When the brightness of image of each video frame in filtering sets of video frames is calculated by above-mentioned formula, setting can be passed throughThe video frame that the mode of threshold value is unsatisfactory for requiring to brightness of image is filtered.

Fuzzy frame can be filtered by following formula:

Wherein, Sharpness () indicates image definition, I_grayIndicate gray level image, Δ_xIndicate transverse gradients, Δ_yTableShow longitudinal gradient, x indicates transverse direction, and y indicates longitudinal direction.It is filtered in sets of video frames when being calculated by above-mentioned formulaWhen the image definition of each video frame, can in such a way that threshold value is set to image definition be unsatisfactory for require video frame intoRow filtering.

Low-quality frame can be filtered by following formula:

Wherein, δ indicates that picture quality, M indicate that horizontal pixel number, N indicate that longitudinal pixel number, i indicate lateral coordinates, j tableShow that longitudinal coordinate, P () indicate that pixel value, μ indicate threshold value.It is each in filtering sets of video frames when being calculated by above-mentioned formulaWhen the picture quality of video frame, the video frame that can be unsatisfactory for requiring to picture quality by way of threshold value is arranged was carried outFilter.

In addition, can have a large amount of fuzzy frame during the Shot change of video.And hence it is also possible to according to lens edgeDetection technique further filters out underproof video frame.

S212, calculating feature vector corresponding with video frame is respectively filtered in the filtering sets of video frames, and according toDescribed eigenvector carries out clustering processing to each filtering video frame in the filtering sets of video frames, obtains at least two clustersCluster, wherein include at least one filtering video frame in the clustering cluster.

The most common extraction method of key frame is clustering, and clustering is similar by calculating the vision between video frameDegree, and select a video frame closest to clustering cluster center as key frame from each clustering cluster.Implement in the present inventionIn example, key frame can be extracted according to the feature vector of video frame is respectively filtered in filtering sets of video frames.Specifically, can countCalculate the corresponding feature vector of each filtering video frame, and according to the feature vector of each filtering video frame to each filtering video frame intoRow clustering processing, and then obtain multiple clustering clusters.It include at least one filtering video frame in each clustering cluster.

In an alternate embodiment of the present invention where, it calculates and respectively filters video frame difference in the filtering sets of video framesCorresponding feature vector may include: to be regarded using convolutional neural networks model to each filtering in the filtering sets of video framesFrequency frame carries out feature extraction；Or, using local binary patterns LBP to it is described filtering sets of video frames in each filtering video frame intoRow feature extraction, and each feature extraction result is subjected to processing formation statistics histogram as with each filtering video frame and is distinguishedCorresponding LBP feature vector.

In embodiments of the present invention, can use convolutional neural networks (Convolutional Neural Network,CNN) model carries out feature extraction to each filtering video frame in the filtering sets of video frames.Specifically, classics can be selectedThe network architectures such as CNN model such as AlexNet, VGGNet or Inception, obtain the high dimensional feature vector statement of video frame.

It is understood that many frames usually in video are all that similarity is very high, therefore is directed to the one of video frameThe feature for being easy to calculate a bit can effectively distinguish the similarity between different video frame, such as color and edge histogram featureOr LBP (Local Binary Pattern, local binary patterns) feature etc..It optionally, in embodiments of the present invention, can be withVideo frame is described using LBP feature as feature descriptor.Transformed matrix-vector is obtained using LBP first, and then LBPFeature vector of the statistics histogram as video frame.In order to consider the location information of feature, it is small that video frame is divided into severalRegion carries out statistics with histogram in each zonule, that is, counts the quantity for belonging to a certain mode in the region, finally again instituteThere is the histogram in region to be once connected together as the processing that feature vector receives next stage.

S213, the static highest filtering video frame composition key frame set of angle value in each clustering cluster is obtained respectively.

In embodiments of the present invention, optionally, key frame is extracted from different clustering clusters by the static degree of picture.Since the motion compensation used in video compress will lead to fuzzy pseudomorphism, usually the picture with high kinergety also can more mouldPaste.Therefore, the quality of the key frame by selecting the picture with harmonic motion energy to may insure to extract is higher.Specifically, canTo be clustered first using feature vector of the K mean algorithm to the video frame extracted, the number of clustering cluster be can be set toThe number of camera lens in video, to obtain better cluster result.Different subsets ID number having the same in same clustering cluster, andCalculate separately the static degree of picture.Static degree refers to the inverse of the quadratic sum of the pixel difference of adjacent picture, can be poly- from differenceKey frame of the static highest picture of angle value as the clustering cluster is selected in class cluster.

It should be noted that in embodiments of the present invention, the purpose for selecting the highest filtering video frame of static angle value is sieveRepresentative strongest video frame is selected in each clustering cluster as key frame.In addition to being according to each cluster of screening with static angle valueFor representative strongest video frame as key frame, other can filter out the side of representative strongest video frame in clustering cluster in clusterMethod can also be used as the method for extracting key frame, and the embodiment of the present invention is to method used by extraction key frame and without limitSystem.

S214, the time parameter according to the filtering video frame for including in each clustering cluster in simple video subfile,Determining initial time corresponding with each clustering cluster and duration, and according to the initial time and the duration, it is rightThe simple video subfile carries out slicing treatment, obtains the video clip set.

Correspondingly, can not only extract key frame, while video can also be pressed after carrying out clustering to video frameDifferent classifications is divided into different segments, by the starting of the available video clip of quantity of video frame in the boundary of classification and classTime and clip durations, and then simple video subfile can be resolved into the frequency range of neglecting with special characteristic, complete sliceProcessing obtains video clip set.Video clip set can be used for carrying out video identification.

S220, multi-modal picture recognition is carried out to the key frame set, obtains the first recognition result, and to the videoSet of segments carries out video identification, obtains the second recognition result.

S230, audio identification is carried out to the simple audio subfile, obtains third recognition result.

S240, according to first recognition result, second recognition result and the third recognition result, obtain withThe video file is corresponding to integrate recognition result.

By adopting the above technical scheme, it is operated by video frame coarse filtration, video frame feature extraction and key-frame extraction etc.Extract key frame, can guarantee key frame meet with the highly relevant performance indicator of video content, to simple video subfile intoRow slicing treatment can be used for carrying out video identification to obtain video clip set, to realize more to video content progressThe identification of dimension.

Embodiment three

Fig. 3 a is a kind of flow chart for video frequency identifying method that the embodiment of the present invention three provides, and Fig. 3 b is the embodiment of the present inventionA kind of three bounding box sizes and position prediction effect diagram provided, Fig. 3 c are a kind of faces that the embodiment of the present invention three providesDetection effect schematic diagram, Fig. 3 e are a kind of flow charts for video frequency identifying method that the embodiment of the present invention three provides, and Fig. 3 g is this hairA kind of flow chart for video frequency identifying method that bright embodiment three provides.The present embodiment is carried out specifically based on above-described embodimentChange, in the present embodiment, gives the specific implementation for obtaining each recognition result.Correspondingly, as shown in Figure 3a, the present embodimentMethod may include:

S310, simple video subfile corresponding with video file to be identified and simple audio subfile are obtained, andObtain key frame set corresponding with the simple video subfile and video clip set.

Wherein, after simple video subfile in video file to be identified is identified available first recognition result andSecond recognition result；Available third recognition result after simple audio subfile is identified.

Correspondingly, to key frame set carry out multi-modal picture recognition obtain the first recognition result can specifically include it is followingTwo kinds of operations:

S320, picture classification is carried out to each key frame in the key frame set using default picture classification model, andUsing classification results as first recognition result.

Wherein, presetting picture classification model can be for the preparatory trained network to key frame progress picture classificationModel.

In embodiments of the present invention, it presets training data when picture classification model is trained and is mainly derived from two sidesFace.First is that including the background data base of the available label data set voluntarily marked of 20,000 multiclass, second is that the public affairs such as ImageNetOpen data set.It is abundant in content colorful due to picture, it is difficult to all categories accurately be differentiated using single model.CauseThis, the embodiment of the present invention can solve the problems, such as precisely to identify using multistage disaggregated model: the first order separates major class, such as quotationClass, sport category and vegetable class etc.；The second level carries out more sophisticated category, is such as finely divided into basketball movement and foot again to sport categoryBall movement etc..Simultaneously according to the actual situation, it can take the circumstances into consideration to carry out third level classification, such as identify it is which two in basketball movementTeam is playing.Every first-level class device can select according to the actual situation classification, object detection, OCR based on CNN networkThe methods of (Optical Character Recognition, optical character identification) completes classification.According to the reality of image contentSituation completes a series of streams such as the building of training dataset, including task formulation, picture crawler, picture calibration, inspection of qualityJourney, to guarantee to identify quality.

ResNet network can use residual error study and solve degenerate problem, and the content that residual error study needs to learn is relativelyIt is few, therefore learning difficulty is small and is easy to get preferable effect, experiments have shown that increase of the ResNet network with depth, the knot of generationFruit shows much better than network traditional before.ResNet network not only shows very on the data set of ImageNetIt is good, equally have good performance on the data set of COCO etc., illustrate ResNet network be used as one it is generalModel.It therefore, in embodiments of the present invention, can be using ResNet as CNN network model when carrying out picture classification.Further, it can be trained ResNet-34 as basic model.

S330, each key frame in the key frame set is separately input into YOLOv3 model trained in advance；AndYOLOv3 model output is obtained, target object mark corresponding with each key frame and target object existPosition coordinates in key frame are as the first recognition result.

Wherein, target object can be object in addition to face, such as animal, automobile or cutter etc..Target object markKnow the label in the list of labels that can be picture or video identification.Illustratively, list of labels includes but is not limited to: (1) excessiveDirty pornographic:, true man sexuality pornographic including true man, animation pornographic, the sexy and some special defects of animation；(2) bloody violence: including a surnameIt raises sudden and violent probably tissue, the bloody scene of violence and fights；(3) political sensitivity: including political sensitivity personage and scene etc.；(4) it disobeysProhibited cargo product: including being involved in drug traffic, controlled knife and army and police's articles etc.；(5) vulgar unhelpful: including exposed upper body, smoking, vulgar place andIt tatoos.In embodiments of the present invention, list of labels can update.Completely train the model of entire video identification usualNeed to spend the time of several weeks.Due to list of labels update frequency it is very fast, the model of video identification with list of labels againTraining is clearly very time-consuming.In order to shorten the training time, the iteration of model can be carried out using the method for transfer learning, it shouldMethod is finely adjusted the partial nerve network layer of the model by a model completely trained, new to identifyClassification.Training time and training resource can be greatlyd save in this way.Specific step is as follows: (1) changing softmax layers of nodeNumber is new number of labels, other network structures can not change；(2) weight of trained model before being loaded into；(3) againTraining pattern, but can substantially reduce the trained time.

Target detection is that the multiple objects in picture are positioned and classified, and positioning marks in picture where objectPosition, classification is then to provide object each in picture to correspond to classification.Target detection is handled for multiple target, to improveThe speed and accuracy rate of video identification.It in embodiments of the present invention, can when being identified to the target object in each key frameTo be identified using YOLOv3 model trained in advance as target detection network.YOLOv3 model is end-to-end detection, nothingIt needs region to nominate, target discrimination and target identification is combined into one, recognition performance can be substantially improved.Pass through training in advanceAfter YOLOv3 model identifies the corresponding target object of each key frame, the label in list of labels can use to objectBody is identified, and can by the recognition result of key frame target object according to its position coordinates in key frame make its withEach video clip is matched one by one in video clip set.Illustratively, it is assumed that YOLOv3 model identifies the 3rd key frameIn controlled knife, and recognize the key frame that the key frame belongs to the 2nd video clip, then can be by the 3rd key frameRecognition result is matched in the 2nd video clip.

YOLOv3 model used in the embodiment of the present invention introduces residual error structure and constructs new Darknet-53；It simultaneously canTo carry out repeated detection, three different anchor are respectively set to be detected on the different characteristic pattern of three scales；ItsIt is secondary not use softmax with more classification entropy loss and carry out single classification.The master of YOLOv3 model progress target detectionWanting process includes the following aspects:

(1) predicted boundary frame

Input picture is divided into S*S cyberspace position cell, fixed frame anchor is obtained by the method for clusterBoxes then predicts four coordinate value (t to each bounding box_x,t_y,t_w,t_h).For the cell of prediction, according to figureAs the offset (c in the upper left corner_x,c_y), and the width and high p of bounding box is obtained before_wAnd p_hIt can be to bounding boxIt is predicted.Mean square error loss function can be used when these coordinate values of YOLOv3 model training, to each boundingBox predicts the score of an object by logistic regression.If the bounding box of prediction and true frame value are most ofBe overlapped and than other all predictions than get well, then the score value of object is just 1.If overlap does not reach a threshold value(threshold value can be set to 0.5) is shown as no penalty values then the bounding box of this prediction will be ignored.

Fig. 3 b is a kind of bounding box size and position prediction effect diagram that the embodiment of the present invention three provides, with reference to figure3b, predicted boundary frame can use following formula:

b_x=σ (t_x)+c_x

b_y=σ (t_y)+c_y

Wherein, b_xIndicate the upper left corner boundingbox abscissa, b_yIndicate the upper left corner boundingbox ordinate, b_wTableShow boundingbox width, b_hIndicate boundingbox height, t_x、t_y、t_wAnd t_hIt is expressed as generating bounding box netFour coordinate values of network model prediction, c_xAnd c_yIndicate deviant, p_wAnd p_hIndicate the width and height of the bounding box of priori, σ() indicates activation primitive.

(2) class prediction

Each bounding box is classified using multi-tag.Therefore polytypic softmax layers of single label is changed into and is used forThe polytypic logistic regression layer of multi-tag, does two classification to each classification by simple logistic regression layer.Logistic regression layerSigmoid function mainly is used, which can be by input constraint in the range of 0 to 1, therefore works as a picture and pass through featureCertain one kind output after extraction, if it is greater than 0.5, means that after sigmoid function constraint and belongs to such.

(3) across scale prediction

YOLOv3 model is given a forecast by the way of multiple scale fusion, in three different scale prediction boxes, is madeThe priori of boundingbox is obtained with cluster, selects 9 clustering clusters and 3 scales, it is then that this 9 clustering clusters are uniformIt is distributed on these scales.Changed by FPN (Feature Pyramid Network, feature pyramid network) network specialSign extracts model, and finally prediction obtains a 3-d tensor, and it comprises bounding box information, object information and moreThe predictive information of a class.Make YOLOv3 model available in such a way to more semantic informations.

(4) feature extraction

In embodiments of the present invention, YOLOv3 model is using DarkNet-53 network as feature extraction layer, one side baseThis uses full convolution, and the sample of feature map is using convolutional layer, residual structure is on the other hand introduced, subtractsNetwork, can be accomplished 53 layers, use multiple 3*3 and 1*1 convolutional layers by the small network difficulty of trained layer, improve network essenceDegree.

YOLOv3 model in the embodiment of the present invention can improve the recognition effect of multiple target multi-tag and Small object.

(5) training

In embodiments of the present invention, a variety of methods, such as data can be used to enhance when being trained to YOLOv3 model.

S340, recognition of face is carried out to obtain the first recognition result to each key frame of the key frame set.

Correspondingly, S340 can specifically include:

S341, Face datection is carried out to each key frame in the key frame set using S3FD algorithm.

It is understood that Face datection is the first step of recognition of face, it is particularly significant to recognition of face.Traditional faceDetection algorithm has the Face datection based on geometrical characteristic, the Face datection based on eigenface, the face inspection based on elastic graph matchingSurvey and be based on the Face datection etc. of SVM (Support VectorMachine, support vector machines).Although these methods are able to achieveThe detection of face, but there are many erroneous detection, missing inspection, detection effect is very poor under complex background, and does not adapt to illumination, angle etc. and becomeChange.To solve the above-mentioned problems, the embodiment of the present invention uses S3FD (the Single Shot Scale- based on deep learningInvariant Face Detector, Scale invariant human-face detector) algorithm.S3FD algorithm is especially suitable for small Face datection.

Specifically, S3FD algorithm detects the face of different scale using the receptive field difference of different convolutional layers.The algorithmBasic network be VGG16, can load VGG16 pre-training model accelerate network training.It is more multiple dimensioned in order to detect simultaneouslyFace, S3FD algorithm increases 6 convolutional layers on the basis of VGG16, is ultimately used to the convolutional layer of detection face.S3FD is calculatedMethod mainly has following two points improvement: 1) difference based on theoretical receptive field He practical receptive field, improves the side of anchor propositionFormula；2) in order to preferably detect small face, more layer and scale are increased.The embodiment of the present invention is based on open source human face dataCollect wider face and VGG16 network is trained according to the human face data that self-demand is collected, detection effect figure Fig. 3 cIt is shown, it is seen then that the Face datection of the embodiment of the present invention works well.

S342, face key point location is carried out to the face detected by MTCNN algorithm, obtains face key point.

Face key point location is the key that do face alignment, needs to orient left and right human eye, the left and right corners of the mouth and nosePosition, the accuracy of face key point location can greatly influence the effect of face characteristic extraction.Traditional face key point is fixedPosition method is all based on the local feature of face greatly to position, and locating effect is undesirable, and generalization ability is poor, does not adapt to angle and lightAccording to etc. influence factors variation.To solve the above-mentioned problems, the embodiment of the present invention uses MTCNN (Multi-task CascadedConvolutional Networks, multitask concatenated convolutional network) algorithm realize face key point positioning.MTCNN is oneThe convolutional neural networks of kind cascade structure, it is divided into tri- parts p-net, r-net and o-net.MTCNN can regard three asThe series connection of independent convolutional neural networks, three being completed for tasks of network be it is the same, it is only slightly poor in network structureNot.The main thought of MTCNN algorithm is exactly the cascade using multiple networks, is constantly optimized to the same task, that is, p-Net obtain one it is rough as a result, then r-net makes improvements, last o-net again improves the result of r-net.It is continuous in this way to improve, so that crucial point location becomes more accurate.The embodiment of the present invention is based on open source data set widerFace, Celeba and according to self-demand collect human face data MTCNN network is trained, Fig. 3 d is the embodiment of the present inventionA kind of effect diagram of the three face key point locations provided, the detection effect figure of MTCNN algorithm are as shown in Figure 3d, it is seen then thatThe face key point locating effect of the embodiment of the present invention is good.

S343, feature extraction is carried out to facial image by Arcface algorithm according to the face key point.

Face characteristic is extracted primarily to being compared to face, the face characteristic of the same person should be quite similar,The feature of different faces should similarity it is very low.Because the feature extracted is all based on such a big classification of face, how to allowThe feature extracted is similar as far as possible on the face in the same person, and discrimination is big as far as possible on different faces, is the pass for extracting featureKey.In order to increase the discrimination between different faces, the embodiment of the present invention is (i.e. deep using Arcface algorithm improvement sorter networkDegree neural network) loss function increase the discrimination between different classes of so that feelings such as there are many imbalanced training sets and classificationRemain to that there is preferable classifying quality under condition.

Arcface algorithm is that categorised demarcation line is directly maximized in angular region, that is, original to sorter networkSoftmax loss function is modified, and the loss that angular region carrys out presentation class network is converted to.Original softmaxLoss calculation formula is as follows:

Loss calculation formula after Arcface algorithm improvement is as follows:

Wherein, L₁Indicating the definition of loss function, m indicates the size of batchsize, and i and j are natural number,It indicatesThe y_i column of i-th of sample the last one full articulamentum, x and y indicate feature vector and classification, and x_i indicates i-th of sampleDeep learning feature and y_i indicate classification belonging to i-th of sample, and T indicates transposition operation,Indicate the last one full connectionThe y_i column of the bias term of layer, b indicate the bias term of the last one full articulamentum,Indicate i-th sample lastThe jth of a full articulamentum arranges, b_jIndicate that the jth column of the bias term of the last one full articulamentum, s indicate after normalization | |X | |, θ_yiIndicate angle between w_ (y_i) and x_i, θ_jIndicate angle between w_j and x_i.

Compared with original softmax loss, possess better performance, class spacing with the feature that Arcface algorithm extractsIt is big from more, even if still having preferable differentiation effect in the case where there are many classification number.

S344, it is matched according to the face characteristic extracted with the feature in feature database, and according to matching result to eachThe corresponding people information of the key frame is identified, and using the mark result of the face information as the first recognition result.

In embodiments of the present invention, after extracting the face characteristic in key frame picture, building need to identify the people of personageFace feature database also gets up the storage corresponding with his face characteristic of the people information in key frame picture.In cognitive phase meetingIt will be matched from the feature of face extraction to be detected and the face characteristic of feature database, identification knot provided according to matched similarityFruit.Under normal conditions, measures characteristic vector similarity has Euclidean distance and two kinds of COS distance.Because of Euclidean distance fluctuation rangeIt is bigger, it is difficult to there is a determining threshold value to define similarity, so the embodiment of the present invention describes spy using COS distanceThe similarity of sign.COS distance range can very easily determine demarcation threshold between [- 1,1].On matching principle originallyThe matching algorithm that inventive embodiments use closest matching method to combine with threshold method.The algorithm calculate first feature to be identified withThe similarity of feature planting modes on sink characteristic takes classification of the personage's classification of the highest feature of similarity as the identification feature, then judges thisWhether similarity is greater than the threshold value of setting, then assert it is that the figure kind is other greater than the threshold value, then determines it is not special less than the threshold valuePersonage's classification of Zheng Kunei.

It should be noted that before carrying out multimodal recognition to key frame picture, it is also necessary to the key frame figure of inputPiece is pre-processed.Pretreatment refers mainly to be standardized picture and that picture is zoomed to same size is defeated as modelEnter, the first order is introduced into the classification that different classes of model carries out level-one label, and the second level is directed to a certain major class or a few againMajor class carries out more fine identification, and such classification framework is very easy to extension, and target detection mould can be used in part labelsType and human face recognition model are assisted in identifying.

It should be noted that Fig. 3 a is only a kind of schematic diagram of implementation, there is no successively suitable between S320 and S330Order relation can first implement S320, then implement S330, can also first implement S330, then implement S320, can be real parallel with the twoApply or select an implementation.

Correspondingly, as shown in Figure 3 e, obtaining the second recognition result can specifically include operations described below:

S350, video identification is carried out to video clip set, obtains the second recognition result.

Specifically, S350 may include operations described below:

S351, time domain down-sampling is carried out to each video clip in the video clip set respectively, obtained and piece of videoThe corresponding sampled video frame set of section.

Wherein, sampled video frame set can be used for storing according to the video frame obtained after setting rule sampling.

Fig. 3 f is a kind of system schematic for video identification that the embodiment of the present invention three provides.As illustrated in figure 3f, in this hairIn bright embodiment, the accurate identification of different type of action in video clip is realized using 3D convolutional neural networks 3DCNN technology.The movement of identification may include fight, smoke, drinking, society shake with sea grass dance etc. more than 20 kinds of bad vulgar movements, can alsoTo include more than the 100 kinds of conventional actions such as having a meal, climb rocks, jump, play football and kissing.Difference movement resolution accuracy may be differentSample, for example dance it is more likely that a global action, and smoke it is more likely that an activities.In order to meet different resolutionDemand, the embodiment of the present invention can construct a high-resolution 3DCNN network and a low resolution 3DCNN network.Usual feelingsUnder condition, by time domain specification there are two types of in the way of, one is directly with original picture frame as the input of 3DCNN, it is for secondX gradient, y gradient and Optical-flow Feature between picture frame are extracted as the input of 3DCNN.It should be noted that 3DCNN intoWhen row training, for more classification problems, 3DCNN can use polytypic cross entropy loss.

Video sequence is the image of time correlation.In the time domain, the time interval very little of consecutive frame, especially in time domainIn the case where sample rate higher such as 25fps, 30fps, 50fps and 60fps, the correlation of consecutive frame is very high, and 3DCNNEach sample of input also requires time domain frame number to fix.In embodiments of the present invention, time domain frame number can position 16 frames, Ke YiweiAngular transition under different frame rates provides prerequisite.Specifically, as illustrated in figure 3f, the system of video identification of the embodiment of the present inventionIn M1 module can use following two kinds of sample modes:

Mode (1): it is assumed that the frame per second of original video sequence is Q, the frame per second after sampling is P, can be carried out based on time gapDown-sampled processing, conversion formula are as follows: σ_i=λ θ_k+1+(1-λ)θ_k, whereinσ_iIndicate i-th of videoFrame, λ indicate weighting parameters, θ_k+1Indicate+1 frame image of kth of original video, θ_kIndicate that the kth frame image of original video, i are down-sampledFrame number index afterwards, two frames i adjacent in original video sequence are k and k+1, it is down-sampled in this way after sequence of frames of video be σ=[σ₁, σ₂…σ_M], wherein M value is 16.

Mode (2): the covering of 8 frames is had to 16 frames that take of original video Sequentially continuous, but between two neighboring 16 frame fragment, i.e.,For original video segment, the segment of multiple 16 frames of 8 frames of covering mutually can be divided into.

Mode (1) can guarantee that the of overall importance of video clip, mode (2) can guarantee the locality and information of video clipIntegrality, the sample generated by both modes can take the label of original video segment, consequently facilitating being trained.

S352, spatially and temporally progress setting processing operation is integrated into the sampled video frame, obtains at least two classesThe input picture of type；Wherein, the setting processing operation includes that scaling processing, light stream extraction and edge image extract；It is describedThe type of input picture includes high-definition picture, low-resolution image, light stream image and edge image.

Correspondingly, as illustrated in figure 3f, the M2 module in the system of video identification of the embodiment of the present invention is responsible for sample videoThe processing that video frame in frame set is set.In embodiments of the present invention, M2 module provides 3 kinds of processing modes, that is, contractsProcessing, light stream extraction and edge image is put to extract.Wherein, scaling processing may further include two kinds of scalable manners: in order toMeet to high-resolution demand, by original image in airspace resize at 224*224*3 size, and the image of low resolution is thenResize is at 112*112*3.Light stream is the significant information of object of which movement in the time domain, is using pixel in image sequence in the timeThe corresponding relationship between previous frame and present frame that the correlation between variation and consecutive frame on domain is found, between consecutive frameThis corresponding relationship regard the motion information of object as.Optionally, the embodiment of the present invention passes through in conjunction with opencv'sCv2.calcOpticalFlowPyrLK () function calculates light stream.Edge image is the structural attribute and object of imageMove the significant information on airspace.Optionally, the embodiment of the present invention takes Canny operator extraction edge image, and to RGB3A channel calculates separately edge feature.The calculation process of Canny are as follows: 1) filter out and make an uproar with smoothed image using Gaussian filterSound；2) gradient intensity of each pixel and direction in image are calculated；3) non-maxima suppression is applied, to eliminate edge detection bandThe spurious response come；4) it detects using dual threshold to determine true and potential edge；5) by inhibiting isolated weak edgeIt is finally completed edge detection.

S353, all kinds of input pictures are separately input into corresponding 3DCNN network, and use the 3DCNN netNetwork identifies input picture, and obtains the 3DCNN network output, the output of video tab corresponding with input pictureProbability value.

Correspondingly, all kinds of input pictures can be separately input into corresponding 3DCNN after getting four class input picturesIn network.As illustrated in figure 3f, 3DCNN network may include the modules such as M3, M4 and M5.M3 module is the backbone network of 3DCNN,The input of 3DCNN1,3DCNN2,3DCNN3 and 3DCNN4 be respectively high-definition picture, low-resolution image, light stream image andEdge image.For the convolution kernel of 3DCNN network, airspace size can choose 3*3 and 5*5, the series connection side of multiple small convolution kernelsFormula.Pooling selects max_pooling, the size of time domain most to start all then gradually to be incremented by, be followed successively by 2,3,4 with 1.ThisThe mode of sample setting, which is that time-domain information is premature in order to prevent, to be fused.M4 module is full articulamentum (fully connectedLayers, FC), network parameter is excessive in order to prevent, and the embodiment of the present invention is only with a full convolutional layer.M5 module is substantiallyA full articulamentum, but node number is classification quantity, prediction be current 3DCNN every a kind of input picture it is corresponding eachThe output probability value of a classification.

S354, the output probability value of each video tab is merged according to setting amalgamation mode, will be obtained after fusionTo video tab be combined as the second recognition result.

Correspondingly, as illustrated in figure 3f, Output module merges the output probability value of the 3DCNN of front 4, fusion sideFormula is the probability value multiplication that 4 3DCNN networks correspond to classification, as a result the fusion output probability as each classification input pictureValue, when prediction, take the output of this layer to export as a result.

Illustratively, it is assumed that disaggregated model is respectively to dance there are two classifications, the output probability value of 4 3DCNN: 0.9,Smoking: 0.1；It dances: 0.8, smoking: 0.2；It dances: 0.9, smoking: 0.1；It dances: 0.7, smoking: 0.3；The output of each 3DCNNThe corresponding weight of probability value is defaulted as 0.25, then corresponding second recognition result of video clip can be with are as follows: dances: 0.825, meterCalculation mode are as follows: 0.9*0.25+0.8*0.25+0.9*0.25+0.7*0.25=0.825, smoking: 0.175, calculation are as follows:0.1*0.25+0.2*0.25+0.1*0.25+0.4*0.25=0.175.

It should be noted that in embodiments of the present invention, video clip is simple video subfile after slicing treatmentThe relatively single segment of content.For long video segment, it is possible to can exist first 10 seconds and smoking, play football within latter 10 secondsSituation.Such video can be cut into two video clips, and for both playing football in smoking in same 10 secondsVideo can not have to incision and the second recognition result by " smoking " and " playing football " two labels as the video.

Correspondingly, as shown in figure 3g, obtaining third recognition result can specifically include operations described below:

S360, audio identification is carried out to simple audio subfile, obtains third recognition result.

Specifically, S360 may include operations described below:

S361, the simple audio subfile is pre-processed after carry out fast Fourier change to obtain the simple soundThe frequency domain information of frequency subfile.

It in embodiments of the present invention, first can be to simple audio when carrying out audio identification to simple audio subfileSubfile is pre-processed (including audio signal preemphasis or signal adding window etc.) and is obtained merely using Fast Fourier Transform (FFT) afterwardsThe frequency domain information of audio subfile.Fast Fourier is the fast algorithm of discrete Fourier Asia transformation, is become according to discrete FourierThe odd even characteristic changed carries out algorithm optimization, and complexity is reduced to O (nlogn) from O (n2).Fast Fourier Transform (FFT) formula can be withExpression are as follows:

Wherein, X (k) indicates that the signal sequence after Fourier's variation, x (n) indicate discrete tone sequence after sampling, and n is indicatedTonic train length, k indicate frequency domain sequence length, and N indicates Fourier transformation siding-to-siding block length,

S362, the corresponding energy spectrum of frequency domain information for calculating simple audio subfile described in every frame.

Correspondingly, modulus square fortune can be carried out to frequency domain information after getting the frequency domain information of simple audio subfileIt calculates, calculates the energy spectrum of each frame signal, and be filtered to signal with Meier filter group, calculate Meier filtered energy.It willSignal is expressed with plural form are as follows:

X (k)=α e^-jθk=acos θ k+jasin θ k=a_k+jb_k

Then signal energy stave reaches are as follows:

Wherein, E (k) indicates Meier filtered energy.

S363, logarithm Meier spectrum energy is obtained according to the energy spectrum.

Log spectrum feature has preferable retention to high-frequency signal, more to the audio identification performance in complex sceneStablize.Therefore, in embodiments of the present invention, audio knowledge can be carried out to simple audio subfile according to logarithm Meier spectrum energyNot.

Specifically, logarithm Meier spectrum energy can be calculated by following formula:

Wherein, E (n) indicates that the corresponding logarithm Meier spectrum energy of n-th of Meier filter, C (k) indicate kth section audioThe energy of signal, H_n(k) frequency response of n-th of Meier filter is indicated.

S364, logarithm Meier spectrum signature is extracted according to the logarithm Meier spectrum energy.

Correspondingly, in embodiments of the present invention, in the logarithm Meier spectrum signature processing stage of logarithm Meier spectrum energy,The library Librosa can be used and extract audio frequency characteristics.Wherein, sample rate is set as 44100Hz, and frame length is set as 30ms, preemphasisCoefficient is 0.89, using hamming window function.Meier spectrum signature coefficient 32 is tieed up, its first-order difference and second differnce feature are calculated,The feature vector of 96 dimensions is constituted altogether.

S365, the logarithm Meier spectrum signature is reconstructed, obtains two-dimentional audio frequency characteristics.

Fig. 3 h is a kind of schematic diagram for logarithm Meier spectrum signature that the embodiment of the present invention three provides, as illustrated in figure 3h,In the embodiment of the present invention, after obtaining logarithm Meier spectrum signature, obtained one-dimensional logarithm Meier spectral audio feature is subjected to weightStructure, obtains two-dimensional audio frequency characteristics distribution, and characteristic pattern dimension is (frequency band number * audio frame length).

The operation of above-mentioned S351-S355 belongs to the pretreated operation of audio frequency characteristics, and the operation of following S356 belongs to and is based onThe data sorting operation of deep learning.

S366, feature extraction and audio classification are carried out to the two-dimentional audio frequency characteristics by being based on CNN basic structural unit.

In embodiments of the present invention, it can realize that further feature is extracted and audio point using CNN basic structural unitClass.Convolutional layer can not only reduce computing cost, while can retain data using part connection and parameter sharing mechanismSpace distribution rule.Basic structural unit mainly includes convolutional layer, pond layer and activation primitive, and classification layer uses SoftmaxFunction, loss function use cross entropy loss function, and it is 128 that initial learning rate, which is set as 0.001, batchsize size, is usedStochastic gradient descent optimization algorithm.

In an alternate embodiment of the present invention where, when the CNN basic structural unit uses multiple classifiers, pass throughThe mode of ballot identifies the simple audio subfile.

Illustratively, when using 3 classifiers, if obtained recognition result is respectively as follows: label 1: smoking, confidenceDegree: 0.9, weight: 0.8；Label 2: smoking, confidence level: 0.8, weight: 0.1；Label 3: dancing, confidence level: 0.5, weight:0.1.The mode then voted, which may is that, to be taken according to label 1 and the identical label of label 2: smoking, then corresponding final recognition resultIt may is that label: smoking, confidence level: 0.8*0.9+0.1*0.8+0.1*0=0.8.

Fig. 3 i is a kind of video recognition algorithms configuration diagram that the embodiment of the present invention three provides, in a specific exampleIn, as shown in figure 3i, after getting video file to be identified, video file to be identified is detached to form simple video ZiwenPart and simple audio subfile, wherein simple video subfile is a kind of successive frame.Simple video subfile is by slice and closesKey frame extraction process forms video clip and key frame, can carry out multimodal recognition to key frame picture to obtain picture recognitionCorresponding first recognition result, multimodal recognition may include picture classification, target detection and Face datection etc..Wherein, pictureClassification can be realized using OCR or NLP (Natural Language Processing, natural language processing) method.For listPure video subfile can be identified to obtain visual classification, i.e. video identification pair when carrying out video identification using 3DCNN networkThe second recognition result answered.Audio classification can also be carried out when carrying out audio identification to simple audio subfile, can therefrom be obtainedThe text information in simple audio subfile is taken, and text information is identified using NLP method, non-legible information can be carried outSpeech audio or non-speech audio identification etc., to obtain the corresponding third recognition result of audio identification.Finally, post-processing moduleRecognition result in three can be integrated, recognition result is integrated in formation.Integrating includes finally determining label in recognition resultWith confidence information etc..

The acquisition modes for embodying each recognition result by adopting the above technical scheme are able to solve existing video audit technology and depositIdentification content is single and problem that identification range is small, realize abundant identification type, refinement identification content, to video content intoThe identification of row various dimensions improves the rich of video identification technology, accuracy, high efficiency on the basis of reducing identification costAnd real-time.

Example IV

Fig. 4 is a kind of schematic diagram for video identification device that the embodiment of the present invention four provides, as shown in figure 4, described deviceIt include: that subfile obtains module 410, the first identification module 420, the second identification module 430 and recognition result acquisition module440, in which:

Subfile obtains module 410, for obtain corresponding with video file to be identified simple video subfile andSimple audio subfile, and obtain key frame set corresponding with the simple video subfile and video clip set；

First identification module 420 obtains the first identification knot for carrying out multi-modal picture recognition to the key frame setFruit, and video identification is carried out to the video clip set, obtain the second recognition result；

Second identification module 430 obtains third identification knot for carrying out audio identification to the simple audio subfileFruit；

Recognition result obtains module 440, for according to first recognition result, second recognition result and describedThird recognition result obtains corresponding with the video file integrating recognition result.

Optionally, subfile obtains module 410, comprising: filtering sets of video frames acquiring unit, for thick using video frameFiltering technique is filtered the simple video subfile, obtains filtering sets of video frames；Clustering cluster acquiring unit, based onFeature vector corresponding with video frame is respectively filtered in the filtering sets of video frames is calculated, and according to described eigenvector to instituteEach filtering video frame stated in filtering sets of video frames carries out clustering processing, obtains at least two clustering clusters, wherein the clusterIt include at least one filtering video frame in cluster；Key frame set component units, it is quiet in each clustering cluster for obtaining respectivelyThe highest filtering video frame of attitude score forms key frame set；Video clip set acquiring unit, for according to each clusterTime parameter of the filtering video frame for including in cluster in simple video subfile, determination are corresponding with each clustering clusterInitial time and duration, and according to the initial time and the duration, the simple video subfile is carried out at sliceReason, obtains the video clip set.

Optionally, clustering cluster acquiring unit is specifically used for using convolutional neural networks model to the filtering set of videoEach filtering video frame in conjunction carries out feature extraction；Or, using local binary patterns LBP in the filtering sets of video framesEach filtering video frame carries out feature extraction, and using each feature extraction result carry out processing formed statistics histogram as with it is each describedFilter the corresponding LBP feature vector of video frame

Optionally, the first identification module 420 is specifically used for using default picture classification model in the key frame setEach key frame carry out picture classification, and using classification results as first recognition result；

And/or

Each key frame in the key frame set is separately input into YOLOv3 model trained in advance；And obtain instituteThe output of YOLOv3 model is stated, target object mark corresponding with each key frame and target object are in key frameIn position coordinates as the first recognition result；

And/or

Face datection is carried out to each key frame in the key frame set using S3FD algorithm；It is calculated by MTCNNMethod carries out face key point location to the face detected, obtains face key point；Passed through according to the face key pointArcface algorithm carries out feature extraction to facial image；According to the feature progress in the face characteristic and feature database extractedMatch, and the corresponding people information of each key frame is identified according to matching result, and by the mark of the face informationAs a result it is used as the first recognition result.

Optionally, the first identification module 420 is also used to respectively carry out each video clip in the video clip setTime domain down-sampling obtains sampled video frame set corresponding with video clip；

Spatially and temporally progress setting processing operation is integrated into the sampled video frame, obtains the defeated of at least two typesEnter image；Wherein, the setting processing operation includes that scaling processing, light stream extraction and edge image extract；The input figureThe type of picture includes high-definition picture, low-resolution image, light stream image and edge image；By all kinds of input picturesIt is separately input into corresponding 3DCNN network, and input picture is identified using the 3DCNN network, and described in acquisitionThe output of 3DCNN network, the output probability value of video tab corresponding with input picture；The output of each video tab is generalRate value is merged according to setting amalgamation mode, and the video tab obtained after fusion is combined as the second recognition result.

Optionally, the second identification module 430, it is fast specifically for being carried out after being pre-processed to the simple audio subfileFast Fourier changes to obtain the frequency domain information of the simple audio subfile；Calculate the frequency domain of simple audio subfile described in every frameThe corresponding energy spectrum of information；Logarithm Meier spectrum energy is obtained according to the energy spectrum；According to the logarithm Meier spectrum energyExtract logarithm Meier spectrum signature；The logarithm Meier spectrum signature is reconstructed, two-dimentional audio frequency characteristics are obtained；By being based onCNN basic structural unit carries out feature extraction and audio classification to the two-dimentional audio frequency characteristics.

Optionally, when the CNN basic structural unit uses multiple classifiers, to described simple by way of ballotAudio subfile is identified.

Video frequency identifying method provided by any embodiment of the invention can be performed in above-mentioned video identification device, has the side of executionThe corresponding functional module of method and beneficial effect.The not technical detail of detailed description in the present embodiment, reference can be made to the present invention is anyThe video frequency identifying method that embodiment provides.

Embodiment five

Fig. 5 is a kind of structural schematic diagram for computer equipment that the embodiment of the present invention five provides.Fig. 5, which is shown, to be suitable for being used toRealize the block diagram of the computer equipment 512 of embodiment of the present invention.The computer equipment 512 that Fig. 5 is shown is only an example,Should not function to the embodiment of the present invention and use scope bring any restrictions.

As shown in figure 5, computer equipment 512 is showed in the form of universal computing device.The component of computer equipment 512 canTo include but is not limited to: one or more processor 516, storage device 528 connect different system components (including storage dressSet 528 and processor 516) bus 518.

Bus 518 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller,Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It liftsFor example, these architectures include but is not limited to industry standard architecture (Industry StandardArchitecture, ISA) bus, microchannel architecture (Micro Channel Architecture, MCA) bus, enhancingType isa bus, Video Electronics Standards Association (Video Electronics Standards Association, VESA) localBus and peripheral component interconnection (Peripheral Component Interconnect, PCI) bus.

Computer equipment 512 typically comprises a variety of computer system readable media.These media can be it is any canThe usable medium accessed by computer equipment 512, including volatile and non-volatile media, moveable and immovable JieMatter.

Storage device 528 may include the computer system readable media of form of volatile memory, such as arbitrary accessMemory (Random Access Memory, RAM) 530 and/or cache memory 532.Computer equipment 512 can be intoOne step includes other removable/nonremovable, volatile/non-volatile computer system storage mediums.Only as an example, it depositsStorage system 534 can be used for reading and writing immovable, non-volatile magnetic media, and (Fig. 5 do not show, commonly referred to as " hard driveDevice ").Although being not shown in Fig. 5, the disk for reading and writing to removable non-volatile magnetic disk (such as " floppy disk ") can be provided and drivenDynamic device, and to removable anonvolatile optical disk (such as CD-ROM (Compact Disc-Read Only Memory, CD-ROM), digital video disk (Digital Video Disc-Read Only Memory, DVD-ROM) or other optical mediums) read-writeCD drive.In these cases, each driver can pass through one or more data media interfaces and bus 518It is connected.Storage device 528 may include at least one program product, which has one group of (for example, at least one) programModule, these program modules are configured to perform the function of various embodiments of the present invention.

Program 536 with one group of (at least one) program module 526, can store in such as storage device 528, thisThe program module 526 of sample includes but is not limited to operating system, one or more application program, other program modules and programIt may include the realization of network environment in data, each of these examples or certain combination.Program module 526 usually executesFunction and/or method in embodiment described in the invention.

Computer equipment 512 can also with one or more external equipments 514 (such as keyboard, sensing equipment, camera,Display 524 etc.) communication, the equipment interacted with the computer equipment 512 communication can be also enabled a user to one or more,And/or with any equipment (such as net that the computer equipment 512 is communicated with one or more of the other calculating equipmentCard, modem etc.) communication.This communication can be carried out by input/output (I/O) interface 522.Also, computerEquipment 512 can also pass through network adapter 520 and one or more network (such as local area network (Local AreaNetwork, LAN), wide area network Wide Area Network, WAN) and/or public network, such as internet) communication.As schemedShow, network adapter 520 is communicated by bus 518 with other modules of computer equipment 512.Although should be understood that in figure notIt shows, other hardware and/or software module can be used in conjunction with computer equipment 512, including but not limited to: microcode, equipmentDriver, redundant processing unit, external disk drive array, disk array (Redundant Arrays of IndependentDisks, RAID) system, tape drive and data backup storage system etc..

The program that processor 516 is stored in storage device 528 by operation, thereby executing various function application and numberAccording to processing, such as realize video frequency identifying method provided by the above embodiment of the present invention.

That is, the processing unit is realized when executing described program: obtaining corresponding with video file to be identified simpleVideo subfile and simple audio subfile, and obtain key frame set corresponding with the simple video subfile and viewFrequency set of segments；Multi-modal picture recognition is carried out to the key frame set, obtains the first recognition result, and to the piece of videoDuan Jihe carries out video identification, obtains the second recognition result；Audio identification is carried out to the simple audio subfile, obtains thirdRecognition result；According to first recognition result, second recognition result and the third recognition result, obtain with it is describedVideo file is corresponding to integrate recognition result.

By computer equipment acquisition simple video subfile corresponding with video file to be identified and merelyAudio subfile, and key frame set corresponding with simple video subfile and video clip set；To key frame setIt carries out multi-modal picture recognition and obtains the first recognition result, and video identification is carried out to video clip set and obtains the second identification knotFruit；Audio identification is carried out to simple audio subfile and obtains third recognition result, finally ties the first recognition result, the second identificationFruit and third recognition result are integrated to obtain the integration recognition result of video file, solve the existing knowledge of existing video audit technologyThe problem that other content is single and identification range is small realizes abundant identification type, refinement identification content, carries out multidimensional to video contentThe identification of degree improves the rich of video identification technology, accuracy, high efficiency and reality on the basis of reducing identification costShi Xing.

Embodiment six

The embodiment of the present invention six also provides a kind of computer storage medium for storing computer program, the computer programWhen being executed by computer processor for executing any video frequency identifying method of the above embodiment of the present invention: obtain with toThe corresponding simple video subfile of the video file of identification and simple audio subfile, and obtain and the simple video ZiwenThe corresponding key frame set of part and video clip set；Multi-modal picture recognition is carried out to the key frame set, obtains theOne recognition result, and video identification is carried out to the video clip set, obtain the second recognition result；To simple audioFile carries out audio identification, obtains third recognition result；According to first recognition result, second recognition result and instituteThird recognition result is stated, obtains corresponding with the video file integrating recognition result.

The computer storage medium of the embodiment of the present invention, can be using any of one or more computer-readable mediaCombination.Computer-readable medium can be computer-readable signal media or computer readable storage medium.It is computer-readableStorage medium for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device orDevice, or any above combination.The more specific example (non exhaustive list) of computer readable storage medium includes: toolThere are electrical connection, the portable computer diskette, hard disk, random access memory (RAM), read-only memory of one or more conducting wires(Read Only Memory, ROM), erasable programmable read only memory ((Erasable Programmable ReadOnly Memory, EPROM) or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magneticMemory device or above-mentioned any appropriate combination.In this document, computer readable storage medium, which can be, any includesOr the tangible medium of storage program, which can be commanded execution system, device or device use or in connection makeWith.

Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal,Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimitedIn electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer canAny computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used forBy the use of instruction execution system, device or device or program in connection.

The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimitedIn wireless, electric wire, optical cable, radio frequency (Radio Frequency, RF) etc. or above-mentioned any appropriate combination.

The computer for executing operation of the present invention can be write with one or more programming languages or combinations thereofProgram code, described program design language include object oriented program language-such as Java, Smalltalk, C++ orPython etc. further includes conventional procedural programming language --- such as " C " language or similar programming language.JourneySequence code can be executed fully on the user computer, partly execute on the user computer, be independent soft as onePart packet executes, part executes on the remote computer or completely in remote computer or service on the user computer for partIt is executed on device.In situations involving remote computers, remote computer can pass through the network of any kind --- including officeDomain net (LAN) or wide area network (WAN)-are connected to subscriber computer, or, it may be connected to outer computer (such as using becauseSpy nets service provider to connect by internet).

Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art thatThe invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation,It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present inventionIt is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, alsoIt may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims

1. a kind of video frequency identifying method characterized by comprising

Obtain simple video subfile corresponding with video file to be identified and simple audio subfile, and acquisition with it is describedThe corresponding key frame set of video subfile and video clip set merely；

Multi-modal picture recognition is carried out to the key frame set, obtains the first recognition result, and to the video clip setVideo identification is carried out, the second recognition result is obtained；

According to first recognition result, second recognition result and the third recognition result, obtain and the videoFile is corresponding to integrate recognition result.

2. the method according to claim 1, wherein obtaining key frame corresponding with the simple video subfileSet and video clip set, comprising:

The simple video subfile is filtered using video frame coarse filtration technology, obtains filtering sets of video frames；

Calculate corresponding with video frame is respectively filtered in filtering sets of video frames feature vector, and according to the feature toIt measures and clustering processing is carried out to each filtering video frame in the filtering sets of video frames, obtain at least two clustering clusters, wherein instituteStating in clustering cluster includes at least one filtering video frame；

The static highest filtering video frame composition key frame set of angle value in each clustering cluster is obtained respectively；

According to time parameter of the filtering video frame for including in each clustering cluster in simple video subfile, determining and each instituteThe corresponding initial time of clustering cluster and duration are stated, and according to the initial time and the duration, to the simple viewFrequency subfile carries out slicing treatment, obtains the video clip set.

3. according to the method described in claim 2, it is characterized in that, calculating and respectively filtering video in the filtering sets of video framesThe corresponding feature vector of frame, comprising:

Feature extraction is carried out to each filtering video frame in the filtering sets of video frames using convolutional neural networks model；Or

Feature extraction is carried out to each filtering video frame in the filtering sets of video frames using local binary patterns LBP, and willEach feature extraction result carry out processing formed statistics histogram as LBP feature corresponding with each filtering video frame toAmount.

4. the method according to claim 1, wherein carry out multi-modal picture recognition to the key frame set,Obtain the first recognition result, comprising:

Picture classification is carried out to each key frame in the key frame set using default picture classification model, and by classification resultsAs first recognition result；

And/or

Each key frame in the key frame set is separately input into YOLOv3 model trained in advance；And described in obtainingThe output of YOLOv3 model, target object mark corresponding with each key frame and target object are in key framePosition coordinates as first recognition result；

And/or

Face datection is carried out to each key frame in the key frame set using S3FD algorithm；

Face key point location is carried out to the face detected by MTCNN algorithm, obtains face key point；

Feature extraction is carried out to facial image by Arcface algorithm according to the face key point；

It is matched according to the face characteristic extracted with the feature in feature database, and according to matching result to each key frameCorresponding people information is identified, and using the mark result of the face information as first recognition result.

5. being obtained the method according to claim 1, wherein carrying out video identification to the video clip setSecond recognition result, comprising:

Time domain down-sampling is carried out to each video clip in the video clip set respectively, obtains adopt corresponding with video clipSample sets of video frames；

Spatially and temporally progress setting processing operation is integrated into the sampled video frame, obtains the input figure of at least two typesPicture；Wherein, the setting processing operation includes that scaling processing, light stream extraction and edge image extract；The input pictureType includes high-definition picture, low-resolution image, light stream image and edge image；

All kinds of input pictures are separately input into corresponding 3DCNN network, and input is schemed using the 3DCNN networkAs being identified, and obtain the 3DCNN network output, the output probability value of video tab corresponding with input picture；

The output probability value of each video tab is merged according to setting amalgamation mode, the video mark that will be obtained after fusionLabel are combined as second recognition result.

6. the method according to claim 1, wherein carrying out audio identification, packet to the simple audio subfileIt includes:

Fast Fourier is carried out after pre-processing the simple audio subfile to change to obtain the simple audio subfileFrequency domain information；

Calculate the corresponding energy spectrum of frequency domain information of simple audio subfile described in every frame；

Logarithm Meier spectrum energy is obtained according to the energy spectrum；

Logarithm Meier spectrum signature is extracted according to the logarithm Meier spectrum energy；

The logarithm Meier spectrum signature is reconstructed, two-dimentional audio frequency characteristics are obtained；

By carrying out feature extraction and audio classification to the two-dimentional audio frequency characteristics based on CNN basic structural unit.

7. according to the method described in claim 6, it is characterized by:

When the CNN basic structural unit uses multiple classifiers, to the simple audio subfile by way of ballotIt is identified.

8. a kind of video identification device characterized by comprising

Subfile obtains module, for obtaining simple video subfile corresponding with video file to be identified and simple audioSubfile, and obtain key frame set corresponding with the simple video subfile and video clip set；

First identification module obtains the first recognition result, and right for carrying out multi-modal picture recognition to the key frame setThe video clip set carries out video identification, obtains the second recognition result；

Recognition result obtains module, for being known according to first recognition result, second recognition result and the thirdNot as a result, obtaining corresponding with the video file integrating recognition result.

9. a kind of computer equipment, which is characterized in that the equipment includes:

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are executed by one or more of processors, so that one or more of processors are realThe now video frequency identifying method as described in any in claim 1-7.

10. a kind of computer storage medium, is stored thereon with computer program, which is characterized in that the program is executed by processorVideo frequency identifying method of the Shi Shixian as described in any in claim 1-7.