Summary of the invention
The Online Video behavioral value system based on space-time contextual analysis that the object of the present invention is to provide a kind of, so that rightIt can use video sequence contextual information, while continually entering with video frame when the action behavior of present frame is detectedBehavioral chain can be incrementally generated, and is dynamically classified to video behavior.
The Online Video behavioral value method based on space-time contextual analysis that it is a further object of the present invention to provide a kind of.
Method proposed by the present invention and have the main improvement of two o'clock compared with the conventional method: 1) method of the invention is based onConvGRU (Convolutional Gated Recurrent Unit), compared to ConvLSTM, it is a kind of following for lightweightRing memory models, the parameter with much less, reduce the risk of over-fitting on small sample quantity collection;2) their model isSingle forward model, therefore when video frame only positioned at input video sequence rear end does behavioral value, could be using fusionSpace-time dynamic information, and method proposed by the present invention is a kind of coding-decoded model, the spatio-temporal context information of video sequence existsIt can be all used in each frame when decoding.
The principle of the present invention is: 1) using depth convolutional neural networks extract single frame video feature, it will continuous several frame viewsFrequency feature is input to ConvGRU building coding-decoded video sequence descriptive model, the behavior space-time context in forward direction transmittingInformation is encoded, and is decoded to each frame in the rear space-time dynamic information encoded into transmitting, is completed in conjunction with current frame information dynamicIt detects;2) maintain a dynamic behavior classification candidate pool that possibility is gradually reduced with the continuous growth of input video sequenceBehavior class scope, while the behavioral chain being currently generated dynamically is trimmed, comprising: increase, terminate and time-domain is repairedIt cuts.
Technical solution provided by the invention is as follows:
Temporal-spatial field behavioral value method proposed by the present invention includes two parts: the behavioral value in video clipLink between video clip.Algorithm utilizes coding-decoded model in video clip, believes in conjunction with present frame and space-time dynamicBreath generates candidate actions region;Candidate actions Regional links at behavioral chain, which gives more sustained attention specified for link between video clipAction object, occur while predicting the classification of behavior on-line manner until terminating from it.
A kind of Online Video behavioral value system based on space-time contextual analysis, including video behavior space-time context letterIt ceases converged network and motion frame links online and sorting algorithm;Wherein: the video behavior spatio-temporal context information converged network,For behavior spatio-temporal context information in current frame information and a video clip to be merged;The motion frame links onlineAnd sorting algorithm forms one completely for the motion frame for corresponding to same moving target to be chained up on-line mannerBehavioral chain, and classify to its behavior classification.
The video behavior spatio-temporal context information converged network specifically includes: single frames feature extraction network, for extractingPresent frame RGB image and light stream picture depth expressing feature in video clip;Video clip spatio-temporal context information converged network,Coding-decoder module of the building based on ConvGRU model for extracting video clip space-time context expressing feature, and be allowed toVideo present frame feature is merged, and fusion feature is obtained;Behavioral value network, for carrying out single frames behavior on fusion featureDetection, the position for obtaining behavior classification score and behavior occurring position, and generate motion frame.
The motion frame link online and sorting algorithm specifically includes: building behavior classification candidate pool, for maintain forThe behavior classification that given video currently most possibly occurred specify number;Behavior classification candidate pool more new algorithm, for rowIt gives a mark for classification, behavior class scope belonging to current video possibility is gradually reduced, realizes the online Fast Classification of behavioral chain;The online growth algorithm of behavioral chain is realized for the corresponding behavior candidate region of video clip to be linked in already present behavioral chainBehavioral chain increases online;Or behavior candidate region is determined as to new behavioral chain.
A kind of Online Video behavioral value method based on space-time contextual analysis, including following several steps:
Step 1: light stream image being calculated to present frame, extracts the depth expressing feature of RGB image and light stream image;
Step 2: building coding-decoding network extracts video behavior spatio-temporal context information, and carries out with current frame informationFusion, obtains fusion feature;
Step 3: to fusion feature carry out classification and position return, generate motion frame, with Viterbi algorithm to motion frame intoRow link, obtains behavior candidate region;
Step 4: building behavior classification candidate pool updates the behavior classification being likely to occur;
Step 5: behavior candidate region being linked in already present behavioral chain on-line manner or generates new behaviorChain;
Step 6: the testing result of RGB image branch and light stream image branch being merged, final detection knot is obtainedFruit.
Compared with prior art, the beneficial effects of the present invention are:
Piece of video is utilized when carrying out behavioral value to video single-frame images using technical solution provided by the inventionBehavior spatio-temporal context information in section, improves the accuracy rate of behavioral value;Online inspection can be carried out to video behavior simultaneouslyIt surveys, compared to the previous offline method based on batch processing mode, improves the timeliness of video behavioral value, can be applied to pairThe relatively high occasion of requirement of real-time, such as intelligent robot, man-machine interactive system.Compared with video behavior detection technique phaseThan on currently a popular open test collection, technology provided by the invention is achieved using less Candidate SubmissionBetter detection effect.
With reference to the accompanying drawing, by embodiment, the present invention is further described:
Specific embodiment
A kind of Online Video behavioral value method based on space-time contextual analysis of the invention, embodiment mode is by such asLower step carries out:
1) one section of video sequence to be detected is divided evenly that (8 frames are a segment, adjacent segment for several video clipsBetween have a frame overlapping);
2) Optic flow information, original RGB image and optical-flow light stream image difference are extracted to each video clipTwo independent calculating branches of composition in model are input to, are illustrated by taking one of RGB branch as an example below, optical-Flow branch situation is identical;
3) every frame image is separately input to a depth convolutional network trained in advance (as behavior point in video clipClass) carry out motion feature extraction;
4) motion feature extracted is input in coding-decoding network that one is made of ConvGRU and extracts video lineIt for spatio-temporal context information, and is merged with the motion information of present frame, every frame exports fusion feature;
5) fusion feature is by an access one behavior sorter network and position Recurrent networks, the behavior occurred to every frame intoRow classification, while the position that behavior occurs positions, and generates behavior frame;
6) it is linked, is constituted by behavior classification with the behavior frame that Viterbi algorithm detects frame every in video clipSeveral behavior candidate regions, following steps (7-9) circulation executes, until input video sequence terminates;
If 7) multiple that current video segment ordinal number is 10, process performing chain time-domain Pruning Algorithm calculates behavioral chain phaseFor the score of each behavior classification, regeneration behavior classification candidate pool, making it only includes the maximum several behavior classifications of score;
8) to certain current behavioral chain, behavior candidate region overlaps area (overlapping area refers to that behavioral chain is last if it existsOverlapping area between one behavior frame and the behavior frame of candidate region first) be greater than specified threshold, then the maximum row of scoreBehavior chain is linked to for candidate region;Behavior candidate region overlaps area greater than specified threshold if it does not exist, then terminatesBehavior chain.This step operation executes respectively for different classes of in behavior classification candidate pool;
9) behavior candidate region is not if it exists and any behavioral chain constitutes link, then using behavior candidate region as oneNew behavioral chain;
10) the behavioral chain testing result of RGB branch and Optical-flow branch is merged, obtains final detectionAs a result.
Fig. 1 is flow chart of the invention, wherein S1-S8 is corresponding in turn in specific embodiment step 1)-8).It is a kind ofOnline Video behavioral value method based on space-time contextual analysis, concrete operations process are now described below:
1) video is evenly dividing into segment S1: one section of input video is given, several video clips are evenly divided into,Every section includes 8 frame images.Each section is used as independent video unit, extracts behavior candidate region;
2) RGB image or light stream S2 are extracted: for each video unit, extracting wherein every frame optical-flow light streamImage is described as motion information.Original RGB image and light stream image is separately input in model as two independent pointZhi Jinhang is calculated.It is illustrated based on RGB image branch below, optical-flow light stream branch is same;
3) single frames expressing feature p ' is extractediS3: the behavioral value network frame of single-frame images is as shown in Fig. 3.It is shown in Fig. 3The image 4 that a video unit is included is shown.Every frame image extracts expressing feature with feature extraction network 5, is denoted as pi.FeatureIt extracts network and is based on VGG-16 model (Simonyan K.and Zisserman A.2014.Very DeepConvolutional Networks for Large Scale Image Recognition.ArXiv(2014).https://Doi.org/arXiv:1409.1556) fine tuning training obtains, and takes its conv5 layers of feature.A dimensionality reduction network 6 is constructed, by pi'sFeature number of layers drops to 128 from 512, is denoted as p 'i, prevent overall network model over-fitting.Dimensionality reduction network is by 128 layers of convolutional layerThe convolution module of composition;
4) fusion expressing feature p is extracteddS4: by the expressing feature p ' of every frame imageiIt is input to ConvGRU network 2, carrys out structureIt is motion encoded to build video unit temporal-spatial fusion.Motion information codec En-Decoder model structure is as shown in Fig. 2, Fig. 2Show present frame input feature vector p 'i1 and fusion expressing feature pd2.En-Decoder model action time range is entire videoUnit, including forward coding and backward decoding process.Feature p 'iForward coding and backward decoding process are simultaneously participated in, it is specific defeatedIt is as shown in Figure 2 to enter mode.Feature p ' of the forward coding to single framesiIt is accumulated, is obtained to video unit motion sequence at any timeCharacterization;Backward decoding motion sequence characterization propagate backward to each frame in video unit and with feature p 'iIt is merged,The feature p of present frame and spatio-temporal context information is mergedd;
5) the output S5 of Detection Networka and RPN network is calculated: feature pdIt is input to RPN network 7, is passed throughOperation obtains movement and proposes score 11, is denoted as sr, and movement proposal 12, it is denoted as pr.RPN network is one 2 layers of 3*3 convolution netNetwork, in pdUpper sliding calculates movement at each position and proposes score sr, area of the score value greater than a designated value (such as 0.5)Domain is considered as that p is proposed in movementr.Detection Network8 receives pdAnd prAs input, behavior classification results 9, note are exportedFor scAnd position adjustment amount 10, it is denoted as δr.Detection Network is by 2 layers of full articulamentum structure comprising 1024 hidden unitsAt behavior classification results scInclude the classification score to each class behavior and background classes, position adjustment amount δrEach class behavior is givenCorresponding 3 position deviations (center, width and height) out.By prAnd δrIt can calculate by modified behavior candidate frame bt;
6) computer video unit behavior candidate region P S6: note btCorresponding RPN movement is proposed to be scored at sr(bt),Using Viterbi algorithm b on different images frame in same video unittIt is chained up, behavior candidate region p is obtained, such as formula(1) shown in:
TpIt is video unit persistence length, is taken as 8 here;For btAnd bt-1Between friendship and compare IntersectionOver Union(IoU);For harmonic coefficient, it is taken as 0.5;
7) video behavioral chain set T is calculateddS7: it as video constantly inputs, obtains corresponding to each video unitBehavior candidate region p obtains the increased video behavioral chain set T of dynamic by following rule (a)-(f)d.Fig. 4 is video behaviorChain set TdOnline dynamic updates operational flowchart, forExecuting rule (a)-(f), main thought are as follows: maintain oneThe behavior classification candidate pool that a dynamic updates, differentiates according to the video continually entered, and it is candidate gradually to reduce behavior classificationNumber;Determine that newly generated candidate region p is linked to original video behavioral chain or work according to the link method of settingFor new behavioral chain.As shown in figure 4, comprising the concrete steps that:
(a) time-domain cuts 13.If current collection TdElement number > upper limit Nd, update terminates.Otherwise, it utilizesViterbi algorithm carries out time-domain cutting to T, as shown in formula (2):
TlIt include b by behavioral chain TtNumber;lt∈ { 0, c } is btGeneric, 0 represents backgroundClass, c represent behavior classification c;If lt=c, thenFor btCorresponding Detection Network classification c classification score sc(bt), if lt=0,It is defined as 1-sc(bt);If lt=lt-1, ω (lt, lt-1)=0, otherwise, ω (lt, lt-1)=0.7;λω=0.4;It is cut by time-domain, background block included in T will be reduced;
(b) it calculates behavior score 14: for T, calculating its score s (T) relative to each behavior classification, s (T) is defined asBelong to the average value of all p score s (p) of T;Similarly, the score s (p) of p is defined as belonging to all b of ptScore sc(bt) averageValue;
(c) it constructs behavior candidate pool 15: constructing a behavior classification candidate pool in descending order according to s (T), specificallyAre as follows: when i) starting, retain all categories;When ii) handling the 10th video unit, retain preceding 5 behavior classifications;Iii) processing theWhen 20 video units, retain preceding 3 behavior classifications;Iv) handle the 30th video unit and after, only retain and rank the firstBehavior classification.If the current candidate pool behavior classification upper limit is Np, for each of candidate pool behavior classification j≤Np, execute ruleThen (d)-(e):
(d) candidate collection P is constructedt16: new generation behavior candidate region p is set, if the IoU between T and p is greater than specified thresholdThen set P is added in p by (such as 0.5)t(initial ptFor sky).The last one behavior that IoU between T and p is defined as belonging to T is candidateIoU between frame and first behavior candidate frame of p;
(e) regeneration behavior chain 17: if PtIt is not sky, then the (p ' ∈ P of p ' corresponding to score maximum s (p ')t) it is linked to T,P ' is added to behind T, forms new behavioral chain T ', updates T=T ';
(f) increase new behavioral chain 18: p if it existsnew(pnew∈Pt) be not linked in any one T, then pnewAsSet T is added in one new behavioral chaind;
8) RGB and Optical-flow testing result merges S8: the behavioral chain of RGB branch and Optical-flow branchTesting result is merged, and final testing result is obtained.The method of fusion is: setting TrgbFor a behavioral chain of RGB branch,ToptFor a behavioral chain of Optical-flow branch, if TrgbWith ToptBetween IoU be greater than a specified threshold (such as 0.7), thenTake max (s (Trgb), s (Topt)) corresponding to that behavioral chain, delete it is other one;Otherwise, retain this two behavioral chains.
Take mAP (mean Average Precision) be evaluation criterion when, method proposed by the present invention is in J-HMDB-21Behavioral value best at present is achieved on data set as a result, as shown in table 1 compared with current other methods:
| mAP | 0.5 | 0.5:0.95 |
| Gurkirt et al.[1] | 72.0 | 41.6 |
| ACT[2] | 72.2 | 43.2 |
| Peng and Schmid[3] | 73.1 | - |
| Harkirat et al.[4] | 67.3 | 36.1 |
| The present invention | 75.9 | 44.8 |
Table 1. is compared with other methods, and '-' indicates not refer to, the higher the better for result value
The method compared in table 1 is listed below:
[1]S.G.,S.S.,and C.F.,“Online real time multiple spatiotemporalaction localisation and prediction on a single platform,”ArXiv,2016.
[2]V.Kalogeiton,P.Weinzaepfel,V.Ferrari,and C.Schmid,“Action tubeletdetector for spatio-temporal action localization,”in IEEE InternationalConference on Computer Vision,2017,pp.4415–4423.
[3]P.X.and S.C.,“Multi-region two-stream r-cnn for action detection,”European Conference on Computer Vision,pp.744–759,2016.
[4]B.H.,S.M.,S.G.,S.S.,C.F.,and T.P.H.,“Incremental tube constructionfor human action detection,”ArXiv,2017.
It should be noted that the purpose for publicizing and implementing example is to help to further understand the present invention, but the skill of this fieldArt personnel, which are understood that, not to be departed from the present invention and spirit and scope of the appended claims, and various substitutions and modifications are allIt is possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with claimSubject to the range that book defines.