Summary of the invention
In order to overcome the deficiencies of the above existing technologies, purpose of the present invention is to provide a kind of based on space characteristics and moreThe action identification method and system of target detection, with by simplifying compound action, and by utilizing the space between multiple targetThe accuracy rate of feature raising action recognition.
In order to achieve the above object, the present invention proposes a kind of action identification method based on space characteristics and multi-target detection, packetInclude following steps:
Step S1, the movement detected to needs are decomposed, and each decomposition goal is obtained, and carry out data to each decomposition goalThe collection of collection, and each target data set is trained based on deep learning to obtain the target detection model of each decomposition goal;
Step S2, constantly acquisition video flowing are obtained using the target detection model to the video flow detection target of inputLocation information of the single goal in video image, and the direction vector feature between target is calculated, compare direction vector feature and is regardingConstantly close target is merged into fresh target by the variation tendency in frequency stream;
Step S3 extracts the fresh target of synthesis, in all targets of a movement decomposition merge only remaining major heading andWhen secondary target, this is judged according to IOU of the two between the feature and target position for the direction vector that video interframe generatesThe generation of movement.
Preferably, step S1 further comprises:
Step S100, the movement detected to needs are decomposed, several decomposition goals are obtained;
Step S101 carries out data set collection to each decomposition goal, obtains multiple target data sets;
Step S102 pre-processes the target data set of acquisition;
Step S103 is trained each target data set using YoloV3 network to obtain the target detection of each decomposition goalModel.
Preferably, in step S102, the pretreatment includes but is not limited to the target in the image of target data setIt is translated, mirror image operation, some Gaussian noises, salt-pepper noise is added in target position, random cropping partial target image is rightImage shaken, padding.
Preferably, step S2 further comprises:
Step S200 carries out target detection using the target detection model to video flowing present frame, obtains each targetLocation information, and the direction vector between each target is calculated, extract the feature of direction vector between target two-by-two;
Step S201 carries out target detection using the target detection model to next frame video, obtains the position of each targetConfidence breath, and the direction vector between each target is calculated, extract the feature of direction vector;
The direction vector feature obtained according to before and after frames video is compared, compares direction vector feature by step S202Constantly close target is merged into fresh target by the variation tendency of length characteristic and direction character in video streaming.
Preferably, in step S202, judgement whether there is close trend between target two-by-two, close if it exists to becomeGesture, then continue next frame video, and return step S201 is overlapped until target is close two-by-two, and is merged into fresh target.
Preferably, in step S202, if in the video frame of front and back two-by-two the direction vector length of target it is continuous reduction withAnd direction is consistent in video interframe, then illustrates two targets constantly close.
Preferably, two direction vector walking direction of video interframe indicates are as follows:
uN-1, n·u′N-1, n=(xn-1-xn)·(x′n-1-xn′)+(yn-1-yn)·(y′n-1-yn′)
Wherein, uN-1, nIndicate t1Direction vector between frame video object, uN-1, nIndicate t2Between frame video object direction toAmount, (xn, yn) indicate target position coordinates,
In continually entering video flowing, if it exists | uN-1, n| > | u 'N-1, n| and uN-1, n·u′N-1, n> 0, then it represents that twoTarget has the tendency that close.
Preferably, in step S202, by the IOU size between two targets to determine whether two targets should be closedAnd at a fresh target.
Preferably, in step S202, the IOU between two targets is calculated, then merges two targets if more than some threshold valueFor fresh target.
In order to achieve the above objectives, the present invention also provides a kind of action recognition system based on space characteristics and multi-target detectionSystem, comprising:
Target detection model training acquiring unit, the movement for detecting to needs are decomposed, and each decomposition mesh is obtainedMark carries out the collection of data set to each decomposition goal, and is trained to obtain each point to each target data set based on deep learningSolve the target detection model of target;
Object detection unit is examined for constantly obtaining video flowing using video flowing of the target detection model to inputSurvey target, obtain location information of the single goal in video image, and calculate the direction vector feature between target, compare direction toConstantly close target is merged into fresh target by the variation tendency of measure feature in video streaming;
Action recognition unit, for extracting the fresh target of synthesis, when all targets of a movement decomposition merge only surplusWhen lower major heading and secondary target, according to the two between the feature and target position for the direction vector that video interframe generatesIOU come judge movement generation.
Compared with prior art, a kind of action identification method and system based on space characteristics and multi-target detection of the present inventionBy being multiple simple targets by movement decomposition and establishing target detection model, the space in video between multiple target is made full use ofVector characteristics, by interframe vector variation characteristic, movement relation and positional relationship by multiple targets in continuous interframe are examinedSurvey movement, realizes the purpose for improving action recognition accuracy rate.
Specific embodiment
Below by way of specific specific example and embodiments of the present invention are described with reference to the drawings, those skilled in the art canUnderstand further advantage and effect of the invention easily by content disclosed in the present specification.The present invention can also pass through other differencesSpecific example implemented or applied, details in this specification can also be based on different perspectives and applications, without departing substantially fromVarious modifications and change are carried out under spirit of the invention.
Fig. 1 is a kind of step flow chart of the action identification method based on space characteristics and multi-target detection of the present invention.Such asShown in Fig. 1, a kind of action identification method based on space characteristics and multi-target detection of the present invention includes the following steps:
Step S1 decomposes the movement that detects of needs, obtains each decomposition goal, such as this movement of drinking can be withBe decomposed into wineglass, hand, three targets of mouth (such as Manual definition will drink movement decomposition be three wineglass, hand and mouth meshMark), carry out the collection of data set to each decomposition goal, the data set some can be looked in online certain disclosed data setsIt arrives, some then need to collect pictures is labeled by software such as labelImg, and based on deep learning to each target dataCollect and is trained to obtain the target detection model of each decomposition goal, in the present invention, the data set training based on all decomposition goalsA target detection model is obtained, this target detection model can detecte all targets in data set.
Specifically, step S1 further comprises:
Step S100, the movement detected to needs are decomposed, several decomposition goals are obtained.
Step S101 carries out data set collection to each decomposition goal, obtains multiple target data sets.Such as it is received from networkCollection includes the picture of each decomposition goal, and the picture comprising identical decomposition goal is collected together, the mesh of the decomposition goal is formedData set is marked, such as the picture comprising each decomposition goal can be downloaded from the Internet, the figure of single decomposition goal can also be downloaded, is led toIt crosses annotation tool the target in picture is marked out to form the target data set of the decomposition goal, it includes original imagesThe comment file generated with mark.
Step S102 pre-processes the target data set of acquisition.Specifically, in order to improve the property of target detectionCan, before being trained based on deep learning to target data set, the image concentrated to the target data of acquisition is pre-processed,The pretreatment includes but is not limited to: being translated to the target in the image of target data set, mirror image operation, in target positionBe added some Gaussian noises, salt-pepper noise etc., random cropping partial target image, image is done shake, padding.
Step S103 is trained each target data set using YoloV3 network to obtain the target detection of each decomposition goalModel.
In the present invention, the network structure of YoloV3 is the residual block that is formed using 3*3 convolution sum 1*1 convolution as basic portionPart detects target in three various sizes of outputs, then will test target and merged to obtain final goal by NMS, defeatedScale is respectively 13*13,26*26,52*52 out, and the network structure of YoloV3 network is as shown in Fig. 2, wherein residual block part one21 convolutional layers (convolutional layer including several 3*3 and 1*1 convolution) is shared, remaining is res layers, and the part YOLO is yolo networkFeature interaction layer, is divided into three scales, and in each scale, local feature interaction is realized by way of convolution kernel, acts on classIt is similar to full articulamentum but is the part realized by way of convolution kernel (3*3 and 1*1) between characteristic pattern (feature map)Feature interaction, in the specific embodiment of the invention, Far Left is smallest dimension yolo layers, and input is the characteristic pattern of 13*13(feature map) exports the characteristic pattern (feature map) of 13*13 size, in this base by a series of convolution operationClassification is carried out on plinth and position returns;Centre is mesoscale yolo layers, the characteristic pattern that yolo layers of smallest dimension are exported(feature map) carries out a series of convolution operations, exports the characteristic pattern (feature map) of 26*26 size, then herein intoRow classification and position return;Rightmost is the yolo layer of large scale, the characteristic pattern (feature that yolo layers of mesoscale are exportedMap a series of convolution operations) are carried out, the characteristic pattern (feature map) of 52*52 size is exported, then carry out herein classification andPosition returns.
Step S2, constantly acquisition video flowing obtain monocular using target detection model to the video flow detection target of inputThe location information being marked in video image, and the direction vector feature between target is calculated, compare direction vector feature in video flowingIn variation tendency, constantly close target is merged into fresh target.In the specific embodiment of the invention, before and after frames are regarded respectivelyFrequency meter calculates the direction vector between target two-by-two, and the space characteristics i.e. length of vector for extracting direction vector and direction are in video flowingBetween variation tendency, if the length of the direction vector between two targets constantly reduces and direction is consistent in video interframe, explanationTwo targets constantly it is close, then calculate the IOU between two targets, then merge into fresh target for two if more than some threshold value.
Specifically, step S2 further comprises:
Step S200 carries out target detection using the target detection model that training obtains in S1 to video flowing present frame, obtainsTo the location information of each target, and the direction vector between each target is calculated, extracts the feature of direction vector between target two-by-two.
In the specific embodiment of the invention, the direction vector method obtained between target two-by-two is as follows: assuming that each single goal existsLocation information in image is represented by (x1, y1, t1), (x2, y2, t1) ..., (xn, yn, t1), wherein t1Indicate t1Frame viewFrequently, (xn, yn) position coordinates that indicate target, the direction vector between target may be expressed as:
u1,2=(x1-x2, y1-y2, t1)
u1, n=(x1-xn, y1-yn, t1)
…
uN-1, n=(xn-1-xn, yn-1-yn, t1)
Wherein, uN-1, nIndicate the direction vector between (n-1)th target and n-th of target.
In the embodiment of the present invention, direction vector feature generally refers to the length characteristic and direction character of direction vector, i.e.,The feature for extracting direction vector between target two-by-two is exactly to calculate length and the direction of direction vector, wherein the length of direction vectorDegree feature may be expressed as:
Step S201 carries out target detection using target detection model to next frame, obtains the location information of each target, andThe direction vector between each target is calculated, the feature of direction vector is extracted.The calculating of target detection and direction vector hereinIdentical as step S200, it will not be described here.
Step S202, the direction vector feature that step S201 is obtained and the direction vector obtained according to former frame video are specialSign compares, and compares the variation tendency of direction vector feature length characteristic in video streaming and direction character, will be constantly closeTarget is merged into fresh target, that is, judges to whether there is close trend between target two-by-two, if it exists close trend, then under continuingOne frame video, return step S201, until target merges constantly close target very close to (such as close coincidence) two-by-twoAt fresh target, if close trend is not present in continuous several frames between two targets, illustrate it is not related between the two targets,The relationship between the two targets is then no longer paid close attention to below.
In the present invention, trend close between target is by the direction vector of target two-by-two in the video frame of front and back two-by-twoWhat the feature that the continuous reduction of length and direction are consistent was judged, i.e. two direction vector walking direction of video interframe canIt indicates are as follows:
uN-1, n·u′N-1, n=(xn-1-xn)·(x′n-1-xn′)+(yn-1-yn)·(y′n-1-yn′)
Wherein, uN-1, nIndicate t1Direction vector between frame video object, uN-1, nIndicate t2Between frame video object direction toAmount.In continually entering video flowing, if it exists | uN-1, n|>|u′N-1, n| and uN-1, n·u′N-1, n> 0, then it represents that two targets connectClose trend.
In the specific embodiment of the invention, merging target is by the IOU (Intersection-over- between two targetsUnion is handed over and is compared) size be to determine whether a fresh target should be merged by two targets.Specifically, when two targetsBetween IOU (hand over and compare) two targets can synthesize to when being more than some threshold value T a fresh targets, between two targetsIOU may be expressed as:
Wherein A, B indicate two targets, and the schematic diagram of IOU is as shown in Figure 3.
Step S3 extracts the fresh target of synthesis, when all targets of a movement decomposition merge only remaining major heading andWhen secondary target, judge to move according to IOU of the two between the feature and target position for the direction vector that video interframe generatesThe generation of work.
In the specific embodiment of the invention, the difference of major heading and time target refers in multiple targets of movement decomposition,The motion change of video interframe is less major heading, and remaining decomposition goal is kept moving in interframe, one be finally synthesizingFresh target is referred to as time target.Secondary target can be constantly close to major heading in video streaming, by the direction between two targets toThe IOU between space characteristics, that is, vector length, direction and the variable quantity and two targets of length of formation, which is measured, to judge movement isNo generation.
Fig. 4 is a kind of system architecture diagram of the motion recognition system based on space characteristics and multi-target detection of the present invention.Such asShown in Fig. 4, a kind of motion recognition system based on space characteristics and multi-target detection of the present invention, comprising:
Target detection model training acquiring unit 401, the movement for detecting to needs are decomposed, and each decomposition is obtainedTarget carries out the collection of data set to each decomposition goal, and is trained to obtain respectively to each target data set based on deep learningThe target detection model of decomposition goal.For example, three wineglass, hand, mouth targets can be decomposed into for this movement of drinking, it is rightEach decomposition goal carries out the collection of data set, some can be focused to find out the data set in online certain disclosed data, someIt then needs to collect pictures and be labeled by software such as labelImg, be then based on deep learning and each target data set is carried outTraining obtains the target detection model of each decomposition goal, this target detection model can detecte all targets in data set.
Specifically, as shown in figure 5, target detection model training acquiring unit 401 further comprises:
Movement decomposition unit 4010, the movement detected to needs are decomposed, several decomposition goals are obtained.
Target data set collector unit 4011 obtains multiple number of targets for carrying out data set collection to each decomposition goalAccording to collection.Such as the picture comprising each decomposition goal is collected from network, the picture comprising identical decomposition goal is collected together,Form the target data set of the decomposition goal.
Pretreatment unit 4012, for being pre-processed to the target data set of acquisition.Specifically, in order to improve targetThe performance of detection need to be obtained using 4012 Duis of pretreatment unit before being trained based on deep learning to target data setThe image that target data is concentrated is pre-processed, and the pretreatment includes but is not limited to: to the mesh in the image of target data setMark translated, mirror image operation, and some Gaussian noises, salt-pepper noise etc., random cropping partial target figure is added in target positionPicture, image is done shake, padding.
Model training unit 4013 obtains each decomposition mesh for being trained using YoloV3 network to each target data setTarget target detection model.
In the present invention, the network structure of YoloV3 is the residual block that is formed using 3*3 convolution sum 1*1 convolution as basic portionPart detects target in three various sizes of outputs, then will test target and merged to obtain final goal by NMS, defeatedScale is respectively 13*13,26*26,52*52 out.
Object detection unit 402, for constantly obtaining video flowing, using target detection model to the video flow detection of inputTarget obtains location information of the single goal in video image, and calculates the direction vector feature between target, compares direction vectorConstantly close target is merged into fresh target by the variation tendency of feature in video streaming.In the specific embodiment of the invention, meshDetection unit 402 is marked respectively to the direction vector between the calculating of before and after frames video two-by-two target, and the space for extracting direction vector is specialSign is the variation tendency between video flowing of length and direction of vector, if the length of the direction vector between two targets constantly reduce andDirection is consistent in video interframe, illustrate two targets constantly it is close, then the IOU between two targets is calculated, if more than someThreshold value then merges into fresh target for two.
Specifically, as shown in fig. 6, object detection unit 402 further comprises:
Former frame module of target detection 4021 carries out target detection using target detection model to video flowing present frame, obtainsTo the location information of each target, and the direction vector between each target is calculated, extracts the feature of direction vector between target two-by-two.
In the specific embodiment of the invention, the direction vector method obtained between target two-by-two is as follows: assuming that each single goal existsLocation information in image is represented by (x1, y1, t1), (x2, y2, t1) ..., (xn, yn, t1), wherein t1Indicate t1Frame viewFrequently, (xn, yn) position coordinates that indicate target, the direction vector between target may be expressed as:
u1,2=(x1-x2, y1-y2, t1)
u1, n=(x1-xn, y1-yn, t1)
…
uN-1, n=(xn-1-xn, yn-1-yn, t1)
Wherein, uN-1, nIndicate the direction vector between (n-1)th target and n-th of target.
In the embodiment of the present invention, direction vector feature generally refers to the length characteristic and direction character of direction vector, i.e.,The feature for extracting direction vector between target two-by-two is exactly to calculate length and the direction of direction vector, wherein the length of direction vectorDegree feature may be expressed as:
A later frame object detection unit 4022 is obtained for carrying out target detection using target detection model to next frameThe location information of each target, and the direction vector between each target is calculated, extract the feature of direction vector.Target detection hereinAnd the calculating of direction vector is identical as step S200, it will not be described here.
Trend judgement processing unit 4023, the direction vector feature and basis that a later frame object detection unit 4022 is obtainedThe direction vector feature that former frame video obtains compares, and judges to whether there is close trend between target two-by-two, close if it existsTrend, then continue next frame video, return to a later frame object detection unit 4022, until two-by-two target very close to (such asClose to coincidence), merge target.If close trend is not present in continuous several frames between two targets, illustrate the two targets itBetween it is not related, behind then no longer pay close attention to relationship between the two targets.
In the present invention, trend close between target is by the direction vector of target two-by-two in the video frame of front and back two-by-twoWhat the feature that the continuous reduction of length and direction are consistent was judged, i.e. two direction vector walking direction of video interframe canIt indicates are as follows:
uN-1, n·u′N-1, n=(xn-1-xn)·(x′n-1-xn′)+(yn-1-yn)·(y′n-1-yn′)
Wherein, uN-1, nIndicate t1Direction vector between frame video object, uN-1, nIndicate t2Between frame video object direction toAmount.In continually entering video flowing, if it exists | uN-1, n| > | u 'N-1, n| and uN-1, n·u′N-1, n> 0, then it represents that two targets haveClose trend.
In the specific embodiment of the invention, merging target is by the IOU (Intersection-over- between two targetsUnion is handed over and is compared) size be to determine whether a fresh target should be merged by two targets.Specifically, when two targetsBetween IOU (hand over and compare) two targets can synthesize to when being more than some threshold value T a fresh targets, between two targetsIOU may be expressed as:
Wherein A, B indicate two targets.
Action recognition unit 403, for extracting the fresh target of synthesis, when all targets of a movement decomposition merge onlyWhen remaining major heading and secondary target, according to the two between the feature and target position for the direction vector that video interframe generatesIOU come judge movement generation.
In the specific embodiment of the invention, the difference of major heading and time target refers in multiple targets of movement decomposition,The motion change of video interframe is less major heading, and remaining decomposition goal is kept moving in interframe, one be finally synthesizingFresh target is referred to as time target.Secondary target can be constantly close to major heading in video streaming, by the direction between two targets toThe IOU between space characteristics, that is, vector length, direction and the variable quantity and two targets of length of formation, which is measured, to judge movement isNo generation.
Fig. 7 is the process of the action identification method based on space characteristics and multi-target detection of the specific embodiment of the inventionFigure.In the present embodiment, it is as follows to be somebody's turn to do the action recognition process based on space characteristics and multi-target detection:
Step 1, will need the movement decomposition that detects is multiple targets, collects the data set of those targets, to data set intoRow pretreatment, is trained it to obtain the model of target detection by the YoloV3 network in deep learning.
In the present embodiment, carrying out pretreated method to data set has: being translated to the target in image, mirror image is graspedMake, some Gaussian noises, salt-pepper noise etc. is added in target position, random cropping partial target image does image and shakes, fills outFill operation.The network structure of YoloV3 is the residual block that is formed using 3*3 convolution sum 1*1 convolution as the basic element of character, in three differencesThe output of size detects target, then will test target and merged to obtain final goal by NMS, and output scale is respectively13*13,26*26,52*52.
Step 2, input video frame carry out target detection to the frame video, obtain the location information of target, calculate eachDirection vector between target extracts the feature of target direction vector two-by-two;
In the embodiment of the present invention, the direction vector method obtained between target two-by-two is: the position letter of single goal in the pictureBreath is represented by (x1, y1, t1), (x2, y2, t1) ..., (xn, yn, t1), wherein t1Indicate t1Frame video, (xn, yn) indicate meshTarget position coordinates, direction vector may be expressed as:
u1,2=(x1-x2, y1-y2, t1)
u1, n=(x1-xn, y1-yn, t1)
…
un-1, n=(xn-1-xn, yn-1-yn, t1)
Wherein, uN-1, nIndicate the direction vector between (n-1)th target and n-th of target.
In the embodiment of the present invention, the length characteristic of direction vector be may be expressed as:
Step 3, input video stream, same to carry out target detection and calculate the direction vector between target two-by-two again, andIt is compared with direction vector before, judges to whether there is close trend between target two-by-two, continue input video, until meshMark is very close to merging target;
Two direction vector walking direction of video interframe may be expressed as:
uN-1, n·u′N-1, n=(xn-1-xn)·(x′n-1-xn′)+(yn-1-yn)·(y′n-1-yn′)
Wherein, uN-1, nIndicate t1Direction vector between frame video object, uN-1, nIndicate t2Between frame video object direction toAmount.
In continually entering video flowing, exist | uN-1, n| > | u 'N-1, n| and uN-1, n·u′N-1, n> 0, then it represents that two meshClose trend is indicated, two targets can be synthesized into a new mesh when the IOU between two targets is more than some threshold value TIt marks, the IOU between two targets may be expressed as:
Wherein A, B indicate two targets.
Step 4, when all targets of a movement decomposition merge to obtain only remaining major heading and when time target, according to the twoVideo interframe generate direction vector feature and target position between IOU come judge movement generation.In this implementationIn example, a movement decomposition is referred to as major heading, remaining multiple lists for the little target of the mobile variation in multiple targetsTarget constantly moves one fresh target of synthesis in interframe, and the fresh target finally obtained is referred to as time target, and secondary target can be in videoIt is constantly close to major heading in stream, pass through space characteristics, that is, vector length of the direction vector formation between two targets, directionWhether the IOU between the variable quantity and two targets of length occurs to judge to act.
In conclusion a kind of action identification method and system based on space characteristics and multi-target detection of the present invention pass through byMovement decomposition is multiple simple targets and establishes target detection model, makes full use of the space vector in video between multiple target specialSign, by interframe vector variation characteristic, by multiple targets continuous interframe movement relation and positional relationship come detection operation,Realize the purpose for improving action recognition accuracy rate.
The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.AnyWithout departing from the spirit and scope of the present invention, modifications and changes are made to the above embodiments by field technical staff.Therefore,The scope of the present invention, should be as listed in the claims.