Summary of the invention
For this purpose, the present invention provides target following scheme, to try hard to solve or at least alleviate above existing at least oneA problem.
According to an aspect of the invention, there is provided a kind of method for tracking target, this method is in the shifting for having camera functionIt is executed in dynamic equipment, comprising steps of inputting the target position determined in initial frame according to user, wherein target position is expressed as oneThe target frame at surrounding target center;Tracker and detector are generated based on the target position training in initial frame, wherein trackerSuitable for tracking to the target in shooting video, detector is suitable for detecting the target in shooting video;Shooting is regardedSubsequent each picture frame in frequency: it tracks to obtain the target position of the picture frame, and output tracking response using tracker;SentenceWhether disconnected tracking response value is greater than or equal to threshold value, if then continuing the target following of next image frame;If otherwise starting detectionDevice utilizes the target position of detector output correspondence image frame;And it after keeping the predetermined frame number of detector continuous service, switches toTracker continues target following.
Optionally, in method for tracking target according to the present invention, the target position determined in initial frame is inputted according to userThe step of setting includes: the area-of-interest based on user's input, utilizes multiple candidates of RPN network model output current image frameTarget frame;Identification is carried out by Fast R-CNN network model and position returns, and exports the confidence level of each candidate target frame;WithAnd after non-maxima suppression, target of the highest candidate target frame of confidence level as target position in characterization initial frame is chosenFrame.
Optionally, in method for tracking target according to the present invention, the target position training based on initial image frame is generatedThe step of tracker includes: the circular matrix collecting sample using the target frame peripheral region of initial image frame;And using mostThe initial trace template of the small two optimization method output tracking devices multiplied.
Optionally, in method for tracking target according to the present invention, the target position training based on initial image frame is generatedThe step of detector includes: the target frame according to initial image frame, exports multiple sample boxes according to the sliding window of predetermined dimension,Generate sample queue.
Optionally, in method for tracking target according to the present invention, the sliding window of predetermined dimension are as follows: at the beginning of sliding windowBeginning scale is the 10% of original image, step-size in search scale be adjacent scale the first prearranged multiple or the second prearranged multiple andValue interval is [0.1 times of initial gauges, 10 times of initial gauges].
Optionally, it in method for tracking target according to the present invention, tracks to obtain its target position using tracker, and defeatedThe step of tracking response value includes: to generate trace template by the target position of a upper picture frame, using tracker out;According to upperThe target position of one picture frame generates the region of search of the picture frame;By the neighborhood of each pixel in trace template and region of searchConvolution algorithm is carried out, the response of each pixel is obtained;Target's center of the maximum pixel of response as the picture frame is chosen,And maximum response is exported as tracking response value;And it is true by the size of the target's center and the target frame of a upper picture frameThe target position of the fixed picture frame.
Optionally, in method for tracking target according to the present invention, which is generated according to the target position of a upper picture frameAs frame region of search the step of include: the above picture frame target frame center be search center, with its each ruler of target frameVery little twice is search range, the region of search as the picture frame.
Optionally, in method for tracking target according to the present invention, which is generated according to the target position of a upper picture frameAs frame region of search the step of further include: processing is zoomed in and out to the picture frame according to predetermined zoom factor, obtains multiple contractingsPicture frame after putting;And the center of the target frame of the above picture frame is search center, with twice of its each size of target frameRegion of search for search range, as picture frame after multiple scalings.
Optionally, in method for tracking target according to the present invention, by each pixel in trace template and region of searchThe step of neighborhood progress convolution algorithm includes: using trace template and each pixel in the region of search of picture frame after multiple scalingsNeighborhood carry out convolution algorithm, obtain the response under the different zoom factor.
Optionally, in method for tracking target according to the present invention, pass through the target of the target's center and a upper picture frameThe size of frame determines the step of target position of the picture frame further include: with the scaling of the affiliated picture frame of the maximum pixel of responseThe target frame of a picture frame zooms in and out processing on factor pair, the target frame size as the picture frame;And according to calculatingThe target frame size of target's center and the picture frame determines the target position of the picture frame.
Optionally, in method for tracking target according to the present invention, the target position of detector output correspondence image frame is utilizedThe step of setting includes: multiple candidate samples that target in the picture frame is generated according to multiple sample boxes in sample queue;Pass throughThree-stage cascade classification is filtered multiple candidate samples, exports the target position of the picture frame.
Optionally, in method for tracking target according to the present invention, further include the steps that update tracker: obtain it is eachAfter the target frame of picture frame, the trace template of the picture frame is calculated according to the content frame;And the tracking to the picture frameTemplate and the trace template of a upper picture frame are weighted, and obtain updated trace template.
Optionally, in method for tracking target according to the present invention, the weighting coefficient point of the picture frame and a upper picture frameIt Wei 0.015 and 0.985.
Optionally, in method for tracking target according to the present invention, further include the steps that updating detector: calculate by detectingThe IoU index for multiple candidate samples that device generates;And sample queue is screened according to IoU index.
According to another aspect of the invention, a kind of mobile device is provided, comprising: camera sub-system is suitable for shooting videoImage;One or more processors;Memory;One or more programs, wherein one or more programs store in memoryAnd be configured as being executed by one or more of processors, one or more of programs include for executing side as described aboveThe instruction of method either in method.
According to another aspect of the invention, a kind of computer-readable storage medium for storing one or more programs is providedMatter, one or more of programs include instruction, and described instruction is when calculating equipment execution, so that calculating equipment executes institute as aboveMethod either in the method stated.
Target following scheme according to the present invention provides user-friendly friendship compared to existing Atomatic focusing methodMutual mode, user only need simply to put touching on the touchscreen or delineate, and can judge automatically user's area-of-interest, and generate oppositeAccurate fine target position, to guarantee the accurate of subsequent tracking.
Further, it is contemplated that the factors such as real-time and accuracy of target following, to every in subsequent shooting videoOne picture frame tracks target using tracker, and when mistake occurs for target following or is that the target tracked disappearsWhen, start spare detector and target is detected, to ensure that the robustness of long video tracking.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawingExemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth hereIt is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosureIt is fully disclosed to those skilled in the art.
Fig. 1 shows the organigram of mobile device 100 according to an embodiment of the invention.Referring to Fig.1, movement is setStandby 100 include: memory interface 102, one or more data processors, image processor and/or central processing unit 104,And peripheral interface 106.Memory interface 102, one or more processors 104 and/or peripheral interface 106 are either discreteElement also can integrate in one or more integrated circuits.In mobile device 100, various elements can by one orA plurality of communication bus or signal wire couple.Sensor, equipment and subsystem may be coupled to peripheral interface 106, to helpRealize multiple functions.For example, motion sensor 110, optical sensor 112 and range sensor 114 may be coupled to peripheral interface106, to facilitate the functions such as orientation, illumination and ranging.Other sensors 116 can equally be connected with peripheral interface 106, such as fixedPosition system (such as GPS receiver), angular-rate sensor, temperature sensor, biometric sensor or other sensor devices, byThis can help to implement relevant function.
Camera sub-system 120 and optical sensor 122 can be used for the camera of convenient such as record photos and video clipsThe realization of function, wherein camera sub-system and optical sensor for example can be charge-coupled device (CCD) or complementary metal oxygenCompound semiconductor (CMOS) optical sensor.
It can help to realize communication function by one or more radio communication subsystems 124, wherein wireless communicationSystem may include radio-frequency transmitter and transmitter and/or light (such as infrared) Receiver And Transmitter.Radio communication subsystem124 particular design and embodiment can depend on one or more communication networks that mobile device 100 is supported.For example,Mobile device 100 may include be designed to support GSM network, GPRS network, EDGE network, Wi-Fi or WiMax network andBlueboothTMThe communication subsystem 124 of network.Audio subsystem 126 can be with 130 phase coupling of loudspeaker 128 and microphoneIt closes, to help the function of implementing to enable voice, such as speech recognition, speech reproduction, digital record and telephony feature.
I/O subsystem 140 may include touch screen controller 142 and/or other one or more input controllers 144.Touch screen controller 142 may be coupled to touch screen 146.For example, the touch screen 146 and touch screen controller 142 can be withThe contact carried out therewith and movement or pause are detected using any one of a variety of touch-sensing technologies, wherein sensing skillArt includes but is not limited to capacitive character, resistive, infrared and surface acoustic wave technique.Other one or more input controllers 144May be coupled to other input/control devicess 148, for example, one or more buttons, rocker switch, thumb wheel, infrared port,The pointer device of USB port, and/or stylus etc.One or more button (not shown)s may include for controlling loudspeakingThe up/down button of 130 volume of device 128 and/or microphone.
Memory interface 102 can be coupled with memory 150.The memory 150 may include that high random access is depositedReservoir and/or nonvolatile memory, such as one or more disk storage equipments, one or more optical storage apparatus, and/Or flash memories (such as NAND, NOR).Memory 150 can store an operating system 152, for example, Android, IOS orThe operating system of Windows Phone etc.The operating system 152 may include for handling basic system services and executionThe instruction of task dependent on hardware.Memory 150 can also be stored using 154.These applications in operation, can be from memory150 are loaded on processor 104, and run on the operating system run via processor 104, and utilize operating systemAnd the interface that bottom hardware provides realizes the various desired functions of user, such as instant messaging, web page browsing, pictures management.Using can be independently of operating system offer, it is also possible to what operating system carried.In some implementations, it applies154 can be one or more programs.
Implementation according to the present invention, by storing corresponding one or more in the memory 150 of mobile device 100A program come realize camera sub-system 120 acquire video image when target following function, that is, methods described below 200.It should be noted that the signified mobile device 100 of the present invention can be mobile phone with above-mentioned construction, plate, camera etc..
Fig. 2 shows the flow charts of method for tracking target 200 according to an embodiment of the invention.As shown in Fig. 2, the partyMethod 200 starts from step S210, and when opening the progress video capture of camera sub-system 120, user can be for example, by the touchscreenThe mode input region of interest clicked/delineated either interested target, is optimized by the input to user, is obtainedThe initially target position in (image) frame, wherein target position is expressed as the target frame at a surrounding target center.
According to one embodiment, the area-of-interest that user inputs is input to the deep learning model of training under line, it is defeatedTarget position out.Specifically, first with RPN network (Region Proposal Network) model output current image frameMultiple candidate target frames;Identification is carried out by Fast R-CNN network model again and position returns, exports each candidate target frameConfidence level;Finally, choosing the highest candidate target frame of confidence level as characterization initial image frame after non-maxima suppressionThe target frame of target position.Introduction about Fast R-CNN network model can refer to following paper description: Ren, Shaoqing,et al."Faster R-CNN:Towards real-time object detection with region proposalNetworks. " Advances in neural information processing systems.2015, it is no longer superfluous hereinIt states.
Then in step S220, the target position training based on initial image frame generates tracker and detector.
Wherein, tracker is suitable for tracking the target in shooting video.Optionally, the present embodiment is using discriminateTracking distinguishes target and ambient enviroment.In the track, what a classifier is trained to need great amount of samples, this is just meanedA large amount of time loss.According to one embodiment of present invention, to the target frame and peripheral region of initial image frame using volumeProduct matrix (circulant matrix) Lai Shengcheng training sample, that is, the image pattern based on cycle spinning, the benefit done soIt is to determine to complete using more efficient frequency domain method to sample set;Then, defeated using the optimization method of least squareThe initial trace template of tracker out.
Detector is suitable for detecting the target in shooting video.According to an embodiment of the invention, the training of detectorThe method of sampling based on sliding window, it is multiple according to the sliding window output of predetermined dimension according to the target frame of initial image frameSample boxes generate sample queue.Optionally, the initial gauges of sliding window take the 10% of original image size, step-size in search rulerDegree is the first prearranged multiple (e.g., 1.2 times) of adjacent scale or the second prearranged multiple (e.g., 0.8 times) and value interval are [just0.1 times of beginning scale, 10 times of initial gauges], particularly, reject window of the area less than 20 pixels.According to sample boxes and meshMultiple sample boxes of output are divided into positive and negative two class by the size for marking frame overlapping region, wherein overlap proportion is greater than 50% samplingFrame is stored in positive sample queue, and sample boxes of the overlap proportion less than 20% are stored in negative sample queue.
It can be seen from the above description that detector is computationally intensive in tracker.In view of target following real-time andThe factors such as accuracy, to it is subsequent shooting video in each picture frame, target is tracked using tracker, and when target withWhen track occurs mistake or is the target disappearance of tracking, then start-up detector detects target.
Detailed process is described as follows.
In step S230, track to obtain the target position of the picture frame (for example, the 2nd frame image) using tracker, andOutput tracking response.
It such as Fig. 3, shows according to an embodiment of the invention, tracks to obtain picture frame target position using trackerFlow chart.
In step S2302, trace template is generated by the target position of a upper picture frame, using tracker.NamelyIt says, after tracking obtains the target of each picture frame, the trace template of the picture frame is generated using tracker, to be used for next figureAs the tracking of frame.
Then in step S2304, the region of search of the picture frame is generated according to the target position of a upper picture frame.It is optionalGround, the center of the target frame of the above picture frame are search center, with twice of its each size of target frame (that is, wide, high size)Region of search for search range, as the picture frame.For example, the target position of a upper picture frame be expressed as with pixel (200,500) for target's center, having a size of 100 × 100 target frame, then according to the target position generate the picture frame searchRegion is exactly centered on pixel (200,500), having a size of 200 × 200 search boxes.
Then in step S2306, the neighborhood of each pixel in trace template and region of search is subjected to convolution algorithm (etc.Valence carries out dot product in trace template and region of search to be transformed on frequency domain), the response of each pixel is obtained, response indicatesEach pixel is the probability of final goal central point.
Then in step S2308, target's center of the maximum pixel of response as the picture frame is chosen, and export mostBig response is as tracking response value.
Then in step S2310, which is determined by the size of the target's center and the target frame of a upper picture frameThe target position of frame.That is, the target position of the picture frame indicates are as follows: using tracking response value corresponding pixel points as in targetThe heart, the above picture frame target frame having a size of the target frame of size.
In the specific implementation process, due to the variation of shooting focal length etc. or the movement of target object, it may result in meshIt marks object and dimensional variation occurs, therefore, embodiment according to the present invention carries out above-mentioned steps using several different scales respectivelyS2302 to S2310.
That is, processing is zoomed in and out to this picture frame according to predetermined zoom factor before executing step S2304,Picture frame after obtaining multiple scalings, then according still further to step S2304, the center of the target frame of the above picture frame is in searchThe heart, twice with its each size of target frame is search range, the region of search as picture frame after multiple scalings.According to the present inventionEmbodiment, predetermined zoom factor includes one or more of following array: 0.82,0.88,0.94,1.06,1.12,1.2}。
In step S2306, the neighbour of each pixel in the region of search of picture frame after trace template and multiple scalings is usedDomain carries out convolution algorithm, obtains the response under the different zoom factor.
In subsequent step S2308 and S2310, with the zoom factor of the affiliated picture frame of the maximum pixel of response to upper oneThe target frame of picture frame zooms in and out processing, the target frame size as the picture frame;According to the target's center of calculating and the figureAs the target frame size of frame determines the target position of the picture frame.
Then in step S240, judge whether tracking response value is greater than or equal to threshold value, optionally, threshold value is set as0.27.If tracking response value >=0.27, return step S230 continues the target following of next image frame.
If tracking response value is less than threshold value, then it is assumed that tracking result inaccuracy executes step S250, start-up detector, benefitWith the target position of detector output correspondence image frame.According to an embodiment of the invention, when the target position of tracking is got too close toWhen image border, it is believed that target may disappear, at this time also start-up detector, execute step S250.
Specifically, the detector generated according to step S220 training, being generated according to multiple sample boxes in sample queue shouldMultiple candidate samples of target in picture frame, since candidate samples quantity is larger, it is lower to directly adopt arest neighbors matching efficiency, becauseThis uses the method for three-stage cascade classification, is filtered to multiple candidate samples, exports the target position of the picture frame.According to oneKind embodiment, the first order filter candidate samples by Variance Constraints, and the second level is classified by random fern further filters candidateSample, the final third level carry out arest neighbors matching, and the candidate samples of highest scoring are considered as the output of detector.
Target disappears or will not reappear immediately after being blocked in many cases, and the detection of short-time duty can notCorrect discovery target.Therefore in step S260, after keeping the predetermined frame number of detector continuous service, tracker is switched to, is continuedTarget following is carried out using tracker, that is, return step S230 continues to execute target following process.Optionally, predetermined frame number is setFor 50 frames.
Embodiment according to the present invention, after judging to target position in each picture frame, system is according to the frameContent is updated tracker and detector.
Specifically, the method for tracker is updated are as follows: after obtaining the target frame of each picture frame, calculate according to the content frameObtain the trace template of the picture frame;Fortune is weighted to the trace template of the trace template of the picture frame and a upper picture frame againIt calculates (that is, linear superposition), obtains updated trace template.Optionally, the weighting coefficient of the picture frame and a upper picture frame pointIt Wei 0.015 and 0.985.
Equally, update the method for detector are as follows: calculate by detector maturation multiple candidate samples IoU index, according toIoU index screens sample queue.In other words, target is judged as by detector in a new frame according to IoU indexThe biggish sample of probability is classified, if IoU index is greater than 0.65, then it is assumed that the sample is the sample high with tracking result registrationThis, is classified to positive sample queue, if IoU index is less than 0.2, then it is assumed that the sample is the sample low with tracking result registrationThis, is added into negative sample queue.It is random to forget part sample to avoid sample queue too long, maintain population sample quantityStablize.
To sum up, target following scheme according to the present invention, compared to existing Atomatic focusing method, first there is provided withThe interactive mode of family close friend, user only need simply to put touching on the touchscreen or delineate, and can judge automatically user's area-of-interest,And relatively accurate fine target position is generated, to guarantee the accurate of subsequent tracking;It is followed secondly, the tracker of this programme usesThe image pattern of ring translation, situations such as obscuring target deformation, motion blur, background, better discriminates between ability, and tracks and calculateMethod has real-time speed, can rapidly and accurately judge the position of target object and corresponding scale in picture frame;Finally, working as targetWhen the case where temporary extinction occur or being blocked, this programme provides spare detector, is conceived to the long-term of target appearanceMemory, breaks through constraint spatially, its position can be judged again after target reappears, to ensure that long video tracksRobustness.
Various technologies described herein are realized together in combination with hardware or software or their combination.To the present inventionMethod and apparatus or the process and apparatus of the present invention some aspects or part can take insertion tangible media, such as it is softThe form of program code (instructing) in disk, CD-ROM, hard disk drive or other any machine readable storage mediums,Wherein when program is loaded into the machine of such as computer etc, and is executed by the machine, the machine becomes to practice this hairBright equipment.
In the case where program code executes on programmable computers, mobile device generally comprises processor, processorReadable storage medium (including volatile and non-volatile memory and or memory element), at least one input unit, and extremelyA few output device (as shown in Figure 1).Wherein, memory is configured for storage program code;Processor is configured forAccording to the instruction in the said program code stored in the memory, method for tracking target of the invention is executed.
By way of example and not limitation, computer-readable medium includes computer storage media and communication media.It calculatesMachine readable medium includes computer storage media and communication media.Computer storage medium storage such as computer-readable instruction,The information such as data structure, program module or other data.Communication media is generally modulated with carrier wave or other transmission mechanisms etc.Data-signal processed passes to embody computer readable instructions, data structure, program module or other data including any informationPass medium.Above any combination is also included within the scope of computer-readable medium.
It should be appreciated that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, it is right aboveIn the description of exemplary embodiment of the present invention, each feature of the invention be grouped together into sometimes single embodiment, figure orIn person's descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. claimed hairBright requirement is than feature more features expressly recited in each claim.More precisely, as the following claimsAs book reflects, inventive aspect is all features less than single embodiment disclosed above.Therefore, it then follows specific realThus the claims for applying mode are expressly incorporated in the specific embodiment, wherein each claim itself is used as this hairBright separate embodiments.
Those skilled in the art should understand that the module of the equipment in example disclosed herein or unit or groupsPart can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the exampleIn different one or more equipment.Module in aforementioned exemplary can be combined into a module or furthermore be segmented into multipleSubmodule.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodimentChange and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodimentMember or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement orSub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use anyCombination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosedAll process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint powerBenefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purposeIt replaces.
The present invention discloses together:
A9, the method as described in A8, wherein described to roll up trace template and the neighborhood of pixel each in region of searchThe step of product operation includes: using the trace template and each pixel in the region of search of picture frame after the multiple scalingNeighborhood carries out convolution algorithm, obtains the response under the different zoom factor.
A10, the method as described in A9, wherein determined by the size of the target's center and the target frame of a upper picture frameThe step of target position of the picture frame further include: with the zoom factor of the affiliated picture frame of the maximum pixel of response to a upper figureAs the target frame of frame zooms in and out processing, the target frame size as the picture frame;And according to the target's center of calculating and thisThe target frame size of picture frame determines the target position of the picture frame.
A11, the method as described in any one of A4-10, wherein utilize the target position of detector output correspondence image frameThe step of include: multiple candidate samples that target in the picture frame is generated according to multiple sample boxes in sample queue;Pass through threeGrade cascade sort is filtered the multiple candidate samples, exports the target position of the picture frame.
A12, the method as described in any one of A1-11 further include the steps that updating tracker: obtaining each picture frameTarget frame after, the trace template of the picture frame is calculated according to the content frame;And to the trace template of the picture frame withThe trace template of a upper picture frame is weighted, and obtains updated trace template.
A13, the method as described in A12, wherein the weighting coefficient of the picture frame and a upper picture frame is respectively 0.015 He0.985。
A14, the method as described in any one of A1-13 further include the steps that updating detector: calculate by detector maturationMultiple candidate samples IoU index;And the sample queue is screened according to the IoU index.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodimentsIn included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the inventionWithin the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointedMeaning one of can in any combination mode come using.
In addition, be described as herein can be by the processor of computer system or by executing by some in the embodimentThe combination of method or method element that other devices of the function are implemented.Therefore, have for implementing the method or methodThe processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, Installation practiceElement described in this is the example of following device: the device be used for implement as in order to implement the purpose of the invention element performed byFunction.
As used in this, unless specifically stated, come using ordinal number " first ", " second ", " third " etc.Description plain objects, which are merely representative of, is related to the different instances of similar object, and is not intended to imply that the object being described in this way mustMust have the time it is upper, spatially, sequence aspect or given sequence in any other manner.
Although the embodiment according to limited quantity describes the present invention, above description, the art are benefited fromIt is interior it is clear for the skilled person that in the scope of the present invention thus described, it can be envisaged that other embodiments.Additionally, it should be noted thatLanguage used in this specification primarily to readable and introduction purpose and select, rather than in order to explain or limitDetermine subject of the present invention and selects.Therefore, without departing from the scope and spirit of the appended claims, for thisMany modifications and changes are obvious for the those of ordinary skill of technical field.For the scope of the present invention, to thisInvent done disclosure be it is illustrative and not restrictive, it is intended that the scope of the present invention be defined by the claims appended hereto.