Movatterモバイル変換


[0]ホーム

URL:


CN114332488B - Target tracking method and system integrating salient information and multi-granularity context features - Google Patents

Target tracking method and system integrating salient information and multi-granularity context features
Download PDF

Info

Publication number
CN114332488B
CN114332488BCN202111671961.1ACN202111671961ACN114332488BCN 114332488 BCN114332488 BCN 114332488BCN 202111671961 ACN202111671961 ACN 202111671961ACN 114332488 BCN114332488 BCN 114332488B
Authority
CN
China
Prior art keywords
features
search
template
feature
branch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111671961.1A
Other languages
Chinese (zh)
Other versions
CN114332488A (en
Inventor
鲍华
束平
章洪潮
李亲
邹文杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui UniversityfiledCriticalAnhui University
Priority to CN202111671961.1ApriorityCriticalpatent/CN114332488B/en
Publication of CN114332488ApublicationCriticalpatent/CN114332488A/en
Application grantedgrantedCritical
Publication of CN114332488BpublicationCriticalpatent/CN114332488B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Landscapes

Abstract

The invention discloses a target tracking method and a system integrating significant information and multi-granularity context characteristics, the system comprises a twin sub-neural network, a multi-branch fusion module, a global context module, an attention map module, a depth cross-correlation module and a target position determination module. The system can extract a plurality of features of a template picture as template branch features, extract a plurality of features of a search picture as search branch features, obtain template features according to the template branch features, obtain search features according to the search branch features, obtain attention map of the search features and attention map of the template features according to the search features and the template features, conduct deep cross-correlation on the attention map of the template features and the attention map of the search features to obtain a score map, and conduct classification and regression operation on the score map to determine the position of a target in the search picture. The situation that the target is lost possibly is avoided when the conditions such as shielding, deformation, rotation and the like occur in the process of tracking the target.

Description

Target tracking method and system integrating significant information and multi-granularity context characteristics
Technical Field
The invention relates to the technical field of computer vision, in particular to a target tracking method and system integrating significant information and multi-granularity context characteristics.
Background
Target tracking is one of the basic problems in the field of computer vision, the template characteristics of the target in an initial video frame containing the target can be extracted through a target tracking algorithm, and the position of the target in a subsequent video frame can be continuously and stably tracked through the template characteristics of the target.
The target tracking algorithm based on the twin neural network performs off-line training through a large number of data sets to obtain high-precision network parameters and matching functions, and has the advantages of high precision, high effectiveness and the like. However, the object tracking algorithm based on the twin neural network has the problems of insufficient extraction of the characteristics of the model, lack of connection between video frames in the process of tracking the object, and the like. When the conditions of shielding, deformation, rotation and the like occur in the process of tracking the target, the condition of losing the target can occur.
Disclosure of Invention
The embodiment of the invention aims to provide a target tracking method and a target tracking system integrating significant information and multi-granularity context characteristics, so as to avoid the situation that a target is lost when shielding, deformation, rotation and the like occur in the process of tracking the target.
The specific technical scheme is as follows:
In a first aspect of the present invention, there is first provided a target tracking system that fuses salient information and multi-granularity context features, including a twin sub-neural network, a multi-branch fusion module, a global context module, an attention-seeking module, a depth cross-correlation module, and a target location determination module, wherein:
The twin sub-neural network is used for acquiring a template picture and a search picture, extracting a plurality of features of the template picture as template branch features and extracting a plurality of features of the search picture as search branch features, wherein the template picture comprises appearance information of a target to be tracked;
the multi-branch fusion module is used for obtaining template characteristics of the template picture according to the template branch characteristics;
the global context module is used for obtaining the searching characteristics of the searching pictures according to the searching branch characteristics;
the attention force diagram module is used for obtaining an attention force diagram of the search feature and an attention force diagram of the template feature according to the search feature and the template feature;
The depth cross-correlation module is used for carrying out depth cross-correlation on attention force diagram of the template features and attention force diagram of the search features to obtain a score diagram;
And the target position determining module is used for classifying and regressing the score map and determining the position of the target in the search picture.
In a second aspect of the present invention, there is provided a target tracking method incorporating salient information and multi-granularity contextual features, the method being applied to a twin neural network, the method comprising:
the method comprises the steps of obtaining a template picture and a search picture, extracting a plurality of features of the template picture as template branch features, and extracting a plurality of features of the search picture as search branch features, wherein the template picture contains appearance information of a target to be tracked;
obtaining template characteristics of the template picture according to the template branch characteristics;
obtaining search features of the search pictures according to the search branch features;
obtaining attention force diagrams of the search features and attention force diagrams of the template features according to the search features and the template features;
Performing depth cross-correlation on the attention map of the template characteristic and the attention map of the search characteristic to obtain a score map;
and classifying and regressing the score map, and determining the position of the target in the search picture.
Optionally, the twin neural network includes a twin sub neural network, the obtaining a template picture and a search picture, extracting a plurality of features of the template picture as template branch features, and extracting a plurality of features of the search picture as search branch features includes:
obtaining a template picture and a search picture through the twin-sub neural network, wherein the size of the search picture is larger than that of the template picture;
Inputting the template picture into ResNet network of the twin sub-neural network to obtain vector convolution operation characteristic, two-dimensional matrix convolution operation characteristic, three-dimensional matrix convolution operation characteristic, four-dimensional matrix convolution operation characteristic and five-dimensional matrix convolution operation characteristic of the template picture as ft1、ft2、ft3、ft4、ft5 respectively, and taking the vector convolution operation characteristic, the two-dimensional matrix convolution operation characteristic, the three-dimensional matrix convolution operation characteristic, the four-dimensional matrix convolution operation characteristic and the five-dimensional matrix convolution operation characteristic of the template picture as template branch characteristics;
Inputting the search picture into the twin sub-neural network ResNet network to obtain vector convolution operation characteristics, two-dimensional matrix convolution operation characteristics, three-dimensional matrix convolution operation characteristics, four-dimensional matrix convolution operation characteristics and five-dimensional matrix convolution operation characteristics of the search picture, wherein the vector convolution operation characteristics, the two-dimensional matrix convolution operation characteristics, the three-dimensional matrix convolution operation characteristics, the four-dimensional matrix convolution operation characteristics and the five-dimensional matrix convolution operation characteristics are fs1、fs2、fs3、fs4、fs5 respectively and serve as search branch characteristics.
Optionally, the twin neural network includes a multi-branch fusion module, the obtaining the template feature of the template picture according to the template branch feature includes:
Carrying out channel compression on the ft3、ft4、ft5 characteristic of the template branch characteristic to obtain a characteristic fn3、fn4、fn5;
Ft2 features of the template branch features are subjected to the multi-branch fusion module to obtain features fn2 containing different receptive fields;
And Fn3、fn4、fn5 is added with Fs2 respectively and the center cutting operation is carried out, so that the template characteristic Ft3、Ft4、Ft5 of the template picture is obtained.
Optionally, the twin neural network includes a global context module, and the obtaining the search feature of the search picture according to the search branch feature includes:
Carrying out channel compression on the fs3、fs4、fs5 features of the search branch features to obtain features fm3、fm4、fm5;
The Fm3、fm4、fm5 feature is passed through the global context module to obtain a search feature Fs3、Fs4、Fs5.
Optionally, the twin neural network includes an attention profile module, and the obtaining the attention profile of the search feature and the attention profile of the template feature according to the search feature and the template feature includes:
inputting the template feature Ft3、Ft4、Ft5 and the search feature Fs3、Fs4、Fs5 into a self-attention module and a cross-attention module of the attention seeking module respectively to obtain attention seeking of the template featureAnd attention seeking diagrams of search features
Optionally, the twin neural network includes a deep cross-correlation module, and performing deep cross-correlation on the attention map of the template feature and the attention map of the search feature to obtain a score map, including:
Through the depth cross-correlation module, the attention of the template features is soughtAnd an attention seeking graph of the search featureAnd respectively performing deep cross-correlation operation to obtain a score map phi3、φ4、φ5.
Optionally, the twin neural network includes a target location determining module, where the classifying and regressing the score map determines a location of the target in the search picture, including:
inputting the score map phi3、φ4、φ5 into a classification branch and a regression branch of the target position determining module respectively;
the score map phi3、φ4、φ5 is subjected to convolution with the convolution kernel size of 1 multiplied by 1 and the step length of 1 through classification branches to obtain the characteristic of 2K of channel numberRespectively associating preset weight values withMultiplying to obtain classification featuresThe classification features comprise foreground and background features of the target in the search picture;
The score map phi3、φ4、φ5 is subjected to convolution with the convolution kernel size of 1 multiplied by 1 and the step length of 1 through regression branches to obtain the characteristic of 4k of channel numberRespectively associating preset weight values withMultiplying to obtain regression featuresThe regression feature comprises a feature of the target;
According to classification characteristicsAnd regression characteristicsAnd determining the position of the target in the search picture.
In yet another aspect of the embodiment of the present invention, there is also provided an electronic device including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
A memory for storing a computer program;
and the processor is used for realizing any one of the target tracking methods integrating the significant information and the multi-granularity context characteristics when executing the programs stored in the memory.
In yet another aspect of the present invention, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the above-described target tracking methods that incorporate salient information and multi-granularity context features.
In yet another aspect of the invention, there is also provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform a method of object tracking incorporating salient information and multi-granularity context features as any of the above.
The embodiment of the invention provides a target tracking system integrating significant information and multi-granularity context characteristics, which comprises a twin sub-neural network, a multi-branch fusion module, a global context module, an attention map module, a depth cross-correlation module and a target position determination module. The system can acquire a template picture and a search picture through a twin sub-neural network, extract a plurality of features of the template picture as template branch features, extract a plurality of features of the search picture as search branch features, acquire the template features of the template picture according to the template branch features through a multi-branch fusion module, acquire the search features of the search picture according to the search branch features through a global context module, acquire attention map of the search features and attention map of the template features according to the search features and the template features through an attention map module, perform depth cross-correlation on the attention map of the template features and the attention map of the search features through a depth cross-correlation module, and perform classification and regression operation on the score map through a target position determination module to determine the position of a target in the search picture. The system enhances the accuracy of template feature extraction through the multi-branch fusion module, and enriches the relation between the search feature and the template feature through the attention seeking module. The situation that the target is lost possibly is avoided when the conditions such as shielding, deformation, rotation and the like occur in the process of tracking the target.
Drawings
The following detailed description of specific embodiments of the invention refers to the accompanying drawings, in which:
FIG. 1 is a flow chart of a method for target tracking incorporating salient information and multi-granularity contextual features provided by an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a target tracking method integrating salient information and multi-granularity context features according to an embodiment of the present invention;
FIG. 3 is a block diagram of a multi-branch fusion module provided by an embodiment of the present invention;
FIG. 4 is a block diagram of a global context module provided by an embodiment of the present invention;
FIG. 5 is a block diagram of an attention diagram module provided by an embodiment of the present invention;
FIG. 6 is a graph of accuracy and success rate tests of the target tracking method provided by the embodiment of the invention;
FIG. 7 is a chart of EAO (ExpectedAverage Overlap, expected average overlap ratio) value test for a target tracking method according to an embodiment of the present invention;
FIG. 8 is an EAO value test chart of the target tracking method according to the embodiment of the invention for various situations;
FIG. 9 is a diagram of representative visual results of a target tracking method according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the prior art, a target tracking algorithm based on a twin neural network has the problems of insufficient extraction of the characteristics of a template, lack of connection between video frames in the process of tracking the target, and the like. When the conditions of shielding, deformation, rotation and the like occur in the process of tracking the target, the condition of losing the target can occur.
In order to solve the above problems, the embodiment of the invention provides a target tracking system integrating significant information and multi-granularity context features. The target tracking system provided by the embodiment of the invention can comprise:
The twin sub-neural network is used for acquiring a template picture and a search picture, extracting a plurality of features of the template picture as template branch features and extracting a plurality of features of the search picture as search branch features;
the multi-branch fusion module is used for obtaining template characteristics of the template picture according to the template branch characteristics;
The global context module is used for obtaining the searching characteristics of the searching pictures according to the searching branch characteristics;
the attention force diagram module is used for obtaining an attention force diagram of the search feature and an attention force diagram of the template feature according to the search feature and the template feature;
The depth cross-correlation module is used for carrying out depth cross-correlation on attention force diagram of the template features and attention force diagram of the search features to obtain a score diagram;
And the target position determining module is used for classifying and regressing the obtained score map and determining the position of the target in the search picture.
The target tracking system provided by the embodiment of the invention can enhance the accuracy of template feature extraction through the multi-branch fusion module, and enrich the relation between the search feature and the template feature through the attention-seeking model. The situation that the target is lost possibly is avoided when the conditions such as shielding, deformation, rotation and the like occur in the process of tracking the target.
Referring to fig. 1, fig. 1 is a flowchart of a target tracking method integrating salient information and multi-granularity context features, which is applied to a twin neural network and provided in an embodiment of the present invention, the method may include the following steps:
s101, acquiring a template picture and a search picture, extracting a plurality of features of the template picture as template branch features, and extracting a plurality of features of the search picture as search branch features.
S102, obtaining template characteristics of the template picture according to the template branch characteristics.
And S103, obtaining the search feature of the search picture according to the search branch feature.
S104, obtaining attention force diagrams of the search features and attention force diagrams of the template features according to the search features and the template features.
S105, carrying out deep cross-correlation on the attention map of the template characteristic and the attention map of the search characteristic to obtain a score map.
S106, classifying and regressing the score map, and determining the position of the target in the search picture.
The template picture comprises appearance information of a target to be tracked, and the search picture is a picture comprising the target.
According to the target tracking method for fusing the salient information and the multi-granularity context features, which is provided by the embodiment of the invention, the accuracy of template feature extraction can be enhanced through the multi-branch fusion module, and the relation between the search features and the template features is enriched through the attention map module. The situation that the target is lost possibly is avoided when the conditions such as shielding, deformation, rotation and the like occur in the process of tracking the target.
Referring to fig. 2, fig. 2 is a flow chart of a target tracking method integrating salient information and multi-granularity context features according to an embodiment of the present invention.
In one implementation, the size of the input template picture (TEMPLATEIMAGE) may be 127×127×3, the width and height of the input template picture may be 127×127 pixel channel number 3, the size of the input search picture (SEARCHIMAGE) may be 255×255×3, and the width and height of the input search picture may be 255×255 pixel channel number 3. Template pictures and search pictures are respectively input into a template branch and a search branch, and the two branches are ResNet networks sharing parameters. The resulting 5-dimensional matrix convolution operation of Resnet of the template and search branches is characterized by ft1、ft2、ft3、ft4、ft5 and fs1、fs2、fs3、fs4、fs5.ft1、ft2、ft3、ft4、ft5 sizes of 61×61×64、31×31×256、15×15×512、15×15×1024、15×15×2048.fs1、fs2、fs3、fs4、fs5, 125×125×64, 61×61×256, 31×31×512, 31×31×1024, 31×31×2048, respectively.
In one embodiment, the twin neural network comprises a twin sub-neural network, and step S101 includes:
Step one, obtaining a template picture and a search picture through a twin sub-neural network.
Inputting the template picture into a ResNet network of the twin sub-neural network to obtain vector convolution operation characteristics, two-dimensional matrix convolution operation characteristics, three-dimensional matrix convolution operation characteristics, four-dimensional matrix convolution operation characteristics and five-dimensional matrix convolution operation characteristics of the template picture, wherein the vector convolution operation characteristics, the two-dimensional matrix convolution operation characteristics, the three-dimensional matrix convolution operation characteristics, the four-dimensional matrix convolution operation characteristics and the five-dimensional matrix convolution operation characteristics are ft1、ft2、ft3、ft4、ft5 respectively and serve as template branch characteristics.
Inputting the search picture into a twin sub-neural network ResNet network to obtain vector convolution operation characteristics, two-dimensional matrix convolution operation characteristics, three-dimensional matrix convolution operation characteristics, four-dimensional matrix convolution operation characteristics and five-dimensional matrix convolution operation characteristics of the search picture, wherein the vector convolution operation characteristics, the two-dimensional matrix convolution operation characteristics, the three-dimensional matrix convolution operation characteristics, the four-dimensional matrix convolution operation characteristics and the five-dimensional matrix convolution operation characteristics are fs1、fs2、fs3、fs4、fs5 respectively and serve as search branch characteristics.
The size of the search picture is larger than the size of the template picture.
In one embodiment, the twin neural network comprises a multi-branch fusion module, step S102 comprising:
Step one, carrying out channel compression on the ft3、ft4、ft5 characteristic of the template branch characteristic to obtain a characteristic fn3、fn4、fn5.
Step two, the ft2 characteristic of the template branch characteristic is subjected to a multi-branch fusion module to obtain a characteristic fn2 containing different receptive fields;
and thirdly, adding Fn3、fn4、fn5 with Fn2 respectively and performing center cutting operation to obtain template characteristics Ft3、Ft4、Ft5 of the template picture.
In one implementation, the ft3、ft4、ft5 features of the search branch feature are convolved by 1X1, and the number of channels is compressed to 256, so that the sizes of the features fn3、fn4、fn5.fn3、fn4、fn5 are 15×15×256. The template feature Ft3、Ft4、Ft5 of the template picture has a size of 7×7×256.
Referring to fig. 3, fig. 3 is a block diagram of a multi-branch fusion module according to an embodiment of the present invention.
The multi-branch fusion module comprises the working steps of inputting the two-dimensional matrix convolution operation characteristics into a two-step convolution kernel (the two-step convolution kernel has the size of 3×3 and the step size of 1), and outputting a characteristic map (hereinafter referred to as a first characteristic map) with the size of 31×31×128 through two-step convolution operation. And secondly, inputting the first characteristic diagram into two branches, wherein the input characteristics of the first branch are kept unchanged, and the other branch is a convolution sub-network with the same two steps. Through the operation, the characteristics of different receptive fields of the two-dimensional matrix convolution operation characteristics can be obtained, and deeper characteristics containing a plurality of semantic information of the target can be obtained. And thirdly, connecting the characteristics of the two branches. Finally, a downsampling operation is performed to obtain a refined feature map fn2.
Fn2 is added with fn3、fn4、fn5 respectively, and then center cutting operation is carried out, wherein the center cutting operation is specifically shown as the following formula (1):
Ft3=Crop(fn3+fn2)
Ft4=Crop(fn4+fn2)(1)
Ft5=Crop(fn5+fn2)
in one embodiment, the twin neural network comprises a global context module, step S103 comprising:
Step one, performing channel compression on the fs3、fs4、fs5 features of the search branch features to obtain features fm3、fm4、fm5.
Step two, the Fm3、fm4、fm5 features pass through a global context module to obtain search features Fs3、Fs4、Fs5.
In one implementation, the fs3、fs4、fs5 features of the search branch feature are convolved by 1X1, and the number of channels is compressed to 256, so that the sizes of the features fm3、fm4、fm5.fm3、fm4、fm5 are 15×15×256. The size to search feature Fs3、Fs4、Fs5 is 31 x 256.
Referring to fig. 4, fig. 4 is a block diagram of a global context module provided by an embodiment of the present invention.
The global context module includes three parts, a context Wen Jian die module, a transform sub-module, and a fusion sub-module. It is assumed that x and z represent the input and output of the global context module, respectively,Np represents the number of elements in the feature map.
The global context operation can be represented by equation (2) where W1、W2 and W3 represent the weight coefficients of the three convolutions of kernel size in fig. 4, respectively, LN () represents the layer normalization function (Layer Normalization) for normalization and ReLu () represents the piecewise linear function for single-sided suppression.
In one embodiment, the twin neural network includes an attention seeking module, and step S104 is specifically:
The template feature Ft3、Ft4、Ft5 and the search feature Fs3、Fs4、Fs5 are respectively input into a self-attention module and a cross-attention module of the attention seeking module to obtain the attention seeking of the template featureAnd attention seeking diagrams of search features
Referring to fig. 5, fig. 5 is a block diagram of an attention seeking module according to an embodiment of the present invention.
As shown in fig. 5, the attention seeking module includes a self-attention module and a cross-attention module. To learn finer semantic features from space and channels, self-attention and cross-attention sub-networks are proposed. As shown in fig. 5, there are 4 dashed boxes from above and below, wherein the contents of the first and fourth dashed boxes represent self-attention, and the contents of the second and third dashed boxes represent cross-attention. In detail, the template feature Z and the search feature X are respectively denoted, wherein the feature sizes of Z and X are c×h×w and c×h×w, respectively.
The self-attention module consists of spatial attention and channel attention.
For the spatial attention, first, the search feature X is divided into spatial locations to obtainWherein Xi,j∈RC×1×1 corresponds to the spatial position (i, j) parameter, and secondly, compressing the channel by using a 1X 1 convolution with the corresponding formula Q=Wsq X, wherein Wsq∈RC×1×1×1 is the parameter of the convolution kernel, resulting in Q εRH×W, while the value of each spatial position of Q can be expressed asThen, a feature with spatial information is generatedAnd finally, giving X-to a leachable parameter alpha and adding the leachable parameter alpha with the original characteristic X to obtain a final characteristic Xsa, wherein sigma () is a sigmoid activation function, and the final characteristic Xsa is shown in the following formula (3):
for channel attention, first, the input feature X is divided by the number of channelsXi∈RH×W, and secondly, the global averaging pooling operation operates on the space to produce a vector V ε RC×1×1, where the value of the kth channel can be obtained by the following equation (4):
again, V is compressed and expanded using two convolution operations to obtain
Wherein, theAndParameters corresponding to two convolution kernels respectively, and then obtaining a characteristic diagram with channel aggregation characteristicsWherein σ () is a sigmoid activation function, and finallyA learnable parameter beta is given and added with the original characteristic X to obtain a final channel characteristic diagram Xca, as shown in the following formula (5):
Cross attention module
For the search branch, template feature Z and search feature X are input into a cross-attention module, which is located in the second and third dashed boxes in FIG. 5. Then the template features are subjected to global average pooling and two 1×1 convolution operations to finally obtain a channel feature mapWherein C is the number of channels, secondly, toPerforming sigmoid function activation operation and multiplying the initial characteristic xi to obtain a preliminary characteristicThe following formula (6):
Again, the corresponding features are computed to compute the final cross-attention feature mapThen, cross attention profile Xcro is defined by the sum of XThe following was obtained:
λ in the above formula (7) is a parameter that can be learned.
The attention search feature map is effectively obtained by merging features Xsa、Xca and Xcro in parallel by element-wise summation operations. The corresponding attention feature among the template features is obtained the same as the search branch.
In one embodiment, the twin neural network includes a deep cross correlation module, and step S105 is specifically:
Through the depth cross-correlation module, attention of template features is soughtAnd attention seeking diagrams of search featuresAnd respectively performing deep cross-correlation operation to obtain a score map phi3、φ4、φ5.
In one implementation, attention patterns of template features are presentedAnd attention seeking diagrams of search featuresPerforming depth cross-correlation operation to obtain phi3, and the like to obtain phi4 and phi5.
In one embodiment, the twin neural network includes a target position determination module, step S106 comprising:
Step one, the score map phi3、φ4、φ5 is respectively input into a classification branch and a regression branch of the target position determining module.
Step two, the score map phi3、φ4、φ5 is subjected to convolution with the convolution kernel size of 1 multiplied by 1 and the step length of 1 through classification branches to obtain the characteristic of 2K of channel numberRespectively associating preset weight values withMultiplying to obtain classification featuresThe classification features include foreground and background features of the object in the search picture.
Step three, the score map phi3、φ4、φ5 is subjected to convolution with the convolution kernel size of 1 multiplied by 1 and the step length of 1 through regression branches to obtain the characteristic of 4k of channel numberRespectively associating preset weight values withMultiplying to obtain regression featuresThe regression features contain features of the target.
Step four, according to the classification characteristicsAnd regression characteristicsAnd determining the position of the target in the search picture.
In one implementation, one mayRespectively multiplied by different learnable weighting coefficients, i.e.Can be used forRespectively multiplied by different learnable weighting coefficients, i.e.
Referring to fig. 6, fig. 6 is a precision and success rate test chart of the target tracking method according to the embodiment of the present invention.
The object tracking method integrating the significant information and the multi-granularity context characteristics and the object tracking method of the nine other main streams provided by the embodiment of the invention are tested on the OTB2015 data set to obtain an accuracy test chart (a) and a success rate test chart (b). As can be seen from fig. 6, the target tracking method provided by the embodiment of the invention is optimal in accuracy and success rate.
Referring to fig. 7, fig. 7 is an EAO value test chart of the object tracking method according to the embodiment of the present invention.
The EAO value test chart obtained by testing the target tracking method which is provided by the embodiment of the invention and is fused with the salient information and the multi-granularity context characteristics on the VOT2019 data set is compared with the current mainstream target tracking method, wherein the larger the EAO value is, the better the evaluation effect is. As can be seen from fig. 7, the object tracking method of the present invention performs optimally among a plurality of object tracking methods.
Referring to fig. 8, fig. 8 is an EAO value test chart of the target tracking method according to the embodiment of the present invention for various situations.
The object tracking method integrating the significant information and the multi-granularity context features provided by the embodiment of the invention is tested with other main stream object tracking methods in VOT2019 to obtain EAO values under various conditions, wherein each condition comprises camera movement (cameramotion), occlusion (occlusion), scale change (sizechange), illumination change (illuminationchange) and motion change (motion change). Referring to fig. 8, it can be seen that the object tracking method of the present invention exhibits better performance in the face of camera movement, illumination variation, and motion variation.
Referring to fig. 9, fig. 9 is a diagram of a representative visual result of a target tracking method according to an embodiment of the present invention.
Fig. 9 shows representative visual results of different object tracking methods on the OTB2015 dataset, i.e. 10 representative video sequences were selected in the OTB2015 dataset and the object tracking method of the present invention was compared with other mainstream object tracking methods in these video sequences. The mainstream target tracking method for comparison comprises Ocean, MDNet, daSiamRPN, ATOM, siamRPN ++ and SiamBAN. The object tracking method of the present invention shows more accurate tracking accuracy when faced with the cases of motion blur, rapid motion, and low-resolution object tracking. For example, in the three video sequences BlurOwl, soccer and DragonBaby, some of the target tracking methods suffer, but the target tracking method of the present invention is more robust in tracking while maintaining higher accuracy. Meanwhile, as can be seen from other video sequences in the graph, the target tracking method also shows better tracking performance when facing the target tracking conditions such as rotation, scale change, deformation, shielding and the like.
The embodiment of the invention also provides an electronic device, as shown in fig. 10, which comprises a processor 1001, a communication interface 1002, a memory 1003 and a communication bus 1004, wherein the processor 1001, the communication interface 1002 and the memory 1003 complete communication with each other through the communication bus 1004,
A memory 1003 for storing a computer program;
the processor 1001 is configured to implement any of the above-described target tracking methods that combine salient information and multi-granularity context features when executing a program stored on the memory 1003.
The communication bus mentioned above for the electronic device may be a Peripheral component interconnect standard (Peripheral ComponentInterconnect, PCI) bus or an extended industry standard architecture (ExtendedIndustry StandardArchitecture, EISA) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the electronic device and other devices.
The memory may include random access memory (RandomAccessMemory, RAM) or may include Non-volatile memory (Non-VolatileMemory, NVM), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor including a central processing unit (CentralProcessing Unit, CPU), a network processor (NetworkProcessor, NP), etc., or may be a digital signal processor (DigitalSignalProcessor, DSP), an Application specific integrated circuit (Application SpecificIntegratedCircuit, ASIC), a Field-Programmable gate array (Field-Programmable GATEARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.
In yet another embodiment of the present invention, there is also provided a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the steps of any of the above-described target tracking methods incorporating salient information and multi-granularity contextual features.
In yet another embodiment of the present invention, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the above embodiments of target tracking methods that incorporate salient information and multi-granularity contextual features.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk SolidStateDisk (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system, electronic device, and computer-readable storage medium embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (8)

Translated fromChinese
1.融合显著信息和多粒度上下文特征的目标跟踪方法,其特征在于,所述方法应用于孪生神经网络,其特征在于,所述方法包括:1. A target tracking method integrating salient information and multi-granularity context features, characterized in that the method is applied to a twin neural network, characterized in that the method comprises:获取模板图片和搜索图片,提取所述模板图片的多个特征作为模板分支特征,提取所述搜索图片的多个特征作为搜索分支特征;所述模板图片包含需要跟踪的目标的外观信息;所述搜索图片为包含所述目标的图片;A template image and a search image are obtained, and multiple features of the template image are extracted as template branch features, and multiple features of the search image are extracted as search branch features; the template image contains appearance information of a target to be tracked; and the search image is a picture containing the target;根据所述模板分支特征得到所述模板图片的模板特征;Obtaining a template feature of the template image according to the template branch feature;根据所述搜索分支特征得到所述搜索图片的搜索特征;Obtaining a search feature of the search image according to the search branch feature;根据所述搜索特征和所述模板特征,得到所述搜索特征的注意力图和所述模板特征的注意力图;According to the search feature and the template feature, obtaining an attention map of the search feature and an attention map of the template feature;将模板特征的注意力图和搜索特征的注意力图进行深度互相关得到得分图;Perform deep cross-correlation between the attention map of the template feature and the attention map of the search feature to obtain a score map;将所述得分图进行分类和回归操作,确定所述目标在所述搜索图片中的位置;Performing classification and regression operations on the score map to determine the position of the target in the search image;所述孪生神经网络包含孪生子神经网络,所述获取模板图片和搜索图片,提取所述模板图片的多个特征作为模板分支特征,提取所述搜索图片的多个特征作为搜索分支特征,包括:The twin neural network includes a twin sub-neural network, and the acquiring of a template image and a search image, extracting a plurality of features of the template image as template branch features, and extracting a plurality of features of the search image as search branch features, comprises:通过所述孪生子神经网络获取模板图片和搜索图片;所述搜索图片的尺寸大于所述模板图片的尺寸;Acquire a template image and a search image through the twin neural network; the size of the search image is larger than the size of the template image;将所述模板图片输入所述孪生子神经网络的ResNet50网络,得到所述模板图片的向量卷积运算特征、二维矩阵卷积运算特征、三维矩阵卷积运算特征、四维矩阵卷积运算特征、五维矩阵卷积运算特征分别是ft1、ft2、ft3、ft4、ft5,作为模板分支特征;Input the template image into the ResNet50 network of the twin neural network, and obtain the vector convolution operation features, two-dimensional matrix convolution operation features, three-dimensional matrix convolution operation features, four-dimensional matrix convolution operation features, and five-dimensional matrix convolution operation features of the template image, which are ft1 , ft2 , ft3 , ft4 , and ft5 , respectively, as template branch features;将所述搜索图片输入所述孪生子神经网络ResNet50网络,得到所述搜索图片的向量卷积运算特征、二维矩阵卷积运算特征、三维矩阵卷积运算特征、四维矩阵卷积运算特征、五维矩阵卷积运算特征分别是fs1、fs2、fs3、fs4、fs5,作为搜索分支特征;Input the search image into the twin neural network ResNet50 network, and obtain the vector convolution operation feature, two-dimensional matrix convolution operation feature, three-dimensional matrix convolution operation feature, four-dimensional matrix convolution operation feature, and five-dimensional matrix convolution operation feature of the search image, which arefs1 ,fs2 ,fs3 ,fs4 , andfs5 , respectively, as search branch features;所述孪生神经网络包含多分支融合模块,所述根据所述模板分支特征得到所述模板图片的模板特征,包括:The twin neural network includes a multi-branch fusion module, and the template features of the template image are obtained according to the template branch features, including:将模板分支特征的ft3、ft4、ft5特征进行通道压缩得到特征fn3、fn4、fn5Perform channel compression on the featuresft3 ,ft4 , andft5 of the template branch features to obtain featuresfn3 ,fn4 , andfn5 ;将模板分支特征的ft2特征经过所述多分支融合模块得到包含不同感受野的特征fn2The feature ft2 of the template branch feature is passed through the multi-branch fusion module to obtain the feature fn2 containing different receptive fields;将fn3、fn4、fn5分别与fs2相加并进行中心裁剪操作,得到所述模板图片的模板特征Ft3、Ft4、Ft5fn3 , fn4 , and fn5 are respectively added to fs2 and a center cropping operation is performed to obtain template features Ft3 , Ft4 , and Ft5 of the template image.2.根据权利要求1所述的方法,其特征在于,所述孪生神经网络包含全局上下文模块,根据所述搜索分支特征得到所述搜索图片的搜索特征,包括:2. The method according to claim 1, characterized in that the twin neural network includes a global context module, and the search features of the search image are obtained according to the search branch features, including:将搜索分支特征的fs3、fs4、fs5特征进行通道压缩得到特征fm3、fm4、fm5;将fm3、fm4、fm5特征经过所述全局上下文模块得到搜索特征Fs3、Fs4、Fs5The featuresfs3 ,fs4 , andfs5 of the search branch features are channel compressed to obtain featuresfm3 ,fm4 , andfm5 ; the featuresfm3 ,fm4 , andfm5 are passed through the global context module to obtain search featuresFs3 ,Fs4 , andFs5 .3.根据权利要求2所述的方法,其特征在于,所述孪生神经网络包含注意力图模块,所述根据所述搜索特征和所述模板特征,得到所述搜索特征的注意力图和所述模板特征的注意力图,包括:3. The method according to claim 2, characterized in that the twin neural network includes an attention map module, and the obtaining of the attention map of the search feature and the attention map of the template feature according to the search feature and the template feature comprises:将所述模板特征Ft3、Ft4、Ft5和所述搜索特征Fs3、Fs4、Fs5分别输入所述注意力图模块的自注意力模块和交叉注意力模块,得到模板特征的注意力图和搜索特征的注意力图The template featuresFt3 ,Ft4 ,Ft5 and the search featuresFs3 ,Fs4 ,Fs5 are respectively input into the self-attention module and the cross-attention module of the attention map module to obtain the attention map of the template features. , , and the attention map of the search feature , , .4.根据权利要求3所述的方法,其特征在于,所述孪生神经网络包含深度互相关模块,所述将模板特征的注意力图和搜索特征的注意力图进行深度互相关得到得分图,包括:4. The method according to claim 3, characterized in that the twin neural network includes a deep cross-correlation module, and the step of performing deep cross-correlation on the attention map of the template feature and the attention map of the search feature to obtain a score map comprises:通过所述深度互相关模块,将所述模板特征的注意力图和所述搜索特征的注意力图分别进行深度互相关操作得到得分图Φ3、Φ4、Φ5Through the deep cross-correlation module, the attention map of the template feature , , and the attention map of the search feature , , The deep cross-correlation operations are performed respectively to obtain the score maps Φ3 , Φ4 , and Φ5 .5.根据权利要求4所述的方法,其特征在于,所述孪生神经网络包含目标位置确定模块,所述将所述得分图进行分类和回归操作,确定所述目标在所述搜索图片中的位置,包括:5. The method according to claim 4, characterized in that the twin neural network includes a target position determination module, and the classification and regression operations on the score map to determine the position of the target in the search image include:将所述得分图Φ3、Φ4、Φ5分别输入所述目标位置确定模块的分类分支和回归分支;Input the score maps Φ3 , Φ4 , and Φ5 into the classification branch and the regression branch of the target position determination module respectively;通过分类分支将所述得分图Φ3、Φ4、Φ5,分别经过卷积核大小为1×1、步长为1的卷积,得到通道数为2K的特征,将预设的可学习的权值分别与相乘,得到分类特征;所述分类特征包含所述目标在所述搜索图片中的前景和背景特征;The score maps Φ3 , Φ4 , and Φ5 are convolved with a convolution kernel size of 1×1 and a step size of 1 through the classification branch to obtain a feature with a channel number of 2K. , , , and the preset learnable weights are respectively , , Multiply to get the classification features ; The classification features include foreground and background features of the target in the search image;通过回归分支将所述得分图Φ3、Φ4、Φ5,分别经过卷积核大小为1×1、步长为1的卷积,得到通道数为4k的特征将预设的可学习的权值分别与相乘,得到回归特征;所述回归特征包含所述目标的特征;The score maps Φ3 , Φ4 , and Φ5 are convolved with a convolution kernel size of 1×1 and a step size of 1 through the regression branch to obtain a feature with a channel number of 4k. , , The preset learnable weights are respectively , , Multiply to get the regression feature ; The regression features include features of the target;根据分类特征和回归特征确定所述目标在所述搜索图片中的位置。According to classification characteristics and regression features The position of the target in the search image is determined.6.一种电子设备,其特征在于,包括处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过通信总线完成相互间的通信;存储器,用于存放计算机程序;处理器,用于执行存储器上所存放的程序时,实现权利要求2-5任一所述的方法步骤。6. An electronic device, characterized in that it includes a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other through the communication bus; the memory is used to store computer programs; the processor is used to implement the method steps described in any one of claims 2-5 when executing the program stored in the memory.7.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现权利要求2-5任一所述的方法步骤。7. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method steps described in any one of claims 2 to 5 are implemented.8.融合显著信息和多粒度上下文特征的目标跟踪系统,用于实施权利要求1所述的方法,其特征在于,包括孪生子神经网络、多分支融合模块、全局上下文模块、注意力图模块、深度互相关模块和目标位置确定模块,其中:8. A target tracking system integrating salient information and multi-granularity context features, for implementing the method of claim 1, characterized in that it comprises a twin neural network, a multi-branch fusion module, a global context module, an attention map module, a deep cross-correlation module and a target position determination module, wherein:所述孪生子神经网络,用于获取模板图片和搜索图片,提取所述模板图片的多个特征作为模板分支特征,提取所述搜索图片的多个特征作为搜索分支特征;所述模板图片包含需要跟踪的目标的外观信息;所述搜索图片为包含所述目标的图片;The twin neural network is used to obtain a template image and a search image, extract multiple features of the template image as template branch features, and extract multiple features of the search image as search branch features; the template image contains appearance information of a target to be tracked; and the search image is a picture containing the target;所述多分支融合模块,用于根据所述模板分支特征得到所述模板图片的模板特征;The multi-branch fusion module is used to obtain the template features of the template image according to the template branch features;所述全局上下文模块,用于根据所述搜索分支特征得到所述搜索图片的搜索特征;The global context module is used to obtain the search feature of the search image according to the search branch feature;所述注意力图模块,用于根据所述搜索特征和所述模板特征,得到所述搜索特征的注意力图和所述模板特征的注意力图;The attention map module is used to obtain an attention map of the search feature and an attention map of the template feature according to the search feature and the template feature;所述深度互相关模块,用于将模板特征的注意力图和搜索特征的注意力图进行深度互相关得到得分图;The deep cross-correlation module is used to perform deep cross-correlation on the attention map of the template feature and the attention map of the search feature to obtain a score map;所述目标位置确定模块,用于将所述得分图进行分类和回归操作,确定所述目标在所述搜索图片中的位置。The target position determination module is used to perform classification and regression operations on the score map to determine the position of the target in the search image.
CN202111671961.1A2021-12-312021-12-31 Target tracking method and system integrating salient information and multi-granularity context featuresActiveCN114332488B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202111671961.1ACN114332488B (en)2021-12-312021-12-31 Target tracking method and system integrating salient information and multi-granularity context features

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202111671961.1ACN114332488B (en)2021-12-312021-12-31 Target tracking method and system integrating salient information and multi-granularity context features

Publications (2)

Publication NumberPublication Date
CN114332488A CN114332488A (en)2022-04-12
CN114332488Btrue CN114332488B (en)2025-07-11

Family

ID=81020004

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202111671961.1AActiveCN114332488B (en)2021-12-312021-12-31 Target tracking method and system integrating salient information and multi-granularity context features

Country Status (1)

CountryLink
CN (1)CN114332488B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114926652B (en)*2022-05-302025-07-08厦门理工学院Twin tracking method and system based on interaction and aggregation type feature optimization

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111325155A (en)*2020-02-212020-06-23重庆邮电大学 Video action recognition method based on residual 3D CNN and multimodal feature fusion strategy
CN112927209A (en)*2021-03-052021-06-08重庆邮电大学CNN-based significance detection system and method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
KR20210116966A (en)*2020-03-182021-09-28삼성전자주식회사Method and apparatus for tracking target
CN113744311A (en)*2021-09-022021-12-03北京理工大学Twin neural network moving target tracking method based on full-connection attention module
CN113705588B (en)*2021-10-282022-01-25南昌工程学院Twin network target tracking method and system based on convolution self-attention module

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111325155A (en)*2020-02-212020-06-23重庆邮电大学 Video action recognition method based on residual 3D CNN and multimodal feature fusion strategy
CN112927209A (en)*2021-03-052021-06-08重庆邮电大学CNN-based significance detection system and method

Also Published As

Publication numberPublication date
CN114332488A (en)2022-04-12

Similar Documents

PublicationPublication DateTitle
CN109493350B (en)Portrait segmentation method and device
US12400302B2 (en)Image processing method, image processing apparatus, electronic device and computer-readable storage medium
CN114764868A (en)Image processing method, image processing device, electronic equipment and computer readable storage medium
Kumar et al.Multiple face detection using hybrid features with SVM classifier
TW202036461A (en)System for disparity estimation and method for disparity estimation of system
CN105069424B (en)Quick face recognition system and method
CN109063776B (en)Image re-recognition network training method and device and image re-recognition method and device
CN111027455B (en)Pedestrian feature extraction method and device, electronic equipment and storage medium
CN110516803A (en) Implement traditional computer vision algorithms as neural networks
CN113240012B (en)Two-dimensional image-based unsupervised multi-view three-dimensional target retrieval method and device
CN108711144B (en)Augmented reality method and device
Fang et al.Deep3DSaliency: Deep stereoscopic video saliency detection model by 3D convolutional networks
CN111340077A (en)Disparity map acquisition method and device based on attention mechanism
CN109447023A (en)Determine method, video scene switching recognition methods and the device of image similarity
CN116029946A (en) Image denoising method and system based on heterogeneous residual attention neural network model
CN113160042A (en)Image style migration model training method and device and electronic equipment
CN112069338A (en)Picture processing method and device, electronic equipment and storage medium
Mao et al.Visual arts search on mobile devices
CN114332488B (en) Target tracking method and system integrating salient information and multi-granularity context features
CN113920382A (en) Cross-domain image classification method and related device based on class-consistent structured learning
Chamasemani et al.Video abstraction using density-based clustering algorithm
CN114445750A (en)Video object segmentation method, device, storage medium and program product
CN114764936A (en)Image key point detection method and related equipment
CN114566160B (en) Voice processing method, device, computer equipment, and storage medium
CN110100263B (en) Image reconstruction method and device

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp