CN114332488B

Movatterモバイル変換

Info

Publication number: CN114332488B
Application number: CN202111671961.1A
Authority: CN
Inventors: 鲍华; 束平; 章洪潮; 李亲; 邹文杰
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2025-07-11
Anticipated expiration: 2041-12-31
Also published as: CN114332488A

Abstract

The invention discloses a target tracking method and a system integrating significant information and multi-granularity context characteristics, the system comprises a twin sub-neural network, a multi-branch fusion module, a global context module, an attention map module, a depth cross-correlation module and a target position determination module. The system can extract a plurality of features of a template picture as template branch features, extract a plurality of features of a search picture as search branch features, obtain template features according to the template branch features, obtain search features according to the search branch features, obtain attention map of the search features and attention map of the template features according to the search features and the template features, conduct deep cross-correlation on the attention map of the template features and the attention map of the search features to obtain a score map, and conduct classification and regression operation on the score map to determine the position of a target in the search picture. The situation that the target is lost possibly is avoided when the conditions such as shielding, deformation, rotation and the like occur in the process of tracking the target.

Description

Target tracking method and system integrating significant information and multi-granularity context characteristics

Technical Field

The invention relates to the technical field of computer vision, in particular to a target tracking method and system integrating significant information and multi-granularity context characteristics.

Background

Target tracking is one of the basic problems in the field of computer vision, the template characteristics of the target in an initial video frame containing the target can be extracted through a target tracking algorithm, and the position of the target in a subsequent video frame can be continuously and stably tracked through the template characteristics of the target.

The target tracking algorithm based on the twin neural network performs off-line training through a large number of data sets to obtain high-precision network parameters and matching functions, and has the advantages of high precision, high effectiveness and the like. However, the object tracking algorithm based on the twin neural network has the problems of insufficient extraction of the characteristics of the model, lack of connection between video frames in the process of tracking the object, and the like. When the conditions of shielding, deformation, rotation and the like occur in the process of tracking the target, the condition of losing the target can occur.

Disclosure of Invention

The embodiment of the invention aims to provide a target tracking method and a target tracking system integrating significant information and multi-granularity context characteristics, so as to avoid the situation that a target is lost when shielding, deformation, rotation and the like occur in the process of tracking the target.

The specific technical scheme is as follows:

In a first aspect of the present invention, there is first provided a target tracking system that fuses salient information and multi-granularity context features, including a twin sub-neural network, a multi-branch fusion module, a global context module, an attention-seeking module, a depth cross-correlation module, and a target location determination module, wherein:

The twin sub-neural network is used for acquiring a template picture and a search picture, extracting a plurality of features of the template picture as template branch features and extracting a plurality of features of the search picture as search branch features, wherein the template picture comprises appearance information of a target to be tracked;

the multi-branch fusion module is used for obtaining template characteristics of the template picture according to the template branch characteristics;

the global context module is used for obtaining the searching characteristics of the searching pictures according to the searching branch characteristics;

the attention force diagram module is used for obtaining an attention force diagram of the search feature and an attention force diagram of the template feature according to the search feature and the template feature;

The depth cross-correlation module is used for carrying out depth cross-correlation on attention force diagram of the template features and attention force diagram of the search features to obtain a score diagram;

And the target position determining module is used for classifying and regressing the score map and determining the position of the target in the search picture.

In a second aspect of the present invention, there is provided a target tracking method incorporating salient information and multi-granularity contextual features, the method being applied to a twin neural network, the method comprising:

the method comprises the steps of obtaining a template picture and a search picture, extracting a plurality of features of the template picture as template branch features, and extracting a plurality of features of the search picture as search branch features, wherein the template picture contains appearance information of a target to be tracked;

obtaining template characteristics of the template picture according to the template branch characteristics;

obtaining search features of the search pictures according to the search branch features;

obtaining attention force diagrams of the search features and attention force diagrams of the template features according to the search features and the template features;

Performing depth cross-correlation on the attention map of the template characteristic and the attention map of the search characteristic to obtain a score map;

and classifying and regressing the score map, and determining the position of the target in the search picture.

Optionally, the twin neural network includes a twin sub neural network, the obtaining a template picture and a search picture, extracting a plurality of features of the template picture as template branch features, and extracting a plurality of features of the search picture as search branch features includes:

obtaining a template picture and a search picture through the twin-sub neural network, wherein the size of the search picture is larger than that of the template picture;

Optionally, the twin neural network includes a multi-branch fusion module, the obtaining the template feature of the template picture according to the template branch feature includes:

Carrying out channel compression on the f_t3、f_t4、f_t5 characteristic of the template branch characteristic to obtain a characteristic f_n3、f_n4、f_n5;

F_t2 features of the template branch features are subjected to the multi-branch fusion module to obtain features f_n2 containing different receptive fields;

And F_n3、f_n4、f_n5 is added with F_s2 respectively and the center cutting operation is carried out, so that the template characteristic F_t3、F_t4、F_t5 of the template picture is obtained.

Optionally, the twin neural network includes a global context module, and the obtaining the search feature of the search picture according to the search branch feature includes:

Carrying out channel compression on the f_s3、f_s4、f_s5 features of the search branch features to obtain features f_m3、f_m4、f_m5;

The F_m3、f_m4、f_m5 feature is passed through the global context module to obtain a search feature F_s3、F_s4、F_s5.

Optionally, the twin neural network includes an attention profile module, and the obtaining the attention profile of the search feature and the attention profile of the template feature according to the search feature and the template feature includes:

inputting the template feature F_t3、F_t4、F_t5 and the search feature F_s3、F_s4、F_s5 into a self-attention module and a cross-attention module of the attention seeking module respectively to obtain attention seeking of the template featureAnd attention seeking diagrams of search features

Optionally, the twin neural network includes a deep cross-correlation module, and performing deep cross-correlation on the attention map of the template feature and the attention map of the search feature to obtain a score map, including:

Through the depth cross-correlation module, the attention of the template features is soughtAnd an attention seeking graph of the search featureAnd respectively performing deep cross-correlation operation to obtain a score map phi₃、φ₄、φ₅.

Optionally, the twin neural network includes a target location determining module, where the classifying and regressing the score map determines a location of the target in the search picture, including:

inputting the score map phi₃、φ₄、φ₅ into a classification branch and a regression branch of the target position determining module respectively;

the score map phi₃、φ₄、φ₅ is subjected to convolution with the convolution kernel size of 1 multiplied by 1 and the step length of 1 through classification branches to obtain the characteristic of 2K of channel numberRespectively associating preset weight values withMultiplying to obtain classification featuresThe classification features comprise foreground and background features of the target in the search picture;

The score map phi₃、φ₄、φ₅ is subjected to convolution with the convolution kernel size of 1 multiplied by 1 and the step length of 1 through regression branches to obtain the characteristic of 4k of channel numberRespectively associating preset weight values withMultiplying to obtain regression featuresThe regression feature comprises a feature of the target;

According to classification characteristicsAnd regression characteristicsAnd determining the position of the target in the search picture.

In yet another aspect of the embodiment of the present invention, there is also provided an electronic device including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

A memory for storing a computer program;

and the processor is used for realizing any one of the target tracking methods integrating the significant information and the multi-granularity context characteristics when executing the programs stored in the memory.

In yet another aspect of the present invention, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the above-described target tracking methods that incorporate salient information and multi-granularity context features.

In yet another aspect of the invention, there is also provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform a method of object tracking incorporating salient information and multi-granularity context features as any of the above.

The embodiment of the invention provides a target tracking system integrating significant information and multi-granularity context characteristics, which comprises a twin sub-neural network, a multi-branch fusion module, a global context module, an attention map module, a depth cross-correlation module and a target position determination module. The system can acquire a template picture and a search picture through a twin sub-neural network, extract a plurality of features of the template picture as template branch features, extract a plurality of features of the search picture as search branch features, acquire the template features of the template picture according to the template branch features through a multi-branch fusion module, acquire the search features of the search picture according to the search branch features through a global context module, acquire attention map of the search features and attention map of the template features according to the search features and the template features through an attention map module, perform depth cross-correlation on the attention map of the template features and the attention map of the search features through a depth cross-correlation module, and perform classification and regression operation on the score map through a target position determination module to determine the position of a target in the search picture. The system enhances the accuracy of template feature extraction through the multi-branch fusion module, and enriches the relation between the search feature and the template feature through the attention seeking module. The situation that the target is lost possibly is avoided when the conditions such as shielding, deformation, rotation and the like occur in the process of tracking the target.

Drawings

The following detailed description of specific embodiments of the invention refers to the accompanying drawings, in which:

FIG. 1 is a flow chart of a method for target tracking incorporating salient information and multi-granularity contextual features provided by an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a target tracking method integrating salient information and multi-granularity context features according to an embodiment of the present invention;

FIG. 3 is a block diagram of a multi-branch fusion module provided by an embodiment of the present invention;

FIG. 4 is a block diagram of a global context module provided by an embodiment of the present invention;

FIG. 5 is a block diagram of an attention diagram module provided by an embodiment of the present invention;

FIG. 6 is a graph of accuracy and success rate tests of the target tracking method provided by the embodiment of the invention;

FIG. 7 is a chart of EAO (ExpectedAverage Overlap, expected average overlap ratio) value test for a target tracking method according to an embodiment of the present invention;

FIG. 8 is an EAO value test chart of the target tracking method according to the embodiment of the invention for various situations;

FIG. 9 is a diagram of representative visual results of a target tracking method according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the prior art, a target tracking algorithm based on a twin neural network has the problems of insufficient extraction of the characteristics of a template, lack of connection between video frames in the process of tracking the target, and the like. When the conditions of shielding, deformation, rotation and the like occur in the process of tracking the target, the condition of losing the target can occur.

In order to solve the above problems, the embodiment of the invention provides a target tracking system integrating significant information and multi-granularity context features. The target tracking system provided by the embodiment of the invention can comprise:

The twin sub-neural network is used for acquiring a template picture and a search picture, extracting a plurality of features of the template picture as template branch features and extracting a plurality of features of the search picture as search branch features;

And the target position determining module is used for classifying and regressing the obtained score map and determining the position of the target in the search picture.

The target tracking system provided by the embodiment of the invention can enhance the accuracy of template feature extraction through the multi-branch fusion module, and enrich the relation between the search feature and the template feature through the attention-seeking model. The situation that the target is lost possibly is avoided when the conditions such as shielding, deformation, rotation and the like occur in the process of tracking the target.

Referring to fig. 1, fig. 1 is a flowchart of a target tracking method integrating salient information and multi-granularity context features, which is applied to a twin neural network and provided in an embodiment of the present invention, the method may include the following steps:

s101, acquiring a template picture and a search picture, extracting a plurality of features of the template picture as template branch features, and extracting a plurality of features of the search picture as search branch features.

S102, obtaining template characteristics of the template picture according to the template branch characteristics.

And S103, obtaining the search feature of the search picture according to the search branch feature.

S104, obtaining attention force diagrams of the search features and attention force diagrams of the template features according to the search features and the template features.

S105, carrying out deep cross-correlation on the attention map of the template characteristic and the attention map of the search characteristic to obtain a score map.

S106, classifying and regressing the score map, and determining the position of the target in the search picture.

The template picture comprises appearance information of a target to be tracked, and the search picture is a picture comprising the target.

According to the target tracking method for fusing the salient information and the multi-granularity context features, which is provided by the embodiment of the invention, the accuracy of template feature extraction can be enhanced through the multi-branch fusion module, and the relation between the search features and the template features is enriched through the attention map module. The situation that the target is lost possibly is avoided when the conditions such as shielding, deformation, rotation and the like occur in the process of tracking the target.

Referring to fig. 2, fig. 2 is a flow chart of a target tracking method integrating salient information and multi-granularity context features according to an embodiment of the present invention.

In one implementation, the size of the input template picture (TEMPLATEIMAGE) may be 127×127×3, the width and height of the input template picture may be 127×127 pixel channel number 3, the size of the input search picture (SEARCHIMAGE) may be 255×255×3, and the width and height of the input search picture may be 255×255 pixel channel number 3. Template pictures and search pictures are respectively input into a template branch and a search branch, and the two branches are ResNet networks sharing parameters. The resulting 5-dimensional matrix convolution operation of Resnet of the template and search branches is characterized by f_t1、f_t2、f_t3、f_t4、f_t5 and f_s1、f_s2、f_s3、f_s4、f_s5.f_t1、f_t2、f_t3、f_t4、f_t5 sizes of 61×61×64、31×31×256、15×15×512、15×15×1024、15×15×2048.f_s1、f_s2、f_s3、f_s4、f_s5, 125×125×64, 61×61×256, 31×31×512, 31×31×1024, 31×31×2048, respectively.

In one embodiment, the twin neural network comprises a twin sub-neural network, and step S101 includes:

Step one, obtaining a template picture and a search picture through a twin sub-neural network.

The size of the search picture is larger than the size of the template picture.

In one embodiment, the twin neural network comprises a multi-branch fusion module, step S102 comprising:

Step one, carrying out channel compression on the f_t3、f_t4、f_t5 characteristic of the template branch characteristic to obtain a characteristic f_n3、f_n4、f_n5.

Step two, the f_t2 characteristic of the template branch characteristic is subjected to a multi-branch fusion module to obtain a characteristic f_n2 containing different receptive fields;

and thirdly, adding F_n3、f_n4、f_n5 with F_n2 respectively and performing center cutting operation to obtain template characteristics F_t3、F_t4、F_t5 of the template picture.

In one implementation, the f_t3、f_t4、f_t5 features of the search branch feature are convolved by 1X1, and the number of channels is compressed to 256, so that the sizes of the features f_n3、f_n4、f_n5.f_n3、f_n4、f_n5 are 15×15×256. The template feature F_t3、F_t4、F_t5 of the template picture has a size of 7×7×256.

Referring to fig. 3, fig. 3 is a block diagram of a multi-branch fusion module according to an embodiment of the present invention.

The multi-branch fusion module comprises the working steps of inputting the two-dimensional matrix convolution operation characteristics into a two-step convolution kernel (the two-step convolution kernel has the size of 3×3 and the step size of 1), and outputting a characteristic map (hereinafter referred to as a first characteristic map) with the size of 31×31×128 through two-step convolution operation. And secondly, inputting the first characteristic diagram into two branches, wherein the input characteristics of the first branch are kept unchanged, and the other branch is a convolution sub-network with the same two steps. Through the operation, the characteristics of different receptive fields of the two-dimensional matrix convolution operation characteristics can be obtained, and deeper characteristics containing a plurality of semantic information of the target can be obtained. And thirdly, connecting the characteristics of the two branches. Finally, a downsampling operation is performed to obtain a refined feature map f_n2.

F_n2 is added with f_n3、f_n4、f_n5 respectively, and then center cutting operation is carried out, wherein the center cutting operation is specifically shown as the following formula (1):

F_t3＝Crop(f_n3+f_n2)

F_t4＝Crop(f_n4+f_n2)(1)

F_t5＝Crop(f_n5+f_n2)

in one embodiment, the twin neural network comprises a global context module, step S103 comprising:

Step one, performing channel compression on the f_s3、f_s4、f_s5 features of the search branch features to obtain features f_m3、f_m4、f_m5.

Step two, the F_m3、f_m4、f_m5 features pass through a global context module to obtain search features F_s3、F_s4、F_s5.

In one implementation, the f_s3、f_s4、f_s5 features of the search branch feature are convolved by 1X1, and the number of channels is compressed to 256, so that the sizes of the features f_m3、f_m4、f_m5.f_m3、f_m4、f_m5 are 15×15×256. The size to search feature F_s3、F_s4、F_s5 is 31 x 256.

Referring to fig. 4, fig. 4 is a block diagram of a global context module provided by an embodiment of the present invention.

The global context module includes three parts, a context Wen Jian die module, a transform sub-module, and a fusion sub-module. It is assumed that x and z represent the input and output of the global context module, respectively,N_p represents the number of elements in the feature map.

The global context operation can be represented by equation (2) where W₁、W₂ and W₃ represent the weight coefficients of the three convolutions of kernel size in fig. 4, respectively, LN () represents the layer normalization function (Layer Normalization) for normalization and ReLu () represents the piecewise linear function for single-sided suppression.

In one embodiment, the twin neural network includes an attention seeking module, and step S104 is specifically:

The template feature F_t3、F_t4、F_t5 and the search feature F_s3、F_s4、F_s5 are respectively input into a self-attention module and a cross-attention module of the attention seeking module to obtain the attention seeking of the template featureAnd attention seeking diagrams of search features

Referring to fig. 5, fig. 5 is a block diagram of an attention seeking module according to an embodiment of the present invention.

As shown in fig. 5, the attention seeking module includes a self-attention module and a cross-attention module. To learn finer semantic features from space and channels, self-attention and cross-attention sub-networks are proposed. As shown in fig. 5, there are 4 dashed boxes from above and below, wherein the contents of the first and fourth dashed boxes represent self-attention, and the contents of the second and third dashed boxes represent cross-attention. In detail, the template feature Z and the search feature X are respectively denoted, wherein the feature sizes of Z and X are c×h×w and c×h×w, respectively.

The self-attention module consists of spatial attention and channel attention.

For the spatial attention, first, the search feature X is divided into spatial locations to obtainWherein X^i,j∈R^C×1×1 corresponds to the spatial position (i, j) parameter, and secondly, compressing the channel by using a 1X 1 convolution with the corresponding formula Q=W_sq X, wherein W_sq∈R^C×1×1×1 is the parameter of the convolution kernel, resulting in Q εR^H×W, while the value of each spatial position of Q can be expressed asThen, a feature with spatial information is generatedAnd finally, giving X-to a leachable parameter alpha and adding the leachable parameter alpha with the original characteristic X to obtain a final characteristic X_sa, wherein sigma () is a sigmoid activation function, and the final characteristic X_sa is shown in the following formula (3):

for channel attention, first, the input feature X is divided by the number of channelsX_i∈R^H^×W, and secondly, the global averaging pooling operation operates on the space to produce a vector V ε R^C×1×1, where the value of the kth channel can be obtained by the following equation (4):

again, V is compressed and expanded using two convolution operations to obtain

Wherein, theAndParameters corresponding to two convolution kernels respectively, and then obtaining a characteristic diagram with channel aggregation characteristicsWherein σ () is a sigmoid activation function, and finallyA learnable parameter beta is given and added with the original characteristic X to obtain a final channel characteristic diagram X_ca, as shown in the following formula (5):

Cross attention module

For the search branch, template feature Z and search feature X are input into a cross-attention module, which is located in the second and third dashed boxes in FIG. 5. Then the template features are subjected to global average pooling and two 1×1 convolution operations to finally obtain a channel feature mapWherein C is the number of channels, secondly, toPerforming sigmoid function activation operation and multiplying the initial characteristic x_i to obtain a preliminary characteristicThe following formula (6):

Again, the corresponding features are computed to compute the final cross-attention feature mapThen, cross attention profile X_cro is defined by the sum of XThe following was obtained:

λ in the above formula (7) is a parameter that can be learned.

The attention search feature map is effectively obtained by merging features X_sa、X_ca and X_cro in parallel by element-wise summation operations. The corresponding attention feature among the template features is obtained the same as the search branch.

In one embodiment, the twin neural network includes a deep cross correlation module, and step S105 is specifically:

Through the depth cross-correlation module, attention of template features is soughtAnd attention seeking diagrams of search featuresAnd respectively performing deep cross-correlation operation to obtain a score map phi₃、φ₄、φ₅.

In one implementation, attention patterns of template features are presentedAnd attention seeking diagrams of search featuresPerforming depth cross-correlation operation to obtain phi₃, and the like to obtain phi₄ and phi₅.

In one embodiment, the twin neural network includes a target position determination module, step S106 comprising:

Step one, the score map phi₃、φ₄、φ₅ is respectively input into a classification branch and a regression branch of the target position determining module.

Step two, the score map phi₃、φ₄、φ₅ is subjected to convolution with the convolution kernel size of 1 multiplied by 1 and the step length of 1 through classification branches to obtain the characteristic of 2K of channel numberRespectively associating preset weight values withMultiplying to obtain classification featuresThe classification features include foreground and background features of the object in the search picture.

Step three, the score map phi₃、φ₄、φ₅ is subjected to convolution with the convolution kernel size of 1 multiplied by 1 and the step length of 1 through regression branches to obtain the characteristic of 4k of channel numberRespectively associating preset weight values withMultiplying to obtain regression featuresThe regression features contain features of the target.

Step four, according to the classification characteristicsAnd regression characteristicsAnd determining the position of the target in the search picture.

In one implementation, one mayRespectively multiplied by different learnable weighting coefficients, i.e.Can be used forRespectively multiplied by different learnable weighting coefficients, i.e.

Referring to fig. 6, fig. 6 is a precision and success rate test chart of the target tracking method according to the embodiment of the present invention.

The object tracking method integrating the significant information and the multi-granularity context characteristics and the object tracking method of the nine other main streams provided by the embodiment of the invention are tested on the OTB2015 data set to obtain an accuracy test chart (a) and a success rate test chart (b). As can be seen from fig. 6, the target tracking method provided by the embodiment of the invention is optimal in accuracy and success rate.

Referring to fig. 7, fig. 7 is an EAO value test chart of the object tracking method according to the embodiment of the present invention.

The EAO value test chart obtained by testing the target tracking method which is provided by the embodiment of the invention and is fused with the salient information and the multi-granularity context characteristics on the VOT2019 data set is compared with the current mainstream target tracking method, wherein the larger the EAO value is, the better the evaluation effect is. As can be seen from fig. 7, the object tracking method of the present invention performs optimally among a plurality of object tracking methods.

Referring to fig. 8, fig. 8 is an EAO value test chart of the target tracking method according to the embodiment of the present invention for various situations.

The object tracking method integrating the significant information and the multi-granularity context features provided by the embodiment of the invention is tested with other main stream object tracking methods in VOT2019 to obtain EAO values under various conditions, wherein each condition comprises camera movement (cameramotion), occlusion (occlusion), scale change (sizechange), illumination change (illuminationchange) and motion change (motion change). Referring to fig. 8, it can be seen that the object tracking method of the present invention exhibits better performance in the face of camera movement, illumination variation, and motion variation.

Referring to fig. 9, fig. 9 is a diagram of a representative visual result of a target tracking method according to an embodiment of the present invention.

Fig. 9 shows representative visual results of different object tracking methods on the OTB2015 dataset, i.e. 10 representative video sequences were selected in the OTB2015 dataset and the object tracking method of the present invention was compared with other mainstream object tracking methods in these video sequences. The mainstream target tracking method for comparison comprises Ocean, MDNet, daSiamRPN, ATOM, siamRPN ++ and SiamBAN. The object tracking method of the present invention shows more accurate tracking accuracy when faced with the cases of motion blur, rapid motion, and low-resolution object tracking. For example, in the three video sequences BlurOwl, soccer and DragonBaby, some of the target tracking methods suffer, but the target tracking method of the present invention is more robust in tracking while maintaining higher accuracy. Meanwhile, as can be seen from other video sequences in the graph, the target tracking method also shows better tracking performance when facing the target tracking conditions such as rotation, scale change, deformation, shielding and the like.

The embodiment of the invention also provides an electronic device, as shown in fig. 10, which comprises a processor 1001, a communication interface 1002, a memory 1003 and a communication bus 1004, wherein the processor 1001, the communication interface 1002 and the memory 1003 complete communication with each other through the communication bus 1004,

A memory 1003 for storing a computer program;

the processor 1001 is configured to implement any of the above-described target tracking methods that combine salient information and multi-granularity context features when executing a program stored on the memory 1003.

The communication bus mentioned above for the electronic device may be a Peripheral component interconnect standard (Peripheral ComponentInterconnect, PCI) bus or an extended industry standard architecture (ExtendedIndustry StandardArchitecture, EISA) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The memory may include random access memory (RandomAccessMemory, RAM) or may include Non-volatile memory (Non-VolatileMemory, NVM), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor including a central processing unit (CentralProcessing Unit, CPU), a network processor (NetworkProcessor, NP), etc., or may be a digital signal processor (DigitalSignalProcessor, DSP), an Application specific integrated circuit (Application SpecificIntegratedCircuit, ASIC), a Field-Programmable gate array (Field-Programmable GATEARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.

In yet another embodiment of the present invention, there is also provided a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the steps of any of the above-described target tracking methods incorporating salient information and multi-granularity contextual features.

In yet another embodiment of the present invention, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the above embodiments of target tracking methods that incorporate salient information and multi-granularity contextual features.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk SolidStateDisk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system, electronic device, and computer-readable storage medium embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

Translated fromChinese

1.融合显著信息和多粒度上下文特征的目标跟踪方法，其特征在于，所述方法应用于孪生神经网络，其特征在于，所述方法包括：1. A target tracking method integrating salient information and multi-granularity context features, characterized in that the method is applied to a twin neural network, characterized in that the method comprises:

获取模板图片和搜索图片，提取所述模板图片的多个特征作为模板分支特征，提取所述搜索图片的多个特征作为搜索分支特征；所述模板图片包含需要跟踪的目标的外观信息；所述搜索图片为包含所述目标的图片；A template image and a search image are obtained, and multiple features of the template image are extracted as template branch features, and multiple features of the search image are extracted as search branch features; the template image contains appearance information of a target to be tracked; and the search image is a picture containing the target;

根据所述模板分支特征得到所述模板图片的模板特征；Obtaining a template feature of the template image according to the template branch feature;

根据所述搜索分支特征得到所述搜索图片的搜索特征；Obtaining a search feature of the search image according to the search branch feature;

根据所述搜索特征和所述模板特征，得到所述搜索特征的注意力图和所述模板特征的注意力图；According to the search feature and the template feature, obtaining an attention map of the search feature and an attention map of the template feature;

将模板特征的注意力图和搜索特征的注意力图进行深度互相关得到得分图；Perform deep cross-correlation between the attention map of the template feature and the attention map of the search feature to obtain a score map;

将所述得分图进行分类和回归操作，确定所述目标在所述搜索图片中的位置；Performing classification and regression operations on the score map to determine the position of the target in the search image;

所述孪生神经网络包含孪生子神经网络，所述获取模板图片和搜索图片，提取所述模板图片的多个特征作为模板分支特征，提取所述搜索图片的多个特征作为搜索分支特征，包括：The twin neural network includes a twin sub-neural network, and the acquiring of a template image and a search image, extracting a plurality of features of the template image as template branch features, and extracting a plurality of features of the search image as search branch features, comprises:

通过所述孪生子神经网络获取模板图片和搜索图片；所述搜索图片的尺寸大于所述模板图片的尺寸；Acquire a template image and a search image through the twin neural network; the size of the search image is larger than the size of the template image;

将所述模板图片输入所述孪生子神经网络的ResNet50网络，得到所述模板图片的向量卷积运算特征、二维矩阵卷积运算特征、三维矩阵卷积运算特征、四维矩阵卷积运算特征、五维矩阵卷积运算特征分别是f_t1、f_t2、f_t3、f_t4、f_t5，作为模板分支特征；Input the template image into the ResNet50 network of the twin neural network, and obtain the vector convolution operation features, two-dimensional matrix convolution operation features, three-dimensional matrix convolution operation features, four-dimensional matrix convolution operation features, and five-dimensional matrix convolution operation features of the template image, which are f_t1 , f_t2 , f_t3 , f_t4 , and f_t5 , respectively, as template branch features;

将所述搜索图片输入所述孪生子神经网络ResNet50网络，得到所述搜索图片的向量卷积运算特征、二维矩阵卷积运算特征、三维矩阵卷积运算特征、四维矩阵卷积运算特征、五维矩阵卷积运算特征分别是f_s1、f_s2、f_s3、f_s4、f_s5，作为搜索分支特征；Input the search image into the twin neural network ResNet50 network, and obtain the vector convolution operation feature, two-dimensional matrix convolution operation feature, three-dimensional matrix convolution operation feature, four-dimensional matrix convolution operation feature, and five-dimensional matrix convolution operation feature of the search image, which are_fs1 ,_fs2 ,_fs3 ,_fs4 , and_fs5 , respectively, as search branch features;

所述孪生神经网络包含多分支融合模块，所述根据所述模板分支特征得到所述模板图片的模板特征，包括：The twin neural network includes a multi-branch fusion module, and the template features of the template image are obtained according to the template branch features, including:

将模板分支特征的f_t3、f_t4、f_t5特征进行通道压缩得到特征f_n3、f_n4、f_n5；Perform channel compression on the features_ft3 ,_ft4 , and_ft5 of the template branch features to obtain features_fn3 ,_fn4 , and_fn5 ;

将模板分支特征的f_t2特征经过所述多分支融合模块得到包含不同感受野的特征f_n2；The feature f_t2 of the template branch feature is passed through the multi-branch fusion module to obtain the feature f_n2 containing different receptive fields;

将f_n3、f_n4、f_n5分别与f_s2相加并进行中心裁剪操作，得到所述模板图片的模板特征F_t3、F_t4、F_t5。f_n3 , f_n4 , and f_n5 are respectively added to f_s2 and a center cropping operation is performed to obtain template features F_t3 , F_t4 , and F_t5 of the template image.

2.根据权利要求1所述的方法，其特征在于，所述孪生神经网络包含全局上下文模块，根据所述搜索分支特征得到所述搜索图片的搜索特征，包括：2. The method according to claim 1, characterized in that the twin neural network includes a global context module, and the search features of the search image are obtained according to the search branch features, including:

将搜索分支特征的f_s3、f_s4、f_s5特征进行通道压缩得到特征f_m3、f_m4、f_m5；将f_m3、f_m4、f_m5特征经过所述全局上下文模块得到搜索特征F_s3、F_s4、F_s5。The features_fs3 ,_fs4 , and_fs5 of the search branch features are channel compressed to obtain features_fm3 ,_fm4 , and_fm5 ; the features_fm3 ,_fm4 , and_fm5 are passed through the global context module to obtain search features_Fs3 ,_Fs4 , and_Fs5 .

3.根据权利要求2所述的方法，其特征在于，所述孪生神经网络包含注意力图模块，所述根据所述搜索特征和所述模板特征，得到所述搜索特征的注意力图和所述模板特征的注意力图，包括：3. The method according to claim 2, characterized in that the twin neural network includes an attention map module, and the obtaining of the attention map of the search feature and the attention map of the template feature according to the search feature and the template feature comprises:

将所述模板特征F_t3、F_t4、F_t5和所述搜索特征F_s3、F_s4、F_s5分别输入所述注意力图模块的自注意力模块和交叉注意力模块，得到模板特征的注意力图、、和搜索特征的注意力图、、。The template features_Ft3 ,_Ft4 ,_Ft5 and the search features_Fs3 ,_Fs4 ,_Fs5 are respectively input into the self-attention module and the cross-attention module of the attention map module to obtain the attention map of the template features. , , and the attention map of the search feature , , .

4.根据权利要求3所述的方法，其特征在于，所述孪生神经网络包含深度互相关模块，所述将模板特征的注意力图和搜索特征的注意力图进行深度互相关得到得分图，包括：4. The method according to claim 3, characterized in that the twin neural network includes a deep cross-correlation module, and the step of performing deep cross-correlation on the attention map of the template feature and the attention map of the search feature to obtain a score map comprises:

5.根据权利要求4所述的方法，其特征在于，所述孪生神经网络包含目标位置确定模块，所述将所述得分图进行分类和回归操作，确定所述目标在所述搜索图片中的位置，包括：5. The method according to claim 4, characterized in that the twin neural network includes a target position determination module, and the classification and regression operations on the score map to determine the position of the target in the search image include:

将所述得分图Φ₃、Φ₄、Φ₅分别输入所述目标位置确定模块的分类分支和回归分支；Input the score maps Φ₃ , Φ₄ , and Φ₅ into the classification branch and the regression branch of the target position determination module respectively;

通过分类分支将所述得分图Φ₃、Φ₄、Φ₅，分别经过卷积核大小为1×1、步长为1的卷积，得到通道数为2K的特征、、，将预设的可学习的权值分别与、、相乘，得到分类特征；所述分类特征包含所述目标在所述搜索图片中的前景和背景特征；The score maps Φ₃ , Φ₄ , and Φ₅ are convolved with a convolution kernel size of 1×1 and a step size of 1 through the classification branch to obtain a feature with a channel number of 2K. , , , and the preset learnable weights are respectively , , Multiply to get the classification features ; The classification features include foreground and background features of the target in the search image;

通过回归分支将所述得分图Φ₃、Φ₄、Φ₅，分别经过卷积核大小为1×1、步长为1的卷积，得到通道数为4k的特征、、将预设的可学习的权值分别与、、相乘，得到回归特征；所述回归特征包含所述目标的特征；The score maps Φ₃ , Φ₄ , and Φ₅ are convolved with a convolution kernel size of 1×1 and a step size of 1 through the regression branch to obtain a feature with a channel number of 4k. , , The preset learnable weights are respectively , , Multiply to get the regression feature ; The regression features include features of the target;

根据分类特征和回归特征确定所述目标在所述搜索图片中的位置。According to classification characteristics and regression features The position of the target in the search image is determined.

6.一种电子设备，其特征在于，包括处理器、通信接口、存储器和通信总线，其中，处理器，通信接口，存储器通过通信总线完成相互间的通信；存储器，用于存放计算机程序；处理器，用于执行存储器上所存放的程序时，实现权利要求2-5任一所述的方法步骤。6. An electronic device, characterized in that it includes a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other through the communication bus; the memory is used to store computer programs; the processor is used to implement the method steps described in any one of claims 2-5 when executing the program stored in the memory.

7.一种计算机可读存储介质，其特征在于，所述计算机可读存储介质内存储有计算机程序，所述计算机程序被处理器执行时实现权利要求2-5任一所述的方法步骤。7. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method steps described in any one of claims 2 to 5 are implemented.

8.融合显著信息和多粒度上下文特征的目标跟踪系统，用于实施权利要求1所述的方法，其特征在于，包括孪生子神经网络、多分支融合模块、全局上下文模块、注意力图模块、深度互相关模块和目标位置确定模块，其中：8. A target tracking system integrating salient information and multi-granularity context features, for implementing the method of claim 1, characterized in that it comprises a twin neural network, a multi-branch fusion module, a global context module, an attention map module, a deep cross-correlation module and a target position determination module, wherein:

所述孪生子神经网络，用于获取模板图片和搜索图片，提取所述模板图片的多个特征作为模板分支特征，提取所述搜索图片的多个特征作为搜索分支特征；所述模板图片包含需要跟踪的目标的外观信息；所述搜索图片为包含所述目标的图片；The twin neural network is used to obtain a template image and a search image, extract multiple features of the template image as template branch features, and extract multiple features of the search image as search branch features; the template image contains appearance information of a target to be tracked; and the search image is a picture containing the target;

所述多分支融合模块，用于根据所述模板分支特征得到所述模板图片的模板特征；The multi-branch fusion module is used to obtain the template features of the template image according to the template branch features;

所述全局上下文模块，用于根据所述搜索分支特征得到所述搜索图片的搜索特征；The global context module is used to obtain the search feature of the search image according to the search branch feature;

所述注意力图模块，用于根据所述搜索特征和所述模板特征，得到所述搜索特征的注意力图和所述模板特征的注意力图；The attention map module is used to obtain an attention map of the search feature and an attention map of the template feature according to the search feature and the template feature;

所述目标位置确定模块，用于将所述得分图进行分类和回归操作，确定所述目标在所述搜索图片中的位置。The target position determination module is used to perform classification and regression operations on the score map to determine the position of the target in the search image.