Disclosure of Invention
The invention aims to provide a multi-camera target tracking and re-identification system so as to improve the identification accuracy and reliability in the prior art.
The pedestrian recognition method with multiple cameras provided by the invention comprises the following steps:
S1, arranging a plurality of cameras, wherein the imaging areas are not overlapped;
S2, each camera records a monitorable area in the current environment and acquires a data set;
s3, tracking an interested target in each monitoring area by each camera;
S31, obtaining characteristics of an input picture by using a twin neural network and obtaining deformation offset of deformable convolution.
Further, the step S31 further includes:
(1) The network consists of Alexnet of offline pre-training, wherein the Alexnet network model is divided into five layers in total, and each convolution layer contains an excitation function ReLU;
(2) The fourth layer of convolution layer is a deformable convolution layer, and is used for taking the feature images obtained from the previous convolution layer as input, learning the offset of the feature images, then acting on a convolution kernel to achieve deformable convolution, and adding the offset to the regular grid R,/>Where k= |r|,/>Representing pixel locations, w is a weight, x is a template frame,Representing the offset of grid R;
(3) The initial frame of the video sequence is a template frame, the current frame is a detection frame, twin neural networks are respectively input to obtain feature images of the template frame and the detection frame, wherein the input size of the template frame is 127 multiplied by 3, the obtained feature image size is 6 multiplied by 256, the input size of the detection frame is 256 multiplied by 3, and the obtained feature image size is 6 multiplied by 256;
S32, inputting the feature map into an RPN network to generate a candidate region, wherein the process is as follows:
(1) The candidate area network consists of two parts, one part is a classification branch for distinguishing a foreground from a background, and the other part is a regression branch for fine tuning a candidate area;
(2) For the classification branches, the candidate area network receives the template frame and the feature map of the detection frame generated in S31, carries out convolution operation by using a new convolution check, reduces the feature map and simultaneously obtains template frame features and detection frame features, takes the features of the template frame as the features of the convolution kernel deconvolution detection frame, and the output feature map comprises 2k channels for respectively representing the foreground and background scores of k anchor points, and generates a response map through regional pooling and offset pooling of interest;
(3) For the regression branch, the same operation is carried out to obtain a position regression value of each sample, wherein the position regression value comprises dx, dy, dw and dh values, namely the output characteristic diagram comprises 4k channels, and the coordinate deviation predictions of k anchor points are respectively represented;
S33, determining a tracking position, wherein the tracking position is determined by the following steps:
(1) Similarity measurement is carried out on the candidate frames of the template branches and the candidate frames of the detection branches, and a boundary frame of a tracking result is obtained;
(2) Screening the bounding box of the final predicted output by using non-maximum suppression (NMS) to obtain a final tracked target bounding box;
(3) The non-maximum suppression means that the optimal candidate frame is reserved by calculating the cross ratio:
S4, transforming the position coordinates of the camera positions and the coverage areas into a unified world coordinate system;
s5, transforming the target motion trail into a unified world coordinate system;
The specific conversion method for transforming the coordinate system into the unified world coordinate system is as follows:
(1)
Wherein the method comprises the steps ofIs the coordinate of a camera coordinate system,/>Is the world coordinate system coordinate,/>In order to rotate the matrix is rotated,Is a translation vector.
In particular, the pedestrian motion area can be defined as a plane, and the z-axis data simplified calculation is omitted.
S6, predicting a camera to which a monitoring area where the target possibly appears belongs according to the target motion trail by an optical flow method;
The optical flow method is to calculate the offset of each pixel point between adjacent frames in the global range of the image to form an optical flow displacement field for representing the movement direction of pedestrians. Wherein it is assumed that the video has luminance invariance, temporal continuity, and spatial invariance during shooting. And acquiring an optical flow vector of each pixel point in the region of interest in each frame of image by an optical flow method, and acquiring the position change of the object to be detected relative to the camera according to the acquired optical flow vector.
, (2)
Wherein the method comprises the steps ofIs the pixel in time-space coordinates/>Brightness of the upper/(Is that the pixel is movedThe subsequent brightness, x, y, t, represents camera coordinates, dx, dy represents displacement, and dt represents elapsed time. According to the Taylor expansion formula on the right of the equation, the/> can be eliminatedThereby obtaining an equation
, (3)
Wherein the method comprises the steps ofDividing two sides by dt to obtain
, (4)
Wherein the method comprises the steps ofRespectively/>Optical flow energy minimization function/>Specifically, it is
, (5)
Wherein the method comprises the steps ofIs a parameter for adjusting the weight value.
S7, acquiring camera data of 2-3 targets possibly entering a camera shooting area, performing target re-recognition on the shot video, when the targets are re-recognized, taking a first frame as a template frame, repeating S3-S6, completing multi-camera target tracking and re-recognition, and if the targets are not found, jumping to S8;
s8, expanding a search area according to the optical flow method in the S6, selecting a camera at which other targets possibly appear, and re-implementing the S7;
The other camera selection schemes described in s81.s8 are as follows:
(1) Converting the target optical flow direction obtained in the step S6 into world coordinates of a camera obtained in the step S5;
(2) Obtaining the general direction of target movement, and judging 2-3 scenes in which the target is likely to appear by a K nearest neighbor method;
(3) Re-identification and tracking are completed under the 2-3 scenes;
s82.k neighbor metric means selects the euclidean distance,Wherein/>Representing distance,/>Representing the tracked target position,/>Representing the camera to be detected,/>To track the target/>1 St dimensional coordinates of/>To track the target/>2 Nd dimensional coordinates of/(To track the target/>Set/>For/>A set of coordinates for the individual cameras,Is/>3-Dimensional coordinates of the cameras.
First, the distance from the current target to each camera is calculatedWherein/>To track the target/>1 St dimensional coordinates of/>To track the target/>2 Nd dimensional coordinates of/(To track the target/>3 Rd dimensional coordinates of/(Is the 1 st dimension coordinate of the camera,/>Is the 2 nd dimensional coordinate of the camera head,/>Is the 3 rd dimensional coordinate of the camera, and is-,/>And (3) selecting k minimum distances calculated by the formula, wherein k=2 or 3, and according to specific conditions, turning to the corresponding k cameras to carry out target re-identification, turning to a certain camera to carry out target tracking when a target frame matched with a template frame is found in the certain frame in the certain camera, and discarding other k-1 cameras.
S9, repeating the steps S3-S7 to finish multi-camera target tracking and re-identification.
Advantageous effects
The invention aims to provide a multi-camera target tracking and re-identification algorithm. Firstly, extracting the characteristic extremely convolution deformation offset of a template frame and a detection frame through a twin neural network, distinguishing foreground background and anchor point coordinate offset prediction through an RPN network, pooling an interested region, pooling the offset, performing similarity measurement on candidate regions obtained by a template branch and a detection branch, obtaining a predicted target frame, and screening the predicted frame by using non-extreme inhibition to obtain a final target position. The invention realizes target tracking and re-identification across cameras, and can also realize re-identification tracking for video sequences with lower resolution of camera shooting.
Detailed Description
The application will be described in further detail with reference to the drawings and examples of embodiments. The specific embodiments of the application described are intended to be illustrative of the application and are not intended to be limiting. It should be further noted that, for convenience of description, only the portions related to the present application are shown in the drawings.
Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the disclosure. The appearances of the phrase "an embodiment" in various places in the specification are not necessarily all referring to the same embodiment.
In the present application, a Alexnet network is adopted as a backbone network in consideration of project operation efficiency and implementation comprehensively, and an exemplary offline pretrained Alexnet network structure is shown in fig. 1. As shown in fig. 1, alexnet networks have a total of 5 layers, consisting of 5 convolutional layers. Taking template frame branching as an example, in the stage of a convolution layer 1, input data is 227 multiplied by 3, a convolution layer is constructed by using a filter with the step length of 11 multiplied by 11 and 4, the data obtained by relu activating a function is 55 multiplied by 96, and the data obtained by 3 multiplied by 3 and the maximum pooling with the step length of 2 is 27 multiplied by 96; in the stage of the convolution layer 2, input data are 27 multiplied by 96, a convolution layer is constructed by using a filter with the step length of 5 multiplied by 5 and the step length of 2, the data obtained by relu activating a function are 27 multiplied by 256, and the data obtained by 3 multiplied by 3 and the maximum pooling with the step length of 2 are 13 multiplied by 256; in the stage of the convolution layer 3, input data are 13×13×256, a convolution layer is constructed by using a filter with 3×3 and a step length of 1, and the data obtained by relu activation functions are 13×13×384; in the stage 4 of the convolution layer, the flexible convolution layer is formed, the input data is 13×13×256, the convolution layer is constructed by using a filter with 3×3 and step length of 1, and the offset is added in the regular grid RThe position p is changed, so that,
, (1)
Where k= |r|, p represents the pixel position, w is the weight, x is the template frame,Representing the offset of grid R, the resulting data is 13×13×384; in the stage of the convolution layer 53, input data is 13×13×384, a convolution layer is constructed by using a filter with a step size of 3×3 and a step size of 1, the data obtained by performing relu activation functions is 13×13×256, and the data obtained by performing maximum pooling with a step size of 3×3 and a step size of 2 is 6×6×256. The detection frame is the same as the input data only.
As shown in fig. 2, the multi-camera target tracking and re-recognition algorithm according to one embodiment of the present application includes:
S1, arranging a plurality of cameras, wherein the imaging areas are not overlapped;
S2, each camera records a monitorable area in the current environment and acquires a data set;
s3, tracking an interested target in each monitoring area by each camera;
S4, transforming the position coordinates of the camera positions and the coverage areas into a unified world coordinate system;
s5, transforming the target motion trail into a unified world coordinate system;
s6, predicting a camera to which a monitoring area where the target possibly appears belongs according to the target motion trail by an optical flow method;
S7, acquiring camera data of 2-3 targets possibly entering a camera shooting area, performing target re-recognition on the shot video, when the targets are re-recognized, taking a first frame as a template frame, repeating S3-S6, completing multi-camera target tracking and re-recognition, and if the targets are not found, jumping to S8;
s8, expanding a search area according to the optical flow method in the S6, selecting a camera at which other targets possibly appear, and re-implementing the S7;
s9, repeating the steps S3-S7 to finish multi-camera target tracking and re-identification.
Specific details of S4, S6, S8 are described below:
Specifically, the transformation described in S4 is performed into a unified world coordinate system, and the specific transformation method is as follows:
, (2)
Wherein the method comprises the steps ofIs the coordinate of a camera coordinate system,/>Is the world coordinate system coordinate,/>In order to rotate the matrix is rotated,Is a translation vector.
In particular, the pedestrian motion area can be defined as a plane, and the z-axis data simplified calculation is omitted.
Specifically, the optical flow method in S6 refers to calculating the offset of each pixel point between adjacent frames in the global range of the image, so as to form an optical flow displacement field, which is used for representing the motion direction of the pedestrian. Wherein it is assumed that the video has luminance invariance, temporal continuity, and spatial invariance during shooting. And acquiring an optical flow vector of each pixel point in the region of interest in each frame of image by an optical flow method, and acquiring the position change of the object to be detected relative to the camera according to the acquired optical flow vector.
(3)
Wherein the method comprises the steps ofIs the pixel in time-space coordinates/>Brightness of the upper/(Is that the pixel is movedThe subsequent brightness, x, y, t, represents camera coordinates, dx, dy represents displacement, and dt represents elapsed time. According to the Taylor expansion formula on the right of the equation, the/> can be eliminatedThereby obtaining an equation:
(4)
Wherein the method comprises the steps ofThe two sides are divided by dt to obtain:
(5)
Wherein the method comprises the steps ofOptical flow energy minimization function/>Specifically, it is
(6)
Wherein the method comprises the steps ofIs a parameter for adjusting the weight value.
Specifically, the other camera selection schemes in S8 are as follows:
(1) Converting the target optical flow direction obtained in the step S6 into world coordinates of a camera obtained in the step S5;
(2) Obtaining the general direction of the movement of the target, judging 2-3 scenes in which the target is likely to appear by a K neighbor method, wherein the K neighbor measurement method selects the Euclidean distance, setting x to represent the three-dimensional coordinate of the target tracked by the current frame,,The 1 st dimension coordinate, the 2 nd dimension coordinate and the 3 rd dimension coordinate of the tracking target x are respectively provided with/>Is the coordinates of n cameras,
, (7)
Wherein the method comprises the steps ofRepresenting the distance between the tracking target and the camera,/>Is/>3-Dimensional coordinates of the cameras,/>Is the 1 st dimension coordinate of the camera,/>Is the 2 nd dimensional coordinate of the camera head,/>Is the 3 rd dimensional coordinate of the camera, and is-,/>The total number of cameras.
First, the distance from the current target to each camera is calculated,/>To track the target/>1 St dimensional coordinates of/>To track the target/>2 Nd dimensional coordinates of/(To track the target/>And then selecting k minimum distances calculated by the formula, wherein k=2 or 3, taking values according to specific conditions, turning to corresponding k cameras to carry out target re-identification, turning to a certain camera to carry out target tracking when a target frame matched with a template frame is found in the certain frame in the certain camera, and discarding other k-1 cameras.
(3) Re-identification and tracking are completed under the 2-3 scenes;
The algorithm network structure is shown in fig. 3, and the algorithm consists of two parts, namely a twin neural network and a candidate area network. The twin network receives two inputs, the upper part of which is called a template frame and is the position of an object coming out of an artificial frame in a first frame of a video; the input located below is called a detected frame, which is the other frame of the detected video segment than the first frame. The twin network maps the two images into a 6 x 256 size feature map and a 22 x 256 size feature map, respectively. The candidate area network is also composed of two branches, namely a classification branch used for foreground and background distinction and a regression branch used for adjusting the position of a priori frame, and each branch receives two inputs as the output of the front twin neural network. The specific flow of the algorithm is as follows:
(1) The network consists of Alexnet of offline pre-training, wherein the Alexnet network model is divided into five layers in total, and each convolution layer contains an excitation function ReLU;
(2) The fourth layer of convolution layer is a deformable convolution layer, and is used for taking the feature images obtained from the previous convolution layer as input, learning the offset of the feature images, then acting on a convolution kernel to achieve deformable convolution, and adding the offset to the regular grid R,/>Where k= |r|,/>Representing pixel locations, w is a weight, x is a template frame,Representing the offset of grid R;
(3) The initial frame of the video sequence is a template frame, the current frame is a detection frame, twin neural networks are respectively input to obtain feature images of the template frame and the detection frame, wherein the input size of the template frame is 127 multiplied by 3, the obtained feature image size is 6 multiplied by 256, the input size of the detection frame is 256 multiplied by 3, and the obtained feature image size is 6 multiplied by 256;
(4) Inputting the feature map into a candidate area network to generate a candidate area, wherein the candidate area network consists of two parts, one part is a classification branch for distinguishing a foreground from a background, and the other part is a regression branch for fine tuning the candidate area;
(5) For the classification branches, the candidate area network receives the template frame and the feature map of the detection frame generated in S31, carries out convolution operation by using a new convolution check, reduces the feature map and simultaneously obtains template frame features and detection frame features, takes the features of the template frame as the features of the convolution kernel deconvolution detection frame, and the output feature map comprises 2k channels for respectively representing the foreground and background scores of k anchor points, and generates a response map through regional pooling and offset pooling of interest;
(6) For the regression branch, the same operation is carried out to obtain a position regression value of each sample, wherein the position regression value comprises dx, dy, dw and dh values, namely the output characteristic diagram comprises 4k channels, and the coordinate deviation predictions of k anchor points are respectively represented;
(7) In order to determine the tracking position, carrying out similarity measurement on the candidate frames of the template branches and the candidate frames of the detection branches to obtain a boundary frame of a tracking result;
(8) Screening the bounding box of the final predicted output by using non-maximum suppression (NMS) to obtain a final tracked target bounding box;
(9) The non-maximum suppression means that the optimal candidate frame is reserved by calculating the cross ratio.
Fig. 4 illustrates a multi-camera target tracking and re-recognition system in accordance with another aspect of the present application. Referring now to fig. 4, a schematic diagram of an electronic system 400 suitable for use in implementing embodiments of the present disclosure is shown. The electronic system shown in fig. 4 is merely an example, and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.
As shown in fig. 4, the electronic system 400 may include a processing device (e.g., central processing unit, graphics processor, etc.) 401 that may perform various suitable actions and processes in accordance with programs stored in a Read Only Memory (ROM) 402 or loaded from a storage device 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the electronic device 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.
In general, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 408 including, for example, magnetic tape, hard disk, etc.; and a communication device 409. The communication means 409 may allow the electronic system 400 to communicate wirelessly or by wire with other devices to exchange data. While fig. 4 shows an electronic system 400 having various devices, it is to be understood that not all illustrated devices are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 4 may represent one device or a plurality of devices as needed.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communications device 409, or from storage 408, or from ROM 402. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 401.
It should be noted that, the computer readable medium according to the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In an embodiment of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Whereas in embodiments of the present disclosure, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic system (also referred to herein as a "deformation target tracking system"); or may exist alone without being assembled into the electronic system. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic system to:
1) Recording current camera data and acquiring a data set;
2) Each camera uses an algorithm to track an interested target in each monitoring area;
3) Transforming the position coordinates of the camera positions and the coverage areas into a unified world coordinate system;
4) Transforming the target motion trail into a unified world coordinate system;
5) Predicting a camera to which a monitoring area where the target possibly appears belongs according to a target motion track by an optical flow method;
6) And (3) acquiring camera data of 2-3 targets possibly entering a camera shooting area, performing target re-recognition on the shot video, and repeating S3-S6 by taking the first frame as a template frame when the targets are re-recognized, so as to complete multi-camera target tracking and re-recognition.
The various algorithms and details in the deformed target tracking method according to the first aspect of the present application are equally applicable to the target tracking and re-recognition system 400 described above, and therefore a substantial portion of their description is omitted for brevity.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.