1. Introduction
Multi-rotor aerial vehicles equipped with cameras are capable of covering large areas dynamically, which is why they are valuable for many applications, such as tracking moving objects. When objects in the scene are isolated and distinct from the background, tracking is geneally an achievable task. However, despite decades of research, visual tracking remains a challenging problem in real-world applications due to factors such as partial occlusion, quick and abrupt object motion, lighting changes, and substantial variations in the view point and pose of the target object.
Single-object tracking (SOT) is a fundamental computer vision problem that has several applications, including autonomous vehicles [
1,
2,
3] and surveillance systems [
4]. The objective of SOT is to track a defined target within a video sequence using its initial state (position and appearance). SOT approaches [
5,
6,
7,
8,
9] that utilize the Siamese paradigm are commonly used in 2D and 3D SOT as the Siamese paradigm provides a compromise between performance and speed. Using an appearance-matching strategy, the Siamese model tracks the target in the candidate region using features from the target template and the search area retrieved by a shared backbone.
Since there is no pre-trained object detector involved, single-object trackers are frequently referred to as “generic object trackers” or “model-free trackers” [
10,
11,
12]. From a learning standpoint, model-free visual object tracking is a difficult problem given that there is only one instance of the target in the first frame, and the tracker has to learn the target appearance in the following frames. In our proposed system, we tried to avoid this issue by replacing the template frame with a template set which remains updated based on proposed simple heuristic rules.
The fundamental distinction between detection and tracking is the application of dynamics. During detection, the object is detected independently in each frame. In tracking, we predict the new location of the object in the next frame using estimated dynamics, and then we detect that object. Based on the measurements, we update the estimated dynamics and iterate.
Even while the results of the appearance matching for 3D SOT on the KITTI dataset are decent [
13], Zheng et al. [
7] noted that KITTI has the following characteristics:
- (i)
The subject moves very slightly between two successive frames, preventing a significant change in appearance;
- (ii)
There are few or no distractions in the target’s environment.
The aforementioned qualities are not applicable in natural situations. As objects move quickly or the hardware can only handle a limited frame sampling rate, self-occlusion may cause dramatic changes in successive LiDAR views. Additionally, negative samples increase dramatically in scenes where heavy traffic is present. Even humans may find it difficult to identify a target based on its appearance in these situations.
As shown in
Figure 1, ourfocus is to track single objects by utilizing only a single RGBD camera rather than the use of other sensors, such as LiDARs, for multiple reasons, such as the use of cameras is cheaper than LiDARs; cameras are lighter in weight than LiDARs, which is very important in the case of multi-rotors; and both sensors cannot see through obstacles, such as heavy rain, snow, and fog.
Based on the above observations, we propose to tackle 3D SOT from a different perspective using the simplicity of a Siamese paradigm [
5] with the power of robust monocular depth estimation [
14]. The main contributions of the proposed system are as follows:
Utilize a monocular-based hybrid-depth estimation technique to overcome the limitations of sensor depth maps.
Employ the template set to cover the target object in a variety of poses over time.
Introduce the auxiliary object classifier to improve the overall performance of the visual tracker.
2. Related Work
In recent years, visual object tracking has been a prominent research field [
15,
16]. Due to its necessity and numerous applications, such as autonomous vehicles, robotics, and surveillance, more tracking algorithms are introduced each year [
15,
16]. Here, we examine the current advancements in the field of visual tracking. To do that, we compare various approaches in terms of the main technique (estimation-based tracking, feature-based tracking, and learning-based tracking); in our comparison, we will consider the advantages, disadvantages, and other details such as the number of tracked objects (single or multiple objects), and whether the tracker works in 2D or 3D space, etc. By the end of this section, we will have outlined a comparison between state-of-the-art deep-learning single-object Trackers.
There are many challenges in object tracking [
17,
18], such as background clutter, scale variation, occlusion, and fast motion, etc.
Figure 2 shows some examples from the OTB dataset [
18], with each example depicting one of the main challenges. In recent years, many approaches have been proposed to tackle each of the problems; here, we will discuss the state-of-the-art algorithms based on the used tracking techniques.
Estimation-based tracking: To use the Kalman filter [
19] in object tracking [
20], a dynamic model of the target movement should be designed. A Kalman filter could be used to calculate the position in the case of linear systems with Gaussian errors. For nonlinear dynamic models, other suitable methods are used, such as the extended Kalman filter [
21].
Tracking can benefit from the main properties of the Kalman filter [
19], which include:
Predicting the future location of the object.
Forecast correction based on current measurements.
Noise reduction caused by incorrect diagnosis.
The vast majority of tracking problems are non-linear. As a result, particle filters have been investigated as a possible solution to such problems. The particle filter is a non-Gaussian-noise measurement model that employs a statistical calculation method known as recursive Monte Carlo. The main concept of the particle filter is to represent the distribution of a set of particles. Each particle is assigned a probability weight, which represents the probability of sampling that particle using the probability density function. One disadvantage of this method is that the particles with the highest probability are selected multiple times, which is overcome by resampling [
22].
The JPDA multi-object tracker [
23,
24,
25] is a tracker that can process numerous target detections from multiple sensors. To assign detections to each track, the tracker employs joint probabilistic data association. With a soft assignment, the tracker allows numerous detections to contribute to each track. Tracks are initialized, confirmed, corrected, predicted, and deleted by the tracker. The tracker receives detection reports from the object detector and sensor fusion as inputs. For each track, the state vector and the state estimate-error covariance matrix are estimated by the tracker. Each detection has at least one track associated to it. The tracker creates a new track if the detection cannot be assigned to an existing track.
Feature-based tracking: Using extracted attributes such as texture, color, and optical flow, this category of visual tracker identifies the most similar objects over the upcoming frames. Here, we discuss some of the most important algorithms associated with feature-based trackers. Lucas–Kanade template tracking [
26] is one of the first and most popular trackers, originally designed to be able to monitor the optical flow of the image pixels and to determine how they move over time. The problem with this approach was the inaccurate assumption of a constant flow (pure translation) for all pixels in a bigger window for long periods of time. Zhao et al. [
27] used local binary patterns (LBP) to describe moving objects and also used a Kalman filter for target tracking. Meanwhile, Zhao et al. [
27] inherited the advantages of both LBP and KF, such as the computational simplicity (LBP and KF), the good performance (LBP and KF), the high discriminative power (LBP), and the invariance to the changes in grascale (LBP). On the other hand, this model also inherited the disadvantages of both of them, such as being not invariant to rotations (LBP), and the state variables being normally distributed (KF). Lastly, in terms of time and space, the complexity of computation increases exponentially with the number of neighbors (LBP).
One of the proposed solutions is to compare features (SIFT and HoG) [
28,
29] instead of pixels, which are hopefully invariant to changes in scale and rotation. Although these algorithms have been used for a long-time, that does not contradict the fact that they have many issues such as highly dimensional feature descriptors (SIFT and HoG), high computation requirements (SIFT and HoG), and low matching accuracy at large angles of rotation and view, which makes them not as reliable as the current deep-learning models.
Learning-based tracking: Discriminative, generative, and reinforcement learning are the three main paradigms used in learning-based tracking. We will review some of the main deep visual object-tracking (VOT) approaches. With the advancement of deep learning, deep features have been used in object trackers. SiamFC [
8] was one of the models that used fully convolutional Siamese networks to solve the problem of object tracking. Although SiamFC has many advantages, it has two main disadvantages [
30]. First, it only generates the final response scores using the features from the final layer. These high-level features are resistant to noise, but they lack specific target information, making these features insufficient for discrimination when the distractor falls into the same category as the target. Second, the SiamFC training method ensures that each patch in the search region contributes equally to the final response score map. Therefore, regardless of where a distractor appears in the search region, it may produce a high response score and lead the tracking to fail. Later, in SiamRPN [
9], the solution was developed using the region proposal network (RPN), which was originally introduced in Faster R-CNN [
31] to solve the object detection problem. Therefore, instead of applying a sliding approach on the features, a bunch of proposals are made and then the model both classifies and regresses these proposals using RPN, which has two branches. The first branch is to classify each proposal, whether or not it looks like the template object, and the second branch regresses an offset for the proposed box.DaSiamRPN [
6] focused on learning distractor-aware Siamese networks for precise and long-term tracking. To that end, the features of conventional Siamese trackers were initially examined. Zhu et al. [
6] recognized that imbalanced training data reduces the separability of the acquired features. To manage this distribution and direct the model’s attention to the semantic distractions, a powerful sampling approach is used during the off-line training phase. Zhu et al. [
6] designed a distractor-aware module to perform incremental learning during inference so that the generic embedding can be successfully transferred to the current video domain. Zhu et al. [
6] also introduced a straightforward yet efficient local-to-global search region approach for long-term tracking. ATOM (Accurate Tracking by Overlap Maximization) [
32] proposed a novel tracking architecture with explicit components for target estimation and classification. The estimation component is trained offline on large-scale datasets to predict the IoU overlap between the target and a bounding-box estimate. The classification component consists of a two-layer fully convolutional network head and is trained online using a dedicated optimization approach. In SiamMask [
5], the authors point out that additional information can be encoded in ROW (response of a candidate window) produced by a fully convolutional Siamese network, to generate a pixel-wise binary map. Transformer meets tracker [
33]: In contrast to how the transformer is often used in natural language processing applications, Wang et al. [
33] redesigned the encoder and decoder of the transformer into two parallel branches within the Siamese-like tracking pipelines. Through attention-based feature reinforcement, the transformer encoder promotes the target templates, which benefits the construction of better tracking models. The object search procedure is made easier by the transformer decoder, which propagates tracking cues from earlier templates to the current frame.
Since our proposed approach is meant to track moving objects in 3D space, we will discuss object localization in 3D space. Indeed, the knowledge of the structure of the 3D environment and the motion of dynamic objects is essential for autonomous navigation [
34]. This importance comes from the fact that the 3D structure implicitly depicts the agent’s relative position, and it is also used to help with high-level scene understanding tasks such as detection and segmentation [
35]. Recent breakthroughs in deep neural networks (DNNs) have sparked a surge in interest in monocular depth prediction [
14,
35] and stereo image depth prediction [
36], as well as optical flow estimation [
37].
Now we will discuss some of the state-of-the-art 3D visual object trackers that operate in three-dimensional space. Eye in the sky [
38] is a novel framework for drone-based tracking and 3D object localization systems. It combines CNN-based object detection, multi-object tracking, ground plane estimation, and, finally, 3D localization of the ground targets. In our proposed framework, we use MiDaS [
14], besides the depth map generated by the depth sensor, to generate the hybrid-depth map, which is explained in the
Section 3. Unlike the relative depth map generated by the MiDaS algorithm [
14], the proposed hybrid-depth map estimates the depth measured in meters.
TrackletNet tracker (TNT) [
39] is a multi-object tracking method based on a tracklet graph model, incorporating tracklet vertex creation with epipolar geometry and connectivity edge measurement using a multi-scale algorithm, Tracklet-Net.
This description implies several important characteristics of the task at hand:
The watermark must include all the required information while remaining placeable and noticeable, even on tiny photos. The created watermark must be compact while still having the ability to hold sufficient information.
The watermark must be invisible to the human eye to prevent easy tampering (and, preferably, to basic image parsing tools). If the malefactor is unaware of the existence of the watermark, they may not even attempt to remove or disable it.
Beyond 3D Siamese tracking [
7]: The authors of this paper also presented a motion-centric paradigm to handle 3D SOT from a new perspective. They suggested the matching-free two-stage tracker
-Track in line with this concept. At the first stage,
-Track uses motion transformation to localize the target across a series of frames. The target box is then refined at the second stage using motion-assisted shape completion.
While many 3D VOT studies focused on using LiDAR data as the essential data input to their systems [
7,
38,
39,
40,
41], in our work, we tried to solve the visual object-tracking task using the RGBD data generated by the lightweight and the affordable sensor [
42]. To do that, we relied on the SiamMask architecture [
5] which originally works in 2D space. We introduced the hybrid-depth maps (explained in detail in the
Section 3) to SiamMask to be able to track the objects in 3D environments. In addition, we replaced the template frame used in SiamMask [
5] with the template set to give the model the ability to track the moving object from different points of view and without the need to build a 3D model for the moving object. Lastly, we used stereo image triangulation [
43] to deproject the position into 3D space, which is followed by a 3D Kalman Filter [
19] which helps to remove the measurement noise.
In
Table 1, we outlined a comparison between several state-of-the-art deep-learning trackers, along with their tracking framework, published year, dimensional space (2D/3D), backbone network (the feature-extraction network), modules, classification/regression methods (BBR: bounding-box regression. BCE loss: binary cross-entropy loss), training schemes, update schemes (no-update, linear-update, non-linear update), input data format (RGB, LiDAR), tracking speed, re-detection (yes/no, i.e., Y/N).
3. Methodology
The main goal of this work is to track the moving object from the perspective of the multi-rotor aerial vehicle,
Figure 1, where the multi-rotor navigates through the environment while trying to follow the moving object using a real-time visual tracking approach. As shown in
Figure 3, the multi-rotor keeps following the moving object until it reaches the location of the moving object and the moving object stops.
There are two sub-tasks in the tracking task: identifying the tracked object and estimating its state. The objective of the tracking mission is to automatically predict the state of the moving object in consecutive frames given its initial state. The proposed framework combines 2D SOT with monocular depth estimation to track moving objects in 3D space. Depth maps are used primarily to maintain awareness of the relevant depth information about target objects.
The proposed system basically has a stream of RGB frames and the depth-map inputs (
Figure 4). The target object is tracked by the Siamese network which produces a mask, a bounding box, an object class (optional), and an RPN score [
31] for the object. In order to estimate the corresponding point in 3D space, a relative depth map [
14] is aligned with the depth map [
42] to generate the hyprid-depth map. As a result of our experiments, the proposed hyprid-depth map was able to estimate the distance from objects at a distance of more than 20 m, which is four times the range of the currently used depth sensor, which can only estimate the distance from objects at a distance of 5 m [
42].
Loss function. To train the classifier
, we used binary cross entropy as the loss function (
1) and Adam as the optimizer.
where
is the scalar value in the model output, and
y is the corresponding target value (0 or 1).
In our experiments, we augmented the architectures of SiamMask [
5] and MiDaS [
14], where each of them were trained separately:
We refer the reader to [
5,
8,
9] for
,
, and
. We have not performed hyperparameter optimization for Equation (
2) but we merely set
,
, like in [
5], and
.
For MiDaS, we refer to [
14]; the used loss function is called the scale- and shift-invariant loss where the loss for a single sample is represented as
where
and
are scaled and shifted versions of the predictions and ground truth, and
defines the specific type of loss function.
Before estimating the position of the target object, we need to estimate the depth of the target object by using the hybrid-depth map
. At each time step,
is calculated as follows
where
is the hybrid-depth map which conveys the depth estimates in meters,
is a scaling factor,
is the depth map generated by the sensor, and depth ratio
r can be evaluated using following expression
where
is a small fraction to avoid division by zero, and
is the normalized inverted relative depth map, which can be evaluated as
To evaluate it, we normalize the inverted relative-depth map
by subtracting the mean
, then divide by the standard deviation
. Therefore,
Here,
is the output of applying the XOR operation (denoted by ⊕) on the binary mask
with 1
is the mask for depth points that have values bigger than a small fraction
, i.e.,
Despite the fact that many tracking algorithms have been proposed for various tasks and object tracking has been studied for many years, it is still a challenging problem. There is no single tracking algorithm that can be used to accomplish all tasks and in different scenarios. In the case of SiamMask [
9], the tracker showed that sometimes it can be distracted from the target object by other distractors or by similar objects in the scene. Working on solving such critical issues could help make multi-rotor-based tracking applications much safer and more reliable. For that purpose, we proposed an auxiliary object classifier and a mechanism for initiating and updating the template set. This mechanism is designed to cover the target object from different points of view from different distances.
At each time step, the proposed framework has two main inputs (the undistorted RGB and depth frames) and estimates the position of the moving object. At the initial step (), the InitTracker initializes the tracker after adding the cropped template of the target object to the template set. The template set has a maximum size (l) and new candidates will be replace old templates over time. All the templates in the template set could be replaced except for the original template inserted at ().
The proposed framework components are realised within Algorithm 1.
Algorithm 1 Object visual tracker |
Input: Undistorted RGB & depth frames |
Input: Undistorted RGB & depth frames |
- 1:
InitTracker() - 2:
repeat - 3:
EstimateRelativeDepthMap() - 4:
GenerateHybridDepthMap() - 5:
EstimateObjectState() - 6:
if StateIsValid() - 7:
ApplyProjectionAndKF() - 8:
else - 9:
- 10:
- 11:
end if - 12:
until end of sequence
|
Let us consider all the steps of Algorithm 1 in detail:
EstimateRelativeDepthMap has the RGB image as an input to estimate the relative depth map using the robust monocular depth estimation network [
14]; the predicted relative-depth map with the depth map produced by the sensor [
42] will be the inputs for the next step.
GerenateHybridDepthMap by applying Equation (
4). To estimate the hybrid-depth map that has the depth information of the moving object in meters. The process of generating the hybrid-depth map is divided into 3 main steps, as follows
- (i)
Preprocess the depth map by replacing all non-numeric values with zeros. The result is a 2D array with values measured in meters.
- (ii)
Generate from the relative depth map a 2D array with values.
- (iii)
EstimatingObjectState takes the state vector of the moving object as the input and returns the updated state as the output. has the details of the target object such as the hybrid-depth maps and the RGB frames of the current and previous time steps. The position will be defined in 2D space as the center of the object. The depth is.
To update the template set there are conditions the candidate template needs to meet. These conditions are as follows
Therefore, if the candidate template violates any of the corresponding conditions, the new candidate template will not be inserted to the template set. Before we explain how the template set is used, we need to know that the original template frame has index at every time step. The usage of the template set is as follows
- (i)
Use the original template frame.
- (ii)
The conditions (3) above must be satisfied and the objectness probability (RPN score) should be greater than a threshold. Other than that, the template frame is skipped to the next most recent frame in the template set.
- (iii)
Repeat (ii) if needed forn times (in our experiments).
- (iv)
Return the estimated object state.
ApplyProjectionAndKF: A 3D Kalman filter [
19] will remove measurement noise after a stereo image triangulation [
43] deprojects the position into 3D space.
As shown in
Figure 4, the hybrid-depth map is generated from both the MiDaS relative-depth map [
14] and the Intel RealSense depth map [
42]. The alignment of these two maps was explained in detail in Algorithm 1.
Figure 5 shows the generated hyprid-depth map using the proposed approach.
Similarly to SiamMask [
5], ResNet50 [
49] is used until the final convolutional layer of the 4th stage as the backbone
for both Siamese Network [
5] branches. The main goal of
is to predict
binary masks (one for each RoW). Basically,
is a simple two-layer network with learnable parameters
.
In the proposed system, a ResNet18 [
49] was used to classify the object detected by the SiamRPN branches [
9], to stabilize tracking and make the model less prone to distractors. We have two different settings to run in the proposed system:
Generic moving-object tracker: in this case, the object will be classified using the Siamese tracker with the help of the frames in the template set.
Class-specific moving-object tracker: in this case, the tracker will have an auxiliary object classifier which makes sure to avoid distractors and the possible foreground/background confusions by the RPN network.
The experiment was performed by training ResNet18 with a binary classification layer. Two datasets were utilized to train the binary classifier. The moving object class is represented by the Stanford Cars dataset (16,185 photos of 196 car classes) [
50]. In contrast, the Describable Textures Dataset (DTD) [
51] was applied to the non-moving object class. The DTD dataset contains 5640 photos sorted into 47 categories influenced by human sight.