wherein u is_c(i, j) is the (i, j) th value in the feature matrix. In order to utilize the aggregate information z in the compression operation_cExcitation operation is carried out, convolution characteristic information of each channel is fused, and a dependency relation s on the channels is obtained, namely:

s＝Tex(z,E)＝δ(E₂σ(E₁z))#(2)

where σ denotes a ReLU activation function, δ denotes a sigmoid activation function, E₁And E₂Two weights. This is achieved using two fully connected layers;

using s to activate switch U through Tscale operation_apsObtaining a feature block U':

U′＝Tscale(U_aps,s)＝U_aps·s#(3)

finally, the DVS feature block is fused with the APS feature to obtain the final fusion feature F_aps′：

Splicing operation is adopted in specific implementation.

And 4, step 4: target detection is carried out on the extracted features through a detection layer

The same as APS part, DVS detection result is added in the detection layer, binary cross entropy loss is carried out on the objects and classes detected by DVS, and the negative log likelihood loss function (NLL) of the coordinate frame is as follows:

wherein

Is the NLL loss of the x coordinate of the DVS. W and H are the number of grids for each width and height, respectively, and K is the prior frame number. The output of the detection layer at the kth prior box of the (i, j) grid is:

and

the coordinates of x are shown as such,

representing the uncertainty of the x coordinate.

Is the group Truth in x-coordinate, which is calculated from the width and height of the adjusted image inGaussian yollov 3 and the kth prior box prior. ξ is a fixed value of 10-9.

The same as the x coordinate, represents the loss of the remaining coordinates y, w, h.

ω_scale＝2-w^G×h^G#(7)

ω_scaleAccording to the size (w) of the object in the training process^G,h^G) Different weights are provided. (6) In (1)

Is a parameter that is applied in the loss only if there is an anchor point in the prior box that best fits the current object. The value of this parameter is 1 or 0, which is intersected by the intersection of GroudTruth and the kth prior box in the (i, j) meshes ((IOU).

C_ijkThe value of (d) depends on whether the bounding box of the grid cell fits the predicted object. If appropriate, C_ijk1 is ═ 1; otherwise, C_ijk＝0。τ_noobjThe k-th prior box indicating the grid does not fit the target.

Representing the correct category.

The k-th prior box indicating the mesh is not responsible for predicting the target.

The class losses are as follows:

P_ijindicating the probability that the currently detected object is the correct object.

The loss function for the DVS portion is:

wherein L is_DVSRepresenting the sum of DVS channel coordinate value loss, class loss, and confidence loss.

L_APSAnd L_DVSConsistent in form. The overall network loss function is therefore:

L＝L_APS+L_DVS#(11)

by increasing the loss function of the DVS channel, the data of the model for detecting the extreme environment has stronger robustness, and the accuracy of the algorithm is improved.

To verify the validity of the proposed solution of the present invention, experiments were first performed on a custom data set. Comparative experiments were performed for different methods such as inputting only an APS image, inputting only a DVS image, inputting a superimposed image of APS and DVS pixels, and inputting both images at the same time, and the experimental results are shown in table 2. Further, the effect of the different input modes is shown in fig. 3. Each column in the figure corresponds to one input mode. Four scenes (fast moving, over-lit dark, and normal) were selected for each method. In a scene where the object is moving rapidly, the input DVS image may detect a rapidly moving vehicle, but may not detect a relatively stationary vehicle. The opposite input APS image can detect a relatively stationary vehicle, but cannot detect a fast moving vehicle. The effect of the image after the superposition of the input APS and DVS pixels is comparable to the effect of the input APS image alone. After the two images are input simultaneously, the vehicle can obtain good detection effect no matter the vehicle moves rapidly or is still. In the case of too strong or too dark illumination, neither the input APS image nor the superimposed image of APS and DVS pixels has a good detection effect. Compared with the prior art, the APS image and the DVS image can be input simultaneously, so that the characteristics of the two parts can be well fused, and the defects of the APS can be made up through the DVS. The DVS image detection effect is the worst in a normal scene because only the luminance variation in the image can generate information, and the region without luminance variation corresponds to the background and cannot be recognized. In general, the method of fusing two images in a network while using an ADF network is significantly superior to other methods.

At the same time, several of the most advanced single input networks were also selected for comparison, as shown in table 3. The network comparison results of the single image input are compared on the custom data set. As can be seen from the table, in the case where the model of the present invention inputs only a single image, the network performance is not as good as that of other networks because the network itself is designed to implement dual input. Therefore, when the model simultaneously inputs frames and events, the experimental result is improved, and the effect of improving the identification by using the event data is proved.

In addition, the present invention compares the PKU-DDD17-CAR dataset with the JDF network into which both data were imported, and the results are shown in Table 4. And converting the event data in the data set into images and then sending the images into the ADF network. The results of inputting the frame image only and the frame image at the same time and the event data are compared, respectively. Although the network is inferior to the JDF network in the case of only inputting a frame image, the network in the case of simultaneously inputting two kinds of data is superior to the JDF network.

TABLE 1 number of convolution layers in network framework

Table 2 experimental results on custom data set

TABLE 3 comparison with Single image input network

TABLE 4 comparison of two data inputs into different networks

Claims

Translated fromChinese

1.基于事件相机的车辆目标检测方法，其特征在于：基于事件相机生成的APS图像和DVS数据，采用卷积神经网络技术，对极端场景中的车辆进行目标检测，将事件数据转为事件图像；根据像素的坐标和极性的变化，在累计时间内将事件数据转为和帧图像相同大小的事件图像；利用成熟的卷积神经网络，在darknet-53框架基础上，只对APS图像进行卷积操作操作的基础上增加对DVS图像提取特征的卷积层，DVS通道仍然采用连续的3×3和1×1的卷积层；然后在卷积神经网络中增加融合模块，在不同分辨率下提取DVS特征之后对相同尺寸的APS的特征进行加权，以引导网络同时学习到APS和DVS更多的细节特征；在检测层修改网络的损失函数，对APS特征进行的损失函数是采用交叉熵损失函数，其中包括对坐标、类别和置信度的损失；交叉熵的损失函数对DVS特征进行损失计算。1. A vehicle target detection method based on an event camera, characterized in that: based on the APS image and DVS data generated by the event camera, the convolutional neural network technology is used to perform target detection on vehicles in extreme scenes, and the event data is converted into event images. ; According to the changes of the coordinates and polarity of the pixels, the event data is converted into an event image of the same size as the frame image within the accumulated time; using a mature convolutional neural network, based on the darknet-53 framework, only the APS image is processed. On the basis of the convolution operation, a convolution layer for extracting features from DVS images is added. The DVS channel still uses continuous 3 × 3 and 1 × 1 convolution layers; and then a fusion module is added to the convolutional neural network. After extracting DVS features at low speed, weight the features of APS of the same size to guide the network to learn more detailed features of APS and DVS at the same time; modify the loss function of the network in the detection layer, and the loss function for APS features is to use crossover Entropy loss function, which includes the loss of coordinates, categories and confidence; the loss function of cross entropy performs loss calculation on DVS features.

2.根据权利要求1所述的基于事件相机的车辆目标检测方法，其特征在于，将事件转为图像采用固定时间间隔法；为了以每秒100帧的速度FPS实现检测，帧重建设置为10ms的固定帧长；在每个时间间隔内，根据事件生成的像素位置，在有极性生成的对应像素点，极性增加的事件被绘制成白色像素，极性减少的事件被绘制成黑色像素，图像的背景颜色为灰色；最后生成与APS图像相同大小的事件图像。2. The vehicle target detection method based on an event camera according to claim 1, wherein the event is converted into an image using a fixed time interval method; in order to realize detection at a speed of 100 frames per second, frame reconstruction is set to 10ms The fixed frame length of ; in each time interval, according to the pixel position generated by the event, at the corresponding pixel point with polarity generation, events with increasing polarity are drawn as white pixels, and events with decreasing polarity are drawn as black pixels , the background color of the image is gray; finally, an event image of the same size as the APS image is generated.

3.根据权利要求1所述的基于事件相机的车辆目标检测方法，其特征在于，增加对DVS图像提取特征的连续的3×3和1×1的卷积层；将APS图像和DVS图像同时输入网络框架中，通过各自的3×3和1×1的卷积层提取特征，不同的是各自提取特征的卷积层层数不同，DVS的要比APS的少；网络在对输入的APS图像进行预测的同时，也对DVS图像进行了预测；APS图像和DVS图像都被划分为S×S的网格，每个网格预测B个边界框，共预测C类；每个bbox被引入高斯模型，共预测8个坐标值，μ_x，ε_x，μ_y，ε_y，μ_w，ε_w，μ_h，ε_h；此外还要预测一个置信度得分p；所以在网络的最后输入检测层的是2×S×S×B×(C+9)的张量；APS通道的三个尺寸的张量和DVS通道的三个相同尺寸的张量分别被送入检测层。3. The vehicle target detection method based on an event camera according to claim 1, wherein the continuous 3×3 and 1×1 convolution layers for extracting features from the DVS image are added; the APS image and the DVS image are simultaneously In the input network framework, the features are extracted through the respective 3×3 and 1×1 convolutional layers. The difference is that the number of convolutional layers for extracting features is different, and the DVS is less than the APS; the network is on the input APS. While the image is predicted, the DVS image is also predicted; both the APS image and the DVS image are divided into S×S grids, each grid predicts B bounding boxes, and a total of C categories are predicted; each bbox is introduced Gaussian model, a total of 8 coordinate values are predicted, μ_x, ε_x, μ_y, ε_y, μ_w, ε_w, μ_h, ε_h; in addition, a confidence score p is also predicted; so at the end of the network, the input detection layer is 2×S× S×B×(C+9) tensors; three tensors of the APS channel and three tensors of the same size of the DVS channel are respectively sent to the detection layer.

4.根据权利要求1所述的基于事件相机的车辆目标检测方法，其特征在于，融合模块中将两部分特征进行有效的融合；将APS和DVS经过各自的卷积层之后分别得到特征F_aps和F_dvs送入融合模型，先将F_aps和F_dvs经过一个给定的变换操作Tc：F→U，F∈R，U∈R^M×N×C，U＝[u₁，u₂，...，u_C]，得到变换特征U_aps和U_dvs，其中u_c是在C通道中第c个通道的大小为M×N的特征矩阵；简单来说，将Tc操作作为一个卷积操作；4. The vehicle target detection method based on an event camera according to claim 1, wherein the two parts of features are effectively fused in the fusion module; the APS and the DVS are respectively obtained after passing through the respective convolution layers to obtain the feature F_aps and F_dvs are sent to the fusion model, and F_aps and F_dvs are first subjected to a given transformation operation Tc: F→U, F∈R, U∈R^M×N×C , U=[u₁ , u₂ , ..., u_C ], to obtain the transformed features U_aps and U_dvs , where u_c is a feature matrix of size M×N for the c-th channel in the C channel; in simple terms, the Tc operation is used as a convolution operate;

得到变换特征U_dvs之后，我们考虑特征中所有通道的全局信息，将这个全局信息压缩到一个通道中得到聚集信息z_c；通过全局平均池化操作Tsq(U_dvs)来完成，形式上表示为：After obtaining the transformed feature U_dvs , we consider the global information of all channels in the feature, and compress this global information into one channel to obtain the aggregated information z_c ; this is done through the global average pooling operation_Tsq (U dvs ), which is formally expressed as :

其中，u_c(i，j)为特征矩阵中第(i，j)个值；为了利用在压缩操作中的聚集信息z_c，进行激励操作，融合各个通道的卷积特征信息，获取通道上的依赖关系s，即：Among them, u_c (i, j) is the (i, j) th value in the feature matrix; in order to use the aggregated information z_c in the compression operation, perform the excitation operation, fuse the convolution feature information of each channel, and obtain the channel the dependencies s, namely:

s＝Tex(z，E)＝δ(E₂σ(E₁z))#(2)s=Tex(z, E)=δ(E₂ σ(E₁ z))#(2)

其中，σ表示ReLU激活函数，δ表示sigmoid激活函数；E₁和E₂为两个权重；使用两个全连接层来实现这一操作；Among them, σ represents the ReLU activation function, and δ represents the sigmoid activation function; E₁ and E₂ are two weights; two fully connected layers are used to achieve this operation;

通过Tscale操作使用s激活转换U_aps的缩放，获得特征块U′：The feature block U′ is obtained by transforming the scaling of U_aps using the s activation by the Tscale operation:

U′＝Tscale(U_aps，s)＝U_aps·s#(3)U'=Tscale(U_aps ,s)=U_aps ·s#(3)

最终将DVS的特征块与APS的特征进行融合，得到最终的融合特征F_aps′：Finally, the feature blocks of DVS and the features of APS are fused to obtain the final fusion feature F_aps ′:

具体实现中采用拼接操作。The splicing operation is adopted in the specific implementation.

5.根据权利要求1所述的基于事件相机的车辆目标检测方法，其特征在于，在检测层增加对DVS特征的损失项；和APS部分相同，在检测层增加DVS检测结果，对DVS检测到的对象和类进行二元交叉熵损失，坐标框的负对数似然损失函数(NLL)如下：5. The vehicle target detection method based on an event camera according to claim 1, wherein the loss item to the DVS feature is added in the detection layer; the same as the APS part, the DVS detection result is added in the detection layer, and the DVS is detected. A binary cross-entropy loss is performed on the objects and classes of , and the negative log-likelihood loss function (NLL) of the coordinate box is as follows:

其中

为DVS的x坐标的NLL损失；W和片分别为每个宽度和高度的网格数，K为先验框数；在(i，j)网格第k个先验框检测层的输出是：

和

表示x的坐标，

表示x坐标的不确定性；

是x坐标的Ground Truth，它是根据在GaussianYOLOv3中调整后的图像的宽度和高度以及第k个先验框先验来计算的；ξ是10-9的固定值；

与x坐标相同，表示其余坐标y、w、h的损失；in

is the NLL loss of the x-coordinate of the DVS; W and slices are the number of grids for each width and height, respectively, and K is the number of a priori boxes; the output of the kth a priori box detection layer in the (i, j) grid is :

and

represents the coordinate of x,

represents the uncertainty of the x-coordinate;

is the Ground Truth of the x coordinate, which is calculated based on the width and height of the image adjusted in GaussianYOLOv3 and the kth a priori box prior; ξ is a fixed value of 10-9;

The same as the x coordinate, indicating the loss of the remaining coordinates y, w, h;

ω_scale＝2-w^G×h^G#(7)ω_scale = 2-w^G ×h^G #(7)

ω_scale在训练过程中根据对象大小(w^G，h^G)提供不同的权重；(7)中的

是仅在先验框中存在最适合当前对象的锚点时才应用于损失中的参数；此参数的值为1或0，这由GroundTruth与(i，j)个网格中第k个先验框的交集相交(IOU)确定；ω_scale provides different weights according to the object size (w^G , h^G ) during training; in (7)

is a parameter applied in the loss only if there is an anchor point that best fits the current object in the prior box; the value of this parameter is 1 or 0, which is determined by GroundTruth and the kth first in the (i, j) grid The intersection of inspection boxes (IOU) is determined;

C_ijk的值取决于网格单元的边界框是否适合预测对象；如果合适，则C_ijk＝1；否则，C_ijk＝0；τ_noobj指示网格的第k个先验框不适合目标；

代表正确的类别；

指示网格的第k个先验框不负责预测目标；The value of C_ijk depends on whether the bounding box of the grid cell is suitable for the prediction object; if it is suitable, then C_ijk =1; otherwise, C_ijk =0; τ_noobj indicates that the kth a priori box of the grid is not suitable for the target;

represents the correct category;

Indicates that the kth prior box of the grid is not responsible for predicting the target;

类别损失如下：The class losses are as follows:

P_ij表示当前检测到的目标是正确目标的概率；P_ij represents the probability that the currently detected target is the correct target;

DVS部分的损失函数为：The loss function of the DVS part is:

其中L_DVS表示DVS通道坐标值损失、类别损失和置信度损失之和；where L_DVS represents the sum of DVS channel coordinate value loss, category loss and confidence loss;

L_APS和L_DVS形式上保持一致；所以整个网络的损失函数为：L_APS and L_DVS are identical in form; so the loss function of the entire network is:

L＝L_APS+L_DVS#(11) 。L=L_APS +L_DVS #(11).