CN118521617A

Movatterモバイル変換

Info

Publication number: CN118521617A
Application number: CN202410680318.2A
Authority: CN
Inventors: 张宇; 郭思彤
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2024-05-29
Filing date: 2024-05-29
Publication date: 2024-08-20

Abstract

Translated fromChinese

本发明公开了一种融合可见光相机与事件相机的光流估计方法、系统及装置。针对现有可见光相机的光流估计模型在复杂多变环境中会出现运动模糊等性能退化问题，本发明核心为图像事件融合模块及运动特征编码增强模块。基于图像事件融合模块通过将帧间事件序列进行分段，引导图像特征生成多时间尺度的插帧伪图像特征，解决数据的模态差距；基于运动特征编码增强模块，编码得到多时间尺度运动特征，通过交叉注意力机制保证多时间尺度运动特征的模式聚合，解决线性查找的采样偏差问题。本发明鲁棒性强，能够在高动态范围及高速运动的场景中工作，相较同框架的单传感器模型准确度、实时性有明显提高。

The present invention discloses an optical flow estimation method, system and device that fuses a visible light camera and an event camera. In view of the performance degradation problems such as motion blur that occur in the optical flow estimation model of the existing visible light camera in complex and changeable environments, the core of the present invention is an image event fusion module and a motion feature coding enhancement module. Based on the image event fusion module, the inter-frame event sequence is segmented to guide the image features to generate multi-time-scale interpolation pseudo-image features to solve the modal gap of the data; based on the motion feature coding enhancement module, multi-time-scale motion features are encoded, and the pattern aggregation of multi-time-scale motion features is ensured through the cross-attention mechanism to solve the sampling bias problem of linear search. The present invention has strong robustness and can work in scenes with high dynamic range and high-speed motion. Compared with the single sensor model of the same framework, the accuracy and real-time performance are significantly improved.

Description

Translated fromChinese

一种融合可见光相机与事件相机的光流估计方法、系统及装置An optical flow estimation method, system and device integrating visible light camera and event camera

技术领域Technical Field

本发明属于神经网络及光流估计技术领域，具体涉及一种融合可见光相机与事件相机的光流估计方法、系统及装置。The present invention belongs to the technical field of neural networks and optical flow estimation, and in particular relates to an optical flow estimation method, system and device that integrates a visible light camera and an event camera.

背景技术Background Art

近年来，基于学习的光流估计是计算机视觉中的一个重要课题，可以在没有几何先验的情况下计算相邻图像帧各像素的运动向量，在各个领域都有广泛的应用场景，例如运动估计、自动驾驶等。光流估计的准确性是开展上述任务的重要基础。In recent years, learning-based optical flow estimation has become an important topic in computer vision. It can calculate the motion vectors of each pixel in adjacent image frames without geometric priors and has a wide range of applications in various fields, such as motion estimation and autonomous driving. The accuracy of optical flow estimation is an important basis for carrying out the above tasks.

在理想光线和低速运动的场景下，现有的基于可见光相机的方法能够通过场景丰富的纹理信息进行学习，较好地估计光流结果。但在高速运动、高动态范围等挑战性场景中，传统可见光相机会出现运动模糊、欠曝过曝等性能退化现象，不能很好地捕获环境信息，进而导致深度学习算法失效。In scenes with ideal lighting and slow motion, existing visible light camera-based methods can learn from the rich texture information of the scene and estimate the optical flow results well. However, in challenging scenes such as high-speed motion and high dynamic range, traditional visible light cameras will experience performance degradation such as motion blur, underexposure and overexposure, and cannot capture environmental information well, which in turn causes the failure of deep learning algorithms.

而事件相机作为一种新型仿生传感器，其与传统可见光相机最大的不同在于它并不以固定的帧率捕获图像，而是异步地响应视场范围内每个像素的亮度变化，这种输出被称为“事件”。一个“事件”一般包含亮度变化的位置、时间和极性。与传统帧相机相比，事件相机具有高时间分辨率(μs级)、高动态范围(>120dB)、高像素带宽(kHz级)等优秀特性，从而可以缓解上述不良场景的影响。然而，事件序列的异步特性导致传统可见光相机的算法框架难以充分利用事件数据所包含的时间信息，其离散特性使得无事件触发的区域很难获得可靠的预测，因此，仅采用事件序列的算法估计的光流结果不够准确。As a new type of bionic sensor, the biggest difference between event cameras and traditional visible light cameras is that they do not capture images at a fixed frame rate, but asynchronously respond to the brightness changes of each pixel in the field of view. This output is called an "event". An "event" generally includes the location, time and polarity of the brightness change. Compared with traditional frame cameras, event cameras have excellent characteristics such as high temporal resolution (μs level), high dynamic range (>120dB), and high pixel bandwidth (kHz level), which can alleviate the impact of the above-mentioned adverse scenarios. However, the asynchronous characteristics of the event sequence make it difficult for the algorithm framework of traditional visible light cameras to fully utilize the temporal information contained in the event data. Its discrete characteristics make it difficult to obtain reliable predictions in areas without event triggering. Therefore, the optical flow results estimated by the algorithm using only the event sequence are not accurate enough.

要实现在变化光照、变化运动速度的环境下的准确运动向量估计，之前的工作将事件和RGB图像两种模态进行融合。其中Fusion-FlowNet将事件和图像直接在特征编码后进行级联融合，Pan引入事件亮度一致性及图像去模糊作为优化目标。然而，这些工作都忽略了事件、图像数据的模态差距，融合方法缺乏原理性，无法很好地利用事件的时间信息、运动信息。To achieve accurate motion vector estimation in environments with changing lighting and speed, previous work has fused the two modalities of events and RGB images. Among them, Fusion-FlowNet directly cascades events and images after feature encoding, and Pan introduces event brightness consistency and image deblurring as optimization goals. However, these works have ignored the modal gap between event and image data, and the fusion method lacks principle and cannot make good use of the time information and motion information of events.

发明内容Summary of the invention

本发明的目的在于提供一种多源信息融合的光流估计方法，其能获得连续稠密且精度较高的光流估计结果，克服了现有基于可见光相机算法在挑战性场景中性能退化、基于事件相机算法在低纹理区域无事件触发等不足，同时以更合理、更有效的方式融合两种模态数据，充分利用事件序列的时空信息及RGB图像的纹理信息。本发明相比于同框架的基于可见光相机和基于事件相机的深度学习算法，在光流估计准确度、实时性上有明显提高，应对不同场景具有更强的鲁棒性。The purpose of the present invention is to provide an optical flow estimation method of multi-source information fusion, which can obtain continuous, dense and high-precision optical flow estimation results, overcome the shortcomings of existing visible light camera-based algorithms in challenging scenes, and event camera-based algorithms without event triggering in low-texture areas, and fuse two modal data in a more reasonable and effective way, making full use of the spatiotemporal information of event sequences and the texture information of RGB images. Compared with the deep learning algorithms based on visible light cameras and event cameras in the same framework, the present invention has significantly improved the accuracy and real-time performance of optical flow estimation, and has stronger robustness in dealing with different scenes.

本发明是通过以下技术方案来实现的：第一方面，本发明提供了一种融合可见光相机与事件相机的光流估计方法，该方法包括以下步骤：The present invention is implemented by the following technical solutions: In a first aspect, the present invention provides an optical flow estimation method integrating a visible light camera and an event camera, the method comprising the following steps:

(1)通过可见光相机获取图像数据，通过事件相机获取图像的事件序列；(1) Acquire image data through a visible light camera and acquire the event sequence of the image through an event camera;

(2)对事件序列进行编码，将事件序列表征为3D体素，提取事件特征，根据时间尺度进行分段，得到多时间尺度事件特征；(2) Encode the event sequence, represent the event sequence as 3D voxels, extract event features, segment them according to the time scale, and obtain multi-time scale event features;

(3)对图像数据提取RGB图像特征和上下文特征；(3) extracting RGB image features and context features from image data;

(4)将RGB图像特征和多时间尺度事件特征进行融合编码，得到多个插帧伪图像特征；(4) Fusion encoding of RGB image features and multi-time scale event features to obtain multiple interpolation pseudo-image features;

(5)基于RGB图像特征和插帧伪图像特征计算相关体，根据分段的迭代光流增量提取金字塔局部相关图；(5) Calculate the correlation volume based on the RGB image features and the interpolated pseudo image features, and extract the pyramid local correlation map according to the segmented iterative optical flow increment;

(6)将不同时间尺度的局部相关图和对应迭代光流增量进行编码，得到多时间尺度运动特征；(6) Encode the local correlation maps of different time scales and the corresponding iterative optical flow increments to obtain multi-time scale motion features;

(7)基于上下文特征和多时间尺度运动特征进行光流迭代细化，得到连续稠密的光流估计结果。(7) The optical flow is iteratively refined based on contextual features and multi-time-scale motion features to obtain a continuous and dense optical flow estimation result.

进一步地，所述对事件序列的编码过程具体为：Furthermore, the encoding process of the event sequence is specifically as follows:

(1)根据前后两帧的图像采样时间提取对应时间片的事件序列{(x_i,y_i,t_i,p_i,)}_i∈[1,N]，将时间维度离散化，采用双线性插值将事件序列由四元数表征为3D体素，事件表征的维度为B×H×W，其中B为时间离散块数量，H和W分别为图像的高度和宽度；(1) According to the image sampling time of the previous and next two frames, the event sequence of the corresponding time slice {(x_i , y_i , t_i , p_i ,)}_i∈[1,N] is extracted, the time dimension is discretized, and the event sequence is represented from quaternions to 3D voxels using bilinear interpolation. The dimension of event representation is B×H×W, where B is the number of time discrete blocks, H and W are the height and width of the image, respectively;

(2)将3D体素的事件表征进行归一化，通过卷积神经网络层和实例归一化层提取事件特征，事件特征维度为：其中C为通道数；(2) The event representation of 3D voxels is normalized, and event features are extracted through the convolutional neural network layer and the instance normalization layer. The event feature dimension is: Where C is the number of channels;

(3)对完整时间片的事件特征进行等间距分段，得到不同时间尺度的事件特征，输出的多时间尺度事件特征维度为：其中nc为通道数，和分别为特征的高度和宽度。(3) The event features of the complete time slice are segmented at equal intervals to obtain event features of different time scales. The output multi-time scale event feature dimensions are: Where nc is the number of channels, and are the height and width of the feature, respectively.

进一步地，所述将RGB图像特征和多时间尺度事件特征进行融合编码，得到多个插帧伪图像特征，具体为：Furthermore, the RGB image features and the multi-time scale event features are fused and encoded to obtain multiple interpolation pseudo image features, specifically:

(1)将多时间尺度事件特征级联，维度为(1) Cascade the multi-time scale event features, the dimension is

(2)通过卷积神经网络层分别对RGB图像特征和级联事件特征进行编码，再将二者级联后编码为多个级联的伪图像特征，维度为nc×h×w，通过拆分得到多个插帧伪图像特征结果。(2) The RGB image features and cascade event features are encoded separately through the convolutional neural network layer, and then the two are cascaded and encoded into multiple cascaded pseudo-image features with a dimension of nc×h×w. Multiple interpolated pseudo-image feature results are obtained by splitting.

进一步地，步骤(5)中，基于RGB图像特征和插帧伪图像特征计算不同时间片对应两帧图像特征的金字塔相关体，根据分段的迭代光流增量提取金字塔局部相关图，具体为：Furthermore, in step (5), the pyramid correlation volume of the two-frame image features corresponding to different time slices is calculated based on the RGB image features and the interpolated pseudo image features, and the pyramid local correlation map is extracted according to the segmented iterative optical flow increment, specifically:

(1)将RGB图像特征和不同时间位置的插帧伪图像特征分别计算相关体，具体如下：(1) Calculate the correlation volume of RGB image features and interpolated pseudo image features at different time positions, as follows:

(2)根据平均池化层减小相关体的尺寸，得到金字塔相关体的维度为：(2) According to the average pooling layer, the size of the correlation volume is reduced, and the dimension of the pyramid correlation volume is obtained as follows:

(3)根据分段的迭代光流增量查找金字塔相关体，查找范围设定为半径为r的矩形，提取金字塔局部相关图的维度为：h×w×(2×r+1)×(2×r+1)。(3) The pyramid correlation volume is searched according to the segmented iterative optical flow increment. The search range is set to a rectangle with a radius of r. The dimension of the extracted pyramid local correlation map is: h×w×(2×r+1)×(2×r+1).

进一步地，步骤(6)中，编码多时间尺度运动特征并采用交叉注意力机制进行增强，具体为：Furthermore, in step (6), multi-time scale motion features are encoded and enhanced using a cross-attention mechanism, specifically:

(1)通过卷积神经网络层将不同时间尺度的局部相关图和对应迭代光流增量进行编码，得到多时间尺度运动特征的维度为：h×w×d；(1) The local correlation maps of different time scales and the corresponding iterative optical flow increments are encoded through the convolutional neural network layer, and the dimensions of the multi-time scale motion features are obtained as: h×w×d;

(2)通过线性层编码不同时间尺度的运动特征及完整时间尺度的运动特征，通过交叉注意力机制增强完整时间尺度运动特征的权重，Q为编码后的不同时间尺度运动特征M_i，i∈[1，n]，K和V为编码后的完整时间尺度运动特M_n，具体如下：(2) Encode motion features of different time scales and motion features of the complete time scale through linear layers, and enhance the weight of the motion features of the complete time scale through the cross attention mechanism. Q is the encoded motion features of different time scales_Mi , i∈[1,n], K and V are the encoded motion features of the complete time scale_Mn , as follows:

(3)通过线性层构造MLP、层归一化层对权重A进行编码，叠加给不同时间尺度的运动特征进行模式增强，得到多时间尺度的增强运动特征。(3) The weight A is encoded by constructing MLP through linear layers and layer normalization layers, and superimposed on motion features of different time scales for pattern enhancement to obtain enhanced motion features of multiple time scales.

进一步地，所述光流迭代细化的过程具体为：Furthermore, the process of iterative refinement of the optical flow is specifically as follows:

(1)通过卷积门控循环单元处理语义特征及增强运动特征，迭代更新隐藏状态；(1) Processing semantic features and enhancing motion features through convolutional gated recurrent units, iteratively updating hidden states;

(2)通过卷积神经网络层处理隐藏状态，输出光流增量的维度为h×w×2，更新当前光流并进行迭代细化；(2) Process the hidden state through the convolutional neural network layer, output the optical flow increment with the dimension of h×w×2, update the current optical flow and perform iterative refinement;

(3)通过3×3的卷积核将当前光流上采样八倍，得到连续稠密的光流估计结果。(3) The current optical flow is upsampled eight times through a 3×3 convolution kernel to obtain a continuous and dense optical flow estimation result.

第二方面，本发明还提供了一种融合可见光相机与事件相机的光流估计系统，该系统包括：In a second aspect, the present invention further provides an optical flow estimation system that integrates a visible light camera and an event camera, the system comprising:

可见光相机，用于获取图像数据，Visible light camera, used to acquire image data,

事件相机，用于获取图像的事件序列；Event cameras, used to acquire event sequences of images;

事件编码器，用于对事件序列进行编码，将事件序列表征为3D体素，提取事件特征，根据时间尺度进行分段，得到多时间尺度事件特征；The event encoder is used to encode the event sequence, represent the event sequence as 3D voxels, extract event features, segment them according to the time scale, and obtain multi-time scale event features;

RGB图像特征及上下文特征编码器，用于对图像数据提取RGB图像特征和上下文特征；RGB image feature and context feature encoder, used to extract RGB image features and context features from image data;

图像事件融合模块，用于将RGB图像特征和多时间尺度事件特征进行融合编码，得到多个插帧伪图像特征；The image event fusion module is used to fuse and encode RGB image features and multi-time scale event features to obtain multiple interpolated pseudo image features;

金字塔相关图模块，用于基于RGB图像特征和插帧伪图像特征计算相关体，根据分段的迭代光流增量提取金字塔局部相关图；The pyramid correlation map module is used to calculate the correlation volume based on the RGB image features and the interpolated pseudo image features, and extract the pyramid local correlation map according to the segmented iterative optical flow increment;

运动特征编码模块，用于将不同时间尺度的局部相关图和对应迭代光流增量进行编码，得到多时间尺度运动特征；The motion feature encoding module is used to encode the local correlation graphs of different time scales and the corresponding iterative optical flow increments to obtain multi-time scale motion features;

光流迭代更新器，用于基于上下文特征和多时间尺度运动特征进行光流迭代细化，得到连续稠密的光流估计结果。The optical flow iterative updater is used to iteratively refine the optical flow based on context features and multi-time scale motion features to obtain continuous and dense optical flow estimation results.

第三方面，本发明还提供了一种融合可见光相机与事件相机的光流估计装置，包括存储器和一个或多个处理器，所述存储器中存储有可执行代码，所述处理器执行所述可执行代码时，实现所述的一种融合可见光相机与事件相机的光流估计方法。In a third aspect, the present invention also provides an optical flow estimation device that integrates a visible light camera and an event camera, comprising a memory and one or more processors, wherein the memory stores executable code, and when the processor executes the executable code, the optical flow estimation method that integrates a visible light camera and an event camera is implemented.

第四方面，本发明还提供了一种计算机可读存储介质，其上存储有程序，所述程序被处理器执行时，实现所述的一种融合可见光相机与事件相机的光流估计方法。In a fourth aspect, the present invention further provides a computer-readable storage medium having a program stored thereon, and when the program is executed by a processor, the optical flow estimation method of fusing a visible light camera and an event camera is implemented.

第五方面，本发明还提供了一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时，实现所述的一种融合可见光相机与事件相机的光流估计方法。In a fifth aspect, the present invention further provides a computer program product, including a computer program, which, when executed by a processor, implements the optical flow estimation method for fusing a visible light camera and an event camera.

本发明相比于现有技术具有以下优点：本发明是融合了可见光相机RGB图像的纹理信息和事件相机事件序列的时空信息，提出了图像事件融合模块，遵循传感器原理融合不同模态信息，构建不同时间尺度的插帧相关体；提出了多时间尺度运动特征编码和增强模块，有效利用可见光相机的帧间的事件序列时序信息，采用交叉注意力机制进行运动特征增强，使得插帧运动特征模式更准确。本发明可以避免光流估计深度学习算法在挑战性场景下性能失效或在低纹理场景下的不可靠预测，在光照、速度多变的环境下实现稳健运行，提高了系统的准确度；本发明采用更少的网络参数和更少的迭代次数，提高了系统的实时性。Compared with the prior art, the present invention has the following advantages: the present invention fuses the texture information of the visible light camera RGB image and the spatiotemporal information of the event sequence of the event camera, proposes an image event fusion module, follows the sensor principle to fuse different modal information, and constructs interpolation related bodies of different time scales; proposes a multi-time scale motion feature encoding and enhancement module, effectively utilizes the event sequence timing information between frames of the visible light camera, and adopts a cross-attention mechanism to enhance motion features, so that the interpolation motion feature pattern is more accurate. The present invention can avoid the performance failure of the optical flow estimation deep learning algorithm in challenging scenes or unreliable prediction in low-texture scenes, realizes robust operation in environments with variable lighting and speed, and improves the accuracy of the system; the present invention uses fewer network parameters and fewer iterations, which improves the real-time performance of the system.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例提供的一种融合可见光相机与事件相机的光流估计方法流程示意图；FIG1 is a schematic flow chart of an optical flow estimation method that combines a visible light camera and an event camera provided by an embodiment of the present invention;

图2为本发明实施例提供的一种融合可见光相机与事件相机的光流估计系统结构示意图。FIG. 2 is a schematic diagram of the structure of an optical flow estimation system that integrates a visible light camera and an event camera provided by an embodiment of the present invention.

图3为本发明实施例提供的图像事件融合网络结构示意图；FIG3 is a schematic diagram of an image event fusion network structure provided by an embodiment of the present invention;

图4为本发明实施例提供的采用交叉注意力机制的运动特征编码增强网络结构示意图。FIG4 is a schematic diagram of a motion feature encoding enhancement network structure using a cross-attention mechanism provided in an embodiment of the present invention.

图5为本发明实施例提供的一种融合可见光相机与事件相机的光流估计装置的结构图。FIG5 is a structural diagram of an optical flow estimation device that integrates a visible light camera and an event camera provided by an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明的上述目的、特征和优点能够更加明显易懂，下面结合附图对本发明的具体实施方式做详细的说明。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and easy to understand, the specific embodiments of the present invention are described in detail below with reference to the accompanying drawings.

在下面的描述中阐述了很多具体细节以便于充分理解本发明，但是本发明还可以采用其他不同于在此描述的其它方式来实施，本领域技术人员可以在不违背本发明内涵的情况下做类似推广，因此本发明不受下面公开的具体实施例的限制。In the following description, many specific details are set forth to facilitate a full understanding of the present invention, but the present invention may also be implemented in other ways different from those described herein, and those skilled in the art may make similar generalizations without violating the connotation of the present invention. Therefore, the present invention is not limited to the specific embodiments disclosed below.

本发明提供了一种融合可见光相机与事件相机的光流估计方法，如图1所示，该方法将RGB图像和事件序列输入基于深度学习的网络模型，得到连续稠密的光流估计结果，尤其适用于复杂、变化环境下快速移动的无人车、移动机器人需要进行运动估计、自动驾驶的应用场景。The present invention provides an optical flow estimation method that integrates a visible light camera and an event camera, as shown in Figure 1. This method inputs RGB images and event sequences into a network model based on deep learning to obtain continuous and dense optical flow estimation results. The method is particularly suitable for application scenarios where unmanned vehicles and mobile robots that move quickly in complex and changing environments need to perform motion estimation and autonomous driving.

如图1所示，事件序列表征为3D体素，基于事件编码器得到对应的分段事件特征，基于RGB图像输入图像及上下文特征编码器得到对应的图像特征和上下文特征，将图像特征和事件特征融合得到插帧伪图像特征，进一步得到多时间尺度的增强运动特征，最后通过光流迭代更新器，得到连续稠密的光流估计结果。As shown in Figure 1, the event sequence is represented as 3D voxels, and the corresponding segmented event features are obtained based on the event encoder. The corresponding image features and context features are obtained based on the RGB image input image and the context feature encoder. The image features and event features are fused to obtain the interpolated pseudo-image features, and further obtain the enhanced motion features at multiple time scales. Finally, the optical flow iterative updater is used to obtain a continuous and dense optical flow estimation result.

进一步地，所述事件编码器对事件序列的编码过程具体为：Furthermore, the encoding process of the event encoder for the event sequence is specifically as follows:

(1)根据前后两帧的图像采样时间提取对应时间片的事件序列{(x_i，y_i，t_i，p_i)}_i∈[1，N]，将连续时间维度进行离散化，采用双线性插值将事件序列由四元数表征为3D体素V(x，y，t)，具体如下：(1) According to the image sampling time of the previous and next two frames, the event sequence of the corresponding time slice {(_xi ,_yi ,_ti ,_pi )}_{i∈[1, N] is} extracted, the continuous time dimension is discretized, and the event sequence is represented from quaternions to 3D voxels V(x, y, t) using bilinear interpolation, as follows:

k_b(·)＝max(0，1-|·|)k_b (·) = max(0, 1 - |·|)

其中：in:

为第j个时间离散块，B为时间离散块数量，N为事件总数，k_b(·)为双线性插值函数。得到的3D体素的事件表征维度为B×H×W，其中H和W分别为图像的高度和宽度，x_i，y_i表示第i个事件的像素位置，t_i表示第i个事件的采样时间戳，p_i表示第i个事件的极性，表示像素亮度变化趋势，x，y，t分别为3D体素的像素位置和时间戳； is the jth time discrete block, B is the number of time discrete blocks, N is the total number of events, and k_b (·) is the bilinear interpolation function. The event representation dimension of the obtained 3D voxel is B×H×W, where H and W are the height and width of the image, respectively, x_i , y_i represent the pixel position of the ith event,_ti represents the sampling timestamp of the ith event,_pi represents the polarity of the ith event, represents the pixel brightness change trend, and x, y, t are the pixel position and timestamp of the 3D voxel, respectively;

(2)将3D体素的事件表征进行归一化，通过卷积神经网络层和实例归一化层提取事件特征，其中包含1个卷积神经网络层的Conv的步长为2，2个Resblock的步长为2，事件特征维度为：其中C为完整时间尺度的通道数；(2) The event representation of the 3D voxels is normalized, and the event features are extracted through the convolutional neural network layer and the instance normalization layer. The Conv step size of the convolutional neural network layer is 2, and the step size of the two Resblocks is 2. The event feature dimension is: Where C is the number of channels in the complete time scale;

(3)对完整时间片的事件特征进行等间距分段，得到不同时间尺度的事件特征，输出的多时间尺度事件特征维度为：其中nc为通道数，和分别为特征的高度和宽度，c为最短时间尺度的通道数。(3) The event features of the complete time slice are segmented at equal intervals to obtain event features of different time scales. The output multi-time scale event feature dimensions are: Where nc is the number of channels, and are the height and width of the feature respectively, and c is the number of channels in the shortest time scale.

进一步地，所述RGB图像特征及上下文特征编码器网络结构相似，通过卷积神经网络层、实例归一化层提取图像特征，通过卷积神经网络层、批归一化层提取上下文特征，其中包含1个卷积神经网络层的Conv的步长为2，2个Resblock的步长为2，特征的维度均为：c×h×w，其中c为通道数，和分别为特征的高度和宽度。Furthermore, the RGB image feature and context feature encoder network structures are similar, and image features are extracted through a convolutional neural network layer and an instance normalization layer, and context features are extracted through a convolutional neural network layer and a batch normalization layer, wherein the step size of the Conv of one convolutional neural network layer is 2, and the step size of two Resblocks is 2, and the dimensions of the features are all: c×h×w, where c is the number of channels, and are the height and width of the feature, respectively.

进一步地，如图3所示将RGB图像特征和多时间尺度事件特征进行融合编码，得到多个插帧伪图像特征，具体为：Furthermore, as shown in FIG3 , the RGB image features and the multi-time scale event features are fused and encoded to obtain multiple interpolation pseudo image features, specifically:

(1)将多时间尺度的事件特征级联，维度为(1) Cascade event features at multiple time scales, with dimensions of

(2)通过卷积神经网络层分别对RGB图像特征和级联事件特征进行编码，再将二者级联后编码为多个级联的伪图像特征，维度为nc×h×w，通过拆分得到n个插帧特征结果。其中，最后一个伪图像特征与真实采样的图像特征相对应，施加二者的特征相似性约束来监督图像事件融合模块的学习，损失约束具体如下：(2) The RGB image features and cascade event features are encoded separately through the convolutional neural network layer, and then the two are cascaded and encoded into multiple cascaded pseudo image features with a dimension of nc×h×w. The n interpolation feature results are obtained by splitting. Among them, the last pseudo image feature corresponds to the real sampled image feature, and the feature similarity constraint of the two is imposed to supervise the learning of the image event fusion module. The loss constraint is as follows:

其中：in:

为真实采样的I₂RGB图像特征，F_n为第n个插帧伪图像特征，||·||₂为二范数； is the real sampled I₂ RGB image feature, F_n is the nth interpolated pseudo image feature, ||·||₂ is the two-norm;

进一步地，计算不同时间片对应两帧图像特征的金字塔相关体，并通过分段的迭代光流增量提取金字塔相关图，具体为：Furthermore, the pyramid correlation volume corresponding to the features of two frames of images at different time slices is calculated, and the pyramid correlation map is extracted by segmented iterative optical flow increment, specifically:

其中，in,

F₀为真实采样的I₁RGB图像特征，F_i为第i个插帧伪图像特征，c’为RGB图像特征和插帧伪图像特征的通道数，T表示矩阵转置，C_i表示第i个插帧伪图像特征和RGB图像特征的相关体；_F0 is the real sampled_I1 RGB image feature,_Fi is the i-th interpolation pseudo image feature, c' is the number of channels of RGB image features and interpolation pseudo image features, T represents matrix transposition,_Ci represents the correlation body of the i-th interpolation pseudo image feature and RGB image feature;

(2)根据步长为2的平均池化层减小相关体的尺寸，得到金字塔相关体的维度为：k表示金字塔相关体缩小的尺度。(2) According to the average pooling layer with a step size of 2, the size of the correlation volume is reduced, and the dimension of the pyramid correlation volume is obtained as follows: k represents the scale of the pyramid correlation volume reduction.

(3)首先，光流初始化为0，后续的迭代周期中，通过网络输出的迭代光流增量更新光流。根据多时间尺度分段原则得到多尺度光流，查找对应的金字塔相关体，查找范围设定为半径为r的矩形，提取金字塔局部相关图的维度为：h×w×(2×r+1)×(2×r+1)。(3) First, the optical flow is initialized to 0. In the subsequent iteration cycle, the optical flow is updated by the iterative optical flow increment output by the network. According to the multi-time scale segmentation principle, the multi-scale optical flow is obtained, and the corresponding pyramid correlation volume is searched. The search range is set to a rectangle with a radius of r. The dimension of the extracted pyramid local correlation map is: h×w×(2×r+1)×(2×r+1).

进一步地，如图4所示，编码多时间尺度运动特征并采用交叉注意力机制进行增强，具体为：Furthermore, as shown in Figure 4, multi-time scale motion features are encoded and enhanced using a cross-attention mechanism, specifically:

(1)通过卷积神经网络层将不同时间尺度的局部相关图和对应迭代光流增量进行编码，得到多时间尺度运动特征的维度为：h×w×d；d表示运动特征的通道数。(1) The local correlation maps of different time scales and the corresponding iterative optical flow increments are encoded through the convolutional neural network layer, and the dimension of the multi-time scale motion features is obtained: h×w×d, where d represents the number of channels of the motion feature.

(2)通过线性层编码不同时间尺度的运动特征及完整时间尺度的运动特征，基于完整时间尺度运动特征通过交叉注意力机制增强不同时间尺度的运动特征，Q为编码后的不同时间尺度运动特征M_i，i∈[1，n]，K和V为编码后的完整时间尺度运动特征M_n，具体如下：(2) Encode motion features of different time scales and motion features of the complete time scale through linear layers, and enhance motion features of different time scales through cross attention mechanism based on the motion features of the complete time scale. Q is the encoded motion features of different time scales_Mi , i∈[1,n], K and V are the encoded motion features of the complete time scale_Mn , as follows:

其中：d_k为K的通道数，A表示其他时间尺度与完整时间尺度运动特征一致的权重；Where: d_k is the number of channels of K, A represents the weight of other time scales consistent with the motion characteristics of the complete time scale;

(3)通过一个线性层、一个层归一化层对权重A进行编码，通过两个线性层构造MLP和一个层归一化层，增强不同时间尺度的运动特征M_i中与完整时间尺度特征一致的模式信息，得到多时间尺度的增强运动特征M_i’，具体如下：(3) The weight A is encoded through a linear layer and a normalization layer, and an MLP is constructed through two linear layers and a normalization layer to enhance the pattern information consistent with the complete time scale features in the motion features_Mi at different time scales, and obtain the enhanced motion features_Mi ' at multiple time scales, as follows:

M_i’＝M_i+MLP([M_i，A])，i∈[1，n]。M_i '=M_i +MLP([M_i , A]), i∈[1, n].

进一步地，所述光流迭代更新器细化光流的过程具体为：Furthermore, the process of refining the optical flow by the optical flow iterative updater is specifically as follows:

(1)通过卷积门控循环单元处理上下文特征及增强运动特征，迭代更新隐藏状态，其中，卷积门控循环单元在y方向和x方向进行隐藏状态的更新，分别包括三个卷积神经网络层；(1) Processing context features and enhancing motion features through a convolutional gated recurrent unit, iteratively updating the hidden state, wherein the convolutional gated recurrent unit updates the hidden state in the y direction and the x direction, respectively including three convolutional neural network layers;

(2)通过光流输出头处理隐藏状态，包含两个卷积神经网络层，输出光流增量的维度为h×w×2，更新当前光流并进行迭代细化；(2) Processing the hidden state through the optical flow output head, which includes two convolutional neural network layers, outputs the optical flow increment with the dimension of h×w×2, updates the current optical flow and performs iterative refinement;

另一方面，本发明还提供了一种融合可见光相机与事件相机的光流估计系统，包括可见光相机、事件相机和融合多模态光流估计神经网络，该融合多模态光流估计神经网络包含事件编码器、图像事件主干网络和光流迭代更新器；如图2所示，具体实现如下：On the other hand, the present invention also provides an optical flow estimation system integrating a visible light camera and an event camera, comprising a visible light camera, an event camera and a fused multimodal optical flow estimation neural network, wherein the fused multimodal optical flow estimation neural network comprises an event encoder, an image event backbone network and an optical flow iterative updater; as shown in FIG2 , the specific implementation is as follows:

所述事件编码器用于将事件序列表征为含像素位置、时序信息的3D体素，采用分段策略编码为多时间尺度的事件特征，包括卷积神经网络层、实例归一化层；The event encoder is used to represent the event sequence as 3D voxels containing pixel position and timing information, and encodes them into event features of multiple time scales using a segmentation strategy, including a convolutional neural network layer and an instance normalization layer;

所述图像事件主干网络用于编码RGB上下文特征及基于图像、事件特征构造多时间尺度的运动特征，由RGB图像特征及上下文特征编码器、图像事件融合模块、金字塔相关图模块和运动特征编码增强模块四个部分串联而成；所述RGB图像特征及上下文特征编码器，包括卷积神经网络层、实例归一化层和批归一化层，用于提取RGB图像特征和上下文特征；，所述图像事件融合模块实现跨模态信息转换，用于基于事件特征引导单图像特征构建插帧伪图像特征，包含卷积神经网络层；所述金字塔相关图模块用于根据插帧结果构造多时间尺度的金字塔相关体，采用分段的迭代光流增量提取对应的金字塔相关图，所述金字塔相关图模块包括平均池化层；所述运动特征编码增强模块用于根据局部相关图和迭代光流增量编码并采用交叉注意力机制增强运动特征，所述运动特征编码增强模块包括卷积神经网络层、线性层和层归一化层；The image event backbone network is used to encode RGB context features and construct multi-time scale motion features based on image and event features, and is composed of four parts connected in series: an RGB image feature and context feature encoder, an image event fusion module, a pyramid correlation graph module, and a motion feature encoding enhancement module; the RGB image feature and context feature encoder includes a convolutional neural network layer, an instance normalization layer, and a batch normalization layer, and is used to extract RGB image features and context features; the image event fusion module realizes cross-modal information conversion, and is used to guide single image features based on event features to construct interpolation pseudo image features, and includes a convolutional neural network layer; the pyramid correlation graph module is used to construct a multi-time scale pyramid correlation body according to the interpolation result, and uses segmented iterative optical flow increments to extract the corresponding pyramid correlation graph, and the pyramid correlation graph module includes an average pooling layer; the motion feature encoding enhancement module is used to enhance the motion feature according to the local correlation graph and iterative optical flow increment encoding and uses a cross-attention mechanism, and the motion feature encoding enhancement module includes a convolutional neural network layer, a linear layer, and a layer normalization layer;

所述光流迭代更新器用于将增强的运动特征及上下文特征进行迭代细化，将输出的光流上采样八倍，得到精细的光流结果，包括卷积神经网络层和卷积门控循环单元。The optical flow iterative updater is used to iteratively refine the enhanced motion features and context features, upsample the output optical flow eight times, and obtain a refined optical flow result, including a convolutional neural network layer and a convolutional gated recurrent unit.

融合多模态光流估计神经网络的图像事件融合网络模块可以作为不同模态融合的范式使用，运动特征编码增强模块可以用于其他计算机视觉任务的运动特征提取及增强。The image event fusion network module that integrates the multimodal optical flow estimation neural network can be used as a paradigm for the fusion of different modalities, and the motion feature encoding enhancement module can be used for motion feature extraction and enhancement of other computer vision tasks.

DSEC-Flow、MVSEC为基于事件相机的光流估计任务的常用真实数据集。DSEC-Flow共包含7800个训练样本和2100个测试样本，共有24个序列，包含白天、夜晚、城市、沿湖等多个场景，尤其包含进出隧道导致RGB图像欠曝过曝的场景；其可见光相机的RGB图像分辨率为1440×1920，事件相机的分辨率为480×640；而MVSEC中包含约1880个样本的3个室内序列、2700个样本的室外序列，所用DAVIS 346B相机的分辨率为346×260，通过可见光相机、事件相机采集灰度图和事件数据。DSEC-Flow数据集有80％的像素位移幅度达到22个像素，最大光流幅度达到210个像素，通过图像宽度进行归一化处理，即有80％的归一化光流幅度达到3.4％，而MVSEC数据集80％的光流幅度低于4像素，因此DSEC-Flow数据集具有小位移、大位移等不同场景，更具有挑战性。本发明采用以上两种数据集验证不同神经网络方法估计光流结果的性能。DSEC-Flow and MVSEC are commonly used real data sets for optical flow estimation tasks based on event cameras. DSEC-Flow contains 7800 training samples and 2100 test samples, with a total of 24 sequences, including multiple scenes such as daytime, night, city, and lakeside, especially including scenes where RGB images are under-exposed or over-exposed due to entering and exiting tunnels; the RGB image resolution of its visible light camera is 1440×1920, and the resolution of the event camera is 480×640; while MVSEC contains 3 indoor sequences of about 1880 samples and outdoor sequences of 2700 samples, and the resolution of the DAVIS 346B camera used is 346×260, and grayscale images and event data are collected through visible light cameras and event cameras. The DSEC-Flow dataset has 80% of its pixel displacement amplitude reaching 22 pixels, and the maximum optical flow amplitude reaches 210 pixels. After normalization by image width, 80% of the normalized optical flow amplitude reaches 3.4%, while 80% of the optical flow amplitude of the MVSEC dataset is less than 4 pixels. Therefore, the DSEC-Flow dataset has different scenarios such as small displacement and large displacement, which is more challenging. The present invention uses the above two datasets to verify the performance of different neural network methods in estimating optical flow results.

为了体现本发明提出方法的进步性，首先在DSEC-Flow数据集上，对本发明提出的融合多模态光流估计神经网络与采用相同网络框架的不同单模态神经网络方法进行了对比试验，采用光流估计结果准确度和模型推理实时性作为指标；接着在MVSEC数据集上，将本发明与不同单模态、多模态神经网络方法进行对比，采用准确度及离群百分比作为指标。In order to demonstrate the progressiveness of the method proposed in the present invention, firstly, a comparative test was carried out on the DSEC-Flow dataset on the fused multimodal optical flow estimation neural network proposed in the present invention and different single-modal neural network methods using the same network framework, using the accuracy of the optical flow estimation results and the real-time performance of the model inference as indicators; then, on the MVSEC dataset, the present invention was compared with different single-modal and multimodal neural network methods, using the accuracy and outlier percentage as indicators.

表1Table 1

InputInputMethodMethodEPEEPE1PE1PE3PE3PETime(ms)Time(ms)I1+I2I1+I2RAFTRAFT0.780.7812.412.42.62.66161EEE-RAFTE-RAFT0.790.7912.512.52.72.75252I1+EI1+EOursOurs0.720.7210.510.52.22.23535

表1是本发明提出的融合多模态光流估计神经网络与同框架的单模态神经网络方法RAFT(纯RGB图像)、E-RAFT(纯事件序列)在DSEC-Flow数据集上的准确度及实时性对比试验。其中，EPE为终点误差，是光流估计的核心指标；NPE(N＝1，3)为EPE大于N个像素的百分比，用来衡量大位移的鲁棒性。从表1可以看出，纯RGB图像的方法无法应对复杂多变的环境，在挑战性场景下会出现性能退化；纯事件序列的方法无法很好地预测无事件触发区域的光流结果，导致性能较差；本发明方法融合了RGB图像和事件序列两种模态信息的优势，在EPE、1PE、3PE中达到了最佳的性能结果，且应用较小的网络结构及较少的迭代次数既能达到较好的性能，使得推理时间大幅减少。Table 1 is a comparative test of the accuracy and real-time performance of the proposed fusion multimodal optical flow estimation neural network and the single-modal neural network methods RAFT (pure RGB image) and E-RAFT (pure event sequence) in the same framework on the DSEC-Flow dataset. Among them, EPE is the endpoint error, which is the core indicator of optical flow estimation; NPE (N = 1, 3) is the percentage of EPE greater than N pixels, which is used to measure the robustness of large displacements. As can be seen from Table 1, the pure RGB image method cannot cope with complex and changeable environments, and performance degradation will occur in challenging scenarios; the pure event sequence method cannot well predict the optical flow results of the event-free triggering area, resulting in poor performance; the method of the present invention combines the advantages of two modal information, RGB images and event sequences, and achieves the best performance results in EPE, 1PE, and 3PE, and the use of a smaller network structure and a smaller number of iterations can achieve better performance and significantly reduce the reasoning time.

表2Table 2

表2是本发明提出的融合多模态光流估计神经网络与单模态神经网络方法RAFT(纯RGB图像)、E-RAFT(纯事件序列)、多模态神经网络方法Fusino-FlowNet在MVSEC数据集上的准确度对比试验。其中，％Out为EPE大于3或者大于光流真值幅度的5％的像素百分比。从表2可以看出，室内场景indoor_flying 1、indoor_flying 2中物体位移幅度较小，纯RGB图像方法效果较好，本发明方法相较事件方法可以有效提高光流估计的准确性；室外场景outdoor_day 1中物体位移幅度较大，容易出现运动模糊，相较其他纯RGB图像、纯事件单模态方法及多模态融合方法，本发明方法效果提升明显，能够适应复杂多变的应用场景，具有更强的鲁棒性。Table 2 is a comparison test of the accuracy of the fused multimodal optical flow estimation neural network proposed in the present invention and the single-modal neural network methods RAFT (pure RGB image), E-RAFT (pure event sequence), and the multimodal neural network method Fusino-FlowNet on the MVSEC data set. Among them, %Out is the percentage of pixels whose EPE is greater than 3 or greater than 5% of the true value of the optical flow. As can be seen from Table 2, the displacement amplitude of the objects in the indoor scenes indoor_flying 1 and indoor_flying 2 is small, and the pure RGB image method has a better effect. Compared with the event method, the method of the present invention can effectively improve the accuracy of optical flow estimation; the displacement amplitude of the objects in the outdoor scene outdoor_day 1 is large, and motion blur is prone to occur. Compared with other pure RGB images, pure event single-modal methods and multimodal fusion methods, the method of the present invention has significantly improved effects, can adapt to complex and changeable application scenarios, and has stronger robustness.

与前述一种融合可见光相机与事件相机的光流估计方法的实施例相对应，本发明还提供了一种融合可见光相机与事件相机的光流估计装置的实施例。Corresponding to the aforementioned embodiment of an optical flow estimation method for fusing a visible light camera and an event camera, the present invention further provides an embodiment of an optical flow estimation device for fusing a visible light camera and an event camera.

参见图5，本发明实施例提供的一种融合可见光相机与事件相机的光流估计装置，包括存储器和一个或多个处理器，所述存储器中存储有可执行代码，所述处理器执行所述可执行代码时，用于实现上述实施例中的一种融合可见光相机与事件相机的光流估计方法。Referring to FIG5 , an optical flow estimation device for fusing a visible light camera and an event camera provided in an embodiment of the present invention includes a memory and one or more processors. The memory stores executable code. When the processor executes the executable code, it is used to implement an optical flow estimation method for fusing a visible light camera and an event camera in the above embodiment.

本发明提供的一种融合可见光相机与事件相机的光流估计装置的实施例可以应用在任意具备数据处理能力的设备上，该任意具备数据处理能力的设备可以为诸如计算机等设备或装置。装置实施例可以通过软件实现，也可以通过硬件或者软硬件结合的方式实现。以软件实现为例，作为一个逻辑意义上的装置，是通过其所在任意具备数据处理能力的设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言，如图5所示，为本发明提供的一种融合可见光相机与事件相机的光流估计装置所在任意具备数据处理能力的设备的一种硬件结构图，除了图5所示的处理器、内存、网络接口、以及非易失性存储器之外，实施例中装置所在的任意具备数据处理能力的设备通常根据该任意具备数据处理能力的设备的实际功能，还可以包括其他硬件，对此不再赘述。An embodiment of an optical flow estimation device that integrates a visible light camera and an event camera provided by the present invention can be applied to any device with data processing capabilities, and the arbitrary device with data processing capabilities can be a device or apparatus such as a computer. The device embodiment can be implemented by software, or by hardware or a combination of software and hardware. Taking software implementation as an example, as a device in a logical sense, it is formed by the processor of any device with data processing capabilities in which it is located to read the corresponding computer program instructions in the non-volatile memory into the memory and run it. From the hardware level, as shown in Figure 5, it is a hardware structure diagram of any device with data processing capabilities in which an optical flow estimation device that integrates a visible light camera and an event camera provided by the present invention is located. In addition to the processor, memory, network interface, and non-volatile memory shown in Figure 5, any device with data processing capabilities in which the device in the embodiment is located can also include other hardware according to the actual function of the arbitrary device with data processing capabilities, which will not be repeated.

上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程，在此不再赘述。The implementation process of the functions and effects of each unit in the above-mentioned device is specifically described in the implementation process of the corresponding steps in the above-mentioned method, and will not be repeated here.

对于装置实施例而言，由于其基本对应于方法实施例，所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。For the device embodiment, since it basically corresponds to the method embodiment, the relevant parts can refer to the partial description of the method embodiment. The device embodiment described above is only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of the present invention. Ordinary technicians in this field can understand and implement it without paying creative work.

本发明实施例还提供一种计算机可读存储介质，其上存储有程序，该程序被处理器执行时，实现上述实施例中的一种融合可见光相机与事件相机的光流估计方法。An embodiment of the present invention further provides a computer-readable storage medium having a program stored thereon. When the program is executed by a processor, an optical flow estimation method of fusing a visible light camera and an event camera in the above embodiment is implemented.

所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元，例如硬盘或内存。所述计算机可读存储介质也可以是任意具备数据处理能力的设备的外部存储设备，例如所述设备上配备的插接式硬盘、智能存储卡(Smart Media Card，SMC)、SD卡、闪存卡(Flash Card)等。进一步的，所述计算机可读存储介质还可以既包括任意具备数据处理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据，还可以用于暂时地存储已经输出或者将要输出的数据。The computer-readable storage medium may be an internal storage unit of any device with data processing capability described in any of the aforementioned embodiments, such as a hard disk or a memory. The computer-readable storage medium may also be an external storage device of any device with data processing capability, such as a plug-in hard disk, a smart media card (SMC), an SD card, a flash card, etc. equipped on the device. Furthermore, the computer-readable storage medium may also include both an internal storage unit and an external storage device of any device with data processing capability. The computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capability, and may also be used to temporarily store data that has been output or is to be output.

本发明还提供了一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时，实现所述的一种融合可见光相机与事件相机的光流估计方法。The present invention also provides a computer program product, including a computer program, and when the computer program is executed by a processor, the optical flow estimation method of fusing a visible light camera and an event camera is implemented.

以上所述仅是本发明的优选实施方式，虽然本发明已以较佳实施例披露如上，然而并非用以限定本发明。任何熟悉本领域的技术人员，在不脱离本发明技术方案范围情况下，都可利用上述揭示的方法和技术内容对本发明技术方案做出许多可能的变动和修饰，或修改为等同变化的等效实施例。因此，凡是未脱离本发明技术方案的内容，依据本发明的技术实质对以上实施例所做的任何的简单修改、等同变化及修饰，均仍属于本发明技术方案保护的范围内。The above is only a preferred embodiment of the present invention. Although the present invention has been disclosed as a preferred embodiment, it is not intended to limit the present invention. Any technician familiar with the art can make many possible changes and modifications to the technical solution of the present invention by using the above disclosed methods and technical contents without departing from the scope of the technical solution of the present invention, or modify it into an equivalent embodiment of equivalent changes. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention without departing from the content of the technical solution of the present invention still falls within the scope of protection of the technical solution of the present invention.

Claims

Translated fromChinese

1.一种融合可见光相机与事件相机的光流估计方法，其特征在于，该方法包括以下步骤：1. An optical flow estimation method integrating a visible light camera and an event camera, characterized in that the method comprises the following steps:

2.根据权利要求1所述的一种融合可见光相机与事件相机的光流估计方法，其特征在于，所述对事件序列的编码过程具体为：2. According to the optical flow estimation method of the fusion of visible light camera and event camera in claim 1, it is characterized in that the encoding process of the event sequence is specifically:

3.根据权利要求1所述的一种融合可见光相机与事件相机的光流估计方法，其特征在于，所述将RGB图像特征和多时间尺度事件特征进行融合编码，得到多个插帧伪图像特征，具体为：3. According to the optical flow estimation method of the fusion visible light camera and event camera of claim 1, it is characterized in that the RGB image features and the multi-time scale event features are fused and encoded to obtain multiple interpolation pseudo image features, specifically:

4.根据权利要求1所述的一种融合可见光相机与事件相机的光流估计方法，其特征在于，步骤(5)中，基于RGB图像特征和插帧伪图像特征计算不同时间片对应两帧图像特征的金字塔相关体，根据分段的迭代光流增量提取金字塔局部相关图，具体为：4. The optical flow estimation method for fusing a visible light camera and an event camera according to claim 1, characterized in that in step (5), a pyramid correlation volume of two frame image features corresponding to different time slices is calculated based on RGB image features and interpolated pseudo image features, and a pyramid local correlation map is extracted according to segmented iterative optical flow increments, specifically:

5.根据权利要求1所述的一种融合可见光相机与事件相机的光流估计方法，其特征在于，步骤(6)中，编码多时间尺度运动特征并采用交叉注意力机制进行增强，具体为：5. The optical flow estimation method for integrating a visible light camera and an event camera according to claim 1, characterized in that in step (6), encoding multi-time scale motion features and enhancing them using a cross-attention mechanism are specifically as follows:

(2)通过线性层编码不同时间尺度的运动特征及完整时间尺度的运动特征，通过交叉注意力机制增强完整时间尺度运动特征的权重，Q为编码后的不同时间尺度运动特征M_i,i∈[1,n]，K和V为编码后的完整时间尺度运动特M_n，具体如下：(2) Encode motion features of different time scales and motion features of the complete time scale through linear layers, and enhance the weight of the motion features of the complete time scale through the cross attention mechanism. Q is the encoded motion features of different time scales_Mi , i∈[1,n], K and V are the encoded motion features of the complete time scale_Mn , as follows:

6.根据权利要求1所述的一种融合可见光相机与事件相机的光流估计方法，其特征在于，所述光流迭代细化的过程具体为：6. The optical flow estimation method of integrating a visible light camera and an event camera according to claim 1, wherein the process of iterative refinement of the optical flow is specifically as follows:

7.一种实现权利要求1-6任一项所述光流估计方法的融合可见光相机与事件相机的光流估计系统，其特征在于，该系统包括：7. An optical flow estimation system integrating a visible light camera and an event camera for implementing the optical flow estimation method according to any one of claims 1 to 6, characterized in that the system comprises:

8.一种融合可见光相机与事件相机的光流估计装置，包括存储器和一个或多个处理器，所述存储器中存储有可执行代码，其特征在于，所述处理器执行所述可执行代码时，实现如权利要求1-6中任一项所述的一种融合可见光相机与事件相机的光流估计方法。8. An optical flow estimation device that integrates a visible light camera and an event camera, comprising a memory and one or more processors, wherein the memory stores executable code, and wherein when the processor executes the executable code, an optical flow estimation method that integrates a visible light camera and an event camera as described in any one of claims 1 to 6 is implemented.

9.一种计算机可读存储介质，其上存储有程序，其特征在于，所述程序被处理器执行时，实现如权利要求1-6中任一项所述的一种融合可见光相机与事件相机的光流估计方法。9. A computer-readable storage medium having a program stored thereon, wherein when the program is executed by a processor, the optical flow estimation method of fusing a visible light camera and an event camera according to any one of claims 1 to 6 is implemented.

10.一种计算机程序产品，包括计算机程序，其特征在于，所述计算机程序被处理器执行时，实现如权利要求1-6任一项所述的一种融合可见光相机与事件相机的光流估计方法。10. A computer program product, comprising a computer program, characterized in that when the computer program is executed by a processor, the computer program implements the optical flow estimation method for fusing a visible light camera and an event camera according to any one of claims 1 to 6.