车辆可配备有计算装置、网络、传感器和控制器以获取和/或处理关于车辆的环境的数据并且基于所述数据而操作车辆。车辆传感器可提供关于将行驶的路线以及车辆的环境中要避开的对象的数据。在车辆正在道路上进行操作时，车辆的操作可依赖于获取关于车辆的环境中的对象的准确且及时的数据。Vehicles may be equipped with computing devices, networks, sensors, and controllers to acquire and/or process data about the vehicle's environment and operate the vehicle based on the data. Vehicle sensors provide data about the route to be driven and objects to avoid in the vehicle's environment. While the vehicle is operating on the road, the operation of the vehicle may depend on obtaining accurate and timely data about objects in the vehicle's environment.

发明内容Contents of the invention

交通基础设施系统中的计算装置可以被编程为获取关于车辆外部环境的数据并使用所述数据来确定车辆路径，在所述车辆路径上以自主或半自主模式操作车辆。车辆可以基于车辆路径通过确定命令来指示车辆的动力传动系统、制动和转向部件操作车辆以沿着所述路径行驶，而在道路上操作。关于外部环境的数据可以包括车辆周围的环境中的一个或多个对象(诸如，车辆和行人等)的位置，并且可以由车辆中的计算装置使用来操作车辆。Computing devices in transportation infrastructure systems may be programmed to obtain data about the vehicle's external environment and use the data to determine a vehicle path on which to operate the vehicle in an autonomous or semi-autonomous mode. The vehicle may operate on a road based on a vehicle path by determining commands to instruct the vehicle's powertrain, braking and steering components to operate the vehicle to follow the path. Data regarding the external environment may include the location of one or more objects in the environment surrounding the vehicle, such as vehicles, pedestrians, etc., and may be used by a computing device in the vehicle to operate the vehicle.

车辆中的计算装置可以被编程为基于由包括在车辆中的传感器获取的图像数据来检测对象和区域。计算装置可以包括被训练以检测图像数据中的对象和区域的神经网络。在本文档的上下文中检测对象和区域意味着确定图像数据中的对象和区域的标签、位置和大小。对象和区域标签通常包括对象或区域的基本上唯一的标识符，诸如识别对象或区域的文本串，其中对象或区域是占据三个维度的物理物品，例如道路、车辆、行人、建筑物或树叶等。定位图像中的对象或区域可以包括确定包括所述对象的图像的像素位置。神经网络通常被实施为计算机软件程序，所述计算机软件程序可以被训练为使用训练数据集来检测图像数据中的对象和区域，所述训练数据集包括具有对象和区域的示例的图像和识别对象和区域的对应的地面实况。地面实况是关于从独立于神经网络的源获得的对象的数据。地面实况数据是被确定或认为对应于(即，表示)实际现实世界状况或状态的数据。例如，可以通过使人类观察者查看图像并且确定对象标签、位置和大小来获得关于对象的地面实况。Computing devices in the vehicle may be programmed to detect objects and areas based on image data acquired by sensors included in the vehicle. Computing devices may include neural networks trained to detect objects and regions in image data. Detecting objects and regions in the context of this document means determining the label, location, and size of objects and regions in image data. Object and area labels typically include a substantially unique identifier of an object or area, such as a text string that identifies the object or area, where the object or area is a physical item occupying three dimensions, such as a road, vehicle, pedestrian, building, or foliage wait. Locating an object or a region in an image may include determining pixel locations of the image including the object. Neural networks are typically implemented as computer software programs that can be trained to detect objects and regions in image data using a training data set that includes images with examples of objects and regions and to identify objects and the corresponding ground truth of the region. Ground truth is data about an object obtained from a source independent of the neural network. Ground truth data is data that is determined or believed to correspond to (ie, represent) actual real-world conditions or states. For example, ground truth about an object can be obtained by having a human observer view the image and determine the object label, location, and size.

一种用于检测图像数据中的对象和区域的技术是训练神经网络以生成分割图。分割图是其中通过确定标签来识别输入图像中的对象的图像，所述标签可以是与图像中的对象相对应的数量、位置和大小。例如，标签对象可以包括道路、车辆、行人、建筑物和树叶。可以通过用纯色替换与对象相对应的像素来指示图像中的对象的位置和大小。例如，与道路相对应的输入图像中的对象可以被分配第一编号并且用绿色替换，与车辆相对应的输入图像的区域可以被分配第二编号并且用红色替换，与树叶相对应的图像的区域可以被分配第三编号并且用黄色替换，依此类推。实例分割图是其中单一类型的区域(诸如车辆)的多个实例各自被分配不同的编号和颜色的分割图。通过用具有对应的地面实况的大量(通常>1000)训练图像训练神经网络，可以训练神经网络以根据输入的单目彩色(RGB)图像确定分割图。单目图像是由单个相机获取的图像，而不是包括由两个或更多个相机获取的两个或更多个图像的立体图像。还可以训练神经网络以处理从传感器获取的图像，所述传感器包括单色相机、红外相机或获取彩色和红外数据的组合的相机。在此示例中，地面实况包括从独立于神经网络的源获得的分割图像。例如，训练数据集中的图像可以由人类观察者使用图像处理软件进行分割，以将值分配给训练图像中的区域。One technique for detecting objects and regions in image data is to train a neural network to generate segmentation maps. A segmentation map is an image in which objects in the input image are identified by determining labels, which may be the number, location, and size corresponding to the objects in the image. For example, label objects can include roads, vehicles, pedestrians, buildings, and foliage. You can indicate the location and size of an object in an image by replacing the pixels corresponding to the object with a solid color. For example, objects in the input image corresponding to roads can be assigned a first number and replaced with green, areas of the input image corresponding to vehicles can be assigned a second number and replaced with red, and areas of the image corresponding to leaves can be assigned a second number and replaced with green. Zones can be assigned a third number and replaced with yellow, and so on. An instance segmentation map is a segmentation map in which multiple instances of a single type of area (such as a vehicle) are each assigned a different number and color. The neural network can be trained to determine segmentation maps from input monocular color (RGB) images by training it with a large number (typically >1000) of training images with corresponding ground truth. A monocular image is an image acquired by a single camera, as opposed to a stereoscopic image that includes two or more images acquired by two or more cameras. Neural networks can also be trained to process images acquired from sensors including monochrome cameras, infrared cameras, or cameras that acquire a combination of color and infrared data. In this example, the ground truth consists of segmented images obtained from sources independent of the neural network. For example, images in a training dataset can be segmented by a human observer using image processing software to assign values to regions in the training images.

深度图是其中根据从获取图像的传感器到对应于图像像素的现实世界三维(3D)空间中的点的距离或范围为图像的像素分配值的图像。通过用大量(通常>1000)训练图像和对应的地面实况训练神经网络，可以训练神经网络以根据单目RGB图像确定深度图。在此示例中，地面实况包括从独立于神经网络的源(例如，激光雷达传感器或立体摄像机)获得的深度图。激光雷达传感器输出距离或距离数据，可以处理所述距离或距离数据以将来自激光雷达传感器的距离数据与彩色视频传感器的视野匹配。同样地，可以处理来自立体摄像机的图像数据以提供对应于彩色摄像机视野的距离或范围数据，所述立体摄像机包括安装成在相机之间提供固定基线或距离的两个或更多个相机。以这种方式获得的地面实况深度图可以与对应的单目RGB图像配对，并且用于训练神经网络以从单目RGB图像产生深度图。A depth map is an image in which pixels of an image are assigned values based on the distance or range from the sensor that acquired the image to a point in real-world three-dimensional (3D) space that corresponds to the image pixel. A neural network can be trained to determine a depth map from a monocular RGB image by training it with a large number (usually >1000) of training images and corresponding ground truth. In this example, the ground truth consists of depth maps obtained from sources independent of the neural network, such as lidar sensors or stereo cameras. The lidar sensor outputs distance or distance data that can be processed to match the distance data from the lidar sensor to the field of view of the color video sensor. Likewise, image data from a stereo camera including two or more cameras mounted to provide a fixed baseline or distance between the cameras can be processed to provide distance or range data corresponding to the field of view of the color camera. Ground truth depth maps obtained in this way can be paired with corresponding monocular RGB images and used to train a neural network to produce depth maps from monocular RGB images.

可以通过向神经网络呈现包括对象以及对应的地面实况的大量(通常>1000)训练图像来训练神经网络。在训练期间，神经网络处理输入图像，并且将结果(在本文中称为输出状态)与地面实况进行比较。神经网络可以多次处理输入图像，每次处理图像时改变处理参数。将神经网络的输出状态与地面实况进行比较，以确定在呈现输入图像时实现正确输出状态的处理参数集。由于涉及人类判断，获取适合于训练神经网络的训练数据集和地面实况可能是昂贵、耗时且不可靠的，并且在计算资源消耗方面效率低下且具有挑战性。Neural networks can be trained by presenting them with a large number (typically >1000) of training images including objects and corresponding ground truth. During training, the neural network processes input images and the results (called output states in this article) are compared to the ground truth. Neural networks can process an input image multiple times, changing the processing parameters each time the image is processed. The output states of the neural network are compared to the ground truth to determine the set of processing parameters that achieve the correct output state when the input image is presented. Obtaining training datasets and ground truth suitable for training neural networks can be expensive, time-consuming and unreliable, as well as inefficient and challenging in terms of computational resource consumption, as human judgment is involved.

本文所讨论的技术通过生成与使用查看真实世界场景的真实图像传感器获取的真实图像中所包括的场景相对应的模拟图像来改进神经网络的训练和操作。因为模拟图像是由真实感图像渲染软件生成的，所以已知3D空间中的对象和区域的标识和位置、分割数据以及到图像中的点的3D距离。神经网络可以如本文所讨论的那样被配置为允许使用模拟图像来训练神经网络，并且将训练转移到真实图像。以这种方式，可以训练神经网络以对真实图像进行操作，无需为训练数据集中的真实图像确定地面实况数据所需的费用、时间和计算资源。本文所讨论的技术可以用于训练神经网络以产生可以用于操作例如车辆、固定机器人、移动机器人、无人机或监控系统的输出。The techniques discussed in this article improve the training and operation of neural networks by generating simulated images that correspond to scenes included in real images acquired using real image sensors that view real-world scenes. Because the simulated images are generated by photorealistic image rendering software, the identities and locations of objects and regions in 3D space, segmentation data, and 3D distances to points in the image are known. Neural networks can be configured as discussed in this article to allow the use of simulated images to train the neural network and transfer the training to real images. In this way, neural networks can be trained to operate on real images without the expense, time, and computational resources required to determine ground truth data for the real images in the training dataset. The techniques discussed in this article can be used to train neural networks to produce outputs that can be used to operate, for example, vehicles, stationary robots, mobile robots, drones, or surveillance systems.

本文公开了一种方法，所述方法包括接收单目图像并且将图像提供给变分自动编码器神经网络(VAE)，其中所述VAE已经以包括第一编码器-解码器对和第二编码器-解码器对的孪生配置进行训练，所述第一编码器-解码器对接收未标记的真实图像作为输入并且输出重建的真实图像，所述第二编码器-解码器对接收合成图像作为输入并且输出重建的合成图像，并且其中所述VAE包括第三解码器和第四解码器，所述第三解码器和所述第四解码器使用标记的合成图像、分割地面实况和深度地面实况进行训练并且基于输入单目图像从VAE输出分割图和深度图。以孪生配置训练VAE可以包括输出分割图的第三解码器和输出深度图的第四解码器。分割地面实况可以包括标记的合成图像中的多个对象的标签，并且深度地面实况包括从传感器到标记的合成图像中的多个位置的距离。分割图可以包括标记的对象，包括道路、建筑物、树叶、车辆和行人。深度图可以包括从传感器到多个位置的距离。真实图像可以由查看真实世界场景的真实世界传感器获取。Disclosed herein is a method comprising receiving a monocular image and providing the image to a variational autoencoder neural network (VAE), wherein the VAE has been configured to include a first encoder-decoder pair and a second encoder A twin configuration of encoder-decoder pairs is trained, where the first encoder-decoder pair receives unlabeled real images as input and outputs reconstructed real images, and the second encoder-decoder pair receives synthetic images as inputting and outputting a reconstructed synthetic image, and wherein the VAE includes a third decoder and a fourth decoder using the labeled synthetic image, segmentation ground truth, and depth ground truth It is trained and outputs segmentation map and depth map from VAE based on the input monocular image. Training the VAE in a twin configuration may include a third decoder that outputs a segmentation map and a fourth decoder that outputs a depth map. Segmentation ground truth may include labels for labeled multiple objects in the composite image, and depth ground truth includes distances from the sensor to multiple locations in the labeled composite image. Segmentation maps can include labeled objects including roads, buildings, foliage, vehicles, and pedestrians. Depth maps can include distances from the sensor to multiple locations. Real images can be acquired by real-world sensors viewing real-world scenes.

合成图像可以由真实感图像渲染软件基于输入到真实感图像渲染软件的描述将由真实感图像渲染软件渲染的场景的数据生成。分割地面实况和深度地面实况可以基于输入到真实感图像渲染软件的描述将由真实感图像渲染软件渲染的场景的场景描述生成。VAE可以包括用于未标记的真实图像和标记的合成图像的第一编码器和第二编码器，并且进一步地，其中第一编码器和第二编码器各自包括与第一编码器或第二编码器中的另一者共享权重的层、共享的潜在空间以及用于未标记的真实图像和标记的合成图像的相应的第一解码器和第二解码器。可以基于确定第一编码器-解码器与第二编码器-解码器之间的循环一致性来进一步训练VAE。训练VAE可以基于确定循环一致性，包括通过确定Kullback–Leibler散度损失和最大平均差异损失来比较输入的真实图像和重建的真实图像。操作装置可以基于分割图和深度图。所述装置可以是车辆、移动机器人、固定机器人、无人机和监控系统中的一者。可以通过基于根据分割图和深度图来确定车辆路径而控制车辆推进、车辆制动和车辆转向中的一者或多者来操作车辆。The composite image may be generated by the photorealistic image rendering software based on data input to the photorealistic image rendering software describing a scene to be rendered by the photorealistic image rendering software. The segmentation ground truth and the depth ground truth may be generated based on a scene description input to the photorealistic image rendering software that describes a scene to be rendered by the photorealistic image rendering software. The VAE may include a first encoder and a second encoder for the unlabeled real image and the labeled synthetic image, and further, wherein the first encoder and the second encoder each include an encoder related to the first encoder or the second encoder. The other of the encoders shares a layer of weights, a shared latent space, and corresponding first and second decoders for unlabeled real images and labeled synthetic images. The VAE may be further trained based on determining cycle consistency between the first encoder-decoder and the second encoder-decoder. Training the VAE can be based on determining cycle consistency, including comparing the input real image and the reconstructed real image by determining the Kullback–Leibler divergence loss and the maximum mean difference loss. The operating device can be based on segmentation maps and depth maps. The device may be one of a vehicle, a mobile robot, a stationary robot, a drone, and a surveillance system. The vehicle may be operated by controlling one or more of vehicle propulsion, vehicle braking, and vehicle steering based on determining the vehicle path from the segmentation map and the depth map.

公开了一种计算机可读介质，所述计算机可读介质存储用于执行上述方法步骤中的一些或全部的程序指令。还公开了一种被编程用于执行上述方法步骤中的一些或全部的计算机，所述计算机包括计算机装置，所述计算机装置被编程为接收单目图像并且将图像提供给变分自动编码器神经网络(VAE)，其中所述VAE已经以包括第一编码器-解码器对和第二编码器-解码器对的孪生配置进行训练，所述第一编码器-解码器对接收未标记的真实图像作为输入并且输出重建的真实图像，所述第二编码器-解码器对接收合成图像作为输入并且输出重建的合成图像，并且其中所述VAE包括第三解码器和第四解码器，所述第三解码器和所述第四解码器使用标记的合成图像、分割地面实况和深度地面实况进行训练并且基于输入单目图像从VAE输出分割图和深度图。以孪生配置训练VAE可以包括输出分割图的第三解码器和输出深度图的第四解码器。分割地面实况可以包括标记的合成图像中的多个对象的标签，并且深度地面实况包括从传感器到标记的合成图像中的多个位置的距离。分割图可以包括标记的对象，包括道路、建筑物、树叶、车辆和行人。深度图可以包括从传感器到多个位置的距离。真实图像可以由查看真实世界场景的真实世界传感器获取。A computer-readable medium storing program instructions for performing some or all of the above method steps is disclosed. Also disclosed is a computer programmed to perform some or all of the above method steps, the computer comprising computer means programmed to receive a monocular image and provide the image to a variational autoencoder neural network (VAE), wherein the VAE has been trained in a twin configuration including a first encoder-decoder pair and a second encoder-decoder pair, the first encoder-decoder pair receiving unlabeled ground truth image as input and output a reconstructed real image, the second encoder-decoder pair receives a synthetic image as input and outputs a reconstructed synthetic image, and wherein the VAE includes a third decoder and a fourth decoder, the The third decoder and the fourth decoder are trained using labeled synthetic images, segmentation ground truth and depth ground truth and output segmentation maps and depth maps from the VAE based on the input monocular images. Training the VAE in a twin configuration may include a third decoder that outputs a segmentation map and a fourth decoder that outputs a depth map. Segmentation ground truth may include labels for labeled multiple objects in the composite image, and depth ground truth includes distances from the sensor to multiple locations in the labeled composite image. Segmentation maps can include labeled objects including roads, buildings, foliage, vehicles, and pedestrians. Depth maps can include distances from the sensor to multiple locations. Real images can be acquired by real-world sensors viewing real-world scenes.

计算机还可以被配置为生成合成图像，所述合成图像由真实感图像渲染软件基于输入到真实感图像渲染软件的描述将由真实感图像渲染软件渲染的场景的数据生成。分割地面实况和深度地面实况可以基于输入到真实感图像渲染软件的描述将由真实感图像渲染软件渲染的场景的场景描述生成。VAE可以包括用于未标记的真实图像和标记的合成图像的第一编码器和第二编码器，并且进一步地，其中第一编码器和第二编码器各自包括与第一编码器或第二编码器中的另一者共享权重的层、共享的潜在空间以及用于未标记的真实图像和标记的合成图像的相应的第一解码器和第二解码器。可以基于确定第一编码器-解码器与第二编码器-解码器之间的循环一致性来进一步训练VAE。训练VAE可以基于确定循环一致性，包括通过确定Kullback–Leibler散度损失和最大平均差异损失来比较输入的真实图像和重建的真实图像。操作装置可以基于分割图和深度图。所述装置可以是车辆、移动机器人、固定机器人、无人机和监控系统中的一者。可以通过基于根据分割图和深度图来确定车辆路径而控制车辆推进、车辆制动和车辆转向中的一者或多者来操作车辆。The computer may also be configured to generate a composite image generated by the photorealistic image rendering software based on data input to the photorealistic image rendering software describing a scene to be rendered by the photorealistic image rendering software. The segmentation ground truth and the depth ground truth may be generated based on a scene description input to the photorealistic image rendering software that describes a scene to be rendered by the photorealistic image rendering software. The VAE may include a first encoder and a second encoder for the unlabeled real image and the labeled synthetic image, and further, wherein the first encoder and the second encoder each include an encoder related to the first encoder or the second encoder. The other of the encoders shares a layer of weights, a shared latent space, and corresponding first and second decoders for unlabeled real images and labeled synthetic images. The VAE may be further trained based on determining cycle consistency between the first encoder-decoder and the second encoder-decoder. Training the VAE can be based on determining cycle consistency, including comparing the input real image and the reconstructed real image by determining the Kullback–Leibler divergence loss and the maximum mean difference loss. The operating device can be based on segmentation maps and depth maps. The device may be one of a vehicle, a mobile robot, a stationary robot, a drone, and a surveillance system. The vehicle may be operated by controlling one or more of vehicle propulsion, vehicle braking, and vehicle steering based on determining the vehicle path from the segmentation map and the depth map.

附图说明Description of the drawings

图1是示例性车辆的图。Figure 1 is a diagram of an exemplary vehicle.

图2是示例性孪生变分自动编码器神经网络的图。Figure 2 is a diagram of an exemplary twin variational autoencoder neural network.

图3是被配置为基于循环一致性进行训练的示例性孪生变分自动编码器神经网络的示意图。Figure 3 is a schematic diagram of an exemplary Siamese variational autoencoder neural network configured to train based on cycle consistency.

图4是被配置为基于循环一致性进行训练的另一个示例性孪生变分自动编码器神经网络的图。Figure 4 is a diagram of another exemplary Siamese variational autoencoder neural network configured to be trained based on cycle consistency.

图5是被配置为产生分割图和深度图的示例性变分自动编码器神经网络的图。Figure 5 is a diagram of an exemplary variational autoencoder neural network configured to produce segmentation maps and depth maps.

图6是示例性真实图像和对应的分割图的图。Figure 6 is a diagram of exemplary real images and corresponding segmentation maps.

图7是示例性真实图像和对应的深度图的图。Figure 7 is a diagram of an exemplary real image and corresponding depth map.

图8是用于训练和操作神经网络以产生分割图和深度图的示例性过程的流程图。Figure 8 is a flowchart of an exemplary process for training and operating a neural network to produce segmentation maps and depth maps.

具体实施方式Detailed ways

图1是车辆110的图，所述车辆可以自主(“自主”本身在本公开中意指“完全自主”)模式、半自主模式和乘员驾驶(也被称为非自主)模式操作。半自主模式或完全自主模式意指车辆可以由作为具有传感器和控制器的系统的一部分的计算装置部分地或完全地驾驶的操作模式。车辆可能被占用或未被占用，但是在任一种情况下，都可在没有乘员协助的情况下部分地或完全地驾驶车辆。出于本公开的目的，自主模式被定义为车辆推进(例如，经由包括内燃发动机和/或电动马达的动力传动系统)、制动和转向中的每一个由一个或多个车辆计算机控制的模式；在半自主模式中，车辆计算机控制车辆推进、制动和转向中的一个或多个。在非自主模式下，这些都不由计算机控制。1 is a diagram of a vehicle 110 that can operate in autonomous ("autonomous" itself means "fully autonomous" in this disclosure) mode, semi-autonomous mode, and crew-driven (also referred to as non-autonomous) mode. Semi-autonomous mode or fully autonomous mode means an operating mode in which the vehicle may be driven partially or fully by a computing device that is part of a system with sensors and controllers. The vehicle may or may not be occupied, but in either case the vehicle may be driven partially or fully without assistance from the occupants. For the purposes of this disclosure, autonomous mode is defined as a mode in which vehicle propulsion (e.g., via a drivetrain including an internal combustion engine and/or electric motor), braking, and steering are each controlled by one or more vehicle computers ;In semi-autonomous mode, the vehicle computer controls one or more of vehicle propulsion, braking, and steering. In non-autonomous mode, none of this is controlled by the computer.

因此，一个或多个车辆110的计算装置115可以从传感器116接收关于车辆110的操作的数据。计算装置115可以以自主模式、半自主模式或非自主模式操作车辆110。Accordingly, one or more computing devices 115 of vehicle 110 may receive data from sensors 116 regarding the operation of vehicle 110 . Computing device 115 may operate vehicle 110 in an autonomous mode, a semi-autonomous mode, or a non-autonomous mode.

计算装置(或计算机)115包括诸如已知的处理器和存储器。此外，存储器包括一种或多种形式的计算机可读介质并且存储指令，所述指令可由处理器执行来执行包括如本文所公开的各种操作。例如，计算装置115可以包括编程以操作车辆制动、推进(例如，通过控制内燃发动机、电动马达、混合动力发动机等中的一者或多者来控制车辆110的加速度)、转向、气候控制、内部灯和/或外部灯等中的一者或多者，以及确定计算装置115(而不是人类操作员)是否以及何时控制此类操作。Computing device (or computer) 115 includes, for example, a known processor and memory. Additionally, memory includes one or more forms of computer-readable media and stores instructions executable by a processor to perform various operations, including as disclosed herein. For example, computing device 115 may include programming to operate vehicle braking, propulsion (e.g., control acceleration of vehicle 110 by controlling one or more of an internal combustion engine, electric motor, hybrid engine, etc.), steering, climate control, one or more of interior lights and/or exterior lights, etc., and determining if and when computing device 115 (rather than a human operator) controls such operations.

计算装置115可包括多于一个计算装置(例如，包括在车辆110中以用于监测和/或控制各种车辆部件的控制器等(例如，动力传动系统控制器112、制动控制器113、转向控制器114等))，或例如经由如下面进一步描述的车辆通信总线通信地耦接到所述多于一个计算装置。计算装置115通常被布置用于通过车辆通信网络(例如，包括车辆110中的总线，诸如控制器局域网络(CAN)等)通信；车辆110网络可另外或替代地包括诸如已知的有线或无线通信机制，例如以太网或其他通信协议。Computing device 115 may include more than one computing device (e.g., a controller included in vehicle 110 for monitoring and/or controlling various vehicle components, etc. (e.g., powertrain controller 112 , brake controller 113 , etc.) Steering controller 114 , etc.)), or communicatively coupled to the more than one computing device, such as via a vehicle communication bus as described further below. Computing device 115 is generally arranged to communicate over a vehicle communication network (eg, including a bus in vehicle 110 such as a Controller Area Network (CAN), etc.); vehicle 110 network may additionally or alternatively include known wired or wireless communication networks such as Communication mechanism, such as Ethernet or other communication protocols.

计算装置115可经由车辆网络向车辆中的各种装置(例如，控制器、致动器、传感器(包括传感器116)等)传输消息和/或从所述各种装置接收消息。替代地或另外，在计算装置115实际上包括多个装置的情况下，可使用车辆通信网络来在本公开中表示为计算装置115的装置之间通信。另外，如下文所提及，各种控制器或感测元件(诸如传感器116)可经由车辆通信网络向计算装置115提供数据。Computing device 115 may transmit messages to and/or receive messages from various devices in the vehicle (eg, controllers, actuators, sensors (including sensor 116 ), etc.) via the vehicle network. Alternatively or additionally, in situations where computing device 115 actually includes multiple devices, a vehicle communication network may be used to communicate between the devices represented as computing device 115 in this disclosure. Additionally, as mentioned below, various controllers or sensing elements, such as sensors 116 , may provide data to computing device 115 via the vehicle communication network.

此外，计算装置115可以被配置用于通过车辆对基础设施(V2I)接口111经由网络与远程服务器计算机(例如，云服务器)通信，如下文所描述，所述接口包括硬件、固件和软件，所述硬件、固件和软件准许计算装置115经由诸如无线互联网或蜂窝网络的网络与远程服务器计算机通信。因此，V2I接口111可以包括被配置为利用各种有线和/或无线联网技术(例如，蜂窝、/>以及有线和/或无线分组网络)的处理器、存储器、收发器等。计算装置115可被配置用于使用例如在邻近车辆110间在自组网的基础上形成或通过基于基础设施的网络形成的车辆对车辆(V2V)网络，例如根据专用短程通信(DSRC)等通过V2I接口111与其他车辆110通信。计算装置115还包括诸如已知的非易失性存储器。计算装置115可通过将数据存储在非易失性存储器中来记录数据，以供稍后检索并且经由车辆通信网络和车辆对基础设施(V对I)接口111传输到服务器计算机或用户移动装置。Additionally, the computing device 115 may be configured to communicate with a remote server computer (eg, a cloud server) via a network through a vehicle-to-infrastructure (V2I) interface 111 , which includes hardware, firmware, and software, as described below. The hardware, firmware, and software permit the computing device 115 to communicate via, for example, wireless Internet or cellular network to communicate with a remote server computer. Accordingly, V2I interface 111 may include components configured to utilize various wired and/or wireless networking technologies (e.g., cellular, and wired and/or wireless packet networks) processors, memories, transceivers, etc. The computing device 115 may be configured to use a vehicle-to-vehicle (V2V) network formed, for example, between adjacent vehicles 110 on an ad hoc basis or through an infrastructure-based network, such as via Dedicated Short Range Communications (DSRC) or the like. The V2I interface 111 communicates with other vehicles 110 . Computing device 115 also includes non-volatile memory, such as is known. Computing device 115 may record the data by storing the data in non-volatile memory for later retrieval and transmission to a server computer or user mobile device via the vehicle communications network and vehicle-to-infrastructure (V-to-I) interface 111 .

如已经提及的，用于在没有人类操作员干预的情况下操作一个或多个车辆110部件(例如，制动、转向、推进等)的编程通常包括在存储器中所存储的并且可以由计算装置115的处理器执行的指令中。使用在计算装置115中接收的数据(例如，来自传感器116的传感器数据、服务器计算机等的数据)，计算装置115可在没有驾驶员的情况下进行各种确定和/或控制各种车辆110部件和/或操作以操作车辆110。例如，计算装置115可包括编程以调节车辆110操作行为(即，车辆110操作的物理表现)，诸如速度、加速度、减速度、转向等，以及策略性行为(即，通常以意图实现路线的安全而有效的穿越的方式控制操作行为)，诸如车辆之间的距离和/或车辆之间的时间量、车道改变、车辆之间的最小间隙、左转跨过路径最小值、到特定位置处的到达时间以及从到达到穿过十字路口的十字路口(无信号灯)最短时间。As already mentioned, programming for operating one or more vehicle 110 components (eg, braking, steering, propulsion, etc.) without human operator intervention is typically included stored in memory and may be computed by The processor of device 115 executes the instructions. Using data received in computing device 115 (eg, sensor data from sensors 116 , data from a server computer, etc.), computing device 115 may make various determinations and/or control various vehicle 110 components without a driver. and/or operate to operate vehicle 110 . For example, computing device 115 may include programming to regulate vehicle 110 operating behaviors (i.e., the physical manifestations of vehicle 110 operation), such as speed, acceleration, deceleration, steering, etc., as well as strategic behaviors (i.e., generally with the intent to achieve safety of the route). The manner in which effective crossings control operating behavior), such as the distance between vehicles and/or the amount of time between vehicles, lane changes, minimum gaps between vehicles, left turns across path minimums, to a specific location Arrival time and the minimum time from arrival to crossing the intersection (without signal lights).

用于车辆110的一个或多个控制器112、113、114可包括常规的电子控制单元(ECU)等，作为非限制性示例，包括一个或多个动力传动系统控制器112、一个或多个制动器控制器113以及一个或多个转向控制器114。控制器112、113、114中的每一者可包括相应的处理器和存储器以及一个或多个致动器。控制器112、113、114可被编程并且连接到车辆110通信总线，诸如控制器局域网(CAN)总线或局域互连网(LIN)总线，以从计算装置115接收指令并且基于指令而控制致动器。One or more controllers 112 , 113 , 114 for vehicle 110 may include conventional electronic control units (ECUs) or the like, including, by way of non-limiting example, one or more powertrain controllers 112 , one or more brake controller 113 and one or more steering controllers 114 . Each of the controllers 112, 113, 114 may include a corresponding processor and memory and one or more actuators. Controllers 112 , 113 , 114 may be programmed and connected to a vehicle 110 communication bus, such as a controller area network (CAN) bus or a local area interconnect network (LIN) bus, to receive instructions from computing device 115 and control actuators based on the instructions. .

传感器116可包括已知可经由车辆通信总线提供数据的各种装置。例如，固定到车辆110的前保险杠(未示出)的雷达可提供从车辆110到车辆110前方的下一车辆的距离，或者设置在车辆110中的全球定位系统(GPS)传感器可提供车辆110的地理坐标。例如，由雷达和/或其他传感器116提供的一个或多个距离和/或由GPS传感器提供的地理坐标可由计算装置115用来自主或半自主地操作车辆110。Sensors 116 may include various devices known to provide data via a vehicle communication bus. For example, a radar affixed to the front bumper (not shown) of vehicle 110 may provide the distance from vehicle 110 to the next vehicle in front of vehicle 110 , or a Global Positioning System (GPS) sensor disposed in vehicle 110 may provide the vehicle's distance. The geographical coordinates of 110. For example, one or more distances provided by radar and/or other sensors 116 and/or geographic coordinates provided by a GPS sensor may be used by computing device 115 to operate vehicle 110 autonomously or semi-autonomously.

车辆110通常是能够自主和/或半自主操作并且具有三个或更多个车轮的陆基车辆110(例如，客车、轻型卡车等)。车辆110包括一个或多个传感器116、V对I接口111、计算装置115以及一个或多个控制器112、113、114。传感器116可收集与车辆110和车辆110操作所处的环境相关的数据。作为举例而非限制，传感器116可包括例如测高仪、相机、激光雷达、雷达、超声波传感器、红外传感器、压力传感器、加速度计、陀螺仪、温度传感器、压力传感器、霍尔传感器、光学传感器、电压传感器、电流传感器、机械传感器(诸如开关)等。传感器116可用来感测车辆110操作所处的环境，例如，传感器116可检测诸如天气状况(降雨、外部环境温度等)的现象、道路坡度、道路位置(例如，使用道路边缘、车道标记等)或目标对象(诸如邻近车辆110)的位置。传感器116还可以用于收集数据，包括与车辆110的操作相关的动态车辆110数据，诸如速度、横摆率、转向角度、发动机转速、制动压力、油压、施加到车辆110中的控制器112、113、114的功率电平、在部件之间的连接性以及车辆110的部件的准确且及时的执行。Vehicle 110 is typically a land-based vehicle 110 (eg, passenger car, light truck, etc.) capable of autonomous and/or semi-autonomous operation and having three or more wheels. The vehicle 110 includes one or more sensors 116 , a V-to-I interface 111 , a computing device 115 , and one or more controllers 112 , 113 , 114 . Sensors 116 may collect data related to vehicle 110 and the environment in which vehicle 110 operates. By way of example and not limitation, sensors 116 may include, for example, an altimeter, a camera, a lidar, a radar, an ultrasonic sensor, an infrared sensor, a pressure sensor, an accelerometer, a gyroscope, a temperature sensor, a pressure sensor, a Hall sensor, an optical sensor, Voltage sensors, current sensors, mechanical sensors (such as switches), etc. Sensors 116 may be used to sense the environment in which vehicle 110 operates. For example, sensors 116 may detect phenomena such as weather conditions (rainfall, outside ambient temperature, etc.), road grade, road position (e.g., using road edges, lane markings, etc.) or the location of a target object (such as adjacent vehicle 110). Sensors 116 may also be used to collect data, including dynamic vehicle 110 data related to the operation of vehicle 110 such as speed, yaw rate, steering angle, engine speed, brake pressure, oil pressure, controls applied to vehicle 110 power levels 112 , 113 , 114 , connectivity between components, and accurate and timely execution of components of the vehicle 110 .

图2是孪生变分自动编码器(VAE)200的图。孪生VAE 200包括通过共享的潜在空间214接合的两个编码器(RGBR、RGBS)206、208和两个解码器(RDEC、SDEC)214、216。VAE是一种可以学习通常对输入数据进行编码以减小输入数据的维度或大小的神经网络。VAE通过使用编码层(编码器206、208)将输入数据编码到潜在空间(共享潜在空间214)中来进行操作。潜在空间包括对应于编码的输入数据的数据。编码的输入数据通常保留输入数据的基本特性，同时丢弃数据的噪声或非基本元素。VAE还包括解码层(解码器214、316)，所述解码层将潜在空间中的编码数据重建为分别对应于输入真实图像202和输入模拟图像204的重建真实图像218和重建模拟图像220。可以通过将输出数据与输入数据进行比较来训练VAE以对数据进行编码和解码。VAE通常以无监督的方式进行训练，其中VAE尝试对输入数据进行多次编码和解码，同时改变编码和解码参数。VAE可以通过将输出与输入进行比较来确定损失函数，从而保留使得输出数据与输入数据匹配的参数。下面将讨论损失函数。Figure 2 is a diagram of a twin variational autoencoder (VAE) 200. Twin VAE 200 includes two encoders (RGBR, RGBS) 206, 208 and two decoders (RDEC, SDEC) 214, 216 joined by a shared latent space 214. VAE is a neural network that can learn to encode input data, typically to reduce the dimensionality or size of the input data. VAE operates by encoding input data into a latent space (shared latent space 214) using encoding layers (encoders 206, 208). The latent space includes data corresponding to the encoded input data. Encoded input data typically preserves the essential characteristics of the input data while discarding noisy or non-essential elements of the data. The VAE also includes a decoding layer (decoders 214, 316) that reconstructs the encoded data in the latent space into a reconstructed real image 218 and a reconstructed simulated image 220 corresponding to the input real image 202 and the input simulated image 204, respectively. A VAE can be trained to encode and decode data by comparing the output data with the input data. VAEs are usually trained in an unsupervised manner, where the VAE attempts to encode and decode the input data multiple times while varying the encoding and decoding parameters. VAE can determine the loss function by comparing the output to the input, thus preserving the parameters that make the output data match the input data. The loss function will be discussed below.

在此示例中，真实编码器206和模拟编码器208分别输入真实图像202和模拟图像204，并且将输入的真实图像202和模拟图像204映射到共享潜在空间214中包括的潜在变量。真实图像202是由查看真实世界场景的真实世界传感器(诸如摄像机)获取的图像。模拟图像204是由真实感图像渲染软件(诸如由北卡罗来纳州凯瑞市(邮编27518)的英佩数码公司生产的虚幻引擎)生成的图像。真实感图像渲染软件是如下软件程序，其生成的图像在观察者看来就像是用查看真实世界场景的真实世界相机所获取的图像。真实感图像渲染软件基于场景描述文件生成图像，所述场景描述文件可以是可以包括要包括在渲染图像中的3D形状的数学描述的文本文件。例如，场景描述可以根据长方体、圆柱体等的交点来描述3D形状。场景描述还包括场景中的表面的颜色和纹理。渲染场景包括将模拟光源投影到3D形状上并且确定形状将如何将光反射到模拟相机传感器上。真实感图像渲染软件可以产生具有足够细节的图像，使得它们对于人类观察者来说看起来几乎就像是用真实世界相机获取的图像。例如，可以使用真实感渲染软件来为视频游戏软件创建逼真的图像。In this example, the real encoder 206 and the simulated encoder 208 input the real image 202 and the simulated image 204 respectively, and map the input real image 202 and the simulated image 204 to latent variables included in the shared latent space 214 . Real images 202 are images acquired by real-world sensors (such as cameras) viewing real-world scenes. Simulated image 204 is an image generated by photorealistic image rendering software, such as Unreal Engine produced by Impey Digital, Cary, NC 27518. Photorealistic image rendering software is a software program that produces images that appear to an observer as if they were acquired with a real-world camera viewing a real-world scene. Photorealistic image rendering software generates images based on a scene description file, which may be a text file that may include a mathematical description of the 3D shapes to be included in the rendered image. For example, the scene description can describe 3D shapes in terms of intersection points of cuboids, cylinders, etc. The scene description also includes the color and texture of the surfaces in the scene. Rendering a scene involves projecting a simulated light source onto a 3D shape and determining how the shape will reflect light onto the simulated camera sensor. Photorealistic image rendering software can produce images with enough detail that to a human observer they look almost like images captured with a real-world camera. For example, photorealistic rendering software can be used to create lifelike images for video game software.

通过强制真实编码器206和模拟编码器208使用共享的潜在空间214，孪生VAE 200可以使用单个潜在变量集来描述编码的真实图像202和模拟图像204两者。潜在空间是由编码器206、208响应于输入数据(诸如真实图像202或模拟图像204)输出的变量集。共享潜在空间214包括对应于输入的真实图像202或模拟图像204数据的编码版本的潜在变量，其中潜在变量的数量被选择为小于用于表示真实图像202或模拟图像204数据的像素数量。例如，输入的真实图像202或模拟图像204可以包括超过三百万个像素，而共享的潜在空间212可以表示具有一千个或更少的潜在变量的输入的真实图像202或模拟图像204。通过分别用真实解码器214和模拟解码器216正确地重建输入真实图像202或模拟图像204来证明共享潜在空间212正确地对应于输入的真实图像202或模拟图像204，所述真实解码器和模拟解码器处理潜在变量并且输出重建的真实图像218和模拟图像220。通过将输入的真实图像202和模拟图像204分别与对应的重建的真实图像218和模拟图像220进行比较来验证正确地重建输入图像202。By forcing the real encoder 206 and the simulated encoder 208 to use a shared latent space 214, the twin VAE 200 can use a single set of latent variables to describe both the encoded real image 202 and the simulated image 204. The latent space is the set of variables output by the encoder 206, 208 in response to input data (such as a real image 202 or a simulated image 204). The shared latent space 214 includes latent variables corresponding to encoded versions of the input real image 202 or simulated image 204 data, where the number of latent variables is selected to be less than the number of pixels used to represent the real image 202 or simulated image 204 data. For example, the input real image 202 or simulated image 204 may include more than three million pixels, and the shared latent space 212 may represent the input real image 202 or simulated image 204 with a thousand or fewer latent variables. Demonstrate that the shared latent space 212 correctly corresponds to the input real image 202 or simulated image 204 by correctly reconstructing the input real image 202 or simulated image 204 with the real decoder 214 and the simulated decoder 216, respectively. The decoder processes the latent variables and outputs reconstructed real images 218 and simulated images 220 . Verification that the input image 202 is correctly reconstructed is verified by comparing the input real image 202 and simulated image 204 with the corresponding reconstructed real image 218 and simulated image 220, respectively.

孪生VAE 200在两个独立的阶段中进行训练。在称为sim2real训练的第一阶段中，训练孪生VAE 200以输入模拟图像204并且输出重建的真实图像218。在称为sim2depth和sim2seg的第二阶段中，训练孪生VAE 200以输入模拟图像204并且输出深度图和分割图。在sim2real训练中，模拟编码器208和解码器216对输出重建的模拟图像220，而真实编码器206和解码器214对输出重建的真实图像218。通过基于均方误差(MSE)计算来计算损失函数，将真实图像202与重建的真实图像218进行比较，并且将模拟图像204与重建的模拟图像220进行比较。MSE计算确定真实图像202与重建的RGB图像218之间以及模拟图像204与重建的RGB图像220之间的每个像素的均值或均方差。在训练期间，选择管理编码器206、208和解码器214、216的操作的编程参数以使MSE损失函数最小化。Twin VAE 200 trains in two independent stages. In the first stage, called sim2real training, the twin VAE 200 is trained to input simulated images 204 and output reconstructed real images 218 . In the second stage, called sim2depth and sim2seg, the twin VAE 200 is trained to input simulated images 204 and output depth maps and segmentation maps. In sim2real training, the simulated encoder 208 and decoder 216 pair outputs a reconstructed simulated image 220, while the real encoder 206 and decoder 214 pair outputs a reconstructed real image 218. The real image 202 is compared to the reconstructed real image 218 and the simulated image 204 is compared to the reconstructed simulated image 220 by calculating a loss function based on a mean square error (MSE) calculation. The MSE calculation determines the mean or mean square error for each pixel between the real image 202 and the reconstructed RGB image 218 and between the simulated image 204 and the reconstructed RGB image 220 . During training, programming parameters governing the operation of the encoders 206, 208 and decoders 214, 216 are selected to minimize the MSE loss function.

因为潜在变量包括在共享潜在空间212中，所以一种类型的图像(例如真实图像202)可以由真实编码器206编码为共享潜在空间212中的潜在变量，然后由模拟图像解码器216解码为重建的模拟图像220。同样地，模拟图像204可以被编码为共享潜在空间212中的潜在变量，并且由真实解码器214解码为重建的真实图像218。此通过共享每个真实编码器206和模拟编码器208的最后三层来辅助，如双向箭头210所示。共享最后三层意味着在训练时，迫使控制编码的参数分别对于编码器206、208的最后三层中的每一层都是相同的。真实解码器214和模拟解码器216将共享潜在空间212中的共享潜在变量解码为重建的真实图像218和重建的模拟图像220。孪生VAE200被训练以通过改变控制对图像进行编码和解码的参数并且分别将重建的真实图像218和模拟图像220与输入真实图像202和模拟图像204进行比较以将真实图像202和模拟图像204解码和编码成重建的真实图像218和模拟图像220。通过限制用于表示共享潜在空间212中的编码图像的潜在变量的数量，可以实现对真实图像202和模拟图像204两者进行编码的紧凑编码。Because the latent variables are included in the shared latent space 212, an image of one type (eg, a real image 202) can be encoded by the real encoder 206 as a latent variable in the shared latent space 212 and then decoded by the simulated image decoder 216 as a reconstructed image. Simulated image 220. Likewise, simulated image 204 may be encoded as latent variables in shared latent space 212 and decoded by real decoder 214 into a reconstructed real image 218 . This is aided by sharing the last three layers of each real encoder 206 and simulated encoder 208, as indicated by the bidirectional arrow 210. Sharing the last three layers means that at training time, the parameters controlling the encoding are forced to be the same for each of the last three layers of the encoders 206, 208 respectively. Real decoder 214 and simulated decoder 216 decode shared latent variables in shared latent space 212 into reconstructed real images 218 and reconstructed simulated images 220 . The twin VAE 200 is trained to decode the real image 202 and simulated image 204 by changing the parameters that control encoding and decoding of images and comparing the reconstructed real image 218 and simulated image 220 with the input real image 202 and simulated image 204 respectively. encoded into reconstructed real images 218 and simulated images 220 . By limiting the number of latent variables used to represent the encoded image in the shared latent space 212, a compact encoding encoding both the real image 202 and the simulated image 204 can be achieved.

一旦已经在真实图像202和模拟图像204上训练了孪生VAE 200，就可以将解码器214、216与共享潜在空间214断开连接，并且可以将分割解码器(SGDEC)222和深度解码器(DDEC)224连接到共享潜在空间214，并且使用标记的模拟数据204进行训练，所述标记的模拟数据包括基于用于渲染模拟数据204的场景描述数据的地面实况。关于图5讨论分割解码器222，并且关于图6讨论深度解码器。Once the twin VAE 200 has been trained on real images 202 and simulated images 204, the decoders 214, 216 can be disconnected from the shared latent space 214, and the segmentation decoder (SGDEC) 222 and the depth decoder (DDEC) can be ) 224 is connected to the shared latent space 214 and trained using labeled simulation data 204 , which includes ground truth based on the scene description data used to render the simulation data 204 . The segmentation decoder 222 is discussed with respect to FIG. 5 and the depth decoder is discussed with respect to FIG. 6 .

编码器206、208层和解码器214、216层被配置为打包层和解包层以改进潜在变量的生成以包括在共享潜在空间212中。通常，真实图像编码器206和模拟图像编码器208以及真实解码器214和模拟解码器216分别被配置为增加和减少卷积步幅以及池化和非池化数据。卷积步幅是一种用于降低分辨率并且由此通过跳过x和y维度上的像素来对输入数据执行数据缩减的编码技术。例如，可以对图像中的每第二列或第二行像素执行卷积。卷积步幅与池化相结合，其中像素的邻域被视为单个像素以输出到下一层。典型的操作是最大池化，其中像素邻域中包括的最大值用于表示用于输出的整个邻域，从而例如将2X 2像素邻域减少到单个像素。例如，可以反转所述过程以进行解码，其中可以复制卷积层的输出以提高分辨率。在像素复制之后，可以用平滑滤波器对输出进行滤波，例如，以反转最大池化操作并且至少部分地恢复原始数据。The encoder 206 , 208 layers and decoder 214 , 216 layers are configured as packing and unpacking layers to improve the generation of latent variables for inclusion in the shared latent space 212 . Typically, the real image encoder 206 and the simulated image encoder 208 and the real decoder 214 and the simulated decoder 216 are configured to increase and decrease the convolution stride and pool and non-pool data, respectively. Convolutional stride is an encoding technique used to reduce the resolution and thereby perform data reduction on the input data by skipping pixels in the x and y dimensions. For example, a convolution can be performed on every second column or row of pixels in the image. Convolutional strides are combined with pooling, where a neighborhood of pixels is treated as a single pixel for output to the next layer. A typical operation is max pooling, where the maximum value included in a neighborhood of pixels is used to represent the entire neighborhood for output, thereby reducing, for example, a 2X 2-pixel neighborhood to a single pixel. For example, the process can be reversed for decoding, where the output of the convolutional layer can be copied to increase resolution. After pixel copying, the output can be filtered with a smoothing filter, for example, to reverse the max pooling operation and at least partially restore the original data.

打包和解包可以通过执行3D卷积替换卷积步长和池化来改善潜在变量的生成和从潜在变量恢复输入数据，所述3D卷积降低空间分辨率，同时增加深度分辨率，从而保留输入数据。打包层首先执行将空间数据编码为位深度数据的空间到深度变换。然后，打包执行降低空间分辨率的3D卷积，同时保持位深度数据。然后，打包执行整形操作，所述整形操作进一步对位深度数据进行编码，随后进行2D卷积以过滤输出的潜在变量。解码层反转此序列以将潜在变量恢复到全分辨率。在2020年3月28日丰田研究院，arXiv.org，1905.02693v4中Vitor Guizilini、Rares Ambrus、Sudeep Pillai、Allan Raventos和Adrien Gaidon的“用于自监督单目深度估计的3D打包”中描述了打包和解包。Packing and unpacking can improve the generation of latent variables and the recovery of input data from latent variables by replacing the convolution stride and pooling by performing 3D convolutions that reduce spatial resolution while increasing depth resolution, thereby preserving the input data. The packing layer first performs a space-to-depth transformation that encodes spatial data into bit-depth data. Then, the package performs 3D convolution that reduces the spatial resolution while maintaining the bit depth data. Then, the packaging performs a shaping operation, which further encodes the bit depth data, followed by a 2D convolution to filter the output latent variables. The decoding layer inverts this sequence to restore the latent variables to full resolution. Packaging is described in "3D Packing for Self-supervised Monocular Depth Estimation" by Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon, 1905.02693v4, Toyota Research Institute, arXiv.org, March 28, 2020 Unpacking.

图3是被配置为确定循环一致性的孪生VAE 300的图。孪生VAE300包括第一配置302和第二配置304。第一配置302输入真实图像202并且输出重建的模拟图像220。然后，孪生VAE 300被配置为第二配置304，其中从第一配置302输出的模拟输出RGB图像220被输入到模拟编码器208以由真实解码器214解码为重建的真实图像218。循环一致性是一种用于训练孪生VAE 300以基于未配对的数据从模拟图像320产生重建的真实图像218的技术。配对的数据是其中生成模拟图像204以匹配每个真实图像202的图像数据，即，其中模拟场景与对象、它们的外观以及它们在真实图像中的布置相匹配。产生配对的训练数据需要用户分析真实图像202数据，通过确定在与真实图像202中的对象相同位置处包括真实图像202中的所有对象的场景描述文件来估计产生每个真实图像202的模拟副本所需的场景描述。可以通过摄影测量来确定真实图像202中出现的对象的真实世界位置。摄影测量是用于使用关于真实世界相机位置和取向的数据来确定对象的真实世界大小和位置的技术。例如，可以假设道路限定诸如车辆的对象所在的平面。关于相机相对于道路的位置和取向以及相机镜头放大率的数据可以用于将图像中的像素位置转换为真实世界位置。Figure 3 is a diagram of a twin VAE 300 configured to determine cycle consistency. Twin VAE 300 includes a first configuration 302 and a second configuration 304. The first configuration 302 inputs a real image 202 and outputs a reconstructed simulated image 220 . The twin VAE 300 is then configured into a second configuration 304 where the simulated output RGB image 220 output from the first configuration 302 is input to the simulated encoder 208 to be decoded by the real decoder 214 into a reconstructed real image 218 . Cycle consistency is a technique used to train the twin VAE 300 to produce a reconstructed real image 218 from a simulated image 320 based on unpaired data. Paired data is image data in which simulated images 204 are generated to match each real image 202 , that is, in which the simulated scene matches the objects, their appearance, and their arrangement in the real images. Generating the paired training data requires the user to analyze the real image 202 data to estimate the cost of producing a simulated copy of each real image 202 by determining a scene description file that includes all objects in the real image 202 at the same locations as the objects in the real image 202 Required scene description. The real-world locations of objects appearing in the real image 202 may be determined through photogrammetry. Photogrammetry is a technique used to determine the real-world size and position of an object using data about the position and orientation of a real-world camera. For example, a road may be assumed to define a plane on which objects such as vehicles are located. Data about the position and orientation of the camera relative to the road and the magnification of the camera lens can be used to convert pixel positions in the image to real-world positions.

场景描述文件必须包含用来以产生真实图像202的逼真副本的方式渲染模拟图像204的指令，所述真实图像包括真实图像202中出现的每个对象的外观和位置。真实感渲染软件输入包括对象的真实世界位置的场景描述文件，并且通过模拟相机和镜头并且通过模拟镜头跟踪从对象反射或发射到模拟相机中的模拟图像平面上的光线来渲染2D图像。产生配对的图像数据是昂贵、耗时的，并且需要大量的人工工作来确定包括真实世界图像中的每个对象的真实世界3D位置的场景描述文件。理论上可以使此任务自动化，然而产生和执行分析真实图像202并且产生配对的模拟图像所需的软件将需要大量的人类编程工作和大量的计算机资源。未配对数据是其中真实图像202和模拟图像204不匹配的图像数据，即，其中用于生成模拟图像204的场景描述文件不是从真实图像202生成的。产生包括未配对的模拟图像204的训练数据集需要产生配对的图像数据所需的一小部分人力和计算机资源。使用如本文所述的循环一致性来训练孪生VAE 300允许使用未配对数据来训练孪生VAE 300，这减少了产生训练数据集所需的时间、费用、人力和计算资源。The scene description file must contain instructions for rendering the simulated image 204 in a manner that produces a realistic copy of the real image 202 , including the appearance and location of each object that appears in the real image 202 . Photorealistic rendering software inputs a scene description file that includes the real-world locations of objects and renders a 2D image by simulating a camera and lens and tracing rays reflected or emitted from objects onto a simulated image plane in a simulated camera through the simulated lens. Generating paired image data is expensive, time-consuming, and requires extensive manual effort to determine a scene description file that includes the real-world 3D position of each object in the real-world image. This task could theoretically be automated, however generating and executing the software required to analyze real images 202 and generate paired simulated images would require significant human programming effort and significant computer resources. Unpaired data is image data in which the real image 202 and the simulated image 204 do not match, that is, in which the scene description file used to generate the simulated image 204 was not generated from the real image 202 . Generating a training data set that includes unpaired simulated images 204 requires a fraction of the human and computer resources required to generate paired image data. Training the twin VAE 300 using cycle consistency as described herein allows the twin VAE 300 to be trained using unpaired data, which reduces the time, expense, labor and computing resources required to generate a training data set.

首先如关于图2所讨论的那样训练孪生VAE 200，以训练真实编码器206和真实解码器214以输入真实图像202并且输出重建的真实图像218。另外如关于图2所讨论的那样训练孪生VAE 200，以训练模拟编码器208和模拟解码器216以输入模拟图像204并且输出重建的模拟图像220。在此训练之后，孪生VAE 200被配置为形成孪生VAE 300、第一配置302和第二配置304。孪生VAE 300第一配置302使用真实编码器206对真实图像202进行编码，以在共享潜在空间212中形成潜在变量。因为共享潜在空间212在真实数据集与模拟数据集之间共享，所以共享潜在空间212中的潜在变量可以输出到模拟解码器216以基于真实图像202输入产生重建的模拟图像220。然后将重建的模拟图像220输入到孪生VAE 300第二配置304并且使用模拟编码器208进行编码以在共享潜在空间212中产生潜在变量。因为共享潜在空间212中包括潜在变量，所以潜在变量可以输出到真实解码器214以被解码为重建的真实图像218。The twin VAE 200 is first trained as discussed with respect to Figure 2 to train the ground truth encoder 206 and ground truth decoder 214 to input a ground truth image 202 and output a reconstructed ground truth image 218. Twin VAE 200 is additionally trained as discussed with respect to FIG. 2 to train simulation encoder 208 and simulation decoder 216 to input simulated image 204 and output reconstructed simulated image 220 . After this training, twin VAE 200 is configured to form twin VAE 300, first configuration 302, and second configuration 304. The twin VAE 300 first configuration 302 uses a real encoder 206 to encode real images 202 to form latent variables in a shared latent space 212 . Because the shared latent space 212 is shared between the real and simulated data sets, the latent variables in the shared latent space 212 may be output to the simulation decoder 216 to produce a reconstructed simulated image 220 based on the real image 202 input. The reconstructed simulated image 220 is then input to the twin VAE 300 second configuration 304 and encoded using the simulation encoder 208 to produce latent variables in the shared latent space 212 . Because latent variables are included in shared latent space 212 , the latent variables may be output to ground truth decoder 214 to be decoded into reconstructed ground truth images 218 .

循环一致性通过将输入真实图像202与重建的真实图像218进行比较来确定真实编码和解码与模拟的编码和解码之间的一致性来起作用。为了在输入与输出之间提供更准确的比较并且补偿真实图像与模拟图像之间的图像编码和解码的差异，计算Kullback-Liobler(KL)散度和最大均值差异损失(MMD)而不是MSE。因为使用真实解码器和编码器以及模拟解码器和编码器对图像进行编码和解码可能会将视觉伪影引入所比较的图像中，所以不能成功地使用简单的MSE损失函数来最小化损失函数。例如，可以通过根据需要对输入图像进行两次编码和解码来改变图像的整体强度或亮度，以确定循环一致性。虽然整体亮度不会影响神经网络的分割或深度处理，但它将影响MSE计算。KL散度和MMD损失是基于像素值的概率分布的度量，而不是像MSE那样的绝对度量，且因此较少受到由重复编码和解码引入的伪影的影响。Cycle consistency works by comparing the input real image 202 with the reconstructed real image 218 to determine the consistency between real encoding and decoding and simulated encoding and decoding. To provide a more accurate comparison between input and output and to compensate for differences in image encoding and decoding between real and simulated images, Kullback-Liobler (KL) divergence and maximum mean difference loss (MMD) are calculated instead of MSE. Because encoding and decoding images using real decoders and encoders as well as simulated decoders and encoders may introduce visual artifacts into the compared images, a simple MSE loss function cannot be successfully used to minimize the loss function. For example, the overall intensity or brightness of the image can be changed by encoding and decoding the input image twice as needed to determine cycle consistency. While overall brightness will not affect the neural network's segmentation or depth processing, it will affect the MSE calculation. KL divergence and MMD loss are measures based on the probability distribution of pixel values, rather than absolute measures like MSE, and are therefore less affected by artifacts introduced by repeated encoding and decoding.

KL散度测量多元概率分布之间的差异，并且不取决于具有相同均值的分布。例如，可以在重复编码和解码之后在输入真实图像202与输出重建的真实图像218之间比较像素值的概率分布。训练孪生VAE 300可以基于根据分布之间的差异而不是逐像素差异来最小化损失函数。KL散度D_KL基于两个概率分布与由以下等式描述的P和Q之间的对数差的期望值：KL divergence measures the differences between multivariate probability distributions and does not depend on distributions having the same mean. For example, the probability distribution of pixel values may be compared between the input real image 202 and the output reconstructed real image 218 after repeated encoding and decoding. Training the twin VAE 300 can be based on minimizing the loss function based on differences between distributions rather than pixel-by-pixel differences. KL divergence D_KL is based on the expected value of the logarithmic difference between two probability distributions and P and Q described by the following equation:

孪生VAE 300第一配置302和第二配置304也可以被训练以通过基于MMD损失最小化损失函数来最大化循环一致性。通过根据以下等式确定两个分布P,QM_k之间的平均距离的平方来计算MMD损失：Twin VAE 300 first configuration 302 and second configuration 304 may also be trained to maximize cycle consistency by minimizing the loss function based on the MMD loss. The MMD loss is calculated by determining the square of the average distance between two distributions P, QM_k according to the following equation:

其中μ_P和μ_Q是分布的均值，E_P、E_Q和E_P,Q分别是分布P,Q的期望值和联合期望值，并且k是核函数，在此示例中是高斯核。当且仅当P＝Q。时均方距离M_k为零。基于等式(2)计算MMD可以产生损失函数，所述损失函数测量像素值的分布之间的距离，假设像素值遵循高斯分布。尽管基于不同配置的不同对象，但是基于MMD计算损失函数可以确定图像是否相似，并且因此可以用于将输入的真实图像202和模拟图像204与由孪生VAE 300的第一配置302和第二配置304执行的重复编码和解码后的重建的真实图像218和模拟图像220进行比较，如下面关于图4和图5所描述。where_μP and_μQ are the means of the distributions,_EP ,_EQ and_EP,Q are the expected and joint expected values of the distributions P,Q respectively, and k is the kernel function, in this case a Gaussian kernel. If and only if P=Q. When the mean square distance M_k is zero. Calculating the MMD based on equation (2) can produce a loss function that measures the distance between distributions of pixel values, assuming that the pixel values follow a Gaussian distribution. Although based on different objects in different configurations, calculating the loss function based on MMD can determine whether the images are similar, and therefore can be used to compare the input real image 202 and simulated image 204 with the first configuration 302 and the second configuration 304 by the twin VAE 300 The reconstructed real image 218 and the simulated image 220 are compared after repeated encoding and decoding is performed, as described below with respect to FIGS. 4 and 5 .

图4是镜像图3中的第一配置302和第二配置304的处于第一配置402和第二配置404的孪生VAE 200的图。以如上文关于图4描述的类似方式，可以使用循环一致性来训练孪生VAE 200、第一配置402和第二配置404以输入真实图像204并且输出重建的模拟图像220。在孪生VAE 200中，第一配置402输入模拟图像204，使用模拟编码器208进行编码并且产生共享潜在空间212中包括的潜在变量。然后将潜在变量输出到真实图像解码器214以被编码为重建的真实图像218。将重建的真实图像218输入到孪生VAE 200第二配置404以使用真实编码器206进行编码以形成共享潜在空间212中包括的潜在变量。然后将潜在变量输出到模拟解码器216以作为重建的模拟图像220输出。使用KL散度和MMD损失将重建的模拟图像220与输入的模拟图像204进行比较，以如上文关于图3所述训练孪生VAE。如关于图3和图4所述训练孪生VAE 200允许孪生VAE 200输入真实图像202或模拟图像204并且产生重建的真实图像218或重建的模拟图像220。FIG. 4 is a diagram of twin VAE 200 in first configuration 402 and second configuration 404 mirroring first configuration 302 and second configuration 304 in FIG. 3 . In a similar manner as described above with respect to FIG. 4 , the twin VAE 200 , the first configuration 402 and the second configuration 404 may be trained using cycle consistency to input a real image 204 and output a reconstructed simulated image 220 . In twin VAE 200 , a first configuration 402 inputs a simulated image 204 , encodes it using a simulated encoder 208 and produces latent variables included in a shared latent space 212 . The latent variables are then output to the real image decoder 214 to be encoded into reconstructed real images 218 . The reconstructed real image 218 is input to the twin VAE 200 second configuration 404 for encoding using the real encoder 206 to form latent variables included in the shared latent space 212 . The latent variables are then output to the simulation decoder 216 for output as the reconstructed simulation image 220 . The reconstructed simulated image 220 is compared to the input simulated image 204 using KL divergence and MMD loss to train the Siamese VAE as described above with respect to Figure 3. Training the twin VAE 200 as described with respect to FIGS. 3 and 4 allows the twin VAE 200 to input a real image 202 or a simulated image 204 and produce a reconstructed real image 218 or a reconstructed simulated image 220 .

一旦已经完成此训练，就可以如关于图5所述配置孪生VAE 200，以使用包括地面实况数据的模拟图像204来训练孪生VAE 200以产生分割图和深度图。因为如关于图3和图4所述，孪生VAE 200也已经被训练为使用真实图像202和模拟图像204两者，所以孪生VAE200可以输入真实图像202并且产生分割图和深度图，但尚未使用真实图像202进行训练。以此方式，可以训练孪生VAE 200而不必为真实图像202产生昂贵且耗时的地面实况数据，或者产生昂贵且耗时的配对的真实图像202和模拟图像204。Once this training has been completed, the twin VAE 200 can be configured as described with respect to Figure 5 to train the twin VAE 200 using simulated images 204 including ground truth data to produce segmentation maps and depth maps. Because, as described with respect to Figures 3 and 4, the twin VAE 200 has also been trained to use both real images 202 and simulated images 204, the twin VAE 200 can input the real images 202 and produce segmentation maps and depth maps, but has not yet used real images 202. Image 202 for training. In this manner, the twin VAE 200 can be trained without having to generate expensive and time-consuming ground truth data for the real images 202 or generating expensive and time-consuming pairs of real images 202 and simulated images 204 .

训练孪生VAE 200也可以用于减少由基于模拟图像训练神经网络引起的问题。当在操作中将真实图像呈现给神经网络时，使用模拟图像训练神经网络可能会导致困难。由于真实图像和模拟图像的外观之间的细微差异，在模拟图像上训练的神经网络可能难以处理真实图像以正确地确定分割图和深度图。使用如本文所讨论的循环一致性方法来训练孪生VAE 200可以基于用模拟图像进行训练来改善神经网络在处理真实图像时的性能。The training twin VAE 200 can also be used to reduce problems caused by training neural networks based on simulated images. Training a neural network using simulated images can cause difficulties when real images are presented to the neural network in operation. Due to subtle differences between the appearance of real and simulated images, neural networks trained on simulated images may have difficulty processing real images to correctly determine segmentation maps and depth maps. Training the twin VAE 200 using cycle consistency methods as discussed in this article can improve the performance of neural networks when processing real images based on training with simulated images.

图5是被配置为训练的孪生VAE 500和用于产生分割图和深度图的孪生VAE 500的图。因为已经如上文关于图2、图3和图4所讨论那样训练了真实编码器206和模拟编码器208，所以无论是将真实图像202还是模拟图像204输入到孪生VAE 500，共享潜在空间212中包括的潜在变量都将是一致的。此允许使用包括用于分割和深度两者的地面实况数据的模拟图像204来训练孪生VAE 500。因为用于生成模拟图像204的场景描述数据包括模拟图像中所包括的所有表面的详细3D描述，所以可以获得准确且高度详细的分割和深度地面实况数据，而无需生成地面实况数据的费力、耗时、昂贵以及计算机资源密集型过程。可以训练孪生VAE 500以利用模拟编码器208将模拟图像204编码为共享潜在空间212中的潜在变量，然后利用分割解码器222和深度解码器224对潜在变量进行解码以分别产生分割图(SGOUT)226和深度图(DOUT)228。为了训练分割解码器222和深度解码器224，可以使用如关于图2所讨论的MSE损失函数将输出分割图226和输出深度图228与对应于输入模拟图像204的分割地面实况数据和深度地面实况数据进行比较，以选择对应于最准确结果的解码参数。Figure 5 is a diagram of the twin VAE 500 configured for training and for generating segmentation maps and depth maps. Because the real encoder 206 and the simulated encoder 208 have been trained as discussed above with respect to Figures 2, 3, and 4, whether the real image 202 or the simulated image 204 is input to the twin VAE 500, the shared latent space 212 All latent variables included will be consistent. This allows the twin VAE 500 to be trained using simulated images 204 that include ground truth data for both segmentation and depth. Because the scene description data used to generate the simulated image 204 includes detailed 3D descriptions of all surfaces included in the simulated image, accurate and highly detailed segmentation and depth ground truth data can be obtained without the laborious, time-consuming process of generating ground truth data. time-consuming, expensive, and computer resource-intensive process. Twin VAE 500 may be trained to encode simulated images 204 into latent variables in a shared latent space 212 using simulated encoder 208 and then decode the latent variables using segmentation decoder 222 and depth decoder 224 to respectively produce segmentation maps (SGOUT) 226 and depth map (DOUT) 228. To train the segmentation decoder 222 and the depth decoder 224 , the output segmentation map 226 and the output depth map 228 may be combined with segmentation ground truth data and depth ground truth corresponding to the input simulated image 204 using the MSE loss function as discussed with respect to FIG. 2 The data are compared to select the decoding parameters that correspond to the most accurate results.

因为孪生VAE 500已经被训练以生成在真实图像202与模拟图像204之间一致的潜在变量，所以可以将真实图像202输入到真实图像编码器206以在共享潜在空间212中形成潜在变量。然后，可以将潜在变量输出到分割解码器222以形成分割图226，并且将所述潜在变量输出到深度解码器224以形成深度图228。因为分割解码器222和深度解码器224使用在共享潜在空间214中形成潜在变量的合成图像204进行训练，所以基于输入真实图像202形成的共享潜在空间214中的潜在变量可以由分割解码器222和深度解码器224处理，就像它们是合成图像204一样，从而处理由真实图像202形成的潜在变量，而无需与真实图像202相对应的地面实况数据来训练VAE 200。本文讨论的技术改进了对孪生VAE 500的训练，以通过使用共享潜在空间212基于输入真实图像来确定分割图和深度图，所述共享潜在空间允许用少量(通常<100)未标记的真实图像202以及大量(通常>1000)标记的模拟图像204来训练孪生VAE 500，从而减少训练孪生VAE 500以生成分割图226和深度图228所需的费用、时间和人工。如上文所讨论，基于场景描述生成模拟图像204，所述场景描述包括模拟图像204中出现的对象的真实世界位置和大小。Because the twin VAE 500 has been trained to generate latent variables that are consistent between the real image 202 and the simulated image 204 , the real image 202 can be input to the real image encoder 206 to form the latent variables in the shared latent space 212 . The latent variables may then be output to segmentation decoder 222 to form segmentation map 226 and to depth decoder 224 to form depth map 228 . Because segmentation decoder 222 and depth decoder 224 are trained using synthetic images 204 that form latent variables in shared latent space 214 , latent variables in shared latent space 214 formed based on input real images 202 can be generated by segmentation decoder 222 and The depth decoder 224 processes the latent variables formed from the real images 202 as if they were synthetic images 204 without requiring ground truth data corresponding to the real images 202 to train the VAE 200 . The techniques discussed in this article improve the training of a Siamese VAE 500 to determine segmentation maps and depth maps based on input real images by using a shared latent space 212 that allows the use of a small number (typically <100) of unlabeled real images. 202 and a large number (typically >1000) of labeled simulated images 204 to train the twin VAE 500, thereby reducing the expense, time, and labor required to train the twin VAE 500 to generate segmentation maps 226 and depth maps 228. As discussed above, the simulated image 204 is generated based on a scene description that includes the real-world locations and sizes of objects appearing in the simulated image 204 .

因为模拟图像204中的对象的真实世界位置和大小在场景描述文件中可用，所以可以以与渲染模拟图像相同的方式从场景描述数据生成对应于分割图和深度图的图像。对于分割图，分割渲染软件可以生成识别对应于图像中的对象的区域的图像，从而形成分割图，而不是将环境光的反射渲染到图像传感器上。对于深度图，深度渲染软件可以生成图像，其中图像的像素对应于从传感器到场景中的点的距离，从而形成深度图。以这种方式对应于模拟图像202的分割图和深度图可以用于训练分割解码器222和深度解码器224以基于模拟图像204输入来产生分割图226和深度图228。在训练之后，孪生VAE 500可以输入真实图像202并且产生分割图226和深度图228而无需再训练，因为孪生VAE 500被训练为以如上文关于图3和图4所讨论的循环一致的方式产生重建的真实图像218和重建的模拟图像220。Because the real-world positions and sizes of objects in the simulated image 204 are available in the scene description file, images corresponding to the segmentation map and depth map can be generated from the scene description data in the same manner as the simulated image is rendered. For segmentation maps, segmentation rendering software can generate images that identify areas that correspond to objects in the image, thus forming a segmentation map, rather than rendering reflections of ambient light onto the image sensor. For depth maps, depth rendering software can generate an image where the pixels of the image correspond to the distance from the sensor to a point in the scene, forming a depth map. Segmentation maps and depth maps corresponding to simulated image 202 in this manner may be used to train segmentation decoder 222 and depth decoder 224 to generate segmentation map 226 and depth map 228 based on simulated image 204 input. After training, the twin VAE 500 can be fed a real image 202 and produce a segmentation map 226 and a depth map 228 without retraining because the twin VAE 500 is trained to produce in a loop-consistent manner as discussed above with respect to Figures 3 and 4 Reconstructed real image 218 and reconstructed simulated image 220 .

图6是真实图像602和对应的分割图604的图。通过将真实图像202输入到包括经训练的分割解码器222的经训练的孪生VAE 500中来生成分割图604。在分割图604中，对象(包括车辆)的轮廓已经被处理并且替换为对应于“车辆”标签的单个灰度或颜色值的区域。分割图604也可以是“实例”分割图，其中每个车辆被确定为车辆的单独实例并且被分配唯一的颜色或灰度值以单独地识别每个车辆。Figure 6 is a diagram of a real image 602 and a corresponding segmentation map 604. The segmentation map 604 is generated by inputting the real image 202 into the trained twin VAE 500 including the trained segmentation decoder 222 . In segmentation map 604, the outlines of objects (including vehicles) have been processed and replaced with regions corresponding to a single grayscale or color value of the "vehicle" label. The segmentation map 604 may also be an "instance" segmentation map, where each vehicle is identified as a separate instance of the vehicle and assigned a unique color or grayscale value to identify each vehicle individually.

图7是真实图像702和深度图704的图。通过将真实图像202输入到包括经训练的深度解码器224的经训练的孪生VAE 500中来生成深度图704。在深度图704中，输入真实图像702中的每个像素被对应于获取真实图像702的传感器与场景中的对象之间的距离的灰度值替换。Figure 7 is a diagram of a real image 702 and a depth map 704. Depth map 704 is generated by inputting real image 202 into trained twin VAE 500 including trained depth decoder 224 . In the depth map 704 , each pixel in the input real image 702 is replaced by a grayscale value corresponding to the distance between the sensor that acquired the real image 702 and an object in the scene.

图8是关于图1至图7描述的用于基于真实图像202生成分割图226和深度图228的过程的流程图。过程800可以由计算装置的处理器实施，例如，将来自传感器的信息作为输入，并且执行命令，并且输出对象信息。过程800包括可以以所示顺序执行的多个框。替代地或另外，过程800可以包括更少的框，或者可以包括以不同次序执行的框。FIG. 8 is a flowchart of the process described with respect to FIGS. 1-7 for generating segmentation map 226 and depth map 228 based on real image 202 . Process 800 may be implemented by a processor of a computing device, for example, taking as input information from sensors and executing commands and outputting object information. Process 800 includes a number of blocks that may be executed in the order shown. Alternatively or additionally, process 800 may include fewer blocks, or may include blocks that are performed in a different order.

过程800开始于框802，其中位于服务器计算机中的计算装置训练孪生VAE 200神经网络以使用如上文关于图2、图3和图4所讨论的循环一致性方法基于真实图像202和模拟图像204输入来生成重建的真实图像218和重建的模拟图像220。然后，可以基于使用模拟图像204和对应的地面实况训练孪生VAE 500来响应于输入真实图像202来训练孪生VAE 500以生成分割图226和深度图228，如关于图5所讨论。Process 800 begins at block 802 , where a computing device located in a server computer trains a twinned VAE 200 neural network to use the cycle consistency method as discussed above with respect to FIGS. 2 , 3 , and 4 based on real image 202 and simulated image 204 inputs. to generate a reconstructed real image 218 and a reconstructed simulated image 220 . Twin VAE 500 may then be trained to generate segmentation map 226 and depth map 228 in response to input real image 202 based on using simulated images 204 and corresponding ground truth, as discussed with respect to FIG. 5 .

在框804处，可以将经训练的孪生VAE 500下载到车辆110中的计算装置115。孪生VAE可以用于输入真实图像202并且响应于真实图像202输入而输出分割图226和深度图228，如关于图6和图7所讨论。例如，可以由诸如彩色摄像机的车辆传感器生成真实图像202。At block 804 , the trained twin VAE 500 may be downloaded to the computing device 115 in the vehicle 110 . A twin VAE may be used to input a real image 202 and output a segmentation map 226 and a depth map 228 in response to the real image 202 input, as discussed with respect to FIGS. 6 and 7 . For example, real image 202 may be generated by vehicle sensors such as a color camera.

在框806处，孪生VAE 500可以将分割图226和深度图228输出到在计算装置115中执行的软件以用于操作车辆110。分割图226和深度图228可以用于确定车辆路径。一种用于确定车辆路径的技术包括使用分割图和深度图来产生车辆周围的环境的认知地图。认知地图是车辆周围环境的俯视图，所述俯视图例如包括诸如车辆和行人的道路和对象。可以通过在认知地图上选择与车辆路线规划一致的局部路线来确定车辆路径。车辆路线规划可以包括从起点到最终目的地(诸如“工作单位”或“家”)的路线，并且可以例如通过使用存储在计算装置115存储器中或经由互联网从服务器计算机下载的位置和地图来确定。车辆路径是描述车辆将要行驶的从车辆110的当前位置到车辆路线规划上的局部目的地的多项式函数。可以确定多项式函数以将车辆横向加速度和纵向加速度维持在预定极限内。计算装置115可以经由控制器112、113、114控制车辆转向、制动和动力系统，以使车辆110沿着多项式函数移动，且从而在规划的车辆路径上行驶。在框806之后，过程800结束。At block 806 , the twin VAE 500 may output the segmentation map 226 and the depth map 228 to software executing in the computing device 115 for use in operating the vehicle 110 . Segmentation map 226 and depth map 228 may be used to determine vehicle paths. One technique for determining a vehicle's path involves using segmentation maps and depth maps to produce a cognitive map of the environment surrounding the vehicle. A cognitive map is an overhead view of the vehicle's surroundings, including, for example, roads and objects such as vehicles and pedestrians. The vehicle route can be determined by selecting a local route on the cognitive map that is consistent with the vehicle's route plan. Vehicle route planning may include a route from an origin to a final destination (such as "work" or "home") and may be determined, for example, by using locations and maps stored in computing device 115 memory or downloaded from a server computer via the Internet. . The vehicle path is a polynomial function that describes where the vehicle will travel from the current location of the vehicle 110 to a local destination on the vehicle's route plan. Polynomial functions may be determined to maintain vehicle lateral acceleration and longitudinal acceleration within predetermined limits. Computing device 115 may control vehicle steering, braking and powertrain systems via controllers 112, 113, 114 to cause vehicle 110 to move along a polynomial function and thereby travel on a planned vehicle path. After block 806, process 800 ends.

诸如本文所讨论的那些的计算装置通常各自包括命令，所述命令可由诸如上文所识别的那些的一个或多个计算装置执行并且用于实施上文描述的过程的框或步骤。例如，上文所讨论的过程框可体现为计算机可执行命令。Computing devices such as those discussed herein typically each include commands executable by one or more computing devices such as those identified above and for carrying out the blocks or steps of the processes described above. For example, the process blocks discussed above may be embodied as computer-executable commands.

计算机可执行命令可根据使用多种编程语言和/或技术创建的计算机程序来编译或解译，所述编程语言和/或技术单独地或组合地包括但不限于：Java^TM、C、C++、Python、Julia、SCALA、Visual Basic、Java Script、Perl、HTML等。一般而言，处理器(例如，微处理器)接收例如来自存储器、计算机可读介质等的命令，并且执行这些命令，从而执行一个或多个过程，包括本文描述的过程中的一个或多个。此类命令和其他数据可存储在文件中并且使用多种计算机可读介质来传输。计算装置中的文件通常是存储在诸如存储介质、随机存取存储器等计算机可读介质上的数据的集合。Computer-executable commands may be compiled or interpreted from computer programs created using a variety of programming languages and/or technologies, including, individually or in combination, but not limited to: Java^™ , C, C++, Python, Julia, SCALA, Visual Basic, Java Script, Perl, HTML, etc. Generally, a processor (eg, a microprocessor) receives commands, such as from a memory, computer-readable medium, etc., and executes the commands to perform one or more processes, including one or more of the processes described herein. . Such commands and other data may be stored in files and transmitted using a variety of computer-readable media. A file in a computing device is typically a collection of data stored on a computer-readable medium such as a storage medium, random access memory, or the like.

计算机可读介质包括参与提供可由计算机读取的数据(例如，命令)的任何介质。此种介质可采用许多形式，包括但不限于非易失性介质、易失性介质等。非易失性介质包括例如光盘或磁盘以及其他持久性存储器。易失性介质包括通常构成主存储器的动态随机存取存储器(DRAM)。计算机可读介质的常见形式包括例如软磁盘、软盘、硬盘、磁带、任何其他磁性介质、CD-ROM、DVD、任何其他光学介质、穿孔卡片、纸带、具有孔图案的任何其他物理介质、RAM、PROM、EPROM、快闪EEPROM、任何其他存储器芯片或盒式磁带、或者计算机可从中读取的任何其他介质。Computer-readable media includes any medium that participates in providing data (eg, commands) readable by a computer. Such media may take many forms, including but not limited to non-volatile media, volatile media, and the like. Non-volatile media includes, for example, optical or magnetic disks and other persistent storage. Volatile media includes dynamic random access memory (DRAM), which typically constitutes main memory. Common forms of computer readable media include, for example, floppy disks, floppy disks, hard drives, magnetic tape, any other magnetic media, CD-ROM, DVD, any other optical media, punched cards, paper tape, any other physical medium with a pattern of holes, RAM, PROM, EPROM, Flash EEPROM, any other memory chip or cartridge, or any other medium that a computer can read from.

除非本文作出相反的明确指示，否则权利要求中使用的所有术语意在给出如本领域技术人员所理解的普通和通常的含义。特别地，除非权利要求叙述相反的明确限制，否则使用诸如“一个/种”、“该”、“所述”等单数冠词应被解读为叙述所指示的要素中的一个或多个。Unless expressly indicated to the contrary herein, all terms used in the claims are intended to be given their ordinary and customary meanings as understood by those skilled in the art. In particular, use of singular articles such as "a," "the," "said," etc., shall be read as reciting one or more of the indicated elements unless a claim recites an explicit limitation to the contrary.

术语“示例性”在本文中以表示示例的意义使用，例如，对“示例性小部件”的引用应被解读为仅指代小部件的示例。The term "exemplary" is used herein in the sense of referring to an example, e.g., references to an "exemplary widget" should be read as referring only to examples of the widget.

修饰值或结果的副词“大约”意味着形状、结构、测量、值、确定、计算等可因为材料、加工、制造、传感器测量、计算、处理时间、通信时间等的缺陷而与确切描述的几何结构、距离、测量、值、确定、计算等有偏差。The adverb "approximately" modifying a value or result means that a shape, structure, measurement, value, determination, calculation, etc. may differ from the exact described geometry due to imperfections in materials, processing, manufacturing, sensor measurements, calculations, processing times, communication times, etc. There are deviations in structures, distances, measurements, values, determinations, calculations, etc.

在附图中，相同的附图标记指示相同的元素。另外，可改变这些要素中的一些或全部。相对于本文描述的介质、过程、系统、方法等，应理解，尽管此类过程等的步骤或框已被描述为根据特定的有序顺序发生，但是此类过程可通过以本文描述的次序以外的次序执行所描述的步骤来实践。还应理解，可同时执行某些步骤，可添加其他步骤，或者可省略本文描述的某些步骤。换句话说，本文对过程的描述是出于说明某些实施例的目的而提供的，并且决不应解释为限制所要求保护的发明。In the drawings, like reference numbers indicate like elements. Additionally, some or all of these elements may be changed. With respect to the media, processes, systems, methods, etc. described herein, it should be understood that although the steps or blocks of such processes, etc. have been described as occurring according to a specific orderly sequence, such processes may be performed in an order other than that described herein. Perform the steps described in the order to practice. It will also be understood that certain steps may be performed concurrently, other steps may be added, or certain steps described herein may be omitted. In other words, the description of processes herein is provided for the purpose of illustrating certain embodiments and should in no way be construed to limit the claimed invention.

根据本发明，提供了一种计算机，所述计算机具有：处理器；以及存储器，所述存储器包括可由处理器执行的指令以：接收单目图像并且将图像提供给变分自动编码器神经网络(VAE)，其中所述VAE已经以包括第一编码器-解码器对和第二编码器-解码器对的孪生配置进行训练，所述第一编码器-解码器对接收未标记的真实图像作为输入并且输出重建的真实图像，所述第二编码器-解码器对接收合成图像作为输入并且输出重建的合成图像，并且其中所述VAE包括第三解码器和第四解码器，所述第三解码器和所述第四解码器使用标记的合成图像、分割地面实况和深度地面实况进行训练并且基于输入单目图像从VAE输出分割图和深度图。According to the present invention, there is provided a computer having: a processor; and a memory including instructions executable by the processor to: receive a monocular image and provide the image to a variational autoencoder neural network ( VAE), wherein the VAE has been trained in a twin configuration including a first encoder-decoder pair and a second encoder-decoder pair, the first encoder-decoder pair receiving unlabeled real images as inputting and outputting a reconstructed real image, the second encoder-decoder pair receiving a synthetic image as input and outputting a reconstructed synthetic image, and wherein the VAE includes a third decoder and a fourth decoder, the third The decoder and the fourth decoder are trained using labeled synthetic images, segmentation ground truth and depth ground truth and output segmentation maps and depth maps from the VAE based on the input monocular images.

根据一个实施例，以孪生配置训练VAE包括输出分割图的第三解码器和输出深度图的第四解码器。According to one embodiment, training the VAE in a twin configuration includes a third decoder that outputs a segmentation map and a fourth decoder that outputs a depth map.

根据一个实施例，分割地面实况包括标记的合成图像中的多个对象的标签，并且深度地面实况包括从传感器到标记的合成图像中的多个位置的距离。According to one embodiment, the segmentation ground truth includes labels labeling a plurality of objects in the composite image, and the depth ground truth includes distances from the sensor to a plurality of locations in the labeled composite image.

根据一个实施例，分割图包括标记的对象，包括道路、建筑物、树叶、车辆和行人。According to one embodiment, the segmentation map includes labeled objects including roads, buildings, foliage, vehicles, and pedestrians.

根据一个实施例，深度图包括从传感器到多个位置的距离。According to one embodiment, the depth map includes distances from the sensor to multiple locations.

根据一个实施例，真实图像由查看真实世界场景的真实世界传感器获取。According to one embodiment, real images are acquired by real world sensors viewing real world scenes.

根据一个实施例，合成图像由真实感图像渲染软件基于输入到真实感图像渲染软件的描述将由真实感图像渲染软件渲染的场景的数据生成。According to one embodiment, the composite image is generated by the photorealistic image rendering software based on data input to the photorealistic image rendering software describing a scene to be rendered by the photorealistic image rendering software.

根据一个实施例，分割地面实况和深度地面实况基于输入到真实感图像渲染软件的描述将由真实感图像渲染软件渲染的场景的场景描述生成。According to one embodiment, segmentation ground truth and depth ground truth are generated from a scene description of a scene to be rendered by the photorealistic image rendering software based on a description input to the photorealistic image rendering software.

根据一个实施例，VAE包括用于未标记的真实图像和标记的合成图像的第一编码器和第二编码器，并且进一步地，其中第一编码器和第二编码器各自包括与第一编码器或第二编码器中的另一者共享权重的层、共享的潜在空间以及用于未标记的真实图像和标记的合成图像的相应的第一解码器和第二解码器。According to one embodiment, the VAE includes a first encoder and a second encoder for unlabeled real images and labeled synthetic images, and further, wherein the first encoder and the second encoder each include an encoder corresponding to the first encoder. The other of the encoder or the second encoder shares a layer of weights, a shared latent space, and corresponding first and second decoders for the unlabeled real image and the labeled synthetic image.

根据一个实施例，基于确定第一编码器-解码器与第二编码器-解码器之间的循环一致性来进一步训练VAE。According to one embodiment, the VAE is further trained based on determining cycle consistency between the first encoder-decoder and the second encoder-decoder.

根据一个实施例，基于确定循环一致性训练VAE包括：通过确定Kullback–Leibler散度损失和最大平均差异损失来比较输入的真实图像和重建的真实图像。According to one embodiment, training the VAE based on determining cycle consistency includes comparing the input real image and the reconstructed real image by determining Kullback–Leibler divergence loss and maximum mean difference loss.

根据一个实施例，所述指令还包括用于基于分割图和深度图操作装置的指令。According to one embodiment, the instructions further include instructions for operating the device based on the segmentation map and the depth map.

根据一个实施例，所述装置是车辆、移动机器人、固定机器人、无人机和监控系统中的一者。According to one embodiment, the device is one of a vehicle, a mobile robot, a stationary robot, a drone, and a surveillance system.

根据一个实施例，所述指令还包括用于通过基于根据分割图和深度图来确定车辆路径而控制车辆推进、车辆制动和车辆转向中的一者或多者来操作车辆的指令。According to one embodiment, the instructions further include instructions for operating the vehicle by controlling one or more of vehicle propulsion, vehicle braking, and vehicle steering based on determining the vehicle path from the segmentation map and the depth map.

根据本发明，一种方法包括：接收单目图像并且将图像提供给变分自动编码器神经网络(VAE)，其中所述VAE已经以包括第一编码器-解码器对和第二编码器-解码器对的孪生配置进行训练，所述第一编码器-解码器对接收未标记的真实图像作为输入并且输出重建的真实图像，所述第二编码器-解码器对接收合成图像作为输入并且输出重建的合成图像，并且其中所述VAE包括第三解码器和第四解码器，所述第三解码器和所述第四解码器使用标记的合成图像、分割地面实况和深度地面实况进行训练；以及基于输入单目图像从VAE输出分割图和深度图。According to the invention, a method includes receiving a monocular image and providing the image to a variational autoencoder neural network (VAE), wherein the VAE has been designed to include a first encoder-decoder pair and a second encoder-decoder pair. A twin configuration of decoder pairs is trained, the first encoder-decoder pair receiving unlabeled real images as input and outputting reconstructed real images, the second encoder-decoder pair receiving synthetic images as input and Outputting a reconstructed synthetic image, and wherein the VAE includes a third decoder and a fourth decoder trained using the labeled synthetic image, segmentation ground truth, and depth ground truth ; and output segmentation map and depth map from VAE based on the input monocular image.

在本发明的一个方面，以孪生配置训练VAE包括输出分割图的第三解码器和输出深度图的第四解码器。In one aspect of the invention, training the VAE in a twin configuration includes a third decoder that outputs a segmentation map and a fourth decoder that outputs a depth map.

在本发明的一个方面，分割地面实况包括标记的合成图像中的多个对象的标签，并且深度地面实况包括从传感器到标记的合成图像中的多个位置的距离。In one aspect of the invention, the segmentation ground truth includes labels labeling a plurality of objects in the composite image, and the depth ground truth includes distances from the sensor to the label plurality of locations in the composite image.

在本发明的一个方面，分割图包括标记的对象，包括道路、建筑物、树叶、车辆和行人。In one aspect of the invention, the segmentation map includes labeled objects including roads, buildings, foliage, vehicles, and pedestrians.

在本发明的一个方面，深度图包括从传感器到多个位置的距离。In one aspect of the invention, a depth map includes distances from a sensor to a plurality of locations.

在本发明的一个方面，真实图像由查看真实世界场景的真实世界传感器获取。In one aspect of the invention, real images are acquired by real world sensors viewing real world scenes.