CN113542758B

Movatterモバイル変換

Info

Publication number: CN113542758B
Application number: CN202110168307.2A
Authority: CN
Inventors: T·T·卡拉斯; S·M·莱内; D·P·利布基; J·T·莱赫蒂宁; M·S·艾塔拉; T·O·艾拉; 刘洺堉; A·M·马尔雅; 王鼎钧
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2020-04-15
Filing date: 2021-02-07
Publication date: 2024-09-10
Anticipated expiration: 2041-02-07
Also published as: CN113542758A

Abstract

Translated fromChinese

本发明公开了生成对抗神经网络辅助的视频压缩和广播。由映射神经网络处理在输入空间中定义的潜码，以产生在中间潜在空间中定义的中间潜码。中间潜码可以用作由合成神经网络处理以生成图像的外观矢量。外观矢量是数据的压缩编码，例如包括人脸、音频和其他数据的视频帧。捕获图像可以在本地设备上转换为外观矢量，并使用与传输捕获图像相比少得多的带宽传输到远程设备。远程设备处的合成神经网络重建图像以供显示。

The present invention discloses video compression and broadcasting assisted by generative adversarial neural networks. Latent codes defined in an input space are processed by a mapping neural network to produce intermediate latent codes defined in an intermediate latent space. The intermediate latent codes can be used as appearance vectors that are processed by a synthetic neural network to generate an image. Appearance vectors are compressed encodings of data, such as video frames including faces, audio, and other data. Captured images can be converted to appearance vectors on a local device and transmitted to a remote device using much less bandwidth than transmitting the captured image. The synthetic neural network at the remote device reconstructs the image for display.

Description

Translated fromChinese

生成对抗神经网络辅助的视频压缩和广播Generative Adversarial Neural Network-Assisted Video Compression and Broadcasting

要求优先权Claiming priority

本申请是2019年5月21日提交的标题为“用于生成神经网络的基于样式的体系架构(A Style-Based Architecture For Generative Neural Networks)”的美国专利申请No.16/418,317(案卷号741868/18-HE-0369-US03)的部分继续申请，其要求2018年11月14日提交的标题为“用于生成神经网络的基于样式的体系架构(A Style-BasedArchitecture For Generative Neural Networks)”美国临时申请No.62/767,417(案卷号510998/17-HE-0369-US01)和2018年11月15日提交的标题为“用于生成神经网络的基于样式的体系架构(A Style-Based Architecture For Generative Neural Networks)”美国临时申请No.62/767,985(案卷号510893/17-HE-0369-US02)的权益，这些申请的全部内容通过引用并入本文。本申请还要求2020年4月15日提交的标题为“生成神经网络辅助的视频压缩和解压缩(Generative Neural Network Assisted Video Compression andDecompression)”的美国临时申请No.63/010,511(案卷号513138/20-HE-0162-US01)的利益，其全部内容通过引用并入本文。This application is a continuation-in-part of U.S. Patent Application No. 16/418,317, filed on May 21, 2019, entitled “A Style-Based Architecture For Generative Neural Networks,” (Docket No. 741868/18-HE-0369-US03), which claims the benefit of U.S. Provisional Application No. 62/767,417, filed on November 14, 2018, entitled “A Style-Based Architecture For Generative Neural Networks,” (Docket No. 510998/17-HE-0369-US01), and U.S. Provisional Application No. 62/767,417, filed on November 14, 2018, entitled “A Style-Based Architecture For Generative Neural Networks,” (Docket No. 510998/17-HE-0369-US01), and U.S. Provisional Application No. 62/767,417, filed on November 14, 2018, entitled “A Style-Based Architecture For Generative Neural Networks,” (Docket No. The present application also claims the benefit of U.S. Provisional Application No. 62/767,985, filed on April 15, 2020, entitled “Generative Neural Network Assisted Video Compression and Decompression,” filed on April 15, 2020, entitled “Generative Neural Network Assisted Video Compression and Decompression,” filed on April 15, 2020, entitled “Generative Neural Network Assisted Video Compression and Decompression,” filed on April 15, 2020, which is incorporated herein by reference in its entirety.

技术领域Technical Field

本发明涉及神经网络，并且具体地，涉及一种用于使用特定缩放控制来合成数据的生成器架构。The present invention relates to neural networks, and in particular, to a generator architecture for synthesizing data using specific scaling controls.

背景技术Background Art

生成对抗网络(GAN)产生的图像的分辨率和质量最近得到了提高。然而，GAN仍然继续充当黑匣子，并且尽管最近做出了努力，但是仍然缺乏对图像合成过程的各个方面的理解，例如，随机特征的起源。潜在空间的性质也知之甚少，并且通常展示的潜在空间内插法没有提供定量的方法来相互比较不同的GAN。需要解决这些问题和/或与现有技术相关的其他问题。The resolution and quality of images produced by generative adversarial networks (GANs) have recently improved. However, GANs still continue to act as black boxes, and despite recent efforts, there is still a lack of understanding of various aspects of the image synthesis process, e.g., the origin of stochastic features. The properties of the latent space are also poorly understood, and commonly demonstrated latent space interpolation methods do not provide a quantitative way to compare different GANs to each other. There is a need to address these and/or other issues related to the prior art.

发明内容Summary of the invention

基于样式的生成网络架构能够对合成的输出数据(例如图像)进行特定缩放控制。在训练期间，基于样式的生成神经网络(生成器神经网络)包括映射网络和合成网络。在预测期间，可以多次省略、复制或评估映射网络。合成网络可用于生成具有多种属性的高度变化的、高质量的输出数据。例如，当用于生成人类的面部图像时，可能变化的属性是年龄、种族、相机视点、姿势、面部形状、眼镜、颜色(眼睛，头发等)、发型、灯光、背景等。根据任务，生成的输出数据可以包括图像、音频、视频、三维(3D)对象、文本等。Style-based generative network architectures enable specific scaling control over synthesized output data (e.g., images). During training, a style-based generative neural network (generator neural network) includes a mapping network and a synthesis network. During prediction, the mapping network can be omitted, copied, or evaluated multiple times. The synthesis network can be used to generate highly variable, high-quality output data with a variety of attributes. For example, when used to generate human facial images, the attributes that may vary are age, race, camera viewpoint, pose, facial shape, glasses, color (eyes, hair, etc.), hairstyle, lighting, background, etc. Depending on the task, the generated output data can include images, audio, video, three-dimensional (3D) objects, text, etc.

由映射神经网络处理在输入空间中定义的潜码，以产生在中间潜在空间中定义的中间潜码。中间潜码可以用作由合成神经网络处理以生成图像的外观矢量。外观矢量是数据的压缩编码，例如包括人脸、音频和其他数据的视频帧。捕获图像可以在本地设备上转换为外观矢量，并使用与传输捕获图像相比少得多的带宽传输到远程设备。远程设备处的合成神经网络重建图像以供显示。The latent codes defined in the input space are processed by the mapping neural network to produce intermediate latent codes defined in the intermediate latent space. The intermediate latent codes can be used as appearance vectors that are processed by the synthesis neural network to generate an image. The appearance vector is a compressed encoding of data, such as a video frame that includes a face, audio, and other data. The captured image can be converted to an appearance vector on the local device and transmitted to the remote device using much less bandwidth than transmitting the captured image. The synthesis neural network at the remote device reconstructs the image for display.

公开了用于生成对抗神经网络辅助的视频压缩和广播的方法、计算机可读介质和系统。传输特定于第一对象的复制数据，用于配置远程合成神经网络以基于复制数据重建包括特征的面部图像。生成器神经网络处理第一对象或第二对象的捕获图像，以生成编码第一对象或第二对象的面部属性的外观矢量，并且将外观矢量传输到远程合成神经网络。Methods, computer-readable media, and systems for generative adversarial neural network-assisted video compression and broadcasting are disclosed. Replica data specific to a first subject is transmitted for configuring a remote synthetic neural network to reconstruct a facial image including features based on the replica data. A generator neural network processes a captured image of the first subject or a second subject to generate an appearance vector encoding facial attributes of the first subject or the second subject, and transmits the appearance vector to the remote synthetic neural network.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1A示出了根据一个实施例的基于样式的生成器系统的框图。FIG. 1A shows a block diagram of a style-based generator system according to one embodiment.

图1B示出了根据一个实施例的由基于样式的生成器系统生成的图像。FIG. 1B illustrates an image generated by a style-based generator system, according to one embodiment.

图1C示出了根据一个实施例的用于基于样式的生成的方法的流程图。FIG. 1C shows a flowchart of a method for style-based generation according to one embodiment.

图2A示出了根据一个实施例的图1A所示的映射神经网络的框图。FIG. 2A illustrates a block diagram of the mapping neural network shown in FIG. 1A , according to one embodiment.

图2B示出了根据一个实施例的图1A中所示的合成神经网络的框图。FIG. 2B illustrates a block diagram of the synthetic neural network shown in FIG. 1A , according to one embodiment.

图2C示出了根据一个实施例的，用于使用基于样式的生成器系统来施加空间噪声的方法的流程图。2C shows a flow chart of a method for applying spatial noise using a style-based generator system, according to one embodiment.

图2D示出了根据一个实施例的GAN系统的框图。FIG. 2D shows a block diagram of a GAN system according to one embodiment.

图3示出了根据一个实施例的并行处理单元。FIG. 3 illustrates a parallel processing unit according to one embodiment.

图4A示出了根据一个实施例的图3的并行处理单元内的通用处理集群。FIG. 4A illustrates a general processing cluster within the parallel processing unit of FIG. 3 , according to one embodiment.

图4B示出了根据一个实施例的图3的并行处理单元的存储器分区单元。FIG. 4B illustrates a memory partitioning unit of the parallel processing unit of FIG. 3 , according to one embodiment.

图5A示出了根据一个实施例的图4A的流式多处理器。FIG. 5A illustrates the streaming multiprocessor of FIG. 4A , according to one embodiment.

图5B是根据一个实施例的使用图3的PPU实现的处理系统的概念图。FIG. 5B is a conceptual diagram of a processing system implemented using the PPU of FIG. 3 , according to one embodiment.

图5C示出了其中可以实现各种先前实施例的各种架构和/或功能的示例性系统。FIG. 5C illustrates an exemplary system in which various architecture and/or functionality of various previous embodiments may be implemented.

图5D示出了可用于训练和利用机器学习的示例性系统的组件，以用于实现本公开的一些实施例。FIG. 5D illustrates components of an exemplary system that may be used to train and utilize machine learning, for implementing some embodiments of the present disclosure.

图6A示出了适用于实现本公开的一些实施例的示例性视频流系统。FIG. 6A illustrates an exemplary video streaming system suitable for implementing some embodiments of the present disclosure.

图6B示出了适用于实现本公开的一些实施例的各种外观矢量。FIG. 6B illustrates various appearance vectors suitable for implementing some embodiments of the present disclosure.

图6C示出了根据一个实施例的用于GAN辅助的视频压缩的方法的流程图。FIG6C shows a flowchart of a method for GAN-assisted video compression according to one embodiment.

图6D示出了根据一个实施例的用于GAN辅助的视频重建的方法的流程图。FIG6D shows a flowchart of a method for GAN-assisted video reconstruction according to one embodiment.

图7A是用于实现本公开的一些实施例的合成神经网络训练配置的概念图。7A is a conceptual diagram of a synthetic neural network training configuration for implementing some embodiments of the present disclosure.

图7B是用于实现本公开的一些实施例的包括图7A的投影仪的端到端系统的概念图。7B is a conceptual diagram of an end-to-end system including the projector of FIG. 7A for implementing some embodiments of the present disclosure.

图7C是用于实现本公开的一些实施例的用于生成训练数据的配置的概念图。7C is a conceptual diagram of a configuration for generating training data for implementing some embodiments of the present disclosure.

图7D是用于实现本公开的一些实施例的使用界标来预测外观矢量的训练配置的概念图。7D is a conceptual diagram of a training configuration for predicting appearance vectors using landmarks for implementing some embodiments of the present disclosure.

图7E是用于实现本公开的一些实施例的包括合成神经网络的另一端到端系统的概念图。7E is a conceptual diagram of another end-to-end system including a synthetic neural network for implementing some embodiments of the present disclosure.

图8A是用于实现本公开的一些实施例的另一合成神经网络训练配置的概念图。8A is a conceptual diagram of another synthetic neural network training configuration for implementing some embodiments of the present disclosure.

图8B是用于实现本公开的一些实施例的又一个合成神经网络训练配置的概念图。8B is a conceptual diagram of yet another synthetic neural network training configuration for implementing some embodiments of the present disclosure.

具体实施方式DETAILED DESCRIPTION

基于样式的生成网络架构可实现对合成输出的特定缩放控制。基于样式的生成器系统包括映射网络和合成网络。从概念上讲，在一个实施例中，由合成网络的不同层生成的特征图(包含表示输出数据的内容的空间变化信息，其中每个特征图是中间激活的一个通道)基于由映射网络提供的样式控制信号来修改。可以从相同或不同的潜码(latentcodes)生成用于合成网络的不同层的样式控制信号。如本文所用，术语“样式控制信号”控制对象的合成图像的属性，例如姿势，一般的发型，面部形状，眼镜，颜色(眼睛，头发，灯光)和微结构。潜码可以是从例如高斯分布得出的随机的N维矢量。可以从相同或不同的映射网络生成用于合成网络的不同层的样式控制信号。另外，可以将空间噪声注入到合成网络的每一层中。The style-based generative network architecture enables specific scaling control of the synthesized output. The style-based generator system includes a mapping network and a synthesis network. Conceptually, in one embodiment, the feature maps generated by different layers of the synthesis network (containing spatially varying information representing the content of the output data, where each feature map is a channel of intermediate activations) are modified based on style control signals provided by the mapping network. Style control signals for different layers of the synthesis network can be generated from the same or different latent codes. As used herein, the term "style control signal" controls the properties of the synthesized image of the object, such as pose, general hairstyle, facial shape, glasses, color (eyes, hair, lighting), and microstructure. The latent code can be a random N-dimensional vector derived from, for example, a Gaussian distribution. Style control signals for different layers of the synthesis network can be generated from the same or different mapping networks. In addition, spatial noise can be injected into each layer of the synthesis network.

图1A示出了根据一个实施例的基于样式的生成器系统100的框图。基于样式的生成器系统100包括映射神经网络110，样式转换单元115和合成神经网络140。在训练合成神经网络140后，当预先计算由样式转换单元115产生的一个或更多个中间潜码和/或样式信号时，可以在没有映射神经网络110的情况下部署合成神经网络140。在一个实施例中，可以包括附加的样式转换单元115，以将由映射神经网络110生成的中间潜码转换为第二样式信号，或者将不同的中间潜码转换为第二样式信号。一个或更多个附加的映射神经网络110可以被包括在基于样式的生成器系统100中，以从潜码或附加的潜码中产生附加的中间潜码。FIG1A shows a block diagram of a style-based generator system 100 according to one embodiment. The style-based generator system 100 includes a mapping neural network 110, a style transfer unit 115, and a synthesis neural network 140. After training the synthesis neural network 140, the synthesis neural network 140 can be deployed without the mapping neural network 110 when one or more intermediate latent codes and/or style signals generated by the style transfer unit 115 are pre-calculated. In one embodiment, an additional style transfer unit 115 can be included to convert the intermediate latent code generated by the mapping neural network 110 into a second style signal, or to convert a different intermediate latent code into a second style signal. One or more additional mapping neural networks 110 can be included in the style-based generator system 100 to generate additional intermediate latent codes from the latent code or additional latent codes.

基于样式的生成器系统100可以由程序，定制电路或定制电路和程序的组合来实现。例如，可以使用GPU(图形处理单元)，CPU(中央处理单元)或能够执行本文所述操作的任何处理器来实现基于样式的生成器系统100。此外，本领域普通技术人员将理解，执行基于样式的生成器系统100的操作的任何系统都在本发明的实施例的范围和精神内。The style-based generator system 100 can be implemented by a program, a custom circuit, or a combination of a custom circuit and a program. For example, the style-based generator system 100 can be implemented using a GPU (graphics processing unit), a CPU (central processing unit), or any processor capable of performing the operations described herein. In addition, a person of ordinary skill in the art will understand that any system that performs the operations of the style-based generator system 100 is within the scope and spirit of embodiments of the present invention.

常规地，通过输入层(诸如前馈神经网络的第一层)将潜码提供给生成器。相反，在一个实施例中，代替接收潜码，合成神经网络140从学习的常数开始，并且潜码被输入到映射神经网络110。在一个实施例中，第一中间数据是学习的常数。给定输入潜在空间中的潜码z，则非线性映射网络f:首先会产生中间潜码映射神经网络110可以被配置为实现非线性映射网络。在一个实施例中，输入潜在空间和中间潜在空间中的输入和输出激活的尺寸相等(例如512)。在一个实施例中，使用8层MLP(多层感知器，即仅由全连接层组成的神经网络)来实现映射函数f。Conventionally, the latent code is provided to the generator through an input layer (such as the first layer of a feed-forward neural network). Instead, in one embodiment, instead of receiving the latent code, the synthesis neural network 140 starts with learned constants, and the latent code is input to the mapping neural network 110. In one embodiment, the first intermediate data is the learned constants. Given an input latent space The latent code z in the nonlinear mapping network f: First, the intermediate latent code will be generated The mapping neural network 110 can be configured to implement a nonlinear mapping network. In one embodiment, the input latent space and the intermediate latent space The input and output activations in are of equal size (e.g., 512). In one embodiment, an 8-layer MLP (multilayer perceptron, i.e., a neural network consisting of only fully connected layers) is used to implement the mapping function f.

尽管常规的生成器仅通过生成器的输入层馈送潜码，但是映射神经网络110是将输入潜码z映射到中间潜在空间以产生中间潜码w。样式转换单元115将中间潜码w转换为第一样式信号。一个或更多个中间潜码w被转换成包括第一样式信号和第二样式信号的空间不变样式。与传统的样式传递技术相比，空间不变样式是根据矢量(即中间潜码w)而不是从示例图像计算得出的。一个或更多个中间潜码w可以由一个或更多个映射神经网络110针对一个或更多个相应潜码z生成。合成神经网络140根据样式信号处理第一中间数据(例如，被编码为特征图的学习常数)，例如，将第一中间数据的密度从4×4增加到8×8，并持续直到达到输出数据密度为止。While conventional generators only feed the latent code through the input layer of the generator, the mapping neural network 110 maps the input latent code z to an intermediate latent space To generate an intermediate latent code w. The style conversion unit 115 converts the intermediate latent code w into a first style signal. One or more intermediate latent codes w are converted into a spatially invariant style including a first style signal and a second style signal. Compared with conventional style transfer techniques, the spatially invariant style is calculated based on a vector (i.e., the intermediate latent code w) rather than from an example image. One or more intermediate latent codes w can be generated by one or more mapping neural networks 110 for one or more corresponding latent codes z. The synthesis neural network 140 processes the first intermediate data (e.g., a learning constant encoded as a feature map) according to the style signal, for example, increasing the density of the first intermediate data from 4×4 to 8×8, and continuing until the output data density is reached.

在一个实施例中，样式转换单元115执行仿射变换。样式转换单元115可以被训练为在合成神经网络140的训练期间学习仿射变换。第一样式信号控制在合成神经网络140的第一层120处的操作以产生修改的第一中间数据。在一个实施例中，第一样式信号控制合成网络140的第一层120内的自适应实例规一化(AdaIN)操作。在一个实施例中，AdaIN操作接收一组内容特征图和样式信号并修改内容特征图的一阶统计(即“样式”)，以匹配样式信号定义的一阶统计。由第一层120输出的修改的第一中间数据被一个或更多个处理层125处理以生成第二中间数据。在一个实施例中，一个或更多个处理层125包括3×3卷积层。在一个实施例中，一个或更多个处理层125包括3×3卷积层，其后是AdaIN操作，该AdaIN操作接收图1A中未明确示出的附加样式信号。In one embodiment, the style transfer unit 115 performs an affine transformation. The style transfer unit 115 may be trained to learn the affine transformation during training of the synthetic neural network 140. The first style signal controls an operation at the first layer 120 of the synthetic neural network 140 to produce modified first intermediate data. In one embodiment, the first style signal controls an adaptive instance normalization (AdaIN) operation within the first layer 120 of the synthetic network 140. In one embodiment, the AdaIN operation receives a set of content feature maps and a style signal and modifies the first-order statistics (i.e., "style") of the content feature map to match the first-order statistics defined by the style signal. The modified first intermediate data output by the first layer 120 is processed by one or more processing layers 125 to generate second intermediate data. In one embodiment, the one or more processing layers 125 include a 3×3 convolutional layer. In one embodiment, the one or more processing layers 125 include a 3×3 convolutional layer followed by an AdaIN operation that receives an additional style signal not explicitly shown in FIG. 1A.

第二中间数据被输入到合成神经网络140的第二层130。第二样式信号控制第二层130的操作以产生修改的第二中间数据。在一个实施例中，第一样式信号修改在第一中间数据中编码的第一属性，第二样式信号修改在第一中间数据和第二中间数据中编码的第二属性。例如，与第二中间数据相比，第一中间数据是粗略数据，并且第一样式在第一层120处被传送到粗略特征图，而第二样式在第二层130处被传送到较高密度的特征图。The second intermediate data is input to the second layer 130 of the synthetic neural network 140. The second style signal controls the operation of the second layer 130 to generate modified second intermediate data. In one embodiment, the first style signal modifies the first attribute encoded in the first intermediate data, and the second style signal modifies the second attribute encoded in the first intermediate data and the second intermediate data. For example, the first intermediate data is coarse data compared to the second intermediate data, and the first style is transmitted to the coarse feature map at the first layer 120, while the second style is transmitted to the higher density feature map at the second layer 130.

在一个实施例中，第二层130对第二中间数据进行上采样并且包括3×3卷积层，随后是AdaIN操作。在一个实施例中，第二样式信号控制合成网络140的第二层130内的AdaIN操作。由第二层130输出的修改后的第二中间数据由一个或更多个处理层135进行处理以生成包括与第二中间数据相对应的内容的输出数据。在一个实施例中，将修改的第二中间数据中的特征的多个(例如32、48、64、96等)通道转换成被编码为颜色通道(例如红色，绿色，蓝色)的输出数据。In one embodiment, the second layer 130 upsamples the second intermediate data and includes a 3×3 convolution layer followed by an AdaIN operation. In one embodiment, the second style signal controls the AdaIN operation within the second layer 130 of the synthesis network 140. The modified second intermediate data output by the second layer 130 is processed by one or more processing layers 135 to generate output data including content corresponding to the second intermediate data. In one embodiment, multiple (e.g., 32, 48, 64, 96, etc.) channels of features in the modified second intermediate data are converted into output data encoded as color channels (e.g., red, green, blue).

在一个实施例中，一个或更多个处理层135包括3×3卷积层。在一个实施例中，输出数据是包括对应于第一缩放的第一属性和对应于第二缩放的第二属性的图像，其中与第二缩放相比，第一缩放更粗糙。第一缩放可以对应于由第一层120处理的特征图的缩放，而第二缩放可以对应于由第二层130处理的特征图的缩放。因此，第一样式信号以第一缩放修改第一属性，并且第二样式信号以第二缩放修改第二属性。In one embodiment, the one or more processing layers 135 include a 3×3 convolutional layer. In one embodiment, the output data is an image including a first attribute corresponding to a first scaling and a second attribute corresponding to a second scaling, wherein the first scaling is coarser than the second scaling. The first scaling may correspond to a scaling of a feature map processed by the first layer 120, and the second scaling may correspond to a scaling of a feature map processed by the second layer 130. Thus, the first style signal modifies the first attribute at the first scaling, and the second style signal modifies the second attribute at the second scaling.

现在将根据用户的需求，给出关于可用于实现前述框架的各种可选架构和特征的更多说明性信息。应该特别注意的是，以下信息是出于说明目的而提出的，不应以任何方式解释为限制。下列任何特征都可以有选择地并入或不排除所描述的其他特征。Now, more illustrative information about various optional architectures and features that can be used to implement the aforementioned framework will be given according to the needs of the user. It should be particularly noted that the following information is provided for illustrative purposes and should not be interpreted as limiting in any way. Any of the following features can be selectively incorporated into or without excluding other features described.

图1B示出了根据一个实施例的由基于样式的生成器系统100生成的图像。图像以1024²分辨率生成。在其他实施例中，可以以不同的分辨率生成图像。两个不同的潜码用于控制由基于样式的生成器系统100生成的图像的样式。具体而言，样式的第一部分由映射神经网络110和样式转换单元115从顶行中的“源”潜码生成。样式的第二部分由相同或附加的映射神经网络110和相应的样式转换单元115从最左列中的“目标”潜码生成。基于样式的生成器系统100开始于合成神经网络140处的学习的常数输入，并基于潜码调整每个卷积层上图像的“样式”，从而直接控制以不同缩放在特征图中编码的图像属性的强度。换句话说，将来自“源”数据的给定样式集复制到“目标”数据。因此，复制的样式(粗略，中间或精细)是从“源”数据中获取的，而所有其他样式都与“目标”数据中的相同。FIG. 1B shows an image generated by a style-based generator system 100 according to one embodiment. The image is generated at a 1024×² resolution. In other embodiments, the image may be generated at a different resolution. Two different latent codes are used to control the style of the image generated by the style-based generator system 100. Specifically, the first part of the style is generated by the mapping neural network 110 and the style transfer unit 115 from the “source” latent code in the top row. The second part of the style is generated by the same or additional mapping neural network 110 and the corresponding style transfer unit 115 from the “target” latent code in the leftmost column. The style-based generator system 100 starts with a learned constant input at the synthesis neural network 140 and adjusts the “style” of the image at each convolutional layer based on the latent code, thereby directly controlling the strength of the image attributes encoded in the feature map at different scales. In other words, a given set of styles from the “source” data is copied to the “target” data. Therefore, the copied style (coarse, intermediate, or fine) is taken from the “source” data, while all other styles are the same as in the “target” data.

样式的第一部分(目标)由合成神经网络140应用，以用样式的第一部分的第一子集替换为样式的第二部分(源)的相应的第二子集来处理学习的常数。在一个实施例中，学习的常数是4×4×512的常数张量。在图1B中的图像的第二，第三和第四行中，样式的第二部分(源)在合成神经网络140的粗略层处替换样式的第一部分(目标)。在一个实施例中，该粗略层对应于粗略的空间密度4²-8²。在一个实施例中，高级属性(例如姿势，一般发型，面部形状和眼镜)从源中复制，而保持其他属性，例如所有颜色(眼睛，头发，照明)和目标的精细面部特征。The first part of the style (target) is applied by the synthesis neural network 140 to process the learned constants by replacing a first subset of the first part of the style with a corresponding second subset of the second part of the style (source). In one embodiment, the learned constants are a 4×4×512 constant tensor. In the second, third, and fourth rows of the images in FIG. 1B , the second part of the style (source) replaces the first part of the style (target) at a coarse layer of the synthesis neural network 140. In one embodiment, this coarse layer corresponds to a coarse spatial density of 4² -8² . In one embodiment, high-level attributes (such as pose, general hairstyle, face shape, and glasses) are copied from the source, while other attributes such as all colors (eyes, hair, lighting) and fine facial features of the target are maintained.

在图1B中的图像的第五和第六行中，样式的第二部分(源)替换了合成神经网络140的中间层处的样式的第一部分(目标)。在一个实施例中，中间层对应空间密度16²-32²。较小尺度的面部特征，发型，睁开/闭合的眼睛是从源继承的，而来自目标的姿势、一般面部形状和眼镜被保留。最后，在图1B中的图像的最后一行中，样式的第二部分(源)在合成神经网络140的高密度(精细)层处替换样式的第一部分(目标)。在一个实施例中，精细层对应空间密度64²-1024²。将来自样式的第二部分(源)中的样式用于精细层，继承了来自源中的配色方案和微结构，同时保留了来自目标的姿势和一般面部形状。In the fifth and sixth rows of images in FIG. 1B , the second portion of the style (source) replaces the first portion of the style (target) at an intermediate layer of the synthesis neural network 140. In one embodiment, the intermediate layer corresponds to spatial densities 16² -32² . Smaller scale facial features, hairstyles, open/closed eyes are inherited from the source, while the pose, general facial shape, and glasses from the target are preserved. Finally, in the last row of images in FIG. 1B , the second portion of the style (source) replaces the first portion of the style (target) at a high density (fine) layer of the synthesis neural network 140. In one embodiment, the fine layer corresponds to spatial densities 64² -1024² . Using the style from the second portion of the style (source) for the fine layer inherits the color scheme and microstructure from the source, while preserving the pose and general facial shape from the target.

基于样式的生成器系统100的体系架构能够通过对样式的特定缩放的修改来控制图像合成。可以将由样式转换单元115执行的映射网络110和仿射变换视为从学习的分布中为每种样式绘制样本的方式，并且合成网络140提供了一种基于样式集合生成新颖图像的机制。每种样式的效果都定位在合成网络140中，即，修改样式的特定子集可以预期仅影响图像的某些属性。The architecture of the style-based generator system 100 enables control of image synthesis through modification of specific scales of styles. The mapping network 110 and the affine transformations performed by the style transfer unit 115 can be viewed as a way to draw samples from a learned distribution for each style, and the synthesis network 140 provides a mechanism for generating novel images based on a collection of styles. The effects of each style are localized in the synthesis network 140, i.e., modifying a specific subset of styles can be expected to affect only certain properties of the image.

如图1B所示，使用来自至少两个不同潜码的样式信号被称为样式混合或混合正则化。训练期间的样式混合对相邻样式进行去相关，并实现对生成的图像的更细粒度的控制。在一个实施例中，在训练期间，使用两个随机潜码而不是一个来生成给定百分比的图像。当生成这样的图像时，可以选择合成神经网络140中的随机位置(例如，交叉点)，其中处理从使用第一潜码生成的样式信号切换到使用第二潜码生成的信号样式。在一个实施例中，两个潜码z1，z2由映射神经网络110处理，并且相应的中间潜码w1，w2控制样式，使得w1适用于交叉点之前，而w2适用于交叉点之后。混合正则化技术防止合成神经网络140假设相邻样式相关。As shown in Figure 1B, using style signals from at least two different latent codes is called style mixing or mixing regularization. Style mixing during training decorrelates adjacent styles and enables finer-grained control over the generated images. In one embodiment, during training, two random latent codes are used instead of one to generate a given percentage of images. When generating such an image, a random position (e.g., a crossover point) in the synthetic neural network 140 can be selected where processing switches from a style signal generated using a first latent code to a signal style generated using a second latent code. In one embodiment, two latent codes z1, z2 are processed by the mapping neural network 110, and the corresponding intermediate latent codes w1, w2 control the style so that w1 applies before the crossover point and w2 applies after the crossover point. The mixing regularization technique prevents the synthetic neural network 140 from assuming that adjacent styles are correlated.

表1示出了在训练期间启用混合正则化如何显著改善样式的本地化(localization)，以在测试时混合了多个潜码的场景中改善的(越低越好)弗雷谢特起始距离(FID)为表示。图1B所示的图像是通过以不同缩放混合两个潜码合成的图像示例。样式的每个子集控制图像的有意义的高级属性。Table 1 shows how enabling mixing regularization during training significantly improves the localization of styles, expressed as improved (lower is better) Frechette Inception Distance (FID) in the scenario where multiple latent codes are mixed at test time. The image shown in Figure 1B is an example of an image synthesized by mixing two latent codes at different scales. Each subset of styles controls meaningful high-level properties of the image.

表1：不同混合正则化比率的FIDTable 1: FID for different mixture regularization ratios

混合比率表示启用了混合正则化的训练示例的百分比。在测试期间最多随机选择四个不同的潜码，并且还随机选择不同的潜码之间的交叉点。混合正则化显著提高了对这些不利操作的容忍度。The mixture ratio represents the percentage of training examples for which mixture regularization is enabled. At most four different latent codes are randomly selected during testing, and the intersections between different latent codes are also randomly selected. Mixture regularization significantly improves tolerance to these adverse operations.

如通过FID所确认的，由基于样式的生成器系统100生成的图像的平均质量高，并且甚至成功地合成了诸如眼镜和帽子之类的附件。对于图1B中所示的图像，通过使用可以在W而不是Z中执行的所谓的截断技巧来避免从W的极端区域进行采样。请注意，基于样式的生成器系统100可以实现以启用仅选择性地针对低分辨率应用截断，因此高分辨率细节不会受到影响。As confirmed by the FID, the average quality of the images generated by the style-based generator system 100 is high, and even accessories such as glasses and hats are successfully synthesized. For the image shown in FIG1B , sampling from extreme regions of W is avoided by using a so-called truncation trick that can be performed in W instead of Z. Note that the style-based generator system 100 can be implemented to enable truncation to be applied selectively only for low resolutions, so high-resolution details are not affected.

考虑到训练数据的分布，低密度的区域表示不佳，因此对于基于样式的生成器系统100来说可能很难学习。训练数据的不均匀分布在所有生成建模技术中都提出了重要的开放问题。然而，已知的是，尽管损失了一些量的变化，但从截断的或缩小的采样空间中绘制潜在矢量趋于可以改善平均图像质量。在一个实施例中，为了改善对基于样式的生成器系统100的训练，将W的质心计算为在一个人脸数据集(例如FFHQ，Flickr-Faces-HQ)的情况下，该点表示一种平均人脸(ψ＝0)。给定w的偏差从中心开始按比例缩小为其中ψ<1。在传统的生成建模系统中，即使使用正交正则化，仅神经网络的一个子集适用于这种截断，即使在不改变损失函数的情况下，W空间中的截断也似乎可以可靠地工作。Given the distribution of the training data, regions of low density are poorly represented and thus may be difficult for the style-based generator system 100 to learn. The uneven distribution of training data poses important open problems in all generative modeling techniques. However, it is known that drawing latent vectors from a truncated or reduced sampling space tends to improve average image quality, albeit at the expense of some amount of variation. In one embodiment, to improve training of the style-based generator system 100, the centroid of W is computed as In the case of a face dataset (e.g. FFHQ, Flickr-Faces-HQ), this point represents an average face (ψ = 0). The deviation of a given w is scaled down from the center to where ψ < 1. In traditional generative modeling systems, only a subset of neural networks are suitable for such truncation, even with orthogonal regularization, and truncation in W space seems to work reliably even without changing the loss function.

图1C示出了根据一个实施例的用于基于样式的生成的方法150的流程图。方法150可以由程序，定制电路或定制电路和程序的组合来执行。例如，方法150可以由GPU(图形处理单元)，CPU(中央处理单元)或能够执行基于样式的生成器系统100的操作的任何处理器来执行。本领域技术人员将理解，执行方法150的任何系统都在本发明实施例的范围和精神内。FIG1C shows a flow chart of a method 150 for style-based generation according to one embodiment. The method 150 may be performed by a program, custom circuitry, or a combination of custom circuitry and a program. For example, the method 150 may be performed by a GPU (graphics processing unit), a CPU (central processing unit), or any processor capable of performing the operations of the style-based generator system 100. Those skilled in the art will appreciate that any system that performs the method 150 is within the scope and spirit of embodiments of the present invention.

在步骤155，映射神经网络110处理在输入空间中定义的潜码，以产生在中间潜在空间中定义的中间潜码。在步骤160，通过样式转换单元115将中间潜码转换为第一样式信号。在步骤165，将第一样式信号施加在合成神经网络140的第一层120上，以根据第一样式信号修改第一中间数据，以产生修改的第一中间数据。在步骤170，修改的第一中间数据被一个或更多个处理层125处理以产生第二中间数据。在步骤175，将第二样式信号施加在合成神经网络140的第二层130上，以根据第二样式信号修改第二中间数据，以产生修改的第二中间数据。在步骤180，由一个或更多个处理层135处理修改的第二中间数据，以产生包括与第二中间数据相对应的内容的输出数据。At step 155, the mapping neural network 110 processes the latent code defined in the input space to generate an intermediate latent code defined in the intermediate latent space. At step 160, the intermediate latent code is converted into a first style signal by the style transfer unit 115. At step 165, the first style signal is applied to the first layer 120 of the synthesis neural network 140 to modify the first intermediate data according to the first style signal to generate modified first intermediate data. At step 170, the modified first intermediate data is processed by one or more processing layers 125 to generate second intermediate data. At step 175, the second style signal is applied to the second layer 130 of the synthesis neural network 140 to modify the second intermediate data according to the second style signal to generate modified second intermediate data. At step 180, the modified second intermediate data is processed by one or more processing layers 135 to generate output data including content corresponding to the second intermediate data.

解纠缠存在多种定义，但一个共同的目标是由线性子空间组成的潜在空间，每个子空间控制一个变化因子。然而，潜在空间Z中的因子的每个组合的采样概率需要与训练数据中的对应密度匹配。There are many definitions of disentanglement, but a common goal is a latent space composed of linear subspaces, each of which controls a factor of variation. However, the sampling probability of each combination of factors in the latent space Z needs to match the corresponding density in the training data.

基于样式的生成器系统100的主要好处是，中间潜在空间W不必根据任何固定分布来支持采样；通过学习的分段连续映射f(z)，得出基于样式的生成器系统100的采样密度。可以将映射调整为“不扭曲(unwarp)”W，以使变化因子变得更加线性。基于样式的生成器系统100自然会趋向于不扭曲W，因为基于解纠缠的表示比基于纠缠的表示生成现实图像应该更容易。这样，训练可以在无监督的情况下，即在事先不知道变化因子的情况下产生较小纠缠的W。The main benefit of the style-based generator system 100 is that the intermediate latent space W does not have to support sampling according to any fixed distribution; the sampling density of the style-based generator system 100 is derived through the learned piecewise continuous mapping f(z). The mapping can be adjusted to "unwarp" W so that the factor of variation becomes more linear. The style-based generator system 100 will naturally tend to unwarp W, because it should be easier to generate realistic images based on a disentangled representation than based on an entangled representation. In this way, training can produce less entangled W in an unsupervised manner, that is, without knowing the factor of variation in advance.

图2A示出了根据一个实施例的图1A所示的映射神经网络110的框图。训练数据的分布可能缺少属性的组合，例如戴着眼镜的孩子。与潜在空间Z相比，在中间潜在空间W中，眼镜和年龄的组合的变化因素的分布变得更加线性。FIG2A shows a block diagram of the mapping neural network 110 shown in FIG1A according to one embodiment. The distribution of training data may lack combinations of attributes, such as children wearing glasses. Compared to the latent space Z, in the intermediate latent space W, the distribution of the variation factor of the combination of glasses and age becomes more linear.

在一个实施例中，映射神经网络110包括归一化层205和多个全连接层210。在一个实施例中，八个全连接层210顺序地耦合以产生中间潜码。在训练期间学习映射神经网络110的参数(例如权重)，并且当部署基于样式的生成器系统100以生成输出数据时，使用该参数来处理输入的潜码。在一个实施例中，映射神经网络110生成一个或更多个中间潜码，其在稍后的时间由合成神经网络140使用以生成输出数据。In one embodiment, the mapping neural network 110 includes a normalization layer 205 and a plurality of fully connected layers 210. In one embodiment, eight fully connected layers 210 are sequentially coupled to produce an intermediate latent code. The parameters (e.g., weights) of the mapping neural network 110 are learned during training, and when the style-based generator system 100 is deployed to generate output data, the parameters are used to process the input latent code. In one embodiment, the mapping neural network 110 generates one or more intermediate latent codes, which are used at a later time by the synthesis neural network 140 to generate output data.

人像中有许多可以被认为是随机的属性，例如头发，发茬，雀斑或皮肤毛孔的确切位置。只要随机化遵循正确的分布，就可以在不影响图像感知的情况下将这些方法中的任何一个随机化。生成图像时人为地忽略噪声会导致图像具有无特征的“绘画”外观。特别地，当产生人像时，粗糙的噪声可能导致头发的大规模卷发和较大背景特征的出现，而细微的噪声可能会使头发的卷发更细，背景细节和皮肤毛孔更细。There are many attributes in a portrait that can be considered random, such as the exact location of hair, stubble, freckles, or skin pores. Any of these methods can be used to randomize without affecting the perception of the image, as long as the randomization follows the correct distribution. Artificially ignoring noise when generating images can result in images with a featureless, "painted" appearance. In particular, when generating portraits, coarse noise may result in large-scale curls in the hair and larger background features, while fine noise may result in finer curls in the hair and background details and skin pores.

常规的生成器可能仅基于通过输入层提供的对神经网络的输入来生成随机变化。在训练期间，每当需要伪随机数时，常规生成器就可能被迫学习从较早的激活中生成空间变化的伪随机数。换句话说，伪随机数生成不是故意内置在常规生成器中的。取而代之的是，伪随机数的生成是在训练期间自行出现的，以使常规生成器满足训练目标。生成伪随机数会消耗神经网络的容量，并且难以隐藏生成的信号的周期性，而且并不总是成功的，正如生成的图像中常见的重复模式所证明的那样。相反，基于样式的生成器系统100可以被配置为通过在每次卷积之后添加每像素噪声来避免这些限制。A conventional generator may generate random variations based solely on the inputs to the neural network provided through the input layer. During training, whenever a pseudo-random number is needed, the conventional generator may be forced to learn to generate spatially varying pseudo-random numbers from earlier activations. In other words, pseudo-random number generation is not intentionally built into the conventional generator. Instead, the generation of pseudo-random numbers emerges on its own during training to allow the conventional generator to meet the training objectives. Generating pseudo-random numbers consumes the capacity of the neural network and is difficult to hide the periodicity of the generated signal and is not always successful, as evidenced by common repeating patterns in generated images. In contrast, the style-based generator system 100 can be configured to avoid these limitations by adding per-pixel noise after each convolution.

在一个实施例中，基于样式的生成器系统100配置有直接装置，以通过引入显式噪声输入来生成随机细节。在一个实施例中，噪声输入是由不相关的高斯噪声组成的单通道图像，并且专用噪声图像被输入到合成网络140的一层或更多层。可以使用学习的每特征缩放因子将噪声图像广播到所有特征图，然后将其添加到相应卷积的输出中。In one embodiment, the style-based generator system 100 is configured with a direct means to generate random details by introducing an explicit noise input. In one embodiment, the noise input is a single channel image composed of uncorrelated Gaussian noise, and the dedicated noise image is input to one or more layers of the synthesis network 140. The noise image can be broadcast to all feature maps using a learned per-feature scaling factor and then added to the output of the corresponding convolution.

图2B示出了根据一个实施例的图1A所示的合成神经网络140的框图。合成神经网络140包括第一处理块200和第二处理块230。在一个实施例中，处理块200处理4×4分辨率特征图，并且处理块230处理8×8分辨率特征图。在处理块200和230之后，之前和/或它们之间，可以在合成神经网络140中包括一个或更多个附加处理块。FIG2B shows a block diagram of the synthetic neural network 140 shown in FIG1A according to one embodiment. The synthetic neural network 140 includes a first processing block 200 and a second processing block 230. In one embodiment, the processing block 200 processes a 4×4 resolution feature map, and the processing block 230 processes an 8×8 resolution feature map. One or more additional processing blocks may be included in the synthetic neural network 140 after, before, and/or between the processing blocks 200 and 230.

第一处理块200接收第一中间数据，第一空间噪声和第二空间噪声。在一个实施例中，在与(例如，添加到)第一中间数据相结合之前，通过学习的每通道缩放因子来缩放第一空间噪声。在一个实施例中，第一空间噪声，第二空间噪声，第三空间噪声和第四空间噪声是独立的每像素高斯噪声。The first processing block 200 receives first intermediate data, first spatial noise, and second spatial noise. In one embodiment, the first spatial noise is scaled by a learned per-channel scaling factor before being combined with (e.g., added to) the first intermediate data. In one embodiment, the first spatial noise, the second spatial noise, the third spatial noise, and the fourth spatial noise are independent per-pixel Gaussian noises.

第一处理块200还接收第一样式信号和第二样式信号。如先前所解释，可通过根据学习的仿射变换通过处理中间潜码来获得样式信号。学习的仿射变换将w专门化为样式y＝(ys，yb)，其控制由模块220在合成神经网络140实现的自适应实例归一化(AdaIN)操作。由于其效率和紧凑的表示，AdaIN特别适合在基于样式的生成器系统100中实现。The first processing block 200 also receives a first style signal and a second style signal. As previously explained, the style signal can be obtained by processing the intermediate latent code according to a learned affine transformation. The learned affine transformation specializes w into a style y=(ys, yb), which controls an adaptive instance normalization (AdaIN) operation implemented by module 220 at the synthetic neural network 140. Due to its efficiency and compact representation, AdaIN is particularly suitable for implementation in the style-based generator system 100.

AdaIN操作已定义AdaIN operation defined

其中每个特征图xi分别进行归一化，然后使用样式y中对应的标量分量进行缩放和偏置。因此，y的维数是特征图的数量的两倍于该层的输入。在一个实施例中，样式信号的维数是施加了样式信号的层中的多个特征图的倍数。与常规样式转移相反，空间不变样式y是从矢量w而不是示例图像中计算出来的。Each feature map xi is normalized separately and then scaled and biased with the corresponding scalar component in the style y. Therefore, the dimension of y is twice the number of feature maps as the input to the layer. In one embodiment, the dimension of the style signal is a multiple of the number of feature maps in the layer to which the style signal is applied. In contrast to conventional style transfer, the spatially invariant style y is computed from the vector w rather than the example image.

每个样式信号的影响都定位于合成神经网络140中，即，修改样式信号的特定子集可以预期仅影响输出数据表示的图像的某些属性。了解本地化的原因，考虑由模块220实现的AdaIN运算(等式1)如何首先将每个信道归一化为零均值和单位方差，然后才基于样式信号应用缩放和偏差。如样式所指示，新的每通道统计信息会为后续的卷积操作修改特征的相对重要性，但由于归一化，新的每通道统计信息不依赖于原始统计信息。因此，每个样式信号在被下一个AdaIN操作覆盖之前仅控制预定数量的卷积225。在一个实施例中，在每个卷积之后并且在由另一个模块225处理之前，将缩放的空间噪声添加到特征。The effect of each style signal is localized within the synthetic neural network 140, i.e., modifying a particular subset of the style signals can be expected to affect only certain properties of the image represented by the output data. To understand the reason for the localization, consider how the AdaIN operation (Equation 1) implemented by module 220 first normalizes each channel to zero mean and unit variance, and only then applies scaling and bias based on the style signal. As indicated by the style, the new per-channel statistics modify the relative importance of the features for subsequent convolution operations, but due to the normalization, the new per-channel statistics do not depend on the original statistics. Therefore, each style signal controls only a predetermined number of convolutions 225 before being overwritten by the next AdaIN operation. In one embodiment, scaled spatial noise is added to the features after each convolution and before being processed by another module 225.

每个模块220之后可以是卷积层225。在一个实施例中，卷积层225将3×3卷积内核应用于输入。在处理块200内，由卷积层225输出的第二中间数据与第二空间噪声组合，并输入到第二模块220，第二模块220应用第二样式信号以生成处理块200的输出。在一个实施例中，在与第二中间数据组合(例如，添加到)之前，通过学习的每通道缩放因子对空间噪声进行缩放。Each module 220 may be followed by a convolutional layer 225. In one embodiment, the convolutional layer 225 applies a 3×3 convolutional kernel to the input. Within the processing block 200, the second intermediate data output by the convolutional layer 225 is combined with the second spatial noise and input to the second module 220, which applies the second style signal to generate the output of the processing block 200. In one embodiment, the spatial noise is scaled by a learned per-channel scaling factor before being combined with (e.g., added to) the second intermediate data.

处理块230接收由处理块200输出的特征图，并且由上采样器235对特征图进行上采样。在实施例中，通过上采样器235对4×4特征图进行上采样以至生成8×8特征图。上采样的特征图被输入到另一个卷积层225以产生第三中间数据。在处理块230内，第三中间数据与第三空间噪声组合，并输入到第三模块220，该第三模块220通过AdaIN操作施加第三样式信号。在一个实施例中，在将第三空间噪声与第三中间数据组合(例如，添加到)之前，通过学习的每通道缩放因子来缩放第三空间噪声。第三模块220的输出由另一个卷积层225处理以产生第四中间数据。第四中间数据与第四空间噪声组合并输入到第四模块220，该第四模块经由AdaIN操作施加第四样式信号。在一个实施例中，在将第四空间噪声与第四中间数据组合(例如，添加到)之前，通过学习的每通道缩放因子来缩放第四空间噪声。The processing block 230 receives the feature map output by the processing block 200, and the feature map is upsampled by the upsampler 235. In an embodiment, the 4×4 feature map is upsampled by the upsampler 235 to generate an 8×8 feature map. The upsampled feature map is input to another convolution layer 225 to generate a third intermediate data. Within the processing block 230, the third intermediate data is combined with the third spatial noise and input to the third module 220, which applies a third style signal through an AdaIN operation. In one embodiment, before the third spatial noise is combined with the third intermediate data (e.g., added to), the third spatial noise is scaled by a learned per-channel scaling factor. The output of the third module 220 is processed by another convolution layer 225 to generate a fourth intermediate data. The fourth intermediate data is combined with the fourth spatial noise and input to the fourth module 220, which applies a fourth style signal via an AdaIN operation. In one embodiment, before the fourth spatial noise is combined with the fourth intermediate data (e.g., added to), the fourth spatial noise is scaled by a learned per-channel scaling factor.

在一个实施例中，输出数据的分辨率为1024²，并且合成神经网络140包括18层，每个分辨率为2的幂(4²-1024²)。可以使用单独的1×1卷积将合成神经网络140的最后一层的输出转换为RGB。在一个实施例中，与具有相同数量的层和特征图的常规生成器中的23.1M相比，合成神经网络140总共具有26.2M可训练参数。In one embodiment, the resolution of the output data is 1024² , and the synthetic neural network 140 includes 18 layers, each with a resolution of a power of 2 (4² -1024² ). The output of the last layer of the synthetic neural network 140 may be converted to RGB using a separate 1×1 convolution. In one embodiment, the synthetic neural network 140 has a total of 26.2M trainable parameters, compared to 23.1M in a conventional generator with the same number of layers and feature maps.

引入空间噪声仅影响输出数据的随机方面，而使整体组成和高级属性(如身份)保持不变。单独的噪声输入到合成神经网络140使得能够将随机变化应用到层的不同子集。将空间噪声输入施加到合成神经网络140的特定层导致与该特定层的尺度匹配的尺度的随机变化。The introduction of spatial noise affects only the random aspects of the output data, while leaving the overall composition and high-level properties (such as identity) unchanged. Separate noise inputs to the synthetic neural network 140 enable random variations to be applied to different subsets of layers. Applying a spatial noise input to a specific layer of the synthetic neural network 140 results in random variations of a scale that matches the scale of that specific layer.

噪声的影响似乎紧密地定位在合成神经网络140中。在合成神经网络140中的任何点处，都存在着尽快引入新内容的压力，合成神经网络140创建随机变化最简单的方法是依赖于空间噪声输入。一组新的空间噪声可用于合成神经网络140中的每一层，因此没有动机从较早的激活中产生随机效应，从而导致局部效应。因此，噪声仅影响无关紧要的随机变化(与头发，胡须等的梳理不同)。相反，样式信号的更改具有全局效果(更改姿势，身份等)。The effects of the noise appear to be tightly localized in the synthetic neural network 140. At any point in the synthetic neural network 140, there is pressure to introduce new content as quickly as possible, and the simplest way for the synthetic neural network 140 to create random variation is to rely on spatial noise inputs. A new set of spatial noise is available for each layer in the synthetic neural network 140, so there is no incentive to generate random effects from earlier activations, which would result in local effects. Therefore, the noise only affects insignificant random variations (unlike the grooming of hair, beards, etc.). In contrast, changes to the style signal have a global effect (changing pose, identity, etc.).

在合成神经网络140中，当输出数据是图像时，样式信号会影响整个图像，因为对完整的特征图进行缩放并使用相同的值进行了偏置。因此，可以连贯地控制诸如姿势，照明或背景样式的整体效果。同时，空间噪声被独立地添加到每个像素，因此理想地适合于控制随机变化。如果合成神经网络140试图使用噪声来控制例如姿势，那将导致空间不一致的决定，这将在训练期间受到惩罚。因此，合成神经网络140学习在没有明确指导的情况下适当地使用全局和局部信道。In the synthetic neural network 140, when the output data is an image, the style signal affects the entire image because the complete feature map is scaled and biased with the same value. Therefore, the overall effect of things like pose, lighting, or background style can be coherently controlled. At the same time, spatial noise is added to each pixel independently and is therefore ideally suited for controlling random variations. If the synthetic neural network 140 tried to use noise to control, for example, pose, that would result in spatially inconsistent decisions, which would be penalized during training. Therefore, the synthetic neural network 140 learns to use global and local channels appropriately without explicit guidance.

图2C示出了根据一个实施例的用于使用基于样式的生成器系统100来施加空间噪声的方法250的流程图。方法250可以由程序，定制电路或定制电路和程序的组合来执行。例如，方法250可以由GPU(图形处理单元)，CPU(中央处理单元)或能够执行基于样式的生成器系统100的操作的任何处理器来执行。另外，本领域技术人员将理解，执行方法250的任何系统都在本发明实施例的范围和精神内。2C shows a flow chart of a method 250 for applying spatial noise using a style-based generator system 100 according to one embodiment. The method 250 may be performed by a program, a custom circuit, or a combination of a custom circuit and a program. For example, the method 250 may be performed by a GPU (graphics processing unit), a CPU (central processing unit), or any processor capable of performing the operations of the style-based generator system 100. In addition, those skilled in the art will appreciate that any system that performs the method 250 is within the scope and spirit of embodiments of the present invention.

在步骤255处，在合成神经网络140的第一层应用第一组空间噪声，以生成第一中间数据，该第一中间数据包括与基于第一组空间噪声修改的源数据相对应的内容。在一个实施例中，源数据是第一中间数据，并且第一层是包括模块220和/或卷积层225的层。在步骤258，修改的第一中间数据被一个或更多个处理层225处理以产生第二中间数据。在步骤260，在合成神经网络140的第二层上施加第二组空间噪声，以生成第二中间数据，该第二中间数据包括与基于第二组空间噪声修改的第一中间数据相对应的内容。在一个实施例中，至少由模块220修改第一中间数据以产生第二中间数据。在步骤265，第二中间数据被处理以产生包括与第二中间数据相对应的内容的输出数据。在一个实施例中，第二中间数据由另一个模块220和块230处理以产生输出数据。At step 255, a first set of spatial noise is applied to a first layer of the synthetic neural network 140 to generate first intermediate data, the first intermediate data including content corresponding to the source data modified based on the first set of spatial noise. In one embodiment, the source data is the first intermediate data, and the first layer is a layer including a module 220 and/or a convolutional layer 225. At step 258, the modified first intermediate data is processed by one or more processing layers 225 to generate second intermediate data. At step 260, a second set of spatial noise is applied to a second layer of the synthetic neural network 140 to generate second intermediate data, the second intermediate data including content corresponding to the first intermediate data modified based on the second set of spatial noise. In one embodiment, the first intermediate data is modified by at least module 220 to generate the second intermediate data. At step 265, the second intermediate data is processed to generate output data including content corresponding to the second intermediate data. In one embodiment, the second intermediate data is processed by another module 220 and block 230 to generate output data.

可以将噪声注入到合成神经网络140的各层中，以引起与该层相对应的缩放的随机变化的合成。重要的是，应该在训练和生成过程二者中注入噪声。另外，在生成期间，可以修改噪声的强度以进一步控制输出数据的“外观”。提供样式信号，而不是直接将潜码输入到合成神经网络140中，再结合直接注入到合成神经网络140中的噪声，导致在生成的图像中自动，无监督地从随机变化(例如雀斑，头发)中分离出高级属性(例如姿势，身份)，并实现直觉式的缩放特定混合和插值操作。Noise can be injected into layers of the synthetic neural network 140 to induce the synthesis of random variations in scale corresponding to that layer. Importantly, noise should be injected during both the training and generation processes. Additionally, during generation, the intensity of the noise can be modified to further control the "look" of the output data. Providing a style signal, rather than directly inputting a latent code into the synthetic neural network 140, combined with noise injected directly into the synthetic neural network 140, results in automatic, unsupervised separation of high-level attributes (e.g., pose, identity) from random variations (e.g., freckles, hair) in the generated images, and enables intuitive scale-specific blending and interpolation operations.

特别地，样式信号在合成神经网络140中以不同的缩放直接调整图像属性的强度。在生成期间，样式信号可用于修改所选的图像属性。另外，在训练期间，映射神经网络110可以被配置为执行样式混合正则化以改善样式的本地化。In particular, the style signal directly adjusts the strength of the image attribute at different scales in the synthesis neural network 140. During generation, the style signal can be used to modify selected image attributes. Additionally, during training, the mapping neural network 110 can be configured to perform style mixing regularization to improve localization of the style.

映射神经网络110将输入的潜码嵌入到中间的潜在空间中，这对如何在合成神经网络140中表示的变化因子具有深远的影响。输入的潜在空间遵循训练数据的概率密度，这可能导致一定程度的不可避免的纠缠。中间潜在空间不受该限制，因此被允许解缠结。与常规的生成器体系架构相比，基于样式的生成器系统100允许对变化的不同因素进行更线性、更少纠缠的表示。在一个实施例中，用基于样式的生成器替换常规生成器可能不需要修改训练框架的任何其他组件(损失函数，鉴别器，优化方法等)。The mapping neural network 110 embeds the latent code of the input into an intermediate latent space, which has a profound impact on how the factors of variation are represented in the synthetic neural network 140. The latent space of the input follows the probability density of the training data, which may lead to a certain degree of unavoidable entanglement. The intermediate latent space is not subject to this restriction and is therefore allowed to be disentangled. Compared to conventional generator architectures, the style-based generator system 100 allows for a more linear, less entangled representation of different factors of variation. In one embodiment, replacing a conventional generator with a style-based generator may not require modifying any other components of the training framework (loss function, discriminator, optimization method, etc.).

基于样式的生成神经网络100可以使用例如GAN(生成对抗网络)，VAE(可变自动编码器)框架，基于流的框架等进行训练。图2D示出了根据一个实施例的GAN 270训练框架的框图。GAN 270可以由程序、定制电路或定制电路和程序的组合来实现。例如，可以使用GPU、CPU或能够执行本文描述的操作的任何处理器来实现GAN 270。此外，本领域普通技术人员将理解，执行GAN 270的操作的任何系统都在本发明的实施例的范围和精神内。The style-based generative neural network 100 can be trained using, for example, a GAN (generative adversarial network), a VAE (variable autoencoder) framework, a flow-based framework, etc. FIG. 2D shows a block diagram of a GAN 270 training framework according to one embodiment. GAN 270 can be implemented by a program, a custom circuit, or a combination of a custom circuit and a program. For example, GAN 270 can be implemented using a GPU, a CPU, or any processor capable of performing the operations described herein. In addition, it will be understood by those of ordinary skill in the art that any system that performs the operations of GAN 270 is within the scope and spirit of embodiments of the present invention.

GAN 270包括生成器，例如基于样式的生成器系统100，鉴别器(神经网络)275和训练损失单元280。生成器110和鉴别器275二者的拓扑可以在训练期间进行修改。GAN 270可以在无监督的设置或有条件的设置下运行。基于样式的生成器系统100接收输入数据(例如，至少一个潜码和/或噪声输入)并产生输出数据。根据任务，输出数据可以是图像，音频，视频或其他类型的数据(例如，配置设置)。鉴别器275是自适应损失函数，其在基于样式的生成器系统100的训练期间使用。基于样式的生成器系统100和鉴别器275使用训练数据集同时训练，该训练数据集包括示例输出数据，由基于样式的生成器系统100生成的输出数据应与示例输出数据一致。基于样式的生成器系统100响应于输入数据生成输出数据，并且鉴别器275确定输出数据是否看起来类似于训练数据中包括的示例输出数据。基于该确定，调整鉴别器275和/或基于样式的生成神经网络100的参数。GAN 270 includes a generator, such as a style-based generator system 100, a discriminator (neural network) 275, and a training loss unit 280. The topology of both the generator 110 and the discriminator 275 can be modified during training. GAN 270 can be run in an unsupervised setting or a conditional setting. The style-based generator system 100 receives input data (e.g., at least one latent code and/or noise input) and produces output data. Depending on the task, the output data can be an image, audio, video, or other type of data (e.g., configuration settings). The discriminator 275 is an adaptive loss function that is used during the training of the style-based generator system 100. The style-based generator system 100 and the discriminator 275 are trained simultaneously using a training data set that includes example output data, and the output data generated by the style-based generator system 100 should be consistent with the example output data. The style-based generator system 100 generates output data in response to the input data, and the discriminator 275 determines whether the output data looks similar to the example output data included in the training data. Based on this determination, parameters of the discriminator 275 and/or the style-based generative neural network 100 are adjusted.

在无监督设置中，鉴别器275输出一个连续值，该连续值指示输出数据与示例输出数据的匹配程度。例如，在一个实施例中，鉴别器275在确定输出数据匹配示例输出数据时输出第一训练刺激(例如，高值)，并且在确定输出数据与示例输出数据不匹配时输出第二训练刺激(例如，低值)。训练损失单元280基于鉴别器275的输出来调整GAN 270的参数(权重)。当针对特定任务训练基于样式的生成器系统100(例如生成人脸的图像)时，当输出数据是人面部图像时，鉴别器输出高值。由基于样式的生成器系统100生成的输出数据不需要与示例输出数据相同，以用于鉴别器275确定输出数据与示例输出数据匹配。在以下描述的上下文中，当输出数据在感觉上类似于任何示例输出数据时，鉴别器275确定输出数据与示例输出数据匹配。In an unsupervised setting, the discriminator 275 outputs a continuous value indicating the degree to which the output data matches the example output data. For example, in one embodiment, the discriminator 275 outputs a first training stimulus (e.g., a high value) when determining that the output data matches the example output data, and outputs a second training stimulus (e.g., a low value) when determining that the output data does not match the example output data. The training loss unit 280 adjusts the parameters (weights) of the GAN 270 based on the output of the discriminator 275. When the style-based generator system 100 is trained for a specific task (e.g., generating images of human faces), the discriminator outputs a high value when the output data is an image of a human face. The output data generated by the style-based generator system 100 does not need to be the same as the example output data for the discriminator 275 to determine that the output data matches the example output data. In the context described below, the discriminator 275 determines that the output data matches the example output data when the output data is perceptually similar to any example output data.

在条件设置中，基于样式的生成神经网络100的输入可以包括其他数据，例如图像，分类标签，分割轮廓和其他(附加)类型的数据(分布，音频等)。除了随机潜码之外，还可以指定附加数据，或者附加数据可以完全代替随机潜码。训练数据集可以包括输入/输出数据对，并且鉴别器275的任务可以是基于鉴别器275在训练数据中看到的示例输入/输出对确定基于样式的生成神经网络100的输出是否与输入一致。In a conditional setting, the input to the style-based generative neural network 100 may include other data, such as images, classification labels, segmentation contours, and other (additional) types of data (distributions, audio, etc.). The additional data may be specified in addition to the random latent code, or the additional data may completely replace the random latent code. The training data set may include input/output data pairs, and the task of the discriminator 275 may be to determine whether the output of the style-based generative neural network 100 is consistent with the input based on the example input/output pairs seen by the discriminator 275 in the training data.

在一个实施例中，可以使用渐进式增长技术来训练基于样式的生成神经网络100。在一个实施例中，映射神经网络110和/或合成神经网络140最初被实现为GAN的生成神经网络部分，并使用渐进式增长技术进行训练，如Karras等人所著，“GAN的渐进式增长以提高质量，稳定性和变异性(Progressive Growing of GANs for Improved Quality,Stability,and Variation)”，第六届国际学习表示会议(ICLR)，(2018年4月30日)，通过引用整体并入本文。In one embodiment, the style-based generative neural network 100 may be trained using a progressive growing technique. In one embodiment, the mapping neural network 110 and/or the synthesis neural network 140 are initially implemented as the generative neural network portion of a GAN and trained using a progressive growing technique as described in Karras et al., "Progressive Growing of GANs for Improved Quality, Stability, and Variation," 6th International Conference on Learning Representations (ICLR), (April 30, 2018), which is incorporated herein by reference in its entirety.

并行处理架构Parallel processing architecture

图3示出了根据一个实施例的并行处理单元(PPU)300。在一个实施例中，PPU 300是在一个或更多个集成电路器件上实现的多线程处理器。PPU 300是设计用于并行处理许多线程的延迟隐藏体系架构。线程(即，执行线程)是被配置为由PPU 300执行的指令集的实例。在一个实施例中，PPU 300是图形处理单元(GPU)，其被配置为实现用于处理三维(3D)图形数据的图形渲染管线，以便生成用于在显示装置(诸如液晶显示(LCD)设备)上显示的二维(2D)图像数据。在另一个实施例中，PPU 300被配置为实现神经网络系统100。在其他实施例中，PPU 300可以用于执行通用计算。尽管为了说明的目的本文提供了一个示例性并行处理器，但应特别指出的是，该处理器仅出于说明目的进行阐述，并且可使用任何处理器来补充和/或替代该处理器。FIG. 3 illustrates a parallel processing unit (PPU) 300 according to one embodiment. In one embodiment, PPU 300 is a multithreaded processor implemented on one or more integrated circuit devices. PPU 300 is a latency-hiding architecture designed for processing many threads in parallel. A thread (i.e., an execution thread) is an instance of an instruction set configured to be executed by PPU 300. In one embodiment, PPU 300 is a graphics processing unit (GPU) configured to implement a graphics rendering pipeline for processing three-dimensional (3D) graphics data to generate two-dimensional (2D) image data for display on a display device (such as a liquid crystal display (LCD) device). In another embodiment, PPU 300 is configured to implement neural network system 100. In other embodiments, PPU 300 can be used to perform general-purpose computations. Although an exemplary parallel processor is provided herein for illustrative purposes, it should be specifically noted that the processor is described only for illustrative purposes and any processor can be used to supplement and/or replace the processor.

一个或更多个PPU 300可以被配置为加速数千个高性能计算(HPC)、数据中心，云计算和机器学习应用。PPU 300可被配置为加速众多深度学习系统和应用，用于自动驾驶汽车、模拟、射线或路径追踪等计算图形、深度学习、高精度语音、图像和文本识别系统、智能视频分析、分子模拟、药物研发、疾病诊断、天气预报、大数据分析、天文学、分子动力学模拟、金融建模、机器人技术、工厂自动化、实时语言翻译、在线搜索优化和个性化用户推荐，等等。One or more PPU 300s can be configured to accelerate thousands of high-performance computing (HPC), data center, cloud computing and machine learning applications. PPU 300 can be configured to accelerate numerous deep learning systems and applications for self-driving cars, simulations, computational graphics such as ray or path tracing, deep learning, high-precision speech, image and text recognition systems, intelligent video analysis, molecular simulations, drug development, disease diagnosis, weather forecasting, big data analysis, astronomy, molecular dynamics simulations, financial modeling, robotics, factory automation, real-time language translation, online search optimization and personalized user recommendations, etc.

如图3所示，PPU 300包括输入/输出(I/O)单元305、前端单元315、调度器单元320、工作分配单元325、集线器330、交叉开关(Xbar)370、一个或更多个通用处理集群(GPC)350以及一个或更多个存储器分区单元380。PPU 300可以经由一个或更多个高速NVLink 310互连连接到主机处理器或其他PPU 300。PPU 300可以经由互连302连接到主机处理器或其他外围设备。PPU 300还可以连接到包括多个存储器设备的本地存储器304。在一个实施例中，本地存储器可以包括多个动态随机存取存储器(DRAM)设备。DRAM设备可以被配置为高带宽存储器(HBM)子系统，其中多个DRAM裸晶(die)堆叠在每个设备内。As shown in FIG3 , the PPU 300 includes an input/output (I/O) unit 305, a front-end unit 315, a scheduler unit 320, a work distribution unit 325, a hub 330, a crossbar switch (Xbar) 370, one or more general processing clusters (GPCs) 350, and one or more memory partition units 380. The PPU 300 can be connected to a host processor or other PPUs 300 via one or more high-speed NVLink 310 interconnects. The PPU 300 can be connected to a host processor or other peripheral devices via an interconnect 302. The PPU 300 can also be connected to a local memory 304 including multiple memory devices. In one embodiment, the local memory may include multiple dynamic random access memory (DRAM) devices. The DRAM device can be configured as a high bandwidth memory (HBM) subsystem in which multiple DRAM dies are stacked within each device.

NVLink 310互连使得系统能够扩展并且包括与一个或更多个CPU结合的一个或更多个PPU 300，支持PPU 300和CPU之间的高速缓存一致性，以及CPU主控。数据和/或命令可以由NVLink 310通过集线器330发送到PPU 300的其他单元或从其发送，例如一个或更多个复制引擎、视频编码器、视频解码器、电源管理单元等(未明确示出)。结合图5B更详细地描述NVLink 310。The NVLink 310 interconnect enables the system to scale and include one or more PPUs 300 in conjunction with one or more CPUs, supports cache coherence between the PPU 300 and the CPU, and CPU mastering. Data and/or commands may be sent by the NVLink 310 to or from other units of the PPU 300, such as one or more copy engines, video encoders, video decoders, power management units, etc. (not explicitly shown), through the hub 330. NVLink 310 is described in more detail in conjunction with FIG. 5B.

I/O单元305被配置为通过互连302从主机处理器(未示出)发送和接收通信(例如，命令、数据等)。I/O单元305可以经由互连302直接与主机处理器通信，或通过一个或更多个中间设备(诸如内存桥)与主机处理器通信。在一个实施例中，I/O单元305可以经由互连302与一个或更多个其他处理器(例如，一个或更多个PPU 300)通信。在一个实施例中，I/O单元305实现外围组件互连高速(PCIe)接口，用于通过PCIe总线进行通信，并且互连302是PCIe总线。在替代的实施例中，I/O单元305可以实现其他类型的已知接口，用于与外部设备进行通信。I/O unit 305 is configured to send and receive communications (e.g., commands, data, etc.) from a host processor (not shown) via interconnect 302. I/O unit 305 may communicate with the host processor directly via interconnect 302, or through one or more intermediate devices (such as a memory bridge). In one embodiment, I/O unit 305 may communicate with one or more other processors (e.g., one or more PPUs 300) via interconnect 302. In one embodiment, I/O unit 305 implements a peripheral component interconnect express (PCIe) interface for communicating over a PCIe bus, and interconnect 302 is a PCIe bus. In alternative embodiments, I/O unit 305 may implement other types of known interfaces for communicating with external devices.

I/O单元305对经由互连302接收的数据包进行解码。在一个实施例中，数据包表示被配置为使PPU 300执行各种操作的命令。I/O单元305按照命令指定将解码的命令发送到PPU 300的各种其他单元。例如，一些命令可以被发送到前端单元315。其他命令可以被发送到集线器330或PPU 300的其他单元，诸如一个或更多个复制引擎、视频编码器、视频解码器、电源管理单元等(未明确示出)。换句话说，I/O单元305被配置为在PPU 300的各种逻辑单元之间和之中路由通信。I/O unit 305 decodes packets received via interconnect 302. In one embodiment, the packets represent commands configured to cause PPU 300 to perform various operations. I/O unit 305 sends the decoded commands to various other units of PPU 300 as specified by the commands. For example, some commands may be sent to front end unit 315. Other commands may be sent to hub 330 or other units of PPU 300, such as one or more copy engines, video encoders, video decoders, power management units, etc. (not explicitly shown). In other words, I/O unit 305 is configured to route communications between and among various logical units of PPU 300.

在一个实施例中，由主机处理器执行的程序在缓冲区中对命令流进行编码，该缓冲区向PPU 300提供工作量用于处理。工作量可以包括要由那些指令处理的许多指令和数据。缓冲区是存储器中可由主机处理器和PPU 300两者访问(例如，读/写)的区域。例如，I/O单元305可以被配置为经由通过互连302传输的存储器请求访问连接到互连302的系统存储器中的缓冲区。在一个实施例中，主机处理器将命令流写入缓冲区，然后向PPU 300发送指向命令流开始的指针。前端单元315接收指向一个或更多个命令流的指针。前端单元315管理一个或更多个流，从流读取命令并将命令转发到PPU 300的各个单元。In one embodiment, a program executed by a host processor encodes a command stream in a buffer that provides a workload to the PPU 300 for processing. The workload may include many instructions and data to be processed by those instructions. A buffer is an area in memory that is accessible (e.g., read/write) by both the host processor and the PPU 300. For example, the I/O unit 305 may be configured to access a buffer in a system memory connected to the interconnect 302 via a memory request transmitted through the interconnect 302. In one embodiment, the host processor writes a command stream to the buffer and then sends a pointer to the start of the command stream to the PPU 300. The front end unit 315 receives a pointer to one or more command streams. The front end unit 315 manages one or more streams, reads commands from the streams, and forwards the commands to the various units of the PPU 300.

前端单元315耦合到调度器单元320，其配置各种GPC 350以处理由一个或更多个流定义的任务。调度器单元320被配置为跟踪与由调度器单元320管理的各种任务相关的状态信息。状态可以指示任务被指派给哪个GPC 350，该任务是活动的还是不活动的，与该任务相关联的优先级等等。调度器单元320管理一个或更多个GPC 350上的多个任务的执行。The front end unit 315 is coupled to a scheduler unit 320, which configures the various GPCs 350 to process the tasks defined by one or more streams. The scheduler unit 320 is configured to track state information related to the various tasks managed by the scheduler unit 320. The state may indicate which GPC 350 the task is assigned to, whether the task is active or inactive, a priority associated with the task, etc. The scheduler unit 320 manages the execution of multiple tasks on one or more GPCs 350.

调度器单元320耦合到工作分配单元325，其被配置为分派任务以在GPC 350上执行。工作分配单元325可以跟踪从调度器单元320接收到的若干调度的任务。在一个实施例中，工作分配单元325为每个GPC 350管理待处理(pending)任务池和活动任务池。待处理任务池可以包括若干时隙(例如，32个时隙)，其包含被指派为由特定GPC 350处理的任务。活动任务池可以包括若干时隙(例如，4个时隙)，用于正在由GPC 350主动处理的任务。当GPC350完成任务的执行时，该任务从GPC 350的活动任务池中逐出，并且来自待处理任务池的其他任务之一被选择和调度以在GPC 350上执行。如果GPC 350上的活动任务已经空闲，例如在等待数据依赖性被解决时，那么活动任务可以从GPC 350中逐出并返回到待处理任务池，而待处理任务池中的另一个任务被选择并调度以在GPC 350上执行。Scheduler unit 320 is coupled to work distribution unit 325, which is configured to dispatch tasks for execution on GPC 350. Work distribution unit 325 may keep track of a number of scheduled tasks received from scheduler unit 320. In one embodiment, work distribution unit 325 manages a pending task pool and an active task pool for each GPC 350. The pending task pool may include a number of time slots (e.g., 32 time slots) containing tasks assigned to be processed by a particular GPC 350. The active task pool may include a number of time slots (e.g., 4 time slots) for tasks that are being actively processed by GPC 350. When a GPC 350 completes execution of a task, the task is evicted from the active task pool of GPC 350, and one of the other tasks from the pending task pool is selected and scheduled for execution on GPC 350. If an active task on a GPC 350 has become idle, such as while waiting for a data dependency to be resolved, then the active task may be evicted from the GPC 350 and returned to the pending task pool, while another task in the pending task pool is selected and scheduled for execution on the GPC 350 .

工作分配单元325经由XBar(交叉开关)370与一个或更多个GPC 350通信。XBar370是将PPU 300的许多单元耦合到PPU 300的其他单元的互连网络。例如，XBar 370可以被配置为将工作分配单元325耦合到特定的GPC 350。虽然没有明确示出，但PPU 300的一个或更多个其他单元也可以经由集线器330连接到XBar 370。Work distribution unit 325 communicates with one or more GPCs 350 via XBar (crossbar) 370. XBar 370 is an interconnect network that couples many units of PPU 300 to other units of PPU 300. For example, XBar 370 can be configured to couple work distribution unit 325 to a particular GPC 350. Although not explicitly shown, one or more other units of PPU 300 can also be connected to XBar 370 via hub 330.

任务由调度器单元320管理并由工作分配单元325分派给GPCTasks are managed by the scheduler unit 320 and dispatched to GPCs by the work distribution unit 325

350。GPC 350被配置为处理任务并生成结果。结果可以由GPC 350内的其他任务消耗，经由XBar 370路由到不同的GPC 350，或者存储在存储器304中。结果可以经由存储器分区单元380写入存储器304，存储器分区单元380实现用于从存储器304读取数据和向存储器304写入数据的存储器接口。结果可以通过NVLink310发送到另一个PPU 304或CPU。在一个实施例中，PPU 300包括数目为U的存储器分区单元380，其等于耦合到PPU 300的独立且不同的存储器设备304的数目。下面将结合图4B更详细地描述存储器分区单元380。350. GPC 350 is configured to process tasks and generate results. The results can be consumed by other tasks within GPC 350, routed to different GPCs 350 via XBar 370, or stored in memory 304. The results can be written to memory 304 via memory partition unit 380, which implements a memory interface for reading data from and writing data to memory 304. The results can be sent to another PPU 304 or CPU via NVLink 310. In one embodiment, PPU 300 includes a number U of memory partition units 380, which is equal to the number of independent and different memory devices 304 coupled to PPU 300. Memory partition unit 380 will be described in more detail below in conjunction with Figure 4B.

在一个实施例中，主机处理器执行实现应用程序编程接口(API)的驱动程序内核，其使得能够在主机处理器上执行一个或更多个应用程序以调度操作用于在PPU 300上执行。在一个实施例中，多个计算应用由PPU 300同时执行，并且PPU 300为多个计算应用程序提供隔离、服务质量(QoS)和独立地址空间。应用程序可以生成指令(例如，API调用)，其使得驱动程序内核生成一个或更多个任务以由PPU 300执行。驱动程序内核将任务输出到正在由PPU 300处理的一个或更多个流。每个任务可以包括一个或更多个相关线程组，本文称为线程束(warp)。在一个实施例中，线程束包括可以并行执行的32个相关线程。协作线程可以指代包括执行任务的指令并且可以通过共享存储器交换数据的多个线程。结合图5A更详细地描述线程和协作线程。In one embodiment, the host processor executes a driver kernel that implements an application programming interface (API) that enables one or more applications to be executed on the host processor to schedule operations for execution on the PPU 300. In one embodiment, multiple computing applications are executed simultaneously by the PPU 300, and the PPU 300 provides isolation, quality of service (QoS), and independent address spaces for multiple computing applications. The application can generate instructions (e.g., API calls) that cause the driver kernel to generate one or more tasks to be executed by the PPU 300. The driver kernel outputs the tasks to one or more streams being processed by the PPU 300. Each task can include one or more related thread groups, referred to herein as warps. In one embodiment, a warp includes 32 related threads that can be executed in parallel. Collaborative threads can refer to multiple threads that include instructions to execute tasks and can exchange data through shared memory. Threads and collaborative threads are described in more detail in conjunction with FIG. 5A.

图4A示出了根据一个实施例的图3的PPU 300的GPC 350。如图4A所示，每个GPC350包括用于处理任务的多个硬件单元。在一个实施例中，每个GPC 350包括管线管理器410、预光栅操作单元(PROP)FIG4A illustrates a GPC 350 of the PPU 300 of FIG3 according to one embodiment. As shown in FIG4A , each GPC 350 includes a plurality of hardware units for processing tasks. In one embodiment, each GPC 350 includes a pipeline manager 410, a pre-raster operation unit (PROP), and a processor.

415、光栅引擎425、工作分配交叉开关(WDX)480、存储器管理单元(MMU)490以及一个或更多个数据处理集群(DPC)420。应当理解，图4A的GPC 350可以包括代替图4A中所示单元的其他硬件单元或除图4A中所示单元之外的其他硬件单元。415, raster engines 425, work distribution crossbar (WDX) 480, memory management unit (MMU) 490, and one or more data processing clusters (DPCs) 420. It should be understood that GPC 350 of FIG4A may include other hardware units instead of or in addition to the units shown in FIG4A.

在一个实施例中，GPC 350的操作由管线管理器410控制。管线管理器410管理用于处理分配给GPC 350的任务的一个或更多个DPC 420的配置。在一个实施例中，管线管理器410可以配置一个或更多个DPC 420中的至少一个来实现图形渲染管线的至少一部分。例如，DPC 420可以被配置为在可编程流式多处理器(SM)440上执行顶点着色程序。管线管理器410还可以被配置为将从工作分配单元325接收的数据包路由到GPC 350中适当的逻辑单元。例如，一些数据包可以被路由到PROP 415和/或光栅引擎425中的固定功能硬件单元，而其他数据包可以被路由到DPC 420以供图元引擎435或SM 440处理。在一个实施例中，管线管理器410可以配置一个或更多个DPC 420中的至少一个以实现神经网络模型和/或计算管线。In one embodiment, the operation of the GPC 350 is controlled by a pipeline manager 410. The pipeline manager 410 manages the configuration of one or more DPCs 420 for processing tasks assigned to the GPC 350. In one embodiment, the pipeline manager 410 may configure at least one of the one or more DPCs 420 to implement at least a portion of a graphics rendering pipeline. For example, a DPC 420 may be configured to execute a vertex shading program on a programmable streaming multiprocessor (SM) 440. The pipeline manager 410 may also be configured to route packets received from the work distribution unit 325 to appropriate logic units in the GPC 350. For example, some packets may be routed to fixed-function hardware units in the PROP 415 and/or the raster engine 425, while other packets may be routed to the DPC 420 for processing by the primitive engine 435 or the SM 440. In one embodiment, the pipeline manager 410 may configure at least one of the one or more DPCs 420 to implement a neural network model and/or a computational pipeline.

PROP单元415被配置为将由光栅引擎425和DPC 420生成的数据路由到光栅操作(ROP)单元，结合图4B更详细地描述。PROP单元415还可以被配置为执行颜色混合的优化，组织像素数据，执行地址转换等。PROP unit 415 is configured to route data generated by raster engine 425 and DPC 420 to a raster operations (ROP) unit, described in more detail in conjunction with FIG4B. PROP unit 415 may also be configured to perform optimizations for color blending, organize pixel data, perform address translations, and the like.

光栅引擎425包括被配置为执行各种光栅操作的若干固定功能硬件单元。在一个实施例中，光栅引擎425包括设置引擎、粗光栅引擎、剔除引擎、裁剪引擎、精细光栅引擎和图块聚合引擎。设置引擎接收变换后的顶点并生成与由顶点定义的几何图元关联的平面方程。平面方程被发送到粗光栅引擎以生成图元的覆盖信息(例如，图块的x、y覆盖掩码)。粗光栅引擎的输出被发送到剔除引擎，其中与未通过z-测试的图元相关联的片段被剔除，发送到裁剪引擎，其中位于视锥体之外的片段被裁剪掉。那些经过裁剪和剔除后留下来的片段可以被传递到精细光栅引擎，以基于由设置引擎生成的平面方程生成像素片段的属性。光栅引擎425的输出包括例如要由在DPC 420内实现的片段着色器处理的片段。The raster engine 425 includes several fixed-function hardware units configured to perform various raster operations. In one embodiment, the raster engine 425 includes a setup engine, a coarse raster engine, a culling engine, a clipping engine, a fine raster engine, and a tile aggregation engine. The setup engine receives the transformed vertices and generates plane equations associated with the geometric primitives defined by the vertices. The plane equations are sent to the coarse raster engine to generate coverage information for the primitives (e.g., x, y coverage masks for tiles). The output of the coarse raster engine is sent to the culling engine, where fragments associated with primitives that fail the z-test are culled, and to the clipping engine, where fragments outside the viewing cone are clipped. Those fragments that remain after clipping and culling can be passed to the fine raster engine to generate attributes of pixel fragments based on the plane equations generated by the setup engine. The output of the raster engine 425 includes, for example, fragments to be processed by a fragment shader implemented in the DPC 420.

包括在GPC 350中的每个DPC 420包括M管线控制器(MPC)Each DPC 420 included in the GPC 350 includes an M pipeline controller (MPC)

430、图元引擎435和一个或更多个SM 440。MPC 430控制DPC 420的操作，将从管线管理器410接收到的数据包路由到DPC 420中的适当单元。例如，与顶点相关联的数据包可以被路由到图元引擎435，图元引擎435被配置为从存储器304提取与顶点相关联的顶点属性。相反，与着色程序相关联的数据包可以被发送到SM 440。430, a primitive engine 435, and one or more SMs 440. The MPC 430 controls the operation of the DPC 420 and routes packets received from the pipeline manager 410 to appropriate units in the DPC 420. For example, packets associated with vertices may be routed to the primitive engine 435, which is configured to fetch vertex attributes associated with the vertices from the memory 304. Conversely, packets associated with shading programs may be sent to the SM 440.

SM 440包括被配置为处理由多个线程表示的任务的可编程流式处理器。每个SM440是多线程的并且被配置为同时执行来自特定线程组的多个线程(例如，32个线程)。在一个实施例中，SM 440实现SIMD(单指令、多数据)体系架构，其中线程组(例如，warp)中的每个线程被配置为基于相同的指令集来处理不同的数据集。线程组中的所有线程都执行相同的指令。在另一个实施例中，SM 440实现SIMT(单指令、多线程)体系架构，其中线程组中的每个线程被配置为基于相同的指令集处理不同的数据集，但是其中线程组中的各个线程在执行期间被允许发散。在一个实施例中，为每个线程束维护程序计数器、调用栈和执行状态，当线程束内的线程发散时，使线程束和线程束中的串行执行之间的并发成为可能。在另一个实施例中，为每个单独的线程维护程序计数器、调用栈和执行状态，从而在线程束内和线程束之间的所有线程之间实现相等的并发。当为每个单独的线程维护执行状态时，执行相同指令的线程可以被收敛并且并行执行以获得最大效率。下面结合图5A更详细地描述SM440。SM 440 includes a programmable stream processor configured to process tasks represented by multiple threads. Each SM440 is multithreaded and is configured to execute multiple threads (e.g., 32 threads) from a specific thread group simultaneously. In one embodiment, SM 440 implements a SIMD (single instruction, multiple data) architecture, wherein each thread in a thread group (e.g., warp) is configured to process different data sets based on the same instruction set. All threads in a thread group execute the same instruction. In another embodiment, SM 440 implements a SIMT (single instruction, multiple thread) architecture, wherein each thread in a thread group is configured to process different data sets based on the same instruction set, but wherein each thread in a thread group is allowed to diverge during execution. In one embodiment, a program counter, a call stack, and an execution state are maintained for each thread bundle, and when threads within a thread bundle diverge, concurrency between serial executions in the thread bundle and the thread bundle is made possible. In another embodiment, a program counter, a call stack, and an execution state are maintained for each individual thread, thereby achieving equal concurrency between all threads within and between thread bundles. When execution state is maintained for each individual thread, threads executing the same instructions can be converged and executed in parallel for maximum efficiency. SM 440 is described in more detail below in conjunction with FIG. 5A .

MMU 490提供GPC 350和存储器分区单元380之间的接口。MMU 490可以提供虚拟地址到物理地址的转换、存储器保护以及存储器请求的仲裁。在一个实施例中，MMU 490提供用于执行从虚拟地址到存储器304中的物理地址的转换的一个或更多个转换后备缓冲器(TLB)。MMU 490 provides an interface between GPC 350 and memory partition unit 380. MMU 490 can provide virtual address to physical address translation, memory protection, and arbitration of memory requests. In one embodiment, MMU 490 provides one or more translation lookaside buffers (TLBs) for performing translations from virtual addresses to physical addresses in memory 304.

图4B示出了根据一个实施例的图3的PPU 300的存储器分区单元380。如图4B所示，存储器分区单元380包括光栅操作(ROP)单元450、二级(L2)高速缓存460和存储器接口470。存储器接口470耦合到存储器304。存储器接口470可以实现用于高速数据传输的32、64、128、1024位数据总线等。在一个实施例中，PPU 300合并了U个存储器接口470，每对存储器分区单元380有一个存储器接口470，其中每对存储器分区单元380连接到对应的存储器设备304。例如，PPU 300可以连接到多达Y个存储器设备，诸如高带宽存储器堆叠或图形双数据速率版本5的同步动态随机存取存储器或其他类型的持久存储器。FIG4B illustrates a memory partition unit 380 of the PPU 300 of FIG3 according to one embodiment. As shown in FIG4B , the memory partition unit 380 includes a raster operation (ROP) unit 450, a level 2 (L2) cache 460, and a memory interface 470. The memory interface 470 is coupled to the memory 304. The memory interface 470 may implement a 32, 64, 128, 1024 bit data bus, etc., for high speed data transfer. In one embodiment, the PPU 300 incorporates U memory interfaces 470, one for each pair of memory partition units 380, wherein each pair of memory partition units 380 is connected to a corresponding memory device 304. For example, the PPU 300 may be connected to up to Y memory devices, such as a high bandwidth memory stack or a graphics double data rate version 5 synchronous dynamic random access memory or other types of persistent memory.

在一个实施例中，存储器接口470实现HBM2存储器接口，并且Y等于U的一半。在一个实施例中，HBM2存储器堆叠位于与PPU 300相同的物理封装上，提供与常规GDDR5 SDRAM系统相比显著的功率高和面积节约。在一个实施例中，每个HBM2堆叠包括四个存储器裸晶并且Y等于4，其中HBM2堆叠包括每个裸晶两个128位通道，总共8个通道和1024位的数据总线宽度。In one embodiment, memory interface 470 implements an HBM2 memory interface, and Y is equal to half of U. In one embodiment, the HBM2 memory stack is located on the same physical package as PPU 300, providing significant power and area savings compared to conventional GDDR5 SDRAM systems. In one embodiment, each HBM2 stack includes four memory dies and Y is equal to 4, where the HBM2 stack includes two 128-bit channels per die, a total of 8 channels and a data bus width of 1024 bits.

在一个实施例中，存储器304支持单错校正双错检测(SECDED)纠错码(ECC)以保护数据。对于对数据损毁敏感的计算应用程序，ECC提供了更高的可靠性。在大型集群计算环境中，PPU 300处理非常大的数据集和/或长时间运行应用程序，可靠性尤其重要。In one embodiment, memory 304 supports single error correction double error detection (SECDED) error correction code (ECC) to protect data. ECC provides higher reliability for computing applications that are sensitive to data corruption. Reliability is particularly important in large cluster computing environments where PPU 300 processes very large data sets and/or long-running applications.

在一个实施例中，PPU 300实现多级存储器层次。在一个实施例中，存储器分区单元380支持统一存储器以为CPU和PPU 300存储器提供单个统一的虚拟地址空间，使得虚拟存储器系统之间的数据能够共享。在一个实施例中，跟踪PPU 300对位于其他处理器上的存储器的访问频率，以确保存储器页面被移动到更频繁地访问该页面的PPU 300的物理存储器。在一个实施例中，NVLink 310支持地址转换服务，其允许PPU 300直接访问CPU的页表并且提供由PPU 300对CPU存储器的完全访问。In one embodiment, the PPU 300 implements a multi-level memory hierarchy. In one embodiment, the memory partition unit 380 supports unified memory to provide a single unified virtual address space for the CPU and PPU 300 memory, enabling data sharing between virtual memory systems. In one embodiment, the frequency of PPU 300 accesses to memory located on other processors is tracked to ensure that memory pages are moved to the physical memory of the PPU 300 that accesses the page more frequently. In one embodiment, the NVLink 310 supports address translation services that allow the PPU 300 to directly access the CPU's page tables and provide full access to the CPU memory by the PPU 300.

在一个实施例中，复制引擎在多个PPU 300之间或在PPU 300与CPU之间传输数据。复制引擎可以为未映射到页表的地址生成页面错误。然后，存储器分区单元380可以服务页面错误，将地址映射到页表中，之后复制引擎可以执行传输。在常规系统中，针对多个处理器之间的多个复制引擎操作固定存储器(例如，不可分页)，其显著减少了可用存储器。由于硬件分页错误，地址可以传递到复制引擎而不用担心存储器页面是否驻留，并且复制过程是否透明。In one embodiment, the copy engine transfers data between multiple PPUs 300 or between a PPU 300 and a CPU. The copy engine can generate a page fault for an address that is not mapped to a page table. The memory partition unit 380 can then service the page fault, map the address into a page table, and then the copy engine can perform the transfer. In conventional systems, fixed memory (e.g., non-pageable) is operated for multiple copy engines between multiple processors, which significantly reduces the available memory. Due to hardware paging faults, addresses can be passed to the copy engine without worrying about whether the memory page is resident, and the copy process is transparent.

来自存储器304或其他系统存储器的数据可以由存储器分区单元380取回并存储在L2高速缓存460中，L2高速缓存460位于芯片上并且在各个GPC 350之间共享。如图所示，每个存储器分区单元380包括与对应的存储器304相关联的L2高速缓存460的一部分。然后可以在GPC 350内的多个单元中实现较低级高速缓存。例如，每个SM 440可以实现一级(L1)高速缓存。L1高速缓存是专用于特定SM 440的专用存储器。来自L2高速缓存460的数据可以被获取并存储在每个L1高速缓存中，以在SM 440的功能单元中进行处理。L2高速缓存460被耦合到存储器接口470和XBar 370。Data from memory 304 or other system memory may be retrieved by memory partition unit 380 and stored in L2 cache 460, which is located on chip and shared between various GPCs 350. As shown, each memory partition unit 380 includes a portion of L2 cache 460 associated with the corresponding memory 304. Lower level caches may then be implemented in multiple units within GPC 350. For example, each SM 440 may implement a level 1 (L1) cache. The L1 cache is a dedicated memory dedicated to a particular SM 440. Data from L2 cache 460 may be retrieved and stored in each L1 cache for processing in the functional units of SM 440. L2 cache 460 is coupled to memory interface 470 and XBar 370.

ROP单元450执行与诸如颜色压缩、像素混合等像素颜色相关的图形光栅操作。ROP单元450还与光栅引擎425一起实现深度测试，从光栅引擎425的剔除引擎接收与像素片段相关联的样本位置的深度。测试与片段关联的样本位置相对于深度缓冲区中的对应深度的深度。如果片段通过样本位置的深度测试，则ROP单元450更新深度缓冲区并将深度测试的结果发送给光栅引擎425。将理解的是，存储器分区单元380的数量可以不同于GPC 350的数量，并且因此每个ROP单元450可以耦合到每个GPC 350。ROP单元450跟踪从不同GPC 350接收到的数据包并且确定由ROP单元450生成的结果通过Xbar 370被路由到哪个GPC 350。尽管在图4B中ROP单元450被包括在存储器分区单元380内，但是在其他实施例中，ROP单元450可以在存储器分区单元380之外。例如，ROP单元450可以驻留在GPC 350或另一个单元中。The ROP unit 450 performs graphics raster operations related to pixel color, such as color compression, pixel blending, etc. The ROP unit 450 also implements depth testing with the raster engine 425, receiving the depth of the sample position associated with the pixel fragment from the culling engine of the raster engine 425. The depth of the sample position associated with the fragment is tested relative to the corresponding depth in the depth buffer. If the fragment passes the depth test of the sample position, the ROP unit 450 updates the depth buffer and sends the result of the depth test to the raster engine 425. It will be understood that the number of memory partition units 380 can be different from the number of GPCs 350, and therefore each ROP unit 450 can be coupled to each GPC 350. The ROP unit 450 tracks the data packets received from different GPCs 350 and determines which GPC 350 the results generated by the ROP unit 450 are routed to via the Xbar 370. Although the ROP unit 450 is included in the memory partition unit 380 in FIG. 4B, in other embodiments, the ROP unit 450 can be outside the memory partition unit 380. For example, ROP unit 450 may reside in GPC 350 or another unit.

图5A示出了根据一个实施例的图4A的流式多处理器440。如图5A所示，SM 440包括指令高速缓存505、一个或更多个调度器单元510、寄存器文件520、一个或更多个处理核心550、一个或更多个特殊功能单元(SFU)552、一个或更多个加载/存储单元(LSU)554、互连网络580、共享存储器/L1高速缓存570。FIG5A illustrates the streaming multiprocessor 440 of FIG4A according to one embodiment. As shown in FIG5A , the SM 440 includes an instruction cache 505, one or more scheduler units 510, a register file 520, one or more processing cores 550, one or more special function units (SFUs) 552, one or more load/store units (LSUs) 554, an interconnection network 580, and a shared memory/L1 cache 570.

如上所述，工作分配单元325调度任务以在PPU 300的GPC 350上执行。任务被分配给GPC 350内的特定DPC 420，并且如果该任务与着色器程序相关联，则该任务可以被分配给SM 440。调度器单元510接收来自工作分配单元325的任务并且管理指派给SM 440的一个或更多个线程块的指令调度。调度器单元510调度线程块以作为并行线程的线程束执行，其中每个线程块被分配至少一个线程束。在一个实施例中，每个线程束执行32个线程。调度器单元510可以管理多个不同的线程块，将线程束分配给不同的线程块，然后在每个时钟周期期间将来自多个不同的协作组的指令分派到各个功能单元(即，核心550、SFU 552和LSU554)。As described above, the work distribution unit 325 schedules tasks to be executed on the GPC 350 of the PPU 300. Tasks are assigned to specific DPCs 420 within the GPC 350, and if the task is associated with a shader program, the task may be assigned to the SM 440. The scheduler unit 510 receives tasks from the work distribution unit 325 and manages the scheduling of instructions assigned to one or more thread blocks assigned to the SM 440. The scheduler unit 510 schedules the thread blocks to execute as warps of parallel threads, where each thread block is assigned at least one warp. In one embodiment, each warp executes 32 threads. The scheduler unit 510 can manage a plurality of different thread blocks, assign warps to different thread blocks, and then dispatch instructions from a plurality of different cooperative groups to various functional units (i.e., core 550, SFU 552, and LSU 554) during each clock cycle.

协作组是用于组织通信线程组的编程模型，其允许开发者表达线程正在进行通信所采用的粒度，使得能够表达更丰富、更高效的并行分解。协作启动API支持线程块之间的同步性，以执行并行算法。常规的编程模型为同步协作线程提供了单一的简单结构：跨线程块的所有线程的栅栏(barrier)(例如，syncthreads()函数)。然而，程序员通常希望以小于线程块粒度的粒度定义线程组，并在所定义的组内同步，以集体的全组功能接口(collective group-wide function interface)的形式使能更高的性能、设计灵活性和软件重用。Cooperative Groups is a programming model for organizing groups of communicating threads that allows developers to express the granularity at which threads are communicating, enabling the expression of richer and more efficient decompositions of parallelism. The cooperative launch API supports synchronization between thread blocks to execute parallel algorithms. Conventional programming models provide a single simple structure for synchronizing cooperative threads: a barrier across all threads of a thread block (e.g., the syncthreads() function). However, programmers often want to define thread groups at a granularity smaller than the thread block granularity and synchronize within the defined group, enabling higher performance, design flexibility, and software reuse in the form of a collective group-wide function interface.

协作组使得程序员能够在子块(例如，像单个线程一样小)和多块粒度处明确定义线程组并且执行集体操作，诸如协作组中的线程上的同步性。编程模型支持跨软件边界的干净组合，以便库和效用函数可以在他们本地环境中安全地同步，而无需对收敛进行假设。协作组图元启用协作并行的新模式，包括生产者-消费者并行、机会主义并行，以及跨整个线程块网格的全局同步。Cooperative Groups enables programmers to explicitly define thread groups at sub-block (e.g., as small as a single thread) and multi-block granularity and perform collective operations such as synchronization on threads in a cooperative group. The programming model supports clean composition across software boundaries so that libraries and utility functions can safely synchronize in their local environment without making assumptions about convergence. Cooperative Groups primitives enable new patterns of cooperative parallelism, including producer-consumer parallelism, opportunistic parallelism, and global synchronization across the entire grid of thread blocks.

分派单元515被配置为向一个或更多个功能单元传送指令。在该实施例中，调度器单元510包括两个分派单元515，其使得能够在每个时钟周期期间调度来自相同线程束的两个不同指令。在替代实施例中，每个调度器单元510可以包括单个分派单元515或附加分派单元515。The dispatch unit 515 is configured to transmit instructions to one or more functional units. In this embodiment, the scheduler unit 510 includes two dispatch units 515, which enables two different instructions from the same thread warp to be scheduled during each clock cycle. In alternative embodiments, each scheduler unit 510 may include a single dispatch unit 515 or additional dispatch units 515.

每个SM 440包括寄存器文件520，其提供用于SM 440的功能单元的一组寄存器。在一个实施例中，寄存器文件520在每个功能单元之间被划分，使得每个功能单元被分配寄存器文件520的专用部分。在另一个实施例中，寄存器文件520在由SM 440执行的不同线程束之间被划分。寄存器文件520为连接到功能单元的数据路径的操作数提供临时存储器。Each SM 440 includes a register file 520 that provides a set of registers for the functional units of the SM 440. In one embodiment, the register file 520 is divided between each functional unit such that each functional unit is assigned a dedicated portion of the register file 520. In another embodiment, the register file 520 is divided between different warps executed by the SM 440. The register file 520 provides temporary storage for operands connected to the data paths of the functional units.

每个SM 440包括L个处理核心550。在一个实施例中，SM 440包括大量(例如128个等)不同的处理核心550。每个核心550可以包括完全管线化的、单精度、双精度和/或混合精度处理单元，其包括浮点运算逻辑单元和整数运算逻辑单元。在一个实施例中，浮点运算逻辑单元实现用于浮点运算的IEEE 754-2008标准。在一个实施例中，核心550包括64个单精度(32位)浮点核心、64个整数核心、32个双精度(64位)浮点核心和8个张量核心(tensorcore)。Each SM 440 includes L processing cores 550. In one embodiment, SM 440 includes a large number (e.g., 128, etc.) of different processing cores 550. Each core 550 may include a fully pipelined, single-precision, double-precision, and/or mixed-precision processing unit, including a floating-point arithmetic logic unit and an integer arithmetic logic unit. In one embodiment, the floating-point arithmetic logic unit implements the IEEE 754-2008 standard for floating-point operations. In one embodiment, core 550 includes 64 single-precision (32-bit) floating-point cores, 64 integer cores, 32 double-precision (64-bit) floating-point cores, and 8 tensor cores.

张量核心被配置为执行矩阵运算，并且在一个实施例中，一个或更多个张量核心被包括在核心550中。具体地，张量核心被配置为执行深度学习矩阵运算，诸如用于神经网络训练和推理的卷积运算。在一个实施例中，每个张量核心在4×4矩阵上运算并且执行矩阵乘法和累加运算D＝A×B+C，其中A、B、C和D是4×4矩阵。The tensor cores are configured to perform matrix operations, and in one embodiment, one or more tensor cores are included in core 550. Specifically, the tensor cores are configured to perform deep learning matrix operations, such as convolution operations for neural network training and inference. In one embodiment, each tensor core operates on a 4×4 matrix and performs a matrix multiplication and accumulation operation D=A×B+C, where A, B, C, and D are 4×4 matrices.

在一个实施例中，矩阵乘法输入A和B是16位浮点矩阵，而累加矩阵C和D可以是16位浮点或32位浮点矩阵。张量核心在16位浮点输入数据以及32位浮点累加上运算。16位浮点乘法需要64次运算，产生全精度的积，然后使用32位浮点与4×4×4矩阵乘法的其他中间积相加来累加。在实践中，张量核心用于执行由这些较小的元素建立的更大的二维或更高维的矩阵运算。API(诸如CUDA 9C++API)公开了专门的矩阵加载、矩阵乘法和累加以及矩阵存储运算，以便有效地使用来自CUDA-C++程序的张量核心。在CUDA层面，线程束级接口假定16×16尺寸矩阵跨越线程束的全部32个线程。In one embodiment, the matrix multiplication inputs A and B are 16-bit floating point matrices, while the accumulation matrices C and D can be 16-bit floating point or 32-bit floating point matrices. The tensor cores operate on 16-bit floating point input data as well as 32-bit floating point accumulations. The 16-bit floating point multiplication requires 64 operations, producing full-precision products, which are then accumulated using 32-bit floating point additions with other intermediate products of the 4×4×4 matrix multiplication. In practice, tensor cores are used to perform larger two-dimensional or higher dimensional matrix operations built from these smaller elements. APIs (such as the CUDA 9 C++ API) expose specialized matrix loads, matrix multiplications and accumulations, and matrix storage operations to efficiently use tensor cores from CUDA-C++ programs. At the CUDA level, the warp-level interface assumes that the 16×16 size matrix spans all 32 threads of the warp.

每个SM 440还包括执行特殊函数(例如，属性评估、倒数平方根等)的M个SFU 552。在一个实施例中，SFU 552可以包括树遍历单元，其被配置为遍历分层树数据结构。在一个实施例中，SFU 552可以包括被配置为执行纹理图过滤操作的纹理单元。在一个实施例中，纹理单元被配置为从存储器304加载纹理图(例如，纹理像素的2D阵列)并且对纹理图进行采样以产生经采样的纹理值，用于在由SM 440执行的着色器程序中使用。在一个实施例中，纹理图被存储在共享存储器/L1高速缓存470中。纹理单元实现纹理操作，诸如使用mip图(即，不同细节层次的纹理图)的过滤操作。在一个实施例中，每个SM 440包括两个纹理单元。Each SM 440 also includes M SFUs 552 that perform special functions (e.g., attribute evaluation, reciprocal square root, etc.). In one embodiment, the SFUs 552 may include a tree traversal unit configured to traverse a hierarchical tree data structure. In one embodiment, the SFUs 552 may include a texture unit configured to perform texture map filtering operations. In one embodiment, the texture unit is configured to load a texture map (e.g., a 2D array of texture pixels) from memory 304 and sample the texture map to produce sampled texture values for use in a shader program executed by the SM 440. In one embodiment, the texture map is stored in a shared memory/L1 cache 470. The texture unit implements texture operations, such as filtering operations using mip maps (i.e., texture maps of different levels of detail). In one embodiment, each SM 440 includes two texture units.

每个SM 440还包括N个LSU 554，其实现共享存储器/L1高速缓存570和寄存器文件520之间的加载和存储操作。每个SM 440包括将每个功能单元连接到寄存器文件520以及将LSU 554连接到寄存器文件520、共享存储器/L1高速缓存570的互连网络580。在一个实施例中，互连网络580是交叉开关，其可以被配置为将任何功能单元连接到寄存器文件520中的任何寄存器，以及将LSU 554连接到寄存器文件和共享存储器/L1高速缓存570中的存储器位置。Each SM 440 also includes N LSUs 554 that implement load and store operations between the shared memory/L1 cache 570 and the register file 520. Each SM 440 includes an interconnect network 580 that connects each functional unit to the register file 520 and connects the LSUs 554 to the register file 520, the shared memory/L1 cache 570. In one embodiment, the interconnect network 580 is a crossbar switch that can be configured to connect any functional unit to any register in the register file 520 and to connect the LSUs 554 to memory locations in the register file and the shared memory/L1 cache 570.

共享存储器/L1高速缓存570是片上存储器阵列，其允许数据存储和SM 440与图元引擎435之间以及SM 440中的线程之间的通信。在一个实施例中，共享存储器/L1高速缓存570包括128KB的存储容量并且在从SM 440到存储器分区单元380的路径中。共享存储器/L1高速缓存570可以用于高速缓存读取和写入。共享存储器/L1高速缓存570、L2高速缓存460和存储器304中的一个或更多个是后备存储。The shared memory/L1 cache 570 is an on-chip memory array that allows data storage and communication between the SM 440 and the primitive engine 435, as well as between threads in the SM 440. In one embodiment, the shared memory/L1 cache 570 includes 128KB of storage capacity and is in the path from the SM 440 to the memory partition unit 380. The shared memory/L1 cache 570 can be used to cache reads and writes. One or more of the shared memory/L1 cache 570, the L2 cache 460, and the memory 304 is a backing store.

将数据高速缓存和共享存储器功能组合成单个存储器块为两种类型的存储器访问提供最佳的总体性能。该容量可由程序用作不使用共享存储器的高速缓存。例如，如果将共享存储器配置为使用一半容量，则纹理和加载/存储操作可以使用剩余容量。在共享存储器/L1高速缓存570内的集成使共享存储器/L1高速缓存570起到用于流式传输数据的高吞吐量管线的作用，并且同时提供对频繁重用数据的高带宽和低延迟的访问。Combining the data cache and shared memory functions into a single memory block provides the best overall performance for both types of memory accesses. This capacity can be used by programs as a cache that does not use the shared memory. For example, if the shared memory is configured to use half of the capacity, texture and load/store operations can use the remaining capacity. Integration within the shared memory/L1 cache 570 enables the shared memory/L1 cache 570 to function as a high throughput pipeline for streaming data, while providing high bandwidth and low latency access to frequently reused data.

当被配置用于通用并行计算时，与图形处理相比，可以使用更简单的配置。具体地，图3所示的固定功能图形处理单元被绕过，创建了更简单的编程模型。在通用并行计算配置中，工作分配单元325将线程块直接指派并分配给DPC 420。块中的线程执行相同的程序，使用计算中的唯一线程ID来确保每个线程生成唯一结果，使用SM 440执行程序并执行计算，使用共享存储器/L1高速缓存570以在线程之间通信，以及使用LSU 554通过共享存储器/L1高速缓存570和存储器分区单元380读取和写入全局存储器。当被配置用于通用并行计算时，SM 440还可以写入调度器单元320可用来在DPC 420上启动新工作的命令。When configured for general parallel computing, a simpler configuration can be used compared to graphics processing. Specifically, the fixed-function graphics processing unit shown in Figure 3 is bypassed, creating a simpler programming model. In the general parallel computing configuration, the work distribution unit 325 assigns and distributes thread blocks directly to the DPC 420. The threads in the block execute the same program, use unique thread IDs in the calculation to ensure that each thread generates a unique result, use SM 440 to execute the program and perform calculations, use shared memory/L1 cache 570 to communicate between threads, and use LSU 554 to read and write global memory through shared memory/L1 cache 570 and memory partition unit 380. When configured for general parallel computing, SM 440 can also write commands that the scheduler unit 320 can use to start new work on DPC 420.

PPU 300可以被包括在台式计算机、膝上型计算机、平板电脑、服务器、超级计算机、智能电话(例如，无线、手持设备)、个人数字助理(PDA)、数码相机、运载工具、头戴式显示器、手持式电子设备等中。在一个实施例中，PPU 300包含在单个半导体衬底上。在另一个实施例中，PPU 300与一个或更多个其他器件(诸如附加PPU 300、存储器304、精简指令集计算机(RISC)CPU、存储器管理单元(MMU)、数字-模拟转换器(DAC)等)一起被包括在片上系统(SoC)上。The PPU 300 may be included in a desktop computer, a laptop computer, a tablet computer, a server, a supercomputer, a smartphone (e.g., wireless, handheld device), a personal digital assistant (PDA), a digital camera, a vehicle, a head-mounted display, a handheld electronic device, etc. In one embodiment, the PPU 300 is included on a single semiconductor substrate. In another embodiment, the PPU 300 is included on a system on a chip (SoC) along with one or more other devices (such as an additional PPU 300, a memory 304, a reduced instruction set computer (RISC) CPU, a memory management unit (MMU), a digital-to-analog converter (DAC), etc.).

在一个实施例中，PPU 300可以被包括在图形卡上，图形卡包括一个或更多个存储器304。图形卡可以被配置为与台式计算机的主板上的PCIe插槽接口。在又一个实施例中，PPU 300可以是包含在主板的芯片集中的集成图形处理单元(iGPU)或并行处理器。In one embodiment, PPU 300 may be included on a graphics card that includes one or more memories 304. The graphics card may be configured to interface with a PCIe slot on a motherboard of a desktop computer. In yet another embodiment, PPU 300 may be an integrated graphics processing unit (iGPU) or parallel processor included in a chipset of a motherboard.

示例性计算系统Exemplary Computing System

具有多个GPU和CPU的系统被用于各种行业，因为开发者在应用(诸如人工智能计算)中暴露和利用更多的并行性。在数据中心、研究机构和超级计算机中部署具有数十至数千个计算节点的高性能GPU加速系统，以解决更大的问题。随着高性能系统内处理设备数量的增加，通信和数据传输机制需要扩展以支持该增加带宽。Systems with multiple GPUs and CPUs are used in a variety of industries as developers expose and exploit more parallelism in applications such as artificial intelligence computing. High-performance GPU-accelerated systems with tens to thousands of computing nodes are deployed in data centers, research institutions, and supercomputers to solve larger problems. As the number of processing devices within high-performance systems increases, communication and data transmission mechanisms need to scale to support this increased bandwidth.

图5B是根据实施例的使用图3的PPU 300实现的处理系统500的概念图。可以将示例性系统565配置为分别实现图1C，2C，6C和6D中所示的方法150、250、650和675中的一个或更多个。处理系统500包括CPU530，交换机510和多个PPU 300，和各个存储器304。PPU 330可各自包括和/或被配置为执行一个或更多个处理核心和/或其组件，例如张量核(TC)，张量处理单元(TPU)，像素视觉核心(PVC)，视觉处理单元(VPU)，图形处理集群(GPC)，纹理处理集群(TPC)，流式多处理器(SM)，树遍历单元(TTU)，人工智能加速器(AIA)，深度学习加速器(DLA)，算术逻辑单元(ALU)，专用集成电路(ASIC)，浮点单元(FPU)，输入/输出(I/O)元素，外围组件互连(PCI)或外围组件互连快速(PCIe)元件等等。5B is a conceptual diagram of a processing system 500 implemented using the PPU 300 of FIG. 3 according to an embodiment. The exemplary system 565 may be configured to implement one or more of the methods 150, 250, 650, and 675 shown in FIG. 1C, 2C, 6C, and 6D, respectively. The processing system 500 includes a CPU 530, a switch 510, and multiple PPUs 300, and respective memories 304. The PPUs 330 may each include and/or be configured to execute one or more processing cores and/or components thereof, such as a tensor core (TC), a tensor processing unit (TPU), a pixel vision core (PVC), a vision processing unit (VPU), a graphics processing cluster (GPC), a texture processing cluster (TPC), a streaming multiprocessor (SM), a tree traversal unit (TTU), an artificial intelligence accelerator (AIA), a deep learning accelerator (DLA), an arithmetic logic unit (ALU), an application-specific integrated circuit (ASIC), a floating point unit (FPU), an input/output (I/O) element, a peripheral component interconnect (PCI) or a peripheral component interconnect express (PCIe) element, and the like.

NVLink 310提供每个PPU 300之间的高速通信链路。尽管图5B中示出了特定数量的NVLink 310和互连302连接，但是连接到每个PPU 300和CPU 530的连接的数量可以改变。交换机510在互连302和CPU 530之间接口。PPU 300、存储器304和NVLink 310可以位于单个半导体平台上以形成并行处理模块525。在一个实施例中，交换机510支持两个或更多个在各种不同连接和/或链路之间接口的协议。NVLink 310 provides a high-speed communication link between each PPU 300. Although a specific number of NVLink 310 and interconnect 302 connections are shown in FIG. 5B , the number of connections connected to each PPU 300 and CPU 530 may vary. Switch 510 interfaces between interconnect 302 and CPU 530. PPU 300, memory 304, and NVLink 310 may be located on a single semiconductor platform to form a parallel processing module 525. In one embodiment, switch 510 supports two or more protocols that interface between various different connections and/or links.

在另一个实施例(未示出)中，NVLink 310在每个PPU 300和CPU 530之间提供一个或更多个高速通信链路，并且交换机510在互连302和每个PPU 300之间进行接口。PPU 300、存储器304和互连302可以位于单个半导体平台上以形成并行处理模块525。在又一个实施例(未示出)中，互连302在每个PPU 300和CPU 530之间提供一个或更多个通信链路，并且交换机510使用NVLink 310在每个PPU 300之间进行接口，以在PPU 300之间提供一个或更多个高速通信链路。在另一个实施例(未示出)中，NVLink 310在PPU 300和CPU 530之间通过交换机510提供一个或更多个高速通信链路。在又一个实施例(未示出)中，互连302在每个PPU 300之间直接地提供一个或更多个通信链路。可以使用与NVLink310相同的协议将一个或更多个NVLink 310高速通信链路实现为物理NVLink互连或者片上或裸晶上互连。In another embodiment (not shown), NVLink 310 provides one or more high-speed communication links between each PPU 300 and CPU 530, and switch 510 interfaces between interconnect 302 and each PPU 300. PPU 300, memory 304, and interconnect 302 may be located on a single semiconductor platform to form a parallel processing module 525. In yet another embodiment (not shown), interconnect 302 provides one or more communication links between each PPU 300 and CPU 530, and switch 510 interfaces between each PPU 300 using NVLink 310 to provide one or more high-speed communication links between PPU 300. In another embodiment (not shown), NVLink 310 provides one or more high-speed communication links between PPU 300 and CPU 530 through switch 510. In yet another embodiment (not shown), interconnect 302 provides one or more communication links directly between each PPU 300. One or more NVLink 310 high-speed communication links may be implemented as a physical NVLink interconnect or as an on-chip or on-die interconnect using the same protocol as NVLink 310 .

在本说明书的上下文中，单个半导体平台可以指在裸晶或芯片上制造的唯一的单一的基于半导体的集成电路。应该注意的是，术语单个半导体平台也可以指具有增加的连接的多芯片模块，其模拟片上操作并通过利用常规总线实现方式进行实质性改进。当然，根据用户的需要，各种电路或器件还可以分开放置或以半导体平台的各种组合来放置。可选地，并行处理模块525可以被实现为电路板衬底，并且PPU 300和/或存储器304中的每一个可以是封装器件。在一个实施例中，CPU 530、交换机510和并行处理模块525位于单个半导体平台上。In the context of this specification, a single semiconductor platform may refer to a unique single semiconductor-based integrated circuit manufactured on a bare die or chip. It should be noted that the term single semiconductor platform may also refer to a multi-chip module with increased connectivity that simulates on-chip operations and is substantially improved by utilizing conventional bus implementations. Of course, various circuits or devices may also be placed separately or in various combinations of semiconductor platforms, depending on the needs of the user. Optionally, the parallel processing module 525 may be implemented as a circuit board substrate, and each of the PPU 300 and/or memory 304 may be a packaged device. In one embodiment, the CPU 530, switch 510, and parallel processing module 525 are located on a single semiconductor platform.

在一个实施例中，每个NVLink 310的信令速率是20到25千兆位/秒，并且每个PPU300包括六个NVLink 310接口(如图5B所示，每个PPU 300包括五个NVLink 310接口)。每个NVLink 310在每个方向上提供25千兆位/秒的数据传输速率，其中六条链路提供300千兆位/秒。当CPU 530还包括一个或更多个NVLink 310接口时，NVLink 310可专门用于如图5B所示的PPU到PPU通信，或者PPU到PPU以及PPU到CPU的某种组合。In one embodiment, the signaling rate of each NVLink 310 is 20 to 25 Gbits/sec, and each PPU 300 includes six NVLink 310 interfaces (as shown in FIG. 5B , each PPU 300 includes five NVLink 310 interfaces). Each NVLink 310 provides a data transfer rate of 25 Gbits/sec in each direction, with six links providing 300 Gbits/sec. When the CPU 530 also includes one or more NVLink 310 interfaces, the NVLink 310 may be dedicated to PPU-to-PPU communication as shown in FIG. 5B , or some combination of PPU-to-PPU and PPU-to-CPU.

在一个实施例中，NVLink 310允许从CPU 530到每个PPU 300的存储器304的直接加载/存储/原子访问。在一个实施例中，NVLink 310支持一致性操作，允许从存储器304读取的数据被存储在CPU 530的高速缓存分层结构中，减少了CPU 530的高速缓存访问延迟。在一个实施例中，NVLink 310包括对地址转换服务(ATS)的支持，允许PPU 300直接访问CPU530内的页表。一个或更多个NVLink 310还可以被配置为以低功率模式操作。In one embodiment, NVLink 310 allows direct load/store/atomic access from CPU 530 to memory 304 of each PPU 300. In one embodiment, NVLink 310 supports coherency operations, allowing data read from memory 304 to be stored in the cache hierarchy of CPU 530, reducing cache access latency of CPU 530. In one embodiment, NVLink 310 includes support for address translation services (ATS), allowing PPU 300 to directly access page tables within CPU 530. One or more NVLinks 310 may also be configured to operate in a low power mode.

图5C示出了示例性系统565，在其中可以实现各种先前实施例的各种体系结构和/或功能。示例性系统565可以被配置为分别实现图1C，2C，6C和6D中所示的方法150、250、650和675中的一个或更多个。5C shows an exemplary system 565 in which various architectures and/or functions of various previous embodiments can be implemented. The exemplary system 565 can be configured to implement one or more of the methods 150, 250, 650, and 675 shown in FIGS. 1C, 2C, 6C, and 6D, respectively.

如图所示，提供了一种系统565，该系统565包括至少一个连接到通信总线575的中央处理单元530。通信总线575可以直接或间接耦合以下设备中的一个或更多个：主存储器540，网络接口535，一个或更多个CPU 530，一个或更多个显示设备545，一个或更多个输入设备560，交换机510和并行处理系统525。通信总线575可以使用任何合适的协议来实现，并且可以表示一个或更多个链接或总线，例如地址总线，数据总线，控制总线或其组合。通信总线575可以包括一种或更多种总线或链路类型，诸如工业标准架构(ISA)总线，扩展工业标准架构(EISA)总线，视频电子标准协会(VESA)总线，外围组件互连(PCI)总线，外围组件互连快速(PCIe)总线，超传输和/或另一种类型的总线或链路。在一些实施例中，组件之间存在直接连接。作为示例，一个或更多个CPU 530可以直接连接到主存储器540。此外，一个或更多个CPU 530可以直接连接到并行处理系统525。在直接或点对点的情况下在组件之间的连接中，通信总线575可以包括PCIe链接以执行连接。在这些示例中，PCI总线不需要被包括在系统565中。As shown, a system 565 is provided, which includes at least one central processing unit 530 connected to a communication bus 575. The communication bus 575 can directly or indirectly couple one or more of the following devices: main memory 540, network interface 535, one or more CPUs 530, one or more display devices 545, one or more input devices 560, switch 510 and parallel processing system 525. The communication bus 575 can be implemented using any suitable protocol and can represent one or more links or buses, such as an address bus, a data bus, a control bus or a combination thereof. The communication bus 575 can include one or more bus or link types, such as an industry standard architecture (ISA) bus, an extended industry standard architecture (EISA) bus, a video electronics standard association (VESA) bus, a peripheral component interconnect (PCI) bus, a peripheral component interconnect express (PCIe) bus, a hypertransport and/or another type of bus or link. In some embodiments, there is a direct connection between components. As an example, one or more CPUs 530 can be directly connected to the main memory 540. In addition, one or more CPUs 530 may be directly connected to the parallel processing system 525. In direct or point-to-point connections between components, the communication bus 575 may include a PCIe link to perform the connection. In these examples, the PCI bus need not be included in the system 565.

尽管图5C的各个框被示为经由通信总线575用线连接，但这并不旨在进行限制并且仅是为了清楚。例如，在一些实施例中，呈现组件，例如一个或更多个显示设备545，可以被认为是I/O组件，例如一个或更多个输入设备560(例如，如果显示器是触摸屏)。作为另一个示例，一个或更多个CPU 530和/或并行处理系统525可以包括存储器(例如，除了并行处理系统525，CPU 530和/或其他组件之外，主存储器540可以表示存储设备)。换句话说，图5C的计算设备仅是说明性的。在“工作站”，“服务器”，“笔记本电脑”，“台式机”，“平板电脑”，“客户端设备”，“移动设备”，“手持设备”，“游戏机”，“电子控制单元(ECU)”，“虚拟现实系统”和/或其他设备或系统类型等类别之间未进行区分，均在图5C的计算设备的范围内。Although the various boxes of FIG. 5C are shown as being connected by wires via a communication bus 575, this is not intended to be limiting and is only for clarity. For example, in some embodiments, a presentation component, such as one or more display devices 545, may be considered an I/O component, such as one or more input devices 560 (e.g., if the display is a touch screen). As another example, one or more CPUs 530 and/or parallel processing systems 525 may include a memory (e.g., in addition to the parallel processing system 525, the CPU 530 and/or other components, the main memory 540 may represent a storage device). In other words, the computing device of FIG. 5C is merely illustrative. No distinction is made between categories such as "workstation", "server", "laptop", "desktop", "tablet", "client device", "mobile device", "handheld device", "game console", "electronic control unit (ECU)", "virtual reality system" and/or other device or system types, all of which are within the scope of the computing device of FIG. 5C.

系统565还包括主存储器540。控制逻辑(软件)和数据被存储在主存储器540中，其可以采取各种计算机可读介质的形式。计算机可读介质可以是系统565可以访问的任何可用介质。计算机可读介质可以包括易失性和非易失性介质以及可移除和不可移除介质。作为示例而非限制，计算机可读介质可以包括计算机存储介质和通信介质。The system 565 also includes a main memory 540. Control logic (software) and data are stored in the main memory 540, which can take the form of various computer-readable media. Computer-readable media can be any available media that can be accessed by the system 565. Computer-readable media can include volatile and non-volatile media and removable and non-removable media. By way of example and not limitation, computer-readable media can include computer storage media and communication media.

计算机存储介质可以包括以用于存储信息(诸如计算机可读指令，数据结构，程序模块和/或其他数据类型)的任何方法或技术实现的易失性和非易失性介质和/或可移除和不可移除介质两者。例如，主存储器540可以存储计算机可读指令(例如，表示诸如操作系统的程序和/或程序元素)，计算机存储介质可以包括但不限于到RAM，ROM，EEPROM，闪存或其他存储技术，CD-ROM，数字多功能磁盘(DVD)或其他光盘存储设备，磁带盒，磁带，磁盘存储设备或其他磁性存储设备或可以用于存储期望信息并且可以由系统565访问的任何其他介质。如本文所使用的，计算机存储介质本身不包括信号。Computer storage media may include both volatile and nonvolatile media and/or both removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, and/or other data types. For example, main memory 540 may store computer readable instructions (e.g., representing programs and/or program elements such as an operating system), and computer storage media may include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage devices, cassettes, magnetic tape, magnetic disk storage devices or other magnetic storage devices, or any other medium that can be used to store the desired information and can be accessed by system 565. As used herein, computer storage media itself does not include signals.

计算机存储介质可以在诸如载波或其他传输机制的已调制数据信号中体现计算机可读指令，数据结构，程序模块和/或其他数据类型，并且包括任何信息传递介质。术语“调制数据信号”可以指的是具有以其将信息编码的方式设置或改变其一个或更多个特征的信号。作为示例而非限制，计算机存储介质可以包括有线介质(诸如有线网络或直接有线连接)，以及无线介质(诸如声学，RF，红外和其他无线介质)。以上任何内容的组合也应包括在计算机可读介质的范围内。Computer storage media may embody computer readable instructions, data structures, program modules and/or other data types in a modulated data signal such as a carrier wave or other transport mechanism, and include any information delivery media. The term "modulated data signal" may refer to a signal that has one or more of its characteristics set or changed in such a way that it encodes information. By way of example and not limitation, computer storage media may include wired media, such as a wired network or direct wired connection, and wireless media, such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.

计算机程序在被执行时使系统565能够执行各种功能。一个或更多个CPU 530可以被配置为执行计算机可读指令中的至少一些，以控制系统565的一个或更多个组件以执行本文描述的方法和/或过程中的一个或更多个。一个或更多个CPU 530的每个可以包括一个或更多个能够同时处理多个软件线程的核心(例如，一个，两个，四个，八个，二十八个，七十二个等等)。一个或更多个CPU 530可以包括任何类型的处理器，并且可以取决于所实现的系统565的类型而包括不同类型的处理器(例如，具有较少核心的用于移动设备的处理器和具有较多核心的用于服务器的处理器)。例如，取决于系统565的类型，处理器可以是使用精简指令集计算(RISC)实现的高级RISC机器(ARM)处理器或使用复杂指令集计算(CISC)实现的x86处理器。除了一个或更多个微处理器或辅助协处理器，例如数学协处理器，系统565还可包括一个或更多个CPU 530。The computer program enables the system 565 to perform various functions when executed. One or more CPUs 530 may be configured to execute at least some of the computer-readable instructions to control one or more components of the system 565 to perform one or more of the methods and/or processes described herein. Each of the one or more CPUs 530 may include one or more cores (e.g., one, two, four, eight, twenty-eight, seventy-two, etc.) capable of simultaneously processing multiple software threads. One or more CPUs 530 may include any type of processor, and may include different types of processors (e.g., processors for mobile devices with fewer cores and processors for servers with more cores) depending on the type of system 565 implemented. For example, depending on the type of system 565, the processor may be an advanced RISC machine (ARM) processor implemented using reduced instruction set computing (RISC) or an x86 processor implemented using complex instruction set computing (CISC). In addition to one or more microprocessors or auxiliary coprocessors, such as a math coprocessor, the system 565 may also include one or more CPUs 530.

作为一个或更多个CPU 530的补充或替代，并行处理模块525可配置为执行至少一些计算机可读指令，以控制系统565的一个或更多个组件执行本文描述的一个或更多个方法和/或过程。系统565可以使用并行处理模块525来渲染图形(例如3D图形)或执行通用计算。例如，并行处理模块525可以用于GPU上的通用计算(GPGPU)。在实施例中，一个或更多个CPU 530和/或并行处理模块525可以离散地或联合地执行方法，过程和/或其部分的任何组合。In addition to or in lieu of one or more CPUs 530, the parallel processing module 525 may be configured to execute at least some computer-readable instructions to control one or more components of the system 565 to perform one or more methods and/or processes described herein. The system 565 may use the parallel processing module 525 to render graphics (e.g., 3D graphics) or perform general-purpose computations. For example, the parallel processing module 525 may be used for general-purpose computations on a GPU (GPGPU). In embodiments, the one or more CPUs 530 and/or the parallel processing module 525 may perform any combination of methods, processes, and/or portions thereof, either discretely or in conjunction.

系统565还包括一个或更多个输入设备560，并行处理系统525和一个或更多个显示设备545。一个或更多个显示设备545可包括显示器(例如，监视器，触摸屏，电视屏幕，平视显示器(HUD)，其他显示类型或其组合)，扬声器和/或其他演示组件。一个或更多个显示设备545可以从其他组件(例如，并行处理系统525，CPU 530等)接收数据，并且输出数据(例如，作为图像，视频，声音等)。The system 565 also includes one or more input devices 560, a parallel processing system 525, and one or more display devices 545. The one or more display devices 545 may include a display (e.g., a monitor, a touch screen, a television screen, a head-up display (HUD), other display types, or a combination thereof), speakers, and/or other presentation components. The one or more display devices 545 may receive data from other components (e.g., parallel processing system 525, CPU 530, etc.) and output data (e.g., as images, video, sound, etc.).

网络接口535可以使系统565逻辑耦合到其他设备，包括输入设备560，一个或更多个显示设备545和/或其他组件，其中一些可以内置于(例如，集成到)系统565。说明性输入设备560包括麦克风，鼠标，键盘，操纵杆，游戏手柄，游戏控制器，碟形卫星天线，扫描仪，打印机，无线设备等。输入设备560可提供自然的用户界面(NUI)处理手势，语音或用户生成的其他生理输入。在某些情况下，可以将输入传输到适当的网络元素以进行进一步处理。NUI可以实现语音识别，手写笔识别，面部识别，生物特征识别，屏幕上以及与屏幕相邻的手势识别，空中手势，头部和眼睛跟踪以及与系统565的显示器相关的触摸识别(如下更详细描述)的任意组合。系统565可以包括用于姿势检测和识别的深度相机，例如立体相机系统，红外相机系统，RGB相机系统，触摸屏技术及其组合。另外，系统565可包括使得能够检测运动的输入设备560，诸如加速度计或陀螺仪(例如，作为惯性测量单元(IMU)的一部分)。在一些示例中，系统565可以使用加速度计或陀螺仪的输出来渲染身临其境的增强现实或虚拟现实。The network interface 535 may enable the system 565 to be logically coupled to other devices, including input devices 560, one or more display devices 545, and/or other components, some of which may be built into (e.g., integrated into) the system 565. Illustrative input devices 560 include microphones, mice, keyboards, joysticks, gamepads, game controllers, satellite dishes, scanners, printers, wireless devices, and the like. The input devices 560 may provide a natural user interface (NUI) that processes gestures, speech, or other physiological input generated by a user. In some cases, the input may be transmitted to an appropriate network element for further processing. The NUI may implement any combination of speech recognition, stylus recognition, facial recognition, biometric recognition, gesture recognition on and adjacent to the screen, mid-air gestures, head and eye tracking, and touch recognition associated with a display of the system 565 (described in more detail below). The system 565 may include a depth camera for gesture detection and recognition, such as a stereo camera system, an infrared camera system, an RGB camera system, touch screen technology, and combinations thereof. Additionally, system 565 may include an input device 560 that enables detection of motion, such as an accelerometer or gyroscope (e.g., as part of an inertial measurement unit (IMU)). In some examples, system 565 may use the output of the accelerometer or gyroscope to render an immersive augmented reality or virtual reality.

此外，系统565可以通过网络接口535耦合到网络(例如，电信网络，局域网(LAN)，无线网络，诸如互联网的广域网(WAN)，对等网络，电缆通过网络接口535)进行通信。系统565可以被包括在分布式网络和/或云计算环境内。In addition, the system 565 can be coupled to a network (e.g., a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable through the network interface 535) for communication through the network interface 535. The system 565 can be included in a distributed network and/or cloud computing environment.

网络接口535可以包括一个或更多个接收机，发射机和/或收发机，其使系统565能够经由包括有线和/或无线通信的电子通信网络与其他计算设备进行通信。网络接口535可以包括组件和功能，以使得能够在许多不同网络中的任何一个上进行通信，例如无线网络(例如，Wi-Fi，Z-Wave，蓝牙，蓝牙LE，ZigBee等)，有线网络(例如，通过以太网或InfiniBand进行通信)，低功耗广域网(例如LoRaWAN，SigFox等)和/或因特网。The network interface 535 may include one or more receivers, transmitters, and/or transceivers that enable the system 565 to communicate with other computing devices via an electronic communication network including wired and/or wireless communications. The network interface 535 may include components and functionality to enable communication over any of a number of different networks, such as a wireless network (e.g., Wi-Fi, Z-Wave, Bluetooth, Bluetooth LE, ZigBee, etc.), a wired network (e.g., communicating over Ethernet or InfiniBand), a low power wide area network (e.g., LoRaWAN, SigFox, etc.), and/or the Internet.

系统565还可包括辅助存储器(未示出)。辅助存储设备包括，例如，硬盘驱动器和/或可移除存储驱动器，表示软盘驱动器，磁带驱动器，光盘驱动器，数字多功能磁盘(DVD)驱动器，记录设备，通用串行总线(USB)闪存。可移除存储驱动器以众所周知的方式从可移除存储单元读取和/或写入可移除存储单元。系统565还可包括硬线电源，电池电源或其组合(未示出)。电源可以向系统565提供电力以使系统565的组件能够操作。The system 565 may also include auxiliary storage (not shown). The auxiliary storage device includes, for example, a hard disk drive and/or a removable storage drive, representing a floppy disk drive, a tape drive, an optical disk drive, a digital versatile disk (DVD) drive, a recording device, a universal serial bus (USB) flash memory. The removable storage drive reads from and/or writes to the removable storage unit in a well-known manner. The system 565 may also include a hard-line power supply, a battery power supply, or a combination thereof (not shown). The power supply can provide power to the system 565 to enable the components of the system 565 to operate.

前述模块和/或设备中的每一个甚至可以被放置在单个半导体平台上以形成系统565。或者，根据用户的需求，各种模块也可以被独立地放置或以半导体平台的各种组合放置。尽管上面已经描述了各种实施例，但是应该理解，它们仅是示例性的，而非限制性的。因此，优选实施例的广度和范围不应由任何上述示例性实施例限制，而应仅根据所附权利要求及其等同物来限定。Each of the aforementioned modules and/or devices can even be placed on a single semiconductor platform to form system 565. Alternatively, the various modules can also be placed independently or in various combinations of semiconductor platforms according to the needs of the user. Although various embodiments have been described above, it should be understood that they are only exemplary and not restrictive. Therefore, the breadth and scope of the preferred embodiments should not be limited by any of the above exemplary embodiments, but should only be defined in accordance with the appended claims and their equivalents.

示例网络环境Example network environment

适用于实现本公开的实施例的网络环境可以包括一个或更多个客户端设备，服务器，网络附加存储(NAS)，其他后端设备和/或其他设备类型。客户端设备，服务器和/或其他设备类型(例如，每个设备)可以在图5B的处理系统500和/或图5C的示例性系统565的一个或更多个实例上实现，例如，每个设备可以包括类似的组件，处理系统500和/或示例性系统565的特征和/或功能。A network environment suitable for implementing embodiments of the present disclosure may include one or more client devices, servers, network attached storage (NAS), other backend devices, and/or other device types. The client devices, servers, and/or other device types (e.g., each device) may be implemented on one or more instances of the processing system 500 of FIG. 5B and/or the exemplary system 565 of FIG. 5C, for example, each device may include similar components, features, and/or functions of the processing system 500 and/or the exemplary system 565.

网络环境的组件可以经由可以是有线，无线或两者的一个或更多个网络相互通信。该网络可以包括多个网络或网络的网络。举例来说，网络可以包括一个或更多个WAN，一个或更多个LAN，一个或更多个公共网络，例如互联网和/或公共交换电话网络(PSTN)，和/或一个或更多个专用网络。在网络包括无线电信网络的情况下，诸如基站，通信塔，甚至接入点之类的组件(以及其他组件)可以提供无线连接。The components of the network environment can communicate with each other via one or more networks that can be wired, wireless, or both. The network can include multiple networks or networks of networks. For example, the network can include one or more WANs, one or more LANs, one or more public networks such as the Internet and/or the Public Switched Telephone Network (PSTN), and/or one or more private networks. In the case where the network includes a wireless telecommunications network, components such as base stations, communication towers, and even access points (among other components) can provide wireless connectivity.

兼容的网络环境可以包括一个或更多个对等网络环境-在这种情况下，服务器可能不包含在网络环境中-以及一个或更多个客户端-服务器网络环境-在这种情况下，一个或更多个服务器可能包含在网络环境中。在对等网络环境中，本文针对一个或更多个服务器描述的功能可以在任意数量的客户端设备上实现。Compatible network environments may include one or more peer-to-peer network environments-in which case the server may not be included in the network environment-and one or more client-server network environments-in which case one or more servers may be included in the network environment. In a peer-to-peer network environment, the functionality described herein for one or more servers may be implemented on any number of client devices.

在至少一个实施例中，网络环境可以包括一个或更多个基于云的网络环境，分布式计算环境，其组合等。基于云的网络环境可以包括框架层，作业调度器，资源管理器以及在一个或更多个服务器上实现的分布式文件系统，该服务器可以包括一个或更多个核心网络服务器和/或边缘服务器。框架层可以包括用于支持软件层的软件和/或应用程序层的一个或更多个应用程序的框架。该软件或一个或更多个应用程序可以分别包括基于网络的服务软件或应用程序。在实施例中，一个或更多个客户端设备可以使用基于网络的服务软件或应用程序(例如，通过经由一个或更多个应用程序编程接口(API)访问服务软件和/或应用程序)。框架层可以是但不限于一种自由和开源软件网页应用程序框架，例如可以使用分布式文件系统进行大规模数据处理(例如“大数据”)。In at least one embodiment, the network environment may include one or more cloud-based network environments, distributed computing environments, combinations thereof, and the like. The cloud-based network environment may include a framework layer, a job scheduler, a resource manager, and a distributed file system implemented on one or more servers, which may include one or more core network servers and/or edge servers. The framework layer may include a framework for supporting software at the software layer and/or one or more applications at the application layer. The software or one or more applications may include network-based service software or applications, respectively. In an embodiment, one or more client devices may use network-based service software or applications (e.g., by accessing the service software and/or applications via one or more application programming interfaces (APIs)). The framework layer may be, but is not limited to, a free and open source software web application framework, such as a distributed file system that may be used for large-scale data processing (e.g., "big data").

基于云的网络环境可以提供执行本文所述的计算和/或数据存储功能的任何组合(或其一个或更多个部分)的云计算和/或云存储。这些各种功能中的任何一个都可以从中央或核心服务器(例如，可以跨州，地区，国家，全球等分布的一个或更多个数据中心的)分布在多个位置。如果到用户(例如，客户端设备)的连接相对靠近一个或更多个边缘服务器，则一个或更多个核心服务器可以向一个或更多个边缘服务器指定功能的至少一部分。基于云的网络环境可以是私有的(例如，限于单个组织)，可以是公共的(例如，可供许多组织使用)和/或其组合(例如，混合云环境)。A cloud-based network environment can provide cloud computing and/or cloud storage that performs any combination (or one or more portions thereof) of the computing and/or data storage functions described herein. Any of these various functions can be distributed across multiple locations from a central or core server (e.g., one or more data centers that can be distributed across states, regions, countries, the world, etc.). If the connection to a user (e.g., a client device) is relatively close to one or more edge servers, one or more core servers can assign at least a portion of the functionality to one or more edge servers. A cloud-based network environment can be private (e.g., limited to a single organization), public (e.g., available to many organizations), and/or a combination thereof (e.g., a hybrid cloud environment).

一个或更多个客户端设备可以包括图5B的示例处理系统500和/或图5C的示例性系统565的至少一些组件，特征和功能。作为示例而非限制，客户端设备可以体现为个人计算机(PC)，膝上型计算机，移动设备，智能手机，平板计算机，智能手表，可穿戴计算机，个人数字助理(PDA)，MP3播放器，虚拟现实耳机，全球定位系统(GPS)或设备，视频播放器，摄像机，监视设备或系统，车辆，船只，飞船，虚拟机，无人机，机器人，手持通信设备，医院设备，游戏设备或系统，娱乐系统，车辆计算机系统，嵌入式系统控制器，遥控器，设备，消费电子设备，工作站，边缘设备，这些描绘的设备的任何组合或任何其他合适的设备。One or more client devices may include at least some components, features, and functionality of the example processing system 500 of FIG5B and/or the example system 565 of FIG5C. By way of example and not limitation, a client device may be embodied as a personal computer (PC), a laptop computer, a mobile device, a smartphone, a tablet computer, a smart watch, a wearable computer, a personal digital assistant (PDA), an MP3 player, a virtual reality headset, a global positioning system (GPS) or device, a video player, a camera, a surveillance device or system, a vehicle, a boat, a spacecraft, a virtual machine, a drone, a robot, a handheld communication device, a hospital device, a gaming device or system, an entertainment system, a vehicle computer system, an embedded system controller, a remote control, a device, a consumer electronic device, a workstation, an edge device, any combination of the depicted devices, or any other suitable device.

机器学习Machine Learning

在处理器(诸如PPU 300)上开发的深度神经网络(DNN)已经用于各种使用情况：从自驾车到更快药物开发，从在线图像数据库中的自动图像字幕到视频聊天应用中的智能实时语言翻译。深度学习是一种技术，它建模人类大脑的神经学习过程，不断学习，不断变得更聪明，并且随着时间的推移更快地传送更准确的结果。一个孩子最初是由成人教导，以正确识别和分类各种形状，最终能够在没有任何辅导的情况下识别形状。同样，深度学习或神经学习系统需要在对象识别和分类方面进行训练，以便在识别基本对象、遮挡对象等同时还有为对象分配情景时变得更加智能和高效。Deep Neural Networks (DNNs) developed on processors such as the PPU 300 have been used in a variety of use cases: from self-driving cars to faster drug development, from automatic image captioning in online image databases to intelligent real-time language translation in video chat applications. Deep learning is a technology that models the neural learning process of the human brain, constantly learning, constantly getting smarter, and delivering more accurate results faster over time. A child is initially taught by an adult to correctly identify and classify various shapes, and eventually is able to recognize shapes without any coaching. Similarly, deep learning or neural learning systems need to be trained in object recognition and classification in order to become smarter and more efficient in recognizing basic objects, occluded objects, etc., while also assigning context to objects.

在最简单的层面上，人类大脑中的神经元查看接收到的各种输入，将重要性水平分配给这些输入中的每一个，并且将输出传递给其他神经元以进行处理。人造神经元或感知器是神经网络的最基本模型。在一个示例中，感知器可以接收一个或更多个输入，其表示感知器正被训练为识别和分类的对象的各种特征，并且在定义对象形状时，这些特征中的每一个基于该特征的重要性指派一定的权重。At the simplest level, neurons in the human brain look at the various inputs they receive, assign importance levels to each of these inputs, and pass outputs to other neurons for processing. An artificial neuron, or perceptron, is the most basic model of a neural network. In one example, a perceptron can receive one or more inputs representing various features of an object that the perceptron is being trained to recognize and classify, and each of these features is assigned a certain weight based on the importance of that feature in defining the shape of the object.

深度神经网络(DNN)模型包括许多连接节点(例如，感知器、玻尔兹曼机器、径向基函数、卷积层等)的多个层，其可以用大量输入数据来训练以快速高精度地解决复杂问题。在一个示例中，DNN模型的第一层将汽车的输入图像分解为各个部分，并查找基本图案(诸如线条和角)。第二层组装线条以寻找更高水平的图案，诸如轮子、挡风玻璃和镜子。下一层识别运载工具类型，最后几层为输入图像生成标签，识别特定汽车品牌的型号。Deep neural network (DNN) models include multiple layers of many connected nodes (e.g., perceptrons, Boltzmann machines, radial basis functions, convolutional layers, etc.), which can be trained with large amounts of input data to solve complex problems quickly and accurately. In one example, the first layer of a DNN model breaks down an input image of a car into its parts and looks for basic patterns (such as lines and angles). The second layer assembles lines to look for higher-level patterns, such as wheels, windshields, and mirrors. The next layer identifies the type of vehicle, and the last few layers generate labels for the input image, identifying the model of a specific car brand.

一旦DNN被训练，DNN就可以被部署并用于在被称为推理(inference)的过程中识别和分类对象或图案。推理的示例(DNN从给定输入中提取有用信息的过程)包括识别储蓄在ATM机中的支票存款上的手写数字、识别照片中朋友的图像、向超过五千万用户提供电影推荐、识别和分类不同类型的汽车、行人和无人驾驶汽车中的道路危险、或实时翻译人类言语。Once a DNN is trained, it can be deployed and used to recognize and classify objects or patterns in a process called inference. Examples of inference (the process by which a DNN extracts useful information from a given input) include recognizing handwritten numbers on check deposits at an ATM, identifying images of friends in photos, providing movie recommendations to over 50 million users, recognizing and classifying different types of cars, pedestrians, and road hazards in self-driving cars, or translating human speech in real time.

在训练期间，数据在前向传播阶段流过DNN，直到产生预测为止，其指示对应于输入的标签。如果神经网络没有正确标记输入，则分析正确标签和预测标签之间的误差，并且在后向传播阶段期间针对每个特征调整权重，直到DNN正确标记该输入和训练数据集中的其他输入为止。训练复杂的神经网络需要大量的并行计算性能，包括由PPU 300支持的浮点乘法和加法。与训练相比，推理的计算密集程度比训练更低，是一个延迟敏感过程，其中经训练的神经网络应用于它以前没有见过的新的输入，以进行图像分类、翻译语音以及通常推理新的信息。During training, data flows through the DNN in a forward propagation phase until a prediction is produced, which indicates a label corresponding to the input. If the neural network does not correctly label an input, the error between the correct label and the predicted label is analyzed, and the weights are adjusted for each feature during a backward propagation phase until the DNN correctly labels that input and other inputs in the training data set. Training complex neural networks requires a large amount of parallel computing performance, including floating-point multiplications and additions supported by the PPU 300. Inference is less computationally intensive than training and is a latency-sensitive process in which a trained neural network is applied to new inputs it has not seen before to classify images, translate speech, and generally reason about new information.

神经网络严重依赖于矩阵数学运算，并且复杂的多层网络需要大量的浮点性能和带宽来提高效率和速度。采用数千个处理核心，针对矩阵数学运算进行了优化，并传送数十到数百TFLOPS的性能，PPU 300是能够传送基于深度神经网络的人工智能和机器学习应用所需性能的计算平台。Neural networks rely heavily on matrix math operations, and complex multi-layer networks require a lot of floating point performance and bandwidth for efficiency and speed. With thousands of processing cores optimized for matrix math operations and delivering tens to hundreds of TFLOPS of performance, the PPU 300 is a computing platform capable of delivering the performance required for deep neural network-based artificial intelligence and machine learning applications.

此外，应用本文公开的一种或更多种技术产生的图像可以用于训练，测试或证明用于识别现实世界中的对象和环境的DNN。这样的图像可以包括道路，工厂，建筑物，城市环境，农村环境，人类，动物以及任何其他物理对象或真实环境的场景。此类图像可用于训练，测试或认证在机器或机器人中使用的DNN，以操纵，处理或修改现实世界中的物理对象。此外，此类图像可用于训练，测试或认证在自动驾驶车辆中使用的DNN，以在现实世界中导航和移动车辆。另外，应用本文公开的一种或更多种技术产生的图像可以用于向这些机器，机器人和车辆的用户传达信息。In addition, images generated by applying one or more of the techniques disclosed herein can be used to train, test, or certify DNNs for recognizing objects and environments in the real world. Such images can include scenes of roads, factories, buildings, urban environments, rural environments, humans, animals, and any other physical objects or real environments. Such images can be used to train, test, or certify DNNs used in machines or robots to manipulate, process, or modify physical objects in the real world. In addition, such images can be used to train, test, or certify DNNs used in autonomous vehicles to navigate and move vehicles in the real world. In addition, images generated by applying one or more of the techniques disclosed herein can be used to convey information to users of these machines, robots, and vehicles.

图5D示出了根据至少一个实施例的可用于训练和利用机器学习的示例性系统555的组件。如将要讨论的，可以由计算设备和资源或单个计算系统的各种组合来提供各种组件，其可以在单个实体或多个实体的控制下。此外，方面可以由不同实体触发，发起或请求。在至少一个实施例中，可以由与提供商环境506相关联的提供商来指导神经网络的训练，而在至少一个实施例中，可以通过能够通过客户端设备502或其他此类资源访问提供商环境的顾客或其他用户来请求训练神经网络。在至少一个实施例中，可以由提供者，用户或第三方内容提供者524提供训练数据(或要由训练的神经网络分析的数据)。在至少一个实施例中，客户端设备502可以是表示用户要导航的车辆或对象，例如，其可以提交请求和/或接收有助于设备导航的指令。FIG5D illustrates components of an exemplary system 555 that may be used to train and utilize machine learning in accordance with at least one embodiment. As will be discussed, various components may be provided by various combinations of computing devices and resources or a single computing system, which may be under the control of a single entity or multiple entities. Additionally, aspects may be triggered, initiated, or requested by different entities. In at least one embodiment, the training of a neural network may be directed by a provider associated with a provider environment 506, while in at least one embodiment, the training of a neural network may be requested by a customer or other user who has access to the provider environment through a client device 502 or other such resource. In at least one embodiment, the training data (or data to be analyzed by the trained neural network) may be provided by a provider, a user, or a third-party content provider 524. In at least one embodiment, a client device 502 may be a vehicle or object representing a user to navigate, for example, which may submit requests and/or receive instructions to facilitate navigation of the device.

在至少一个实施例中，请求能够在至少一个网络504上被提交，以被提供商环境506接收。在至少一个实施例中，客户端设备502可以是使用户能够生成并发送此类请求的任何适当的电子和/或计算设备，例如但不限于台式计算机，笔记本计算机，计算机服务器，智能电话，平板计算机，游戏机(便携式或其他)，计算机处理器，计算逻辑和机顶盒。一个或更多个网络504可以包括用于传输请求或其他此类数据的任何适当的网络，例如可以包括互联网，内联网，蜂窝网络，局域网(LAN)，广域网(WAN)，个人区域网络(PAN)，对等方之间直接无线连接的ad hoc网络等。In at least one embodiment, the request can be submitted over at least one network 504 to be received by the provider environment 506. In at least one embodiment, the client device 502 can be any suitable electronic and/or computing device that enables a user to generate and send such a request, such as, but not limited to, a desktop computer, a laptop computer, a computer server, a smart phone, a tablet computer, a game console (portable or otherwise), a computer processor, computing logic, and a set-top box. The one or more networks 504 may include any suitable network for transmitting requests or other such data, such as the Internet, an intranet, a cellular network, a local area network (LAN), a wide area network (WAN), a personal area network (PAN), an ad hoc network for direct wireless connection between peers, etc.

在至少一个实施例中，在该示例中，可以在接口层508接收请求，该接口层可以将数据转发到训练和推理管理器532。训练和推理管理器532可以是包括用于管理请求和服务对应数据或内容的硬件和软件的系统或服务，在至少一个实施例中，训练和推理管理器532可以接收训练神经网络的请求，并且可以为请求训练模块512提供数据。在至少一个实施例中，训练模块512可以选择合适的模型或神经网络来使用，如果没有被请求指定，并且可以使用相关的训练数据来训练模型。在至少一个实施例中，训练数据可以是存储在训练数据存储库514中，从客户端设备502接收或从第三方提供商524获得的一批数据。在至少一个实施例中，训练模块512可以负责训练数据。神经网络可以是任何适当的网络，例如递归神经网络(RNN)或卷积神经网络(CNN)。一旦训练了神经网络并成功地对其进行了评估，就可以将经训练的神经网络存储在例如模型存储库516中，该模型存储库可以存储用于用户，应用程序或服务等的不同模型或网络。在至少一个实施例中，具有可以基于多个不同因素使用单个应用程序或实体的多个模型。In at least one embodiment, in this example, the request may be received at the interface layer 508, which may forward the data to the training and reasoning manager 532. The training and reasoning manager 532 may be a system or service including hardware and software for managing requests and serving corresponding data or content. In at least one embodiment, the training and reasoning manager 532 may receive a request to train a neural network and may provide data to the requesting training module 512. In at least one embodiment, the training module 512 may select an appropriate model or neural network to use if not specified by the request, and may use the relevant training data to train the model. In at least one embodiment, the training data may be a batch of data stored in the training data repository 514, received from the client device 502, or obtained from the third party provider 524. In at least one embodiment, the training module 512 may be responsible for the training data. The neural network may be any appropriate network, such as a recurrent neural network (RNN) or a convolutional neural network (CNN). Once the neural network is trained and successfully evaluated, the trained neural network may be stored in, for example, a model repository 516, which may store different models or networks for users, applications, or services, etc. In at least one embodiment, there are multiple models that can use a single application or entity based on a number of different factors.

在至少一个实施例中，在随后的时间点，可以从客户端设备502(或另一种这样的设备)接收至少部分地受经训练的神经网络确定或影响的内容(例如，路径确定)或数据的请求。该请求可以包括例如要使用神经网络处理以获得一个或更多个推论或其他输出值，分类或预测的输入数据，或者对于至少一个实施例，可以由接口层508接收输入数据，并且虽然也可以使用不同的系统或服务，但它还是指向推理模块518的。在至少一个实施例中，如果还没有本地存储在推理模块518中，则推理模块518可以从模型存储库516获得适当的训练网络，例如本文所讨论的经训练的深度神经网络(DNN)。推理模块518可以提供数据作为训练网络的输入，然后可以生成一个或更多个推理作为输出。例如，这可以包括输入数据实例的分类。在至少一个实施例中，然后可以将推理发送到客户端设备502，以向用户显示或进行其他通信。在至少一个实施例中，还可以将用于用户的上下文数据存储到用户上下文数据存储库522中，该用户上下文数据存储库522可以包括关于用户的数据，该数据可以用作生成推理或获取实例后确定返回到用户的数据时对网络的输入。在至少一个实施例中，也可以将可能包括输入或推理数据中的至少一些的相关数据存储到本地数据库534中，以处理将来的请求。在至少一个实施例中，用户可以使用账户信息或其他信息来访问提供商环境的资源或功能。在至少一个实施例中，如果允许和可用，则还可以收集用户数据并将其用于进一步的训练模型，以便为将来的请求提供更准确的推理。在至少一个实施例中，可以通过用户界面接收对在客户端设备502上执行的机器学习应用程序526的请求，并通过相同的界面显示结果。客户端设备可以包括诸如用于生成请求和处理结果或响应的处理器528和存储器562之类的资源，以及至少一个用于存储用于机器学习应用程序526的数据的数据存储元件552。In at least one embodiment, at a later point in time, a request for content (e.g., path determination) or data that is at least partially determined or influenced by a trained neural network may be received from the client device 502 (or another such device). The request may include, for example, input data to be processed using the neural network to obtain one or more inferences or other output values, classifications, or predictions, or for at least one embodiment, the input data may be received by the interface layer 508 and directed to the reasoning module 518, although a different system or service may also be used. In at least one embodiment, if not already locally stored in the reasoning module 518, the reasoning module 518 may obtain an appropriate training network from the model repository 516, such as a trained deep neural network (DNN) discussed herein. The reasoning module 518 may provide data as input to the training network, and may then generate one or more inferences as output. For example, this may include a classification of the input data instance. In at least one embodiment, the inference may then be sent to the client device 502 for display or other communication to the user. In at least one embodiment, context data for the user may also be stored in a user context data repository 522, which may include data about the user, which may be used as input to the network when generating inferences or determining data returned to the user after obtaining an instance. In at least one embodiment, related data that may include at least some of the input or inference data may also be stored in a local database 534 to process future requests. In at least one embodiment, the user may use account information or other information to access resources or functions of the provider environment. In at least one embodiment, if allowed and available, user data may also be collected and used for further training models to provide more accurate inferences for future requests. In at least one embodiment, a request for a machine learning application 526 executed on a client device 502 may be received through a user interface, and the results may be displayed through the same interface. The client device may include resources such as a processor 528 and a memory 562 for generating requests and processing results or responses, and at least one data storage element 552 for storing data for the machine learning application 526.

在至少一个实施例中，处理器528(或训练模块512或推理模块518的处理器)将是中央处理单元(CPU)。但是，如上所述，这种环境中的资源可以利用GPU来处理至少某些类型的请求的数据。具有数千个内核的GPU(例如PPU 300)被设计为处理大量的并行工作负载，因此，在用于训练神经网络和生成预测的深度学习中变得很流行。尽管使用GPU进行离线构建可以更快地训练更大，更复杂的模型，但离线生成预测意味着无法使用请求时间输入特征，或者必须针对所有特征排列生成预测并将其存储在查找表中才能使用实时请求。如果深度学习框架支持CPU模式，并且模型又小又简单，足以以合理的延迟在CPU上执行前馈，则CPU实例上的服务可以托管模型。在这种情况下，可以在GPU上离线进行训练，而在CPU上实时进行推理。如果CPU方法不可行，则服务可以在GPU实例上运行。但是，由于GPU具有与CPU不同的性能和成本特征，因此运行将运行时算法卸载到GPU的服务可能要求其设计与基于CPU的服务不同。In at least one embodiment, the processor 528 (or the processor of the training module 512 or the inference module 518) will be a central processing unit (CPU). However, as described above, resources in such an environment may utilize GPUs to process data for at least some types of requests. GPUs with thousands of cores (such as PPU 300) are designed to handle massively parallel workloads and, as a result, have become popular in deep learning for training neural networks and generating predictions. Although using GPUs for offline building can train larger and more complex models faster, generating predictions offline means that request-time input features cannot be used, or predictions must be generated for all feature permutations and stored in a lookup table to use real-time requests. If the deep learning framework supports CPU mode and the model is small and simple enough to be executed feed-forward on the CPU with reasonable latency, a service on a CPU instance can host the model. In this case, training can be performed offline on the GPU, while inference can be performed in real time on the CPU. If the CPU approach is not feasible, the service can run on a GPU instance. However, because GPUs have different performance and cost characteristics than CPUs, running a service that offloads runtime algorithms to a GPU may require that its design be different from that of a CPU-based service.

在至少一个实施例中，可以从客户端设备502提供视频数据以在提供商环境506中进行增强。在至少一个实施例中，可以对视频数据进行处理以在客户端设备502上进行增强。在至少一个实施例中，视频数据可以从第三方内容提供者524进行流传输，并由第三方内容提供者524，提供者环境506或客户端设备502增强。在至少一个实施例中，可以从客户端设备502提供视频数据以用作在提供者环境506中训练数据。In at least one embodiment, video data may be provided from client device 502 for enhancement in provider environment 506. In at least one embodiment, video data may be processed for enhancement on client device 502. In at least one embodiment, video data may be streamed from third party content provider 524 and enhanced by third party content provider 524, provider environment 506, or client device 502. In at least one embodiment, video data may be provided from client device 502 for use as training data in provider environment 506.

在至少一个实施例中，可以由客户端设备502和/或提供商环境506来执行有监督和/或无监督训练。在至少一个实施例中，一组训练数据514(例如，分类或标记的数据)作为输入提供，以用作训练数据。在一个实施例中，该组训练数据可以在生成对抗训练配置中使用以训练生成器神经网络。In at least one embodiment, supervised and/or unsupervised training can be performed by the client device 502 and/or the provider environment 506. In at least one embodiment, a set of training data 514 (e.g., classified or labeled data) is provided as input to be used as training data. In one embodiment, the set of training data can be used in a generative adversarial training configuration to train a generator neural network.

在至少一个实施例中，训练数据可以包括要对其训练神经网络的至少一个人类对象，化身或角色的图像。在至少一个实施例中，训练数据可以包括至少一个要为其训练神经网络的对象类型，以及标识该对象类型的信息的实例。在至少一个实施例中，训练数据可以包括一组图像，每个图像包括一种对象类型的表示，其中，每个图像还包括标签，元数据，分类或标识相应图像中表示的对象类型的其他信息或与之相关联。各种其他类型的数据也可以用作训练数据，可以包括文本数据，音频数据，视频数据等等。在至少一个实施例中，训练数据514被提供作为训练输入到训练模块512。在至少一个实施例中，训练模块512可以是包括硬件和软件的系统或服务，例如执行训练应用的一个或更多个计算设备，用于训练神经网络(或其他模型或算法等)。在至少一个实施例中，训练模块512接收指示要用于训练的模型的类型的指令或请求。在至少一个实施例中，模型可以是对于这种目的有用的任何适当的统计模型，网络或算法，可以包括人工神经网络，深度学习算法，学习分类器，贝叶斯网络等。在至少一个实施例中，训练模块512可以从适当的存储库516中选择初始模型或其他未经训练的模型，并利用训练数据514来训练模型，从而生成可以经训练的模型(例如，经训练的深度神经网络)，其可用于对相似类型的数据进行分类或生成其他此类推论。在不使用训练数据的至少一个实施例中，每个训练模块512仍然可以选择适当的初始模型以对输入数据进行训练。In at least one embodiment, the training data may include an image of at least one human object, avatar, or character for which the neural network is to be trained. In at least one embodiment, the training data may include at least one object type for which the neural network is to be trained, and an instance of information identifying the object type. In at least one embodiment, the training data may include a set of images, each image including a representation of an object type, wherein each image also includes or is associated with a label, metadata, classification, or other information identifying the object type represented in the corresponding image. Various other types of data may also be used as training data, and may include text data, audio data, video data, and the like. In at least one embodiment, the training data 514 is provided as training input to the training module 512. In at least one embodiment, the training module 512 may be a system or service including hardware and software, such as one or more computing devices executing a training application, for training a neural network (or other model or algorithm, etc.). In at least one embodiment, the training module 512 receives an instruction or request indicating the type of model to be used for training. In at least one embodiment, the model may be any suitable statistical model, network, or algorithm useful for such a purpose, and may include an artificial neural network, a deep learning algorithm, a learning classifier, a Bayesian network, and the like. In at least one embodiment, the training module 512 can select an initial model or other untrained model from an appropriate repository 516 and train the model using the training data 514, thereby generating a trainable model (e.g., a trained deep neural network) that can be used to classify similar types of data or generate other such inferences. In at least one embodiment that does not use training data, each training module 512 can still select an appropriate initial model to train the input data.

在至少一个实施例中，可以以多种不同的方式来训练模型，这可以部分地取决于所选择的模型的类型。在至少一个实施例中，可以向机器学习算法提供一组训练数据，其中模型是通过训练过程创建的模型伪像。在至少一个实施例中，训练数据的每个实例包含正确答案(例如，分类)，其可以被称为目标或目标属性。在至少一个实施例中，学习算法在训练数据中找到将输入数据属性映射到目标的模式，要预测的答案，并且输出捕获这些模式的机器学习模型。在至少一个实施例中，然后可以使用机器学习模型来获得针对未指定目标的新数据的预测。In at least one embodiment, the model can be trained in a variety of different ways, which can depend in part on the type of model selected. In at least one embodiment, a set of training data can be provided to a machine learning algorithm, where the model is a model artifact created by the training process. In at least one embodiment, each instance of the training data contains a correct answer (e.g., a classification), which can be referred to as a target or target attribute. In at least one embodiment, the learning algorithm finds patterns in the training data that map input data attributes to the target, the answer to be predicted, and outputs a machine learning model that captures these patterns. In at least one embodiment, the machine learning model can then be used to obtain predictions for new data for which the target is not specified.

在至少一个实施例中，训练和推理管理器532可以从包括二进制分类，多分类，生成和回归模型的一组机器学习模型中进行选择。在至少一个实施例中，要使用的模型的类型可以至少部分取决于要预测的目标的类型。In at least one embodiment, the training and inference manager 532 can select from a group of machine learning models including binary classification, multi-classification, generative, and regression models. In at least one embodiment, the type of model to be used can depend at least in part on the type of target to be predicted.

GAN辅助的视频编码和重建GAN-assisted video coding and reconstruction

视频会议和类似的应用程序需要大量带宽才能通过网络将图像传输到边缘设备。如果没有足够的带宽，则图像和/或音频质量会受到影响。可以采用常规的图像压缩技术来在传输之前压缩图像并解压缩图像以在接收设备处显示。但是，当带宽极度受限或连接不可靠时，常规技术可能不够鲁棒。Video conferencing and similar applications require a lot of bandwidth to transmit images over the network to edge devices. If there is not enough bandwidth, the image and/or audio quality will be affected. Conventional image compression techniques can be used to compress images before transmission and decompress images for display at the receiving device. However, conventional techniques may not be robust enough when bandwidth is extremely limited or connections are unreliable.

在诸如视频会议(VC)之类的应用程序中，其中在相对一致的情况下传输单个对象的大量的连续镜头，可以使用生成器神经网络(诸如基于样式的生成器系统100)对视频进行编码以作为中间潜码或外观矢量。合成神经网络(诸如合成神经网络140)然后可以从外观矢量重建图像。In applications such as video conferencing (VC), where a large number of consecutive shots of a single object are transmitted in relatively consistent conditions, a generator neural network (such as style-based generator system 100) can be used to encode the video as an intermediate latent code or appearance vector. A synthetic neural network (such as synthetic neural network 140) can then reconstruct the image from the appearance vector.

在一个实施例中，外观矢量包括抽象潜码(例如，中间潜码)，一组(面部)界标点，一组与众所周知的面部动作编码系统(FACS)有关的系数，或表示学习的特征嵌入空间中面部外观的矢量中的至少一个。In one embodiment, the appearance vector comprises at least one of an abstract latent code (e.g., an intermediate latent code), a set of (facial) landmark points, a set of coefficients related to the well-known Facial Action Coding System (FACS), or a vector representing facial appearance in a learned feature embedding space.

可以捕获和处理对象的图像，以将每个视频帧投影或映射到合成神经网络的潜在空间中，以产生被发送到接收设备的外观矢量。外观矢量编码对象的属性，并且是图像的压缩表示。在潜在空间中运行的合成神经网络可以配置为渲染外观矢量，以在接收器处重建图像，从而有效地对外观矢量进行解压缩。例如，在视频会议期间，在照相机，姿势和照明相当稳定的条件下，对象通常是一个人。这样的人讲话和收听的视频流在很大程度上是多余的，因为视频帧仅包含同一人的微小变化。Images of objects can be captured and processed to project or map each video frame into the latent space of a synthetic neural network to produce an appearance vector that is sent to a receiving device. The appearance vector encodes the properties of the object and is a compressed representation of the image. The synthetic neural network operating in the latent space can be configured to render the appearance vector to reconstruct the image at the receiver, effectively decompressing the appearance vector. For example, during a video conference, the subject is often a single person under fairly stable conditions of camera, pose, and lighting. Video streams of such a person speaking and listening are largely redundant because the video frames contain only minor variations of the same person.

此外，标准视频广播通常在诸如对象的外观之类的方面几乎不提供高级控制。相反，合成神经网络能够增强控制能力，特别是能够使特定对象的特征与在视频帧中捕获图像的人的运动解耦合的能力。因此，如本文进一步所述，使用合成神经网络来重建压缩视频使得能够在重建期间进行控制修改。Furthermore, standard video broadcasts typically provide little advanced control over aspects such as the appearance of an object. In contrast, synthetic neural networks are able to enhance control capabilities, particularly the ability to decouple the characteristics of a particular object from the motion of the person capturing the image in the video frame. Thus, as further described herein, the use of synthetic neural networks to reconstruct compressed video enables controlled modifications during reconstruction.

外观矢量为重建的视频帧提供姿势，表情等的实时信息，并且复制数据有助于捕获和广播相似性的人的基本特征。复制数据(例如，受过训练的神经网络的权重)可以在训练期间确定并发送到接收器。The appearance vector provides real-time information of pose, expression, etc. for the reconstructed video frame, and the replica data helps capture and broadcast the essential features of the person of similarity. The replica data (e.g., the weights of the trained neural network) can be determined during training and sent to the receiver.

在合成神经网络的训练期间使用的人类对象的特征可以应用于重建的视频帧-即使当不同的人类对象出现在从中生成外观矢量的捕获图像中。换句话说，复制数据通过合成神经网络传输到重建的视频帧。复制数据可以使用被捕获和广播其相似性的同一个人来生成，但是具有不同的属性，例如不同的发型，衣服和/或场景照明。例如，当该人具有她喜欢的发型，在她穿着制服时以及在演播室照明条件下时，可以生成该人的复制数据。相反，可以在同一个人的头发样式不同，戴着帽子或眼镜且光线不足的情况下生成外观矢量。复制数据将被传输到重建图像上，因此她似乎拥有自己喜欢的发型，穿着制服，并且在演播室照明条件下捕获了她的图像。在另一个示例中，可以为另一个人生成复制数据，并且将该另一个人的属性转移到重建图像。因此，复制数据可用于修改重建图像的一个或更多个方面。在另一个实施例中，由外观神经提供由合成神经网络用于重建的各个属性。例如，其图像被捕获的人可以选择广播给接收机的不同属性(例如，戴眼镜，眼睛的颜色等)。在另一示例中，接收机可以选择一个或更多个不同的属性以用于重建。Features of human subjects used during training of the synthetic neural network can be applied to the reconstructed video frame - even when a different human subject appears in the captured image from which the appearance vector is generated. In other words, the replica data is transferred to the reconstructed video frame by the synthetic neural network. The replica data can be generated using the same person whose likeness is captured and broadcast, but with different attributes, such as different hairstyles, clothing, and/or scene lighting. For example, replica data for a person can be generated when the person has her preferred hairstyle, when she is wearing a uniform, and under studio lighting conditions. Conversely, the appearance vector can be generated for the same person with a different hair style, wearing a hat or glasses, and in poor lighting. The replica data will be transferred to the reconstructed image so that she appears to have her preferred hairstyle, wear a uniform, and that her image was captured under studio lighting conditions. In another example, replica data can be generated for another person, and the attributes of that other person are transferred to the reconstructed image. Thus, the replica data can be used to modify one or more aspects of the reconstructed image. In another embodiment, the individual attributes used by the synthetic neural network for reconstruction are provided by the appearance neural network. For example, the person whose image was captured can choose different attributes (e.g., wearing glasses, eye color, etc.) that are broadcast to the receiver. In another example, the receiver may select one or more different attributes to use for reconstruction.

现在将根据用户的需求，给出关于可用于实现前述框架的各种可选架构和特征的更多说明性信息。应该特别注意的是，以下信息是出于说明目的而提出的，不应以任何方式解释为限制。下列任何特征都可以视需要并入或不排除所描述的其他特征。Now, more illustrative information about various optional architectures and features that can be used to implement the aforementioned framework will be given according to the needs of the user. It should be particularly noted that the following information is provided for illustrative purposes and should not be interpreted as limiting in any way. Any of the following features can be incorporated into or not exclude other features described as needed.

图6A示出了适用于实现本公开的一些实施例的示例性视频流传输系统600。应该理解，本文描述的这种和其他布置仅作为示例阐述。除了或代替所示的那些，可以使用其他布置和元件(例如，机器，界面，功能，命令，功能的分组等)，并且可以完全省略一些元件。此外，本文描述的许多元件是功能实体，其可以被实现为离散或分布式组件或与其他组件结合并且以任何合适的组合和位置来实现。本文描述为由实体执行的各种功能可以由硬件，固件和/或软件来执行。例如，各种功能可以由执行存储在存储器中的指令的处理器来执行。此外，本领域普通技术人员将理解，执行示例性视频流系统600的操作的任何系统都在本公开的实施例的范围和精神内。Fig. 6A shows an exemplary video streaming system 600 suitable for implementing some embodiments of the present disclosure. It should be understood that this and other arrangements described herein are only set forth as examples. In addition to or in place of those shown, other arrangements and elements (e.g., machines, interfaces, functions, commands, groupings of functions, etc.) can be used, and some elements can be omitted completely. In addition, many elements described herein are functional entities, which can be implemented as discrete or distributed components or combined with other components and implemented in any suitable combination and position. The various functions described herein as being performed by an entity can be performed by hardware, firmware and/or software. For example, various functions can be performed by a processor executing instructions stored in a memory. In addition, it will be appreciated by those of ordinary skill in the art that any system that performs the operation of the exemplary video streaming system 600 is within the scope and spirit of the embodiments of the present disclosure.

图6A包括客户端设备603和604(其可以包括与图5B的示例处理系统500，图5C的示例系统565和/或图5D的示例系统555相似的组件，特征和/或功能)以及一个或更多个网络504(可能与本文描述的一个或更多个网络类似)。在本公开的一些实施例中，系统600可以包括提供商环境506，并且客户端设备603和/或604可以分别是客户端设备502。尽管发送客户端设备603被描述为源或发送者，但是发送客户端设备603可以被配置为同时执行目的地或接收客户端设备604的操作。类似地，尽管将一个或更多个接收客户端设备604描述为目标或接收者，但是一个或更多个接收客户端设备604可以是配置为同时执行源或发送客户端设备603的操作。6A includes client devices 603 and 604 (which may include components, features, and/or functions similar to the example processing system 500 of FIG. 5B , the example system 565 of FIG. 5C , and/or the example system 555 of FIG. 5D ) and one or more networks 504 (which may be similar to one or more networks described herein). In some embodiments of the present disclosure, system 600 may include a provider environment 506, and client devices 603 and/or 604 may be client devices 502, respectively. Although the sending client device 603 is described as a source or sender, the sending client device 603 may be configured to simultaneously perform the operations of the destination or receiving client device 604. Similarly, although one or more receiving client devices 604 are described as targets or recipients, one or more receiving client devices 604 may be configured to simultaneously perform the operations of the source or sending client device 603.

在系统600中，对于视频会议会话，一个或更多个接收客户端设备604可以使用数据捕获组件614来捕获输入数据。输入数据可以是图像，音频，注视方向，注视位置和由输入设备560捕获的其他类型的数据。数据捕获组件614提供捕获的数据，用于训练编码器616将输入潜在空间(例如潜在空间Z)中的输入投影到与合成神经网络(例如中间潜空间W)关联的潜在空间中。当数据是人605(例如与发送客户端设备603交互或正在观看发送客户端设备603的人)捕获的视频帧时，编码器616被训练为每个帧产生外观矢量。一旦训练了编码器616以将输入潜在空间中的输入投影到潜在空间W中的外观矢量，则编码器将人605的捕获帧转换为外观矢量。In system 600, for a video conferencing session, one or more receiving client devices 604 can use data capture component 614 to capture input data. The input data can be images, audio, gaze direction, gaze location, and other types of data captured by input device 560. Data capture component 614 provides the captured data for training encoder 616 to project inputs in an input latent space (e.g., latent space Z) into a latent space associated with a synthetic neural network (e.g., intermediate latent space W). When the data is a video frame captured by a person 605 (e.g., a person interacting with or viewing the sending client device 603), encoder 616 is trained to generate an appearance vector for each frame. Once encoder 616 is trained to project inputs in the input latent space to appearance vectors in latent space W, the encoder converts the captured frames of person 605 into appearance vectors.

在训练期间学习编码器616的参数(例如权重)，并且当部署编码器616以生成外观矢量时，使用该参数来处理输入的潜码。在一个实施例中，编码器616包括映射神经网络110，并且接收一个或更多个客户端设备604内的解码器622包括合成神经网络140和样式转换单元115。Parameters (e.g., weights) of the encoder 616 are learned during training, and when the encoder 616 is deployed to generate an appearance vector, the parameters are used to process the input latent code. In one embodiment, the encoder 616 includes the mapping neural network 110, and the decoder 622 within the receiving one or more client devices 604 includes the synthesis neural network 140 and the style transfer unit 115.

在一实施例中，编码器616可产生一帧的外观矢量，然后产生外观矢量调整一个或更多个后续帧。在一个实施例中，编码器616可以被配置为基于度量或以预定间隔生成外观矢量，以替代或补充一个或更多个帧的外观矢量调整。In one embodiment, the encoder 616 may generate an appearance vector for one frame and then generate an appearance vector to adjust one or more subsequent frames. In one embodiment, the encoder 616 may be configured to generate an appearance vector based on a metric or at predetermined intervals to replace or supplement the appearance vector adjustment of one or more frames.

在另一个实施例中，一个或更多个接收客户端设备604可以在两个外观矢量之间进行内插以生成附加的外观矢量和附加的重建帧。附加的重建帧可以重建比捕获的帧更多的帧。外观矢量可以使解码器622重建不同的图像，可以将其视为高维空间中的矢量，并且可以对这些“关键”外观矢量进行插值，以产生与对应于“关键”外观矢量的重建帧“之间”的图像对应的外观矢量。成功训练的解码器622往往具有“更平滑”的潜在空间，其中内插的外观矢量忠实地捕获了捕获图像之间的平滑自然的视觉过渡。In another embodiment, one or more receiving client devices 604 may interpolate between two appearance vectors to generate additional appearance vectors and additional reconstructed frames. The additional reconstructed frames may reconstruct more frames than the captured frames. The appearance vectors may enable the decoder 622 to reconstruct different images, which may be viewed as vectors in a high-dimensional space, and these "key" appearance vectors may be interpolated to produce appearance vectors corresponding to images "between" the reconstructed frames corresponding to the "key" appearance vectors. Successfully trained decoders 622 tend to have a "smoother" latent space, where the interpolated appearance vectors faithfully capture the smooth and natural visual transitions between the captured images.

当发送客户端设备603没有为每个捕获图像生成外观矢量时，附加的重建帧可以为没有生成外观矢量的帧重建帧。可以通过重建附加帧来获得慢动作效果。当一个或更多个外观矢量被破坏或丢弃(由于网络拥塞等)并且未被接收客户端设备604接收到时，附加的重建帧可以为丢失的外观矢量重建帧。此外，可以由发送客户端设备603压缩外观矢量和外观矢量调整中的一个或更多个，并由一个或更多个接收客户端设备604使用常规技术解压缩。When the sending client device 603 does not generate an appearance vector for each captured image, the additional reconstructed frames can be reconstructed frames for frames for which no appearance vector was generated. A slow motion effect can be obtained by reconstructing the additional frames. When one or more appearance vectors are corrupted or discarded (due to network congestion, etc.) and are not received by the receiving client device 604, the additional reconstructed frames can be reconstructed frames for the lost appearance vectors. In addition, one or more of the appearance vectors and the appearance vector adjustments can be compressed by the sending client device 603 and decompressed by one or more receiving client devices 604 using conventional techniques.

外观矢量(或外观矢量调整)由发送客户端设备603经由一个或更多个网络504发送到一个或更多个接收客户端设备604。发送客户端设备603还可以将复制数据615经由一个或更多个网络504发送到一个或更多个接收客户端设备604。在一个实施例中，复制数据615以安全的方式被存储在耦合到提供商环境506(如图6A所示)的存储器中和/或在发送客户端设备603中。在一个实施例中，人605可以选择一个或更多个单独的用户属性612，其被发送到通过一个或更多个接收客户端设备604以进行重建。The appearance vector (or appearance vector adjustment) is sent by the sending client device 603 to the one or more receiving client devices 604 via the one or more networks 504. The sending client device 603 may also send the replicated data 615 to the one or more receiving client devices 604 via the one or more networks 504. In one embodiment, the replicated data 615 is stored in a secure manner in a memory coupled to the provider environment 506 (as shown in FIG. 6A ) and/or in the sending client device 603. In one embodiment, the person 605 may select one or more individual user attributes 612 that are sent to the one or more receiving client devices 604 for reconstruction.

在一个实施例中，通过训练合成神经网络来生成用于特定对象(诸如人605)的复制数据615。特定对象可以是真实的或合成的角色，包括人类和/或计算机生成的化身，例如人，动物，生物等。训练数据可以包括包含该对象的渲染的或捕获的视频的帧。可以使用特定对象面部的视频帧而不是使用许多不同人的面部图像来训练合成神经网络。在一个实施例中，训练可以从已经在许多不同的人的面部上进行训练的预先训练的合成神经网络开始，然后对特定对象的面部的视频帧进行微调。In one embodiment, replicated data 615 for a particular object, such as a person 605, is generated by training a synthetic neural network. The particular object may be a real or synthetic character, including a human and/or a computer-generated avatar, such as a person, animal, creature, etc. The training data may include frames of a rendered or captured video containing the object. The synthetic neural network may be trained using video frames of the face of the particular object rather than using facial images of many different people. In one embodiment, training may start with a pre-trained synthetic neural network that has been trained on the faces of many different people and then fine-tuned on the video frames of the face of the particular object.

当执行附加训练时，可以预先生成复制数据615，或者可以连续地生成和/或更新复制数据615，或者可以周期性地更新复制数据615。在一个实施例中，发送客户端设备603可以使用图2D所示的生成对抗网络270配置来连续训练合成神经网络，以细化与人605相关联的复制数据615。在一个实施例中，除了编码器616之外，发送客户端设备603还包括解码器622；可以将人605的捕获图像与发送客户端设备603内的解码器622从外观矢量生成的重建图像进行比较。然后更改复制数据615，以减少在发送客户端设备内生成的重建图像与捕获的图像之间的差异。在另一个实施例中，当捕获图像也可用时，提供商环境506改变复制数据615，并且提供商环境506实现图2D中所示的GAN 270训练框架以执行连续或周期性训练。When additional training is performed, the replicated data 615 may be pre-generated, or the replicated data 615 may be continuously generated and/or updated, or the replicated data 615 may be periodically updated. In one embodiment, the sending client device 603 may continuously train the synthetic neural network using the generative adversarial network 270 configuration shown in FIG. 2D to refine the replicated data 615 associated with the person 605. In one embodiment, in addition to the encoder 616, the sending client device 603 also includes a decoder 622; the captured image of the person 605 may be compared to the reconstructed image generated from the appearance vector by the decoder 622 within the sending client device 603. The replicated data 615 is then altered to reduce the difference between the reconstructed image generated within the sending client device and the captured image. In another embodiment, the provider environment 506 alters the replicated data 615 when the captured image is also available, and the provider environment 506 implements the GAN 270 training framework shown in FIG. 2D to perform continuous or periodic training.

接收客户端设备604可以经由通信接口621接收外观矢量和复制数据615，并且解码器622可以根据复制数据615来重建由外观矢量编码的图像。接收客户端设备604然后可以经由显示器624显示重建图像。在一个实施例中，一个或更多个接收客户端设备604还接收一个或更多个用户属性612，该用户属性612也影响重建图像。解码器622可至少包括合成神经网络140，训练合成神经网络的实例以产生复制数据615，或者另一合成神经网络。The receiving client device 604 may receive the appearance vector and the replicated data 615 via the communication interface 621, and the decoder 622 may reconstruct the image encoded by the appearance vector from the replicated data 615. The receiving client device 604 may then display the reconstructed image via a display 624. In one embodiment, one or more of the receiving client devices 604 also receive one or more user attributes 612, which also affect the reconstructed image. The decoder 622 may include at least the synthetic neural network 140, an instance of the synthetic neural network trained to generate the replicated data 615, or another synthetic neural network.

在一个实施例中，合成神经网络140从512个16位浮点数的潜码重建高度逼真的1024×1024像素图像。在另一个实施例中，潜码包括少于512个浮点或整数格式。潜码由发送客户端设备603发送到接收客户端设备604，并用于合成人605的视频流。以30或60FPS发送每个生成帧包含8kb的潜码表示240或480Kbps-百万像素视频流通常所需带宽的一小部分。In one embodiment, the synthesis neural network 140 reconstructs a highly realistic 1024×1024 pixel image from a latent code of 512 16-bit floating point numbers. In another embodiment, the latent code comprises fewer than 512 floating point or integer formats. The latent code is sent by the sending client device 603 to the receiving client device 604 and used to synthesize a video stream of a person 605. Each generated frame sent at 30 or 60 FPS contains 8 kb of latent code representing 240 or 480 Kbps - a fraction of the bandwidth typically required for a megapixel video stream.

图6B示出了用于实现本公开的一些实施例的各种外观矢量。输入数据可以是图像606，音频607，注视方向，注视位置以及由输入设备560捕获的其他类型的数据。编码器616处理输入数据以生成外观矢量。外观矢量对输入数据的属性进行编码，并且是输入数据的压缩表示。在一个实施例中，外观矢量包括抽象潜码610(例如，中间潜码)，一组(面部)界标点611，一组FACS系数613或表示学习特征嵌入空间中的面部外观的矢量中的至少一个。6B illustrates various appearance vectors for implementing some embodiments of the present disclosure. The input data may be an image 606, audio 607, gaze direction, gaze position, and other types of data captured by the input device 560. The encoder 616 processes the input data to generate an appearance vector. The appearance vector encodes the attributes of the input data and is a compressed representation of the input data. In one embodiment, the appearance vector includes at least one of an abstract latent code 610 (e.g., an intermediate latent code), a set of (facial) landmark points 611, a set of FACS coefficients 613, or a vector representing the facial appearance in the learning feature embedding space.

图6C示出了根据一个实施例的用于GAN辅助视频压缩的方法650的流程图。本文所述的方法650的每个方框包括可以使用硬件，固件，和/或软件的任意组合执行的计算过程。例如，各种功能可以由执行存储在存储器中的指令的处理器来执行。方法650也可以体现为存储在计算机存储介质上的计算机可用指令。该方法可以由独立应用程序，服务或托管服务(独立或与另一托管服务组合)或另一产品的插件(仅举几例)提供。另外，通过示例的方式，参照图6A的系统描述了方法650。但是，该方法可以附加地或替代地由任何一个系统或系统的任何组合来执行，包括但不限于本文所述的那些。此外，本领域普通技术人员将理解，执行方法650的任何系统在本公开的实施例的范围和精神内。FIG6C shows a flowchart of a method 650 for GAN-assisted video compression according to one embodiment. Each box of the method 650 described herein includes a computational process that can be performed using any combination of hardware, firmware, and/or software. For example, various functions can be performed by a processor executing instructions stored in a memory. Method 650 can also be embodied as computer-usable instructions stored on a computer storage medium. The method can be provided by a standalone application, a service or a hosted service (standalone or in combination with another hosted service), or a plug-in for another product (to name a few). In addition, by way of example, method 650 is described with reference to the system of FIG6A. However, the method may be performed in addition or alternatively by any one system or any combination of systems, including but not limited to those described herein. In addition, a person of ordinary skill in the art will understand that any system that performs method 650 is within the scope and spirit of the embodiments of the present disclosure.

在步骤655处，传输特定于第一对象的复制数据615，用于配置远程合成神经网络以基于复制数据615重建包括特征在内的面部图像。在一个实施例中，远程合成神经网络在客户端设备(诸如一个或更多个接收客户端设备604)内。在一个实施例中，远程合成神经网络包括解码器622。At step 655, the replicated data 615 specific to the first object is transmitted for configuring the remote synthetic neural network to reconstruct a facial image including the features based on the replicated data 615. In one embodiment, the remote synthetic neural network is within a client device, such as one or more receiving client devices 604. In one embodiment, the remote synthetic neural network includes a decoder 622.

在步骤660，生成器神经网络处理第一对象或第二对象的捕获图像，以生成编码第一对象或第二对象的面部的外观矢量。第一和第二对象可以分别是真实的或合成的角色，包括人类和/或计算机生成的化身，例如人，动物，生物等。在一个实施例中，生成器神经网络包括映射神经网络，例如映射神经网络110。在一个实施例中，生成器神经网络包括映射神经网络和合成神经网络。在一个实施例中，抽象潜码由合成神经网络处理以产生第一对象的面部的预测图像。在一个实施例中，训练生成器神经网络以产生第一对象的预测图像，其与第一对象的捕获图像进行比较以学习复制数据615。在一个实施例中，将预测图像与捕获图像进行比较，并且生成器神经网络的参数会更新，以减少预测图像和捕获图像之间的差异。在一个实施例中，抽象潜码基于差异而被增量地更新，并且由合成神经网络进行处理以预测面部的后续图像。In step 660, a generator neural network processes a captured image of a first object or a second object to generate an appearance vector encoding a face of the first object or the second object. The first and second objects may be real or synthetic characters, including humans and/or computer-generated avatars, such as people, animals, creatures, etc., respectively. In one embodiment, the generator neural network includes a mapping neural network, such as mapping neural network 110. In one embodiment, the generator neural network includes a mapping neural network and a synthesis neural network. In one embodiment, the abstract latent code is processed by the synthesis neural network to produce a predicted image of the face of the first object. In one embodiment, the generator neural network is trained to produce a predicted image of the first object, which is compared to the captured image of the first object to learn to replicate the data 615. In one embodiment, the predicted image is compared to the captured image, and the parameters of the generator neural network are updated to reduce the difference between the predicted image and the captured image. In one embodiment, the abstract latent code is incrementally updated based on the difference and processed by the synthesis neural network to predict subsequent images of the face.

在一个实施例中，属性包括头部姿势和面部表情。在一个实施例中，外观矢量是面部的压缩编码。在一个实施例中，外观矢量包括抽象潜码，例如中间潜码和/或一个或更多个样式信号。在一个实施例中，外观矢量编码与衣服，发型或灯光相关的至少一个附加属性。至少一个附加属性可以从捕获图像或传感器中导出。在一个实施例中，外观矢量包括512个16位浮点数的潜码。In one embodiment, the attributes include head pose and facial expression. In one embodiment, the appearance vector is a compressed encoding of the face. In one embodiment, the appearance vector includes an abstract latent code, such as an intermediate latent code and/or one or more style signals. In one embodiment, the appearance vector encodes at least one additional attribute related to clothing, hairstyle, or lighting. The at least one additional attribute may be derived from a captured image or sensor. In one embodiment, the appearance vector includes a latent code of 512 16-bit floating point numbers.

在一个实施例中，外观矢量还包括第一对象或第二对象的附加部分的属性，并且远程合成神经网络还被配置为重建图像以包括附加部分。附加部分可以包括肩膀，脖子，手臂或手中的至少一个。In one embodiment, the appearance vector also includes attributes of an additional portion of the first object or the second object, and the remote synthesis neural network is further configured to reconstruct the image to include the additional portion. The additional portion may include at least one of a shoulder, a neck, an arm, or a hand.

在一个实施例中，在捕获图像中检测面部界标点并将其用于生成外观矢量。在一个实施例中，通过根据学习或优化的矩阵变换面部界标点来计算抽象潜码。在一个实施例中，通过学习优化矩阵的神经网络来实现优化矩阵。在一个实施例中，将抽象潜码与通过根据学习或优化的矩阵变换面部界标点而计算出的预测抽象潜码进行比较。更新矩阵以减少抽象潜码与预测的抽象潜码之间的差异。在一个实施例中，抽象潜码由合成神经网络处理以产生第一对象的面部的预测图像，将预测图像与捕获图像进行比较，并且更新矩阵以减小预测图像和捕获图像之间的差异。在一个实施例中，生成器神经网络包括映射神经网络，并且将在捕获图像中检测到的面部界标点输入到映射神经网络以计算抽象潜码。在一个实施例中，将与面部界标点配对的预测图像与与面部界标点配对的捕获图像进行比较以产生差异，并更新生成器神经网络的参数以减小差异。In one embodiment, facial landmark points are detected in a captured image and used to generate an appearance vector. In one embodiment, an abstract latent code is calculated by transforming the facial landmark points according to a learned or optimized matrix. In one embodiment, the optimized matrix is implemented by a neural network that learns the optimized matrix. In one embodiment, the abstract latent code is compared to a predicted abstract latent code calculated by transforming the facial landmark points according to the learned or optimized matrix. The matrix is updated to reduce the difference between the abstract latent code and the predicted abstract latent code. In one embodiment, the abstract latent code is processed by a synthetic neural network to generate a predicted image of a face of a first subject, the predicted image is compared to the captured image, and the matrix is updated to reduce the difference between the predicted image and the captured image. In one embodiment, the generator neural network includes a mapping neural network, and the facial landmark points detected in the captured image are input to the mapping neural network to calculate the abstract latent code. In one embodiment, the predicted image paired with the facial landmark points is compared with the captured image paired with the facial landmark points to generate a difference, and the parameters of the generator neural network are updated to reduce the difference.

在步骤670，将外观矢量发送到远程合成神经网络。在一个实施例中，在视频会议会话期间将外观矢量发送到远程合成神经网络。在一个实施例中，捕获图像是视频的帧，并且生成器神经网络被配置为针对与附加的捕获图像相对应的视频的每个附加的帧生成外观矢量调整值。At step 670, the appearance vector is sent to the remote synthesis neural network. In one embodiment, the appearance vector is sent to the remote synthesis neural network during a video conferencing session. In one embodiment, the captured images are frames of a video, and the generator neural network is configured to generate the appearance vector adjustment value for each additional frame of the video corresponding to the additional captured image.

图6D示出了根据一个实施例的用于GAN辅助的视频重建的方法675的流程图。在步骤680，获得特定于对象的复制数据615以配置合成神经网络。在一个实施例中，解码器622包括合成神经网络。对象可以是真实的或合成的。在一个实施例中，复制数据615包括在合成神经网络的训练期间学习的权重。6D shows a flow chart of a method 675 for GAN-assisted video reconstruction according to one embodiment. At step 680, object-specific replication data 615 is obtained to configure a synthetic neural network. In one embodiment, the decoder 622 includes a synthetic neural network. The object can be real or synthetic. In one embodiment, the replication data 615 includes weights learned during training of the synthetic neural network.

在步骤685，接收外观矢量，该外观矢量对在视频帧中捕获的人脸的属性进行编码。在一个实施例中，人脸是人605的脸。在另一个实施例中，人脸不是人605的脸。例如，复制数据615可以表示对象的特征，例如精灵化身。在视频会议会话期间，由解码器622处理外观矢量，以重建人605的面部图像，该图像包括由复制数据615定义的特征。换句话说，在显示器624重建图像的观看者看到一个精灵化身，其表情和姿势与人605相匹配。At step 685, an appearance vector is received that encodes attributes of a face captured in the video frame. In one embodiment, the face is the face of person 605. In another embodiment, the face is not the face of person 605. For example, replicated data 615 may represent features of an object, such as an avatar. During the video conference session, the appearance vector is processed by decoder 622 to reconstruct a facial image of person 605 that includes the features defined by replicated data 615. In other words, a viewer of the reconstructed image on display 624 sees an avatar whose expression and pose match those of person 605.

在一个实施例中，外观矢量是人605的人脸的压缩编码。在一个实施例中，针对视频的每个附加帧接收外观矢量调整值。在一个实施例中，每个外观矢量调整被连续地应用于外观矢量，以针对包括特性的视频的每个附加帧重建人脸的附加图像。In one embodiment, the appearance vector is a compressed encoding of the face of person 605. In one embodiment, an appearance vector adjustment value is received for each additional frame of the video. In one embodiment, each appearance vector adjustment is applied successively to the appearance vector to reconstruct an additional image of the face for each additional frame of the video that includes the feature.

步骤690，合成神经网络处理外观矢量以重建人脸的图像，该图像包括由复制数据定义的特征。在一个实施例中，在观看环境中显示人脸的重建图像，其中合成神经网络根据观看环境中的照明来重建图像。例如，代替使用在外观矢量和/或复制数据615中编码的照明属性，合成神经网络基于显示重建图像的环境中的照明来重建图像。在一个实施例中，照明或其他信息可以由传感器提供。At step 690, the synthetic neural network processes the appearance vector to reconstruct an image of the face of the person, the image including the features defined by the replication data. In one embodiment, the reconstructed image of the face of the person is displayed in a viewing environment, wherein the synthetic neural network reconstructs the image based on the lighting in the viewing environment. For example, instead of using the lighting properties encoded in the appearance vector and/or the replication data 615, the synthetic neural network reconstructs the image based on the lighting in the environment in which the reconstructed image is displayed. In one embodiment, the lighting or other information may be provided by a sensor.

图7A是用于实施本公开的一些实施例的包括投影仪700的合成神经网络训练配置的概念图。在一个实施例中，使用图2D所示的生成对抗网络270配置，根据GAN目标训练编码器710和合成神经网络715，以分别生成图像的外观矢量和重建图像。然后，可以使用图7A中所示的配置来共同训练合成神经网络715和编码器710。经训练的编码器710和合成神经网络715可以分别被部署为编码器616和解码器622，以在实时视频会议期间使用。FIG7A is a conceptual diagram of a synthetic neural network training configuration including a projector 700 for implementing some embodiments of the present disclosure. In one embodiment, using the generative adversarial network 270 configuration shown in FIG2D, an encoder 710 and a synthetic neural network 715 are trained according to a GAN objective to generate an appearance vector of an image and a reconstructed image, respectively. The synthetic neural network 715 and the encoder 710 can then be jointly trained using the configuration shown in FIG7A. The trained encoder 710 and synthetic neural network 715 can be deployed as an encoder 616 and a decoder 622, respectively, for use during a real-time video conference.

在一个实施例中，投影仪700被实现在发送客户端设备603内，并且被用于改善由发送客户端设备603生成的外观矢量的质量。投影仪700通过包括合成神经网络715模仿一个或更多个接收客户端设备604的操作。解码器622可以是合成神经网络715的一个实例。如前所述，编码器616可以执行投影以映射人605的每个捕获图像以产生外观矢量。合成神经网络715从外观矢量重建预测图像720。In one embodiment, projector 700 is implemented within sending client device 603 and is used to improve the quality of appearance vectors generated by sending client device 603. Projector 700 emulates the operation of one or more receiving client devices 604 by including synthetic neural network 715. Decoder 622 can be an instance of synthetic neural network 715. As previously described, encoder 616 can perform projection to map each captured image of person 605 to produce an appearance vector. Synthetic neural network 715 reconstructs predicted image 720 from the appearance vector.

训练损失单元725将预测图像720与对应的捕获图像705进行比较，以识别预测图像720和捕获图像705之间的差异。在一个实施例中，训练损失单元725可以使用学习到的感知图像块相似度(LPIPS)技术来识别相应图像块之间的像素值差异。训练损失单元725更新编码器710的参数(例如权重)以减小预测图像720和捕获图像705之间的差异。在一个实施例中，训练损失单元720被配置为更新合成神经网络715使用的参数，从而更新捕获图像中对象的复制数据615。The training loss unit 725 compares the predicted image 720 with the corresponding captured image 705 to identify the difference between the predicted image 720 and the captured image 705. In one embodiment, the training loss unit 725 may use a learned perceptual image block similarity (LPIPS) technique to identify the difference in pixel values between corresponding image blocks. The training loss unit 725 updates the parameters (e.g., weights) of the encoder 710 to reduce the difference between the predicted image 720 and the captured image 705. In one embodiment, the training loss unit 720 is configured to update the parameters used by the synthetic neural network 715 to update the replica data 615 of the object in the captured image.

在推理期间，当一个或更多个接收客户端设备604中的解码器622基于从发送客户端设备603接收到的外观矢量来重建图像时，发送客户端设备603可以继续操作投影仪700以持续改善编码器710和/或合成神经网络715的性能。以这种方式，编码器710和/或合成神经网络715可以在推理期间使用具有更多种不同属性的更多捕获图像来“训练”。继续仅更新编码器710内的合成神经网络715的参数不会影响外观矢量的质量。发送客户端设备603可以被配置为随着合成神经网络715的性能改善而更新复制数据615和/或解码器622的参数，使得解码器622的性能也可以改善。During inference, while the decoder 622 in one or more receiving client devices 604 reconstructs an image based on the appearance vectors received from the sending client device 603, the sending client device 603 can continue to operate the projector 700 to continue to improve the performance of the encoder 710 and/or the synthetic neural network 715. In this way, the encoder 710 and/or the synthetic neural network 715 can be "trained" using more captured images with a greater variety of different attributes during inference. Continuing to update only the parameters of the synthetic neural network 715 within the encoder 710 does not affect the quality of the appearance vectors. The sending client device 603 can be configured to update the parameters of the replicated data 615 and/or the decoder 622 as the performance of the synthetic neural network 715 improves so that the performance of the decoder 622 can also improve.

编码器710不是执行投影以映射每个捕获图像705以产生外观矢量，而是最初执行投影以产生用于第一捕获图像的第一外观矢量。在第一捕获图像之后，编码器710可以将第一外观矢量用作输入，从该输入产生第二外观矢量。当投影操作的计算量很大时，从先前的外观矢量预测外观矢量可能在计算上很有效，从而实现了实时性能。由于相邻的视频帧通常很相似，因此使用先前的外观矢量执行增量投影可以提高计算速度和图像质量的性能。投影操作不是从潜在空间中的任意点开始，而是从投影算法为前一帧生成的潜在矢量开始。编码器710有效地执行局部搜索而不是全局搜索以产生后续的外观矢量。增量投影还可以产生时间上更连贯的结果视频，从而减少由全局搜索中的不同选择引起的闪烁或帧间失真，从而避免每帧在潜在空间中产生不同的，几乎相等的点。Instead of performing a projection to map each captured image 705 to produce an appearance vector, the encoder 710 initially performs a projection to produce a first appearance vector for a first captured image. After the first captured image, the encoder 710 can use the first appearance vector as an input from which a second appearance vector is produced. When the projection operation is computationally intensive, predicting the appearance vector from a previous appearance vector may be computationally efficient, thereby achieving real-time performance. Since adjacent video frames are often very similar, performing incremental projection using a previous appearance vector may improve performance in terms of computational speed and image quality. The projection operation does not start from an arbitrary point in the latent space, but rather from a latent vector generated by the projection algorithm for the previous frame. The encoder 710 effectively performs a local search rather than a global search to produce subsequent appearance vectors. Incremental projection may also produce a more temporally coherent resulting video, thereby reducing flickering or inter-frame distortion caused by different selections in the global search, thereby avoiding each frame from producing different, nearly equal points in the latent space.

在一个实施例中，投影仪700为每个外观矢量生成校正数据，其中基于预测图像720和捕获图像705之间的比较来计算校正数据。预测图像720可以用作可用于传统编码器(诸如H.265高效视频编码(HEVC)格式编码器)的附加宏块预测方案。在一个实施例中，当接收客户端设备604仅支持常规视频编码数据时，发送客户端设备603可以生成可以由接收客户端设备604解码的常规编码视频数据。In one embodiment, the projector 700 generates correction data for each appearance vector, wherein the correction data is calculated based on a comparison between the predicted image 720 and the captured image 705. The predicted image 720 can be used as an additional macroblock prediction scheme that can be used with a conventional encoder, such as an H.265 High Efficiency Video Coding (HEVC) format encoder. In one embodiment, when the receiving client device 604 only supports conventional video coded data, the sending client device 603 can generate conventional coded video data that can be decoded by the receiving client device 604.

在一个实施例中，当与先前捕获图像相比在捕获图像中发生显著变化时，可以执行投影操作以映射捕获图像的另一外观矢量，然后恢复增量投影。可以计算置信度度量以确定何时发生显著变化。例如，置信度度量可以指示与先前捕获图像相比在捕获图像中改变的像素的数量。当置信度度量大于阈值时，可以启动投影操作。在另一个实施例中，可以针对预测图像或者比较对应的预测图像和捕获图像来评估置信度度量。在一个实施例中，接收客户端设备604(例如，远程合成神经网络)可以请求经由投影操作生成的外观矢量。客户端设备604可以基于对针对重建图像计算的置信度度量的评估来发起请求。In one embodiment, when a significant change occurs in a captured image compared to a previously captured image, a projection operation may be performed to map another appearance vector of the captured image and then restore the incremental projection. A confidence metric may be calculated to determine when a significant change occurs. For example, the confidence metric may indicate the number of pixels that have changed in the captured image compared to a previously captured image. When the confidence metric is greater than a threshold, the projection operation may be initiated. In another embodiment, the confidence metric may be evaluated for a predicted image or by comparing corresponding predicted images and captured images. In one embodiment, a receiving client device 604 (e.g., a remote synthetic neural network) may request an appearance vector generated via a projection operation. The client device 604 may initiate the request based on an evaluation of the confidence metric calculated for the reconstructed image.

图7B是用于实现本公开的一些实施例的包括图7A的投影仪700的端对端系统730的概念图。系统730至少包括发送客户端设备603和接收客户端设备604。编码器710在发送客户端设备603内，并生成通过一个或更多个网络504传输到接收客户端设备604内的解码器722的外观矢量。解码器722可以是合成神经网络715的实例。解码器722根据复制数据处理外观矢量以产生重建图像712。重建图像712然后可以在显示器624处显示给观看者。复制数据可以由捕获图像705中的人或观看者来选择。FIG7B is a conceptual diagram of an end-to-end system 730 including the projector 700 of FIG7A for implementing some embodiments of the present disclosure. The system 730 includes at least a sending client device 603 and a receiving client device 604. The encoder 710 is within the sending client device 603 and generates an appearance vector that is transmitted to a decoder 722 within the receiving client device 604 via one or more networks 504. The decoder 722 can be an instance of a synthetic neural network 715. The decoder 722 processes the appearance vector based on the replicated data to produce a reconstructed image 712. The reconstructed image 712 can then be displayed to a viewer at a display 624. The replicated data can be selected by a person in the captured image 705 or a viewer.

图7C是用于实现本公开的一些实施例的，用于生成训练数据的配置的概念图。使用投影来生成潜码的替代方法是使用矩阵或神经网络将面部界标点转换成潜码。当使用面部界标点时，使用界标训练数据学习矩阵的参数或神经网络的权重。面部界标点可以由界标检测器735从捕获图像中提取以产生界标训练数据。界标检测器735可以使用常规计算机视觉技术或神经分析技术来实现。面部标志性点描述了面部关键点的位置(眼睑和嘴唇的边缘，瞳孔中心，鼻梁等)，并捕获了面部的重要运动和变形。界标检测器735可用于检测其他类型的界标，包括不限于图像空间的面部界标。例如，界标可以是与面部动作编码系统(FACS)有关的一组系数，面部外观的其他属性，或表示学习的特征嵌入空间中的面部外观的矢量。FACS定义了一组与所显示的情绪相对应的面部肌肉运动。FIG7C is a conceptual diagram of a configuration for generating training data for implementing some embodiments of the present disclosure. An alternative to using projections to generate latent codes is to convert facial landmark points into latent codes using a matrix or a neural network. When using facial landmark points, the parameters of the matrix or the weights of the neural network are learned using landmark training data. Facial landmark points can be extracted from the captured image by a landmark detector 735 to generate landmark training data. The landmark detector 735 can be implemented using conventional computer vision techniques or neural analysis techniques. Facial landmark points describe the location of key facial points (edges of the eyelids and lips, center of the pupil, bridge of the nose, etc.) and capture important movements and deformations of the face. The landmark detector 735 can be used to detect other types of landmarks, including facial landmarks that are not limited to image space. For example, a landmark can be a set of coefficients related to the Facial Action Coding System (FACS), other attributes of facial appearance, or a vector representing facial appearance in a learned feature embedding space. FACS defines a set of facial muscle movements corresponding to displayed emotions.

所提取的面部界标点可以用作外观矢量，用于生成外观矢量或与外观矢量分开提供。通常，不同的训练数据集可以用于生成针对不同对象的复制数据(真实和合成的)。此外，不同的训练数据集可用于为同一主题生成不同的复制数据，其中每天或每个会话的属性都不同，例如人的衣服，发型以及由于化妆，照明等特定于每个复制数据的变化。The extracted facial landmark points can be used as an appearance vector, used to generate the appearance vector or provided separately from the appearance vector. In general, different training datasets can be used to generate replica data (real and synthetic) for different subjects. In addition, different training datasets can be used to generate different replica data for the same subject, where the attributes vary from day to day or from session to session, such as the person's clothing, hairstyle, and changes specific to each replica data due to makeup, lighting, etc.

图7D是用于实现本公开的一些实施例的，用于使用面部界标点来预测外观矢量的训练配置的概念图。在一个实施例中，线性回归用于学习或优化将面部界标点的矢量转换为外观矢量(例如，潜码矢量)的矩阵。与投影捕获图像以产生外观矢量相比，使用面部界标来产生外观矢量可能对训练图像和实时捕获图像之间的变化(不同的发型，不同的服装等)更具弹性。FIG7D is a conceptual diagram of a training configuration for predicting an appearance vector using facial landmark points, for implementing some embodiments of the present disclosure. In one embodiment, linear regression is used to learn or optimize a matrix that converts a vector of facial landmark points to an appearance vector (e.g., a latent vector). Using facial landmarks to generate an appearance vector may be more resilient to changes between training images and live captured images (different hairstyles, different clothing, etc.) than projecting captured images to generate appearance vectors.

界标训练数据由回归矩阵740转换为与接收客户端设备604内的合成神经网络或解码器722相关联的潜在空间。具体而言，根据回归矩阵转换每个训练图像的面部界标点或神经网络以产生预测的外观矢量。训练损失单元745将由训练的编码器710生成的投影外观矢量与预测外观矢量进行比较，并更新回归矩阵的参数以减小投影外观矢量与预测外观矢量之间的差异。The landmark training data is transformed by the regression matrix 740 into a latent space associated with the synthetic neural network or decoder 722 within the receiving client device 604. Specifically, the facial landmark points or neural network of each training image are transformed according to the regression matrix to produce a predicted appearance vector. The training loss unit 745 compares the projected appearance vector generated by the trained encoder 710 with the predicted appearance vector and updates the parameters of the regression matrix to reduce the difference between the projected appearance vector and the predicted appearance vector.

图7E是用于实现本公开的一些实施例的端到端系统750的概念图。系统750至少包括发送客户端设备603和接收客户端设备604。界标检测器735在发送客户端设备603内，并且生成通过一个或更多个网络504传输到接收客户端设备604内的解码器722的外观矢量。解码器722可以是合成神经网络715的实例。解码器722根据复制数据处理外观矢量来产生重建图像712。然后，重建图像712可以在显示器624处显示给观看者。复制数据可以由捕获图像705中的人或观看者来选择。7E is a conceptual diagram of an end-to-end system 750 for implementing some embodiments of the present disclosure. System 750 includes at least a sending client device 603 and a receiving client device 604. Landmark detector 735 is within the sending client device 603 and generates an appearance vector that is transmitted to a decoder 722 within the receiving client device 604 via one or more networks 504. Decoder 722 can be an instance of synthetic neural network 715. Decoder 722 processes the appearance vector based on the replicated data to produce a reconstructed image 712. The reconstructed image 712 can then be displayed to a viewer at a display 624. The replicated data can be selected by a person in the captured image 705 or a viewer.

在一个实施例中，代替将外观矢量发送给解码器722，发送客户端设备603可以替代地针对每个捕获图像发送检测到的界标。在这样的实施例中，回归矩阵740包括在接收客户端设备604内，并且处理检测到的界标以在接收客户端设备604内产生外观矢量。可以将在训练期间学习的回归矩阵740所使用的参数以及复制数据提供给接收方客户端设备604。In one embodiment, instead of sending the appearance vector to the decoder 722, the sending client device 603 may instead send the detected landmarks for each captured image. In such an embodiment, the regression matrix 740 is included within the receiving client device 604, and the detected landmarks are processed to generate the appearance vector within the receiving client device 604. The parameters used by the regression matrix 740 learned during training and the replicated data may be provided to the receiving client device 604.

常规的压缩技术可以应用于外观矢量，例如通过对面部界标的坐标进行量化和增量编码。在一个实施例中，当使用检测到的界标来生成外观矢量时，复制数据也可以用于控制重建图像的特征，其相较于捕获图像中人物的特征。重建的人类对象的属性，例如发型，衣服和/或照明，可以与外观矢量或复制数据(例如，过滤器)一起提供。Conventional compression techniques may be applied to the appearance vector, such as by quantizing and delta-encoding the coordinates of facial landmarks. In one embodiment, when the detected landmarks are used to generate the appearance vector, the replication data may also be used to control the characteristics of the reconstructed image compared to the characteristics of the person in the captured image. Attributes of the reconstructed human subject, such as hairstyle, clothing, and/or lighting, may be provided along with the appearance vector or replication data (e.g., filters).

因为一组界标训练数据可以同样好地投影或变换为许多不同的预测外观矢量，所以在训练期间由回归矩阵740学习的回归矩阵在代数方面可以具有较大的“零空间”。换句话说，可能存在高维潜在空间的许多区域，从这些区域可以生成很好地映射到界标训练数据的重建图像。但是，重建图像有时可能无法与捕获图像匹配，从而导致时间伪像。例如，在接收者观看的动画中，时间伪像可能表现为细微但明显的闪烁或奇特的抖动失真。可以通过学习回归矩阵直接改善面部界标到潜在空间的映射来改善重建图像的质量。Because a set of landmark training data can be projected or transformed equally well into many different predicted appearance vectors, the regression matrix learned by regression matrix 740 during training can have a large "null space" in algebraic terms. In other words, there may be many regions of the high-dimensional latent space from which reconstructed images can be generated that map well to the landmark training data. However, the reconstructed image may sometimes not match the captured image, resulting in temporal artifacts. For example, in an animation viewed by a recipient, temporal artifacts may appear as subtle but noticeable flickering or strange jittery distortions. The quality of the reconstructed image can be improved by learning the regression matrix to directly improve the mapping of facial landmarks to the latent space.

图8A是用于实现本公开的一些实施例的端到端系统训练配置800的概念图。可以首先使用图2D中所示的生成对抗网络270配置以GAN目标训练合成神经网络715，以分别生成图像的外观矢量和重建图像。然后在配置800中，与合成神经网络715一起训练回归矩阵740，以分别预测图像的外观矢量并重建图像。配置800可以用于执行端到端回归，以将面部界标转换为外观矢量。经训练的回归矩阵740和合成神经网络715可以分别被部署为编码器710和解码器722，以在实时视频会议期间使用。FIG8A is a conceptual diagram of an end-to-end system training configuration 800 for implementing some embodiments of the present disclosure. The synthetic neural network 715 may first be configured with the GAN objective using the generative adversarial network 270 shown in FIG2D to generate an appearance vector of an image and reconstruct an image, respectively. Then in configuration 800, a regression matrix 740 is trained with the synthetic neural network 715 to predict the appearance vector of an image and reconstruct an image, respectively. Configuration 800 may be used to perform end-to-end regression to convert facial landmarks into appearance vectors. The trained regression matrix 740 and the synthetic neural network 715 may be deployed as an encoder 710 and a decoder 722, respectively, for use during a real-time video conference.

训练损失单元825将重建图像与图像训练数据进行比较以识别重建图像与图像训练数据之间的差异。在一个实施例中，训练损失单元825可以使用LPIPS技术来识别差异。训练损失单元825更新回归矩阵740的参数以减小重建图像与图像训练数据之间的差异。在一个实施例中，训练损失单元825被配置为更新合成神经网络715所使用的参数，从而更新所捕获图像中对象的复制数据。The training loss unit 825 compares the reconstructed image to the image training data to identify differences between the reconstructed image and the image training data. In one embodiment, the training loss unit 825 may use LPIPS technology to identify differences. The training loss unit 825 updates the parameters of the regression matrix 740 to reduce the differences between the reconstructed image and the image training data. In one embodiment, the training loss unit 825 is configured to update the parameters used by the synthetic neural network 715, thereby updating the replica data of the object in the captured image.

图8B是用于实现本公开的一些实施例的端到端系统训练配置850的概念图。配置850可以用于联合训练编码器716，以使用条件GAN目标将面部界标转换为外观矢量和合成神经网络715。图2D中所示的生成对抗网络270配置可以与编码器716和合成神经网络715一起使用，以分别预测图像的外观矢量和重建图像。鉴别器神经网络875确定与也输入到编码器的地标训练数据配对的重建图像是否看起来类似于与正确界标(与图像训练数据正确匹配的界标)配对的图像训练数据。基于该确定，训练损失单元835调整鉴别器875，合成神经网络715和/或编码器716的参数。一旦利用条件GAN目标训练了合成神经网络715，则编码器716和/或合成神经网络715可以在实时视频会议期间使用。经训练的编码器716和合成神经网络715可以分别部署为编码器616和解码器622。FIG8B is a conceptual diagram of an end-to-end system training configuration 850 for implementing some embodiments of the present disclosure. Configuration 850 can be used to jointly train encoder 716 to convert facial landmarks to appearance vectors and synthetic neural network 715 using a conditional GAN objective. The generative adversarial network 270 configuration shown in FIG2D can be used with encoder 716 and synthetic neural network 715 to predict the appearance vector and reconstructed image of an image, respectively. Discriminator neural network 875 determines whether the reconstructed image paired with landmark training data also input to the encoder looks similar to image training data paired with correct landmarks (landmarks that correctly match the image training data). Based on this determination, training loss unit 835 adjusts parameters of discriminator 875, synthetic neural network 715, and/or encoder 716. Once synthetic neural network 715 is trained using the conditional GAN objective, encoder 716 and/or synthetic neural network 715 can be used during a real-time video conference. The trained encoder 716 and synthetic neural network 715 can be deployed as encoder 616 and decoder 622, respectively.

在一个实施例中，合成神经网络715被配置为产生与背景部分分离的每个重建图像的前景部分。前景部分至少包括重建图像的面部部分，还可以包括出现在捕获图像中的人的肩膀和其他部位。合成神经网络715还可以被配置为生成指示单独的前景和背景部分的α蒙版(通道)或遮罩。在一个实施例中，一个或更多个接收客户端设备604可以将头部和面部重建图像的前景部分合成到任意背景上，或者完全修改或删除背景部分In one embodiment, the synthesis neural network 715 is configured to generate a foreground portion of each reconstructed image separated from a background portion. The foreground portion includes at least the facial portion of the reconstructed image, and may also include shoulders and other parts of a person appearing in the captured image. The synthesis neural network 715 may also be configured to generate an alpha mask (channel) or mask indicating separate foreground and background portions. In one embodiment, one or more receiving client devices 604 may synthesize the foreground portion of the head and facial reconstructed images onto an arbitrary background, or modify or delete the background portion entirely.

在训练期间，在将每个重建图像的前景部分和α蒙版进行合成以生成鉴别器神经网络875接收并对其进行逼真度评估的重建图像之前，它们可以相对于背景部分随机移动。效果是训练合成神经网络715以生成高质量的α蒙版。相对位移(通过移动)是一种简单而鲁棒的技术，其使得在随机位移的部分被合成之后，鉴别器神经网络875分配高的真实感分数。相对位移还可以增强编码器716和/或合成神经网络715(例如，生成器神经网络)解开重建图像的背景，姿势和纹理属性的能力，并提高生成器神经网络的有效性。可以使用类似的技术来鼓励合成神经网络715分割并合成图像中捕获的人605的其他方面，例如在手势中使用的衣服或手和手臂。During training, the foreground portions of each reconstructed image and the alpha mask may be randomly shifted relative to the background portions before they are synthesized to generate a reconstructed image that the discriminator neural network 875 receives and assesses for realism. The effect is to train the synthesis neural network 715 to generate high-quality alpha masks. Relative displacement (by shifting) is a simple and robust technique that causes the discriminator neural network 875 to assign high realism scores after the randomly shifted portions are synthesized. Relative displacement may also enhance the ability of the encoder 716 and/or synthesis neural network 715 (e.g., a generator neural network) to disentangle background, pose, and texture attributes of the reconstructed image and improve the effectiveness of the generator neural network. Similar techniques may be used to encourage the synthesis neural network 715 to segment and synthesize other aspects of the person 605 captured in the image, such as clothing or hands and arms used in gestures.

在一个实施例中，编码器616、710或716被配置成将捕获图像中的至少面部与背景图像数据(例如，背景部分)分离，对背景图像数据进行编码，并且将编码后的背景图像数据从发送客户端设备603发送，以被一个或更多个接收客户端设备604与人脸的重建图像组合。在一个实施例中，背景图像数据由发送客户端设备603使用常规技术压缩，以传输到一个或更多个接收客户端设备604。在一个实施例中，通过操作发送客户端设备603中的合成神经网络715减少了传输背景图像数据所需的带宽，并根据用于重建图像的α蒙版，去除被重建图像的前景部分覆盖的背景图像数据的区域。当存在高可信度，即与先前重建图像相比背景部分被改变时，背景部分的部分覆盖区域可以被发送到一个或更多个接收客户端设备604。In one embodiment, the encoder 616, 710, or 716 is configured to separate at least the face in the captured image from background image data (e.g., a background portion), encode the background image data, and transmit the encoded background image data from the sending client device 603 to be combined with a reconstructed image of the face by one or more receiving client devices 604. In one embodiment, the background image data is compressed by the sending client device 603 using conventional techniques for transmission to one or more receiving client devices 604. In one embodiment, the bandwidth required to transmit the background image data is reduced by operating the synthetic neural network 715 in the sending client device 603, and removing the area of the background image data covered by the foreground portion of the reconstructed image according to the alpha mask used to reconstruct the image. When there is a high confidence that the background portion has been changed compared to the previous reconstructed image, the partially covered area of the background portion can be sent to the one or more receiving client devices 604.

在一个实施例中，鉴别神经网络875的注意力可以集中在面部的语义关键区域上，例如眼睛，嘴和眉毛。在一个实施例中，例如使用人工编码的试探法或在人类注视跟踪数据上训练的图像显著性网络来预测每个训练图像的语义上最重要的区域。在输入到鉴别器神经网络875的一些图像中的语义重要区域之外，图像分辨率可以被人为地降低，或者图像可能以其他方式被干扰。在语义重要区域之外的区域的修改可以导致合成神经网络715将附加的容量分配给重建图像的区域，这对于人类观看者将是最重要的。In one embodiment, the attention of the discriminator neural network 875 can be focused on semantically key regions of the face, such as the eyes, mouth, and eyebrows. In one embodiment, the semantically most important regions of each training image are predicted using, for example, manually encoded heuristics or an image saliency network trained on human gaze tracking data. Outside of the semantically important regions in some of the images input to the discriminator neural network 875, the image resolution may be artificially reduced or the images may be perturbed in other ways. Modifications to regions outside of semantically important regions can cause the synthetic neural network 715 to allocate additional capacity to regions of the reconstructed image that would be most important to a human viewer.

在一个实施例中，将音频数据直接作为波形，频谱图或音频的类似低级表示，或者编码为诸如音素的高级表示，并入到生成器神经网络训练配置800或850中。当音频数据是音素时，可以以类似于面部界标的方式来检测音素。在一个实施例中，鉴别器神经网络875学习在面部应该被制造出来的声音，音素或话语的上下文中判断面部图像的真实性。然后，合成神经网络715学习产生与输入音频数据很好对应的面部。换句话说，一个或更多个接收客户端设备接收可由解码器622用来重建面部图像的音频数据。In one embodiment, the audio data is incorporated into the generator neural network training configuration 800 or 850 directly as a waveform, spectrogram, or similar low-level representation of audio, or encoded as a high-level representation such as phonemes. When the audio data is phonemes, the phonemes can be detected in a manner similar to facial landmarks. In one embodiment, the discriminator neural network 875 learns to judge the authenticity of the facial image in the context of the sound, phoneme, or speech that the face should be made of. Then, the synthesis neural network 715 learns to produce a face that corresponds well to the input audio data. In other words, one or more receiving client devices receive audio data that can be used by the decoder 622 to reconstruct the facial image.

在一个实施例中，合成神经网络715增加了存储器以处理音频数据。例如，可以使用递归神经网络(RNN)，长期短期记忆(LSTM)和“变压器”关注网络来实现合成神经网络715。合并音频处理能力可用于改善重建图像，包括分组损失或网络服务质量降低视频流但保留音频数据的情况。In one embodiment, the synthetic neural network 715 has increased memory to process audio data. For example, the synthetic neural network 715 can be implemented using a recurrent neural network (RNN), a long short-term memory (LSTM), and a "transformer" attention network. Incorporating audio processing capabilities can be used to improve reconstructed images, including situations where packet loss or network service quality degrades the video stream but preserves the audio data.

在一个实施例中，重建图像被用于改善重建的音频数据的质量。带宽效率极高的外观矢量流可编码有用的“唇读”式信息(例如，嘴巴，舌头，脸颊的形状，嘴唇合拢和张开的确切瞬间等)，以改善不良的音频流。可以训练生成器神经网络以在一个或更多个接收客户端设备604上生成改进的，降噪的，源分离的或空间化的音频流。In one embodiment, the reconstructed image is used to improve the quality of the reconstructed audio data. The bandwidth-efficient appearance vector stream can encode useful "lip reading" type information (e.g., the shape of the mouth, tongue, cheeks, the exact moment when the lips close and open, etc.) to improve a poor audio stream. The generator neural network can be trained to generate an improved, noise-reduced, source-separated or spatialized audio stream on one or more receiving client devices 604.

控制面部外观各方面的能力提供了机会来控制重建的属性，例如基于观察者的注视来控制重建的面部的注视方向。从观看者的角度来看，视频会议的一个常见问题是缺乏明显的眼神交流。由于摄像机很少放置在捕获视频图像的人的眼睛附近，因此视频会议中的谈话者很少感觉到好像在进行眼神交流。眼神交流是社交活动的重要提示，人们认为眼神交流的缺乏是人们更喜欢面对面会议而不是视频会议的原因。同样，在多人视频会议中，人们无法分辨谁在看谁，尤其是由于每个参与者屏幕上的视频窗口布局可能不同。先前的工作已经探索了重新渲染眼睛以创建眼神接触的感觉，但是可以使用潜码，面部界标或其他定义注视位置和/或方向的外观矢量的操纵来增加感知的眼神接触。The ability to control aspects of facial appearance provides opportunities to control properties of the reconstruction, such as controlling the gaze direction of the reconstructed face based on the gaze of the observer. A common problem with video conferencing is the lack of apparent eye contact from the viewer's perspective. Because cameras are rarely placed near the eyes of the people capturing the video images, talkers in a video conference rarely feel as if they are making eye contact. Eye contact is an important cue for social interaction, and the lack of eye contact has been suggested as a reason why people prefer face-to-face meetings to video conferencing. Similarly, in multi-person video conferences, people cannot tell who is looking at whom, especially since the layout of the video windows on each participant's screen may be different. Prior work has explored re-rendering the eyes to create the perception of eye contact, but perceived eye contact can be increased using latent codes, facial landmarks, or other manipulations of the look vector that define gaze location and/or direction.

例如，合成神经网络715可以稍微改变重建图像中对象的眉头甚至头部方向，以说明不同的注视点或位置。所述修改可以与训练协议耦合，所述训练协议被设计为鼓励合成神经网络715将注视方向与其他方面(诸如面部身份)解耦。因为视频会议系统中的重建发生在一个或更多个接收客户端设备604上，所以视频会议系统可以利用关于参与者视频的布局的本地知识，并且一个或更多个接收客户端设备604的摄像机或传感器可以提供观看者的注视位置。注视位置是显示器上与观看者注视方向相交的位置。可以通过解码器来修改显示给观看者的重建图像中的对象的视线，使得该对象看起来是在看显示另一重建图像的位置或观看者的视线位置。注视位置可以位于正在讲话的对象的重建图像处。操纵视线和注意力的概念已广泛应用于视频会议以外的领域，例如远程呈现化身，其空间关系对于远程呈现系统中的不同参与者而言可能看起来有所不同。For example, the synthetic neural network 715 can slightly change the brow or even the head direction of the object in the reconstructed image to illustrate different gaze points or positions. The modification can be coupled with a training protocol designed to encourage the synthetic neural network 715 to decouple the gaze direction from other aspects (such as facial identity). Because the reconstruction in the video conferencing system occurs on one or more receiving client devices 604, the video conferencing system can utilize local knowledge about the layout of the participant video, and the camera or sensor of one or more receiving client devices 604 can provide the viewer's gaze position. The gaze position is the position on the display that intersects the viewer's gaze direction. The line of sight of an object in the reconstructed image displayed to the viewer can be modified by a decoder so that the object appears to be looking at the position where another reconstructed image is displayed or the viewer's line of sight position. The gaze position can be located at the reconstructed image of the object that is speaking. The concept of manipulating line of sight and attention has been widely used in fields other than video conferencing, such as telepresence avatars, whose spatial relationships may appear different for different participants in the telepresence system.

在一个实施例中，在观看环境中显示人脸的重建图像，其中合成神经网络715根据在观看环境中捕获的观看者的注视位置来重建图像。在一个实施例中，图像中人脸的注视方向是朝向注视位置。在一个实施例中，外观矢量包括与人脸所观看的第二图像相对应的注视位置，并且人脸的重建图像在观看环境中的注视方向朝向也重建并显示在观看环境中的第二图像。In one embodiment, a reconstructed image of a face is displayed in a viewing environment, wherein the synthetic neural network 715 reconstructs the image based on a gaze position of a viewer captured in the viewing environment. In one embodiment, the gaze direction of the face in the image is toward the gaze position. In one embodiment, the appearance vector includes a gaze position corresponding to a second image viewed by the face, and the gaze direction of the reconstructed image of the face in the viewing environment is toward the second image also reconstructed and displayed in the viewing environment.

在图像重建期间修改面部外观的能力使得能够改变照明。可以基于来自观看环境中的传感器的复制数据或环境数据来改变照明。在观看环境中匹配照明可能会使广播对象的存在更具幻觉。当观看者的眼睛跟踪数据可用时，解码器622或722也可以应用运动视差。通常，与实际大小一样的3D内容尊重观看者的运动视差，并且模仿环境的照明本质上比没有这些功能的图像更具吸引力。The ability to modify facial appearance during image reconstruction enables changing lighting. The lighting may be changed based on replication data or environmental data from sensors in the viewing environment. Matching lighting in the viewing environment may make the presence of the broadcast object more illusory. Decoder 622 or 722 may also apply motion parallax when viewer eye tracking data is available. In general, life-size 3D content respects the viewer's motion parallax and lighting that mimics the environment is inherently more attractive than images without these features.

在更高的水平上，可以通过引入异步中断来提高社交效率和通过视频会议进行的对话的流程。如今的视频会议缺少打断，因为当一个人想要打断时，很难引起演讲者的关注或注意。同样，由于开始说话与对方听到打断之间的不自然滞后，每次打断都对谈话流程造成更大的干扰。该问题的可能解决方案是当第一个人开始说话时对显示给第一个人的重建图像中的打断的影响进行建模，以预期在远程客户端设备上查看第一个人重建图像的第二个人的反应。关键观察结果是，当第二个人在视频会议会话期间尝试打断第一个人时，就会发生社交交易。但是，要成功，两个人的互动不必相同。例如，打断者可能开始讲话，被打断者的图像/对话将立即自然地做出反应，而被打断者可能会看到/听到打断者的行为，就像在打断别人一样，直到被打断者停止说话为止。在此示例中，即使双方有不同的经历，打断“交易”仍会在功能上完成。At a higher level, social efficiency and the flow of conversations conducted over videoconferencing can be improved by introducing asynchronous interruptions. Interruptions are lacking in today’s videoconferencing because it is difficult to get the speaker’s attention or focus when one person wants to interrupt. Likewise, each interruption creates a greater disruption to the flow of the conversation due to the unnatural lag between starting to speak and the other person hearing the interruption. A possible solution to this problem is to model the effect of an interruption in the reconstructed image displayed to the first person when the first person begins speaking to anticipate the reaction of the second person viewing the reconstructed image of the first person on the remote client device. The key observation is that when the second person attempts to interrupt the first person during a videoconferencing session, a social transaction occurs. However, the interaction between the two people need not be identical to be successful. For example, the interrupter may begin speaking, the interrupted person’s image/conversation will react immediately and naturally, and the interrupted person may see/hear the interrupter behave as if they were interrupting someone else until the interrupted person stops speaking. In this example, the interruption “transaction” is functionally completed even though the two parties have different experiences.

在一个实施例中，可以由合成神经网络715或解码器622或722使用从两个不同的捕获图像生成的外观矢量来执行样式混合，以生成重建图像。如先前结合图1A，1B和2B所描述的，外观矢量可以被变换为一组统计参数，称为样式，其在金字塔层次的不同级别上影响合成神经网络715。例如，在训练了合成神经网络715以生成面部图像之后，影响合成网络4×4和8×8分辨率的“粗略样式”往往会控制所生成面部图像的高级方面，例如姿势，性别，头发的长度，而影响16×16和32×32的“中等样式”则控制着面部特征-是什么使给定的人看起来与众不同，与父母相似，等等。In one embodiment, style blending may be performed by the synthesis neural network 715 or decoder 622 or 722 using appearance vectors generated from two different captured images to generate a reconstructed image. As previously described in conjunction with FIGS. 1A , 1B, and 2B, the appearance vectors may be transformed into a set of statistical parameters, referred to as styles, that influence the synthesis neural network 715 at different levels of the pyramid hierarchy. For example, after the synthesis neural network 715 is trained to generate facial images, the “coarse styles” that influence the synthesis network at 4×4 and 8×8 resolutions tend to control high-level aspects of the generated facial images, such as pose, gender, length of hair, while the “medium styles” that influence 16×16 and 32×32 control facial features—what makes a given person look different, similar to a parent, and so on.

在一个实施例中，样式混合用于以微妙的方式改变重建图像的外观。例如，将使用运动模糊来捕获人605快速移动的帧，并且从所得外观矢量重建的图像将忠实地重新创建运动模糊。但是，通过将每个帧的相应外观矢量的粗略样式与选定的不包含运动模糊的帧的精细样式进行混合，重建图像可以正确捕获人脸的运动和变形，同时还保留了精细的细节，显得清晰且没有运动模糊。类似的样式混合可用于制作视频，使对象看起来更加清醒或警觉，或者化了妆，穿着某些衣服或特定表情。In one embodiment, style blending is used to change the appearance of the reconstructed image in subtle ways. For example, a frame in which person 605 moves quickly would be captured using motion blur, and the image reconstructed from the resulting appearance vectors would faithfully recreate the motion blur. However, by blending the coarse style of each frame's corresponding appearance vector with the fine style of selected frames that do not contain motion blur, the reconstructed image can correctly capture the movement and deformation of the person's face, while also preserving fine details, appearing sharp and free of motion blur. Similar style blending can be used to create a video that makes a subject appear more awake or alert, or with makeup, certain clothing, or a specific expression.

在一个实施例中，可以由合成神经网络715执行样式混合，以通过将用于静止图像的精细样式控制与用于模糊图像的粗略样式控制相结合来锐化图像的运动模糊部分。例如，外观矢量包括与视频中人脸模糊的第一帧相对应的第一部分和与视频中人脸被清晰定义的第二帧相对应的第二部分。合成神经网络715进行的处理结合第一部分和第二部分，使用第一部分“控制粗略样式”和第二部分“控制精细样式”重建人脸被清晰定义的图像。在另一示例中，当在帧中捕获的人脸模糊时，合成神经网络715通过使用外观矢量来控制粗略样式并使用复制数据来控制精细样式来重建人脸被清晰定义的图像。In one embodiment, style blending may be performed by the synthetic neural network 715 to sharpen motion blurred portions of an image by combining fine style control for a still image with coarse style control for a blurred image. For example, the appearance vector includes a first portion corresponding to a first frame in a video where a face is blurred and a second portion corresponding to a second frame in the video where a face is clearly defined. The processing performed by the synthetic neural network 715 combines the first portion and the second portion, using the first portion "controlling the coarse style" and the second portion "controlling the fine style" to reconstruct an image in which a face is clearly defined. In another example, when a face captured in a frame is blurred, the synthetic neural network 715 reconstructs an image in which a face is clearly defined by using the appearance vector to control the coarse style and using the replicated data to control the fine style.

训练和部署生成神经网络组件以使用外观矢量，重建数据和特定属性数据对图像进行编码和重建可以提供更吸引人的视频会议体验。外观矢量为重建的视频帧提供姿势，表情等的实时信息，复制数据为正在广播其相似度的人的基本特征做出了贡献。复制数据(例如经训练的合成神经网络的权重)在训练过程中确定并传输到接收器。用于训练的图像中人类对象的特征可能会应用于重建的视频帧，即使在用于生成外观矢量的捕获图像中出现了不同的人类对象。重建的人类对象的属性(例如发型，衣服和/或照明)可以提供外观矢量或复制数据。例如，可以基于观看者的视线方向或在视频会议期间显示的参与者的图像的相对位置，来控制一个或更多个重建的人类对象的视线。Training and deploying a generative neural network component to encode and reconstruct images using appearance vectors, reconstruction data, and specific attribute data can provide a more engaging video conferencing experience. The appearance vector provides real-time information of pose, expression, etc. for the reconstructed video frame, and the replication data contributes to the basic characteristics of the person whose likeness is being broadcast. The replication data (e.g., the weights of the trained synthetic neural network) is determined during the training process and transmitted to the receiver. The characteristics of human objects in the images used for training may be applied to the reconstructed video frames, even if different human objects appear in the captured images used to generate the appearance vectors. Attributes of the reconstructed human objects (e.g., hairstyle, clothing, and/or lighting) can provide the appearance vector or replication data. For example, the gaze of one or more reconstructed human objects can be controlled based on the viewer's gaze direction or the relative position of the images of the participants displayed during the video conference.

传输低带宽外观矢量以在远程客户端设备处重建图像减少了提供交互式视频会议体验所需的性能所需的带宽。时间上采样可以用于通过在不同的外观矢量之间进行插值来生成附加帧。可以将常规压缩技术应用于外观矢量和/或背景图像。音频数据可以被发送并用于辅助视频帧的重建。Transmitting low bandwidth appearance vectors to reconstruct images at the remote client device reduces the bandwidth required to provide the performance required for an interactive video conferencing experience. Temporal upsampling can be used to generate additional frames by interpolating between different appearance vectors. Conventional compression techniques can be applied to the appearance vectors and/or the background image. Audio data can be sent and used to assist in the reconstruction of the video frames.

注意，本文描述的技术可以体现在存储在计算机可读介质中的可执行指令中，以供基于处理器的指令执行机器，系统，装置或设备使用或与其结合使用。本领域技术人员将理解，对于一些实施例，可以包括各种类型的计算机可读介质来存储数据。如本文所使用的，“计算机可读介质”包括用于存储计算机程序的可执行指令的任何合适的介质中的一个或更多个，使得指令执行机器，系统，装置或设备可以从计算机可读介质读取(或获取)指令，并执行用于执行所描述的实施例的指令。合适的存储格式包括电子，磁性，光学和电磁格式中的一种或更多种。常规示例性计算机可读介质的非详尽列表包括：便携式计算机软盘；随机存取存储器(RAM)；只读存储器(ROM)；可擦可编程只读存储器(EPROM)；闪存设备；光学存储设备，包括便携式光盘(CD)，便携式数字视频光盘(DVD)等。Note that the techniques described herein may be embodied in executable instructions stored in a computer-readable medium for use or in conjunction with a processor-based instruction execution machine, system, apparatus, or device. Those skilled in the art will appreciate that for some embodiments, various types of computer-readable media may be included to store data. As used herein, "computer-readable media" includes one or more of any suitable media for storing executable instructions of a computer program, so that an instruction execution machine, system, apparatus, or device can read (or obtain) instructions from the computer-readable medium and execute instructions for performing the described embodiments. Suitable storage formats include one or more of electronic, magnetic, optical, and electromagnetic formats. A non-exhaustive list of conventional exemplary computer-readable media includes: portable computer floppy disks; random access memory (RAM); read-only memory (ROM); erasable programmable read-only memory (EPROM); flash memory devices; optical storage devices, including portable compact disks (CDs), portable digital video disks (DVDs), etc.

应该理解，附图中示出的组件的布置是出于说明的目的，并且其他布置也是可能的。例如，本文描述的一个或更多个元件可以全部或部分地实现为电子硬件组件。可以以软件，硬件或软件和硬件的组合来实现其他元件。此外，可以组合这些其他元件中的一些或全部，可以完全省略一些其他元件，并且可以添加附加组件，同时仍实现本文所述的功能。因此，本文描述的主题可以体现为许多不同的变型，并且所有这些变型都被认为在权利要求的范围内。It should be understood that the arrangement of components shown in the drawings is for illustrative purposes, and other arrangements are possible. For example, one or more elements described herein may be implemented in whole or in part as electronic hardware components. Other elements may be implemented in software, hardware, or a combination of software and hardware. In addition, some or all of these other elements may be combined, some other elements may be completely omitted, and additional components may be added while still achieving the functions described herein. Therefore, the subject matter described herein may be embodied in many different variations, and all of these variations are considered to be within the scope of the claims.

为了辅助对本文所述主题的理解，根据动作序列描述了许多方面。本领域技术人员将认识到，各种动作可以由专用电路或电路，由一个或更多个处理器执行的程序指令，或由两者的组合来执行。本文对任何动作序列的描述并不旨在暗示必须遵循为执行该序列而描述的特定顺序。除非本文另外指出或与上下文明显矛盾，否则本文描述的所有方法可以以任何合适的顺序执行。In order to assist the understanding of the subject matter described herein, many aspects are described according to the action sequence. It will be appreciated by those skilled in the art that various actions can be performed by a dedicated circuit or circuit, by a program instruction executed by one or more processors, or by a combination of the two. The description of any action sequence herein is not intended to imply that the specific order described for executing the sequence must be followed. Unless otherwise noted herein or clearly contradictory to the context, all methods described herein can be performed in any suitable order.

在描述主题的上下文中(特别是在所附权利要求的上下文中)术语“一”，“一个”和“该”以及类似的引用的使用应解释为涵盖单数形式和复数形式。除非本文另有说明或与上下文明显矛盾，否则为复数形式。术语“至少一个”后面是一个或更多个项目的列表(例如，“A和B中的至少一个”)应理解为是指从所列项目(A或B)中选择的一个项目或所列项目(A和B)中两个或更多个的任意组合，除非本文另有说明或与上下文明显矛盾。此外，前述描述仅出于说明的目的，而非出于限制的目的，因为所寻求的保护范围由下文所述的权利要求及其等同形式限定。除非另外要求，否则本文提供的任何和所有示例或示例性语言(例如“诸如”)的使用仅旨在更好地说明主题，并且不对主题的范围构成限制。在权利要求书和书面说明书中，术语“基于”和其他类似的短语表示产生结果的条件的使用，并不旨在排除产生该结果的任何其他条件。说明书中的任何语言都不应被解释为指示任何未要求保护的要素对于实施所要求保护的发明是必不可少的。The use of the terms "a", "an" and "the" and similar references in the context of describing the subject matter (particularly in the context of the appended claims) should be interpreted as covering both the singular and the plural. Unless otherwise specified herein or clearly contradicted by the context, it is plural. The term "at least one" followed by a list of one or more items (e.g., "at least one of A and B") should be understood to refer to one item selected from the listed items (A or B) or any combination of two or more of the listed items (A and B), unless otherwise specified herein or clearly contradicted by the context. In addition, the foregoing description is for illustrative purposes only and not for limiting purposes, as the scope of protection sought is defined by the claims described below and their equivalents. Unless otherwise required, the use of any and all examples or exemplary language (e.g., "such as") provided herein is intended only to better illustrate the subject matter and does not limit the scope of the subject matter. In the claims and written description, the use of the term "based on" and other similar phrases to indicate the conditions for producing a result is not intended to exclude any other conditions for producing the result. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the claimed invention.