CN117314733A

Movatterモバイル変換

Info

Publication number: CN117314733A
Application number: CN202311020072.8A
Authority: CN
Inventors: 姜智卓; 杨思远; 刘瑜; 李耀文; 李徵
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2023-08-14
Filing date: 2023-08-14
Publication date: 2023-12-29

Abstract

The invention provides a video filling method, a device, equipment and a storage medium based on a diffusion model, and relates to the technical field of artificial intelligence, wherein the method comprises the following steps: obtaining a trained diffusion model; the U-shaped network model in the trained diffusion model comprises a first encoder, a first middle layer and a decoder; the attention modules in the first encoder and decoder are both space-time attention modules; the attention calculation dimension of the space-time attention module comprises a channel dimension, a width dimension and a height dimension, wherein the channel dimension is a dimension represented by the product of the number of channels and the number of frames; and inputting the video frame sequence to be filled into the trained diffusion model for video filling to obtain a target video frame sequence. The invention can promote the frame consistency of the target video frame sequence.

Description

Translated fromChinese

基于扩散模型的视频填充方法、装置、设备及存储介质Video filling method, device, equipment and storage medium based on diffusion model

技术领域Technical field

本发明涉及人工智能技术领域，尤其涉及一种基于扩散模型的视频填充方法、装置、设备及存储介质。The present invention relates to the field of artificial intelligence technology, and in particular to a video filling method, device, equipment and storage medium based on a diffusion model.

背景技术Background technique

视频填充任务指的是根据给定视频序列中的几帧图像，重构出完整的视频序列。视频填充技术具有广阔的应用前景，例如，在自动驾驶领域，视频预测模型对目标未来状态的精确预测，能够使智能体在决策过程中做出更迅速、更明智的判断。The video filling task refers to reconstructing a complete video sequence based on several frames of images in a given video sequence. Video filling technology has broad application prospects. For example, in the field of autonomous driving, the video prediction model's accurate prediction of the future state of the target can enable the agent to make faster and more informed judgments in the decision-making process.

目前，现有的视频填充方案中，将视频表示为B×CF×H×W四维数据，且使用二维卷积替换三维卷积；其中，B表示批量大小，C表示通道数，F表示帧数，H表示高度，W表示宽度。虽然该方案可以极大地减少视频生成的时间，但是由于该方案中注意力模块使用的是空间注意力模块，且空间注意力模块的注意力计算维度仅包括宽度维度和高度维度，会导致生成的完整视频序列的帧间一致性较差。Currently, in the existing video filling scheme, the video is represented as B×CF×H×W four-dimensional data, and two-dimensional convolution is used to replace the three-dimensional convolution; where B represents the batch size, C represents the number of channels, and F represents the frame Number, H represents height and W represents width. Although this solution can greatly reduce the time of video generation, because the attention module in this solution uses the spatial attention module, and the attention calculation dimensions of the spatial attention module only include the width dimension and the height dimension, it will cause the generated Complete video sequences have poor inter-frame consistency.

发明内容Contents of the invention

本发明提供一种基于扩散模型的视频填充方法、装置、设备及存储介质，用以解决现有技术中生成的完整视频序列的帧间一致性较差的缺陷，实现提升目标视频帧序列的帧间一致性的目的。The present invention provides a video filling method, device, equipment and storage medium based on a diffusion model to solve the defect of poor inter-frame consistency of a complete video sequence generated in the prior art, and to improve the frame quality of a target video frame sequence. for the purpose of inter-consistency.

本发明提供一种基于扩散模型的视频填充方法，包括：The present invention provides a video filling method based on a diffusion model, including:

获取训练好的扩散模型；所述训练好的扩散模型中的U型网络模型包括第一编码器、第一中间层和解码器；所述第一编码器和所述解码器中的注意力模块均为时空注意力模块；所述时空注意力模块的注意力计算维度包括通道维度、宽度维度和高度维度，所述通道维度为通道数和帧数的乘积所表示的维度；Obtain a trained diffusion model; the U-shaped network model in the trained diffusion model includes a first encoder, a first intermediate layer and a decoder; the first encoder and the attention module in the decoder Both are spatiotemporal attention modules; the attention calculation dimensions of the spatiotemporal attention module include channel dimensions, width dimensions and height dimensions, and the channel dimensions are the dimensions represented by the product of the number of channels and the number of frames;

将待填充的视频帧序列输入至所述训练好的扩散模型中进行视频填充，得到目标视频帧序列。The video frame sequence to be filled is input into the trained diffusion model for video filling to obtain a target video frame sequence.

根据本发明提供的一种基于扩散模型的视频填充方法，所述扩散模型还包括序列编码器，所述第一中间层中的注意力模块为交叉注意力模块；在所述视频填充的过程中，所述方法还包括：According to a video filling method based on a diffusion model provided by the present invention, the diffusion model also includes a sequence encoder, and the attention module in the first intermediate layer is a cross attention module; in the process of video filling , the method also includes:

将上一时刻预测得到的视频序列的全部输出编码作为全局特征输入所述序列编码器中进行编码，得到特征图；Enter all the output codes of the video sequence predicted at the previous moment as global features into the sequence encoder for coding to obtain a feature map;

将所述特征图输入所述交叉注意力模块中，并以上一时刻预测得到的视频序列中末尾抽取的至少一个视频帧作为局部条件帧，预测下一时刻的视频序列。The feature map is input into the cross-attention module, and at least one video frame extracted at the end of the video sequence predicted at the previous moment is used as a local condition frame to predict the video sequence at the next moment.

根据本发明提供的一种基于扩散模型的视频填充方法，所述训练好的扩散模型是基于如下步骤训练得到的：According to a video filling method based on a diffusion model provided by the present invention, the trained diffusion model is trained based on the following steps:

将样本集中的初始视频帧输入所述扩散模型的前向过程的加噪公式中逐渐添加高斯噪声，得到带有噪声的视频帧序列；Input the initial video frames in the sample set into the noise adding formula of the forward process of the diffusion model and gradually add Gaussian noise to obtain a video frame sequence with noise;

将所述带有噪声的视频帧序列输入所述U型网络模型中，以上一时刻预测得到的视频序列中末尾抽取的至少一个视频帧作为局部条件帧，预测下一时刻的视频序列，直至递归预测过程结束，得到所述带有噪声的视频帧序列的估计值；The video frame sequence with noise is input into the U-shaped network model, and at least one video frame extracted at the end of the video sequence predicted at the previous moment is used as a local condition frame to predict the video sequence at the next moment until the recursion The prediction process ends, and the estimated value of the noisy video frame sequence is obtained;

基于所述带有噪声的视频帧序列的估计值和所述带有噪声的视频帧序列计算损失，并基于所述损失调整所述U型网络模型的参数。A loss is calculated based on the estimated value of the noisy video frame sequence and the noisy video frame sequence, and parameters of the U-shaped network model are adjusted based on the loss.

将样本集中的初始视频帧输入所述扩散模型的前向过程的加噪公式中逐渐添加高斯噪声，得到带有噪声的视频帧序列；The initial video frames in the sample set are input into the noise adding formula of the forward process of the diffusion model and gradually add Gaussian noise to obtain a video frame sequence with noise;

将所述带有噪声的视频帧序列输入所述U型网络模型中，以所述带有噪声的视频帧序列中的真值序列中抽取的至少一个视频帧作为局部条件帧，预测得到第一训练阶段的视频序列；The noisy video frame sequence is input into the U-shaped network model, and at least one video frame extracted from the true value sequence in the noisy video frame sequence is used as a local condition frame to predict the first Video sequences during the training phase;

以上一时刻预测得到的视频序列中末尾抽取的至少一个视频帧作为局部条件帧，预测下一时刻的视频序列，直至递归预测过程结束，得到第二训练阶段的视频序列，作为所述带有噪声的视频帧序列的估计值；At least one video frame extracted at the end of the video sequence predicted at the previous moment is used as a local condition frame to predict the video sequence at the next moment until the recursive prediction process ends, and the video sequence of the second training stage is obtained as the noisy The estimated value of the video frame sequence;

根据本发明提供的一种基于扩散模型的视频填充方法，所述方法还包括：According to a video filling method based on a diffusion model provided by the present invention, the method further includes:

在从所述带有噪声的视频帧序列中的真值序列中随机抽取至少一个视频帧作为局部条件帧之后，在所述通道维度上并置所述局部条件帧的位置编码；所述位置编码为与所述局部条件帧尺寸相同的单通道张量，所述单通道张量中的每个元素为所述局部条件帧在所述带有噪声的视频帧序列中的索引值。After randomly extracting at least one video frame as a local condition frame from the ground truth sequence in the noisy video frame sequence, concatenating the position encoding of the local condition frame in the channel dimension; the position encoding is a single-channel tensor with the same size as the local condition frame, and each element in the single-channel tensor is the index value of the local condition frame in the noisy video frame sequence.

根据本发明提供的一种基于扩散模型的视频填充方法，所述第一编码器和所述解码器均包括两个第一残差块和两个所述时空注意力模块，所述第一中间层包括一个第二残差块和一个所述交叉注意力模块。According to a video filling method based on a diffusion model provided by the present invention, both the first encoder and the decoder include two first residual blocks and two spatiotemporal attention modules, and the first intermediate The layer includes a second residual block and one of the cross-attention modules.

根据本发明提供的一种基于扩散模型的视频填充方法，所述序列编码器包括第二编码器和第二中间层，所述第二编码器包括两个第三残差块和两个第一注意力模块，所述第二中间层包括一个第四残差块和一个第二注意力模块。According to a diffusion model-based video filling method provided by the present invention, the sequence encoder includes a second encoder and a second intermediate layer, and the second encoder includes two third residual blocks and two first Attention module, the second intermediate layer includes a fourth residual block and a second attention module.

本发明还提供一种基于扩散模型的视频填充装置，包括：The present invention also provides a video filling device based on a diffusion model, including:

获取模块，用于获取训练好的扩散模型；所述训练好的扩散模型中的U型网络模型包括编码器、中间层和解码器；所述编码器和所述解码器中的注意力模块均为时空注意力模块；所述时空注意力模块的注意力计算维度包括通道维度、宽度维度和高度维度，所述通道维度为通道数和帧数的乘积所表示的维度；Acquisition module, used to obtain a trained diffusion model; the U-shaped network model in the trained diffusion model includes an encoder, an intermediate layer and a decoder; the attention modules in the encoder and the decoder both It is a spatiotemporal attention module; the attention calculation dimensions of the spatiotemporal attention module include channel dimensions, width dimensions and height dimensions, and the channel dimensions are the dimensions represented by the product of the number of channels and the number of frames;

填充模块，用于将待填充的视频帧序列输入至所述训练好的扩散模型中进行视频填充，得到目标视频帧序列。A filling module is used to input the video frame sequence to be filled into the trained diffusion model for video filling to obtain a target video frame sequence.

本发明还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述任一种所述的基于扩散模型的视频填充方法。The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, the diffusion-based method as described in any one of the above is implemented. Video filling method for models.

本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如上述任一种所述的基于扩散模型的视频填充方法。The present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the video filling method based on the diffusion model as described in any one of the above is implemented.

本发明还提供一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时实现如上述任一种所述的基于扩散模型的视频填充方法。The present invention also provides a computer program product, which includes a computer program. When the computer program is executed by a processor, the computer program implements any one of the above diffusion model-based video filling methods.

本发明提供的基于扩散模型的视频填充方法、装置、设备及存储介质，首先，获取训练好的扩散模型；训练好的扩散模型中的U型网络模型包括第一编码器、第一中间层和解码器；第一编码器和解码器中的注意力模块均为时空注意力模块；时空注意力模块的注意力计算维度包括通道维度、宽度维度和高度维度，通道维度为通道数和帧数的乘积所表示的维度；将待填充的视频帧序列输入至训练好的扩散模型中进行视频填充，得到目标视频帧序列。由于第一编码器和解码器中的注意力模块均为时空注意力模块，可以生成内容连贯的视频序列；且时空注意力模块的注意力计算维度还包括通道维度，可以让时空注意力模块考虑不同帧之间的关系，从而提升目标视频帧序列的帧间一致性。In the video filling method, device, equipment and storage medium based on the diffusion model provided by the present invention, first, a trained diffusion model is obtained; the U-shaped network model in the trained diffusion model includes a first encoder, a first intermediate layer and Decoder; the attention modules in the first encoder and decoder are both spatiotemporal attention modules; the attention calculation dimensions of the spatiotemporal attention module include channel dimensions, width dimensions and height dimensions, and the channel dimensions are the number of channels and the number of frames. The dimension represented by the product; input the video frame sequence to be filled into the trained diffusion model for video filling, and obtain the target video frame sequence. Since the attention modules in the first encoder and decoder are both spatiotemporal attention modules, video sequences with coherent content can be generated; and the attention calculation dimensions of the spatiotemporal attention module also include channel dimensions, which can be considered by the spatiotemporal attention module. The relationship between different frames, thereby improving the inter-frame consistency of the target video frame sequence.

附图说明Description of drawings

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are of the present invention. For some embodiments of the invention, those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.

图1是现有技术提供的扩散模型加噪降噪示意图；Figure 1 is a schematic diagram of noise addition and noise reduction of the diffusion model provided by the prior art;

图2是本发明实施例提供的基于扩散模型的视频填充方法的流程示意图；Figure 2 is a schematic flowchart of a video filling method based on a diffusion model provided by an embodiment of the present invention;

图3是本发明实施例提供的时空注意力模块的结构示意图；Figure 3 is a schematic structural diagram of a spatiotemporal attention module provided by an embodiment of the present invention;

图4是本发明实施例提供的交叉注意力模块的结构示意图；Figure 4 is a schematic structural diagram of a cross-attention module provided by an embodiment of the present invention;

图5是本发明实施例提供的双阶段模型训练示意图；Figure 5 is a schematic diagram of two-stage model training provided by an embodiment of the present invention;

图6是本发明实施例提供的基于扩散模型的视频填充装置的结构示意图；Figure 6 is a schematic structural diagram of a video filling device based on a diffusion model provided by an embodiment of the present invention;

图7是本发明实施例提供的电子设备的结构示意图。Figure 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the present invention more clear, the technical solutions in the present invention will be clearly and completely described below in conjunction with the accompanying drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention. , not all examples. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.

视频填充任务指的是根据给定视频序列中的几帧图像，重构出完整的视频序列。当前关于视频填充任务的研究主要集中于视频预测和视频插帧。其中，视频预测是在给定视频序列的前几帧的情况下预测连贯的视频序列的任务，而视频插帧则是在给定视频序列首尾几帧的情况下生成连贯的视频序列。The video filling task refers to reconstructing a complete video sequence based on several frames of images in a given video sequence. Current research on video filling tasks mainly focuses on video prediction and video frame interpolation. Among them, video prediction is the task of predicting a coherent video sequence given the first few frames of the video sequence, while video frame interpolation is the task of generating a coherent video sequence given the first and last frames of the video sequence.

针对视频预测这一问题，早期方法使用递归神经网络建立图像自回归模型来隐式建模时间对应。最近的方法使用Transformer捕获全局时空对应，并在生成对抗网络上构建视频预测模型，以进一步改进生成视频的细节。一些方法还提出了在生成网络上构建时间移位模块，对相邻帧之间的时空一致性进行建模。此外，一些方法构建了基于生成对抗网络(Generative Adversary Network，GAN)的隐式神经表示，通过分别地操纵空间和时间坐标来改善运动动力学。To address the problem of video prediction, early methods used recurrent neural networks to build image autoregressive models to implicitly model temporal correspondence. Recent methods use Transformers to capture global spatiotemporal correspondence and build video prediction models on generative adversarial networks to further improve the details of generated videos. Some methods also propose to build a temporal shift module on the generative network to model the spatiotemporal consistency between adjacent frames. In addition, some methods construct implicit neural representations based on Generative Adversary Networks (GANs) to improve motion dynamics by manipulating spatial and temporal coordinates respectively.

而在视频插帧领域，现有技术是将光流估计和图像映射结合起来，生成两个连续关键帧之间的中间帧。更具体地说，在帧间线性运动和亮度恒定的假设下，这些工作计算光流并将输入关键帧映射到目标帧，同时利用诸如上下文信息、空间Transformer网络或动态混合过滤器等设计来改善结果。In the field of video frame insertion, the existing technology combines optical flow estimation and image mapping to generate an intermediate frame between two consecutive key frames. More specifically, under the assumptions of linear motion between frames and constant brightness, these works compute optical flow and map input keyframes to target frames while leveraging designs such as contextual information, spatial Transformer networks, or dynamic blending filters to improve result.

最近，扩散模型因其生成的图像质量高，训练稳定等因素，开始取代生成对抗网络，成为许多计算视觉任务的新范式，最近的技术开始将扩散模型用于视频填充领域，如图1所示，扩散模型包含一个前向过程(扩散过程)和一个逆向过程(降噪过程)。Recently, the diffusion model has begun to replace the generative adversarial network and become a new paradigm for many computational vision tasks due to the high quality of the generated images and stable training. Recent technologies have begun to use the diffusion model in the field of video filling, as shown in Figure 1 , the diffusion model includes a forward process (diffusion process) and a reverse process (noise reduction process).

给定初始数据x₀～q(x)，q(x)表示数据分布函数，前向过程是一个固定的马尔科夫链，可以根据方差表β₁,…,β_T，向x₀逐渐添加高斯噪声σ，以此来破坏数据中的结构，并输出一个带有噪声的样本序列x₁,...x_T，其中T表示扩散步数。当T→∞时，x_T接近一个各向同性的高斯分布。前向过程可以用如下公式定义：Given the initial data x₀ ~ q(x), q(x) represents the data distribution function. The forward process is a fixed Markov chain, which can be gradually added to x₀ according to the variance table β₁ ,..., β_T Gaussian noise σ is used to destroy the structure in the data and output a noisy sample sequence x₁ ,...x_T , where T represents the number of diffusion steps. When T→∞, x_T approaches an isotropic Gaussian distribution. The forward process can be defined by the following formula:

其中，q(x_1:T)表示前向过程的概率，q(x_t∣x_t-1)表示从x_t-1转移到x_t的概率，I表示单位矩阵。Among them, q(x_1:T ) represents the probability of the forward process, q(x_t ∣x_t-1 ) represents the probability of transferring from x_t-1 to x_t , and I represents the identity matrix.

使用上述定义的转移核，可以按照下述公式(3)采样在任意时刻t带有噪声的样本x_t：Using the transfer kernel defined above, the sample x_t with noise at any time t can be sampled according to the following formula (3):

其中，in,

在训练阶段，逆向过程尝试从高斯噪声追溯初始数据x₀。由于无法精确的获得前向过程的逆向分布q(x_t-1∣x_t)，所以使用一个带有可学习高斯核的马尔科夫链来替换它，该马尔科夫链定义如下：During the training phase, the inverse process attempts to extract data from Gaussian noise Trace back the initial data x₀ . Since the inverse distribution q(x_t-1 |x_t ) of the forward process cannot be accurately obtained, a Markov chain with a learnable Gaussian kernel is used to replace it. The Markov chain is defined as follows:

然后使用差分下界来优化负对数似然：Then use a differential lower bound to optimize the negative log-likelihood:

公式(6)本质上是使用KL散度来让p_θ(x_t-1∣x_t)估计前向过程的后验分布：Formula (6) essentially uses KL divergence to let p_θ (x_t-1 ∣x_t ) estimate the posterior distribution of the forward process:

其中，α_t:＝1-β_t，/>in, α_t :=1-β_t ,/>

由于在采样阶段无法获得初始数据x₀，所以使用一个θ参数化的时间条件神经网络来估计噪声∈_t，该神经网络通常采用Unet结构，然后通过公式(3)得到预测的初始数据该神经网络的损失函数L(θ)采用L2损失。Since the initial data x₀ cannot be obtained during the sampling stage, a θ-parameterized time-conditional neural network is used to estimate the noise ∈_t . The neural network usually adopts a Unet structure, and then the predicted initial data is obtained through formula (3) The loss function L(θ) of this neural network uses L2 loss.

最后，用神经网络预测的初始数据替换公式(7)中的x₀来逆转扩散过程，并从高斯噪声中重构数据结构。Finally, the initial data predicted by the neural network Replace x₀ in equation (7) to reverse the diffusion process and reconstruct the data structure from Gaussian noise.

现有的将扩散模型用于视频填充的技术中，使用一种特定类型的3D U-Net作为神经网络架构来处理视频数据。为了减少计算开销，该方法将每个2维的3×3卷积核替换为3维的1×3×3卷积核(第一个轴表示视频帧的时间，第二个和第三个轴表示空间上的高度和宽度)，并在每一个空间注意力模块后插入一个时间注意力模块，该模块将空间轴作为批处理轴，并对时间轴计算注意力。该3D U-Net模型主要针对无条件视频生成任务，将其用于视频填充需要在降噪过程中，用加噪后的参考帧去替换视频序列中对应位置的噪声帧，同时在降噪过程中还需要进行前向扩散，才能保证生成的视频帧间一致性较好。这种方法极大拖累了视频生成的速度。此外，3D U-Net大量使用的3D卷积也会造成视频生成速度慢的问题，为了改善这个问题，掩蔽条件视频扩散(Masked Conditional Video Diffusion，MCVD)算法在训练过程就引入参考帧，让网络学习根据参考帧来填充视频，这样在采样的过程中，就不需要重复执行替换的操作，也不需要在降噪过程中进行前向加噪。此外，MCVD提出将视频序列在通道上并置，将B×C×F×H×W五维格式的数据表示为B×CF×H×W的四维数据，然后使用2维卷积替换三维卷积，极大减少了视频生成的时间，但是这种网络结构生成的结果帧间一致性较差。Existing techniques that use diffusion models for video filling use a specific type of 3D U-Net as a neural network architecture to process video data. In order to reduce computational overhead, this method replaces each 2-dimensional 3×3 convolution kernel with a 3-dimensional 1×3×3 convolution kernel (the first axis represents the time of the video frame, the second and third The axis represents the height and width in space), and a temporal attention module is inserted after each spatial attention module, which uses the spatial axis as the batch axis and calculates attention on the temporal axis. This 3D U-Net model is mainly aimed at the unconditional video generation task. Using it for video filling requires replacing the noise frames at the corresponding positions in the video sequence with the noised reference frames during the denoising process. At the same time, during the denoising process Forward diffusion is also required to ensure good inter-frame consistency of the generated video. This method greatly slows down the speed of video generation. In addition, the extensive use of 3D convolutions in 3D U-Net will also cause the problem of slow video generation. In order to improve this problem, the Masked Conditional Video Diffusion (MCVD) algorithm introduces reference frames during the training process to allow the network to Learn to fill the video based on reference frames, so that there is no need to repeat the replacement operation during the sampling process, and there is no need to perform forward noise during the denoising process. In addition, MCVD proposes to concatenate the video sequences on the channel, represent the data in the B×C×F×H×W five-dimensional format as the four-dimensional data of B×CF×H×W, and then use 2-dimensional convolution to replace the three-dimensional convolution. The product greatly reduces the time of video generation, but the results generated by this network structure have poor inter-frame consistency.

此外，现有技术在预测长视频时，需要递归调用网络，即将网络输出的后几帧，作为参考条件来预测接下来的视频序列。这种方法存在网络预测误差随着帧数增加而累积的问题。In addition, when predicting long videos, the existing technology needs to call the network recursively, that is, the last few frames output by the network are used as reference conditions to predict the next video sequence. This method has the problem that network prediction errors accumulate as the number of frames increases.

基于此，本发明实施例提供了一种基于扩散模型的视频填充方法，首先，获取训练好的扩散模型；训练好的扩散模型中的U型网络模型包括第一编码器、第一中间层和解码器；第一编码器和解码器中的注意力模块均为时空注意力模块；时空注意力模块的注意力计算维度包括通道维度、宽度维度和高度维度，通道维度为通道数和帧数的乘积所表示的维度；将待填充的视频帧序列输入至训练好的扩散模型中进行视频填充，得到目标视频帧序列。由于第一编码器和解码器中的注意力模块均为时空注意力模块，可以生成内容连贯的视频序列；且时空注意力模块的注意力计算维度还包括通道维度，可以让时空注意力模块考虑不同帧之间的关系，从而提升目标视频帧序列的帧间一致性。Based on this, embodiments of the present invention provide a video filling method based on a diffusion model. First, a trained diffusion model is obtained; the U-shaped network model in the trained diffusion model includes a first encoder, a first middle layer and Decoder; the attention modules in the first encoder and decoder are both spatiotemporal attention modules; the attention calculation dimensions of the spatiotemporal attention module include channel dimensions, width dimensions and height dimensions, and the channel dimensions are the number of channels and the number of frames. The dimension represented by the product; input the video frame sequence to be filled into the trained diffusion model for video filling, and obtain the target video frame sequence. Since the attention modules in the first encoder and decoder are both spatiotemporal attention modules, video sequences with coherent content can be generated; and the attention calculation dimensions of the spatiotemporal attention module also include channel dimensions, which can be considered by the spatiotemporal attention module. The relationship between different frames, thereby improving the inter-frame consistency of the target video frame sequence.

下面结合图2-图5描述本发明实施例的基于扩散模型的视频填充方法。The video filling method based on the diffusion model according to the embodiment of the present invention is described below with reference to Figures 2-5.

请参照图2，图2是本发明实施例提供的基于扩散模型的视频填充方法的流程示意图。如图2所示，该方法可以包括以下步骤：Please refer to FIG. 2 , which is a schematic flowchart of a video filling method based on a diffusion model provided by an embodiment of the present invention. As shown in Figure 2, the method may include the following steps:

步骤201、获取训练好的扩散模型；训练好的扩散模型中的U型网络模型包括第一编码器、第一中间层和解码器；第一编码器和解码器中的注意力模块均为时空注意力模块；时空注意力模块的注意力计算维度包括通道维度、宽度维度和高度维度，通道维度为通道数和帧数的乘积所表示的维度；Step 201: Obtain the trained diffusion model; the U-shaped network model in the trained diffusion model includes the first encoder, the first intermediate layer and the decoder; the attention modules in the first encoder and decoder are both spatio-temporal Attention module; the attention calculation dimensions of the spatiotemporal attention module include channel dimensions, width dimensions and height dimensions. The channel dimension is the dimension represented by the product of the number of channels and the number of frames;

步骤202、将待填充的视频帧序列输入至训练好的扩散模型中进行视频填充，得到目标视频帧序列。Step 202: Input the video frame sequence to be filled into the trained diffusion model for video filling, and obtain the target video frame sequence.

在步骤201中，扩散模型包括前向过程和U型网络模型。扩散模型的前向过程与图1所示的前向过程类似，此处不再赘述。U型网络模型可以为U-Net。In step 201, the diffusion model includes a forward process and a U-shaped network model. The forward process of the diffusion model is similar to the forward process shown in Figure 1 and will not be described again here. The U-shaped network model can be U-Net.

U型网络模型可以包括第一编码器、第一中间层和解码器。可选地，第一编码器和解码器均包括两个第一残差块和两个时空注意力模块，第一中间层包括一个第二残差块和一个交叉注意力模块。The U-shaped network model may include a first encoder, a first intermediate layer and a decoder. Optionally, both the first encoder and the decoder include two first residual blocks and two spatiotemporal attention modules, and the first intermediate layer includes one second residual block and one cross-attention module.

其中，第一编码器和解码器中的注意力模块均为时空注意力模块，可以生成内容连贯的视频序列。如图3所示，时空注意力模块的注意力计算维度包括通道维度C₁、宽度维度W和高度维度H，通道维度C₁为通道数和帧数的乘积所表示的维度。C₁＝C/N，B₁＝B×N，N表示多头注意力的个数，C表示多头注意力的通道数和帧数的乘积。时空注意力模块在数据的通道维度、宽度维度和高度维度进行注意力计算，可以适配将视频帧在通道上并置的数据处理方式，且时空注意力模块的注意力计算维度还包括通道维度，可以让时空注意力模块考虑不同帧之间的关系，从而提升目标视频帧序列的帧间一致性。Among them, the attention modules in the first encoder and decoder are spatiotemporal attention modules, which can generate video sequences with coherent content. As shown in Figure 3, the attention calculation dimensions of the spatiotemporal attention module include channel dimension C₁ , width dimension W and height dimension H. Channel dimension C₁ is the dimension represented by the product of the number of channels and the number of frames. C₁ =C/N, B₁ =B×N, N represents the number of multi-head attention, and C represents the product of the number of channels of multi-head attention and the number of frames. The spatiotemporal attention module performs attention calculations in the channel dimension, width dimension and height dimension of the data, which can adapt to the data processing method of juxtaposing video frames on the channel, and the attention calculation dimension of the spatiotemporal attention module also includes the channel dimension. , which allows the spatiotemporal attention module to consider the relationship between different frames, thereby improving the inter-frame consistency of the target video frame sequence.

在步骤202中，将待填充的视频帧序列输入至训练好的扩散模型中进行视频填充，得到目标视频帧序列；由于扩散模型的时空注意力模块在数据的通道维度、宽度维度和高度维度进行注意力计算，增加了通道维度，可以在视频填充时让时空注意力模块考虑不同帧之间的关系，从而提升目标视频帧序列的帧间一致性。In step 202, the video frame sequence to be filled is input into the trained diffusion model for video filling to obtain the target video frame sequence; since the spatiotemporal attention module of the diffusion model performs the process in the channel dimension, width dimension and height dimension of the data, Attention calculation adds the channel dimension, which allows the spatiotemporal attention module to consider the relationship between different frames when filling the video, thereby improving the inter-frame consistency of the target video frame sequence.

本发明实施例提供的基于扩散模型的视频填充方法，首先，获取训练好的扩散模型；训练好的扩散模型中的U型网络模型包括第一编码器、第一中间层和解码器；第一编码器和解码器中的注意力模块均为时空注意力模块；时空注意力模块的注意力计算维度包括通道维度、宽度维度和高度维度，通道维度为通道数和帧数的乘积所表示的维度；将待填充的视频帧序列输入至训练好的扩散模型中进行视频填充，得到目标视频帧序列。由于第一编码器和解码器中的注意力模块均为时空注意力模块，可以生成内容连贯的视频序列；且时空注意力模块的注意力计算维度还包括通道维度，可以让时空注意力模块考虑不同帧之间的关系，从而提升目标视频帧序列的帧间一致性。The video filling method based on the diffusion model provided by the embodiment of the present invention first obtains a trained diffusion model; the U-shaped network model in the trained diffusion model includes a first encoder, a first intermediate layer and a decoder; first The attention modules in the encoder and decoder are both spatiotemporal attention modules; the attention calculation dimensions of the spatiotemporal attention module include channel dimensions, width dimensions and height dimensions. The channel dimension is the dimension represented by the product of the number of channels and the number of frames. ; Input the video frame sequence to be filled into the trained diffusion model for video filling, and obtain the target video frame sequence. Since the attention modules in the first encoder and decoder are both spatiotemporal attention modules, video sequences with coherent content can be generated; and the attention calculation dimensions of the spatiotemporal attention module also include channel dimensions, which can be considered by the spatiotemporal attention module. The relationship between different frames, thereby improving the inter-frame consistency of the target video frame sequence.

基于图1对应实施例的基于扩散模型的视频填充方法，在一种示例实施例中，扩散模型还包括序列编码器，第一中间层中的注意力模块为交叉注意力模块；在视频填充的过程中，该方法还包括：Based on the video filling method based on the diffusion model of the corresponding embodiment in Figure 1, in an example embodiment, the diffusion model also includes a sequence encoder, and the attention module in the first intermediate layer is a cross attention module; in the video filling During the process, the method also includes:

步骤203、将上一时刻预测得到的视频序列的全部输出编码作为全局特征输入序列编码器中进行编码，得到特征图；Step 203: Enter all the output codes of the video sequence predicted at the previous moment as global features into the sequence encoder for coding to obtain a feature map;

步骤204、将特征图输入交叉注意力模块中，并以上一时刻预测得到的视频序列中末尾抽取的至少一个视频帧作为局部条件帧，预测下一时刻的视频序列。Step 204: Input the feature map into the cross-attention module, and use at least one video frame extracted at the end of the video sequence predicted at the previous moment as a local condition frame to predict the video sequence at the next moment.

在步骤203中，扩散模型还增加了一个额外的序列编码器。可选地，序列编码器包括第二编码器和第二中间层，第二编码器包括两个第三残差块和两个第一注意力模块，第二中间层包括一个第四残差块和一个第二注意力模块。也即，序列编码器包括三个残差块和三个注意力模块。In step 203, an additional sequence encoder is added to the diffusion model. Optionally, the sequence encoder includes a second encoder and a second intermediate layer, the second encoder includes two third residual blocks and two first attention modules, and the second intermediate layer includes a fourth residual block. and a second attention module. That is, the sequence encoder includes three residual blocks and three attention modules.

在视频填充的过程中，将上一时刻预测得到的视频序列的全部输出编码作为全局特征输入序列编码器中进行编码，得到特征图，可以充分提取上一时刻预测得到的视频序列的全部信息。In the process of video filling, all the output codes of the video sequence predicted at the previous moment are input into the sequence encoder as global features for encoding, and the feature map is obtained, which can fully extract all the information of the video sequence predicted at the previous moment.

在步骤204中，如图4所示，将序列编码器输出的特征图输入交叉注意力模块中，并以上一时刻预测得到的视频序列中末尾抽取的至少一个视频帧作为第一中间层输入的局部条件帧，预测下一时刻的视频序列，然后不断递归生长视频序列。In step 204, as shown in Figure 4, the feature map output by the sequence encoder is input into the cross-attention module, and at least one video frame extracted at the end of the video sequence predicted at the previous moment is used as the first intermediate layer input. The local conditional frame predicts the video sequence at the next moment, and then continuously and recursively grows the video sequence.

在本实施例中，将全局条件(上一时刻预测得到的视频序列的全部输出编码作为全局特征)和局部条件(上一时刻预测得到的视频序列中末尾抽取的至少一个视频帧作为局部条件帧)进行结合，预测下一时刻的视频序列，可以充分提取上一时刻预测得到的视频序列的全部信息，从而提升模型预测长视频的能力，避免只利用上一时刻预测得到的视频序列的部分信息带来的长视频预测不准确的问题。In this embodiment, global conditions (all output codes of the video sequence predicted at the previous moment are used as global features) and local conditions (at least one video frame extracted at the end of the video sequence predicted at the previous moment) are used as local condition frames. ) are combined to predict the video sequence at the next moment, which can fully extract all the information of the video sequence predicted at the previous moment, thereby improving the model's ability to predict long videos and avoiding using only part of the information of the video sequence predicted at the previous moment. This brings about the problem of inaccurate long video prediction.

基于上述实施例的基于扩散模型的视频填充方法，在一种示例实施例中，训练好的扩散模型是基于如下步骤训练得到的：Based on the video filling method based on the diffusion model in the above embodiment, in an example embodiment, the trained diffusion model is trained based on the following steps:

步骤301、将样本集中的初始视频帧输入扩散模型的前向过程的加噪公式中逐渐添加高斯噪声，得到带有噪声的视频帧序列；Step 301: Enter the initial video frames in the sample set into the noise adding formula of the forward process of the diffusion model and gradually add Gaussian noise to obtain a video frame sequence with noise;

步骤302、将带有噪声的视频帧序列输入U型网络模型中，以上一时刻预测得到的视频序列中末尾抽取的至少一个视频帧作为局部条件帧，预测下一时刻的视频序列，直至递归预测过程结束，得到带有噪声的视频帧序列的估计值；Step 302: Input the video frame sequence with noise into the U-shaped network model, and use at least one video frame extracted at the end of the video sequence predicted at the previous moment as a local condition frame to predict the video sequence at the next moment until recursive prediction At the end of the process, the estimated value of the video frame sequence with noise is obtained;

步骤303、基于带有噪声的视频帧序列的估计值和带有噪声的视频帧序列计算损失，并基于损失调整U型网络模型的参数。Step 303: Calculate the loss based on the estimated value of the noisy video frame sequence and the noisy video frame sequence, and adjust the parameters of the U-shaped network model based on the loss.

在步骤301中，扩散模型的前向过程的加噪公式可以包括前述的公式(1)-(3)。基于公式(1)-(3)向输入样本集中的初始视频帧中逐渐添加高斯噪声后，得到带有噪声的视频帧序列。In step 301, the noise adding formula of the forward process of the diffusion model may include the aforementioned formulas (1)-(3). After gradually adding Gaussian noise to the initial video frames in the input sample set based on formulas (1)-(3), a video frame sequence with noise is obtained.

在步骤302中，U型网络模型中的时空注意力模块的注意力计算维度还包括通道维度，可以让时空注意力模块考虑不同帧之间的关系，从而提升目标视频帧序列的帧间一致性。In step 302, the attention calculation dimension of the spatiotemporal attention module in the U-shaped network model also includes the channel dimension, which allows the spatiotemporal attention module to consider the relationship between different frames, thereby improving the inter-frame consistency of the target video frame sequence. .

将带有噪声的视频帧序列输入U型网络模型中，获得上一时刻预测得到的视频序列。将上一时刻预测得到的视频序列的全部输出编码作为全局特征输入序列编码器中进行编码，得到特征图，可以充分提取上一时刻预测得到的视频序列的全部信息。将序列编码器输出的特征图输入交叉注意力模块中，并以上一时刻预测得到的视频序列中末尾抽取的至少一个视频帧作为局部条件帧，预测下一时刻的视频序列，然后不断递归生长视频序列，直至递归预测过程结束，得到带有噪声的视频帧序列的估计值。Input the video frame sequence with noise into the U-shaped network model to obtain the video sequence predicted at the previous moment. All output codes of the video sequence predicted at the previous moment are input into the sequence encoder as global features for encoding to obtain feature maps, which can fully extract all information of the video sequence predicted at the previous moment. The feature map output by the sequence encoder is input into the cross-attention module, and at least one video frame extracted at the end of the video sequence predicted at the previous moment is used as a local condition frame to predict the video sequence at the next moment, and then recursively grow the video sequence until the end of the recursive prediction process to obtain an estimate of the noisy video frame sequence.

在步骤303中，计算带有噪声的视频帧序列的估计值和带有噪声的视频帧序列之间的损失，并根据损失调整U型网络模型的参数，直至损失收敛。In step 303, the loss between the estimated value of the noisy video frame sequence and the noisy video frame sequence is calculated, and the parameters of the U-shaped network model are adjusted according to the loss until the loss converges.

在本实施例中，由于在扩散模型的训练过程，一方面，U型网络模型中的时空注意力模块的注意力计算维度还包括通道维度，可以让时空注意力模块考虑不同帧之间的关系，从而提升目标视频帧序列的帧间一致性；另一方面，将全局条件(上一时刻预测得到的视频序列的全部输出编码作为全局特征)和局部条件(上一时刻预测得到的视频序列中末尾抽取的至少一个视频帧作为局部条件帧)进行结合，预测下一时刻的视频序列，可以充分提取上一时刻预测得到的视频序列的全部信息，从而提升模型预测长视频的能力，进而可以提升训练好的扩散模型的预测能力。In this embodiment, due to the training process of the diffusion model, on the one hand, the attention calculation dimension of the spatiotemporal attention module in the U-shaped network model also includes the channel dimension, which allows the spatiotemporal attention module to consider the relationship between different frames. , thereby improving the inter-frame consistency of the target video frame sequence; on the other hand, the global condition (all output encoding of the video sequence predicted at the previous moment is used as a global feature) and the local condition (in the video sequence predicted at the previous moment) At least one video frame extracted at the end is combined as a local condition frame) to predict the video sequence at the next moment. All the information of the video sequence predicted at the previous moment can be fully extracted, thereby improving the model's ability to predict long videos, and thus improving Predictive power of trained diffusion models.

在另一种示例实施例中，训练好的扩散模型是基于如下步骤训练得到的：In another example embodiment, the trained diffusion model is trained based on the following steps:

步骤401、将样本集中的初始视频帧输入扩散模型中的前向过程的加噪公式中逐渐添加高斯噪声，得到带有噪声的视频帧序列；Step 401: Enter the initial video frames in the sample set into the noise adding formula of the forward process in the diffusion model and gradually add Gaussian noise to obtain a video frame sequence with noise;

步骤402、将带有噪声的视频帧序列输入U型网络模型中，以带有噪声的视频帧序列中的真值序列中抽取的至少一个视频帧作为局部条件帧，预测得到第一训练阶段的视频序列；Step 402: Input the noisy video frame sequence into the U-shaped network model, use at least one video frame extracted from the true value sequence in the noisy video frame sequence as a local condition frame, and predict the first training stage. video sequence;

步骤403、以上一时刻预测得到的视频序列中末尾抽取的至少一个视频帧作为局部条件帧，预测下一时刻的视频序列，直至递归预测过程结束，得到第二训练阶段的视频序列，作为带有噪声的视频帧序列的估计值；Step 403: At least one video frame extracted at the end of the video sequence predicted at the previous moment is used as a local condition frame to predict the video sequence at the next moment until the recursive prediction process ends, and the video sequence of the second training stage is obtained as a video sequence with An estimate of the noisy video frame sequence;

步骤404、基于带有噪声的视频帧序列的估计值和带有噪声的视频帧序列计算损失，并基于损失调整U型网络模型的参数。Step 404: Calculate the loss based on the estimated value of the noisy video frame sequence and the noisy video frame sequence, and adjust the parameters of the U-shaped network model based on the loss.

在步骤401中，扩散模型的前向过程的加噪公式可以包括前述的公式(1)-(3)。基于公式(1)-(3)向输入样本集中的初始视频帧中逐渐添加高斯噪声后，得到带有噪声的视频帧序列。In step 401, the noise adding formula of the forward process of the diffusion model may include the aforementioned formulas (1)-(3). After gradually adding Gaussian noise to the initial video frames in the input sample set based on formulas (1)-(3), a video frame sequence with noise is obtained.

在步骤402中，如图5所示，以上一时刻预测得到的视频序列中的真值序列中末尾抽取的至少一个视频帧作为局部条件帧，预测下一时刻的视频序列，预测得到第一训练阶段的视频序列。In step 402, as shown in Figure 5, at least one video frame extracted at the end of the true value sequence in the video sequence predicted at the previous moment is used as a local condition frame to predict the video sequence at the next moment, and the first training is obtained by predicting Stage video sequence.

在第一训练阶段中，由于局部条件帧是从上一时刻预测得到的视频序列中的真值序列中抽取的，不会引入上一时刻预测得到的视频序列的误差。In the first training stage, since the local condition frame is extracted from the true value sequence in the video sequence predicted at the previous moment, the error of the video sequence predicted at the previous moment will not be introduced.

在一种实施方式中，在从带有噪声的视频帧序列中的真值序列中随机抽取至少一个视频帧作为局部条件帧之后，在通道维度上并置局部条件帧的位置编码；位置编码为与局部条件帧尺寸相同的单通道张量，单通道张量中的每个元素为局部条件帧在带有噪声的视频帧序列中的索引值。In one embodiment, after randomly extracting at least one video frame as a local condition frame from the ground truth sequence in the noisy video frame sequence, the position encoding of the local condition frame is concatenated in the channel dimension; the position encoding is A single-channel tensor with the same size as the local condition frame. Each element in the single-channel tensor is the index value of the local condition frame in the noisy video frame sequence.

示例性地，假设上一时刻预测得到的视频序列为长度为14帧的视频序列，在第一训练阶段中，从上一时刻预测得到的视频序列中的真值序列中随机抽取0-2个局部条件帧，在通道维度上并置抽取的局部条件帧的位置编码，位置编码是一个与局部条件帧尺寸相同的单通道张量，单通道张量中的每个元素为局部条件帧在带有噪声的视频帧序列中的索引值。通过这种条件引入的方式，使得模型能够同时完成视频预测、视频插帧以及无条件视频生成等视频填充任务。该阶段网络预测一个噪声∈_θ，然后通过公式得到第一个预测前8帧视频序列，即第一训练阶段的视频序列。For example, assuming that the video sequence predicted at the previous moment is a video sequence with a length of 14 frames, in the first training stage, 0-2 are randomly selected from the true value sequence in the video sequence predicted at the previous moment. Local condition frame, the position encoding of the extracted local condition frame is juxtaposed in the channel dimension. The position encoding is a single-channel tensor with the same size as the local condition frame. Each element in the single-channel tensor is the local condition frame with noise. The index value in the video frame sequence. Through this conditional introduction, the model can simultaneously complete video filling tasks such as video prediction, video frame insertion, and unconditional video generation. At this stage, the network predicts a noise ∈_θ , and then passes the formula The first predicted 8-frame video sequence is obtained, which is the video sequence of the first training stage.

在步骤403中，将第一训练阶段的视频序列的全部输出编码作为全局特征输入序列编码器中进行编码，得到特征图，可以充分提取上一时刻预测得到的视频序列的全部信息。将序列编码器输出的特征图输入交叉注意力模块中，并以上一时刻预测得到的视频序列中末尾抽取的至少一个视频帧作为局部条件帧，预测下一时刻的视频序列，然后不断递归生长视频序列，直至递归预测过程结束，得到第二训练阶段的视频序列，作为带有噪声的视频帧序列的估计值。In step 403, all the output codes of the video sequence in the first training stage are input into the sequence encoder as global features for encoding to obtain a feature map, which can fully extract all the information of the video sequence predicted at the previous moment. The feature map output by the sequence encoder is input into the cross-attention module, and at least one video frame extracted at the end of the video sequence predicted at the previous moment is used as a local condition frame to predict the video sequence at the next moment, and then recursively grow the video sequence until the end of the recursive prediction process, the video sequence of the second training stage is obtained as an estimate of the video frame sequence with noise.

示例性地，将第一训练阶段的视频序列的全部输出编码作为全局特征输入序列编码器中进行编码，得到特征图z。将序列编码器输出的特征图z输入交叉注意力模块中，并取第一训练阶段的视频序列/>的后两帧作为局部条件帧，预测后8帧视频序列。For example, the video sequence in the first training stage is All the output codes are input into the sequence encoder as global features for encoding, and the feature map z is obtained. Input the feature map z output by the sequence encoder into the cross attention module, and take the video sequence of the first training stage/> The last two frames are used as local condition frames, and the next 8 frames of the video sequence are predicted.

由于模型预测的误差通常以运动模糊、颜色偏差的等形式存在，在第二训练阶段将这种误差作为一种数据增强，让模型学习对抗这种预测误差，极大地缓解了递归预测长视频存在的误差累积的问题。Since the error of model prediction usually exists in the form of motion blur, color deviation, etc., this error is used as a kind of data enhancement in the second training stage, allowing the model to learn to fight against this prediction error, which greatly alleviates the existence of recursive prediction of long videos. The problem of error accumulation.

在步骤404中，计算带有噪声的视频帧序列的估计值和带有噪声的视频帧序列之间的损失，并根据损失调整U型网络模型的参数，直至损失收敛。In step 404, the loss between the estimated value of the noisy video frame sequence and the noisy video frame sequence is calculated, and the parameters of the U-shaped network model are adjusted according to the loss until the loss converges.

在本实施例中，在第一个训练阶段，由于局部条件帧是从上一时刻预测得到的视频序列中的真值序列中抽取的，不会引入上一时刻预测得到的视频序列的误差；在第二个训练阶段，将模型预测的误差作为一种数据增强，让模型学习对抗这种预测误差，极大地缓解了递归预测长视频存在的误差累积的问题；可以进一步提升训练好的扩散模型的长视频预测的能力。In this embodiment, in the first training stage, since the local condition frame is extracted from the true value sequence in the video sequence predicted at the previous moment, the error of the video sequence predicted at the previous moment will not be introduced; In the second training stage, the model prediction error is used as a kind of data enhancement, allowing the model to learn to fight against this prediction error, which greatly alleviates the problem of error accumulation in recursive prediction of long videos; it can further improve the trained diffusion model. The ability to predict long videos.

下面通过具体实验对本发明实施例的方案进行验证。The solutions of the embodiments of the present invention are verified below through specific experiments.

(1)实验采用的数据集为道路场景数据集Cityscapes，在Cityscapes数据集上对模型改进前后的预测结果进行对比，实验结果如表1所示，箭头方向朝下表示数值降低，箭头方向朝上表示数值升高：(1) The data set used in the experiment is the road scene data set Cityscapes. On the Cityscapes data set, the prediction results before and after the model improvement are compared. The experimental results are shown in Table 1. The arrow pointing downward indicates that the value has decreased, and the arrow pointing upward indicates that the value has decreased. Indicates an increase in value:

表1Table 1

CityscapesCityscapesp/k/np/k/nFVD↓FVD↓LPIPS↓LPIPS↓SSIM↑SSIM↑基线baseline2/6/282/6/28276.27276.270.1110.1110.7080.708+双阶段训练+Dual phase training2/6/282/6/28143.91143.910.0710.0710.7420.742+序列编码器+ sequence encoder2/6/282/6/28141.08141.080.0670.0670.7500.750本发明实施例Embodiments of the invention2/6/282/6/28124.62124.620.0690.0690.7430.743

其中，p表示参考帧数量，k表示模型每次预测的帧数，n表示模型预测的总帧数。弗雷歇视频距离(Fréchet Video Distance，FVD)表示比较预测的序列和真值序列在Inflated 3D ConvNet(I3D)的特征空间中的统计相似度，衡量的是视频序列的连贯性和视觉质量，其值越低，视觉质量越好，计算公式为：Among them, p represents the number of reference frames, k represents the number of frames predicted by the model each time, and n represents the total number of frames predicted by the model. Fréchet Video Distance (FVD) represents the statistical similarity between the predicted sequence and the true sequence in the feature space of Inflated 3D ConvNet (I3D), which measures the coherence and visual quality of the video sequence. The lower the value, the better the visual quality, calculated as:

其中，fvd(P_R,P_G)表示评价指标FVD，P_R表示真实数据分布，P_G表示模型生成的数据分布，μ_R表示真实数据分布均值，μ_G表示生成数据分布均值，Tr表示矩阵的迹，R表示真实(real)，G表示生成(generative)。Among them, fvd(_PR ,_PG ) represents the evaluation index FVD, P_R represents the real data distribution, P_G represents the data distribution generated by the model, μ_R represents the mean of the real data distribution, μ_G represents the mean of the generated data distribution, and Tr represents the matrix. trace, R represents real, G represents generative.

学习感知图像块相似度(Learned Perceptual Image Patch Similarity，LPIPS)衡量的是预测的序列和真值序列在ALEXNET(亚历克斯等人提出的神经网络)特征图上的相似度，其值越低，结果越好。结构相似性(Structural Similarity，SSIM)衡量了预测结果与真值的相似度，其计算公式为：Learned Perceptual Image Patch Similarity (LPIPS) measures the similarity between the predicted sequence and the true value sequence on the ALEXNET (neural network proposed by Alex et al.) feature map, the lower the value , the better the result. Structural Similarity (SSIM) measures the similarity between the prediction result and the true value. Its calculation formula is:

其中，SSIM(x,y)表示评价指标SSIM，x表示当前图像，y表示目标图像，μ_x表示当前图像均值，μ_y表示目标图像均值，C₁表示常数，σ_xy表示当前图像和目标图像的协方差，σ_x表示当前图像标准差，σ_y表示目标图像标准差，C₂表示常数。Among them, SSIM(x,y) represents the evaluation index SSIM, x represents the current image, y represents the target image, μ_x represents the current image mean, μ_y represents the target image mean, C₁ represents a constant, σ_xy represents the current image and the target image The covariance of σ_x represents the current image standard deviation, σ_y represents the target image standard deviation, and C₂ represents a constant.

根据表1的实验结果可知，在Cityscapes数据集上，本发明实施例提出的双阶段训练方案、局部条件和全局条件结合的方案均能够显著提升模型视频预测和视频插帧的结果质量。According to the experimental results in Table 1, it can be seen that on the Cityscapes data set, the two-stage training scheme and the scheme combining local conditions and global conditions proposed by the embodiment of the present invention can significantly improve the result quality of model video prediction and video frame insertion.

(2)实验采用的数据集为道路场景数据集Cityscapes，在Cityscapes数据集上对现有方案(Hier-vRNN、GHVAE、MCVD spatin、MCVD concat)和本发明实施例的预测结果进行对比，实验结果如表2所示：(2) The data set used in the experiment is the road scene data set Cityscapes. On the Cityscapes data set, the prediction results of existing solutions (Hier-vRNN, GHVAE, MCVD spatin, MCVD concat) and the embodiments of the present invention are compared. The experimental results As shown in table 2:

表2Table 2

CityscapesCityscapesp/k/np/k/nFVD↓FVD↓LPIPS↓LPIPS↓SSIM↑SSIM↑Hier-vRNNHier-vRNN2/10/282/10/28567.51567.510.2640.2640.6280.628GHVAEGHVAE2/10/282/10/284184180.1930.1930.7400.740MCVD spatinMCVD spatin2/5/282/5/28184.81184.810.1210.1210.7200.720MCVD concatMCVD concat2/5/282/5/28141.31141.310.1120.1120.6900.690本发明实施例Embodiments of the invention2/6/282/6/28124.62124.620.0690.0690.7430.743

其中，Hier-vRNN表示分层变分递归神经网络(Hierarchical VariationalRecurrent Neural Network)，GHVAE表示贪婪分层变体自动编码器(Greedy HierarchicalVariational Autoencoder)，MCVD spatin表示MCVD空时自适应归一化算法，MCVD concat表示MCVD并置算法。Among them, Hier-vRNN represents Hierarchical Variational Recurrent Neural Network (Hierarchical Variational Recurrent Neural Network), GHVAE represents Greedy Hierarchical Variational Autoencoder (Greedy HierarchicalVariational Autoencoder), MCVD spatin represents MCVD space-time adaptive normalization algorithm, MCVD concat represents the MCVD concatenation algorithm.

根据表2的实验结果可知，Cityscapes数据集上，与现有方案(Hier-vRNN、GHVAE、MCVD spatin、MCVD concat)相比，本发明实施例提出的方案能够显著提升模型视频预测和视频插帧的结果质量。According to the experimental results in Table 2, on the Cityscapes data set, compared with existing solutions (Hier-vRNN, GHVAE, MCVD spatin, MCVD concat), the solution proposed by the embodiment of the present invention can significantly improve model video prediction and video frame insertion. quality of results.

(3)实验采用的数据集为BAIR数据集，在BAIR数据集上对现有方案(SVG-LP、SepConv、MCVD)和本发明实施例的预测结果进行对比，实验结果如表3所示：(3) The data set used in the experiment is the BAIR data set. The prediction results of the existing solutions (SVG-LP, SepConv, MCVD) and the embodiments of the present invention are compared on the BAIR data set. The experimental results are shown in Table 3:

表3table 3

BAIRBAIRp/k/rp/k/rPSNR↑PSNR↑SSIM↑SSIM↑SVG-LPSVG-LP18/7/10018/7/10018.64818.6480.8460.846SepConvSepConv18/7/10018/7/10021.61521.6150.8770.877MCVDMCVD4/5/1004/5/10025.16225.1620.9320.932本发明实施例Embodiments of the invention2/6/1002/6/10026.73226.7320.9520.952

其中，PSNR表示峰值信噪比(Peak signal-to-noise ratio)，SepConv表示可分离卷积(Separable Convolution)算法，SVG-LP表示SVG视频的先验学习(Stochastic VideoGeneration-learned prior)算法。Among them, PSNR represents the peak signal-to-noise ratio (Peak signal-to-noise ratio), SepConv represents the separable convolution (Separable Convolution) algorithm, and SVG-LP represents the prior learning (Stochastic VideoGeneration-learned prior) algorithm of SVG video.

根据表3的实验结果可知，BAIR数据集上，与现有方案(SVG-LP、SepConv、MCVD)相比，本发明实施例提出的方案能够显著提升模型视频预测和视频插帧的结果质量。According to the experimental results in Table 3, it can be seen that on the BAIR data set, compared with existing solutions (SVG-LP, SepConv, MCVD), the solution proposed in the embodiment of the present invention can significantly improve the result quality of model video prediction and video frame insertion.

(4)实验采用的数据集为BAIR数据集，在BAIR数据集上对现有方案(MCVD spatin、MCVD concat)和本发明实施例的预测结果进行对比，实验结果如表4所示：(4) The data set used in the experiment is the BAIR data set. The prediction results of the existing schemes (MCVD spatin, MCVD concat) and the embodiments of the present invention are compared on the BAIR data set. The experimental results are shown in Table 4:

表4Table 4

根据表4的实验结果可知，BAIR数据集上，与现有方案MCVD spatin、MCVD concat)相比，本发明实施例提出的方案能够显著提升模型视频预测和视频插帧的结果质量。According to the experimental results in Table 4, it can be seen that on the BAIR data set, compared with the existing solutions (MCVD spatin, MCVD concat), the solution proposed in the embodiment of the present invention can significantly improve the result quality of model video prediction and video frame insertion.

下面对本发明提供的基于扩散模型的视频填充装置进行描述，下文描述的基于扩散模型的视频填充装置与上文描述的基于扩散模型的视频填充方法可相互对应参照。The video filling device based on the diffusion model provided by the present invention is described below. The video filling device based on the diffusion model described below and the video filling method based on the diffusion model described above may be mutually referenced.

请参照图6，图6是本发明实施例提供的基于扩散模型的视频填充装置的结构示意图。如图6所示，该装置可以包括：Please refer to FIG. 6 , which is a schematic structural diagram of a video filling device based on a diffusion model provided by an embodiment of the present invention. As shown in Figure 6, the device may include:

获取模块10，用于获取训练好的扩散模型；训练好的扩散模型中的U型网络模型包括编码器、中间层和解码器；编码器和解码器中的注意力模块均为时空注意力模块；时空注意力模块的注意力计算维度包括通道维度、宽度维度和高度维度，通道维度为通道数和帧数的乘积所表示的维度；Acquisition module 10 is used to obtain the trained diffusion model; the U-shaped network model in the trained diffusion model includes an encoder, an intermediate layer and a decoder; the attention modules in the encoder and decoder are both spatiotemporal attention modules. ;The attention calculation dimensions of the spatiotemporal attention module include channel dimensions, width dimensions and height dimensions. The channel dimension is the dimension represented by the product of the number of channels and the number of frames;

填充模块20，用于将待填充的视频帧序列输入至训练好的扩散模型中进行视频填充，得到目标视频帧序列。The filling module 20 is used to input the video frame sequence to be filled into the trained diffusion model for video filling to obtain the target video frame sequence.

在一种示例实施例中，扩散模型还包括序列编码器，第一中间层中的注意力模块为交叉注意力模块；在视频填充的过程中，该装置还包括：In an example embodiment, the diffusion model further includes a sequence encoder, and the attention module in the first intermediate layer is a cross-attention module; during the video filling process, the device further includes:

编码模块，用于将上一时刻预测得到的视频序列的全部输出编码作为全局特征输入序列编码器中进行编码，得到特征图；The encoding module is used to encode all the output codes of the video sequence predicted at the previous moment as global features and input them into the sequence encoder for encoding to obtain the feature map;

预测模块，用于将特征图输入交叉注意力模块中，并以上一时刻预测得到的视频序列中末尾抽取的至少一个视频帧作为局部条件帧，预测下一时刻的视频序列。The prediction module is used to input the feature map into the cross attention module, and use at least one video frame extracted at the end of the video sequence predicted at the previous moment as a local condition frame to predict the video sequence at the next moment.

在一种示例实施例中，该装置还包括训练模块，训练模块具体用于：In an example embodiment, the device further includes a training module, and the training module is specifically used to:

将样本集中的初始视频帧输入扩散模型的前向过程的加噪公式中逐渐添加高斯噪声，得到带有噪声的视频帧序列；The initial video frames in the sample set are input into the noise-adding formula of the forward process of the diffusion model and Gaussian noise is gradually added to obtain a video frame sequence with noise;

将带有噪声的视频帧序列输入U型网络模型中，以上一时刻预测得到的视频序列中末尾抽取的至少一个视频帧作为局部条件帧，预测下一时刻的视频序列，直至递归预测过程结束，得到带有噪声的视频帧序列的估计值；Input the video frame sequence with noise into the U-shaped network model, and use at least one video frame extracted at the end of the video sequence predicted at the previous moment as a local condition frame to predict the video sequence at the next moment until the recursive prediction process ends. Get an estimate of the noisy video frame sequence;

基于带有噪声的视频帧序列的估计值和带有噪声的视频帧序列计算损失，并基于损失调整U型网络模型的参数。The loss is calculated based on the estimated value of the noisy video frame sequence and the noisy video frame sequence, and the parameters of the U-shaped network model are adjusted based on the loss.

将样本集中的初始视频帧输入扩散模型中的前向过程的加噪公式中逐渐添加高斯噪声，得到带有噪声的视频帧序列；The initial video frames in the sample set are input into the noise-adding formula of the forward process in the diffusion model and Gaussian noise is gradually added to obtain a video frame sequence with noise;

将带有噪声的视频帧序列输入U型网络模型中，以带有噪声的视频帧序列中的真值序列中抽取的至少一个视频帧作为局部条件帧，预测得到第一训练阶段的视频序列；Input the noisy video frame sequence into the U-shaped network model, use at least one video frame extracted from the true value sequence in the noisy video frame sequence as a local condition frame, and predict the video sequence in the first training stage;

以上一时刻预测得到的视频序列中末尾抽取的至少一个视频帧作为局部条件帧，预测下一时刻的视频序列，直至递归预测过程结束，得到第二训练阶段的视频序列，作为带有噪声的视频帧序列的估计值；At least one video frame extracted at the end of the video sequence predicted at the previous moment is used as a local condition frame to predict the video sequence at the next moment until the recursive prediction process ends, and the video sequence of the second training stage is obtained as a video with noise An estimate of the frame sequence;

在一种示例实施例中，训练模块还用于：在从带有噪声的视频帧序列中的真值序列中随机抽取至少一个视频帧作为局部条件帧之后，在通道维度上并置局部条件帧的位置编码；位置编码为与局部条件帧尺寸相同的单通道张量，单通道张量中的每个元素为局部条件帧在带有噪声的视频帧序列中的索引值。In an example embodiment, the training module is further configured to: collocate the local condition frames in the channel dimension after randomly extracting at least one video frame as the local condition frame from the ground truth sequence in the noisy video frame sequence. Position encoding; the position encoding is a single-channel tensor with the same size as the local condition frame, and each element in the single-channel tensor is the index value of the local condition frame in the video frame sequence with noise.

在一种示例实施例中，第一编码器和解码器均包括两个第一残差块和两个时空注意力模块，第一中间层包括一个第二残差块和一个交叉注意力模块。In an example embodiment, both the first encoder and the decoder include two first residual blocks and two spatiotemporal attention modules, and the first intermediate layer includes one second residual block and one cross-attention module.

在一种示例实施例中，序列编码器包括第二编码器和第二中间层，第二编码器包括两个第三残差块和两个第一注意力模块，第二中间层包括一个第四残差块和一个第二注意力模块。In an example embodiment, the sequence encoder includes a second encoder including two third residual blocks and two first attention modules, and a second intermediate layer. Four residual blocks and a second attention module.

图7示例了一种电子设备的实体结构示意图，如图7所示，该电子设备可以包括：处理器(processor)710、通信接口(Communications Interface)720、存储器(memory)730和通信总线740，其中，处理器710，通信接口720，存储器730通过通信总线740完成相互间的通信。处理器710可以调用存储器730中的逻辑指令，以执行基于扩散模型的视频填充方法，该方法包括：获取训练好的扩散模型；训练好的扩散模型中的U型网络模型包括第一编码器、第一中间层和解码器；第一编码器和解码器中的注意力模块均为时空注意力模块；时空注意力模块的注意力计算维度包括通道维度、宽度维度和高度维度，通道维度为通道数和帧数的乘积所表示的维度；将待填充的视频帧序列输入至训练好的扩散模型中进行视频填充，得到目标视频帧序列。Figure 7 illustrates a schematic diagram of the physical structure of an electronic device. As shown in Figure 7, the electronic device may include: a processor (processor) 710, a communications interface (Communications Interface) 720, a memory (memory) 730, and a communication bus 740. Among them, the processor 710, the communication interface 720, and the memory 730 complete communication with each other through the communication bus 740. The processor 710 can call logical instructions in the memory 730 to perform a video filling method based on a diffusion model. The method includes: obtaining a trained diffusion model; the U-shaped network model in the trained diffusion model includes a first encoder, The first intermediate layer and decoder; the attention modules in the first encoder and decoder are both spatiotemporal attention modules; the attention calculation dimensions of the spatiotemporal attention module include channel dimensions, width dimensions and height dimensions, and the channel dimension is channel The dimension represented by the product of the number and the number of frames; input the video frame sequence to be filled into the trained diffusion model for video filling, and obtain the target video frame sequence.

此外，上述的存储器730中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logical instructions in the memory 730 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the present invention. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. .

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，计算机程序可存储在非暂态计算机可读存储介质上，所述计算机程序被处理器执行时，计算机能够执行上述各方法所提供的基于扩散模型的视频填充方法，该方法包括：获取训练好的扩散模型；训练好的扩散模型中的U型网络模型包括第一编码器、第一中间层和解码器；第一编码器和解码器中的注意力模块均为时空注意力模块；时空注意力模块的注意力计算维度包括通道维度、宽度维度和高度维度，通道维度为通道数和帧数的乘积所表示的维度；将待填充的视频帧序列输入至训练好的扩散模型中进行视频填充，得到目标视频帧序列。On the other hand, the present invention also provides a computer program product. The computer program product includes a computer program. The computer program can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can Executing the video filling method based on the diffusion model provided by each of the above methods, the method includes: obtaining a trained diffusion model; the U-shaped network model in the trained diffusion model includes a first encoder, a first intermediate layer and a decoder ; The attention modules in the first encoder and decoder are both spatiotemporal attention modules; the attention calculation dimensions of the spatiotemporal attention module include channel dimensions, width dimensions and height dimensions, and the channel dimension is the product of the number of channels and the number of frames. represents the dimension; input the video frame sequence to be filled into the trained diffusion model for video filling, and obtain the target video frame sequence.

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各方法提供的基于扩散模型的视频填充方法，该方法包括：获取训练好的扩散模型；训练好的扩散模型中的U型网络模型包括第一编码器、第一中间层和解码器；第一编码器和解码器中的注意力模块均为时空注意力模块；时空注意力模块的注意力计算维度包括通道维度、宽度维度和高度维度，通道维度为通道数和帧数的乘积所表示的维度；将待填充的视频帧序列输入至训练好的扩散模型中进行视频填充，得到目标视频帧序列。In another aspect, the present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored, which is implemented when executed by a processor to perform the diffusion model-based video filling method provided by the above methods, The method includes: obtaining a trained diffusion model; the U-shaped network model in the trained diffusion model includes a first encoder, a first intermediate layer and a decoder; the attention modules in the first encoder and decoder are both Spatiotemporal attention module; the attention calculation dimensions of the spatiotemporal attention module include channel dimension, width dimension and height dimension. The channel dimension is the dimension represented by the product of the number of channels and the number of frames; input the video frame sequence to be filled into the trained Video filling is performed in the diffusion model to obtain the target video frame sequence.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the part of the above technical solution that essentially contributes to the existing technology can be embodied in the form of a software product. The computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., including a number of instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or certain parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be used Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent substitutions are made to some of the technical features; however, these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

Translated fromChinese

1.一种基于扩散模型的视频填充方法，其特征在于，包括：1. A video filling method based on diffusion model, which is characterized by including:

2.根据权利要求1所述的基于扩散模型的视频填充方法，其特征在于，所述扩散模型还包括序列编码器，所述第一中间层中的注意力模块为交叉注意力模块；在所述视频填充的过程中，所述方法还包括：2. The video filling method based on the diffusion model according to claim 1, characterized in that the diffusion model further includes a sequence encoder, and the attention module in the first intermediate layer is a cross attention module; During the video filling process, the method also includes:

3.根据权利要求2所述的基于扩散模型的视频填充方法，其特征在于，所述训练好的扩散模型是基于如下步骤训练得到的：3. The video filling method based on the diffusion model according to claim 2, characterized in that the trained diffusion model is trained based on the following steps:

4.根据权利要求2所述的基于扩散模型的视频填充方法，其特征在于，所述训练好的扩散模型是基于如下步骤训练得到的：4. The video filling method based on the diffusion model according to claim 2, characterized in that the trained diffusion model is trained based on the following steps:

将样本集中的初始视频帧输入所述扩散模型中的前向过程的加噪公式中逐渐添加高斯噪声，得到带有噪声的视频帧序列；Input the initial video frames in the sample set into the noise adding formula of the forward process in the diffusion model and gradually add Gaussian noise to obtain a video frame sequence with noise;

5.根据权利要求3或4所述的基于扩散模型的视频填充方法，其特征在于，所述方法还包括：5. The video filling method based on the diffusion model according to claim 3 or 4, characterized in that the method further includes:

6.根据权利要求2所述的基于扩散模型的视频填充方法，其特征在于，所述第一编码器和所述解码器均包括两个第一残差块和两个所述时空注意力模块，所述第一中间层包括一个第二残差块和一个所述交叉注意力模块。6. The video filling method based on the diffusion model according to claim 2, characterized in that the first encoder and the decoder each include two first residual blocks and two spatiotemporal attention modules. , the first intermediate layer includes a second residual block and a cross-attention module.

7.根据权利要求2所述的基于扩散模型的视频填充方法，其特征在于，所述序列编码器包括第二编码器和第二中间层，所述第二编码器包括两个第三残差块和两个第一注意力模块，所述第二中间层包括一个第四残差块和一个第二注意力模块。7. The video filling method based on the diffusion model according to claim 2, characterized in that the sequence encoder includes a second encoder and a second intermediate layer, the second encoder includes two third residuals block and two first attention modules, and the second intermediate layer includes a fourth residual block and a second attention module.

8.一种基于扩散模型的视频填充装置，其特征在于，包括：8. A video filling device based on a diffusion model, characterized by including:

9.一种电子设备，包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序，其特征在于，所述处理器执行所述程序时实现如权利要求1至7任一项所述的基于扩散模型的视频填充方法。9. An electronic device, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized in that when the processor executes the program, it implements claim 1 The video filling method based on the diffusion model described in any one of to 7.

10.一种非暂态计算机可读存储介质，其上存储有计算机程序，其特征在于，所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的基于扩散模型的视频填充方法。10. A non-transitory computer-readable storage medium with a computer program stored thereon, characterized in that when the computer program is executed by a processor, the diffusion model-based diffusion model as claimed in any one of claims 1 to 7 is implemented. Video filling method.