CN113344786B

Movatterモバイル変換

Info

Publication number: CN113344786B
Application number: CN202110652621.8A
Authority: CN
Inventors: 刘文顺; 孙季丰; 赵帅
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2023-02-14
Anticipated expiration: 2041-06-11
Also published as: CN113344786A

Abstract

Translated fromChinese

本发明公开了一种基于几何生成模型的视频转码方法、装置、介质和设备，包括：构建几何生成模型，并且训练后得到超分辨率重建模型；获取MPEG‑4视频；针对MPEG‑4视频进行解码，解码后保存为连续的静态图片形式；针对MPEG‑4视频解码后得到的每帧图片，通过超分辨率重建模型对每帧图片进行超分辨率放大重建；针对于超分辨率放大重建后的图像，编码成H.265格式的视频。本发明方法中，通过几何生成模型从几何的角度来解决生成对抗网络中的模式崩溃问题，避免对抗学习，使得生成的图像更加丰富逼真。

The invention discloses a video transcoding method, device, medium and equipment based on a geometric generation model, including: constructing a geometric generation model, and obtaining a super-resolution reconstruction model after training; obtaining MPEG-4 video; aiming at MPEG-4 video Decode, and save as continuous static pictures after decoding; for each frame of pictures obtained after MPEG-4 video decoding, perform super-resolution zoom-in reconstruction on each frame of pictures through the super-resolution reconstruction model; for super-resolution zoom-in reconstruction The final image is encoded into a video in H.265 format. In the method of the present invention, the problem of mode collapse in the generated confrontation network is solved from a geometric point of view through the geometric generation model, avoiding confrontation learning, and making the generated images richer and more realistic.

Description

Translated fromChinese

基于几何生成模型的视频转码方法、装置、介质和设备Video transcoding method, device, medium and equipment based on geometric generative model

技术领域technical field

本发明涉及一种视频转码方法，特别涉及一种基于几何生成模型的视频转码方法、装置、介质和设备。The present invention relates to a video transcoding method, in particular to a video transcoding method, device, medium and equipment based on a geometric generation model.

背景技术Background technique

随着互联网的广泛普及，网络越来越成为我们日常生活获取信息的主要途径，网络视频、直播等行业快速兴起，人们对视频的质量也提出了更高的要求，高清甚至超清视频的需求越来越大。近年来国内4K、8K视频设备得到大众的追捧，然而网络上却没有足够的超清视频资源，使得这些超高清视频设备大部分时候仍然只能播放普通的高清视频，无法发挥其硬件的优势。超清视频资源的缺乏一方面是因为超清视频的制作成本较高，导致其发展跟不上大众对超高清视频的需求，另外一方面是因为网络传输带宽的限制使得超高清视频的保存和传播变得困难。针对超清视频资源制作成本高的问题，以采用视频超分辨率技术来对现有的标清视频进行重建，变成我们需要的超高清视频；而针对超清视频存储和传播困难的问题，我们需要更好视频编码方法来对视频进行压缩，减少对网络带宽的占用。目前网络上常见的视频基本都是基于MPEG标准和H.264标准来进行压缩的，但是随着超清视频甚至4K、8K视频的兴起，需要更大的压缩率来减小存储空间的耗费和传输带宽的占用，因此最新的视频编码标准H.265于2012年由视频编码专家组完成制定，相比H.264标准，H.265采用更高的压缩率，在保证画面质量基本不变的情况下能够将网络传输带宽减少一半，同时H.265标准还能够支持最高为8K的分辨率，因此H.265对于超高清视频的保存和传输具有重要意义。With the widespread popularity of the Internet, the Internet has increasingly become the main way for us to obtain information in our daily life. With the rapid rise of industries such as online video and live broadcast, people have put forward higher requirements for the quality of video, and the demand for high-definition or even ultra-high-definition video getting bigger. In recent years, domestic 4K and 8K video devices have been sought after by the public, but there are not enough ultra-high-definition video resources on the Internet, so these ultra-high-definition video devices can only play ordinary high-definition videos most of the time, and cannot take advantage of their hardware. The lack of ultra-high-definition video resources is due to the high production cost of ultra-high-definition video, which leads to its development not keeping up with the public's demand for ultra-high-definition video. On the other hand, the limitation of network transmission bandwidth makes the preservation and Propagation becomes difficult. Aiming at the high production cost of ultra-definition video resources, video super-resolution technology is used to reconstruct the existing standard-definition video and become the ultra-high-definition video we need; and in view of the difficulty of storing and disseminating ultra-definition video, we A better video coding method is needed to compress the video and reduce the occupation of network bandwidth. At present, the common videos on the Internet are basically compressed based on the MPEG standard and the H.264 standard. However, with the rise of ultra-clear video and even 4K and 8K video, a greater compression rate is required to reduce the consumption of storage space and Occupation of transmission bandwidth, so the latest video coding standard H.265 was formulated by the video coding expert group in 2012. Compared with the H.264 standard, H.265 adopts a higher compression rate, while ensuring that the picture quality is basically unchanged. Under normal circumstances, the network transmission bandwidth can be reduced by half, and the H.265 standard can also support a resolution of up to 8K, so H.265 is of great significance for the preservation and transmission of ultra-high-definition video.

现有技术中，为了从分辨率没有那么高的视频中还原得到高清、超高清的视频资源，通常需要采用视频转码技术，现有的视频转码技术通常是使用插值方法，但是传统的插值方法效果明显不够，并且存在转码效率低的现象。现有技术中，基于深度学习的图像超分辨率(SISR)技术，主要以卷积神经网络(CNN)为学习模型，通过大量数据学习低分辨率图像缺失的纹理细节等高频信息，实现低分辨率图像到高分辨率图像端到端的转换。相比传统的插值方法，深度学习的方法表现出很大的优势，在PSNR(峰值信噪比)、SSIM(结构相似度)等效果评价指标上实现了显著的提升。其中基于生成对抗网络(GAN)的超分辨率方法能够生成更加逼真、清晰的纹理等细节，因此被广泛的应用于图像生成领域，比较常用的有超分辨率、图像翻译人体姿态生成等，但是实际应用中，判别器D和生成器G之间需要进行对抗学习，如果数据分布有多个簇或分布在多个孤立的流形中，那么生成器很难很好地学习多个模式，容易引起GAN模式崩溃和收敛困难的现象。In the prior art, in order to restore high-definition and ultra-high-definition video resources from videos with low resolution, it is usually necessary to use video transcoding technology. The existing video transcoding technology usually uses interpolation methods, but the traditional interpolation The effect of the method is obviously not enough, and there is a phenomenon of low transcoding efficiency. In the existing technology, image super-resolution (SISR) technology based on deep learning mainly uses convolutional neural network (CNN) as the learning model, and learns high-frequency information such as missing texture details of low-resolution images through a large amount of data to achieve low-resolution images. End-to-end conversion of high-resolution images to high-resolution images. Compared with the traditional interpolation method, the deep learning method shows great advantages, and has achieved significant improvement in the effect evaluation indicators such as PSNR (peak signal-to-noise ratio) and SSIM (structural similarity). Among them, the super-resolution method based on the generative confrontation network (GAN) can generate more realistic and clear texture and other details, so it is widely used in the field of image generation. More commonly used are super-resolution, image translation and human pose generation, etc., but In practical applications, confrontation learning is required between the discriminator D and the generator G. If the data distribution has multiple clusters or is distributed in multiple isolated manifolds, it is difficult for the generator to learn multiple patterns well, and it is easy to Phenomena that cause GAN mode collapse and convergence difficulties.

发明内容Contents of the invention

本发明的第一目的在于克服现有技术的缺点与不足，提供一种基于几何生成模型的视频转码方法，其中通过几何生成模型从几何的角度来解决生成对抗网络中的模式崩溃问题，避免对抗学习，使得生成的图像更加丰富逼真。The first purpose of the present invention is to overcome the shortcomings and deficiencies of the prior art, and provide a video transcoding method based on a geometric generative model, wherein the geometric generative model is used to solve the problem of mode collapse in the generative adversarial network from a geometric point of view, avoiding Adversarial learning makes the generated images richer and more realistic.

本发明的第二目的在于提供一种基于几何生成模型的视频转转码装置。The second object of the present invention is to provide a video transcoding device based on a geometric generative model.

本发明的第三目的在于提供一种存储介质。A third object of the present invention is to provide a storage medium.

本发明的第四目的在于提供一种计算设备。A fourth object of the present invention is to provide a computing device.

本发明的第一目的通过下述技术方案实现：一种基于几何生成模型的视频转码方法，步骤包括：The first object of the present invention is achieved through the following technical solutions: a video transcoding method based on a geometric generation model, the steps comprising:

构建几何生成模型，并且训练后得到超分辨率重建模型；Construct a geometric generation model, and obtain a super-resolution reconstruction model after training;

获取MPEG-4视频；get MPEG-4 video;

针对MPEG-4视频进行解码，解码后保存为连续的静态图片形式；Decode MPEG-4 video and save it as continuous still pictures after decoding;

针对MPEG-4视频解码后得到的每帧图片，通过超分辨率重建模型对每帧图片进行超分辨率放大重建；For each frame of picture obtained after MPEG-4 video decoding, the super-resolution reconstruction model is used to perform super-resolution zoom-in reconstruction of each frame of picture;

针对于超分辨率放大重建后的图像，编码成H.265格式的视频。For super-resolution enlarged and reconstructed images, encode them into H.265 format video.

优选的，所述构建的几何生成模型包括编码部分和解码部分；Preferably, the geometric generation model constructed includes an encoding part and a decoding part;

所述编码部分包括依次连接的两个下采样层，分别为第一下采样层和第二下采样层；The encoding part includes two down-sampling layers connected in sequence, namely a first down-sampling layer and a second down-sampling layer;

所述解码部分包括第一上采样层、第二上采样层和卷积层；第一上采样层、第二上采样层的上采样处理均采用亚像素卷积来完成；The decoding part includes a first upsampling layer, a second upsampling layer and a convolution layer; the upsampling processes of the first upsampling layer and the second upsampling layer are all completed by sub-pixel convolution;

其中：in:

几何生成模型输入的特征输入到第一下采样层；The features input by the geometric generative model are input to the first downsampling layer;

第一下采样层的输出特征作为第二下采样层的输入，第二下采样层的输出作为编码部分的输出；The output feature of the first downsampling layer is used as the input of the second downsampling layer, and the output of the second downsampling layer is used as the output of the encoding part;

解码部分输入的特征经过通道注意力残差块后输入到第一上采样层；The features input by the decoding part are input to the first upsampling layer after passing through the channel attention residual block;

第一下采样层输出的特征和第一上采样层输出的特征在通道维度拼接后经过通道注意力残差块后作为第二上采样层的输入；The features output by the first downsampling layer and the features output by the first upsampling layer are used as the input of the second upsampling layer after splicing the channel dimension and passing through the channel attention residual block;

第二上采样层输出的特征和几何生成模型输入的特征在通道维度拼接后作为卷积层的输入，卷积层的输出即为几何生成模型的输出。The features output by the second upsampling layer and the features input by the geometric generative model are concatenated in the channel dimension as the input of the convolutional layer, and the output of the convolutional layer is the output of the geometric generative model.

优选的，在几何生成模型中：Preferably, in the geometric generative model:

首先，编码部分将输入的数据分布v映射到隐空间Z，得到隐空间的特征分布μ：First, the encoding part maps the input data distribution v to the latent space Z, and obtains the feature distribution μ of the latent space:

μ＝f_θ：∑→Zμ=f_θ : ∑→Z

其中，∑表示输入数据分布v的一个子流行，f_θ表示编码映射，θ是要学习的参数；where ∑ denotes a subpopulation of the input data distribution v, f_θ denotes the encoding mapping, and θ is the parameter to be learned;

然后再计算隐空间分布μ的最优传输映射T:Z→Z，即将均匀分布ζ变换到隐空间分布μ：Then calculate the optimal transmission map T:Z→Z of the hidden space distribution μ, that is, transform the uniform distribution ζ into the hidden space distribution μ:

T:Z→Z＝ζ→μT:Z→Z＝ζ→μ

其中，上述隐空间分布的最优传输映射T采用凸优化来进行计算；Among them, the optimal transmission map T of the above hidden space distribution is calculated by convex optimization;

最终将经T变换后得到的分布再输入到解码部分，通过解码部分生成最终的高分辨率图像。Finally, the distribution obtained after T-transformation is input to the decoding part, and the final high-resolution image is generated through the decoding part.

更进一步的，几何生成模型的训练过程如下：Further, the training process of the geometric generative model is as follows:

获取已知对应高分辨率图像的低分辨率图像，作为训练样本；Obtain low-resolution images known to correspond to high-resolution images as training samples;

将作为训练样本的低分辨率图像通过Bicubic进行上采样处理后，得到目标分辨率图像；After the low-resolution image used as the training sample is upsampled by Bicubic, the target resolution image is obtained;

将低分辨率图像对应的目标分辨率图像作为几何生成模型的输入，将低分辨率图像对应的高分辨率图像作为标签图像，对几何生成模型进行训练，得到超分辨率重建模型。The target resolution image corresponding to the low-resolution image is used as the input of the geometric generation model, and the high-resolution image corresponding to the low-resolution image is used as the label image, and the geometric generation model is trained to obtain a super-resolution reconstruction model.

更进一步的，在几何生成模型的训练过程中，进行边缘感知损失处理，具体为：Furthermore, during the training process of the geometric generative model, edge-aware loss processing is performed, specifically:

先对标签图像做边缘检测得到每个位置的边缘值；First perform edge detection on the label image to obtain the edge value of each position;

计算训练误差时，赋予边缘位置更高的权重。When computing the training error, the edge positions are given higher weight.

更进一步的，在对标签图像做边缘检测时，采用拉普拉斯滤波器作为边缘检测的算子，基于边缘感知损失和平衡Charbonnier损失得到最终使用的训练损失函数L_final为：Furthermore, when performing edge detection on label images, the Laplacian filter is used as the edge detection operator, and the final training loss function L_final is obtained based on the edge-aware loss and balanced Charbonnier loss:

L_final＝L_Charbonnier+λ||M*(I^SR-I^HR)||；L_final ＝L_Charbonnier +λ||M*(I^SR -I^HR )||;

其中，L_Charbonnier为L1损失的变体Charbonnier损失，具体为：Among them, L_Charbonnier is the variant Charbonnier loss of L1 loss, specifically:

ε为一个常量，λ是平衡Charbonnier损失和边缘感知损失的一个伸缩系数；ε is a constant, and λ is a scaling factor that balances Charbonnier loss and edge-aware loss;

||M*(I^SR-I^HR)||为边缘感知损失，I^SR为代表经过几何生成模型重建的高分辨率图像，I^HR代表标签图像；||M*(I^SR -I^HR )|| is the edge-aware loss, I^SR represents the high-resolution image reconstructed by the geometric generative model, and I^HR represents the label image;

在对标签图像进行边缘检测时，得到每个位置的边缘值，然后设定一个阈值δ来标注每个位置是否为边缘位置，边缘值大于该阈值的标注为1，否则标注为0：When performing edge detection on the label image, the edge value of each position is obtained, and then a threshold δ is set to mark whether each position is an edge position. The edge value greater than the threshold is marked as 1, otherwise it is marked as 0:

式中L(i,j)是指在标签图像(i,j)位置处拉普拉斯边缘检测得到边缘值，M(i,j)为标签图像(i,j)位置的边缘位置标注，M为M(i,j)构成的矩阵。In the formula, L(i, j) refers to the edge value obtained by Laplacian edge detection at the position of the label image (i, j), and M(i, j) is the label of the edge position of the position of the label image (i, j), M is a matrix composed of M(i,j).

更进一步的，拉普拉斯滤算子

的计算公式为：Furthermore, the Laplacian filter operator

The calculation formula is:

其中，(x,y)表示图像当前位置，f(x,y)表示图像当前位置(x,y)处灰度值，f(x-1,y)表示图像位置(x-1,y)处灰度值，f(x+1,y)表示图像位置(x+1,y)处灰度值，f(x,y-1)表示图像位置(x,y-1)处的灰度值，f(x,y+1)表示图像位置(x,y+1)处的灰度值。Among them, (x, y) represents the current position of the image, f(x, y) represents the gray value at the current position (x, y) of the image, and f(x-1, y) represents the image position (x-1, y) The gray value at the position, f(x+1, y) represents the gray value at the image position (x+1, y), f(x, y-1) represents the gray value at the image position (x, y-1) value, f(x,y+1) represents the gray value at the image position (x,y+1).

本发明的第二目的通过下述技术方案实现：一种基于几何生成模型的视频转转码装置，包括：The second object of the present invention is achieved through the following technical solutions: a video transcoding device based on a geometric generation model, comprising:

模型构建模块，用于构建几何生成模型；Model building blocks for building geometric generative models;

模型训练模块，用于对构建的几何生成模型进行训练，得到超分辨率重建模型；The model training module is used to train the constructed geometric generation model to obtain a super-resolution reconstruction model;

获取模块，用于获取MPEG-4视频；Obtain module, be used for obtaining MPEG-4 video frequency;

视频解码模块，用于针对MPEG-4视频进行解码，解码后保存为连续的静态图片形式；The video decoding module is used for decoding the MPEG-4 video, and saves it as a continuous static picture form after decoding;

视频重建模块，用于针对MPEG-4视频解码后得到的每帧图片，通过超分辨率重建模型对每帧图片进行超分辨率放大重建；The video reconstruction module is used to carry out super-resolution zoom-in reconstruction of each frame of pictures through the super-resolution reconstruction model for each frame of pictures obtained after MPEG-4 video decoding;

视频编码模块，用于针对于超分辨率放大重建后的图像，编码成H.265格式的视频。The video encoding module is used to encode the super-resolution enlarged and reconstructed image into a video in H.265 format.

本发明的第三目的通过下述技术方案实现：一种存储介质，存储有程序，所述程序被处理器执行时，实现实施例1所述的基于几何生成模型的视频转码方法。The third object of the present invention is achieved by the following technical solution: a storage medium storing a program, and when the program is executed by a processor, the video transcoding method based on the geometric generation model described in Embodiment 1 is realized.

本发明的第四目的通过下述技术方案实现：一种计算设备，包括处理器以及用于存储处理器可执行程序的存储器，所述处理器执行存储器存储的程序时，实现实施例1所述的基于几何生成模型的视频转码方法。The fourth object of the present invention is achieved by the following technical solution: a computing device, including a processor and a memory for storing a program executable by the processor, and when the processor executes the program stored in the memory, it realizes the process described in Embodiment 1. A geometric generative model-based method for video transcoding.

本发明相对于现有技术具有如下的优点及效果：Compared with the prior art, the present invention has the following advantages and effects:

(1)本发明基于几何生成模型的视频转码方法，首先构建几何生成模型，并且训练后得到超分辨率重建模型；获取要进行转码的MPEG-4视频；针对MPEG-4视频进行解码，解码后保存为连续的静态图片形式；针对MPEG-4视频解码后得到的每帧图片，通过超分辨率重建模型对每帧图片进行超分辨率放大重建；针对于超分辨率放大重建后的图像，编码成H.265格式的视频。在本发明方法中，通过几何生成模型从几何的角度来进行图像超分辨率的重建，避免了对抗网络中对抗学习的过程，训练更加稳定，解决生成对抗网络中的模式崩溃问题，使得生成的图像更加丰富逼真；基于本发明方法能够方便的将低分辨率的MPEG-4视频转换成高分辨率的H.265视频，H.265采用了更高的压缩率，在保证画面质量基本不变的情况下能够将网络传输带宽减少一半，同时H.265标准还能够支持最高为8K的分辨率，十分有利于超高清视频的保存和传输。(1) The present invention is based on the video transcoding method of geometric generation model, first constructs geometric generation model, and obtains super-resolution reconstruction model after training; Obtain the MPEG-4 video that will transcode; Decode for MPEG-4 video, After decoding, it is saved as a continuous static picture; for each frame of picture obtained after MPEG-4 video decoding, the super-resolution reconstruction model is used to perform super-resolution enlargement and reconstruction of each frame of picture; for the image after super-resolution enlargement and reconstruction , encode video in H.265 format. In the method of the present invention, image super-resolution is reconstructed from a geometric point of view through the geometric generation model, which avoids the process of adversarial learning in the adversarial network, makes the training more stable, and solves the problem of pattern collapse in the generated adversarial network, so that the generated The image is richer and more vivid; based on the method of the present invention, the low-resolution MPEG-4 video can be easily converted into a high-resolution H.265 video, and H.265 adopts a higher compression rate, which ensures that the picture quality is basically unchanged Under the circumstances, the network transmission bandwidth can be reduced by half, and the H.265 standard can also support a resolution of up to 8K, which is very conducive to the preservation and transmission of ultra-high-definition video.

(2)本发明基于几何生成模型的视频转码方法中，在MPEG-4视频解码后，由超分辨率重建模型对每帧图片进行超分辨率放大重建，放大倍数可以是四倍，相比目前常用的传统插值方法重建的图像，本发明使用深度学习的超分辨率重建模型能够取得更好的放大效果，最后将重建的信息补充到编码图像里面，从而保证编解码后的图像信息不会有偏差。(2) In the video transcoding method based on the geometric generative model of the present invention, after the MPEG-4 video is decoded, the super-resolution reconstruction model is used to carry out super-resolution zoom-in reconstruction of each frame of pictures, and the magnification can be four times, compared to For images reconstructed by traditional interpolation methods commonly used at present, the present invention uses the super-resolution reconstruction model of deep learning to achieve a better zoom effect, and finally supplements the reconstructed information into the encoded image, thereby ensuring that the image information after encoding and decoding will not There are deviations.

(3)本发明基于几何生成模型的视频转码方法中，几何生成模型包括编码部分和解码部分；编码部分包括依次连接的两个下采样层，分别为第一下采样层和第二下采样层；解码部分包括第一上采样层、第二上采样层和卷积层；第一上采样层、第二上采样层的上采样处理均采用亚像素卷积来完成。其中图像输入到几何生成模型后，经过编码部分第一下采样层和第二下采样层两次下采样操作对特征进行压缩，接着经过解码部分第一上采样层、第二上采样层的上采样操作后多特征进行复原，并且在上采样之前会使用通道注意力残差块对特征图进行调整，提高模型的特征提取能力。另外，解码部分的第一上采样层、第二上采样层采用亚像素卷积来完成，可以有效减少棋盘效应。(3) In the video transcoding method based on the geometric generation model of the present invention, the geometric generation model includes an encoding part and a decoding part; the encoding part includes two down-sampling layers connected in sequence, which are respectively the first down-sampling layer and the second down-sampling layer layer; the decoding part includes the first upsampling layer, the second upsampling layer and the convolutional layer; the upsampling processing of the first upsampling layer and the second upsampling layer is completed by sub-pixel convolution. After the image is input to the geometric generation model, the features are compressed through two downsampling operations of the first downsampling layer and the second downsampling layer in the encoding part, and then through the upsampling of the first upsampling layer and the second upsampling layer in the decoding part. After the sampling operation, multiple features are restored, and before upsampling, the channel attention residual block is used to adjust the feature map to improve the feature extraction ability of the model. In addition, the first upsampling layer and the second upsampling layer of the decoding part are completed by sub-pixel convolution, which can effectively reduce the checkerboard effect.

(4)本发明基于几何生成模型的视频转码方法中，编码部分输出的特征经过隐特征空间分布变换过程，实现隐特征向量的转换，最后将转换后的特征向量输入到解码部分；本发明方法基于几何生成模型的图像超分辨率方法，数据流形到隐空间的编码和解码映射用Encoder(编码)–Decoder(解码)结构实现，在隐空间中用几何方法即凸优化方法计算Brenier势能函数，得到最优传输映射，将均匀分布映射成数据在隐空间的分布，这种学习模型异常简单，同时概率变换部分的理论透明，优化能量为凸，保证了最优传输映射的存在性、唯一性和数值稳定性，这样将几何生成模型一部分网络的黑箱变得透明，得到半透明生成模型，同时，离散解到光滑解的逼近阶也有理论保证。(4) In the video transcoding method based on the geometric generative model of the present invention, the features output by the encoding part go through the hidden feature space distribution transformation process to realize the conversion of the hidden feature vector, and finally the converted feature vector is input to the decoding part; the present invention Method Based on the image super-resolution method of the geometric generative model, the encoding and decoding mapping from the data manifold to the latent space is realized by the Encoder (encoding)–Decoder (decoding) structure, and the Brenier potential energy is calculated by the geometric method in the latent space, that is, the convex optimization method function to obtain the optimal transmission mapping, and map the uniform distribution into the distribution of data in the latent space. This learning model is extremely simple, and the theory of the probability transformation part is transparent, and the optimization energy is convex, which ensures the existence of the optimal transmission mapping. Uniqueness and numerical stability, so that the black box of a part of the network of the geometric generative model becomes transparent, and a semi-transparent generative model is obtained. At the same time, the approximation order from the discrete solution to the smooth solution is also theoretically guaranteed.

(5)本发明基于几何生成模型的视频转码方法中，为了缓解边缘模糊的问题，在训练几何生成模型的时采用一种边缘感知损失，即先对标签图像(真实图像)做边缘检测得到每个位置的边缘信息，然后计算训练误差时给予这些边缘位置更高的权重，让网络训练的时候更加关注这些边缘特征。(5) In the video transcoding method based on the geometric generation model of the present invention, in order to alleviate the problem of blurred edges, an edge-aware loss is used when training the geometric generation model, that is, the edge detection is first performed on the label image (real image) to obtain The edge information of each position is then given a higher weight to these edge positions when calculating the training error, so that the network can pay more attention to these edge features during training.

附图说明Description of drawings

图1是本发明方法流程图。Fig. 1 is a flow chart of the method of the present invention.

图2是本发明方法中几何生成模型结构图。Fig. 2 is a structural diagram of the geometric generation model in the method of the present invention.

图3是本发明方法中几何生成模型的空间分布变换过程原理图。Fig. 3 is a schematic diagram of the spatial distribution transformation process of the geometric generation model in the method of the present invention.

具体实施方式Detailed ways

下面结合实施例及附图对本发明作进一步详细的描述，但本发明的实施方式不限于此。The present invention will be further described in detail below in conjunction with the embodiments and the accompanying drawings, but the embodiments of the present invention are not limited thereto.

实施例1Example 1

本实施例公开了一种基于几何生成模型的视频转码方法，该方法将常见的MPEG-4格式标清视频通过视频超分辨率重建，然后编码成H.265标准的超高清视频，以便于超高清视频的传输和存储。如图1所示，该方法的步骤包括：This embodiment discloses a video transcoding method based on a geometric generative model. In this method, the common MPEG-4 standard-definition video is reconstructed through video super-resolution, and then encoded into an H.265 standard ultra-high-definition video, so as to facilitate super-resolution Transmission and storage of high-definition video. As shown in Figure 1, the steps of the method include:

S1、构建几何生成模型；在本实施例中，如图2中所示，所构建的几何生成模型为一种改进的Encoder-Decoder(编码-解码)结构，包括编码部分和解码部分；其中：S1, build a geometric generation model; in the present embodiment, as shown in Figure 2, the constructed geometric generation model is an improved Encoder-Decoder (encoding-decoding) structure, including an encoding part and a decoding part; wherein:

编码部分包括依次连接的两个下采样层，分别为第一下采样层和第二下采样层；The encoding part includes two down-sampling layers connected in sequence, namely the first down-sampling layer and the second down-sampling layer;

解码部分包括第一上采样层、第二上采样层和卷积层；第一上采样层、第二上采样层的上采样处理均采用亚像素卷积来完成；The decoding part includes the first upsampling layer, the second upsampling layer and the convolution layer; the upsampling processing of the first upsampling layer and the second upsampling layer are all completed by sub-pixel convolution;

其中：in:

(1)几何生成模型输入的特征输入到第一下采样层。本实施例几何生成模型的输入为：低分辨率图像通过Bicubic进行上采样得到目标分辨率大小的图像。(1) The features input by the geometric generative model are input to the first downsampling layer. The input of the geometric generation model in this embodiment is: the low-resolution image is up-sampled by Bicubic to obtain an image of the target resolution size.

(2)第一下采样层的输出特征作为第二下采样层的输入，第二下采样层的输出作为编码部分的输出。(2) The output feature of the first downsampling layer is used as the input of the second downsampling layer, and the output of the second downsampling layer is used as the output of the encoding part.

(3)解码部分输入的特征经过通道注意力残差块进行特征图调整后输入到第一上采样层。(3) The features input in the decoding part are input to the first upsampling layer after the feature map is adjusted by the channel attention residual block.

(4)第一下采样层输出的特征和第一上采样层输出的特征在通道维度拼接后经过通道注意力残差块RACB进行特征图调整，然后输入到第二上采样层。(4) The features output by the first down-sampling layer and the features output by the first up-sampling layer are adjusted in the channel attention residual block RACB after splicing in the channel dimension, and then input to the second up-sampling layer.

(5)第二上采样层输出的特征和几何生成模型输入的特征在通道维度拼接后作为卷积层的输入，卷积层的输出即为几何生成模型的输出。(5) The features output by the second upsampling layer and the features input by the geometric generative model are concatenated in the channel dimension as the input of the convolutional layer, and the output of the convolutional layer is the output of the geometric generative model.

如图3中所示，在本实施例的几何生成模型中，编码部分是为了将输入的数据分布ν映射到隐空间Z，得到隐空间的特征分布μ：As shown in Figure 3, in the geometric generative model of this embodiment, the encoding part is to map the input data distribution ν to the latent space Z, and obtain the feature distribution μ of the latent space:

μ＝f_θ：∑→Zμ=f_θ : ∑→Z

其中∑表示输入数据分布ν的一个子流行，f_θ表示编码映射，θ是要学习的参数，然后我们再计算隐空间分布μ的最优传输映射T:Z→Z，即将均匀分布ζ变换到隐空间分布μ：where ∑ represents a sub-popularity of the input data distribution ν, f_θ represents the encoding map, θ is the parameter to be learned, and then we calculate the optimal transmission map T:Z→Z of the latent space distribution μ, which transforms the uniform distribution ζ to Latent space distribution μ:

T:Z→Z＝ζ→μT:Z→Z＝ζ→μ

上述隐空间分布的最优传输映射T可以用透明的几何方法凸优化来进行计算，这样我们就把流形降维和概率变换分开，用透明的最优传输模型来部分取代黑箱，得到半透明网络模型，最后我们将经T变换后得得到的分布再输入到解码部分，生成最终的高分辨率图像。The optimal transport map T of the above-mentioned latent space distribution can be calculated by using a transparent geometric convex optimization method, so that we separate the manifold dimensionality reduction from the probability transformation, and partially replace the black box with a transparent optimal transport model to obtain a semi-transparent network model, and finally we input the distribution obtained after T-transformation into the decoding part to generate the final high-resolution image.

S2、针对几何生成模型进行训练，得到超分辨率重建模型，具体训练过程如下：S2. Train the geometric generation model to obtain the super-resolution reconstruction model. The specific training process is as follows:

S21、获取已知对应高分辨率图像的低分辨率图像，作为训练样本。S21. Obtain low-resolution images known to correspond to high-resolution images as training samples.

在本实施例中，采用超分辨率领域常用的数据集DIV2K构成训练样本集，该数据集含有800个由低分辨率和高分辨率图像组成的训练图像对，100个用于验证的图像对。In this embodiment, the data set DIV2K commonly used in the field of super-resolution is used to form a training sample set. This data set contains 800 training image pairs consisting of low-resolution and high-resolution images, and 100 image pairs for verification. .

S22、将作为训练样本的低分辨率图像通过Bicubic进行上采样处理后，得到目标分辨率图像；S22. After the low-resolution image used as the training sample is upsampled by Bicubic, the target resolution image is obtained;

S23、将低分辨率图像对应的目标分辨率图像作为几何生成模型的输入，将低分辨率图像对应的高分辨率图像作为标签图像，对几何生成模型进行训练，得到超分辨率重建模型。S23. Using the target resolution image corresponding to the low-resolution image as an input of the geometric generation model, and using the high-resolution image corresponding to the low-resolution image as a label image, train the geometric generation model to obtain a super-resolution reconstruction model.

本步骤中训练完成得到Encoder–Decoder的编码部分和解码部分，其中编码部分和解码部分中间的隐向量就是图像的隐特征分布。In this step, the training is completed to obtain the encoding part and the decoding part of the Encoder-Decoder, where the hidden vector between the encoding part and the decoding part is the hidden feature distribution of the image.

本步骤中，在几何生成模型的训练过程中，进行边缘感知损失处理，具体为：In this step, edge-aware loss processing is performed during the training process of the geometric generative model, specifically:

(1)先对标签图像做边缘检测得到每个位置的边缘值。本实施例中，在对标签图像做边缘检测时，采用拉普拉斯滤波器作为边缘检测的算子，拉普拉斯算子是一种基于图像导数运算的高通线性滤波器，通过二阶导数来度量图像函数的曲率，对于图像来说，像素值的导数越大说明说明图像灰度变化越剧烈，即该位置为图像边缘，而在这些导数为极大值的边缘位置其二阶导数为0，拉普拉斯算子就是利用这一特点来检测边缘位置的。由于图像是二维的，所以要在x、y两个方向求二阶偏导：(1) First perform edge detection on the label image to obtain the edge value of each position. In this embodiment, when performing edge detection on the label image, the Laplacian filter is used as the operator for edge detection. The Laplacian operator is a high-pass linear filter based on image derivative operations. The derivative is used to measure the curvature of the image function. For an image, the larger the derivative of the pixel value, the more severe the grayscale change of the image, that is, the position is the edge of the image, and the second derivative at the edge position where these derivatives are the maximum value is 0, the Laplacian operator uses this characteristic to detect the edge position. Since the image is two-dimensional, it is necessary to find the second-order partial derivative in the x and y directions:

为了更适合离散的数字图像，常用其二阶差分形式：In order to be more suitable for discrete digital images, its second-order difference form is commonly used:

整理得到拉普拉斯算子的计算公式：The calculation formula of the Laplacian operator is sorted out:

即四个邻域的灰度值求和减去当前位置灰度值的四倍，可写成模板矩阵形式：That is, the sum of the gray values of the four neighborhoods minus four times the gray value of the current position can be written in the form of a template matrix:

如果再考虑两个对角线上的情况，可得到另外一种8邻域模板：If we consider the situation on the two diagonals, another 8-neighborhood template can be obtained:

在本实施例中，在对标签图像进行边缘检测时，得到每个位置的边缘值，然后设定一个阈值δ来标注每个位置是否为边缘位置，边缘值大于该阈值的标注为1，否则标注为0，具体的：In this embodiment, when edge detection is performed on the label image, the edge value of each position is obtained, and then a threshold δ is set to mark whether each position is an edge position, and the edge value greater than the threshold is marked as 1, otherwise Marked as 0, specifically:

式中L(i,j)是指在标签图像(i,j)位置处拉普拉斯边缘检测得到边缘值，M(i,j)为标签图像(i,j)位置的边缘位置标注，阈值δ可取0.1。In the formula, L(i, j) refers to the edge value obtained by Laplacian edge detection at the position of the label image (i, j), and M(i, j) is the label of the edge position of the position of the label image (i, j), The threshold δ can take 0.1.

(2)计算训练误差时，相比起标签图像中其他位置，赋予边缘位置更高的权重，让网络训练的时候更加关注这些边缘特征。(2) When calculating the training error, compared with other positions in the label image, the edge positions are given higher weights, so that the network can pay more attention to these edge features during training.

本实施例中，基于边缘感知损失和平衡Charbonnier损失得到最终使用的训练损失函数L_final为：In this embodiment, based on the edge-aware loss and the balanced Charbonnier loss, the final training loss function L_final used is:

ε为一个常量，λ是平衡Charbonnier损失和边缘感知损失的一个伸缩系数，即对应为权重系数，一般取0.1。ε is a constant, and λ is a scaling factor that balances Charbonnier loss and edge-aware loss, which corresponds to a weight coefficient, generally 0.1.

||M*(I^SR-I^HR)||为边缘感知损失，I^SR为代表经过几何生成模型重建的高分辨率图像，I^HR代表标签图像，M为M(i,j)构成的矩阵。||M*(I^SR -I^HR )||is the edge-aware loss, I^SR represents the high-resolution image reconstructed by the geometric generative model, I^HR represents the label image, and M is the matrix composed of M(i,j) .

S3、获取MPEG-4视频，针对MPEG-4视频进行解码，解码后保存为连续的静态图片形式。S3. Obtain the MPEG-4 video, decode the MPEG-4 video, and save it as continuous still pictures after decoding.

其中，视频的解码即为编码的逆操作，对于MPEG-4视频编解码来讲，编码过程是对图像块依次进行DCT变化和量化操作，解码过程则是先进行反量化，然后进行逆DCT变换。最后将重建的信息补充到编码图像里，从而保证编解码后的图像信息不会有偏差。Among them, video decoding is the inverse operation of encoding. For MPEG-4 video encoding and decoding, the encoding process is to perform DCT transformation and quantization operations on image blocks in sequence, and the decoding process is to perform inverse quantization first, and then perform inverse DCT transformation. . Finally, the reconstructed information is added to the coded image, so as to ensure that the coded image information will not be biased.

本步骤中可以使用多媒体框架ffmpeg对MPEG-4视频进行解码，具体代码为：In this step, the multimedia framework ffmpeg can be used to decode the MPEG-4 video, and the specific code is:

ffmpeg-i xx.y4m-vsync 0xx％3d.bmp–y。ffmpeg -i xx.y4m -vsync 0xx%3d.bmp –y.

S4、针对MPEG-4视频解码后得到的每帧图片，通过超分辨率重建模型对每帧图片进行超分辨率放大重建。S4. For each frame of picture obtained after decoding the MPEG-4 video, perform super-resolution zoom-in reconstruction on each frame of picture through the super-resolution reconstruction model.

在本实施例中，步骤S1获取到的超分辨率重建模型将每一帧图片进行四倍放大。图片在输入到超分辨率重建模型之前，先通过Bicubic进行上采样得到目标分辨率大小的图像，然后输入到超分辨率重建模型中。In this embodiment, the super-resolution reconstruction model obtained in step S1 enlarges each frame of picture by four times. Before the picture is input into the super-resolution reconstruction model, it is first up-sampled by Bicubic to obtain an image of the target resolution size, and then input into the super-resolution reconstruction model.

S5、针对于超分辨率放大重建后的图像，编码成H.265格式的视频。由于放大重建后的画面分辨率较大，不适合存储和传输，因此本实施例使用H.265编码方法对其进行信息压缩，减少冗余信息。在本步骤中，编码过程可通过多媒体框架ffmpeg来实现，具体代码为：S5. Encode the super-resolution enlarged and reconstructed image into a video in H.265 format. Since the enlarged and reconstructed picture has a relatively large resolution and is not suitable for storage and transmission, this embodiment uses the H.265 encoding method to compress its information to reduce redundant information. In this step, the encoding process can be realized through the multimedia framework ffmpeg, and the specific code is:

ffmpeg-i xx％3d.bmp-pix_fmt yuv420p-vsync 0xx.y4m-y-vcodec libx265。ffmpeg -i xx%3d.bmp-pix_fmt yuv420p-vsync 0xx.y4m-y-vcodec libx265.

本领域技术人员可以理解，实现本实施例方法中的全部或部分步骤可以通过程序来指令相关的硬件来完成，相应的程序可以存储于计算机可读存储介质中。应当注意，尽管在附图中以特定顺序描述了本实施例1的方法操作，但是这并非要求或者暗示必须按照该特定顺序来执行这些操作，或是必须执行全部所示的操作才能实现期望的结果。相反，描绘的步骤可以改变执行顺序，有些步骤也可以同时执行，附加地或备选地，可以省略某些步骤，将多个步骤合并为一个步骤执行，和/或将一个步骤分解为多个步骤执行。Those skilled in the art can understand that all or part of the steps in the method of this embodiment can be implemented by instructing related hardware through a program, and the corresponding program can be stored in a computer-readable storage medium. It should be noted that although the method operations of Embodiment 1 are described in a specific order in the drawings, this does not require or imply that these operations must be performed in this specific order, or that all of the illustrated operations must be performed to achieve the desired result. Rather, the depicted steps may be performed in a different order, and some steps may be performed concurrently, additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step for execution, and/or one step may be broken down into multiple steps. step execution.

实施例2Example 2

本实施例公开了一种基于几何生成模型的视频转转码装置，该装置包括模型构建模块、模型训练模块、获取模块、视频解码模块、视频重建模块和视频编码模块，各模块的功能如下：This embodiment discloses a video transcoding device based on a geometric generation model. The device includes a model building module, a model training module, an acquisition module, a video decoding module, a video reconstruction module, and a video encoding module. The functions of each module are as follows:

模型构建模块，用于构建几何生成模型；本实施例中，所构建的几何生成模型如图2中所示，具体结构说明见实施例1。The model construction module is used to construct a geometric generation model; in this embodiment, the constructed geometric generation model is shown in FIG. 2 , and the specific structure description is shown in Embodiment 1.

模型训练模块，用于对构建的几何生成模型进行训练，得到超分辨率重建模型。The model training module is used to train the constructed geometric generation model to obtain a super-resolution reconstruction model.

获取模块，用于获取MPEG-4视频。Get module for getting MPEG-4 video.

视频解码模块，用于针对MPEG-4视频进行解码，解码后保存为连续的静态图片形式。The video decoding module is used to decode the MPEG-4 video, and save it as continuous still pictures after decoding.

视频重建模块，用于针对MPEG-4视频解码后得到的每帧图片，通过超分辨率重建模型对每帧图片进行超分辨率放大重建。The video reconstruction module is used for performing super-resolution zoom-in reconstruction on each frame of picture through the super-resolution reconstruction model for each frame of picture obtained after MPEG-4 video decoding.

本实施例上述各个模块的具体实现可以参见上述实施例1，在此不再一一赘述。需要说明的是，本实施例提供的装置仅以上述各功能模块的划分进行举例说明，在实际应用中，可以根据需要而将上述功能分配由不同的功能模块完成，即将内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。For the specific implementation of the above-mentioned modules in this embodiment, reference may be made to the above-mentioned Embodiment 1, which will not be repeated here. It should be noted that the device provided in this embodiment is only illustrated by dividing the above-mentioned functional modules. In practical applications, the above-mentioned function distribution can be completed by different functional modules according to needs, that is, the internal structure is divided into different Functional modules to complete all or part of the functions described above.

实施例3Example 3

本实施例公开了一种存储介质，存储有程序，所述程序被处理器执行时，实现实施例1所述的基于几何生成模型的视频转码方法，如下：This embodiment discloses a storage medium, which stores a program. When the program is executed by a processor, the video transcoding method based on the geometric generation model described in Embodiment 1 is implemented as follows:

获取MPEG-4视频；get MPEG-4 video;

本实施例中，上述各个过程具体实现可以参见上述实施例1，在此不再一一赘述。In this embodiment, reference may be made to the above-mentioned Embodiment 1 for specific implementation of the above-mentioned processes, and details are not repeated here.

在本实施例中，存储介质可以是磁盘、光盘、计算机存储器、只读存储器(ROM，Read-Only Memory)、U盘、移动硬盘等介质。In this embodiment, the storage medium may be a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory), a U disk, a mobile hard disk, and the like.

实施例4Example 4

本实施例公开了一种计算设备，包括处理器以及用于存储处理器可执行程序的存储器，其特征在于，所述处理器执行存储器存储的程序时，实现实施例1所述的基于几何生成模型的视频转码方法，如下：This embodiment discloses a computing device, which includes a processor and a memory for storing a program executable by the processor. It is characterized in that, when the processor executes the program stored in the memory, the geometry-based generation described in Embodiment 1 is realized. The video transcoding method of the model is as follows:

获取MPEG-4视频；get MPEG-4 video;

本实施例中，计算设备可以是台式电脑、笔记本电脑、PDA手持终端、平板电脑等终端设备。In this embodiment, the computing device may be a terminal device such as a desktop computer, a notebook computer, a PDA handheld terminal, or a tablet computer.

上述实施例为本发明较佳的实施方式，但本发明的实施方式并不受上述实施例的限制，其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化，均应为等效的置换方式，都包含在本发明的保护范围之内。The above-mentioned embodiment is a preferred embodiment of the present invention, but the embodiment of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, Simplifications should be equivalent replacement methods, and all are included in the protection scope of the present invention.

Claims

Translated fromChinese

1.一种基于几何生成模型的视频转码方法，其特征在于，步骤包括：1. A video transcoding method based on a geometric generative model, characterized in that the steps include:

所述构建的几何生成模型包括编码部分和解码部分；The geometric generation model constructed includes an encoding part and a decoding part;

其中：in:

第二上采样层输出的特征和几何生成模型输入的特征在通道维度拼接后作为卷积层的输入，卷积层的输出即为几何生成模型的输出；The features output by the second upsampling layer and the features input by the geometric generative model are used as the input of the convolutional layer after splicing in the channel dimension, and the output of the convolutional layer is the output of the geometric generative model;

在几何生成模型中：In the geometry generative model:

首先，编码部分将输入的数据分布ν映射到隐空间Z，得到隐空间的特征分布μ：First, the encoding part maps the input data distribution ν to the latent space Z, and obtains the feature distribution μ of the latent space:

μ＝f_θ:Σ→Zμ＝f_θ :Σ→Z

其中，Σ表示输入数据分布ν的一个子流行，f_θ表示编码映射，θ是要学习的参数；where Σ denotes a subpopulation of the input data distribution ν, f_θ denotes the encoding mapping, and θ is the parameter to be learned;

T:Z→Z＝ζ→μT:Z→Z＝ζ→μ

最终将经T变换后得到的分布再输入到解码部分，通过解码部分生成最终的高分辨率图像；Finally, the distribution obtained after T transformation is input to the decoding part, and the final high-resolution image is generated through the decoding part;

获取MPEG-4视频；get MPEG-4 video;

2.根据权利要求1所述的基于几何生成模型的视频转码方法，其特征在于，几何生成模型的训练过程如下：2. The video transcoding method based on the geometric generative model according to claim 1, wherein the training process of the geometric generative model is as follows:

3.根据权利要求2所述的基于几何生成模型的视频转码方法，其特征在于，在几何生成模型的训练过程中，进行边缘感知损失处理，具体为：3. The video transcoding method based on the geometric generative model according to claim 2, wherein, during the training process of the geometric generative model, edge-aware loss processing is performed, specifically:

4.根据权利要求3所述的基于几何生成模型的视频转码方法，其特征在于，在对标签图像做边缘检测时，采用拉普拉斯滤波器作为边缘检测的算子，基于边缘感知损失和平衡Charbonnier损失得到最终使用的训练损失函数L_final为：4. The video transcoding method based on a geometric generative model according to claim 3, wherein when edge detection is performed on the label image, a Laplacian filter is used as an operator of edge detection, based on edge-aware loss And balance the Charbonnier loss to get the final training loss function L_final as:

5.根据权利要求4所述的基于几何生成模型的视频转码方法，其特征在于，拉普拉斯滤算子▽²f的计算公式为：5. the video transcoding method based on geometric generative model according to claim 4, is characterized in that, the calculation formula of Laplacian filter operator ▽² f is:

▽²f＝f(x-1,y)+f(x+1,y)+f(x,y-1)+f(x,y+1)-4f(x,y)；▽² f＝f(x-1,y)+f(x+1,y)+f(x,y-1)+f(x,y+1)-4f(x,y);

6.一种基于几何生成模型的视频转码装置，其特征在于，包括：6. A video transcoding device based on a geometric generative model, comprising:

其中：in:

在几何生成模型中：In the geometry generative model:

μ＝f_θ:Σ→Zμ＝f_θ :Σ→Z

T:Z→Z＝ζ→μT:Z→Z＝ζ→μ

7.一种存储介质，存储有程序，其特征在于，所述程序被处理器执行时，实现权利要求1～5中任一项所述的基于几何生成模型的视频转码方法。7. A storage medium storing a program, wherein when the program is executed by a processor, the video transcoding method based on a geometric generative model according to any one of claims 1-5 is realized.

8.一种计算设备，包括处理器以及用于存储处理器可执行程序的存储器，其特征在于，所述处理器执行存储器存储的程序时，实现权利要求1～5中任一项所述的基于几何生成模型的视频转码方法。8. A computing device, comprising a processor and a memory for storing a program executable by the processor, wherein when the processor executes the program stored in the memory, the method described in any one of claims 1 to 5 is realized. A video transcoding method based on a geometric generative model.