Detailed Description
The invention is further described below with reference to the accompanying drawings:
the QP self-adaptive compressed video super-resolution method based on the compressed priori comprises the following steps:
(1) Specifically, a feature extraction block formed by multi-scale convolution is utilized to extract multi-scale shallow features of a compressed low-resolution video frame and corresponding compressed prior information (prediction signals, block diagrams and residual errors) thereof in a resolving way;
(2) Constructing a QP modulation module, wherein the QP modulation module takes a frame level QP as input, generates a modulation chart through a 1X 1 convolution layer and Leakly ReLU layer combination modulation unit, recalibrates input characteristics on a channel by using the modulation chart, and outputs the modulated characteristics;
(3) Constructing a priori fusion module, namely constructing a self-adaptive multi-scale priori information fusion module on the basis of the multi-scale convolution and QP modulation module, stacking a plurality of multi-scale priori information fusion modules and a simple feature fusion block to form the priori fusion module, fusing the multi-scale features of the video space and the multi-scale features of the corresponding priori information, and outputting the fused features;
(4) Specifically, the existing bidirectional propagation alignment module is utilized to fuse the fused spatial features in the time domain, perform feature alignment and output the aligned features;
(5) Constructing a reconstruction module, namely constructing an adaptive enhancement modulation module on the basis of a QP modulation module, stacking the adaptive enhancement modulation modules to form the reconstruction module, and adaptively enhancing the characteristics under the guidance of a frame level QP;
(6) Combining the feature extraction block, the prior fusion module, the bidirectional propagation alignment module and the reconstruction module into a final QP self-adaptive network;
(7) Training the network in step (6) with training data;
(8) During testing, the low-resolution compressed video and the corresponding compressed priori (frame level QP, prediction signal, block diagram and residual) are taken as the input of the network, and the final super-resolution video is output.
Specifically, an input sequence of low resolution compressed video frames is defined as XCLR={xt-r,…,xt,…,xt+r, its corresponding compressed prior information (prediction signal, block diagram, residual) is defined as XCP={(xCP)t-r,…,(xCP)t,…,(xCP)t+r, and the frame level QP is defined as qp= { QPt-r,…,qpt,…qpt+r }, where r represents the radius of the temporal neighborhood.
In the step (1), the process of extracting the multi-scale shallow features of the low-resolution compressed video XCLR and the corresponding compressed a priori XCP by using the feature extraction block formed by multi-scale convolution as shown in fig. 1 (b) can be expressed as:
Wherein, sum upRepresenting high-scale features and low-scale features extracted from XCLR,AndRepresenting high-scale features and low-scale features extracted from compressed priors of the corresponding video sequence, fFEV (·) and fFEP (·) represent feature extraction operations.
The QP modulation module constructed in step (2) is shown in fig. 2. Given a characteristic diagramAs input, a guide graph G related to the feature content is generated by one CLC structure and one sigmoid layer, wherein the CLC structure is composed of 21×1 convolution layers and 1 LeaklyReLU layers. The frame level QP is fed into the modulation unit to generate a 1D modulation diagramThe modulation process can be summarized as:
Wherein FM is the modulated characteristic of the QP modulation module, FCLC (.) represents the function of the CLC structure,Representing element-by-element multiplication. M=fMU(QP),fMU (·) represents the function of the modulation unit, g=σ (FCLC (F)), σ represents the sigmoid function.
The prior fusion module in the step (3) is composed of m adaptive multi-scale prior information fusion modules shown in fig. 3 and a feature fusion block shown in fig. 1 (c). Inputting the multi-scale shallow layer features and the frame level QP acquired in the step 1 into a priori fusion module to realize adaptive multi-scale multi-feature fusion, wherein the process can be expressed as follows:
Where FPFM represents the fused features and FPFM (. Cndot.) represents the function of the a priori fusion module.
The self-adaptive multi-scale prior information fusion module in the step (3) consists of 1 multi-scale convolution with C channels, a three-dimensional attention structure and a multi-scale QP modulation module. Specifically, for the ith adaptive multi-scale prior information fusion module, the input is the high-scale feature generated by the fusion of the last stage moduleAnd low-scale featuresHigh-scale features extracted on a compressed priorAnd low-scale featuresAnd a frame level QP, firstly, connecting the features of different scales on the channel dimension respectively to realize preliminary fusion, and obtainingCommunication and aggregation between the multi-scale features is then achieved by multi-scale convolution, which can be expressed as:
Here, theThe aggregated high-scale features and low-scale features, f (·, W) represent the convolution of the parameter W, δ refers to Leakly ReLU functions, upsample (·, s) represents the nearest neighbor interpolation operation with up-sampling parameters s, pool (·, k) represents the k×k, step-size is the average pooling operation of k, respectively. The three-dimensional attention structure is then used to further enhance the characterization of the multi-scale features, we use the energy function E to guide the generation of weights for three-dimensional attention, re-adapt the calibration features according to each position importance in the feature map, and then adaptively blend with the input high-low features through jump connection, respectively. This process can be summarized as:
Here, theRespectively represent the characteristics of the high and low dimensions after the self-adaptive fusion,Is an additional 4 learnable parameters initialized to 1, the energy function E uses the Yang's calculation method, reference "Yang L,Zhang R Y,Li L,et al.Simam:A simple,parameter-free attention module for convolutional neuralnetworks[C]//International Conference on Machine Learning.PMLR,2021:11863-11874.".The multi-scale QP modulation module as shown in figure 1 (c) is then fed along with the frame level QP. The output of the final self-adaptive multi-scale prior information fusion module is as follows:
where fMSQPMM (·) represents the function of the multi-scale QP modulation module. The self-adaptive multi-scale prior information fusion module not only can self-adaptively fuse multi-scale features, but also introduces prior information, and improves the characterization capability of the model on different QP compressed video features.
The bidirectional alignment module in step (4) is shown in fig. 1 (a), and the specific structure is consistent with that of the bidirectional alignment module in Basicvsr ++, reference "Chan K C K,Zhou S,Xu X,et al.Basicvsr++:Improving video super-resolution with enhanced propagation and alignment[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2022:5972-5981."
The reconstruction module in the step (5) is composed of n adaptive enhancement modulation modules as shown in fig. 4. The adaptive enhancement modulation module first uses a wide-active structure to enhance the feature stream, i.e., the input features are input by a3 x 3 convolutional layer before the active layerThe channel of (2) is extended from C to sXC, and the number of activated characteristic channels is adjusted to C by using a 3X 3 convolution layer to obtainThe global averaging pooling, leakly ReLU layers and sigmoid layers are then utilized to generate the attention weights for the different channels, thereby adaptively recalibrating feature FWA. The adaptive residual combines the initial feature and the recalibrated feature by the leachable parameter χ, μ to obtain an adaptively enhanced featureFinally, embedding a QP modulation module at the tail part, and adaptively fusing the modulated characteristics and initial characteristics through the learnable parameters gamma and eta to obtain an adaptive enhancement modulation characteristic FEMM:
FEMM=γ×(fQPMM(C1×1(FE)))+η×Fin
Where fQPMM (·) represents the function of the QP modulation module and C1×1 represents a1 x 1 convolution operation.
As shown in fig. 1 (a), the QP adaptive network in the step (6) is formed by a feature extraction block in the step (1), a prior fusion module in the step (3), a bidirectional propagation alignment module in the step (4) and a reconstruction module in the step (5), and is uniformly trained, where a loss function of the QP adaptive network is expressed as:
wherein the method comprises the steps ofRepresenting the super-resolution result of the network reconstruction, yi represents the original high-resolution video frame, ε is a penalty factor, here set to 1×10-6, λ is a weight super-parameter, here set to 0.05, LFFT is a fast Fourier transform based loss function, defined asWherein A (·) represents the arithmetic square root of the sum of the squares of the real and imaginary parts after Fourier transformation.
And (3) during training in the step (7), setting the super parameters r, C, m, n and s as 3,64,5,7,2 respectively. After all videos of the training set test set and the verification set are downsampled by 4 times, the video is compressed by HM16.2 software of HEVC under LDB and RA modes according to 4 different QPs (22,27,32,37) to obtain a compressed low-resolution video and a compressed prior (frame level QP, a prediction signal, a block diagram and residual). In the training phase, video clips of different configurations and different QPs are mixed together to train our general model, randomly cropped from compressed low resolution video in 64×64 blocks and original high resolution video in 256×256 blocks at corresponding positions, and data augmented with flipping and rotation, with a batch size set to 4. With the parameter default ADMM as the optimizer, the initial learning rate is set to 0.0001, and is reduced to 0.00001 during training.
To verify the effectiveness of the present invention, comparative experiments were developed on 10 standard test sequences of the video coding joint co-operation group (JCT-VC) with resolutions 720p (1280×720), 1080p (1920×1080) and WQXGA (2560×1600), respectively. In experiments, the method of the invention is compared with 8 typical video super-resolution algorithms and 1 compressed video super-resolution algorithm combined with compressed prior. The 8 algorithms are:
Method 1 Jo et al, reference "Jo Y,Oh S W,Kang J,et al.Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2018:3224-3232."
Method 2 Xue et al, reference "Xue T,Chen B,Wu J,et al.Video enhancement with task-oriented flow[J].International Journal of Computer Vision,2019,127:1106-1125."
Method 3 Wang et al, reference "Wang L,Guo Y,Lin Z,et al.Learning for video super-resolution through HR optical flow estimation[C]//Computer Vision–ACCV 2018:14th Asian Conference on Computer Vision,Perth,Australia,December 2–6,2018,Revised Selected Papers,Part I 14.Springer International Publishing,2019:514-529."
Method 4 Tian et al, reference "Tian Y,Zhang Y,Fu Y,et al.temporally-deformable alignment network for video super-resolution.In 2020IEEE[C]//CVF Conference on Computer Vision and Pattern Recognition(CVPR).2020:3357-3366."
Method 5 Haris et al, reference "Haris M,Shakhnarovich G,Ukita N.Recurrent back-projection network for video super-resolution[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2019:3897-3906."
Method 6 Wang et al, reference "Wang X,Chan K C K,Yu K,et al.Edvr:Video restoration with enhanced deformable convolutional networks[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops.2019:0-0."
Method 7 Chan et al, reference "Chan K C K,Wang X,Yu K,et al.Basicvsr:The search for essential components in video super-resolution and beyond[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2021:4947-4956."
Method 8 Chan et al, reference "Chan K C K,Zhou S,Xu X,et al.Basicvsr++:Improving video super-resolution with enhanced propagation and alignment[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2022:5972-5981."
Method 9, chen et al, reference "Chen P,Yang W,Wang M,et al.Compressed domain deep video super-resolution[J].IEEE Transactions on Image Processing,2021,30:7156-7169."
The contents of the comparative experiments are as follows:
Method 1, method 2, method 3, method 4, method 5, method 6, method 7, method 8, method 9 are retrained with the same training set, compressed video super-resolution is performed on the test set, and compared to the present method, it is noted that methods 1-9 require training specific models for different coding configurations and QPs, and the present method requires training only one model. In this experiment, super-resolution was performed on low-resolution compressed video with qp= 22,27,32,37 in LDB mode, respectively. Table one shows a comparison of the average PSNR (PEAK SIGNAL to Noise Ratio) and SSIM (Structure Similarity Index) of the reconstructed results for each method over the test set, with the greater the PSNR (dB) and SSIM values, indicating better reconstructed video quality, the best results for all experiments being marked by bold. In addition ,"BasketballDrive(LDB,QP=27)","BQTerrace(RA,QP=32)","Johnny(RA,QP=27)","KristenAndSara(LDB,QP=22)","PeopleOnStreet(RA,QP=27)" is selected for visual comparison, as shown in fig. 5, fig. 5 (a) shows the original high resolution video frame, and fig. 5 (b) and fig. 5 (c) are respectively the image block selected for presentation on the original high resolution frame and its corresponding decoded low resolution image block. Fig. 5 (d), 5 (e), 5 (f), 5 (g), 5 (h) and 5 (i) are subjective visual effect contrast diagrams after super-resolution according to the present invention, respectively, in methods 1,5, 6, 7 and 8.
List one
From the experimental results, it can be seen that:
The method 1-method 8 is a video super-resolution algorithm, a good result is obtained after retraining by using compressed video, the method 9 uses coding priori, and compared with the method 1-method 7, the method 1-method 8 fully uses coding priori, and the method can be superior to the method 1-method 9 in objective indexes of different QP (QP) by only deploying one model. In addition, compared with the methods 1, 5, 6, 7 and 8, the reconstruction result of the QP adaptive compressed video super-resolution method based on compressed prior is visually pleasant and the texture is clearer.
In conclusion, the video reconstructed by the method has obvious advantages in subjective visual effect and objective evaluation index. Therefore, the invention is an effective method for compressing video super-resolution.