Generating high-resolution video frames from given low-resolution ones
This article is about video frame restoration technique. For video upscaling tool by Nvidia, seeVideo Super Resolution.
VSR and SISR methods' outputs comparison. VSR restores more details by using temporal information.
Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlikesingle-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.
There are many approaches for this task, but this problem still remains to be popular and challenging.
Most research considers the degradation process of frames as
where:
— original high-resolution frame sequence,
— blur kernel,
— convolution operation,
— downscaling operation,
— additive noise,
— low-resolution frame sequence.
Super-resolution is an inverse operation, so its problem is to estimate frame sequence from frame sequence so that is close to original.Blur kernel, downscaling operation and additive noise should be estimated for given input to achieve better results.
Video super-resolution approaches tend to have more components than the image counterparts as they need to exploit the additional temporal dimension. Complex designs are not uncommon. Some most essential components for VSR are guided by four basic functionalities: Propagation, Alignment, Aggregation, and Upsampling.[1]
Propagation refers to the way in which features are propagated temporally
Alignment concerns on the spatial transformation applied to misaligned images/features
Aggregation defines the steps to combine aligned features
Upsampling describes the method to transform the aggregated features to the final output image
When working with video, temporal information could be used to improve upscaling quality.Single image super-resolution methods could be used too, generating high-resolution frames independently from their neighbours, but it's less effective and introduces temporal instability. There are a few traditional methods, which consider the video super-resolution task as an optimization problem. Last yearsdeep learning based methods for video upscaling outperform traditional ones.
There are several traditional methods for video upscaling. These methods try to use some natural preferences and effectively estimatemotion between frames. The high-resolution frame is reconstructed based on both natural preferences and estimated motion.
Firstly the low-resolution frame is transformed to thefrequency domain. The high-resolution frame is estimated in this domain. Finally, this result frame is transformed to the spatial domain.Some methods useFourier transform, which helps to extend the spectrum of captured signal and though increase resolution. There are different approaches for these methods: using weighted least squares theory,[2] total least squares (TLS) algorithm,[3] space-varying[4] or spatio-temporal[5] varying filtering.Other methods usewavelet transform, which helps to find similarities in neighboring local areas.[6] Latersecond-generation wavelet transform was used for video super resolution.[7]
Iterative back-projection methods assume some function between low-resolution and high-resolution frames and try to improve their guessed function in each step of an iterative process.[8]Projections onto convex sets (POCS), that defines a specific cost function, also can be used for iterative methods.[9]
Iterative adaptive filtering algorithms useKalman filter to estimate transformation from low-resolution frame to high-resolution one.[10] To improve the final result these methods consider temporal correlation among low-resolution sequences. Some approaches also consider temporal correlation among high-resolution sequence.[11] To approximate Kalman filter a common way is to useleast mean squares (LMS).[12] One can also usesteepest descent,[13]least squares (LS),[14]recursive least squares (RLS).[14]
Direct methods estimate motion between frames, upscale a reference frame, and warp neighboring frames to the high-resolution reference one. To construct result, these upscaled frames are fused together bymedian filter,[15] weighted median filter,[16] adaptive normalized averaging, AdaBoost classifier[17] orSVD based filters.[18]
Non-parametric algorithms join motion estimation and frames fusion to one step. It is performed by consideration of patches similarities. Weights for fusion can be calculated bynonlocal-means filters.[19] To strength searching for similar patches, one can use rotation invariancesimilarity measure[20] or adaptive patch size.[21] Calculating intra-frame similarity help to preserve small details and edges.[22] Parameters for fusion also can be calculated bykernel regression.[23]
In approaches with alignment, neighboring frames are firstly aligned with target one. One can align frames by performingmotion estimation andmotion compensation (MEMC) or by using Deformable convolution (DC).Motion estimation gives information about the motion ofpixels between frames.motion compensation is a warping operation, which aligns one frame to another based on motion information. Examples of such methods:
Deep-DE[30] (deep draft-ensemble learning) generates a series of SR feature maps and then process them together to estimate the final frame
VSRnet[31] is based on SRCNN (model forsingle image super resolution), but takes multiple frames as input. Input frames are first aligned by the Druleas algorithm
VESPCN[32] uses a spatial motion compensation transformer module (MCT), which estimates and compensates motion. Then a series of convolutions performed to extract feature and fuse them
DRVSR[33] (detail-revealing deep video super-resolution) consists of three main steps:motion estimation,motion compensation andfusion. The motion compensation transformer (MCT) is used for motion estimation. The sub-pixel motion compensation layer (SPMC) compensates motion. Fusion step uses encoder-decoder architecture and ConvLSTM module to unit information from both spatial and temporal dimensions
RVSR[34] (robust video super-resolution) have two branches: one for spatial alignment and another for temporal adaptation. The final frame is a weighted sum of branches' output
FRVSR[35] (frame recurrent video super-resolution) estimate low-resolutionoptical flow, upsample it to high-resolution and warp previous output frame by using this high-resolution optical flow
STTN[36] (the spatio-temporal transformer network) estimateoptical flow by U-style network based on Unet and compensate motion by a trilinear interpolation method
SOF-VSR[37] (super-resolution optical flow for video super-resolution) calculate high-resolutionoptical flow in coarse-to-fine manner. Then the low-resolution optical flow is estimated by a space-to-depth transformation. The final super-resolution result is gained from aligned low-resolution frames
TecoGAN[38] (the temporally coherentGAN) consists ofgenerator anddiscriminator. Generator estimates LRoptical flow between consecutive frames and from this approximate HR optical flow to yield output frame. The discriminator assesses the quality of the generator
TOFlow[39] (task-oriented flow) is a combination of optical flow network and reconstruction network. Estimated optical flow is suitable for a particular task, such as video super resolution
MMCNN[40] (the multi-memory convolutional neural network) aligns frames with target one and then generates the final HR-result through the feature extraction, detailfusion and feature reconstruction modules
RBPN[41] (therecurrent back-projection network). The input of each recurrent projection module features from the previous frame, features from the consequence of frames, and optical flow between neighboring frames
MEMC-Net[42] (the motion estimation and motion compensation network) uses both motion estimation network and kernel estimation network to warp frames adaptively
RTVSR[43] (real-time video super-resolution) aligns frames with estimated convolutional kernel
MultiBoot VSR[44] (the multi-stage multi-reference bootstrapping method) aligns frames and then have two-stage of SR-reconstruction to improve quality
BasicVSR[45] aligns frames with optical flow and then fuse their features in a recurrent bidirectional scheme
IconVSR[45] is a refined version of BasicVSR with a recurrent coupled propagation scheme
UVSR[46] (unrolled network for video super-resolution) adapted unrolled optimization algorithms to solve the VSR problem
Another way to align neighboring frames with target one is deformable convolution. While usual convolution has fixed kernel, deformable convolution on the first step estimate shifts for kernel and then do convolution. Examples of such methods:
EDVR[47] (The enhanced deformable video restoration) can be divided into two main modules: the pyramid, cascading and deformable (PCD) module for alignment and the temporal-spatialattention (TSA) module for fusion
DNLN[48] (The deformable non-local network) has alignment module, based on deformable convolution with the hierarchical feature fusion module (HFFB) for better quality) and non-localattention module
TDAN[49] (The temporally deformable alignment network) consists of an alignment module and a reconstruction module. Alignment performed by deformable convolution based on feature extraction and alignment
Multi-Stage Feature Fusion Network[50] for Video Super-Resolution uses the multi-scale dilated deformable convolution for frame alignment and the Modulative Feature Fusion Branch to integrate aligned frames
Some methods align frames by calculatedhomography between frames.
TGA[51] (Temporal GroupAttention) divide input frames to N groups dependent on time difference and extract information from each group independently. Fast Spatial Alignment module based onhomography used to align frames
Methods without alignment do not perform alignment as a first step and just process input frames.
VSRResNet[52] likeGAN consists ofgenerator anddiscriminator. Generator upsamples input frames, extracts features and fuses them. Discriminator assess the quality of result high-resolution frames
FFCVSR[53] (frame and feature-context video super-resolution) takes unaligned low-resolution frames and output high-resolution previous frames to simultaneously restore high-frequency details and maintain temporal consistency
MRMNet[54] (the multi-resolution mixture network) consists of three modules: bottleneck, exchange, and residual. Bottleneck unit extract features that have the same resolution as input frames. Exchange module exchange features between neighboring frames and enlarges feature maps. Residual module extract features after exchange one
STMN[55] (the spatio-temporal matching network) usediscrete wavelet transform tofuse temporal features. Non-local matching block integrates super-resolution anddenoising. At the final step, SR-result is got on the global wavelet domain
MuCAN[56] (the multi-correspondenceaggregation network) uses temporal multi-correspondence strategy tofuse temporal features and cross-scale nonlocal-correspondence to extract self-similarities in frames
DUF[57] (the dynamic upsampling filters) uses deformable 3Dconvolution formotion compensation. The model estimates kernels for specific input frames
FSTRN[58] (The fast spatio-temporal residual network) includes a few modules: LR video shallow feature extraction net (LFENet), LR featurefusion and up-sampling module (LSRNet) and two residual modules: spatio-temporal and global
3DSRnet[59] (The 3D super-resolution network) uses 3Dconvolutions to extract spatio-temporal information. Model also has a special approach for frames, where scene change is detected
MP3D[60] (the multi-scale pyramid 3Dconvolutional network) uses 3Dconvolution to extract spatial and temporal features simultaneously, which then passed through reconstruction module with 3D sub-pixelconvolution for upsampling
DMBN[61] (the dynamic multiple branch network) has three branches to exploit information from multiple resolutions. Finally, information from branches fuse dynamically
Recurrent convolutional neural networks perform video super-resolution by storing temporal dependencies.
STCN[62] (the spatio-temporal convolutional network) extract features in the spatial module, pass them through the recurrent temporal module and final reconstruction module. Temporal consistency is maintained bylong short-term memory (LSTM) mechanism
BRCN[63] (the bidirectional recurrent convolutional network) has two subnetworks: with forwardfusion and backwardfusion. The result of the network is a composition of two branches' output
RISTN[64] (theresidual invertible spatio-temporal network) consists of spatial, temporal and reconstruction module. Spatial module composed of residual invertible blocks (RIB), which extract spatial features effectively. The output of the spatial module is processed by the temporal module, which extracts spatio-temporal information and then fuses important features. The final result is calculated in the reconstruction module by deconvolution operation
RRCN[65] (the residual recurrent convolutional network) is a bidirectional recurrent network, which calculates a residual image. Then the final result is gained by adding a bicubically upsampled input frame
RRN[66] (the recurrent residual network) uses a recurrent sequence of residual blocks to extract spatial and temporal information
BTRPN[67] (the bidirectional temporal-recurrent propagation network) use bidirectional recurrent scheme. Final-result combined from two branches with channelattention mechanism
RLSP[68] (recurrent latent state propagation) fully convolutional network cell with highly efficient propagation of temporal information through a hidden state
RSDN[69] (the recurrent structure-detail network) divide input frame into structure and detail components and process them in two parallel streams
Non-local methods extract both spatial and temporal information. The key idea is to use all possible positions as aweighted sum. This strategy may be more effective than local approaches (the progressivefusion non-local method) extract spatio-temporal features by non-local residual blocks, then fuse them by progressive fusion residual block (PFRB). The result of these blocks is a residual image. The final result is gained by addingbicubically upsampled input frame
NLVSR[70] (the novel video super‐resolution network) aligns frames with target one by temporal‐spatial non‐local operation. To integrate information from aligned frames an attention‐based mechanism is used
MSHPFNL[71] also incorporates multi-scale structure and hybrid convolutions to extract wide-range dependencies. To avoid some artifacts likeflickering orghosting, they use generative adversarial training
LPIPS (Learned Perceptual Image Patch Similarity) compares the perceptual similarity of frames based on high-order image structure
tOF measures pixel-wise motion similarity with reference frame based onoptical flow
tLP calculates how LPIPS changes from frame to frame in comparison with the reference sequence
FSIM (Feature Similarity Index for Image Quality) usesphase congruency as the primary feature to measure the similarity between two corresponding frames.
Currently, there aren't so many objective metrics to verify video super-resolution method's ability to restore real details. Research is currently underway in this area.
Another way to assess the performance of the video super-resolution algorithm is to organize thesubjective evaluation. People are asked to compare the corresponding frames, and the final mean opinion score (MOS) is calculated as thearithmetic mean overall ratings.
While deep learning approaches of video super-resolution outperform traditional ones, it's crucial to form a high-qualitydataset for evaluation. It's important to verify models' ability to restore small details, text, and objects with complicated structure, to cope with big motion and noise.
A few benchmarks in video super-resolution were organized by companies and conferences. The purposes of such challenges are to compare diverse algorithms and to find the state-of-the-art for the task.
The NTIRE 2019 Challenge was organized byCVPR and proposed two tracks for Video Super-Resolution: clean (only bicubic degradation) and blur (blur added firstly). Each track had more than 100 participants and 14 final results were submitted. Dataset REDS was collected for this challenge. It consists of 30 videos of 100 frames each. The resolution of ground-truth frames is 1280×720. The tested scale factor is 4. To evaluate models' performance PSNR and SSIM were used. The best participants' results are performed in the table:
The Youku-VESR Challenge was organized to check models' ability to cope with degradation and noise, which are real for Youku online video-watching application. The proposed dataset consists of 1000 videos, each length is 4–6 seconds. The resolution of ground-truth frames is 1920×1080. The tested scale factor is 4. PSNR and VMAF metrics were used for performance evaluation. Top methods are performed in the table:
The challenge was held byECCV and had two tracks on video extreme super-resolution: first track checks the fidelity with reference frame (measured byPSNR andSSIM). The second track checks the perceptual quality of videos (MOS).Dataset consists of 328 video sequences of 120 frames each. The resolution of ground-truth frames is 1920×1080. The tested scale factor is 16. Top methods are performed in the table:
The MSU Video Super-Resolution Benchmark was organized by MSU and proposed three types of motion, two ways to lower resolution, and eight types of content in the dataset. The resolution of ground-truth frames is 1920×1280. The tested scale factor is 4. 14 models were tested. To evaluate models' performance PSNR and SSIM were used with shift compensation. Also proposed a few new metrics: ERQAv1.0, QRCRv1.0, and CRRMv1.0.[72] Top methods are performed in the table:
The MSU Super-Resolution for Video Compression Benchmark was organized by MSU. This benchmark tests models' ability to work with compressed videos. The dataset consists of 9 videos, compressed with differentVideo codec standards and differentbitrates. Models are ranked by BSQ-rate[73] over subjective score. The resolution of ground-truth frames is 1920×1080. The tested scale factor is 4. 17 models were tested. 5 video codecs were used to compress ground-truth videos. Top combinations of Super-Resolution methods and video codecs are performed in the table:
In many areas, working with video, we deal with different types of video degradation, including downscaling. The resolution of video can be degraded because of imperfections of measuring devices, such as optical degradations and limitedsize of camera sensors. Bad light and weather conditions add noise to video. Object and camera motion also decrease video quality.Super Resolution techniques help to restore the original video. It's useful in a wide range of applications, such as
video surveillance (to improve video captured from the camera and recognize car numbers and faces)
medical imaging (to discover better some organs or tissues for clinical analysis and medical intervention)
forensic science (to help in the investigation during the criminal procedure)
astronomy (to improve quality of video of stars and planets)
Simulating the natural hand movements by "jiggling" the camera
Video super-resolution finds its practical use in some modern smartphones and cameras, where it is used to reconstruct digital photographs.
Reconstructing details on digital photographs is a difficult task since these photographs are already incomplete: the camera sensor elements measure only the intensity of the light, not directly its color. A process calleddemosaicing is used to reconstruct the photos from partial color information. A single frame doesn't give us enough data to fill in the missing colors, however, we can receive some of the missing information from multiple images taken one after the other. This process is known asburst photography and can be used to restore a single image of good quality from multiple sequential frames.
When we capture a lot of sequential photos with a smartphone or handheld camera, there is always some movement present between the frames because of the hand motion. We can take advantage of this hand tremor by combining the information on those images. We choose a single image as the "base" or reference frame and align every other frame relative to it.
There are situations where hand motion is simply not present because the device is stabilized (e.g. placed on a tripod). There is a way to simulate natural hand motion by intentionally slightly moving the camera. The movements are extremely small so they don't interfere with regular photos. You can observe these motions on Google Pixel 3[74] phone by holding it perfectly still (e.g. pressing it against the window) and maximally pinch-zooming the viewfinder.
^Chan, Kelvin CK, et al. "BasicVSR: The search for essential components in video super-resolution and beyond."Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
^Kim, S. P.; Bose, N. K.; Valenzuela, H. M. (1989). "Reconstruction of high resolution image from noise undersampled frames".Lecture Notes in Control and Information Sciences. Vol. 129. Berlin/Heidelberg: Springer-Verlag. pp. 315–326.doi:10.1007/bfb0042742.ISBN3-540-51424-4.
^Bose, N.K.; Kim, H.C.; Zhou, B. (1994). "Performance analysis of the TLS algorithm for image reconstruction from a sequence of undersampled noisy and blurred frames".Proceedings of 1st International Conference on Image Processing. Vol. 3. IEEE Comput. Soc. Press. pp. 571–574.doi:10.1109/icip.1994.413741.ISBN0-8186-6952-7.
^Tekalp, A.M.; Ozkan, M.K.; Sezan, M.I. (1992). "High-resolution image reconstruction from lower-resolution image sequences and space-varying image restoration".[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE. pp. 169–172 vol.3.doi:10.1109/icassp.1992.226249.ISBN0-7803-0532-9.
^Goldberg, N.; Feuer, A.; Goodwin, G.C. (2003). "Super-resolution reconstruction using spatio-temporal filtering".Journal of Visual Communication and Image Representation.14 (4). Elsevier BV:508–525.doi:10.1016/s1047-3203(03)00042-7.ISSN1047-3203.
^Bose, N.K.; Lertrattanapanich, S.; Chappalli, M.B. (2004). "Superresolution with second generation wavelets".Signal Processing: Image Communication.19 (5). Elsevier BV:387–391.doi:10.1016/j.image.2004.02.001.ISSN0923-5965.
^Cohen, B.; Avrin, V.; Dinstein, I. (2000). "Polyphase back-projection filtering for resolution enhancement of image sequences".2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100). Vol. 4. IEEE. pp. 2171–2174.doi:10.1109/icassp.2000.859267.ISBN0-7803-6293-4.
^Katsaggelos, A.K. (1997). "An iterative weighted regularized algorithm for improving the resolution of video sequences".Proceedings of International Conference on Image Processing. IEEE Comput. Soc. pp. 474–477.doi:10.1109/icip.1997.638811.ISBN0-8186-8183-7.
^Farsiu, Sina; Elad, Michael; Milanfar, Peyman (2006-01-15). "A practical approach to superresolution". In Apostolopoulos, John G.; Said, Amir (eds.).Visual Communications and Image Processing 2006. Vol. 6077. SPIE. p. 607703.doi:10.1117/12.644391.
^Jing Tian; Kai-Kuang Ma (2005). "A new state-space approach for super-resolution image sequence reconstruction".IEEE International Conference on Image Processing 2005. IEEE. pp. I-881.doi:10.1109/icip.2005.1529892.ISBN0-7803-9134-9.
^Costa, Guilherme Holsbach; Bermudez, Jos Carlos Moreira (2007). "Statistical Analysis of the LMS Algorithm Applied to Super-Resolution Image Reconstruction".IEEE Transactions on Signal Processing.55 (5). Institute of Electrical and Electronics Engineers (IEEE):2084–2095.Bibcode:2007ITSP...55.2084C.doi:10.1109/tsp.2007.892704.ISSN1053-587X.S2CID52857681.
^Elad, M.; Feuer, A. (1999). "Super-resolution reconstruction of continuous image sequences".Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348). Vol. 3. IEEE. pp. 459–463.doi:10.1109/icip.1999.817156.ISBN0-7803-5467-2.
^abElad, M.; Feuer, A. (1999). "Superresolution restoration of an image sequence: adaptive filtering approach".IEEE Transactions on Image Processing.8 (3). Institute of Electrical and Electronics Engineers (IEEE):387–395.Bibcode:1999ITIP....8..387E.doi:10.1109/83.748893.ISSN1057-7149.PMID18262881.
^Pickering, M.; Frater, M.; Arnold, J. (2005). "Arobust approach to super-resolution sprite generation".IEEE International Conference on Image Processing 2005. IEEE. pp. I-897.doi:10.1109/icip.2005.1529896.ISBN0-7803-9134-9.
^Nasonov, Andrey V.; Krylov, Andrey S. (2010). "Fast Super-Resolution Using Weighted Median Filtering".2010 20th International Conference on Pattern Recognition. IEEE. pp. 2230–2233.doi:10.1109/icpr.2010.546.ISBN978-1-4244-7542-1.
^Simonyan, K.; Grishin, S.; Vatolin, D.; Popov, D. (2008). "Fast video super-resolution via classification".2008 15th IEEE International Conference on Image Processing. IEEE. pp. 349–352.doi:10.1109/icip.2008.4711763.ISBN978-1-4244-1765-0.
^Nasir, Haidawati; Stankovic, Vladimir; Marshall, Stephen (2011). "Singular value decomposition based fusion for super-resolution image reconstruction".2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA). IEEE. pp. 393–398.doi:10.1109/icsipa.2011.6144138.ISBN978-1-4577-0242-6.
^Zhuo, Yue;Liu, Jiaying; Ren, Jie; Guo, Zongming (2012). "Nonlocal based Super Resolution with rotation invariance and search window relocation".2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. pp. 853–856.doi:10.1109/icassp.2012.6288018.ISBN978-1-4673-0046-9.
^Huhle, Benjamin; Schairer, Timo; Jenke, Philipp; Straßer, Wolfgang (2010). "Fusion of range and color images for denoising and resolution enhancement with a non-local filter".Computer Vision and Image Understanding.114 (12). Elsevier BV:1336–1345.doi:10.1016/j.cviu.2009.11.004.ISSN1077-3142.
^Elad, M.; Feuer, A. (1997). "Restoration of a single superresolution image from several blurred, noisy, and undersampled measured images".IEEE Transactions on Image Processing.6 (12). Institute of Electrical and Electronics Engineers (IEEE):1646–1658.Bibcode:1997ITIP....6.1646E.doi:10.1109/83.650118.ISSN1057-7149.PMID18285235.
^Farsiu, Sina; Robinson, Dirk; Elad, Michael; Milanfar, Peyman (2003-11-20). "Robust shift and add approach to superresolution". In Tescher, Andrew G. (ed.).Applications of Digital Image Processing XXVI. Vol. 5203. SPIE. p. 121.doi:10.1117/12.507194.
^Rajan, D.; Chaudhuri, S. (2001). "Generation of super-resolution images from blurred observations using Markov random fields".2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221). Vol. 3. IEEE. pp. 1837–1840.doi:10.1109/icassp.2001.941300.ISBN0-7803-7041-4.
^Zibetti, Marcelo Victor Wust; Mayer, Joceli (2006). "Outlier Robust and Edge-Preserving Simultaneous Super-Resolution".2006 International Conference on Image Processing. IEEE. pp. 1741–1744.doi:10.1109/icip.2006.312718.ISBN1-4244-0480-0.
^Joshi, M.V.; Chaudhuri, S.; Panuganti, R. (2005). "A Learning-Based Method for Image Super-Resolution From Zoomed Observations".IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics.35 (3). Institute of Electrical and Electronics Engineers (IEEE):527–537.doi:10.1109/tsmcb.2005.846647.ISSN1083-4419.PMID15971920.S2CID3162908.
^Liao, Renjie; Tao, Xin; Li, Ruiyu; Ma, Ziyang; Jia, Jiaya (2015). "Video Super-Resolution via Deep Draft-Ensemble Learning".2015 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 531–539.doi:10.1109/iccv.2015.68.ISBN978-1-4673-8391-2.
^Kappeler, Armin; Yoo, Seunghwan; Dai, Qiqin; Katsaggelos, Aggelos K. (2016). "Video Super-Resolution With Convolutional Neural Networks".IEEE Transactions on Computational Imaging.2 (2). Institute of Electrical and Electronics Engineers (IEEE):109–122.doi:10.1109/tci.2016.2532323.ISSN2333-9403.S2CID9356783.
^Caballero, Jose; Ledig, Christian; Aitken, Andrew; Acosta, Alejandro; Totz, Johannes; Wang, Zehan; Shi, Wenzhe (2016-11-16). "Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation".arXiv:1611.05250v2 [cs.CV].
^Tao, Xin; Gao, Hongyun; Liao, Renjie; Wang, Jue; Jia, Jiaya (2017). "Detail-Revealing Deep Video Super-Resolution".2017 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 4482–4490.arXiv:1704.02738.doi:10.1109/iccv.2017.479.ISBN978-1-5386-1032-9.
^Liu, Ding; Wang, Zhaowen; Fan, Yuchen; Liu, Xianming; Wang, Zhangyang; Chang, Shiyu; Huang, Thomas (2017). "Robust Video Super-Resolution with Learned Temporal Dynamics".2017 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 2526–2534.doi:10.1109/iccv.2017.274.ISBN978-1-5386-1032-9.
^Sajjadi, Mehdi S. M.; Vemulapalli, Raviteja; Brown, Matthew (2018). "Frame-Recurrent Video Super-Resolution".2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE. pp. 6626–6634.arXiv:1801.04590.doi:10.1109/cvpr.2018.00693.ISBN978-1-5386-6420-9.
^Kim, Tae Hyun; Sajjadi, Mehdi S. M.; Hirsch, Michael; Schölkopf, Bernhard (2018). "Spatio-Temporal Transformer Network for Video Restoration".Computer Vision – ECCV 2018. Lecture Notes in Computer Science. Vol. 11207. Cham: Springer International Publishing. pp. 111–127.doi:10.1007/978-3-030-01219-9_7.ISBN978-3-030-01218-2.ISSN0302-9743.
^Chu, Mengyu; Xie, You; Mayer, Jonas; Leal-Taixé, Laura; Thuerey, Nils (2020-07-08). "Learning temporal coherence via self-supervision for GAN-based video generation".ACM Transactions on Graphics.39 (4). Association for Computing Machinery (ACM).arXiv:1811.09393.doi:10.1145/3386569.3392457.ISSN0730-0301.S2CID209460786.
^Xue, Tianfan; Chen, Baian; Wu, Jiajun; Wei, Donglai; Freeman, William T. (2019-02-12). "Video Enhancement with Task-Oriented Flow".International Journal of Computer Vision.127 (8). Springer Science and Business Media LLC:1106–1125.arXiv:1711.09078.doi:10.1007/s11263-018-01144-2.ISSN0920-5691.S2CID40412298.
^Wang, Zhongyuan; Yi, Peng; Jiang, Kui; Jiang, Junjun; Han, Zhen; Lu, Tao; Ma, Jiayi (2019). "Multi-Memory Convolutional Neural Network for Video Super-Resolution".IEEE Transactions on Image Processing.28 (5). Institute of Electrical and Electronics Engineers (IEEE):2530–2544.Bibcode:2019ITIP...28.2530W.doi:10.1109/tip.2018.2887017.ISSN1057-7149.PMID30571634.S2CID58595890.
^Haris, Muhammad; Shakhnarovich, Gregory; Ukita, Norimichi (2019). "Recurrent Back-Projection Network for Video Super-Resolution".2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 3892–3901.arXiv:1903.10128.doi:10.1109/cvpr.2019.00402.ISBN978-1-7281-3293-8.
^Bao, Wenbo; Lai, Wei-Sheng; Zhang, Xiaoyun; Gao, Zhiyong; Yang, Ming-Hsuan (2021-03-01). "MEMC-Net: Motion Estimation and Motion Compensation Driven Neural Network for Video Interpolation and Enhancement".IEEE Transactions on Pattern Analysis and Machine Intelligence.43 (3). Institute of Electrical and Electronics Engineers (IEEE):933–948.arXiv:1810.08768.doi:10.1109/tpami.2019.2941941.ISSN0162-8828.PMID31722471.S2CID53046739.
^Kalarot, Ratheesh; Porikli, Fatih (2019). "MultiBoot Vsr: Multi-Stage Multi-Reference Bootstrapping for Video Super-Resolution".2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE. pp. 2060–2069.doi:10.1109/cvprw.2019.00258.ISBN978-1-7281-2506-0.
^abChan, Kelvin C. K.; Wang, Xintao; Yu, Ke; Dong, Chao; Loy, Chen Change (2020-12-03). "BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond".arXiv:2012.02181v1 [cs.CV].
^Naoto Chiche, Benjamin; Frontera-Pons, Joana; Woiselle, Arnaud; Starck, Jean-Luc (2020-11-09). "Deep Unrolled Network for Video Super-Resolution".2020 Tenth International Conference on Image Processing Theory, Tools and Applications (IPTA). IEEE. pp. 1–6.arXiv:2102.11720.doi:10.1109/ipta50016.2020.9286636.ISBN978-1-7281-8750-1.
^Wang, Xintao; Chan, Kelvin C. K.; Yu, Ke; Dong, Chao; Loy, Chen Change (2019-05-07). "EDVR: Video Restoration with Enhanced Deformable Convolutional Networks".arXiv:1905.02716v1 [cs.CV].
^Jo, Younghyun; Oh, Seoung Wug; Kang, Jaeyeon; Kim, Seon Joo (2018). "Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation".2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE. pp. 3224–3232.doi:10.1109/cvpr.2018.00340.ISBN978-1-5386-6420-9.
^Li, Sheng; He, Fengxiang; Du, Bo; Zhang, Lefei; Xu, Yonghao; Tao, Dacheng (2019-04-05). "Fast Spatio-Temporal Residual Network for Video Super-Resolution".arXiv:1904.02870v1 [cs.CV].
^Kim, Soo Ye; Lim, Jeongyeon; Na, Taeyoung; Kim, Munchurl (2019). "Video Super-Resolution Based on 3D-CNNS with Consideration of Scene Change".2019 IEEE International Conference on Image Processing (ICIP). pp. 2831–2835.doi:10.1109/ICIP.2019.8803297.ISBN978-1-5386-6249-6.S2CID202763112.
^Luo, Jianping; Huang, Shaofei; Yuan, Yuan (2020). "Video Super-Resolution using Multi-scale Pyramid 3D Convolutional Networks".Proceedings of the 28th ACM International Conference on Multimedia. pp. 1882–1890.doi:10.1145/3394171.3413587.ISBN9781450379885.S2CID222278621.
^Zhang, Dongyang; Shao, Jie; Liang, Zhenwen; Liu, Xueliang; Shen, Heng Tao (2020). "Multi-branch Networks for Video Super-Resolution with Dynamic Reconstruction Strategy".IEEE Transactions on Circuits and Systems for Video Technology.31 (10):3954–3966.doi:10.1109/TCSVT.2020.3044451.ISSN1051-8215.S2CID235057646.
^Fuoli, Dario; Gu, Shuhang; Timofte, Radu (2019-09-17). "Efficient Video Super-Resolution through Recurrent Latent Space Propagation".arXiv:1909.08080 [eess.IV].
^Zvezdakova, A. V.; Kulikov, D. L.; Zvezdakov, S. V.; Vatolin, D. S. (2020). "BSQ-rate: a new approach for video-codec performance comparison and drawbacks of current solutions".Programming and Computer Software.46 (3):183–194.doi:10.1134/S0361768820030111.S2CID219157416.