Disclosure of Invention
In view of the above problems, embodiments of the present invention are proposed to provide a video processing method, apparatus and device that overcome or at least partially solve the above problems.
According to an aspect of an embodiment of the present invention, there is provided a video processing method, including:
acquiring at least two frames of a video to be coded;
extracting the features of the at least two frames to obtain at least four feature maps;
recombining the at least four characteristic maps to obtain at least two characteristic sequences;
determining a key frame contained in each of the at least two characteristic sequences to obtain a key frame sequence;
and performing inter-frame prediction according to the key frame sequence to obtain feature data of a key frame, wherein the feature data is used for obtaining coded data of the video to be coded.
Optionally, recombining the at least four feature maps to obtain at least two feature sequences, including:
classifying the at least four feature maps according to feature types;
and combining the same feature types into a feature sequence to obtain the at least two feature sequences.
Optionally, determining a key frame included in each of the at least two feature sequences to obtain a sequence of key frames, including:
respectively acquiring key frames in the at least two characteristic sequences according to a preset rule;
and combining the acquired key frames to obtain a key frame sequence, wherein the frames in each key frame sequence only contain one data feature type.
Optionally, performing inter-frame prediction according to the key frame sequence to obtain feature data of a key frame, including:
determining a target frame in the sequence of key frames as an independently encoded frame;
processing the feature map of at least one non-target frame in the key frame sequence based on the feature map of the independent coding frame to respectively obtain at least one prediction feature map, wherein the at least one prediction feature map is in one-to-one correspondence with the at least one non-target frame;
and calculating the motion residual error between the prediction characteristic graph and the characteristic graph of the key frame corresponding to the prediction characteristic graph aiming at each prediction characteristic graph to obtain the characteristic data of the key frame.
Optionally, after the key frames in the at least two feature sequences are respectively obtained according to a preset rule, the method further includes:
sequencing non-key frames in the at least two characteristic sequences to obtain a non-key frame sequence;
and coding the non-key frame sequence to obtain the characteristic data of the non-key frame.
Optionally, after obtaining the feature data of the key frame, the method further includes:
quantizing the feature data of the key frame and the feature data of the non-key frame simultaneously to obtain quantized data;
transforming the quantized data to obtain transformed data;
and inputting the transformed data into an entropy coder for processing to obtain code stream data of the video to be coded.
According to another aspect of the embodiments of the present invention, there is provided a video processing apparatus including:
the device comprises an acquisition module, a coding module and a decoding module, wherein the acquisition module is used for acquiring at least two frames of a video to be coded;
the processing module is used for extracting the features of the at least two frames to obtain at least four feature maps; recombining the at least four characteristic maps to obtain at least two characteristic sequences; determining a key frame contained in each of the at least two characteristic sequences to obtain a key frame sequence; and performing inter-frame prediction according to the key frame sequence to obtain feature data of a key frame, wherein the feature data is used for obtaining coded data of the video to be coded.
According to still another aspect of an embodiment of the present invention, there is provided a computing device including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the video processing method.
According to a further aspect of the embodiments of the present invention, there is provided a computer storage medium, in which at least one executable instruction is stored, and the executable instruction causes a processor to perform operations corresponding to the video processing method.
According to the scheme provided by the embodiment of the invention, at least two frames of the video to be coded are obtained; extracting the features of the at least two frames to obtain at least four feature maps; recombining the at least four characteristic maps to obtain at least two characteristic sequences; determining a key frame contained in each of the at least two characteristic sequences to obtain a key frame sequence; and performing inter-frame prediction according to the key frame sequence to obtain feature data of a key frame, wherein the feature data is used for obtaining coded data of the video to be coded, and a plurality of feature frames corresponding to the same feature in the video data can be regarded as a video, so that the background can be kept unchanged when the inter-frame prediction is performed on the same feature video, the inter-frame prediction process is simplified, the accuracy of the inter-frame prediction is increased, and the efficiency of the inter-frame prediction is improved.
The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the embodiments of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
Fig. 1 shows a flowchart of a video processing method according to an embodiment of the present invention. As shown in fig. 1, the method comprises the steps of:
step 11, acquiring at least two frames of a video to be coded;
step 12, extracting the features of the at least two frames to obtain at least four feature maps;
step 13, recombining the at least four characteristic maps to obtain at least two characteristic sequences;
step 14, determining a key frame contained in each of the at least two feature sequences to obtain a key frame sequence;
andstep 15, performing interframe prediction according to the key frame sequence to obtain feature data of a key frame, wherein the feature data is used for obtaining coded data of the video to be coded.
In this embodiment, at least two frames of a video to be encoded are obtained; extracting the features of the at least two frames to obtain at least four feature maps; recombining the at least four characteristic maps to obtain at least two characteristic sequences; determining a key frame contained in each of the at least two characteristic sequences to obtain a key frame sequence; and performing inter-frame prediction according to the key frame sequence to obtain feature data of a key frame, wherein the feature data is used for obtaining coded data of the video to be coded, and a plurality of feature frames corresponding to the same feature in the video data can be regarded as a video, so that the background can be kept unchanged when the inter-frame prediction is performed on the same feature video, the inter-frame prediction process is simplified, the accuracy of the inter-frame prediction is increased, and the efficiency of the inter-frame prediction is improved.
In an alternative embodiment of the present invention,step 13 may include:
step 131, classifying the at least four feature maps according to feature types;
step 132, combining the same feature types into a feature sequence to obtain the at least two feature sequences.
In this embodiment, the feature type mainly refers to an image feature, and the image feature mainly includes a color feature, a texture feature, a shape feature, and a spatial relationship feature of an image. The color feature is a global feature describing surface properties of a scene corresponding to an image or an image area; texture features are also global features that also describe the surface properties of the scene corresponding to the image or image area; the shape features have two types of representation methods, one type is contour features, the other type is region features, the contour features of the image mainly aim at the outer boundary of the object, and the region features of the image relate to the whole shape region; the spatial relationship characteristic refers to the mutual spatial position or relative direction relationship among a plurality of targets segmented from the image, and these relationships can be also divided into a connection/adjacency relationship, an overlapping/overlapping relationship, an inclusion/containment relationship, and the like. And the feature sequence refers to a sequence in which the same image features in the same type of image features are arranged together.
In yet another alternative embodiment of the present invention, step 14 may comprise:
step 141, respectively acquiring key frames in the at least two feature sequences according to a preset rule;
and 142, combining the acquired key frames to obtain a key frame sequence, wherein frames in each key frame sequence only contain one data feature type.
In this embodiment, the preset rule is a rule that is manually or otherwise set in advance and can be acquired as the key frame rule, for example: rule 1 is preset, i.e. the first N frames in a GOP (a video compression technique used by MPEG) are key frames, where N is a positive integer greater than or equal to 1; rule 2 is preset, i.e. the encoded frames of the full frame compression are key frames. The preset rules include preset rule 1 and preset rule 2, but are not limited to the above.
The key frame sequence comprises all the characteristics in the video to be coded, and in the key frame sequence, different characteristics are characterized among every frame.
In yet another alternative embodiment of the present invention, step 15 may comprise:
step 151, determining a target frame in the key frame sequence as an independent coding frame;
step 152, processing the feature map of at least one non-target frame in the sequence of key frames based on the feature map of the independent coding frame to obtain at least one prediction feature map, wherein the at least one prediction feature map is in one-to-one correspondence with the at least one non-target frame;
step 153, for each prediction feature map, calculating a motion residual between the prediction feature map and a feature map of a key frame corresponding to the prediction feature map to obtain feature data of the key frame.
As shown in fig. 2, in this embodiment, a prediction neural network is trained first, so that the trained neural network can select a target frame from a sequence of key frames, encode the target frame independently, predict a feature map of a non-target frame through a feature map of the selected target frame, calculate a residual between the predicted feature map and a feature map corresponding to the predicted target frame before prediction, and finally obtain feature data of the key frame, where the feature data of the key frame includes: residual data of the independently encoded target frame and the non-target frame.
For example, the nth frame in fig. 2 is the obtained sequence of the key frames, the feature matrix 1 is the feature matrix of the selected target frame, the target frame is used as an independently encoded frame, the feature matrix 1 is independently encoded in an intra-frame prediction manner, then the feature matrix 1 is input into the predictionneural network 12 to generate a prediction feature matrix 2, and then the residual between the prediction feature matrix 2 and the feature matrix 2 is calculated. Similarly, for the residual error of the feature matrix N, the feature matrix N is only required to be input into the prediction neural network 1N to generate the prediction feature matrix N, and then the residual error between the prediction feature matrix N and the feature matrix N is calculated. The predicted neural networks 1N are used in total, namely the predictedneural network 12 and the predictedneural network 13, and each predicted neural network is executed according to the whole feature matrix, namely the input and the output of the deep predicted neural network are N M, wherein N M is the size of the feature matrix, but parameters of each predicted neural network are different.
The predictive neural network mentioned in the above embodiment is trained by the following method:
a training step 1, acquiring a training key frame sequence;
a training step 2, determining a training target frame in the training key frame sequence as a training independent coding frame;
training step 3, processing the feature map of at least one non-training target frame in the training key frame sequence based on the feature map of the training independent coding frame to respectively obtain at least one prediction feature map, wherein the at least one prediction feature map is in one-to-one correspondence with the at least one non-training target frame;
and 4, training, namely calculating a motion residual error between the prediction feature map and the feature map of the training key frame corresponding to the prediction feature map aiming at each prediction feature map to obtain feature data of the training key frame.
When the prediction neural network is trained, at least two feature matrices can be obtained for each original image frame, so that the encoder can easily obtain massive training data for training each prediction neural network.
In an optional embodiment of the present invention, step 15 may also perform inter-frame prediction by a method based on conventional motion compensation, as shown in fig. 3, assuming that the I frame sequence is a key frame sequence, after the key frame sequence is obtained, the key frame corresponding to the feature matrix 1 is designated as an independent encoded frame, and the other key frames are compressed by using a conventional inter-frame prediction method based on motion compensation, that is, one block in the feature matrix 2 and the feature matrix 3 is motion compensation based on the block in the feature matrix 1. Due to the spatial correlation between feature matrix 1, feature matrix 2, and feature matrix 3, MV1 and MV2 in fig. 3 are actually direct copies of feature matrix 1, i.e., MV (X =0, Y = 0) is predicted in situ. Finally, when encoding the feature data of the key frame, the prediction residual is quantized and encoded for each block in the feature matrix 2 and the feature matrix 3. Intra prediction is used for the feature matrix 1 and the prediction residual for each block is quantized and coded. At this point, after the encoding is finished, the encoder uses intra-frame prediction only for one target frame, and inter-frame prediction is performed on other non-target frames in the key frame sequence based on the target frame. As shown in fig. 3, feature matrix 1 contains the boundaries of the globule, while feature matrix 3 contains part of the boundaries of the globule, so that feature matrix 3 is compressed for "temporal" redundancy under motion compensated prediction. Wherein, the different key frames do not have time sequence, but only form virtual time relationship after arrangement. In the original video stream to be encoded, all frames of the key frame video correspond to image samples at the same time.
In yet another alternative embodiment of the present invention, after step 141, the method may further include:
step 143, sequencing the non-key frames in the at least two feature sequences to obtain a non-key frame sequence;
step 144, encoding the non-key frame sequence to obtain the feature data of the non-key frame.
In this embodiment, the encoding of the non-key frame sequence is performed by a video encoder, where the non-key frame refers to other frames remaining after the key frame is acquired according to a preset rule, for example: p-frames or B-frames, but not limited to those described above.
In still another alternative embodiment of the present invention, afterstep 15, the method may further include:
step 16, quantizing the feature data of the key frame and the feature data of the non-key frame simultaneously to obtain quantized data;
step 17, transforming the quantized data to obtain transformed data;
and step 18, inputting the transformed data into an entropy coder for processing to obtain code stream data of the video to be coded.
In this embodiment, firstly, the feature data of the key frame and the feature data of the non-key frame are quantized simultaneously to obtain quantized data, and secondly, the quantized data are transformed, so that errors of different features can be concentrated in the same region, and thus, the transformed frequency domain components are more concentrated at the low frequency components. And then, the code stream data of the video to be coded is obtained after the transformed data is input to an entropy coder.
Fig. 4 and 5 show a schematic diagram of an overall process of encoding and decoding and a schematic diagram of feature matrix transformation in the encoding process according to an embodiment of the present invention, as shown in fig. 4, an original image sequentially undergoes feature extraction, feature video ordering, I/P/B frame determination, then respectively undergoes inter-frame prediction, feature I frame ordering, and feature predictor, after operation, sequentially enters a quantizer, a transformer, entropy encoding storage or transmission, entropy decoding, an inverse transformer, and an inverse quantizer, respectively extracts a non-selected feature I frame prediction residual, extracts a selected feature I frame, extracts a feature P, B frame prediction residual, and a control parameter, wherein after the selected feature I frame is extracted, the selected feature I frame is further subjected to operation with information after the non-selected feature I frame prediction residual is extracted, to obtain a feature I frame separation, and then obtains a feature video reordering by operation with information after the feature P, B frame prediction residual and the control parameter are extracted, and finally obtaining the depth characteristic. Fig. 5 further defines the processes of feature extraction, feature video ordering, I/P/B frame determination, inter-frame prediction, feature I frame ordering, and feature prediction, respectively.
The inter prediction intra in fig. 4 also includes a number of modules, such as: a motion vector calculation module, a motion compensation residual calculation module, etc., but is not limited to the above. The feature predictor in fig. 4 may be one of the prediction compensation method based on the deep neural network or the method based on the conventional motion compensation in the above-described embodiments.
In the above embodiments of the present invention, at least two frames of a video to be encoded are obtained; extracting the features of the at least two frames to obtain at least four feature maps; recombining the at least four characteristic maps to obtain at least two characteristic sequences; determining a key frame contained in each of the at least two characteristic sequences to obtain a key frame sequence; and performing inter-frame prediction according to the key frame sequence to obtain feature data of a key frame, wherein the feature data is used for obtaining coded data of the video to be coded, and a plurality of feature frames corresponding to the same feature in the video data can be regarded as a video, so that the background can be kept unchanged when the inter-frame prediction is performed on the same feature video, the inter-frame prediction process is simplified, the accuracy of the inter-frame prediction is increased, and the efficiency of the inter-frame prediction is improved.
Fig. 6 shows a schematic structural diagram of avideo processing apparatus 60 according to an embodiment of the present invention. As shown in fig. 6, the apparatus includes:
an obtainingmodule 61, configured to obtain at least two frames of a video to be encoded;
aprocessing module 62, configured to extract features of the at least two frames to obtain at least four feature maps; recombining the at least four characteristic maps to obtain at least two characteristic sequences; determining a key frame contained in each of the at least two characteristic sequences to obtain a key frame sequence; and performing inter-frame prediction according to the key frame sequence to obtain feature data of a key frame, wherein the feature data is used for obtaining coded data of the video to be coded.
Optionally, theprocessing module 62 is further configured to classify the at least four feature maps according to feature types;
and combining the same feature types into a feature sequence to obtain the at least two feature sequences.
Optionally, theprocessing module 62 is further configured to obtain key frames in the at least two feature sequences according to a preset rule;
and combining the acquired key frames to obtain a key frame sequence, wherein the frames in each key frame sequence only contain one data feature type.
Optionally, theprocessing module 62 is further configured to determine that a target frame in the key frame sequence is an independently encoded frame;
processing the feature map of at least one non-target frame in the key frame sequence based on the feature map of the independent coding frame to respectively obtain at least one prediction feature map, wherein the at least one prediction feature map is in one-to-one correspondence with the at least one non-target frame;
and calculating the motion residual error between the prediction characteristic graph and the characteristic graph of the key frame corresponding to the prediction characteristic graph aiming at each prediction characteristic graph to obtain the characteristic data of the key frame.
Optionally, theprocessing module 62 is further configured to sort non-key frames in the at least two feature sequences to obtain a non-key frame sequence;
and coding the non-key frame sequence to obtain the characteristic data of the non-key frame.
Optionally, theprocessing module 62 is further configured to quantize the feature data of the key frame and the feature data of the non-key frame simultaneously to obtain quantized data;
transforming the quantized data to obtain transformed data;
and inputting the transformed data into an entropy coder for processing to obtain code stream data of the video to be coded.
It should be noted that this embodiment is an apparatus embodiment corresponding to the above method embodiment, and all the implementations in the above method embodiment are applicable to this apparatus embodiment, and the same technical effects can be achieved.
An embodiment of the present invention provides a non-volatile computer storage medium, where the computer storage medium stores at least one executable instruction, and the computer executable instruction may execute a video processing method in any of the above method embodiments.
Fig. 7 is a schematic structural diagram of a computing device according to an embodiment of the present invention, and a specific embodiment of the present invention does not limit a specific implementation of the computing device.
As shown in fig. 7, the computing device may include: a processor (processor), a Communications Interface (Communications Interface), a memory (memory), and a Communications bus.
Wherein: the processor, the communication interface, and the memory communicate with each other via a communication bus. A communication interface for communicating with network elements of other devices, such as clients or other servers. And a processor for executing the program, and specifically may perform the relevant steps in the above-described video processing method embodiment for the computing device.
In particular, the program may include program code comprising computer operating instructions.
The processor may be a central processing unit CPU or an application Specific Integrated circuit asic or one or more Integrated circuits configured to implement embodiments of the present invention. The computing device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.
And the memory is used for storing programs. The memory may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The program may in particular be adapted to cause a processor to perform the video processing method in any of the method embodiments described above. For specific implementation of each step in the program, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing embodiments of the video processing method, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of embodiments of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best modes of embodiments of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that is, the claimed embodiments of the invention require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components according to embodiments of the present invention. Embodiments of the invention may also be implemented as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing embodiments of the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Embodiments of the invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.