The present application claims priority from the chinese patent application filed at 2023, 7 and 4, filed with the chinese national intellectual property agency, application number 202310813148.6, application name "a video transcoding method and apparatus", the entire contents of which are incorporated herein by reference.
Detailed Description
Before describing the technical scheme provided by the application, some terms related in the application are explained first so as to facilitate understanding by those skilled in the art.
(1) Inserting frames-new frames are generated between video frames of the original video sequence.
(2) Skip mode, which is a Prediction mode of a Prediction stage when a video frame is encoded, belongs to one of Inter Prediction (Inter Prediction) modes. In Skip mode, the coding block obtains a predicted pixel value through motion estimation, and directly uses the predicted pixel value as a reconstructed pixel value, so that the residual value is not required to be transmitted, and the method is generally used under the condition that the predicted pixel value is relatively accurate.
(3) Intra mode, which is a Prediction mode of a Prediction stage when a video frame is encoded, belongs to an Intra Prediction (Intra Prediction) mode. In Intra mode, the coded block obtains predicted pixel values in the current frame using surrounding reference pixels without referencing coded blocks of other frames.
(4) Video transcoding (Video Transcoding) refers to converting a video stream that has been compression encoded into another video stream to accommodate different network bandwidths, different terminal processing capabilities, and different user requirements.
Embodiments of the present application will be described in detail below with reference to the accompanying drawings.
In the following, possible application scenarios to which the video processing method provided by the present application is applicable are described. It should be noted that these descriptions are for the sake of understanding to those skilled in the art, and are not intended to limit the scope of the present application.
Fig. 1 schematically illustrates one possible application scenario to which the embodiment of the present application is applicable. As shown in fig. 1, the application scenario may include a client 100 and a server 200. Optionally, the terminal device where the client 100 is located and the server 200 may be connected by communication through one or more networks. The network may be a wired network, or may be a wireless network, for example, a wireless network may be a wireless Fidelity (WIFI) network, or may be a mobile cellular network, or may be another type of network, which embodiments of the present application are not limited in this respect.
Wherein the client 100 may be installed on a terminal device for providing a video service (such as a live broadcast service, a video recording service, a live broadcast viewing service, or a video viewing service, etc.) to a user. Optionally, in the video transcoding scenario based on the end cloud collaboration, after the client 100 collects the relevant video of the user, the relevant video of the user is encoded into a corresponding code stream and transmitted to the server 200, so that the server 200 performs transcoding operation on the relevant video of the user. For example, after the server 200 obtains the relevant video of the user, the relevant video of the user may be subjected to a time-domain downsampling operation, that is, some video frames included in the relevant video are discarded by the server, for example, the server may discard the first target video frame (or a video frame that may be understood to belong to the discarded frame) included in the relevant video. The first target video frame may refer to a video frame that is more suitable as a frame to be interpolated (alternatively may be referred to as a relatively easily interpolated video frame).
Alternatively, the terminal device may also be referred to as a terminal, a User Equipment (UE), an access terminal device, a vehicle terminal, an industrial control terminal, a UE unit, a UE station, a Mobile Station (MS), a Mobile Terminal (MT), a remote station, a remote terminal device, a mobile device, a UE terminal device, a wireless communication device, a UE agent, or a UE apparatus, etc. In the embodiment of the present application, the terminal device may be fixed in position or mobile, which is not limited in the implementation of the present application.
Illustratively, the terminal device may be a mobile phone (mobile phone), a tablet (Pad), a subscriber unit (subscore unit), a cellular phone (cellular phone), a smart phone (smart phone), a wireless data card, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA) computer, a wireless modem (modem), a handheld device (handset), a laptop computer (laptop computer), a computer with wireless transceiver function, a Virtual Reality (VR) terminal device, an augmented reality (augmented reality, AR) terminal device, a wireless terminal in industrial control (industrial control), a terminal device capable of side-link Communication (such as a vehicle-mounted terminal device or a handheld terminal capable of V2X Communication, etc.), a wireless terminal in unmanned (SELF DRIVING), a wireless terminal in remote medical (remote medical) system, a wireless terminal in smart grid (SMART GRID), a wireless terminal in transportation security (transportation safety), a wireless terminal in smart city (SMART CITY), a wireless terminal in smart home, a wireless terminal in a smart home, a machine/mobile Communication device, a machine-friendly device, a robot Communication device, a mobile Communication device, a machine-friendly device, a mobile Communication device, a mobile phone, a vehicle Communication device, a robot Communication device, etc. The embodiment of the application does not limit the specific technology and the specific equipment form adopted by the terminal equipment.
Illustratively, the client 100 may include an acquisition module (or may be referred to as a video acquisition module), an encoding module (or may be referred to as a video encoding module), a decoding module (or may be referred to as a video decoding module), and a video plug-in module. It should be noted that, the connection relationship between the respective functional modules in the client 100 illustrated in fig. 1 is only an example, and does not limit the present application. The functions of the respective functional modules included in the client 100 are described below, respectively.
And the acquisition module is used for providing video acquisition service for the user. For example, the acquisition module is used for acquiring videos generated by live broadcasting of a user or videos recorded by the user, and sending relevant videos of the user to the encoding module.
The coding module is used for coding the video acquired by the acquisition module to obtain a plurality of coded video frames (or called pictures or images) corresponding to the video and coding information corresponding to the video frames respectively. The coding module may then generate a corresponding code stream according to the plurality of coded video frames and the coding information corresponding to the plurality of video frames, and may send the code stream to the server 200. Optionally, the coded multiple video frames and the coding information corresponding to the multiple video frames are carried in the code stream. For example, taking the acquisition module to acquire the video a, the video a includes 5 video frames (such as the video frame A1, the video frame A2, the video frame A3, the video frame A4 and the video frame A5), the encoding module may encode the 5 video frames included in the video a respectively, so as to obtain encoded 5 video frames and encoding information corresponding to the 5 video frames respectively. Then, the encoding module may generate a corresponding code stream according to the encoded 5 video frames and the encoding information corresponding to the 5 video frames, and send the code stream to the server 200. Optionally, the coded 5 video frames and the coding information corresponding to the 5 video frames are carried in the code stream. Illustratively, at least one of block partition information, motion Vector (or Motion Vector) information, prediction mode, position information of a reference video frame, block residual information, or the like may be included in the encoding information corresponding to any one of the video frames.
A decoding module, configured to decode a code stream (such as a standard code stream or a private code stream) from the server 200. For example, the decoding module decodes the standard code stream from the server side 200 to obtain a plurality of video frames belonging to the normal encoded frames, and the decoding module decodes the private code stream from the server side 200 to obtain attribute information of at least one video frame belonging to the discarded frames.
The video frame inserting module is used for generating at least one frame to be inserted according to the attribute information of each video frame belonging to the discarded frame and the video frame belonging to the normal coding frame, which is near the playing position of each video frame belonging to the discarded frame, obtained by decoding by the decoding module, and generating a complete video according to the at least one frame to be inserted and a plurality of video frames belonging to the normal coding frame.
Server side 200 may refer to a device (e.g., a cloud server or cloud computing device) that provides services for a user to perform corresponding operations on client 100. The server side 200 may be a cloud server (or may be referred to as a cloud, server side, or cloud computing device) for providing cloud computing services such as cloud services, cloud computing, cloud storage, cloud communication, network services, security services, and big data, or may be a general data center or server or other form of computing device.
Illustratively, the server side 200 may include a decoding module, a video characteristic analysis module, a time domain downsampling module, an insertion frame quality feedback module, and an encoding module. It should be noted that, the connection relationship between the functional modules in the server side 200 illustrated in fig. 1 is only an example, and is not meant to limit the present application. The functions of the respective functional modules included in the server side 200 are described below, respectively.
The decoding module is configured to decode the code stream from the client 100 to obtain a plurality of video frames and encoding information corresponding to the video frames.
And the video characteristic analysis module is used for analyzing the coding information corresponding to the video frames respectively to obtain a video frame analysis result. Optionally, the video frame analysis result includes an attribute of each video frame in the plurality of video frames (i.e., whether the video frame belongs to a dropped frame or a normal encoded frame). Illustratively, the video frame analysis results may include, but are not limited to, which video frames belong to dropped frames (or which video frames are marked as dropped frames or which video frames have dropped attributes) or which video frames belong to normally encoded frames (or which video frames are marked as normally encoded frames or which video frames have reserved attributes, i.e., video frames that need to be reserved for subsequent encoding), and so forth. For example, continuing to take the video a as an example, the video characteristic analysis module obtains the video frame analysis result corresponding to the video a after performing analysis statistics on the coding information corresponding to each of the 5 video frames included in the video a. Illustratively, the video frame analysis result corresponding to the video a includes that the video frame A3 and the video frame A5 belong to discarded frames and that the video frame A1, the video frame A2 and the video frame A4 belong to normal encoding frames.
Optionally, the video characteristic analysis module is further configured to determine, for each video frame belonging to the dropped frame, motion vector information (or may be referred to as motion vector information) of a plurality of coding blocks corresponding to the video according to a prediction mode of the plurality of coding blocks corresponding to the video frame.
The time domain downsampling module is used for discarding the video frames belonging to the discarded frames according to the video frame analysis result of the video characteristic analysis module and reserving the video frames belonging to the normal coding frames, so that the quality of video insertion frames can be ensured. For example, continuing with the above video a as an example, after acquiring the attributes of the 5 video frames included in the video a, the time-domain downsampling module may discard the video frame A3 and the video frame A5 and reserve the video frame A1, the video frame A2 and the video frame A4 according to the attributes of the 5 video frames.
And the frame inserting quality feedback module is used for feeding back the video characteristic analysis module so as to change the influence strategy of the video characteristic analysis module on other modules. Because a video processing scheme with a fixed setting may not have better robustness, in the process of executing the video processing scheme, by dynamically adjusting a scheme policy (such as a frame loss policy) according to the actual situation of the encountered video sequence, the video processing scheme can have better processing effect when facing most scenes. Optionally, when determining that the decoding module decodes each video frame that meets the first number threshold (for example, 50 or an integer multiple of 50 or other numbers of video frames, etc.), the frame quality feedback module may generate, according to motion vector information of one video frame belonging to the dropped frame and two video frames belonging to normal encoded frames located adjacent to the video frame belonging to the dropped frame and before and after the video frame belonging to the dropped frame, a frame to be inserted corresponding to the video frame belonging to the dropped frame. Then, the frame quality feedback module may determine a peak signal-to-Noise Ratio (PSNR) corresponding to the frame to be inserted according to the frame to be inserted and the video frame belonging to the dropped frame.
In one example, when the peak signal-to-noise ratio is greater than or equal to the set threshold, the interpolation quality feedback module adjusts a second threshold (e.g., th_intra) and sends the adjusted second threshold to the video characteristic analysis module. The video characteristic analysis module needs to use a first threshold value and a second threshold value when analyzing coding information corresponding to a plurality of video frames respectively. In another example, the interpolation quality feedback module generates indication information and sends the indication information to the video characteristic analysis module when the peak signal-to-noise ratio is greater than or equal to a set threshold. The indication information is used for indicating to reduce the second threshold value. In yet another example, when the peak signal-to-noise ratio is less than the set threshold and the number of video frames included in the first number of video frames belonging to the dropped frame is less than the second number threshold, the interpolation frame quality feedback module adjusts the second threshold (e.g., th intra) and decreases the first threshold (e.g., th skip), and sends the decreased first threshold and the increased second threshold to the video characteristic analysis module. Wherein the second number threshold is smaller than the first number threshold. In yet another example, the interpolation frame quality feedback module generates indication information and sends the indication information to the video characteristic analysis module when the peak signal-to-noise ratio is less than the set threshold and the number of video frames belonging to the dropped frame included in the video frames of the first number of threshold is less than the second number of threshold. The indication information is used for indicating that the first threshold value is reduced and the second threshold value is increased.
The encoding module is configured to perform normal encoding operation on a plurality of video frames belonging to the normal encoding frames, obtain a standard code stream, and transmit the standard code stream to the server 200. Illustratively, the standard code stream may carry a plurality of encoded video frames belonging to the normal encoded frame. It can be appreciated that the video (or a video sequence that may be referred to as a video sequence that includes a plurality of video frames) is changed to a new video sequence with a smaller number of video frames after the time-domain downsampling operation, and the new video sequence forms a standard code stream after being subjected to standard encoding, and is transmitted to the server 200. In addition, the encoding module is further configured to perform an encoding operation on the attribute information of at least one video frame belonging to the dropped frame, obtain a private code stream, and transmit the private code stream to the server 200.
Optionally, the video processing method provided by the embodiment of the application can be applied to live transcoding scenes based on end cloud cooperation, and can also be applied to other transcoding scenes (such as video recording transcoding scenes, video on demand transcoding scenes or video broadcasting transcoding scenes).
It should be noted that fig. 1 only schematically provides one possible application scenario, and the schematic application scenario is for more clearly describing the technical solution of the embodiment of the present application, and is not limited to the application scenario configuration of the video processing method provided by the present application. The form and number of the respective modules in the application scenario shown in fig. 1 are merely examples, and do not limit the present application. Moreover, the names of the modules in the application scenario shown in fig. 1 are only an example, and the names of the modules in the specific implementation may be other names, which is not limited in the present application. In addition, as a person skilled in the art can know, with the appearance of a new application scenario, the technical scheme provided by the embodiment of the application is also applicable to similar technical problems.
The following describes in detail a specific implementation of the video processing method according to the embodiment of the present application based on the application scenario illustrated in fig. 1.
Fig. 2 is a schematic flow chart of a video processing method according to an embodiment of the present application. The method may be, but is not limited to being, applied to the application scenario illustrated in fig. 1. The method flow may be implemented by data interaction by a plurality of video processing devices (e.g., a first video processing device and a second video processing device). Alternatively, the first video processing device may be a server side (such as a cloud server or a data center) or a component capable of supporting functions required by the server side to implement the method (such as a plug-in, a component, a chip or a circuit, etc.), and the second video processing device may be a client side or a component capable of supporting functions required by the client side to implement the method (such as a plug-in, a component, a chip or a circuit, etc.). Illustratively, the first video processing device may be the server side 200 illustrated in fig. 1. For example, the server 200 may be a certain computing device with a video processing function or a certain computing device cluster with a video processing function, and the second video processing apparatus may be the client 100 illustrated in fig. 1, where the client 100 has a video processing function. For example, the terminal device on which the client 100 is installed may be a certain computing device. For example, taking the server side 200 as a certain computing device with a video processing function as an example, the first video processing apparatus may be the computing device with the video processing function, or may be a separate unit embedded in the computing device for implementing the video processing function, or may be a functional component (such as a chip) encapsulated in the computing device for implementing the video processing function. In order to facilitate the description of the technical solution provided by the embodiments of the present application, a flow of implementing a video processing method by performing data interaction between a first video processing device and a second video processing device is described below by taking the first video processing device as a server side and taking the second video processing device as a client side as an example. As shown in fig. 2, the method includes:
Step 201, the server decodes the first code stream to obtain a plurality of video frames and encoding information of the plurality of video frames included in the first video.
Optionally, after the server side obtains the first code stream, the server side may decode the first code stream to obtain a plurality of video frames and encoding information of the plurality of video frames included in the first video. The first code stream may be transmitted by a user through a certain client, or may be transmitted by a user through a certain device.
Illustratively, the first code stream may carry a plurality of video frames included in the first video and encoding information of the plurality of video frames. Optionally, the first code stream may further carry location information of a plurality of video frames included in the first video. For example, the position information of any one video frame may include information that may characterize a play position, such as a play order (or may be referred to as a play sequence number), a play time, or a play index of the video frame in the first video. Illustratively, the coding information of any video frame may include a prediction mode (such as an inter prediction mode or an intra prediction mode) of a plurality of coding blocks corresponding to the video frame, block partition information, block residual information, positions of the plurality of coding blocks in the video frame, and the like. Optionally, the coding information of one or several coding blocks in the plurality of coding blocks corresponding to the video frame may further include motion vector information of the one or several coding blocks and position information of the reference video frame of the one or several coding blocks. The reference video frame of a certain coding block may refer to a video frame to which motion vector information of the coding block needs to be referred. In other words, there are one or more encoded blocks in the reference video frame that are compared to the pixel values of the encoded block. In one example, when the prediction mode of a certain coding block is an inter prediction mode, the motion vector information of the coding block and the position information of the reference video frame of the coding block may be included in the coding information of the coding block. In another example, when the prediction mode of a certain coding block is an intra prediction mode, the motion vector information of the coding block and the position information of the reference video frame of the coding block are not included in the coding information of the coding block.
Under the condition that the requirements of users on video image quality are higher and higher, in order to meet the requirements that users pursue more realistic and more realistic video experience, and meanwhile, in order to solve the problem that obvious clamping feeling is brought to a low frame rate under some severe sports scenes to influence visual feeling of people, the embodiment of the application is realized by means of a video processing technology at a server side. In addition, by means of the video processing technology of the server side, the smoothness of video pictures can be improved, so that the audio and video experience of a user is improved, and meanwhile, the cost of introducing professional equipment and manual processing can be reduced.
Taking a live transcoding scenario as an example, the first code stream is from a certain client (such as a short video client), in order to ensure that the live video can be clearly and smoothly watched by a user in a weak network environment or a network signal katon or other scenarios, the client may encode the collected live video to form the first code stream. The first code stream carries a plurality of video frames and encoding information of the video frames, wherein the video frames are included in the live video. Then, the client may transmit the first code stream to the server, so that the server performs corresponding processing on the first code stream.
Optionally, when the client encodes the collected live video, the video frame may be divided into a plurality of blocks for each video frame included in the live video. Then, the client can encode a plurality of blocks corresponding to the video frame to obtain a plurality of encoding blocks corresponding to the video frame and encoding information of the encoding blocks. Thus, the first code stream actually carries the coding information of the plurality of coding blocks corresponding to each video frame and the plurality of coding blocks corresponding to each video frame.
For example, taking live video collected by the client as video 1, where video 1 includes 4 video frames (such as video frame 1, video frame 2, video frame 3, and video frame 4), it is assumed that the client divides video frame 1 into 2 blocks (such as block 11 and block 12), video frame 2 into 3 blocks (such as block 21, block 22, and block 23), video frame 3 into 3 blocks (such as block 31, block 32, and block 33), and video frame 4 into 2 blocks (such as block 41 and block 42). Then, for the video frame 1, the client may encode the block 11 and the block 12 corresponding to the video frame 1, respectively, to obtain the encoded block 11, the encoded block 12, the encoded information of the encoded block 11, and the encoded information of the encoded block 12. For video frame 2, the client may encode block 21, block 22, and block 23 corresponding to video frame 2, respectively, to obtain encoded block 21, encoded block 22, encoded block 23, encoded information of encoded block 21, encoded information of encoded block 22, and encoded information of encoded block 23. For video frame 3, the client may encode block 31, block 32, and block 33 corresponding to video frame 3, respectively, to obtain encoded block 31, encoded block 32, encoded block 33, encoded information of encoded block 31, encoded information of encoded block 32, and encoded information of encoded block 33. For the video frame 4, the client may encode the block 41 and the block 42 corresponding to the video frame 4, respectively, to obtain the encoded block 41, the encoded block 42, the encoded information of the encoded block 41, and the encoded information of the encoded block 42. The client may then send the first code stream to the server. Optionally, the first code stream carries coding information of coding block 11, coding block 12, coding information of coding block 11 and coding information of coding block 12 corresponding to video frame 1, coding block 21, coding block 22, coding block 23, coding information of coding block 21, coding information of coding block 22 and coding information of coding block 23 corresponding to video frame 2, coding block 31, coding block 32, coding block 33, coding information of coding block 31, coding information of coding block 32 and coding information of coding block 33 corresponding to video frame 3, coding block 41, coding block 42, coding information of coding block 41 and coding information of coding block 42 corresponding to video frame 4.
Step 202, the server determines at least one first target video frame included in the plurality of video frames according to the encoding information of the plurality of video frames.
After receiving the first code stream, the server decodes the first code stream to obtain a plurality of coding blocks corresponding to each video frame in the plurality of video frames and coding information of the plurality of coding blocks corresponding to each video frame. Optionally, when the first code stream also carries the position information of each video frame in the plurality of video frames, the server decodes the first code stream, and may also obtain the position information of each video frame in the plurality of video frames in the first video. Then, for any video frame (such as the first video frame) in the plurality of video frames, the server side may determine, according to the encoding information of the plurality of encoding blocks corresponding to the first video frame, an attribute of the first video frame, that is, whether the first video frame is the first target video frame or the second target video frame (or may also be understood as whether the first video frame has a discarding characteristic (i.e. is relatively suitable to be discarded as a frame to be inserted) or has a retaining characteristic). Wherein a first target video frame may refer to a video frame that needs to be discarded and a second target video frame (or a video frame that may be considered to belong to a normally encoded frame) may refer to a video frame that needs to be preserved.
Optionally, the server may determine the texture complexity index value of the first video frame according to the encoding information of the first video frame. Wherein the first video frame is any one of a plurality of video frames. Then, the server side may determine whether the first video frame is the first target video frame by determining whether the texture complexity index value of the first video frame meets the texture complexity threshold. When the texture complexity index value of the first video frame meets the texture complexity threshold, the server side may determine that the first video frame is the first target video frame. Illustratively, the texture complexity index may include, but is not limited to, block partition depth, block residual, absolute value of motion vector abscissa of the entire encoded block, a ratio of discrete cosine transform (Discrete cosine transform) 2, a first ratio or a second ratio, etc.
For example, a video frame with a small block division depth is more suitable as the first target video frame, a video frame with a smaller sum of absolute values of motion vector abscissas of the entire block is more suitable as the first target video frame, and a video frame with a smaller block residual is more suitable as the first target video frame. In addition, a transform scheme including encoded blocks may also be used to determine which video frames are more suitable as the first target video frame, such as using higher-scale blocks of DCT2, which are theoretically more suitable for interpolation, etc. It should be noted that the texture complexity indices may be combined with each other to make a decision, or some special processing may be performed on the texture complexity indices, and then some indices may be used to measure whether the texture complexity indices are more suitable for frame interpolation. In summary, the information theory can show that the image texture information is simpler, and the simpler texture information means that the frame is easier to insert, and by analogy, an image with only one circle and a mountain-water painting image can be obtained, and the image with the simple texture of the circle is easier to insert and generate.
Based on the foregoing, taking the example that the texture complexity threshold includes the first threshold and/or the second threshold, the implementation process of determining, by the server side, the attribute that the first video frame has is described in the following several possible implementations.
In a first implementation manner, when a first ratio corresponding to the first video frame is greater than a first threshold (for example, th_skip) and/or a second ratio corresponding to the first video frame is less than a second threshold (for example, th_intra), the server side may determine that the first video frame is the first target video frame.
The first ratio may be used to indicate a pixel number ratio of an area where a coded block in the first video frame with a prediction mode being an inter prediction mode (such as Skip mode) is located, and the second ratio may be used to indicate a pixel number ratio of an area where a coded block in the first video frame with a prediction mode being an Intra prediction mode (such as Intra mode) is located.
The implementation process of calculating the first ratio and the second ratio corresponding to the first video frame by the server side is described below, where the server side may determine first number of the plurality of pixels included in the region where the coding block with the prediction mode being the inter-prediction mode is located in the first video frame, and may determine second number of the plurality of pixels included in the region where the coding block with the prediction mode being the intra-prediction mode is located in the first video frame. The server may then take the ratio of the first number to the total number of the plurality of pixels included in the first video frame as a first ratio, and the ratio of the second number to the total number of the plurality of pixels included in the first video frame as a second ratio.
In the second implementation manner, when the first ratio corresponding to the first video frame is smaller than or equal to a first threshold (for example, th_skip), or the second ratio corresponding to the first video frame is larger than or equal to a second threshold (for example, th_intra), or the first ratio corresponding to the first video frame is larger than the first threshold and the second ratio corresponding to the first video frame is larger than or equal to the second threshold, the server side may determine that the first video frame is the second target video frame.
Illustratively, based on the above-described first and second embodiments, taking the above-described video 1 as an example, an implementation procedure in which the server side determines the attribute of each video frame included in the video 1 is described below.
Regarding the video frame 1, the prediction mode of the coding block 11 corresponding to the video frame 1 is referred to as Skip mode, and the prediction mode of the coding block 12 is referred to as Intra mode. For the coding block 11, the server side may count the number of pixels a1 included in the corresponding region of the coding block 11 in the video frame 1. For the encoded block 12, the server side may count the number of pixels a2 included in the corresponding region of the encoded block 12 in the video frame 1. Then, the server side may calculate a first ratio corresponding to the video frame 1 as a 1/(a1+a2), and may calculate a second ratio corresponding to the video frame 1 as a 2/(a1+a2).
In one example, when the first ratio a 1/(a1+a2) corresponding to the video frame 1 is greater than th_skip and/or the second ratio a 2/(a1+a2) corresponding to the video frame 1 is less than th_intra, the server may determine that the video frame 1 is the first target video frame. In another example, when the first ratio a 1/(a1+a2) corresponding to the video frame 1 is less than or equal to th_skip, or the second ratio a 2/(a1+a2) corresponding to the video frame 1 is greater than or equal to th_intra, or the first ratio a 1/(a1+a2) corresponding to the video frame 1 is greater than th_skip and the second ratio a 2/(a1+a2) corresponding to the video frame 1 is greater than or equal to th_intra, the server side may determine that the video frame 1 is the second target video frame.
Regarding the video frame 2, the predictive mode of the coding block 21 corresponding to the video frame 2 is referred to as Skip mode, the predictive mode of the coding block 22 is referred to as Intra mode, and the predictive mode of the coding block 23 is referred to as Skip mode. For the encoded block 21, the server side may count the number b1 of pixels included in the corresponding region of the encoded block 21 in the video frame 2. For the coding block 22, the server side may count the number of pixels b2 included in the corresponding region of the coding block 22 in the video frame 2. For the coding block 23, the server side may count the number of pixels b3 included in the corresponding region of the coding block 23 in the video frame 2. Then, the server side may calculate that the number of pixels included in the region corresponding to the encoded block in the Skip mode in the video frame 2 is (b1+b3), and may calculate that the number of pixels included in the region corresponding to the encoded block in the Intra mode in the video frame 2 is b2. Then, the server may calculate a first ratio corresponding to the video frame 2 as (b1+b3)/(b1+b2+b3), and may calculate a second ratio corresponding to the video frame 2 as b2/(b1+b2+b3).
In one example, when the first ratio (b1+b3)/(b1+b2+b3) corresponding to the video frame 2 is greater than th_skip and/or the second ratio b 2/(b1+b2+b3) corresponding to the video frame 2 is less than th_intra, the server side may determine the video frame 2 as the first target video frame. In another example, when the first ratio (b1+b3)/(b1+b2+b3) corresponding to the video frame 2 is smaller than or equal to th_skip, or the second ratio b 2/(b1+b2+b3) corresponding to the video frame 2 is larger than or equal to th_intra, or the first ratio (b1+b3)/(b1+b2+b3) corresponding to the video frame 2 is larger than th_skip and the second ratio b 2/(b1+b2+b3) corresponding to the video frame 2 is larger than or equal to th_intra, the server side may determine the video frame 2 as the second target video frame.
For the video frame 3, the prediction mode of the coding block 31 corresponding to the video frame 3 is Intra mode, the prediction mode of the coding block 32 is Intra mode, and the prediction mode of the coding block 33 is Skip mode. For the encoded block 31, the server side may count the number of pixels c1 included in the corresponding region of the encoded block 31 in the video frame 3. For the coding block 32, the server side may count the number of pixels c2 included in the corresponding region of the coding block 32 in the video frame 3. For the coding block 33, the server side may count the number of pixels c3 included in the corresponding region of the coding block 33 in the video frame 3. Then, the server may calculate that the number of pixels included in the region corresponding to the encoded block in the Skip mode in the video frame 3 is c3, and may calculate that the number of pixels included in the region corresponding to the encoded block in the Intra mode in the video frame 3 is (c1+c2). Then, the server side may calculate a first ratio corresponding to the video frame 3 as c 3/(c1+c2+c3), and may calculate a second ratio corresponding to the video frame 3 as (c1+c2)/(c1+c2+c3).
In one example, when the first ratio c 3/(c1+c2+c3) corresponding to the video frame 3 is greater than th_skip and/or the second ratio (c1+c2)/(c1+c2+c3) corresponding to the video frame 3 is less than th_intra, the server side may determine the video frame 3 as the first target video frame. In another example, when the first ratio c 3/(c1+c2+c3) corresponding to the video frame 3 is smaller than or equal to th_skip, or the second ratio (c1+c2)/(c1+c2+c3) corresponding to the video frame 3 is larger than or equal to th_intra, or the first ratio c 3/(c1+c2+c3) corresponding to the video frame 3 is larger than th_skip and the second ratio (c1+c2)/(c1+c2+c3) corresponding to the video frame 3 is larger than or equal to th_intra, the server side may determine that the video frame 3 is the second target video frame.
For the video frame 4, the prediction mode of the coding block 41 corresponding to the video frame 4 is Intra mode, and the prediction mode of the coding block 42 is Skip mode. For the coding block 41, the server side may count the number d1 of pixels included in the corresponding region of the coding block 41 in the video frame 4. For the encoded block 42, the server side may count the number d2 of pixels included in the corresponding region of the encoded block 42 in the video frame 4. Then, the server side may calculate a first ratio corresponding to the video frame 4 as d 2/(d1+d2), and may calculate a second ratio corresponding to the video frame 4 as d 1/(d1+d2).
In one example, the server may determine that the video frame 4 is the first target video frame when the first ratio d 2/(d1+d2) corresponding to the video frame 4 is greater than th_skip and/or the second ratio d 1/(d1+d2) corresponding to the video frame 4 is less than th_intra. In another example, when the first ratio d 2/(d1+d2) corresponding to the video frame 4 is smaller than or equal to th_skip, or the second ratio d 1/(d1+d2) corresponding to the video frame 4 is larger than or equal to th_intra, or the first ratio d 2/(d1+d2) corresponding to the video frame 4 is larger than th_skip and the second ratio d 1/(d1+d2) corresponding to the video frame 4 is larger than or equal to th_intra, the server side may determine that the video frame 4 is the second target video frame.
Step 203, the server sends the second code stream to the client. The client receives the second code stream.
Illustratively, a plurality of second target video frames may be carried in the second bitstream.
Optionally, after determining the attribute of each video frame in the plurality of video frames included in the first video, the server side may perform a time domain downsampling operation on the first video according to the attribute of each video frame in the plurality of video frames, that is, discard at least one first target video frame included in the first video, and reserve a plurality of second target video frames included in the first video. The reserved second target video frames can be used as new video sequences to form a code stream to be transmitted to the server through normal coding operation, and the attribute information of the discarded first target video frames can also be used to form another code stream to be transmitted to the server through coding operation.
In one possible implementation, after encoding the plurality of second target video frames to obtain a second code stream, the server may send the second code stream to the client,
In another possible implementation manner, in order to assist the client in generating the frame to be inserted more accurately, so as to ensure the quality of the frame to be inserted, the server side may send the second code stream to the client and may also send the third code stream to the client. The third code stream may carry attribute information of at least one first target video frame. For example, the attribute information of any one of the first target video frames may include, but is not limited to, motion vector information of a plurality of encoded blocks corresponding to the first target video frame, position information of the first target video frame in the first video, and the like. Optionally, the attribute information of any one of the first target video frames may further include a pixel number ratio of an area where a coding block in which a prediction mode is an inter-frame prediction mode in the first target video frame is located, a pixel number ratio of an area where a coding block in which a prediction mode is an intra-frame prediction mode in the first target video frame is located, a first threshold or a second threshold, and the like. Optionally, the third code stream may also carry a frame dropping policy of the server side for the first video, or may also carry a dropping rule (or may be called a dropping policy) of at least one first target video frame. For example, the frame dropping policy may include information about how to drop at least one first target video frame.
The implementation process of the server side sending the second code stream to the client side is described below.
The server side may encode the plurality of second target video frames that remain to form a second code stream. The server side may then send a second code stream to the client side. Illustratively, the encoded plurality of second target video frames are carried in a second bitstream. Optionally, the second code stream may further carry position information of the plurality of second target video frames in the first video, respectively.
For example, continue taking video 1 as an example. In one example, assume that video frame 1 included in video 1 is a second target video frame, video frame 2 is a first target video frame, video frame 3 is a second target video frame, and video frame 4 is a second target video frame. The server side may take the remaining video frame 1, video frame 3, and video frame 4 as new video sequences, and may encode the new video sequences into a standard code stream (i.e., a second code stream) using standard encoding techniques. The server side may then send the standard code stream to the client side. Optionally, the standard code stream carries the encoded video frame 1, the encoded video frame 3 and the encoded video frame 4.
In another example, assume that video frame 1 included in video 1 is a second target video frame, video frame 2 is a first target video frame, video frame 3 is a first target video frame, and video frame 4 is a second target video frame. The server side may take the remaining video frames 1 and 4 as new video sequences and may encode the new video sequences into a standard code stream (i.e., a second code stream) using standard encoding techniques. The server side may then send the standard code stream to the client side. Optionally, the standard code stream carries the encoded video frame 1 and the encoded video frame 4.
Optionally, after the server side performs step 203, step 204 may also be performed, and by performing step 204, the quality of the video frame inserted by the client side may be further ensured, and the phenomenon that the quality of the generated frame to be inserted is poor due to inaccurate estimation of motion vector information of some video frames with severe motion after the video frames with severe motion are discarded may be further alleviated. For example, after the server end performs step 203, step 204 is not performed, and the client end may generate the second video according to the plurality of second target video frames carried by the second code stream after receiving the second code stream. After the server side performs step 203, step 204 is also performed, and after the client side receives the second code stream and the third code stream respectively, the client side may generate the second video according to the attribute information of the plurality of second target video frames carried by the second code stream and at least one first target video frame carried by the third code stream.
Step 204, the server sends the third code stream to the client. The client receives the third code stream.
The following describes the implementation process that the server side sends the third code stream to the client side:
The server side may encode attribute information of the discarded at least one first target video frame to form a third code stream. The server side may then send a third code stream to the client side.
Illustratively, the implementation of the server-side determination of motion vector information for each encoded block corresponding to any one of the first target video frames is described below.
For a third target video frame included in the first video, the server may identify a prediction mode of each of the plurality of coding blocks according to coding information of the plurality of coding blocks corresponding to the third target video frame. Wherein the third target video frame may be any one of the at least one first target video frame.
When the prediction mode of the first coding block included in the plurality of coding blocks is an inter prediction mode, the server side may obtain motion vector information of the first coding block from coding information of the first coding block, and may scale the motion vector information of the first coding block to obtain scaled motion vector information of the first coding block.
For example, the server side may obtain the motion vector information of the first coding block from the coding information of the first coding block, and may obtain the position information of the reference video frame of the first coding block from the coding information of the first coding block. The first coding block may be any one of a plurality of coding blocks corresponding to the third target video frame. Then, the server side may calculate a play interval between the reference video frame of the first coding block and the third target video frame according to the position information of the reference video frame of the first coding block and the position information of the third target video frame, and may calculate a play interval between the third target video frame and a second target video frame adjacent to the third target video frame before the third target video frame according to the position information of the third target video frame and the position information of the second target video frame adjacent to the third target video frame before the third target video frame. Then, the server side may scale the motion vector information of the first coding block according to the playing interval between the reference video frame and the third target video frame of the first coding block and the playing interval between the third target video frame and a second target video frame adjacent to the first target video frame before the third target video frame, to obtain scaled motion vector information. The scaled motion vector information may be used as current motion vector information of the first coding block. Illustratively, the play interval may be one of a play order difference value, a play index difference value, a play time difference value, or the like.
When the prediction mode of the first coding block is the intra-frame prediction mode, the server side cannot acquire the motion vector information of the first coding block from the coding information of the first coding block. Of course, the server side does not acquire the position information of the reference video frame of the first coding block from the coding information of the first coding block. At this time, the server side may determine whether a second encoded block whose prediction mode is the inter prediction mode exists on the left side, above or above the left side of the first encoded block. When there is a second coding block whose prediction mode is inter prediction mode at the left side, above or left side of the first coding block, the server side may obtain motion vector information of the second coding block from the coding information of the second coding block, and may scale the motion vector information of the second coding block to obtain scaled motion vector information, where the scaled motion vector information is used as current motion vector information of the first coding block.
For example, the server side may first determine whether there are encoded blocks whose prediction modes are inter prediction modes in three positions of the left side, the upper side, and the upper left side of the first encoded block. If one of the three positions on the left side, the upper side and the upper left side of the first coding block has a coding block whose prediction mode is the inter prediction mode, the server side may acquire motion vector information of the coding block of the position from the coding information of the coding block of the position, and may acquire position information of a reference video frame of the coding block of the position from the coding information of the coding block of the position. Then, the server side may calculate a play interval between the reference video frame of the coding block at the position and the third target video frame according to the position information of the reference video frame of the coding block at the position and the position information of the third target video frame, and may calculate a play interval between the third target video frame and a second target video frame adjacent to the third target video frame before the third target video frame according to the position information of the third target video frame and the position information of the second target video frame adjacent to the third target video frame before the third target video frame. Then, the server side may scale the motion vector information of the second coding block according to the playing interval between the reference video frame and the third target video frame of the coding block at the position and the playing interval between the third target video frame and a second target video frame adjacent to the first target video frame before the third target video frame, so as to obtain scaled motion vector information corresponding to the coding block at the position. The scaled motion vector information corresponding to the coding block at the position can be used as the current motion vector information of the first coding block.
If there are a plurality of encoding blocks whose prediction mode is the inter prediction mode in three positions of the left side, the upper side and the upper left side of the first encoding block, the server side may select one encoding block from the plurality of encoding blocks and may acquire motion vector information of the encoding block of the position from encoding information of the encoding block of the position. The server may then use the motion vector information of the encoded block at that location to participate in determining the current motion vector information of the first encoded block. After the server selects a coding block at a certain position from the coding blocks at a plurality of positions, the position information of the reference video frame of the coding block at the position can also be obtained from the coding information of the coding block at the position. Then, the server side may calculate a play interval between the reference video frame of the coding block at the position and the third target video frame according to the position information of the reference video frame of the coding block at the position and the position information of the third target video frame, and may calculate a play interval between the third target video frame and a second target video frame adjacent to the third target video frame before the third target video frame according to the position information of the third target video frame and the position information of the second target video frame adjacent to the third target video frame before the third target video frame. Then, the server side may scale the motion vector information of the second coding block according to the playing interval between the reference video frame and the third target video frame of the coding block at the position and the playing interval between the third target video frame and a second target video frame adjacent to the first target video frame before the third target video frame, so as to obtain scaled motion vector information corresponding to the coding block at the position. The scaled motion vector information corresponding to the coding block at the position can be used as the current motion vector information of the first coding block.
If there is no coding block whose prediction mode is the inter prediction mode in the coding blocks of at least one position, the server side may use (0, 0) as current motion vector information of the first coding block.
When there are no encoded blocks in the three positions of the left side, the upper side and the upper left side of the first encoded block, the server side may use (0, 0) as the current motion vector information of the first encoded block.
After obtaining the current motion vector information of the plurality of coding blocks corresponding to each first target video frame in at least one first target video frame, the server side may generate a third code stream according to the current motion vector information of the plurality of coding blocks corresponding to each first target video frame in at least one first target video frame and the position information of the at least one first target video frame in the first video respectively. The server side may then send a third code stream to the client side.
For example, continue taking video 1 as an example. In one example, assume that video frame 2 included in video 1 is the first target video frame. The server may encode the current motion vector information of the encoding block 21, the encoding block 22, and the encoding block 23 corresponding to the video frame 2 and the position information of the video frame 2 in the video 1 into a private code stream (i.e., a third code stream). The server side may then send the private code stream to the client side. The private code stream may carry current motion vector information of the encoded encoding blocks 21, 22 and 23 and position information of the encoded video frame 2 in the video 1.
In another example, assume that video frame 2 included in video 1 is a first target video frame and video frame 3 is a first target video frame. The server may encode the current motion vector information of the encoding block 21, the encoding block 22, and the encoding block 23 corresponding to the video frame 2, the position information of the video frame 2 in the video 1, the current motion vector information of the encoding block 31, the encoding block 32, and the encoding block 33 corresponding to the video frame 3, and the position information of the video frame 3 in the video 1 into a private code stream. The server side may then send the private code stream to the client side. The private code stream may carry current motion vector information of the encoded encoding blocks 21, 22 and 23, position information of the encoded video frame 2 in the video 1, current motion vector information of the encoded encoding blocks 31, 32 and 33, and position information of the encoded video frame 3 in the video 1.
Illustratively, referring to fig. 3, taking the first target video frame as the video frame 2 included in the video 1 as an example, an implementation process of determining, by the server, motion vector information of each coding block corresponding to the video frame 2 is described. The multiple encoded blocks corresponding to the video frame 2 are assumed to be the encoded block 21, the encoded block 22, and the encoded block 23, the prediction mode of the encoded block 21 is the Skip mode, the prediction mode of the encoded block 22 is the Intra mode, the prediction mode of the encoded block 23 is the Skip mode, and the video frame 1 is assumed to be the second target video frame, the video frame 3 is assumed to be the second target video frame, and the video frame 4 is assumed to be the second target video frame. As shown in fig. 3, the method includes:
step 301, a server obtains a prediction mode of each coding block in a plurality of coding blocks corresponding to a video frame 2.
Alternatively, the server may obtain, according to the coding information of the coding block 21, the coding information of the coding block 22, and the coding information of the coding block 23 corresponding to the video frame 2, that the prediction mode of the coding block 21 is Skip mode, that the prediction mode of the coding block 22 is Intra mode, and that the prediction mode of the coding block 23 is Skip mode.
Step 302, the server determines whether the prediction mode of the first coding block included in the multiple coding blocks corresponding to the video frame 2 is Skip mode. If yes, go to step 303, otherwise go to step 304.
Step 303, the server acquires the motion vector information of the first coding block, and scales the motion vector information of the first coding block according to the playing interval between the reference video frame and the video frame 2 of the first coding block and the playing interval between the video frame 2 and the video frame 1 to obtain scaled motion vector information.
Optionally, video frame 1 is a second target video frame that is adjacent before video frame 2.
For example, when the first coding block is the coding block 21, the server side may determine that the prediction mode of the coding block 21 is Skip mode. Then, the server side may obtain the motion vector information of the coding block 21 from the coding information of the coding block 21, and may scale the motion vector information of the coding block 21 according to the playing interval between the reference video frame and the video frame 2 of the coding block 21 and the playing interval between the video frame 2 and the video frame 1, to obtain scaled motion vector information, where the scaled motion vector information is used as the current motion vector information of the coding block 21.
For another example, when the first coding block is the coding block 22, the server side may determine that the prediction mode of the coding block 22 is the Intra mode. At this time, the server side does not perform step 303 described above, but performs step 304.
For another example, when the first coding block is the coding block 23, the server side may determine that the prediction mode of the coding block 23 is Skip mode. Then, the server side may obtain the motion vector information of the coding block 23 from the coding information of the coding block 23, and may scale the motion vector information of the coding block 23 according to the playing interval between the reference video frame and the video frame 2 and the playing interval between the video frame 2 and the video frame 1 of the coding block 23, to obtain scaled motion vector information, where the scaled motion vector information is used as the current motion vector information of the coding block 23.
Step 304, when determining that the prediction mode of the first coding block is Intra mode, the server determines whether there is a second coding block whose prediction mode is Skip mode at the left side, above or above the left side of the first coding block. If yes, go to step 305, if not, go to step 306.
Step 305, the server obtains the motion vector information of the second coding block from the coding information of the second coding block, and scales the motion vector information of the second coding block according to the playing interval between the reference video frame and the video frame 2 and the playing interval between the video frame 2 and the video frame 1 of the second coding block, so as to obtain the scaled motion vector information corresponding to the second coding block.
Illustratively, taking the first coding block as the coding block 22 as an example, the server performs the above step 304 when determining that the prediction mode of the coding block 22 is the Intra mode. If one or more of the coding blocks located on the left side, above, or left above the coding block 22 have a prediction mode that is an inter prediction mode, the server side may perform step 305. If there is no coding block whose prediction mode is the inter prediction mode among coding blocks located at the left, upper and upper left of the coding block 22, the server side may perform step 306.
The implementation of step 305 is described below as being performed by the server side by way of the following several possible examples.
Example one is to take the coding block located at the left side (or upper or left) of the coding block 22 as coding block 21, and the coding block located at the right side (or upper or lower or left) of the coding block 22 as coding block 23. After the server determines that the prediction mode of the coding block 22 is Intra mode, the server may determine whether there is a coding block whose prediction mode is inter prediction mode on the left side, upper left side, or upper side of the coding block 22. When the prediction mode of the coding block 21 located at the left side (or above or left above) of the coding block 22 is the inter-frame prediction mode, the server side may obtain the motion vector information of the coding block 21 from the coding information of the coding block 21, and may scale the motion vector information of the coding block 21 according to the playing interval between the reference video frame and the video frame 2 of the coding block 21 and the playing interval between the video frame 2 and the video frame 1, to obtain scaled motion vector information corresponding to the coding block 21, where the scaled motion vector information corresponding to the coding block 21 is used as the current motion vector information of the coding block 22.
In the second example, the coding block located at the left side of the coding block 22 is taken as the coding block 21, and the coding block located above (or upper left of) the coding block 22 is taken as the coding block 23. After the server determines that the prediction mode of the coding block 22 is Intra mode, the server may determine whether there is a coding block whose prediction mode is inter prediction mode on the left side, upper left side, or upper side of the coding block 22. When the prediction modes of the coding block 21 located at the left side of the coding block 22 and the coding block 23 located above (or left above) the coding block 22 are both inter prediction modes, the server side may acquire the motion vector information of the coding block 21 from the coding information of the coding block 21, and may acquire the motion vector information of the coding block 23 from the coding information of the coding block 23, and may use the motion vector information of any one of the coding block 21 and the coding block 23 to participate in determining the current motion vector information of the coding block 22. The server side may then select the motion vector information of one of the code blocks 21 and 23 to participate in determining the current motion vector information of the code block 22. For example, taking a server side selecting motion vector information of the encoding block 23 from motion vector information of the encoding block 21 and motion vector information of the encoding block 23 to participate in determining current motion vector information of the encoding block 22 as an example, the server side may scale the motion vector information of the encoding block 23 according to a playing interval between a reference video frame and a video frame 2 of the encoding block 23 and a playing interval between the video frame 2 and the video frame 1 to obtain scaled motion vector information corresponding to the encoding block 23, where the scaled motion vector information corresponding to the encoding block 23 is used as current motion vector information of the encoding block 22.
If only one of the coded blocks 21 and 23 has a prediction mode of inter prediction mode, the server side may acquire motion vector information of the coded block from the coded information of the coded block having the prediction mode of inter prediction mode, and may use the motion vector information of the coded block to participate in determining the current motion vector information of the coded block 22.
Example three, the coding block located above the coding block 22 is the coding block 21, and the coding block located above (or to the left of) the coding block 22 is the coding block 23. After the server determines that the prediction mode of the coding block 22 is Intra mode, the server may determine whether there is a coding block whose prediction mode is inter prediction mode on the left side, upper left side, or upper side of the coding block 22. When the prediction modes of the coding block 21 located above the coding block 22 and the coding block 23 located above (or to the left of) the coding block 22 are both inter prediction modes, the server side may acquire the motion vector information of the coding block 21 from the coding information of the coding block 21, and may acquire the motion vector information of the coding block 23 from the coding information of the coding block 23, and may use the motion vector information of any one of the coding block 21 and the coding block 23 to participate in determining the current motion vector information of the coding block 22. The server side may then select the motion vector information of one of the code blocks 21 and 23 to participate in determining the current motion vector information of the code block 22. For example, taking a server side selecting motion vector information of the encoding block 23 from motion vector information of the encoding block 21 and motion vector information of the encoding block 23 to participate in determining current motion vector information of the encoding block 22 as an example, the server side may scale the motion vector information of the encoding block 23 according to a play interval between reference video frames 23 and a play interval between video frames 2 and 1 of the encoding block 23 to obtain scaled motion vector information corresponding to the encoding block 23, where the scaled motion vector information corresponding to the encoding block 23 is used as current motion vector information of the encoding block 22.
If only one of the coded blocks 21 and 23 has a prediction mode of inter prediction mode, the server side may acquire motion vector information of the coded block from the coded information of the coded block having the prediction mode of inter prediction mode, and may use the motion vector information of the coded block to participate in determining the current motion vector information of the coded block 22.
In the fourth example, the coding block located at the upper left of the coding block 22 is used as the coding block 21, and the coding block located above (or at the left of) the coding block 22 is used as the coding block 23. After the server determines that the prediction mode of the coding block 22 is Intra mode, the server may determine whether there is a coding block whose prediction mode is inter prediction mode on the left side, upper left side, or upper side of the coding block 22. When the prediction modes of the coding block 21 located at the upper left of the coding block 22 and the coding block 23 located above (or to the left of) the coding block 22 are both inter prediction modes, the server side may acquire the motion vector information of the coding block 21 from the coding information of the coding block 21, and may acquire the motion vector information of the coding block 23 from the coding information of the coding block 23, and may use the motion vector information of any one of the coding block 21 and the coding block 23 to participate in determining the current motion vector information of the coding block 22. The server side may then select the motion vector information of one of the code blocks 21 and 23 to participate in determining the current motion vector information of the code block 22. For example, taking a server side selecting motion vector information of the encoding block 21 from motion vector information of the encoding block 21 and motion vector information of the encoding block 23 to participate in determining current motion vector information of the encoding block 22 as an example, the server side may scale the motion vector information of the encoding block 21 according to a playing interval between a reference video frame and a video frame 23 of the encoding block 21 and a playing interval between a video frame 2 and a video frame 1 to obtain scaled motion vector information corresponding to the encoding block 21, where the scaled motion vector information corresponding to the encoding block 21 is used as current motion vector information of the encoding block 22.
If only one of the coded blocks 21 and 23 has a prediction mode of inter prediction mode, the server side may acquire motion vector information of the coded block from the coded information of the coded block having the prediction mode of inter prediction mode, and may use the motion vector information of the coded block to participate in determining the current motion vector information of the coded block 22.
Step 306, the server takes (0, 0) as the current motion vector information of the first coding block.
The implementation of step 306 is described below as being performed by the server side by way of the following several possible examples.
In example one, taking the coding block located at the lower left (or lower) of the coding block 22 as the coding block 21, the coding block located at the right side (or upper right or lower right) of the coding block 22 as the coding block 23, the server side may use (0, 0) as the current motion vector information of the coding block 22 when the server side determines that there is no coding block located at the left, upper and upper left of the coding block 22 or that there is no coding block with the prediction mode being the inter prediction mode located at the three positions of the left, upper and upper left of the coding block 22 after determining that the prediction mode of the coding block 22 is the Intra mode.
In example two, taking the coding block located on the right side (or upper right or lower right) of the coding block 22 as the coding block 21, taking the coding block located on the lower left (or lower right) of the coding block 22 as the coding block 23, the server side can take (0, 0) as the current motion vector information of the coding block 22 when the server side determines that no coding block exists on the left side, upper left and upper left of the coding block 22 or that no coding block with the prediction mode being the inter prediction mode exists in the three positions of the left side, upper right and upper left of the coding block 22 after determining that the prediction mode of the coding block 22 is the Intra mode.
As can be seen from the foregoing steps 301 to 306, for each first target video frame, after identifying the prediction mode of each of the plurality of encoding blocks corresponding to the first target video frame, the server may select, for each encoding block, a motion vector information calculation mode applicable to the prediction mode of the encoding block to calculate the current motion vector information of the encoding block, so that the current motion vector information calculation of the encoding block may be more accurate, and the current motion vector information calculation for different encoding blocks may be more consistent with the actual encoding conditions of different encoding blocks, thereby assisting in accurately generating the frame to be inserted in the following steps, and helping to ensure the quality of the frame to be inserted.
Optionally, after performing time domain downsampling operation on the first video, the server side can evaluate the frame inserting quality of the first video to dynamically adjust the frame dropping strategy, so that the frame dropping strategy has higher flexibility, can flexibly adapt to different video transcoding scenes, and can implement different frame dropping strategies for different video transcoding scenes, thereby ensuring the frame inserting quality of the video.
When the server decodes each video frame satisfying the first number threshold (for example, 50 or integer multiple of 50 or other numbers, etc.), the to-be-inserted frame corresponding to the first target video frame can be generated according to the motion vector information of a certain first target video frame (for example, a fourth target video frame) included in the video frames of the first number threshold and two adjacent video frames belonging to the normal encoded frames located before and after the first target video frame. For example, the server may input the motion vector information of the first target video frame and two video frames belonging to the normal coding frame, which are located adjacent to the first target video frame, to a certain neural network model (such as a convolutional neural network model) to obtain a frame to be inserted corresponding to the first target video frame. The neural network model is a model trained through a large number of video frame training sample sets and is used for generating a frame to be inserted corresponding to a first target video frame. And then, the server side can determine the peak signal-to-noise ratio corresponding to the frame to be inserted according to the frame to be inserted and the first target video frame. Optionally, the peak signal-to-noise ratio may be used as a quality index value of the frame to be inserted to evaluate the quality of the frame to be inserted. Then, the server side can determine how to adjust the frame loss policy, i.e. how to adjust the first threshold and/or the second threshold according to the magnitude relation between the peak signal-to-noise ratio and the quality threshold. When the peak signal-to-noise ratio is less than the quality threshold, the server side may decrease the second threshold (e.g., th_intra). When the peak signal-to-noise ratio is greater than or equal to the quality threshold, the server may adjust the first threshold (e.g., th_skip) and/or adjust the second threshold (e.g., th_intra) if the number of first target video frames included in the video frames of the first number threshold is less than or equal to the second number threshold. If the number of the first target video frames included in the video frames of the first number threshold is greater than the second number threshold, the server side does not need to adjust the first threshold and/or the second threshold. The fourth target video frame is any one of a plurality of first target video frames included in the video frames of the first number threshold.
When the server side does not decode the video frames meeting the first number of thresholds, the server side does not need to adjust the frame loss policy, that is, does not need to adjust the first threshold and/or the second threshold.
Illustratively, taking the first number threshold as 50 and the second number threshold as 20 for example, when the server side decodes 50 video frames each, the server side may select one first target video frame from at least one first target video frame (i.e., video frame having a discard characteristic) included in the decoded 50 video frames for participating in evaluating the frame insertion quality of the video. For example, taking the example that 50 video frames decoded by the server side include the video frame 1, the video frame 2 and the video frame 3, where the video frame 2 is a first target video frame, and the video frame 2 and the video frame 3 are second target video frames, the server side may select the video frame 2 to participate in evaluating the frame insertion quality of the video. Wherein video frame 1 is a second target video frame adjacent to and preceding video frame 2 and video frame 3 is a second target video frame adjacent to and following video frame 2. Then, the server side may input the motion vector information of the plurality of encoding blocks corresponding to the video frame 2, the video frame 1 and the video frame 3 into the convolutional neural network model, to obtain a frame to be inserted corresponding to the video frame 2, for example, a frame to be inserted a. Then, the server side can calculate the peak signal-to-noise ratio corresponding to the frame to be inserted a according to the video frame 2 and the frame to be inserted a. And when the peak signal-to-noise ratio corresponding to the frame to be inserted a is smaller than the quality threshold, the server side reduces Th_intra. When the peak signal-to-noise ratio corresponding to the frame to be inserted a is greater than or equal to the quality threshold, if the server determines that the number of the first target video frames included in the decoded 50 video frames is less than or equal to 20, the server may reduce th_skip and/or increase th_intra. If the server determines that the number of first target video frames included in the decoded 50 video frames is greater than 20, the server does not need to adjust th_skip and/or th_intra.
Step 205, the client generates the second video according to the plurality of second target video frames, or generates the second video according to the attribute information of the plurality of second target video frames and at least one first target video frame.
In one possible implementation, after receiving the second code stream from the server, the client may decode the second code stream to obtain a plurality of second target video frames. The client may then generate corresponding frames to be inserted between the plurality of second target video frames in a conventional frame insertion manner. The client may then generate a second video from the plurality of second target video frames and the corresponding frames to be inserted.
In another possible implementation manner, after receiving the second code stream and the third code stream from the server, the client may decode the second code stream to obtain a plurality of second target video frames, and decode the third code stream to obtain attribute information of at least one first target video frame. Then, the server side can generate the second video according to the attribute information of the plurality of second target video frames and at least one first target video frame. Therefore, the implementation method can use the attribute information of the video frames with the discarding characteristic and the video frames with the retaining characteristic to participate in generating the corresponding frames to be inserted, so that the quality of the video frames to be inserted can be further ensured, the problem that more artifacts exist in the generated video due to inaccurate estimation of the motion vector information of the frames to be inserted in the existing scheme can be solved, and the phenomenon that the quality of the generated frames to be inserted is poor due to inaccurate estimation of the motion vector information of the video frames with more intense motion after the video frames with more intense motion are discarded can be further relieved.
For example, for each first target video frame in at least one first target video frame, the server may determine, according to the position information of the first target video frame in the first video, a nearest second target video frame in the first video before the playing position of the first target video frame, and determine a nearest second target video frame in the first video after the playing position of the first target video frame. Then, the server side can generate a frame to be inserted corresponding to the first target video frame according to the motion vector information of the plurality of coding blocks corresponding to the first target video frame and two adjacent second target video frames before and after the first target video frame. The position information of the frame to be inserted corresponding to any one of the first target video frames is the same as the position information of the first target video frames in the first video. In other words, the first target video frame is located at the same position (such as a playing position) in the first video as the position where the frame to be inserted corresponding to the first target video frame needs to be inserted (or placed or put). Then, the server side can insert (or put) the frame to be inserted corresponding to each first target video frame into a plurality of second target video frames according to the position information of each first target video frame in the first video respectively, so as to generate a second video, wherein the second video has a complete video sequence.
Illustratively, continuing with the video 1, the server side sends the second code stream and the third code stream to the client, respectively. Assume that video frame 1 included in video 1 is a second target video frame, video frame 2 is a first target video frame, video frame 3 is a second target video frame, and video frame 4 is a second target video frame, and assume that the playing sequence of each video frame in video 1 is video frame 1, video frame 2, video frame 3 and video frame 4. The second code stream sent by the client to the server carries the encoded video frame 1, the encoded video frame 3 and the encoded video frame 4, and the third code stream sent by the client to the server carries the attribute information of the encoded video frame 2. Illustratively, the attribute information of the video frame 2 may include current motion vector information of each of the encoding block 21, the encoding block 22, and the encoding block 23 corresponding to the video frame 2, and position information (such as a play position) of the video frame 2 in the video 1. After receiving the second code stream and the third code stream from the client, the server decodes the second code stream to obtain a video frame 1, a video frame 3 and a video frame 4, and decodes the third code stream to obtain attribute information of the video frame 2. Then, the server may determine, according to the position information of the video frame 2 in the video 1, that the nearest second target video frame before the playing position of the video frame 2 in the video 1 is the video frame 1, determine that the nearest second target video frame after the playing position of the video frame 2 in the video 1 is the video frame 3, and input the current motion vector information of each of the coding block 21, the coding block 22 and the coding block 23 and the video frame 1 and the video frame 3 into the convolutional neural network model (i.e. the trained convolutional neural network model for generating the frame to be inserted), to obtain the frame to be inserted corresponding to the video frame 2. The playing position of the video frame 2 in the video 1 is the same as the playing position of the frame to be inserted corresponding to the video frame 2, which is needed to be inserted. Then, the server side can insert the frame to be inserted corresponding to the video frame 2 into the video frame 1, the video frame 3 and the video frame 4 according to the position information of the video frame 2 in the video 1, so as to generate a complete video.
As can be seen from the above steps 201 to 205, the server side analyzes the attribute of each video frame (i.e. whether each video frame has a discarding characteristic or a retaining characteristic) according to the coding information of each video frame in the plurality of video frames, and can more accurately determine which video frames are suitable to be discarded according to the attribute of each video frame. The server side may then encode the video frames with the reservation property into a code stream for transmission to the client side. Then, after decoding the received code stream, the client obtains a video frame with a reserved characteristic, and because the server discards the video frame which is easier to interpolate, the client can generate a corresponding frame to be interpolated relatively easily according to the video frame with the reserved characteristic, so that the quality of the video frame to be interpolated can be ensured, and the phenomenon that the quality of the generated frame to be interpolated is poor because the motion vector information of the video frame with the severe motion is inaccurate after the video frame with the severe motion is discarded can be relieved. In addition, the server side performs more accurate time domain downsampling operation on a plurality of video frames, so that some video frames can be discarded to the greatest extent, and data transmitted to the client side by the server side is subjected to coding operation, so that the data amount transmitted to the client side by the server side can be further reduced, the transmission bandwidth between the server side and the client side can be further reduced, and the phenomenon that video watching experience is poor for users in a weak network environment can be relieved. Optionally, when the server side encodes the attribute information of the video frame with the discarding characteristic into a code stream and transmits the code stream to the client side, the client side can effectively and comprehensively utilize the attribute information of the video frame with the discarding characteristic and the video frame with the retaining characteristic to generate a corresponding frame to be inserted, so that the quality of the video frame to be inserted can be further ensured, the problem that more artifacts exist in the generated video due to inaccurate estimation of the motion vector information of the frame to be inserted in the existing scheme can be solved, and the phenomenon that the quality of the generated frame to be inserted is poor due to inaccurate estimation of the motion vector information of the video frame with more severe motion after the video frame with more severe motion is discarded can be further effectively relieved.
In the description of the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or" describes an association relationship of associated objects, and indicates that there may be three relationships, for example, a and/or B, and may indicate that a exists alone, a exists with a and B together, and B exists alone, where a and B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, "at least one of A, B, and C" includes A, B, C, AB, AC, BC, or ABC. And, unless otherwise specified, references to "first," "second," "third," etc. ordinal words of the embodiments of the present application are used for distinguishing between multiple objects and are not used for limiting the order, timing, priority, or importance of the multiple objects. Furthermore, the terms "comprising," "including," "having," and variations thereof herein, are intended to be "including but not limited to" unless otherwise specifically limited to.
In addition, each step in the foregoing embodiments may be performed by a corresponding device, or may be performed by a component such as a chip, a processor, or a chip system in the device, which is not limited by the embodiment of the present application. The above embodiments are described only as examples to be executed by the respective apparatuses.
In the above embodiments, some steps may be selected and performed, or the order of steps in the drawings may be adjusted and performed, which is not limited to the present application. It should be understood that it is within the scope of the present application to perform some of the steps in the illustrations, adjust the order of the steps, or implement them in combination with each other.
It will be appreciated that, in order to implement the functions of the above embodiments, each device involved in the above embodiments includes a corresponding hardware structure and/or software module for performing each function. Those of skill in the art will readily appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application scenario and design constraints imposed on the solution.
It should be noted that "step" in the embodiment of the present application is merely illustrative, and is used to better understand a representation method adopted by the embodiment, and does not essentially limit the implementation of the scheme of the present application, for example, the "step" may be further understood as "feature". In addition, the execution sequence of the scheme of the application is not limited in any way, and any operation such as step sequence change or step combination or step splitting which does not affect the implementation of the whole scheme is made on the basis, so that the formed new technical scheme is also within the scope of the disclosure of the application.
Based on the same conception, the embodiment of the present application also provides a possible video processing device, which is suitable for the application scenario illustrated in fig. 1. Alternatively, the video processing apparatus may be a computing device (such as a first video processing apparatus or a second video processing apparatus) or a component (such as a chip, a system on a chip, or a circuit, etc.) capable of supporting the functions required for the computing device to implement the video processing method. In an example, when the video processing apparatus is a first video processing apparatus (such as the server side 200 illustrated in fig. 1), the video processing apparatus is configured to implement the technical solution related to the first video processing apparatus in the above embodiment, or the module (such as the chip) of the video processing apparatus is configured to implement the technical solution related to the first video processing apparatus in the above embodiment, so that the beneficial effects of the first video processing apparatus in the above embodiment can also be implemented. Illustratively, taking the example that the video processing apparatus is a chip provided in a computing device, when the video processing apparatus is a chip, the video processing apparatus includes a transceiver and a processor therein, and no memory is included. The transceiver is in the form of an input/output interface, and the input/output interface is used for receiving and transmitting the computing equipment through the chip. The input-output interface may include an input interface that may enable receipt of the computing device and/or an output interface that may be used to enable transmission of the computing device. The processor is configured to read and execute the corresponding computer program or instructions such that the corresponding functions of the video processing apparatus are implemented. Optionally, when implementing the corresponding function of the first video processing device in the above embodiment, the input/output interface may implement the transceiving operation performed by the first video processing device in the above embodiment, and the processor may implement other operations than the transceiving operation performed by the first video processing device in the above embodiment. Specific details concerning the embodiments described above are referred to, and will not be described in detail.
Referring to fig. 4a, the video processing apparatus 400a includes a decoding module 401, a video characteristic analysis module 402, and a transceiving module 403 (for transmitting and receiving data). Optionally, the video processing apparatus 400a may further include a temporal downsampling module 404, an encoding module 405, and an interpolation frame quality feedback module 406.
Wherein the temporal downsampling module 404 is configured to discard at least one first target video frame and preserve a plurality of second target video frames. The frame quality feedback module 406 is configured to feed back to the video characteristic analysis module 402. For example, the frame quality feedback module 406 may generate a frame to be inserted, and determine whether the first threshold and the second threshold need to be adjusted according to the quality of the frame to be inserted. The encoding module 405 is configured to encode the plurality of second target video frames to obtain a second code stream, or encode attribute information of at least one first target video frame to obtain a third code stream. The third code stream carries attribute information of at least one first target video frame.
When the video processing apparatus 400a is configured to implement the function of a server (such as a cloud server) in the method embodiment illustrated in fig. 2 or fig. 3, the transceiver module 403 is configured to obtain the first code stream. The decoding module 401 is configured to decode the first code stream to obtain a plurality of video frames and encoded information of the plurality of video frames included in the first video. Wherein the encoding information may include one or more of block partition information, motion vector information, prediction mode, and block residual information of the plurality of video frames. The video characteristic analysis module 402 is configured to determine at least one first target video frame included in the plurality of video frames according to the encoding information of the plurality of video frames. Wherein the at least one first target video frame is a video frame of the plurality of video frames that needs to be discarded. The transceiver module 403 is further configured to send the second code stream. The second code stream carries a plurality of second target video frames, and the plurality of second target video frames are other video frames except at least one first target video frame in the plurality of video frames. The transceiver module 403 is further configured to send the third code stream.
When the video processing apparatus 400a is used to implement the server-side function in the method embodiment illustrated in fig. 2 or fig. 3, for more detailed description of the decoding module 401, the video characteristic analysis module 402, the transceiver module 403, the time-domain downsampling module 404, the encoding module 405, and the frame quality feedback module 406, reference may be made to the related description of the server-side in the method embodiment illustrated in fig. 2 or fig. 3, which is not repeated herein.
In another example, when the video processing apparatus is a second video processing apparatus (such as the client 100 illustrated in fig. 1), the video processing apparatus is configured to implement the technical solution related to the second video processing apparatus in the above embodiment, or the module (such as the chip) of the video processing apparatus is configured to implement the technical solution related to the second video processing apparatus in the above embodiment, so that the beneficial effects of the second video processing apparatus in the above embodiment can also be implemented. Illustratively, taking the example that the video processing apparatus is a chip provided in a computing device, when the video processing apparatus is a chip, the video processing apparatus includes a transceiver and a processor therein, and no memory is included. The transceiver is in the form of an input/output interface, and the input/output interface is used for receiving and transmitting the computing equipment through the chip. The input-output interface may include an input interface that may enable receipt of the computing device and/or an output interface that may be used to enable transmission of the computing device. The processor is configured to read and execute the corresponding computer program or instructions such that the corresponding functions of the video processing apparatus are implemented. Optionally, when the chip implements the corresponding function of the second video processing device in the above embodiment, the input/output interface may implement the transceiving operation performed by the second video processing device in the above embodiment, and the processor may implement other operations than the transceiving operation performed by the second video processing device in the above embodiment. Specific details concerning the embodiments described above are referred to, and will not be described in detail.
Referring to fig. 4b, the video processing apparatus 400b includes a transceiver module 401 and a video plug-in module 402. Optionally, the video processing apparatus 400b further includes an acquisition module 403, an encoding module 404, and a decoding module 405. The system comprises an acquisition module 403, a coding module 404 and a coding module, wherein the acquisition module 403 is used for acquiring related videos of a user, such as videos recorded by the user or videos generated by live broadcasting of the user, and the coding module 404 is used for coding the related videos (such as first videos) acquired by the acquisition module 403 to generate corresponding code streams (such as first code streams corresponding to the first videos). The transceiver module 405 is configured to send the corresponding code stream (such as the first code stream) generated by the encoding module 404, or receive the corresponding code stream (such as the second code stream and the third code stream) from the server.
When the video processing apparatus 400b is used to implement the client function in the embodiment of the method illustrated in fig. 2, the transceiver module 401 is configured to receive the second code stream from the server. The second code stream carries a plurality of second target video frames. The decoding module 405 is configured to decode the second code stream to obtain a plurality of second target video frames. The video frame inserting module 402 is configured to generate a second video according to the plurality of second target video frames.
For a more detailed description of the transceiver module 401, the video plug-in module 402, the acquisition module 403, the encoding module 404 and the decoding module 405, reference should be made to the related description of the client in the embodiment of the method illustrated in fig. 2, and the description is not repeated here.
Alternatively, the decoding module, the video characteristic analysis module, the transceiver module, the time domain downsampling module, the frame inserting quality feedback module and the encoding module illustrated in fig. 4a may be implemented by software, or may be implemented by hardware, and the decoding module, the video frame inserting module, the acquisition module, the encoding module and the transceiver module illustrated in fig. 4b may be implemented by software, or may be implemented by hardware. By way of example, the implementation of the video characteristic analysis module illustrated in fig. 4a above will be described next. Similarly, the implementation manners of the decoding module, the transceiver module, the time domain downsampling module, the frame inserting quality feedback module and the encoding module illustrated in fig. 4a may refer to the implementation manner of the video characteristic analysis module, and the implementation manners of the decoding module, the video frame inserting module, the acquisition module, the encoding module and the transceiver module illustrated in fig. 4b may also refer to the implementation manner of the video characteristic analysis module, which are not described herein again.
When implemented in software, the video characteristic analysis module may be an application or block of code running on a computer device as an example of a software functional unit. The computer device may be at least one of a physical host, a virtual machine, a container, and the like. Further, the computer device may be one or more. For example, the video characteristic analysis module may be an application running on multiple hosts/virtual machines/containers. It should be noted that, a plurality of hosts/virtual machines/containers for running the application may be distributed in the same availability area (availability zone, AZ) or may be distributed in different AZs. Multiple hosts/virtual machines/containers for running the application may be distributed in the same region (region) or may be distributed in different regions. Wherein typically a region may comprise a plurality of AZs.
Also, multiple hosts/virtual machines/containers for running the application may be distributed in the same virtual private cloud (virtual private cloud, VPC) or may be distributed in multiple VPCs. Where typically a region may comprise multiple VPCs and a VPC may comprise multiple AZs.
When implemented in hardware, the video characteristic analysis module may include at least one computing device, such as a server, as an example of a hardware functional unit. Or the video characteristic analysis module may be a device implemented using an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or the like. The PLD may be implemented as a complex program logic device (complex programmable logical device, CPLD), a field-programmable gate array (FPGA) GATE ARRAY, a general-purpose array logic (GENERIC ARRAY logic, GAL), or any combination thereof.
The plurality of computing devices included in the video characteristic analysis module may be distributed in the same AZ or may be distributed in different AZ. Multiple computing devices included in the video characteristic analysis module may be distributed in the same region or may be distributed in different regions. Likewise, multiple computing devices included in the video characteristic analysis module may be distributed in the same VPC or may be distributed in multiple VPCs. Wherein the plurality of computing devices may be any combination of computing devices such as servers, ASIC, PLD, CPLD, FPGA, and GAL.
It should be noted that, the steps responsible for implementation of the decoding module, the video characteristic analysis module, the transceiver module, the time domain downsampling module, the frame inserting quality feedback module and the encoding module illustrated in fig. 4a may be specified according to needs, and all the functions of the first video processing apparatus are implemented by implementing different execution steps of the video processing method through the decoding module, the video characteristic analysis module, the transceiver module, the time domain downsampling module, the frame inserting quality feedback module and the encoding module. The steps of the decoding module, the video frame inserting module, the collecting module, the encoding module and the receiving and transmitting module shown in fig. 4b, which are responsible for implementation, can be designated according to the needs, and all the functions of the second video processing device are implemented by implementing different execution steps of the video processing method through the decoding module, the video frame inserting module, the collecting module, the encoding module and the receiving and transmitting module.
In addition, it should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be adopted in actual implementation. The functional modules in the embodiments of the present application may be integrated into one module, or each module may exist alone physically, or two or more modules may be integrated into one module. For example, taking the above-described plurality of modules illustrated in fig. 4a as an example, the decoding module, the video characteristic analysis module, the transceiver module, the time-domain downsampling module, the frame insertion quality feedback module, and the encoding module may be integrated into one module, or the decoding module, the video characteristic analysis module, the transceiver module, the time-domain downsampling module, the frame insertion quality feedback module, and the encoding module may be the same module. The integrated units may be implemented in hardware or in software functional units.
Based on the same conception, the embodiment of the present application further provides a possible computing device, which is configured to perform the video processing method illustrated in the above method embodiment, and relevant features may be referred to the above method embodiment and will not be described herein. As shown in fig. 5a, computing device 500 includes a bus 501, a processor 502, and a memory 504. Optionally, the computing device 500 may also include a communication interface 503. Communication between the processor 502, the communication interface 503 and the memory 504 is via the bus 501. Computing device 500 may be a server or a terminal device. It should be understood that the present application is not limited to the number of processors, memories in computing device 500.
Bus 501 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one line is shown in FIG. 5a, but not only one bus or one type of bus. Bus 501 may include a path to transfer information between various components of computing device 500 (e.g., memory 504, processor 502, communication interface 503).
The processor 502 may include any one or more of a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), a Microprocessor (MP), or a digital signal processor (DIGITAL SIGNAL processor, DSP).
The memory 504 may include volatile memory (RAM), such as random access memory (random access memory). The processor 502 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, mechanical hard disk (HARD DISK DRIVE, HDD) or solid state disk (SSD STATE DRIVE).
The memory 504 stores executable program codes, and the processor 502 executes the executable program codes to implement the functions of the aforementioned first video processing apparatus (such as the decoding module, the video characteristic analysis module, and the transceiver module, or include one or more of a time domain downsampling module, an insertion frame quality feedback module, and an encoding module), thereby implementing the video processing method provided by the embodiment of the present application. That is, the memory 504 has stored thereon computer program instructions for performing the video processing method.
Or the memory 504 stores executable codes, and the processor 502 executes the executable codes to implement the functions of the foregoing first video processing apparatus (such as the decoding module, the video characteristic analysis module, and the transceiver module, or include one or more of a time domain downsampling module, an inserting frame quality feedback module, or an encoding module), so as to implement the video processing method provided by the embodiment of the present application. That is, the memory 504 stores computer program instructions for the first video processing apparatus to perform the video processing method provided in the embodiment of the present application.
The communication interface 503 enables communication between the computing device 500 and other devices or communication networks using a transceiver module such as, but not limited to, a network interface card, transceiver, or the like.
As shown in fig. 5b, computing device 500 includes a bus 501, a processor 502, and a memory 504. Optionally, the computing device 500 may also include a communication interface 503. Communication between the processor 502, the communication interface 503 and the memory 504 is via the bus 501. Computing device 500 may be a server or a terminal device. It should be understood that the present application is not limited to the number of processors, memories in computing device 500.
Bus 501 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one line is shown in FIG. 5b, but not only one bus or one type of bus. Bus 501 may include a path to transfer information between various components of computing device 500 (e.g., memory 504, processor 502, communication interface 503).
The processor 502 may include any one or more of a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), a Microprocessor (MP), or a digital signal processor (DIGITAL SIGNAL processor, DSP).
The memory 504 may include volatile memory (RAM), such as random access memory (random access memory). The processor 502 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, mechanical hard disk (HARD DISK DRIVE, HDD) or solid state disk (SSD STATE DRIVE).
The memory 504 stores executable program codes, and the processor 502 executes the executable program codes to implement the functions of the aforementioned second video processing device (such as the transceiver module and the video plug-in module and one or more of the acquisition module, the encoding module, or the decoding module), thereby implementing the video processing method provided by the embodiment of the present application. That is, the memory 504 has stored thereon computer program instructions for performing the video processing method.
Or the memory 504 stores executable codes, and the processor 502 executes the executable codes to implement the functions of the aforementioned second video processing device (such as the transceiver module and the video frame inserting module and one or more of the acquisition module, the encoding module or the decoding module), thereby implementing the video processing method provided by the embodiment of the present application. That is, the memory 504 stores computer program instructions for the second video processing apparatus to perform the video processing method provided in the embodiment of the present application.
The communication interface 503 enables communication between the computing device 500 and other devices or communication networks using a transceiver module such as, but not limited to, a network interface card, transceiver, or the like.
Based on the same conception, the embodiment of the present application further provides a possible computing device cluster, which is used for executing the method illustrated in the above method embodiment, and relevant features can be referred to the above method embodiment, which is not repeated herein. The cluster of computing devices includes at least one computing device. The computing device may be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device may also be a terminal device such as a desktop, notebook, or smart phone.
Illustratively, the foregoing first video processing apparatus functions may be implemented by taking as an example instructions stored in the memory 504 of the computing device 500 illustrated in fig. 5 a. As shown in fig. 6a, the computing device cluster 600 includes at least one computing device 500. The same instructions for performing the video processing method provided by the embodiments of the present application may be stored in the memory 504 of one or more computing devices 500 in the computing device cluster 600.
In some possible implementations, portions of instructions for performing the video processing method may also be stored separately in the memory 504 of one or more computing devices 500 in the computing device cluster 600. In other words, a combination of one or more computing devices 500 may collectively execute instructions for performing a video processing method.
It should be noted that the memories 504 in different computing devices 500 in the computing device cluster 600 may store different instructions for performing part of the functions of the first video processing apparatus. That is, the instructions stored by the memory 504 in the different computing devices 500 may implement the functionality of one or more of a decoding module, a video characteristic analysis module, a transceiving module, a temporal downsampling module, an interpolated frame quality feedback module, or an encoding module.
In some possible implementations, one or more computing devices in the computing device cluster 600 may be connected by a network. Wherein the network may be a wide area network or a local area network, etc. By way of example, fig. 7a illustrates one possible implementation, taking as an example that the aforementioned functionality of the first video processing apparatus may be implemented by instructions stored in the memory 504 of the computing device 500 illustrated in fig. 5 a. As shown in fig. 7a, two computing devices 500A and 500B are connected by a network. Specifically, the connection to the network is made through a communication interface in each computing device. In this type of possible implementation, instructions for performing the functions of a decoding module, a video characteristic analysis module, and a transceiver module are stored in the memory 504 in the computing device 500A. Meanwhile, the memory 504 in the computing device 500B stores instructions for performing the functions of the time-domain downsampling module, the frame inserting quality feedback module, and the encoding module. In some possible implementations, portions of the instructions for performing the video processing method may also be stored separately in the memory 504 of one or more computing devices 500 in the computing device cluster. In other words, a combination of one or more computing devices 500 may collectively execute instructions for performing a video processing method.
It should be appreciated that the functionality of computing device 500A shown in fig. 7a may also be performed by a plurality of computing devices 500 illustrated in fig. 5 a. Likewise, the functionality of computing device 500B may also be performed by a plurality of computing devices 500 illustrated in fig. 5 a.
Illustratively, the foregoing functions of the second video processing apparatus may be implemented by taking as an example instructions stored in the memory 504 of the computing device 500 illustrated in fig. 5 b. As shown in fig. 6b, the computing device cluster 600 includes at least one computing device 500. The same instructions for performing the video processing method provided by the embodiments of the present application may be stored in the memory 504 of one or more computing devices 500 in the computing device cluster 600.
In some possible implementations, portions of instructions for performing the video processing method may also be stored separately in the memory 504 of one or more computing devices 500 in the computing device cluster 600. In other words, a combination of one or more computing devices 500 may collectively execute instructions for performing a video processing method.
It should be noted that the memory 504 in different computing devices 500 in the computing device cluster 600 may store different instructions for performing part of the functions of the second video processing apparatus. That is, the instructions stored by the memory 504 in the different computing devices 500 may implement the functionality of one or more of a decoding module, a video insertion module, an acquisition module, an encoding module, or a transceiver module.
In some possible implementations, one or more computing devices in the computing device cluster 600 may be connected by a network. Wherein the network may be a wide area network or a local area network, etc. By way of example, fig. 7b illustrates one possible implementation, where the aforementioned functionality of the second video processing apparatus may be implemented by instructions stored in the memory 504 of the computing device 500 illustrated in fig. 5 b. As shown in fig. 7B, two computing devices 500A 'and 500B' are connected by a network. Specifically, the connection to the network is made through a communication interface in each computing device. In this type of possible implementation, instructions to perform the functions of the transceiver module and the video plug-in module are stored in memory 504 in computing device 500A'. Meanwhile, the memory 504 in the computing device 500B' has stored therein instructions for performing the functions of the acquisition module, the encoding module, and the decoding module. In some possible implementations, portions of the instructions for performing the video processing method may also be stored separately in the memory 504 of one or more computing devices 500 in the computing device cluster. In other words, a combination of one or more computing devices 500 may collectively execute instructions for performing a video processing method.
It should be appreciated that the functionality of computing device 500A' shown in fig. 7b may also be performed by a plurality of computing devices 500 illustrated in fig. 5 b. Likewise, the functionality of computing device 500B' may also be performed by a plurality of computing devices 500 as illustrated in fig. 5B.
It will be appreciated that the computing device storing means for implementing the functions of the plurality of modules comprised by the first video processing apparatus may also form a cluster of computing devices as illustrated in fig. 7a or fig. 7b with the computing device storing means for implementing the functions of the plurality of modules comprised by the second video processing apparatus.
Based on the same conception, the embodiment of the application also provides a computer program product containing instructions. The computer program product may be software or a program product containing instructions capable of running on a computing device or stored in any useful medium. The computer program product, when run on at least one computing device, causes the at least one computing device to perform the video processing method provided by embodiments of the present application.
Based on the same conception, the embodiment of the application also provides a computer readable storage medium. The computer readable storage medium may be any available medium that can be stored by a computing device or a data storage device such as a data center containing one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc. The computer-readable storage medium includes instructions that instruct a computing device to perform the video processing method provided by embodiments of the present application.
Based on the same conception, the embodiment of the application also provides a computer chip, which is coupled with the memory and is used for reading the computer program stored in the memory and executing the video processing method provided by the embodiment of the application.
Based on the same conception, the embodiment of the application also provides a chip system which comprises a processor and is used for supporting a computer device to realize the video processing method provided by the embodiment of the application. In one possible design, the system-on-chip also includes memory for holding programs and data necessary for the computer device. The chip system may be comprised of a computer chip or may include a computer chip and other discrete devices.
It should be noted that the above embodiments are merely for illustrating the technical solution of the present application and not for limiting the same, and although the present application has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some technical features may be equivalently replaced, and these modifications or replacements do not make the essence of the corresponding technical solution deviate from the protection scope of the technical solution of the embodiments of the present application.