Disclosure of Invention
In view of this, the present invention provides a video transcoding method, apparatus, storage medium and electronic device, by which average code rates under different transcoding parameters can be predicted in advance, and a final CRF transcoding is performed on a video by selecting a transcoding parameter corresponding to an appropriate average code rate from the average code rates, so that the video quality is ensured, and meanwhile, the bandwidth requirement is also ensured to be satisfied.
A method of video transcoding, comprising:
 obtaining a first video clip, and determining a maximum code rate supported by the first video clip;
 Performing CRF pre-transcoding on the first video segment according to preset initial transcoding parameters to obtain a second video segment, wherein the initial transcoding parameters at least comprise initial transcoding resolution and initial code rate factors;
 obtaining peak signal-to-noise ratio, structural similarity index and code stream byte number of each video frame in the second video segment, wherein the peak signal-to-noise ratio is used for representing image quality of the video frames after CRF transcoding, and the structural similarity index is used for representing similarity of the video frames before and after CRF;
 determining a target video sub-segment in the second video segment based on the number of code stream bytes of each video frame in the second video segment, wherein the sum of the number of code stream bytes of each video frame in the target video sub-segment is larger than the sum of the number of code stream bytes of other video sub-segments with the same frame number in the second video segment;
 Acquiring transcoding characteristic information corresponding to the target video sub-segment based on peak signal-to-noise ratio and structural similarity indexes of each video frame in the target video sub-segment;
 Extracting code stream layer characteristic information corresponding to the target video sub-segment, wherein the code stream layer characteristic information at least comprises distribution characteristics, motion vector characteristics, frame size characteristics and macro block type characteristics of quantization parameters;
 Applying a pre-trained prediction model to process transcoding characteristic information and code stream layer characteristic information corresponding to the target video sub-segment to obtain a first average code rate after CRF transcoding of the first video segment predicted by the prediction model according to a plurality of other transcoding parameters;
 And selecting a transcoding parameter corresponding to the maximum average code rate to be selected from at least one average code rate to be selected, and performing final CRF transcoding on the first video segment, wherein the average code rate to be selected is a first average code rate smaller than the maximum code rate.
Optionally, the determining the target video sub-segment in the second video segment based on the number of bytes of the code stream of each video frame in the second video segment includes:
 Generating a frame size list corresponding to the second video segment, wherein the frame size list comprises the number of bytes of the code stream of each video frame in the second video segment which are arranged according to the time sequence;
 determining a preset window width of a sliding window;
 sliding the sliding window in the frame size list, and calculating the sum of the byte numbers of the code stream of each video frame in the window width when the sliding window is slid each time;
 And selecting a video frame region where a sliding window with the maximum sum of the byte numbers of the code streams is positioned in the frame size list as a target video sub-segment.
Optionally, the obtaining the transcoding feature information corresponding to the target video sub-segment based on the peak signal-to-noise ratio and the structural similarity index of each video frame in the target video sub-segment includes:
 and calculating an average peak signal-to-noise ratio and an average structural similarity index of the target video sub-segment based on the peak signal-to-noise ratio and the structural similarity index of each video frame in the target video sub-segment, and taking the average peak signal-to-noise ratio and the average structural similarity index as transcoding characteristic information corresponding to the target video sub-segment.
Optionally, the process of training the prediction model includes:
 Collecting a video clip set according to the frame number of the target video sub-clip, wherein the video clip set comprises at least one third video clip which is a video clip with the same frame number as the target video sub-clip;
 Performing CRF pre-transcoding on the third video segment according to the initial transcoding parameters to obtain transcoding characteristic information and code stream layer characteristic information of the pre-transcoded third video segment;
 performing CRF transcoding on the third video segment according to the at least one other transcoding parameter to obtain a second average code rate corresponding to the at least one other transcoding parameter;
 And taking the transcoding characteristic information and the code stream layer characteristic information of the third video segment after pre-transcoding as training data, and taking the second average code rate as a training label to carry out iterative training on the prediction model until the prediction model converges, and completing training on the prediction model.
Optionally, the obtaining process of the peak signal-to-noise ratio and the structural similarity index of each video frame in the second video segment includes:
 Scaling the first video segment based on the video resolution of the second video segment to obtain a fourth video segment;
 and calculating the image pixel difference and the image similarity between each video frame in the fourth video segment and the corresponding video frame in the second video segment, and obtaining the peak signal-to-noise ratio and the structural similarity index of each video frame in the second video segment.
A video transcoding device, comprising:
 the first acquisition unit is used for acquiring a first video clip and determining the maximum code rate supported by the first video clip;
 The first transcoding unit is used for performing CRF pre-transcoding on the first video segment according to preset initial transcoding parameters to obtain a second video segment, wherein the initial transcoding parameters at least comprise initial transcoding resolution and initial code rate factors;
 the second obtaining unit is used for obtaining peak signal-to-noise ratio, structural similarity index and code stream byte number of each video frame in the second video segment, wherein the peak signal-to-noise ratio is used for representing image quality of the video frames after CRF transcoding, and the structural similarity index is used for representing similarity of the video frames before and after CRF;
 a determining unit, configured to determine, based on the number of bytes of the code stream of each video frame in the second video segment, a target video sub-segment in the second video segment, where a sum of the number of bytes of the code stream of each video frame in the target video sub-segment is greater than a sum of the number of bytes of the code stream of video sub-segments with the same number of other frames in the second video segment;
 The third acquisition unit is used for acquiring transcoding characteristic information corresponding to the target video sub-segment based on peak signal-to-noise ratio and structural similarity indexes of each video frame in the target video sub-segment;
 A fourth obtaining unit, configured to extract code stream layer feature information corresponding to the target video sub-segment, where the code stream layer feature information includes at least a distribution feature, a motion vector feature, a frame size feature, and a macroblock type feature of quantization parameters;
 The prediction unit is used for applying a pre-trained prediction model to process transcoding characteristic information and code stream layer characteristic information corresponding to the target video sub-segment to obtain a first average code rate after CRF transcoding is carried out on the first video segment predicted by the prediction model according to a plurality of other transcoding parameters;
 and the second transcoding unit is used for selecting a transcoding parameter corresponding to the maximum average code rate to be selected from at least one average code rate to be selected, and performing final CRF transcoding on the first video segment, wherein the average code rate to be selected is a first average code rate smaller than the maximum code rate.
Optionally, the determining unit determines, based on the number of bytes of the code stream of each video frame in the second video segment, a target video sub-segment in the second video segment, specifically for:
 Generating a frame size list corresponding to the second video segment, wherein the frame size list comprises the number of bytes of the code stream of each video frame in the second video segment which are arranged according to the time sequence;
 determining a preset window width of a sliding window;
 sliding the sliding window in the frame size list, and calculating the sum of the byte numbers of the code stream of each video frame in the window width when the sliding window is slid each time;
 And selecting a video frame region where a sliding window with the maximum sum of the byte numbers of the code streams is positioned in the frame size list as a target video sub-segment.
Optionally, the apparatus further includes: a training unit;
 The training unit is used for:
 Collecting a video clip set according to the frame number of the target video sub-clip, wherein the video clip set comprises at least one third video clip which is a video clip with the same frame number as the target video sub-clip;
 Performing CRF pre-transcoding on the third video segment according to the initial transcoding parameters to obtain transcoding characteristic information and code stream layer characteristic information of the pre-transcoded third video segment;
 performing CRF transcoding on the third video segment according to the at least one other transcoding parameter to obtain a second average code rate corresponding to the at least one other transcoding parameter;
 And taking the transcoding characteristic information and the code stream layer characteristic information of the third video segment after pre-transcoding as training data, and taking the second average code rate as a training label to carry out iterative training on the prediction model until the prediction model converges, and completing training on the prediction model.
A storage medium comprising stored instructions that, when executed, control a device on which the storage medium resides to perform any one of the video transcoding methods described above.
An electronic device comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to perform the video transcoding method of any of the above by one or more processors.
Compared with the prior art, the invention has the following advantages:
 The invention provides a video transcoding method, a device, a storage medium and electronic equipment, comprising the following steps: obtaining a first video clip, and determining a maximum code rate supported by the first video clip; performing CRF pre-transcoding on the first video segment according to the initial transcoding parameters to obtain a second video segment; obtaining peak signal-to-noise ratio, structural similarity index and code stream byte number of each video frame in the second video segment; determining a target video sub-segment in the second video segment based on the number of bytes of the code stream for each video frame; obtaining transcoding characteristic information corresponding to the target video sub-segment based on peak signal-to-noise ratio and structural similarity indexes of each video frame in the target video sub-segment; extracting code stream layer characteristic information corresponding to a target video sub-segment; applying a prediction model to process transcoding characteristic information and code stream layer characteristic information corresponding to the target video sub-segment to obtain a plurality of first average code rates; and selecting a transcoding parameter corresponding to the maximum average code rate to be selected from at least one average code rate to be selected, and performing final CRF transcoding on the first video segment, wherein the average code rate to be selected is a first average code rate smaller than the maximum code rate. By applying the method provided by the invention, the average code rate under different code conversion parameters can be predicted in advance, and the code conversion parameters corresponding to the proper average code rate are selected from the average code rates to carry out final CRF code conversion on the video, so that the video quality is ensured, and the bandwidth requirement is ensured to be met.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the present disclosure, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions, and the terms "comprise," "include," or any other variation thereof, are intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The invention is operational with numerous general purpose or special purpose computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor devices, distributed computing environments that include any of the above devices or devices, and the like.
The embodiment of the invention provides a video transcoding method, which can be applied to various system platforms, wherein an execution subject of the method can be a processor of a computer terminal or various mobile devices, and a flow chart of the method is shown in fig. 1, and specifically comprises the following steps:
 S101: and obtaining the first video clip and determining the maximum code rate supported by the first video clip.
S102: and performing CRF pre-transcoding on the first video segment according to the preset initial transcoding parameters to obtain a second video segment.
The initial transcoding parameters include at least an initial transcoding resolution and an initial code rate factor, and may further include, but are not limited to, encoding formats such as H264, H265, AV1, AVs, etc.
S103: and obtaining peak signal-to-noise ratio, structural similarity index and code stream byte number of each video frame in the second video segment.
Wherein, peak signal-to-Noise Ratio (PSNR) is used to characterize image quality of video frames after CRF (Constant Rate Factor, constant rate factor coding) transcoding, and structural similarity index (Structural Similarity Index, SSIM) is used to characterize similarity of video frames before and after CRF.
It should be noted that, when calculating the peak signal-to-noise ratio and the structural similarity index of the video frames, the resolutions of the images of the two video frames are required to be consistent, so that the embodiment of the invention can scale the video segment before transcoding into the resolution of the video segment after transcoding, and then compare and calculate with the video segment after transcoding.
Optionally, the obtaining process of the peak signal-to-noise ratio and the structural similarity index of each video frame in the second video segment provided by the embodiment of the present invention may include:
 And scaling the first video segment based on the video resolution of the second video segment to obtain a fourth video segment.
And calculating the image pixel difference and the image similarity between each video frame in the fourth video segment and the corresponding video frame in the second video segment, and obtaining the peak signal-to-noise ratio and the structural similarity index of each video frame in the second video segment.
It can be understood that the embodiment of the invention can also scale the transcoded video clip into the original video clip resolution and then compare the video clip with the original video clip for calculation.
Optionally, the obtaining process of the peak signal-to-noise ratio and the structural similarity index of each video frame in the second video segment provided by the embodiment of the present invention may include:
 And scaling the second video segment based on the video resolution of the first video segment to obtain a fifth video segment.
And calculating the image pixel difference and the image similarity between each video frame in the fifth video segment and the corresponding video frame in the first video segment, and obtaining the peak signal-to-noise ratio and the structural similarity index of each video frame in the second video segment.
Optionally, the embodiment of the invention can directly acquire the peak signal-to-noise ratio and the structural similarity index of each video frame in the process of pre-transcoding the first video segment.
S104: a target video sub-segment in the second video segment is determined based on the number of bytes of the stream of each video frame in the second video segment.
The sum of the byte numbers of the code stream of each video frame in the target video sub-segment is larger than the sum of the byte numbers of the code stream of other video sub-segments with the same size in the second video segment.
Specifically, a sliding window can be set, and the sum of the byte numbers of the code stream of each video frame in the window is calculated in the sliding process of the window.
Alternatively, according to the preset frame number, several adjacent video frames can be sequentially selected as a video sub-segment to calculate the sum of the byte numbers of the code stream.
It should be noted that, the larger the sum of the byte numbers of the code stream of the single video word segment, the more complex it is to characterize the video sub-segment, and the more difficult it is in the transcoding process.
S105: and obtaining transcoding characteristic information corresponding to the target video sub-segment based on the peak signal-to-noise ratio and the structural similarity index of each video frame in the target video sub-segment.
The transcoding characteristic information is an average value of peak signal-to-noise ratios of all video frames in the target video sub-segment and an average value of structural similarity indexes. The method comprises the steps of calculating average peak signal-to-noise ratio and average structural similarity index of target video sub-segments based on peak signal-to-noise ratio and structural similarity index of each video frame in the target video sub-segments, and taking the average peak signal-to-noise ratio and the average structural similarity index as transcoding characteristic information corresponding to the target video sub-segments.
S106: and extracting code stream layer characteristic information corresponding to the target video sub-segment.
Wherein the stream layer characteristic information includes at least one of distribution characteristics of quantization parameters (quantizer parameter, QP), motion vector characteristics, frame size characteristics, and macroblock type characteristics.
Further, the code stream layer characteristic information may specifically include 45 characteristics, specifically referring to the following respective characteristics of table 1 and table 2:
 S107: and processing transcoding characteristic information and code stream layer characteristic information corresponding to the target video sub-segment by applying a pre-trained prediction model to obtain a first average code rate after CRF transcoding of the first video segment predicted by the prediction model according to a plurality of other transcoding parameters.
It should be noted that the prediction model may be a regression model, and alternative models include LibGBM, random Forest, SVM (support vector machine ), and the like.
It should be noted that other transcoding parameters correspond to different code rate factors, and the prediction model mainly predicts the average code rate of the video clips after transcoding under the different code rate factors.
Specifically, the transcoding characteristic information and the code stream layer characteristic information are combined and input into a prediction model to predict the average code rate.
S108: and selecting a transcoding parameter corresponding to the maximum average code rate to be selected from at least one average code rate to be selected, and performing final CRF transcoding on the first video segment.
Wherein the average code rate to be selected is a first average code rate smaller than the maximum code rate.
Based on the above method, the process of selecting the maximum average code rate to be selected includes the following scene embodiments:
 For example, when 1080P is used as output resolution for a certain video, CRF values (code rate factors) are 24, 25, 26, 27 and 28, respectively, bitrate (average code rate) obtained by the method is 3200Kbps, 3000Kbps, 2800Kbps, 2400Kbps and 2000Kbps, respectively. Now, when considering the situations of image quality and bandwidth cost, the average code rate is required to be not higher than 2500Kbps (i.e. the maximum code rate supported by video is 2500 Kbps), and we can choose to transcode the video by using code rate factor=27 as the final coding factor.
By applying the method provided by the embodiment of the invention, the average code rate under different code conversion parameters can be predicted in advance, and the video is finally transcoded by selecting the appropriate code conversion parameters corresponding to the average code rate from the average code rates, so that the video quality is ensured, and the bandwidth requirement is ensured to be met.
In the method provided by the embodiment of the present invention, referring to fig. 2, determining a target video sub-segment in a second video segment based on the number of bytes of a code stream of each video frame in the second video segment includes:
 S201: and generating a frame size list corresponding to the second video segment.
The frame size list contains the byte number of the code stream of each video frame in the second video segment arranged according to the time sequence.
S202: and determining the window width of a preset sliding window.
S203: sliding a sliding window within the frame size list and calculating a sum of the number of bytes of the stream of each video frame within the window width each time the sliding window is slid.
Referring to fig. 3, fig. 3 is a schematic view of a sliding window sliding process. The sliding window slides according to the time sequence, after each sliding, one video frame is entered in the coverage range of the sliding window, and one video frame is exited, and the window width of the sliding window is the size of a plurality of video frame combinations.
S204: and selecting a video frame region where a sliding window with the maximum sum of the byte numbers of the code streams is positioned in the frame size list as a target video sub-segment.
Based on the method provided in the above embodiment, the video transcoding process may have the following implementation scenarios:
 (1) And carrying out one-time H264 pre-transcoding of the CRF constant code rate factor on the target video according to the preset transcoding resolution, the preset code rate factor and other preset parameters to obtain an output video file, and calculating peak signal-to-noise ratio and structural similarity indexes in the pre-transcoding process. Wherein the peak signal-to-noise ratio and the structural similarity index are performed based on a preset transcoding resolution instead of scaling to the slice source resolution. This part of the features cannot be obtained separately from the transcoded video file and is suitable for being obtained during the transcoding process.
(2) And counting the byte number of the code stream of each frame of the pre-transcoding output video file to form a time-sequence frame size list.
(3) Traversing the list by the frame width of the sliding window T, and calculating the sum of byte numbers of the code stream of each frame in the sliding window. Recording the maximum value of the sum of the byte numbers of the code stream and the position of the sliding window, and defining the position of the sliding window as a target sliding window.
According to the CRF coding principle, the code rate distribution after pre-transcoding reflects the complexity of the picture to a certain extent, and the sliding window is utilized to select the most complex segment from the target video as a representative segment, so that the following consideration is given to: firstly, uncertainty exists in complexity distribution of each scene in the video, feature extraction is carried out on the whole video, and features of simple scenes and complex scenes are mixed and disturbed. And secondly, the average code rate of the most complex fragments is more beneficial to estimating the occupied file volume and the consumed transmission bandwidth. In addition, if a desired average code rate is selected from the average code rates predicted by each CRF at the image quality angle, the video produced by using this code rate factor as the transcoding parameter can ensure that the image quality is always above the desired image quality, because the average code rate (image quality desired) is built on the most complex segment.
(4) And extracting peak signal-to-noise ratio and structural similarity index average value in the target sliding window as characteristic information in the transcoding process.
(5) Analyzing the pre-transcoded video file generated in the step (1) in a target sliding window, extracting the byte number layer characteristics of the code stream in the window, and selecting 45 pieces of code stream layer characteristic information in total according to the scheme, wherein the characteristic information mainly comprises the distribution characteristics, the motion vector characteristics, the frame size characteristics and the macro block type characteristics of quantization parameters.
(6) And D, combining the characteristic information in the transcoding process generated in the step D and the characteristic information of the code stream layer generated in the step E, and then sending the combined characteristic information and the characteristic information to a trained machine learning model. And predicting the model to obtain the average code rate of the video in the target sliding window after the constant code rate factor coding is carried out under the appointed code rate factor and the appointed coding parameter. Specifying coding parameters that are consistent with the coding parameters used in creating the dataset tags in the training model stage includes, but is not limited to, coding resolution, coding formats such as H264, H265, AV1, AVS, etc.
(7) And obtaining the average code rate of the video in the target sliding window after the video is coded by each code rate factor, wherein the average code rate of the target video after the video is coded is necessarily smaller than the average code rate under the same code rate factor. And selecting a proper code rate factor according to the average code rate and the self service condition to carry out final transcoding.
In the present invention, the selected characteristic information of the code stream layer can be divided into the following categories: frame size characteristics, quantization parameter distribution characteristics, motion vector characteristics, macroblock type characteristics, and in-process characteristics. Features are all calculated in one frame and then the average value in the sliding window is counted. Such as the calculation of the quantization parameter maximum: and calculating the maximum value of the quantization parameter of each frame, and calculating the average value of the maximum values of the quantization parameters of each frame as the characteristic value.
The greater the correlation of the feature with the tag data, the higher the prediction accuracy. The distribution of quantization parameter values reflects the picture coding difficulty to a certain extent, and the simpler pictures are smaller in average quantization parameter and larger in average quantization parameter when coded by the same code rate factor. The coding difficulty is finally reflected to the volume of the generated file in CRF coding, so that quantization parameter distribution is a relatively effective characteristic. Also, motion vectors can represent the difficulty of encoding. In the scheme, besides the traditional motion vector size, errors of a motion vector predicted value and a true value are also introduced as features, and the features reflect motion irregularity and unpredictability in a picture to a certain extent, and the greater the errors, the higher the coding complexity. The macro block partitioning mode reflects the complexity of the texture in the picture. The introduction of peak signal-to-noise ratio and structural similarity index features in the transcoding process has a remarkable influence on the improvement of model prediction precision, and the higher the coding difficulty of the chip source to the coder is, the lower the peak signal-to-noise ratio/structural similarity index value is under the same code rate factor, and otherwise, the higher is the coding difficulty.
In the invention, the extraction of the byte number layer characteristics of the code stream is completed in a compression domain, pixel level operation is not needed, the required calculation cost is low, engineering realization is easy, and experimental data show that the characteristics can obtain better reasoning precision. And in the transcoding process, the extraction of peak signal-to-noise ratio and structural similarity index features only needs small calculation cost because of multiplexing the pre-transcoding process, and the calculation of peak signal-to-noise ratio and structural similarity index of the current frame to be coded and the reconstructed frame is completed inside the encoder.
In the method provided by the embodiment of the invention, referring to fig. 4, a process of training a prediction model includes:
 S401: and collecting a video clip set according to the frame number of the target video sub-clip, wherein the video clip set comprises at least one third video clip.
The third video segment is a video segment with the same frame number as the target video sub-segment.
It should be noted that, the video segments with the same frame number as the target video sub-segments may be directly obtained by the web crawler or the database, or the video segments may be intercepted from the long video according to the frame number of the target video sub-segments.
S402: and after the CRF pre-transcoding is carried out on the third video segment according to the initial transcoding parameters, transcoding characteristic information and code stream layer characteristic information of the pre-transcoded third video segment are obtained.
It should be noted that the process of pre-transcoding the third video segment in S402 is identical to the process of pre-transcoding in S102, which will not be repeated here.
S403: and performing CRF transcoding on the third video segment according to at least one other transcoding parameter to obtain a second average code rate corresponding to the at least one other transcoding parameter.
S404: and taking the transcoding characteristic information and the code stream layer characteristic information of the third video segment after pre-transcoding as training data, and taking the second average code rate as a training label to carry out iterative training on the prediction model until the prediction model converges, so that training on the prediction model is completed.
It can be understood that after pre-transcoding the third video segment, performing transcoding processes corresponding to other transcoding parameters on the third video segment for multiple times according to the CRF transcoding mode, so as to obtain a second average code rate corresponding to each other transcoding parameter. And (3) inputting the characteristic information obtained by the original pre-transcoding and the characteristic information of the code stream layer as training data into a prediction model, processing the prediction model to output a predicted average code rate corresponding to each other transcoding parameter, comparing the predicted average code rate with a standard second average code rate, if the difference between the predicted average code rate and the standard second average code rate is larger, adjusting model parameters of the prediction model, and then predicting again until the error between the obtained predicted average code rate and the standard second average code rate is within a preset range, and characterizing convergence of the prediction model to finish training of the prediction model.
In the training phase of the prediction model, the method comprises several steps of data set making and model training. At this time, the video is transcoded by using different code rate factors to obtain tag data, i.e. answers, under the different code rate factors. Each data-answer combination will be trained to obtain a model file. In the actual reasoning stage (engineering production stage), the average code rate under different code rate factors can be deduced by only pre-transcoding once, extracting the feature information after pre-transcoding and sending the feature information into each model file (model).
In the present invention, a specific embodiment process of training a prediction model may include the following implementation scenarios:
 A. and selecting continuous T frames of video from the original film source material to form a video fragment, and collecting a certain number of video fragments to form a data set. The value of T is 250 frames, corresponding to a 10 second duration of 25fps patch source. The video clips are typically copied back from key frame locations in the original film source material in a non-transcoded manner.
B. And carrying out one-time H264 pre-transcoding of the CRF constant code rate factor on the video clips in the data set according to the preset transcoding resolution, the preset code rate factor and other preset parameters to obtain an output video file.
C. And extracting peak signal-to-noise ratio/structural similarity index in the pre-transcoding process as characteristic information, and extracting code stream layer characteristic information of the pre-transcoding output video file, wherein the total of the two characteristic information is 47.
D. And transcoding the video clips by using code rate factors CRF as N and designating coding parameters such as coding format and resolution for the service. N is determined by the service scene, for example, N takes a value of 20-32, and then 13 times of transcoding is carried out on the video source segment. The service-specific coding parameters are required to be consistent with the coding parameters when the average code rate prediction is finally performed by using the model, for example, the coding formats are H264, H265, AV1, AVS and the like. The specified resolution needs to be consistent with the resolution at which the average code rate prediction using the model is ultimately performed, such as 1080P.
E. And D, calculating the average code rate of the transcoded video file output in the step D, and taking the average code rate as a label of the data set. For example, if the value of N is 20-32, 13 labels are formed, and the labels and the characteristic information form 13 data sets together.
The machine learning model used in the scheme adopts a regression model, and alternative models include LibGBM, random Forest, SVM (support vector machine) support vector machines and the like. And training to obtain respective model files for prediction aiming at different code rate factors N. For example, if the average code rate condition after transcoding when the CRF value is 20-32 needs to be predicted, 13 model files are obtained through training.
Based on the method provided by the embodiment, the invention provides an application implementation scene of a video transcoding method, and the specific embodiment is as follows, in the example, 1080P resolution constant rate factor coding is needed to be carried out on a certain video, and the range of the rate factor CRF is 24-32:
 (1) And (3) coding the constant code rate factor of the video film source to generate a pre-transcoded video. Preset parameters: coding standard H264, coding resolution 540P, frame rate 25, CRF value 28, adopting 2 reference frames, starting B frame coding, IDR (Instantaneous Decoding Refresh) key frame interval 75 frames, starting peak signal-to-noise ratio and structural similarity index calculation. Other parameters are configured by using X264 default, and the peak signal-to-noise ratio and the structural similarity index of the Y component of each frame are recorded. The same encoding parameters as in the data set preprocessing stage are used here.
(2) The pre-transcoded video per frame size is output and recorded using a ffprobe or the like tool.
(3) Traversing the frame size list by taking 250 frames as windows, and calculating the sum of byte numbers of code streams of each frame in the sliding window. Recording the maximum value of the sum of the byte numbers of the code stream and the position of the sliding window, and defining the position of the sliding window as a target sliding window.
(4) And calculating the average value of the Y component peak signal-to-noise ratio and the structural similarity index in the sliding window as the characteristic in the transcoding process. And calculating the byte-number layer characteristics of the code stream of the pre-transcoded video in the sliding window.
(5) And merging the characteristics in the pre-transcoding process and the characteristics of the byte number layer of the code stream, sequentially sending the merged characteristics to the corresponding machine learning models of the CRFs 24-32, and outputting the predicted code rate under each CRF. And selecting the code rate factor with the code rate meeting the expected code rate according to the service condition as the formal code rate factor.
(6) The constant rate factor transcoding is performed on the video film source by using the formal coding rate factor, and the coding parameters are determined according to the service condition, such as coding standard H265, coding resolution 1080P, frame rate 25, adopting 2 reference frames, starting B frame coding, IDR interval 75 frames, and the like. The same encoding parameters as the data set labelling stage are used here.
In the method provided by the embodiment of the invention, the most complex segment is selected from the target video by utilizing the sliding window as the representative segment, the characteristics of the representative segment are extracted, and the average code rate of the representative segment after the CRF coding is predicted. The average code rate or the file volume after the whole CRF coding can be ensured to be controlled through the prediction of the code rate after the most complex fragment is transcoded. Compared with the prior art, the method only uses the characteristic information of the video file after pre-transcoding, and the peak signal-to-noise ratio and the structural similarity index in the pre-transcoding process are introduced as auxiliary characteristic information, so that the characteristic correlation can be greatly improved, and the prediction precision can be improved. The method extracts the byte number layer characteristics of 45 code streams of the transcoded video file, and has the characteristics of high extraction performance and complete characteristic coverage. Wherein motion vector prediction errors are not addressed by existing schemes as a feature.
The specific implementation process and derivative manner of the above embodiments are all within the protection scope of the present invention.
Corresponding to the method of fig. 1, the embodiment of the present invention further provides a video transcoding device, which is used for implementing the method of fig. 1, where the video transcoding device provided in the embodiment of the present invention may be applied to a computer terminal or various mobile devices, and the structural schematic diagram of the video transcoding device is shown in fig. 5, and specifically includes:
 the first obtaining unit 501 is configured to obtain a first video clip, and determine a maximum code rate supported by the first video clip.
The first transcoding unit 502 is configured to perform CRF pre-transcoding on the first video segment according to a preset initial transcoding parameter, so as to obtain a second video segment, where the initial transcoding parameter at least includes an initial transcoding resolution and an initial code rate factor.
The second obtaining unit 503 is configured to obtain a peak signal-to-noise ratio, a structural similarity index, and a number of bytes of the code stream of each video frame in the second video segment, where the peak signal-to-noise ratio is used to represent an image quality of the video frame after CRF transcoding, and the structural similarity index is used to represent a similarity of the video frame before and after CRF.
A determining unit 504, configured to determine, based on the number of bytes of the code stream of each video frame in the second video segment, a target video sub-segment in the second video segment, where the sum of the number of bytes of the code stream of each video frame in the target video sub-segment is greater than the sum of the number of bytes of the code stream of other video sub-segments with the same size in the second video segment.
The third obtaining unit 505 is configured to obtain transcoding feature information corresponding to the target video sub-segment based on the peak signal-to-noise ratio and the structural similarity index of each video frame in the target video sub-segment.
The fourth obtaining unit 506 is configured to extract stream layer feature information corresponding to the target video sub-segment, where the stream layer feature information at least includes distribution features of quantization parameters, motion vector features, frame size features, and macroblock type features.
The prediction unit 507 is configured to apply a pre-trained prediction model to process the transcoding feature information and the code stream layer feature information corresponding to the target video sub-segment, so as to obtain a first average code rate after the first video segment predicted by the prediction model is subjected to CRF transcoding according to a plurality of other transcoding parameters.
And the second transcoding unit 508 is configured to select a transcoding parameter corresponding to the maximum average code rate to be selected from at least one average code rate to be selected, and perform final CRF transcoding on the first video, where the average code rate to be selected is a first average code rate smaller than the maximum code rate.
In the device provided by the embodiment of the invention, the determining unit determines the target video sub-segment in the second video segment based on the number of bytes of the code stream of each video frame in the second video segment, and is specifically configured to:
 generating a frame size list corresponding to the second video segments, wherein the frame size list comprises the byte number of the code stream of each video frame in the second video segments arranged according to the time sequence.
And determining the window width of a preset sliding window.
Sliding a sliding window within the frame size list and calculating a sum of the number of bytes of the stream of each video frame within the window width each time the sliding window is slid.
And selecting a video frame region where a sliding window with the maximum sum of the byte numbers of the code streams is positioned in the frame size list as a target video sub-segment.
The device provided by the embodiment of the invention further comprises: training unit.
Training unit for:
 And collecting a video clip set according to the frame number of the target video sub-clip, wherein the video clip set comprises at least one third video clip which is a video clip consistent with the frame number of the target video sub-clip.
And after the CRF pre-transcoding is carried out on the third video segment according to the initial transcoding parameters, transcoding characteristic information and code stream layer characteristic information of the pre-transcoded third video segment are obtained.
And performing CRF transcoding on the third video segment according to at least one other transcoding parameter to obtain a second average code rate corresponding to the at least one other transcoding parameter.
And taking the transcoding characteristic information and the code stream layer characteristic information of the third video segment after pre-transcoding as training data, and taking the second average code rate as a training label to carry out iterative training on the prediction model until the prediction model converges, so that training on the prediction model is completed.
The specific working process of each unit in the video transcoding device disclosed in the above embodiment of the present invention can be referred to the corresponding content in the video transcoding device disclosed in the above embodiment of the present invention, and will not be described herein again.
The embodiment of the invention also provides a storage medium, which comprises stored instructions, wherein the equipment where the storage medium is controlled to execute the video transcoding method when the instructions run.
The embodiment of the present invention further provides an electronic device, whose structural schematic diagram is shown in fig. 6, specifically including a memory 601, and one or more instructions 602, where the one or more instructions 602 are stored in the memory 601, and configured to be executed by the one or more processors 603 to perform the following operations of:
 and obtaining the first video clip and determining the maximum code rate supported by the first video clip.
And performing CRF pre-transcoding on the first video segment according to preset initial transcoding parameters to obtain a second video segment, wherein the initial transcoding parameters at least comprise initial transcoding resolution and initial code rate factors.
And obtaining peak signal-to-noise ratio, structural similarity index and code stream byte number of each video frame in the second video segment, wherein the peak signal-to-noise ratio is used for representing image quality of the video frames after CRF transcoding, and the structural similarity index is used for representing similarity of the video frames before and after CRF.
And determining a target video sub-segment in the second video segment based on the number of code stream bytes of each video frame in the second video segment, wherein the sum of the code stream bytes of each video frame in the target video sub-segment is larger than the sum of the code stream bytes of other video sub-segments with the same size in the second video segment.
And obtaining transcoding characteristic information corresponding to the target video sub-segment based on the peak signal-to-noise ratio and the structural similarity index of each video frame in the target video sub-segment.
And extracting code stream layer characteristic information corresponding to the target video sub-segment, wherein the code stream layer characteristic information at least comprises distribution characteristics of quantization parameters, motion vector characteristics, frame size characteristics and macroblock type characteristics.
And processing transcoding characteristic information and code stream layer characteristic information corresponding to the target video sub-segment by applying a pre-trained prediction model to obtain a first average code rate after CRF transcoding of the first video segment predicted by the prediction model according to a plurality of other transcoding parameters.
And selecting a transcoding parameter corresponding to the maximum average code rate to be selected from at least one average code rate to be selected, and performing final CRF transcoding on the first video, wherein the average code rate to be selected is a first average code rate smaller than the maximum code rate.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a system or system embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, with reference to the description of the method embodiment being made in part. The systems and system embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
Those of skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.
To clearly illustrate this interchangeability of hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.