CN110505519B

Movatterモバイル変換

Info

Publication number: CN110505519B
Application number: CN201910750378.6A
Authority: CN
Inventors: 张进; 杜欧杰; 莫东松; 赵璐; 张健; 马丹; 钟宜峰; 马晓琳
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2019-08-14
Filing date: 2019-08-14
Publication date: 2021-12-03
Anticipated expiration: 2039-08-14
Also published as: CN110505519A

Abstract

Translated fromChinese

本发明实施例提供一种视频剪辑方法、电子设备及存储介质，方法包括：从待剪辑视频中提取特征帧图像，其中所述特征帧图像为包括有预设类别画面的图像；将所述特征帧图像输入至视频剪辑模型中，得到所述视频剪辑模型输出的检测结果，所述检测结果表示所述特征帧图像的精彩程度；根据所述检测结果，对所述待剪辑视频进行剪辑；其中，所述视频剪辑模型为根据预先获取的视频样本训练得到，所述视频样本中的图像帧标记有表示视频精彩程度的权重。本发明实施例提高了视频剪辑的准确性，进而保证了所剪辑视频的较高的精彩程度。

Embodiments of the present invention provide a video editing method, an electronic device, and a storage medium. The method includes: extracting a feature frame image from a video to be edited, wherein the feature frame image is an image including a preset category picture; The frame image is input into the video editing model, and the detection result output by the video editing model is obtained, and the detection result indicates the wonderful degree of the feature frame image; according to the detection result, the video to be edited is edited; wherein , the video editing model is obtained by training based on pre-acquired video samples, and the image frames in the video samples are marked with weights representing the wonderful degree of the video. The embodiment of the present invention improves the accuracy of video clipping, thereby ensuring a high degree of brilliance of the clipped video.

Description

Video editing method, electronic equipment and storage medium

Technical Field

The present invention relates to the field of video technologies, and in particular, to a video editing method, an electronic device, and a storage medium.

Background

Along with the development of user demands and media technologies, the number of videos also increases exponentially, especially, the timeliness of sports live broadcast games, user interactivity and other characteristics meet the user experience when watching videos, and the number of experience live broadcast videos is large. However, a live event is often long in duration, typically tens of minutes to hours, but the audience is only interested in a small portion of the event, which requires cutting a more wonderful video of the live event so that the user can view the interested portion through the wonderful video.

However, in the prior art, the more wonderful part of the video is mostly edited by human. During the editing process, it is usually determined manually which part of the video is more wonderful, and then the part which is considered to be more wonderful by the person is edited. However, since each person has a different understanding of the degree of wonderness, when the same video is edited, the degree of wonderness of the edited part of each person is different, and the accuracy of the degree of wonderness of the edited wonderness video is low.

Disclosure of Invention

The embodiment of the invention provides a video clipping method, electronic equipment and a storage medium, which aim to solve the problem that the accuracy of the highlight degree of a clipped highlight video is low when the highlight video is clipped in the prior art.

The embodiment of the invention provides a video clipping method, which comprises the following steps:

extracting a characteristic frame image from a video to be clipped, wherein the characteristic frame image is an image comprising a preset category picture;

inputting the characteristic frame image into a video clip model to obtain a detection result output by the video clip model, wherein the detection result represents the wonderful degree of the characteristic frame image;

according to the detection result, the video to be clipped is clipped;

the video clip model is obtained by training according to a video sample acquired in advance, and image frames in the video sample are marked with weights representing the video wonderful degree.

The embodiment of the invention provides a ball live video editing method, which comprises the following steps:

extracting a characteristic frame image from a ball video to be clipped, wherein the characteristic frame image is an image comprising a preset ball scene and a preset ball action;

inputting the characteristic frame image into a ball video clip model to obtain a detection result output by the ball video clip model, wherein the detection result represents the wonderful degree of the characteristic frame image;

according to the detection result, the ball video is edited;

the ball video clip model is obtained by training according to a pre-acquired ball scene video sample and a ball action video sample, and image frames in the ball scene video sample and the ball action video sample are respectively marked with weights for representing the video wonderful degree.

The embodiment of the invention provides electronic equipment which comprises a memory, a processor and a program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the video clipping method or the ball live video clipping method.

Embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the video clipping method or the ball-live video clipping method described above.

According to the video clipping method and device provided by the embodiment of the invention, the characteristic frame image is extracted from the video to be clipped, then the characteristic frame image is input into the video clipping model to obtain the detection result output by the video clipping model, then the video to be clipped is clipped according to the detection result, at the moment, the characteristic frame image is obtained by training according to the video sample obtained in advance based on the video clipping model, and the image frame in the video sample is marked with the weight representing the video wonderful degree, so that the accuracy of the video clipping model is improved, the accuracy of the detection result output by the video clipping model is ensured, and the high-color degree of the video frame obtained by clipping is further ensured.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flowchart illustrating the steps of a video editing method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of training a video clip model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device in an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, which is a flowchart illustrating steps of a video clipping method according to an embodiment of the present invention, the video clipping method includes:

step 101: and extracting characteristic frame images from the video to be edited.

In this step, specifically, when the video needs to be clipped, the video to be clipped may be obtained first, and then the feature frame image may be extracted from the video to be clipped.

Specifically, the feature frame image is an image including a preset category picture.

In addition, specifically, when the feature frame image is extracted from the video to be edited, image recognition may be performed on each video frame in the video to be edited, and the feature frame image in which the preset category picture is recognized may be extracted.

In addition, it should be noted that the preset category screen may be defined differently according to different properties of the video to be edited. For example, when the video to be clipped is a table tennis video, the preset category picture may include at least one of the following categories: player standing, player sports footage, player hitting action, player hitting distance, player hitting time, director viewing angle, and captions. Of course, the nature of the video to be edited is not specifically limited herein.

Therefore, the image comprising the preset category picture is taken as the characteristic frame image and extracted from the video to be edited, so that the situations that the identification workload is large and the invalid identification amount is large when the highlight identification is directly carried out on the video frame to be edited in the video to be edited are avoided.

Step 102: and inputting the characteristic frame image into the video clip model to obtain a detection result output by the video clip model.

In this step, specifically, the embodiment trains in advance to obtain the video clip model. Then, when the characteristic frame image is obtained, the characteristic frame image can be directly input into the video clip model, so that the detection result output by the video clip model is obtained.

Specifically, the detection result indicates the degree of wonderness of the feature frame image.

The highlight may be represented by a score, and the expression form of the highlight is not particularly limited herein.

In addition, specifically, the video clip model is obtained by training according to a video sample acquired in advance, and image frames in the video sample are marked with weights representing the video highlights.

Therefore, the video clip model is trained based on the video samples marked with the weights of the video wonderful degree, and the accuracy of the trained video clip model is ensured.

Step 103: and according to the detection result, clipping the video to be clipped.

In this step, specifically, when the detection result output by the video clipping model is obtained, the video to be clipped may be clipped according to the detection result.

When the video to be clipped is clipped according to the detection result, whether the wonderful degree of the characteristic frame image is greater than the preset wonderful degree or not can be judged according to the detection result, and when the wonderful degree of the characteristic frame image is determined to be greater than the preset wonderful degree according to the detection result, the video frame corresponding to the characteristic frame image is determined as the video frame to be clipped, and then the video frame to be clipped is clipped.

Therefore, when the wonderness degree of the characteristic frame image is larger than the preset wonderness degree, the video frame corresponding to the characteristic frame image is determined as the video frame to be clipped, and then the video frame to be clipped is clipped from the video to be clipped, so that the wonderness degree of the clipped video frame is ensured to be higher.

In addition, specifically, the clipped video frames can be synthesized to obtain the clipped video, so that the video with high wonderful degree can be obtained.

In this way, in the embodiment, the feature frame image is extracted from the video to be edited, the feature frame image is input into the video editing model to obtain the detection result output by the video editing model, the video to be edited is edited according to the detection result, the feature frame image is obtained by training according to the video sample obtained in advance based on the video editing model, and the image frame in the video sample is marked with the weight representing the video wonderful degree, so that the accuracy of the video editing model is ensured, the accuracy of the detection result output by the video editing model is ensured, and the high-accuracy degree of the video frame obtained by editing is further ensured.

Further, on the basis of the above embodiment, in this embodiment, before the feature frame image is input into the video clip model to obtain the detection result output by the video clip model, the preset neural network model needs to be trained to obtain the video clip model.

Specifically, the video clip model may also be obtained by training an audio sample corresponding to the video sample, where an audio frame in the audio sample is labeled with a label indicating the video highlight degree, and the label is set according to a preset sound type set in advance.

Therefore, the label representing the video wonderful degree is set according to the preset sound type which is preset, the audience truly reacts to the video wonderful degree based on the sound in the video, the accuracy of the set label can be guaranteed when the label is set according to the preset sound type, and the efficiency of the label setting is guaranteed.

For example, assuming that the video to be clipped is a table tennis video, the preset sound type includes at least one of the following types: ping-pong ball impact sounds, player and coach sounds, spectator sounds, referee sounds, and silence. At this time, a label representing the degree of wonderness may be set for each type of the preset sound types, and then, in the audio sample, the sound type is identified for each audio frame in the audio sample, and a label corresponding to the identified preset sound type is marked.

In addition, specifically, the video clip model is obtained by training the audio sample based on the fact that the audio sample corresponds to the video sample, and the video clip model is obtained by further training the audio sample based on the label of the audio sample, so that the accuracy of the video clip model is ensured when the video clip model is obtained by training the video sample and the audio sample corresponding to the video sample, and the accuracy of the feature frame image detection result output by the video clip model is ensured.

Specifically, as shown in fig. 2, when the preset neural network model is trained to obtain the video clip model, the method may specifically include the following steps:

step 201: and acquiring a characteristic frame sample image and an audio sample training frame corresponding to the characteristic frame sample image according to the video sample and the audio sample corresponding to the video sample.

In this step, specifically, when the video clip model is obtained by training, a training sample, that is, a feature frame sample image in the video sample and an audio sample training frame corresponding to the feature frame sample image, needs to be obtained first.

When a feature frame sample image and an audio sample training frame corresponding to the feature frame sample image are obtained according to a video sample and an audio sample corresponding to the video sample, framing processing can be respectively performed on the video sample and the audio sample according to a preset framing mode to obtain a plurality of video frame images and a plurality of audio frames corresponding to the plurality of video frame images; then, based on a preset category picture, identifying the plurality of video frame images to obtain the characteristic frame sample image; and then acquiring an audio sample training frame corresponding to the characteristic frame sample image from the plurality of audio frames, and marking a label corresponding to the preset sound type on the audio sample training frame based on the preset sound type.

The above-described process is explained below.

Specifically, the present embodiment extracts an audio sample from a video sample. Then, when the video sample and the audio sample are respectively subjected to framing processing according to a preset framing manner, frame cutting may be performed according to time, for example, an audio frame is generated every 125ms in ms units, and a video frame image is correspondingly generated in the video sample. Of course, the specific form of the preset framing manner is not limited specifically, and the framing may be performed for 130ms, for example.

In addition, specifically, in this embodiment, after the audio frame is obtained, the audio frame may be sliced, so as to improve the accuracy of sound recognition on the audio frame. For example, the slices may be performed in a unit of 5ms in a 50% overlapping manner, so as to obtain multiple pieces of audio in one audio frame; at this time, when the voice recognition module performs voice recognition on one audio frame, multiple audio segments can be recognized, so as to improve the accuracy of the voice recognition.

In addition, specifically, after obtaining a plurality of video frame images by framing, based on a preset category picture, performing preset category picture recognition on each of the video frame images obtained by framing to obtain a feature frame sample image, and then obtaining an audio sample training frame corresponding to the feature frame sample image from the plurality of audio frames. Of course, the preset category screen can be set according to the nature of the video sample. For example, when the video frame to be clipped is a table tennis direct playing video, the preset category picture may include at least one of the following scenes: player standing, player sports footage, player hitting action, player hitting distance, player hitting time, director viewing angle, and captions.

Meanwhile, the present embodiment is also preset with preset sound types, for example, the preset sound types include ping-pong ball impact sound, player and coach sound, audience sound, referee sound, mute sound, and the like. At this time, when the audio sample training frame is labeled with the label based on the preset sound type, the sound recognition module may perform the sound recognition on the audio sample training frame, then determine the preset sound type to which the audio sample training frame belongs, and finally label the label corresponding to the preset sound type as the label of the audio sample training frame. Of course, the label indicates the degree of highlights of the sample image of the corresponding video frame. The embodiment sets the label representing the wonderful degree based on the preset sound type, and because the sound is the real reaction of the audience to the wonderful degree of the video, the accuracy of the set label is ensured, and the accurate cognition of the wonderful degree of the video frame image is further ensured.

In addition, specifically, after the audio sample is obtained by performing audio extraction on the video sample, the audio sample may also be subjected to analog-to-digital conversion to generate a Pulse Code Modulation (PCM) binary file. Specifically, when generating the PCM binary file, the sound continuous waveform may be sampled and quantized, that is, converted into discrete data points at a certain sampling rate and sampling number, and the MP3 format audio file is converted into a 16-bit mono PCM file at a 16KHz sampling frequency by using the computer program ffmpeg, so as to improve the accuracy of audio sample identification.

Step 202: substituting initial weights corresponding to different scene highlights in preset type pictures into a preset neural network model, and inputting the characteristic frame sample image into the preset neural network model to obtain an image result output by the preset neural network model.

Specifically, the present embodiment further presets a preset neural network model and initial weights corresponding to different scene highlights in a preset category picture. In this step, the initial weights corresponding to the highlights of different scenes in the preset category picture may be substituted into the preset neural network model, and the feature frame sample image is input into the preset neural network model to obtain an image result output by the preset neural network model, where of course the image result represents the highlights of the feature frame sample image.

Specifically, the image result can be represented by the following formula:

wherein n represents the general category of the preset category screen, F (X)ⁱ) And expressing the wonderful degree corresponding to each type of preset category picture. In this embodiment, n may have a value of 7, i.e. include 7 types of preset category pictures.

The initial weights corresponding to the highlights of different scenes in the preset category picture are set, so that the highlights of different scenes in the preset category picture can be distinguished, and the accuracy is higher when the preset neural network model with the initial weights is substituted for feature frame sample image recognition.

In addition, specifically, the specific parameters of the preset neural network model may be set as follows: the preset neural network model is provided with 13 convolutional layers in total, the convolutional kernel of the input layer is 7 by 7, and the output channel is 128; the convolution kernel of the second layer is 7 by 7, and the output channel is 128; the convolution kernels of the third layer to the eleventh layer are 5 by 5, and the output channel is 512; the twelfth layer and the output layer are a full connection layer plus a softmax layer.

Step 203: and training the initial weight substituted into a preset neural network model according to the image result and the label corresponding to the audio sample training frame to obtain a video clip model.

In this step, specifically, after the image result output by the preset neural network model is obtained, the initial weight substituted into the preset neural network model may be trained according to the image result and the label corresponding to the audio sample training frame, so as to obtain the video clip model.

Specifically, since the audio sample training frame corresponds to the feature frame sample image, the label corresponding to the audio sample training frame may be used to reflect the highlight degree of the feature frame sample image. At this time, the label corresponding to the audio sample training frame can be regarded as a real label of the feature frame sample image, so that when the preset neural network model is trained, the preset neural network model can be trained based on the feature frame sample image and the label of the audio sample training frame, that is, the initial weight substituted into the preset neural network model is trained, and then the video clip model is obtained.

The method comprises the steps of training initial weights substituted into a preset neural network model according to labels corresponding to image results and audio sample training frames to obtain a video clip model, detecting the accuracy of the image results based on the labels corresponding to the audio sample training frames, adjusting the initial weights when the accuracy of the detected image results is lower than a preset threshold, verifying the accuracy of the preset neural network model after the initial weights are adjusted based on the video samples and the audio samples, and determining the preset neural network model substituted with the adjusted weights as the video clip model when the accuracy of the image results obtained through verification is higher than the preset threshold.

Specifically, the training frame based on the audio sample corresponds to the sample image of the feature frame, so that the label corresponding to the training frame based on the audio sample can be regarded as a real label of the sample image of the feature frame, that is, the label of the training frame based on the audio sample can be used to detect the accuracy of the image result. At this time, when it is detected that the accuracy of the image result is lower than the preset threshold, that is, when the image result obviously does not conform to the label of the audio sample training frame, it indicates that the initial weight setting of each scene substituted into the preset neural network model is unreasonable, and at this time, the initial weight needs to be adjusted to ensure the precision of the preset neural network model substituted with the adjusted weight. Finally, the preset neural network model substituted with the adjusted weights may be determined as a video clip model.

Of course, it should be noted here that, when the initial weight is adjusted, an adjustment ratio of the initial weight may be obtained; and then forbidding to adjust the initial weight when the adjustment proportion of the initial weight is detected to be larger than a preset proportion threshold value, and adjusting the initial weight when the adjustment proportion of the initial weight is detected to be smaller than or equal to the preset proportion threshold value.

It should be noted that, the specific value of the adjustment ratio is not specifically limited, for example, the adjustment ratio may be 40%, that is, when the adjustment ratio of the initial weight is greater than 40%, the initial weight is prohibited from being adjusted, that is, the training sample is not used.

Therefore, when the initial weight is adjusted, whether the initial weight is adjusted or not is determined according to the adjustment proportion, and the interference of invalid training samples on the training of the preset network neural model is avoided.

According to the embodiment, the initial weights corresponding to the highlights of different scenes in the preset type picture are preset, the initial weights are substituted into the preset neural network model, then the initial weights are trained through the audio samples corresponding to the video samples, the actual reaction of audiences to the highlights of the video is based on the sound in the audio samples, the accuracy of the trained weights is guaranteed, and the accuracy of the trained video editing model is further guaranteed.

In addition, specifically, taking a table tennis live-action video as an example, the initial weight setting conditions of different scenes and different scenes in the preset type of pictures are described herein.

Wherein the preset category screen includes at least one of the following categories: player standing, player sports footage, player hitting action, player hitting distance, player hitting time, director viewing angle, and captions.

Specifically, the player station comprises at least one of the following scenes: the system comprises a near station position, a middle far station position and a far station position, wherein the initial weights of the far station position, the near station position, the middle far station position and the middle near station position are sequentially reduced. Assuming that the initial weight is a fractional value, the station position close to the table can be 40-50 cm away from the table for the player, and the initial weight value is set to be 30 points because the close table is usually in a ball serving state of the player and is more wonderful; the station position of the middle and near table can be 50-70 cm away from the table for the player, and at the moment, as the middle and near table is usually used for the player to catch or pick, the wonderful degree is general, and the initial weight value is set to be 15 points; the station position of the middle-distance table can be 70-100 cm away from the table for the player, and because the middle-distance table is usually the holding stage of the player, the wonderful degree is general, the initial weight value is set to be 20 points; the far station position can be 70-100 cm away from the table for the player, and since the far station is usually the player's killing or saving stage, the wonderful degree is the highest, and the initial weight value is set to 35.

The player motion footwork comprises at least one of the following scenes: the method comprises a single step method, a parallel step method, a jump step method, a stepping step method, a cross step method, a side step method and a small step crushing method, wherein the initial weights of the side step method, the jump step method, the cross step method, the single step method, the stepping step method, the small step crushing method and the parallel step method are sequentially reduced. For example, assuming that the initial weight is a score value, the step method is adopted when returning to the net or pursuing a ball, the wonderful degree is high, and the initial weight value is set to be 12 scores; the step-by-step method is generally adopted for left and right movement during ball catching, the wonderful degree is general, and an initial weight value is set to be 6 points; the stepping method is more when a user rightly holds a large-angle ball, the wonderful degree is higher, and the initial weight value is set to be 15 points; the leapfrog method is used when the ball comes fast and the angle is large, the wonderful degree is high, and the initial weight value is set to be 18 points; the step-by-step method is adopted when a user catches a ball or simply moves, the wonderful degree is general, and an initial weight value is set to be 8 points; the cross step method is used for dealing with incoming balls far away from the body, the wonderful degree is high, and an initial weight value is set to be 13 points; the side-body step method is generally adopted when the coming ball approaches the body of a player or the coming ball reaches the reverse position of the batter, the wonderful degree is high, and the initial weight value is set to be 21 minutes; the small step-breaking method is a step method adopted when the body gravity center, the ball receiving position and the time are adjusted, the wonderful degree is general, and the initial weight value is set to be 7 points.

The player hitting action includes at least one of the following scenarios: the swing arm swinging motion, the ball attack swinging motion, the ball touch motion, the swing with following motion and the relaxing motion, wherein the initial weights of the ball attack swinging motion, the swing arm swinging motion, the ball touch motion, the swing with following motion and the relaxing motion are sequentially reduced. For example, assuming that the initial weight is a score value, the swing arm swing motion determines the ball hitting motion and the ball hitting power, the wonderful degree is high, and the initial weight value is set to be 25 scores; the rotary property of the returned ball, the flying arc line and the hitting line of the returned ball are determined by the swing action of the head-on ball, the wonderful degree is high, and the initial weight value is set to be 30 minutes; the ball hitting action of the racket determines the hand-off angle, the ball hitting speed and the rotating property of the return ball, the wonderful degree is high, and the initial weight value is set to be 20 minutes; the swing action with the swing is beneficial to ensuring the integrity, coordination and stability of the batting action at the batting end stage, the wonderful degree is general, and the initial weight value is set to be 10 points; the relaxation action is a short relaxation stage which appears when the swing is finished, the wonderful degree is low, and the initial weight value is set to be 5 points.

The player hitting distance comprises at least one of the following scenes: short, medium, and long shots, wherein the initial weights of the long, medium, and short shots decrease in sequence. For example, assuming that the initial weight is a fractional value, when a player hits a ball at a short distance, the player usually knows the speed and the point of fall, and the wonderful degree is high, the initial weight is set to 35 points; when hitting the ball at a middle distance, the player usually excels in pushing and blocking, the wonderful degree is low, and the initial weight value is set to be 25 points; when a player hits a ball for a long distance, the player usually has the greatest strength, spin and wonderful degree, and an initial weight value of 40 points is set.

The player hitting time includes at least one of the following scenes: the initial weight of the initial hitting, the descending hitting, the late hitting, the initial hitting, the descending, the late hitting, the high-point hitting, the initial hitting, and the descending late-point hitting are sequentially decreased. For example, assuming that the initial weight is a score value, when the early stage of the rise is usually fast push, the wonderful degree is high, and the initial weight value is set to 20 points; the later rising stage is usually a forward-impulse arc-shaped ball, the wonderful degree is high, and the initial weight value is set to be 21 points; the high point period is usually when the force is exerted, the wonderful degree is higher, and the initial weight value is set to be 23 points; when the early stage of descent is usually attack of a middle-distance station, the wonderful degree is the highest, and an initial weight value is set to be 25 points; when the late stage of the descent is usually the middle-distance table ball cutting, the wonderful degree is general, and the initial weight value is set to be 11 points.

The director view includes at least one of the following scenes: a large panorama perspective, a small panorama perspective, a player close-up perspective, a player action close-up perspective, and a spectator close-up perspective; wherein the initial weights of the player action close-up perspective, the small panoramic perspective, the player close-up perspective, the audience close-up perspective and the large panoramic perspective are sequentially decreased. For example, assuming that the initial weight is a score value, the large panoramic view angle usually displays the battle name of the match clubs of both parties and the battle list of the athletes of both parties at the moment, the wonderful degree is low, and the initial weight value is set to 8 scores; when the close-up visual angle of the athlete closes up the athlete and contains a caption bar of the sportsman, the athlete is displayed to practice the ball, the wonderful degree is low, and the initial weight value is set to be 12 points; the close view angle of the athlete indicates that the athlete is serving, the wonderful degree is higher, and the initial weight value is set to be 20 points; the small panoramic view angle indicates that the athletes of the two parties are stiff and have high wonderful degree, and the initial weight value is set to be 23 points; the athlete action close-up visual angle indicates that the score of the athlete or the score lost of the athlete is the highest in the wonderful degree, and the initial weight value is set to be 25 points; the close-up visual angle of the audience indicates that the focus of the shot is not in the match, the wonderful degree is the lowest, and the initial weight value is set to be 12 points.

The subtitle comprises at least one of the following scenes: both parties count the subtitles of the bitmap, the score subtitles, the local point subtitles, the technical statistics subtitles and the full-field technical statistics subtitles; the initial weights of the local point caption and the score caption for the bitmap caption, the technical statistical caption and the full-field technical statistical caption are reduced in sequence. For example, assuming that the initial weight is a score value, the two parties indicate that the match does not start yet for the battlefield subtitles, the wonderful degree is low, and the initial weight value is set to 15 scores; the score subtitles show that the turn is about to start, the wonderful degree is high, and an initial weight value is set to be 30 scores; the local point caption indicates that the last turn of the local is started, the wonderful degree is highest, and an initial weight value is set to be 35 points; the technical statistics of the subtitles shows that one part is finished, the wonderful degree is low, and an initial weight value is set to be 15 points; the full-field technical statistics caption indicates the end of the competition, the wonderful degree is the lowest, and the initial weight value is set to be 5 points.

In addition, the preset sound type will be described by taking a table tennis direct playing video as an example. Specifically, the preset sound type includes at least one of the following types: table tennis ball impact sound, player and coach sound, audience sound, referee sound, and silence; wherein, the sound of table tennis impact, the sound of athletes and coaches and the sound of audiences are marked as wonderful at the same time, and the sound of referees and silence are marked as wonderful.

According to the video clipping method provided by the embodiment, the characteristic frame image is extracted from the video to be clipped, the characteristic frame image is input into the video clipping model to obtain the detection result output by the video clipping model, the video to be clipped is clipped according to the detection result, the video to be clipped is obtained by training according to the video sample obtained in advance based on the video clipping model, and the image frame in the video sample is marked with the weight representing the video wonderful degree, so that the accuracy of the video clipping model is improved, the accuracy of the detection result output by the video clipping model is ensured, and the high-accuracy degree of the clipped video frame is ensured.

In addition, the embodiment also provides a method for editing a ball live video, which may include the following steps:

step A: extracting a characteristic frame image from a ball video to be clipped, wherein the characteristic frame image is an image comprising a preset ball scene and a preset ball action;

and B: inputting the characteristic frame image into a ball video clip model to obtain a detection result output by the ball video clip model, wherein the detection result represents the wonderful degree of the characteristic frame image;

and C: according to the detection result, the ball video is edited;

It should be noted that, the preset ball scene and the preset ball action may refer to the example description of the preset category picture in the video editing method, and are not described herein again; in addition, the method steps in the ball live video editing method are the same as those in the video editing method, and are not described herein again.

In addition, as shown in fig. 3, an entity structure schematic diagram of the electronic device provided in the embodiment of the present invention is shown, where the electronic device may include: a processor (processor)310, a communication Interface (communication Interface)320, a memory (memory)330 and acommunication bus 340, wherein theprocessor 310, thecommunication Interface 320 and thememory 330 communicate with each other via thecommunication bus 340. Theprocessor 310 may invoke a computer program stored on thememory 330 and executable on theprocessor 310 to perform the methods provided by the various embodiments described above, including, for example: extracting a characteristic frame image from a video to be clipped, wherein the characteristic frame image is an image comprising a preset category picture; inputting the characteristic frame image into a video clip model to obtain a detection result output by the video clip model, wherein the detection result represents the wonderful degree of the characteristic frame image; according to the detection result, the video to be clipped is clipped; the video clip model is obtained by training according to a video sample acquired in advance, and image frames in the video sample are marked with weights representing the video wonderful degree.

Further examples include: extracting a characteristic frame image from a ball video to be clipped, wherein the characteristic frame image is an image comprising a preset ball scene and a preset ball action; inputting the characteristic frame image into a ball video clip model to obtain a detection result output by the ball video clip model, wherein the detection result represents the wonderful degree of the characteristic frame image; according to the detection result, the ball video is edited; the ball video clip model is obtained by training according to a pre-acquired ball scene video sample and a ball action video sample, and image frames in the ball scene video sample and the ball action video sample are respectively marked with weights for representing the video wonderful degree.

In addition, the logic instructions in thememory 330 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the method provided in the foregoing embodiments when executed by a processor, and the method includes: extracting a characteristic frame image from a video to be clipped, wherein the characteristic frame image is an image comprising a preset category picture; inputting the characteristic frame image into a video clip model to obtain a detection result output by the video clip model, wherein the detection result represents the wonderful degree of the characteristic frame image; according to the detection result, the video to be clipped is clipped; the video clip model is obtained by training according to a video sample acquired in advance, and image frames in the video sample are marked with weights representing the video wonderful degree.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A video clipping method, comprising:

according to the detection result, the video to be clipped is clipped;

the video clip model is obtained by training according to a pre-acquired video sample, and image frames in the video sample are marked with weights representing the video wonderful degree;

wherein, before inputting the characteristic frame image into a video clip model and obtaining a detection result output by the video clip model, training a preset neural network model to obtain the video clip model, and the method comprises the following steps:

acquiring a characteristic frame sample image and an audio sample training frame corresponding to the characteristic frame sample image according to a video sample and an audio sample corresponding to the video sample; the characteristic frame sample image is an image comprising a preset category picture, and the audio sample training frame is provided with a label corresponding to a preset sound type;

substituting initial weights corresponding to different scene highlights in preset category pictures into the preset neural network model, and inputting the characteristic frame sample image into the preset neural network model to obtain an image result output by the preset neural network model; wherein the image result represents a degree of highlights of the feature frame sample image;

and training the initial weight substituted into the preset neural network model according to the image result and the label corresponding to the audio sample training frame to obtain the video clip model.

2. The video clipping method according to claim 1, wherein the video clipping model is further trained according to an audio sample corresponding to the video sample, wherein the audio frames in the audio sample are labeled with a label representing the video highlight degree, and the label is set according to a preset sound type set in advance.

3. The video clipping method according to claim 1, wherein the obtaining of the feature frame sample image and the audio sample training frame corresponding to the feature frame sample image from the video sample and the audio sample corresponding to the video sample comprises:

according to a preset framing mode, respectively carrying out framing processing on the video sample and the audio sample to obtain a plurality of video frame images and a plurality of audio frames corresponding to the plurality of video frame images;

based on a preset category picture, identifying the plurality of video frame images to obtain the characteristic frame sample image;

and acquiring an audio sample training frame corresponding to the characteristic frame sample image from the plurality of audio frames, and marking a label corresponding to the preset sound type on the audio sample training frame based on the preset sound type.

4. The method of claim 1, wherein the training initial weights substituted into the preset neural network model according to the image result and the labels corresponding to the training frames of the audio samples to obtain the video clip model comprises:

detecting the accuracy of the image result based on the label corresponding to the audio sample training frame;

when the accuracy of the image result is detected to be lower than a preset threshold value, adjusting the initial weight;

and based on the video sample and the audio sample, carrying out accuracy verification on the preset neural network model with the adjusted initial weight, and determining the preset neural network model with the adjusted weight as the video clip model when the accuracy of the image result obtained by verification is greater than the preset threshold value.

5. The video clipping method of claim 4, wherein said adjusting said initial weight comprises:

obtaining the adjustment proportion of the initial weight;

when the adjustment proportion of the initial weight is detected to be larger than a preset proportion threshold value, the initial weight is forbidden to be adjusted;

and when the adjustment proportion of the initial weight is detected to be smaller than or equal to the preset proportion threshold value, adjusting the initial weight.

6. The video clipping method according to claim 1, wherein when the video to be clipped is a table tennis direct playing video, the preset category pictures include at least one of: player standing, player sports footage, player hitting action, player hitting distance, player hitting time, director viewing angle, and captions.

7. A method of ball live video editing, comprising:

according to the detection result, the ball video is edited;

the ball video clip model is obtained by training according to a pre-acquired ball scene video sample and a ball action video sample, and image frames in the ball scene video sample and the ball action video sample are respectively marked with weights for representing the video wonderful degree;

before the feature frame image is input into a ball video clip model and a detection result output by the ball video clip model is obtained, training a preset neural network model to obtain the ball video clip model, including:

acquiring a characteristic frame sample image and an audio sample training frame corresponding to the characteristic frame sample image according to a video sample and an audio sample corresponding to the ball scene video sample and the ball action video sample; the characteristic frame sample image is an image comprising a preset category picture, and the audio sample training frame is provided with a label corresponding to a preset sound type;

and training the initial weight substituted into the preset neural network model according to the image result and the label corresponding to the audio sample training frame to obtain the ball video clip model.

8. An electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, wherein the steps of the video clipping method of any one of claims 1 to 6 or the steps of the ball live video clipping method of claim 7 are implemented by the processor when executing the program.

9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the video clipping method of any one of claims 1 to 6 or the steps of the ball live video clipping method of claim 7.