Movatterモバイル変換


[0]ホーム

URL:


CN112954453A - Video dubbing method and apparatus, storage medium, and electronic device - Google Patents

Video dubbing method and apparatus, storage medium, and electronic device
Download PDF

Info

Publication number
CN112954453A
CN112954453ACN202110179770.7ACN202110179770ACN112954453ACN 112954453 ACN112954453 ACN 112954453ACN 202110179770 ACN202110179770 ACN 202110179770ACN 112954453 ACN112954453 ACN 112954453A
Authority
CN
China
Prior art keywords
video
dubbed
sub
style
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110179770.7A
Other languages
Chinese (zh)
Other versions
CN112954453B (en
Inventor
张同新
姚佳立
张昊宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co LtdfiledCriticalBeijing Youzhuju Network Technology Co Ltd
Priority to CN202110179770.7ApriorityCriticalpatent/CN112954453B/en
Publication of CN112954453ApublicationCriticalpatent/CN112954453A/en
Application grantedgrantedCritical
Publication of CN112954453BpublicationCriticalpatent/CN112954453B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本公开涉及一种视频配音方法和装置、存储介质和电子设备,所述方法包括:将待配音视频依据视频场景拆分为多个子视频;将待配音的子视频的待配音文案输入风格预测模型,并获取所述风格预测模型输出的风格标签;基于所述风格标签和所述待配音文案,生成所述待配音的子视频的配音音频。本公开可以是视频的配音更生动自然。

Figure 202110179770

The present disclosure relates to a video dubbing method and device, a storage medium and an electronic device. The method includes: dividing a video to be dubbed into a plurality of sub-videos according to a video scene; inputting the to-be-dubbed copy of the sub-video to be dubbed into a style prediction model , and obtain the style label output by the style prediction model; based on the style label and the to-be-dubbed copy, the dubbed audio of the to-be-dubbed sub-video is generated. The present disclosure can make the dubbing of the video more vivid and natural.

Figure 202110179770

Description

Video dubbing method and apparatus, storage medium, and electronic device
Technical Field
The present disclosure relates to the field of video processing, and in particular, to a video dubbing method and apparatus, a storage medium, and an electronic device.
Background
Video is a common multimedia form, and at present, information acquisition through video is a common life style in the rapid development of information technology. At present, functions capable of automatically dubbing videos appear, but the automatic dubbing schemes generally cannot adjust the dubbing style according to the content of the videos, so that the dubbing of the videos is not vivid and natural.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a first aspect, the present disclosure provides a video dubbing method, the method comprising: splitting an audio and video to be matched into a plurality of sub-videos according to a video scene; inputting a to-be-dubbed scheme of a to-be-dubbed sub video into a style prediction model, and acquiring a style label output by the style prediction model; and generating dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case.
In a second aspect, the present disclosure provides a video dubbing apparatus, the apparatus comprising: the scene determining module is used for splitting the audio and video to be matched into a plurality of sub-videos according to the video scene; the style determining module is used for inputting the to-be-dubbed file of the to-be-dubbed sub-video into a style prediction model and acquiring a style label output by the style prediction model; and the dubbing generation module is used for generating the dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case.
In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect of the present disclosure.
In a fourth aspect, the present disclosure provides an electronic device, including a storage device having a computer program stored thereon, and a processing device configured to execute the computer program to implement the steps of the method according to the first aspect of the present disclosure.
Through the technical scheme, the following technical effects can be at least achieved:
the method comprises the steps of splitting a video to be dubbed into different sub-videos according to scenes, generating style labels for the sub-videos, and generating dubbing for the sub-videos based on the style labels, so that the dubbing styles of the sub-videos in different styles are different, and the dubbing of the videos is more natural and vivid.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:
fig. 1 is a flow chart illustrating a method of dubbing video in accordance with an exemplary disclosed embodiment.
Fig. 2 is a block diagram illustrating a video dubbing apparatus according to an exemplary disclosed embodiment.
FIG. 3 is a block diagram illustrating an electronic device according to an exemplary disclosed embodiment.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
Fig. 1 is a flow chart illustrating a method of dubbing video in accordance with an exemplary disclosed embodiment, the method comprising, as shown in fig. 1:
and S11, splitting the audio and video to be matched into a plurality of sub-videos according to the video scenes.
The video scene refers to a scene generated by the change of the preset shooting condition of the video, for example, different sub-mirrors can generate different video scenes, and different shooting environments can also generate different video scenes; the sub-video may be a video clip generated by dividing a time point at which a video scene changes as a division point, and for example, the video content when the shooting angle of the video is angle 1 and the video content when the shooting angle is angle 2 may be regarded as two different sub-videos, the video content when the shooting location of the video is location 1 and the video content when the shooting location is location 2 may be regarded as two different sub-videos, and the video content when the shooting object of the video is person 1 and the video content when the shooting object of the video is person 2 may be regarded as two different sub-videos.
In a possible implementation manner, the sub-video with the video length lower than the preset threshold and the sub-video adjacent to the sub-video can be merged to obtain a new sub-video, so that the number of scenes needing to be subjected to style prediction can be reduced, more features are provided for the generation of the style label, and the efficiency of the generation of the style label is improved.
For example, a video is divided into 8 sub-videos, the video length of the sub-video 1 is 10 seconds, the video length of the sub-video 2 is 5 seconds, the video length of the sub-video 3 is 7 seconds, the video length of the sub-video 4 is 40 seconds, the video length of the sub-video 5 is 8 seconds, the video length of the sub-video 6 is 17 seconds, the video length of the sub-video 7 is 3 seconds, the video length of the sub-video 8 is 22 seconds, and when the preset threshold value is 20 seconds, the sub-video 1, the sub-video 2, and the sub-video 3 can be merged into one sub-video, the sub-video 4 can be regarded as an independent sub-video, the sub-video 5 and the sub-video 6 can be merged into one sub-video, and the sub-video 7 and the sub-video 8 can be merged into one sub-video.
In a possible implementation manner, the video may be further subjected to scene splitting according to a preset number of scenes, where the preset number of scenes is four, and the preset number of scenes corresponds to the background portion, the narration portion, the highlight portion, and the end portion, respectively, and then the video may be divided into four sub-videos based on the image content and/or the text content of the video content.
In a possible implementation manner, the audio/video to be configured may be split into a plurality of sub-videos according to a video scene based on a scene splitting model.
The training steps of the scene segmentation model are as follows:
inputting a sample video into a scene segmentation model to be trained, acquiring segmentation points output by the scene segmentation model, and adjusting parameters in the scene segmentation model based on sample labeling segmentation points of the sample video, the segmentation points output by the scene segmentation model and a preset loss function until the difference between the sample segmentation points and the segmentation points output by the model meets a training condition or the iteration number meets the training condition.
The preset loss function is used for penalizing the difference value between the segmentation point output by the model and the sample labeling segmentation point.
The sample labeling division point of the sample video can be determined based on the division requirement, for example, when the video needs to be divided according to the minute mirror, the sample labeling division point can be a division point labeled according to the minute mirror in advance, when the video needs to be divided according to the scenario of the video, the division point can be labeled according to the scenario, and a scenario label is added to the division point of the video, such as a background division point, a narration division point, a highlight division point, an ending division point, and the like. In order to improve the accuracy of the scene segmentation model, the type of the sample video may be determined according to the type of the video to be segmented, for example, when the video to be segmented is a movie video, the sample video is also a movie video, and when the video to be segmented is a popular science video, the sample video is also a popular science video. In a possible implementation manner, multiple types of scene segmentation models can be trained in advance according to different video classifications, and when a scene is segmented, a scene segmentation model corresponding to the type of a video to be segmented can be selected for splitting.
In a possible implementation manner, when the audio and video to be matched has corresponding document content, the document content can be subjected to scene segmentation through a semantic discrimination model, the document content expressing different contents is divided into different scenes, and the audio and video to be matched is divided into a plurality of sub-videos according to the scenes divided by the document content.
S12, inputting the to-be-dubbed file of the to-be-dubbed sub-video into the style prediction model, and obtaining the style label output by the style prediction model.
The scheme to be dubbed of the sub-video to be dubbed corresponding to different videos can be obtained through the following two forms:
the first method comprises the following steps: and identifying caption content from the sub-video to be dubbed, and taking the caption content as the file to be dubbed. When the video has text subtitles, the file to be dubbed can be obtained by identifying the subtitle content, and the subtitle content and the time axis of the video have a corresponding relation, so that the time axis corresponding to the subtitle content is determined while the subtitle content is obtained, and the dubbing audio is added at the corresponding time axis position after the dubbing is finished.
And the second method comprises the following steps: acquiring the file content of the audio and video to be matched; acquiring time information of a sub video to be dubbed; and determining the to-be-dubbed file corresponding to the to-be-dubbed sub-video from the file content based on the time information. When the audio and video to be dubbed has corresponding file content, the file content corresponding to the sub video to be dubbed can be directly extracted from the file content, for example, when the file content has time axis information, the time information of the sub video can be extracted, the time axis corresponding to the time information is determined, the file content corresponding to the time axis is determined to be the file to be dubbed, when the time axis information does not exist in the file content, the time information of the sub video can be extracted, the video position of the sub video in the audio and video to be dubbed is determined, the character position corresponding to the video position is determined in the file content, and the file content at the character position is taken as the file to be dubbed.
Through the style prediction model, the to-be-dubbed case can be labeled with style labels, the style labels can include labels of emotion classes, such as excitement, happiness, sadness and the like, can also include expressions of scenario classes, such as horror, comedy, tragedy and the like, and can also include expressions of scenario trend classes, such as highlight, ending, statement and the like.
In one possible embodiment, the training step of the style prediction model is as follows:
inputting a sample text into a style prediction model to be trained, acquiring a style label output by the style prediction model, and adjusting parameters in the style prediction model based on a sample labeling style label of the sample text, the style label output by the style prediction model and a preset loss function so as to enable the style label output by the style prediction model to be close to the sample labeling style label, wherein the training can be stopped when the difference between the two labels meets a preset condition or the training iteration number reaches a preset number.
The preset loss function is used for penalizing the difference value between the style label output by the model and the sample labeling style label.
And S13, generating dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case.
The dubbing audio can be generated by using a dubbing model, a program or an engine with a stylized dubbing function, and selectable labels are set for the style prediction model according to the style types of different dubbing programs, for example, when the style types of the dubbing programs include three types of happy, excited and hard, the types of the labels which can be output by the style prediction model can be corresponding to the three types of styles, for example, the labels respectively representing happy, happy and happy are all corresponding to the happy style types.
After the dubbing audio of the sub-video is generated, the dubbing audio can be inserted into the position corresponding to the sub-video in the video, and the dubbed complete video can be obtained by dubbing all the sub-videos, and the dubbing style of each scene in the video is matched with the video style of the scene, so that the video is more vivid and natural.
In a possible implementation manner, a video length corresponding to each subtitle content may be determined, a dubbing speed may be determined based on the video length corresponding to each subtitle content and a text length of the to-be-dubbed text, and a dubbing audio of the to-be-dubbed sub-video may be generated based on the style label and the to-be-dubbed text at the dubbing speed.
That is, when recognizing the text subtitle from the video to obtain the dubbing scheme, the length of the video covered by the text subtitle can be determined, and the dubbing speed can be determined according to the length of the video and the length of the scheme, so that the dubbing can be played within the time of the text subtitle; for example, when the video length is short, the dubbing speed may be increased, and when the video length is long, the dubbing speed may be decreased or may be at a preset dubbing speed, which may be a natural recitation speed experimentally tested.
In a possible embodiment, the document content may be divided into sentences to obtain a plurality of document clauses, time axis information of each document clause in the sub-video is determined based on the time information of the sub-video to be dubbed, and text subtitles corresponding to the document clause and dubbing audio corresponding to the document clause are added to the sub-video to be dubbed based on the time axis information of each document clause.
That is, the speed of the document content can be adjusted in sentence units, so that the dubbing speed is more natural, and the text subtitle can be reasonably added to the video, so that reading inconvenience caused by long or short text subtitle can be avoided.
In a possible implementation manner, the audio and video to be dubbed can be split into a plurality of sub-videos through a scene segmentation model, a scene label of each sub-video is obtained, a to-be-dubbed file of the to-be-dubbed sub-video and the scene label of the sub-video are input into a style prediction model, and a style label output by the style prediction model is obtained.
The scene segmentation model is not only used for splitting the video into the sub-videos, but also can add scene labels to the sub-videos, for example, labels such as 'office buildings', 'forests', 'streets' and the like can be added to the video according to different backgrounds, or labels such as 'backgrounds', 'highlights', 'tails' and the like can be added to the video according to different scenarios, and the style prediction model generates style labels based on the scene labels and the dubbing patterns, so that the generated style labels can accurately represent the style of the video.
Through the technical scheme, the following technical effects can be at least achieved:
the method comprises the steps of splitting a video to be dubbed into different sub-videos according to scenes, generating style labels for the sub-videos, and generating dubbing for the sub-videos based on the style labels, so that the dubbing styles of the sub-videos in different styles are different, and the dubbing of the videos is more natural and vivid.
Fig. 2 is a block diagram illustrating a video dubbing apparatus according to an exemplary disclosed embodiment, theapparatus 200, as shown in fig. 2, comprising:
thescene determining module 210 is configured to split the audio and video to be configured into a plurality of sub-videos according to a video scene.
Thestyle determining module 220 is configured to input the to-be-dubbed case of the to-be-dubbed sub video into the style prediction model, and obtain a style label output by the style prediction model.
And adubbing generating module 230, configured to generate a dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case.
In a possible implementation manner, the apparatus further includes a text recognition module, configured to recognize subtitle content from the sub-video to be dubbed, and use the subtitle content as the to-be-dubbed file.
In a possible implementation manner, the device further comprises a document acquisition module, configured to acquire document content of the audio and video to be matched; acquiring time information of a sub video to be dubbed; and determining the to-be-dubbed file corresponding to the to-be-dubbed sub-video from the file content based on the time information.
In a possible implementation manner, thescene determining module 210 is configured to split the audio and video to be configured into a plurality of sub-videos through a scene splitting model, and obtain a scene tag of each sub-video; thestyle determining module 220 is configured to input the to-be-dubbed case of the sub-video to be dubbed and the scene tag of the sub-video into the style prediction model, and obtain the style tag output by the style prediction model.
In a possible implementation manner, the apparatus further includes a length determining module, configured to determine a video length corresponding to each subtitle content; thedubbing generating module 230 is configured to determine a dubbing speed based on a video length corresponding to each subtitle content and a text length of the to-be-dubbed scenario; and generating dubbing audio of the sub-video to be dubbed at the dubbing speed based on the style label and the file to be dubbed.
In a possible implementation manner, the apparatus further includes a time determining module, configured to perform clause segmentation on the document content to obtain a plurality of document clauses; determining time axis information of each file clause in the sub-video based on the time information of the sub-video to be dubbed; and adding a text subtitle corresponding to the text clause and dubbing audio corresponding to the text clause to the sub-video to be dubbed based on the time axis information of each text clause.
In a possible implementation manner, thescene determining module 210 is configured to split the audio and video to be configured into a plurality of sub-videos according to a video scene based on a scene splitting model; the training steps of the scene segmentation model are as follows: inputting a sample video into a scene segmentation model to be trained, acquiring segmentation points output by the scene segmentation model, and adjusting parameters in the scene segmentation model based on sample labeling segmentation points of the sample video, the segmentation points output by the scene segmentation model and a first loss function.
In one possible embodiment, the training step of the style prediction model is as follows: inputting a sample text into a style prediction model to be trained, acquiring a style label output by the style prediction model, and adjusting parameters in the style prediction model based on the sample labeling style label of the sample text, the style label output by the style prediction model and a second loss function.
The steps specifically executed by each module have been specifically described in the embodiment of the method portion corresponding to the module, and are not described herein again.
Through the technical scheme, the following technical effects can be at least achieved:
the method comprises the steps of splitting a video to be dubbed into different sub-videos according to scenes, generating style labels for the sub-videos, and generating dubbing for the sub-videos based on the style labels, so that the dubbing styles of the sub-videos in different styles are different, and the dubbing of the videos is more natural and vivid.
Referring now to FIG. 3, a block diagram of anelectronic device 300 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 3, theelectronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)302 or a program loaded from a storage means 308 into a Random Access Memory (RAM) 303. In theRAM 303, various programs and data necessary for the operation of theelectronic apparatus 300 are also stored. Theprocessing device 301, theROM 302, and theRAM 303 are connected to each other via abus 304. An input/output (I/O)interface 305 is also connected tobus 304.
Generally, the following devices may be connected to the I/O interface 305:input devices 306 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; anoutput device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like;storage devices 308 including, for example, magnetic tape, hard disk, etc.; and acommunication device 309. The communication means 309 may allow theelectronic device 300 to communicate wirelessly or by wire with other devices to exchange data. While fig. 3 illustrates anelectronic device 300 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 309, or installed from the storage means 308, or installed from theROM 302. The computer program, when executed by theprocessing device 301, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some implementations, the electronic devices may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented by software or hardware. Wherein the name of a module in some cases does not constitute a limitation on the module itself.
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Example 1 provides, in accordance with one or more embodiments of the present disclosure, a video dubbing method comprising: splitting an audio and video to be matched into a plurality of sub-videos according to a video scene; inputting a to-be-dubbed scheme of a to-be-dubbed sub video into a style prediction model, and acquiring a style label output by the style prediction model; and generating dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case.
Example 2 provides the method of example 1, further comprising, in accordance with one or more embodiments of the present disclosure: and identifying caption content from the sub-video to be dubbed, and taking the caption content as the file to be dubbed.
Example 3 provides the method of example 1, further comprising, in accordance with one or more embodiments of the present disclosure: acquiring the file content of the audio and video to be matched; acquiring time information of a sub video to be dubbed; and determining the to-be-dubbed file corresponding to the to-be-dubbed sub-video from the file content based on the time information.
Example 4 provides the method of example 1, the splitting the audio and video to be configured into a plurality of sub-videos according to a video scene includes: splitting the audio and video to be distributed into a plurality of sub-videos through a scene segmentation model, and acquiring a scene label of each sub-video; the method for inputting the to-be-dubbed case of the to-be-dubbed sub-video into the style prediction model and acquiring the style label output by the style prediction model comprises the following steps: inputting the to-be-dubbed case of the sub-video to be dubbed and the scene label of the sub-video into a style prediction model, and acquiring the style label output by the style prediction model.
Example 5 provides the method of example 2, further comprising, in accordance with one or more embodiments of the present disclosure: determining the video length corresponding to each subtitle content; generating dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case, wherein the generating dubbing audio comprises the following steps: determining dubbing speed based on the video length corresponding to each subtitle content and the character length of the file to be dubbed; and generating dubbing audio of the sub-video to be dubbed at the dubbing speed based on the style label and the file to be dubbed.
Example 6 provides the method of example 3, further comprising, in accordance with one or more embodiments of the present disclosure: sentence dividing is carried out on the document content to obtain a plurality of document clauses; determining time axis information of each file clause in the sub-video based on the time information of the sub-video to be dubbed; and adding a text subtitle corresponding to the text clause and dubbing audio corresponding to the text clause to the sub-video to be dubbed based on the time axis information of each text clause.
Example 7 provides the method of example 1, the splitting the audio and video to be configured into a plurality of sub-videos according to a video scene, including: splitting an audio and video to be distributed into a plurality of sub-videos according to a video scene based on a scene segmentation model; the training steps of the scene segmentation model are as follows: inputting a sample video into a scene segmentation model to be trained, acquiring segmentation points output by the scene segmentation model, and adjusting parameters in the scene segmentation model based on sample labeling segmentation points of the sample video, the segmentation points output by the scene segmentation model and a first loss function.
Example 8 provides the method of example 1, the training steps of the style prediction model are as follows: inputting a sample text into a style prediction model to be trained, acquiring a style label output by the style prediction model, and adjusting parameters in the style prediction model based on the sample labeling style label of the sample text, the style label output by the style prediction model and a second loss function.
Example 9 provides, in accordance with one or more embodiments of the present disclosure, a video dubbing apparatus comprising: the scene determining module is used for splitting the audio and video to be matched into a plurality of sub-videos according to the video scene; the style determining module is used for inputting the to-be-dubbed file of the to-be-dubbed sub-video into a style prediction model and acquiring a style label output by the style prediction model; and the dubbing generation module is used for generating the dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case.
Example 10 provides the apparatus of example 9, further including a text recognition module to identify subtitle content from the sub-video to be dubbed as the to-be-dubbed case, according to one or more embodiments of the present disclosure.
Example 11 provides the apparatus of example 9, further comprising a document acquisition module, configured to acquire document content of the audio and video to be matched, according to one or more embodiments of the present disclosure; acquiring time information of a sub video to be dubbed; and determining the to-be-dubbed file corresponding to the to-be-dubbed sub-video from the file content based on the time information.
Example 12 provides the apparatus of example 9, in accordance with one or more embodiments of the present disclosure, where the scene determination module is configured to split the audio and video to be distributed into a plurality of sub-videos through a scene segmentation model, and obtain a scene tag of each of the sub-videos; and the style determining module is used for inputting the to-be-dubbed file of the to-be-dubbed sub video and the scene label of the sub video into the style prediction model and acquiring the style label output by the style prediction model.
Example 13 provides the apparatus of example 10, further including a length determination module to determine a video length corresponding to each subtitle content, in accordance with one or more embodiments of the present disclosure; the dubbing generating module is used for determining the dubbing speed based on the video length corresponding to each subtitle content and the character length of the file to be dubbed; and generating dubbing audio of the sub-video to be dubbed at the dubbing speed based on the style label and the file to be dubbed.
Example 14 provides the apparatus of example 11, further including a time determination module to clause the document content to obtain a plurality of document clauses, in accordance with one or more embodiments of the present disclosure; determining time axis information of each file clause in the sub-video based on the time information of the sub-video to be dubbed; and adding a text subtitle corresponding to the text clause and dubbing audio corresponding to the text clause to the sub-video to be dubbed based on the time axis information of each text clause.
Example 15 provides the apparatus of example 9, the scene determination module to split the audio and video to be configured into a plurality of sub-videos according to a video scene based on a scene splitting model, according to one or more embodiments of the present disclosure; the training steps of the scene segmentation model are as follows: inputting a sample video into a scene segmentation model to be trained, acquiring segmentation points output by the scene segmentation model, and adjusting parameters in the scene segmentation model based on sample labeling segmentation points of the sample video, the segmentation points output by the scene segmentation model and a first loss function;
example 16 provides the apparatus of example 9, the training of the style prediction model comprising the steps of: inputting a sample text into a style prediction model to be trained, acquiring a style label output by the style prediction model, and adjusting parameters in the style prediction model based on the sample labeling style label of the sample text, the style label output by the style prediction model and a second loss function.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims (11)

Translated fromChinese
1.一种视频配音方法,其特征在于,所述方法包括:1. a video dubbing method, is characterized in that, described method comprises:将待配音视频依据视频场景拆分为多个子视频;Split the video to be dubbed into multiple sub-videos according to the video scene;将待配音的子视频的待配音文案输入风格预测模型,并获取所述风格预测模型输出的风格标签;Input the to-be-dubbed copy of the sub-video to be dubbed into the style prediction model, and obtain the style label output by the style prediction model;基于所述风格标签和所述待配音文案,生成所述待配音的子视频的配音音频。Based on the style tag and the text to be dubbed, the dubbing audio of the sub-video to be dubbed is generated.2.根据权利要求1所述的方法,其特征在于,所述方法还包括:2. The method according to claim 1, wherein the method further comprises:从所述待配音的子视频中识别字幕内容,并将该字幕内容作为所述待配音文案。Identify the subtitle content from the sub-video to be dubbed, and use the subtitle content as the to-be-dubbed copy.3.根据权利要求1所述的方法,其特征在于,所述方法还包括:3. The method according to claim 1, wherein the method further comprises:获取所述待配音视频的文案内容;Obtain the copy content of the video to be dubbed;获取待配音的子视频的时间信息;Obtain the time information of the sub-video to be dubbed;基于所述时间信息,从所述文案内容中确定所述待配音的子视频对应的待配音文案。Based on the time information, the to-be-dubbed copy corresponding to the to-be-dubbed sub-video is determined from the copy content.4.根据权利要求1所述的方法,其特征在于,所述将待配音视频依据视频场景拆分为多个子视频,包括:4. The method according to claim 1, wherein the video to be dubbed is divided into a plurality of sub-videos according to the video scene, comprising:通过场景分割模型,将所述待配音视频拆分为多个子视频,并获取每个所述子视频的场景标签;Splitting the to-be-dubbed video into a plurality of sub-videos through a scene segmentation model, and acquiring a scene tag of each of the sub-videos;所述将待配音的子视频的待配音文案输入风格预测模型,并获取所述风格预测模型输出的风格标签,包括:The to-be-dubbed copy of the sub-video to be dubbed is input into the style prediction model, and the style label output by the style prediction model is obtained, including:将待配音的子视频的待配音文案和该子视频的场景标签输入风格预测模型,并获取所述风格预测模型输出的风格标签。Input the to-be-dubbed copy of the sub-video to be dubbed and the scene label of the sub-video into the style prediction model, and obtain the style label output by the style prediction model.5.根据权利要求2所述的方法,其特征在于,所述方法还包括:5. The method according to claim 2, wherein the method further comprises:确定各字幕内容对应的视频长度;Determine the video length corresponding to each subtitle content;所述基于所述风格标签和所述待配音文案,生成所述待配音的子视频的配音音频,包括:Generating the dubbed audio of the sub-video to be dubbed based on the style tag and the to-be-dubbed copy includes:基于各字幕内容对应的视频长度和所述待配音文案的文字长度确定配音速度;Determine the dubbing speed based on the video length corresponding to each subtitle content and the text length of the to-be-dubbed copy;基于所述风格标签和所述待配音文案,以所述配音速度生成所述待配音的子视频的配音音频。Based on the style tag and the text to be dubbed, the dubbing audio of the sub-video to be dubbed is generated at the dubbing speed.6.根据权利要求3所述的方法,其特征在于,所述方法还包括:6. The method according to claim 3, wherein the method further comprises:对所述文案内容进行分句,得到多个文案子句;Clause the content of the copy to obtain a plurality of copy clauses;基于所述待配音的子视频的时间信息,确定各文案子句在该子视频中的时间轴信息;Based on the time information of the sub-video to be dubbed, determine the time-axis information of each copywriting clause in the sub-video;基于各文案子句的时间轴信息,为所述待配音的子视频添加与该文案子句对应的文字字幕和与该文案子句对应的配音音频。Based on the timeline information of each copywriting clause, add text subtitles corresponding to the copywriting clause and dubbing audio corresponding to the copywriting clause to the sub-video to be dubbed.7.根据权利要求1所述的方法,其特征在于,所述将待配音视频依据视频场景拆分为多个子视频,包括:7. The method according to claim 1, wherein the video to be dubbed is divided into a plurality of sub-videos according to the video scene, comprising:基于场景分割模型,将待配音视频依据视频场景拆分为多个子视频;Based on the scene segmentation model, the video to be dubbed is divided into multiple sub-videos according to the video scene;其中,所述场景分割模型的训练步骤如下:Wherein, the training steps of the scene segmentation model are as follows:将样本视频输入待训练的场景分割模型,并获取所述场景分割模型输出的分割点,并基于样本视频的样本标注分割点、所述场景分割模型输出的分割点、以及第一损失函数调整所述场景分割模型中的参数。Input the sample video into the scene segmentation model to be trained, and obtain the segmentation point output by the scene segmentation model, and adjust the segmentation point based on the sample video of the sample video, the segmentation point output by the scene segmentation model, and the first loss function. Describe the parameters in the scene segmentation model.8.根据权利要求1所述的方法,其特征在于,所述风格预测模型的训练步骤如下:8. method according to claim 1, is characterized in that, the training step of described style prediction model is as follows:将样本文本输入待训练的风格预测模型,并获取所述风格预测模型输出的风格标签,并基于样本文本的样本标注风格标签、所述风格预测模型输出的风格标签以及第二损失函数调整所述风格预测模型中的参数。Input the sample text into the style prediction model to be trained, and obtain the style label output by the style prediction model, and adjust the style label based on the sample text of the sample text, the style label output by the style prediction model and the second loss function. Parameters in the style prediction model.9.一种视频配音装置,其特征在于,所述装置包括:9. A video dubbing device, wherein the device comprises:场景确定模块,用于将待配音视频依据视频场景拆分为多个子视频;The scene determination module is used to split the video to be dubbed into multiple sub-videos according to the video scene;风格确定模块,用于将待配音的子视频的待配音文案输入风格预测模型,并获取所述风格预测模型输出的风格标签;a style determination module, used to input the to-be-dubbed copy of the sub-video to be dubbed into a style prediction model, and obtain a style label output by the style prediction model;配音生成模块,用于基于所述风格标签和所述待配音文案,生成所述待配音的子视频的配音音频。A dubbing generation module, configured to generate dubbing audio of the sub-video to be dubbed based on the style tag and the to-be-dubbed copy.10.一种计算机可读介质,其上存储有计算机程序,其特征在于,该程序被处理装置执行时实现权利要求1-8中任一项所述方法的步骤。10. A computer-readable medium on which a computer program is stored, characterized in that, when the program is executed by a processing device, the steps of the method according to any one of claims 1-8 are implemented.11.一种电子设备,其特征在于,包括:11. An electronic device, characterized in that, comprising:存储装置,其上存储有计算机程序;a storage device on which a computer program is stored;处理装置,用于执行所述存储装置中的所述计算机程序,以实现权利要求1-8中任一项所述方法的步骤。A processing device, configured to execute the computer program in the storage device, to implement the steps of the method of any one of claims 1-8.
CN202110179770.7A2021-02-072021-02-07Video dubbing method and device, storage medium and electronic equipmentActiveCN112954453B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202110179770.7ACN112954453B (en)2021-02-072021-02-07Video dubbing method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202110179770.7ACN112954453B (en)2021-02-072021-02-07Video dubbing method and device, storage medium and electronic equipment

Publications (2)

Publication NumberPublication Date
CN112954453Atrue CN112954453A (en)2021-06-11
CN112954453B CN112954453B (en)2023-04-28

Family

ID=76244958

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202110179770.7AActiveCN112954453B (en)2021-02-072021-02-07Video dubbing method and device, storage medium and electronic equipment

Country Status (1)

CountryLink
CN (1)CN112954453B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114205671A (en)*2022-01-172022-03-18百度在线网络技术(北京)有限公司 Video content editing method and device based on scene alignment
CN114938473A (en)*2022-05-162022-08-23上海幻电信息科技有限公司Comment video generation method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
JP2018120362A (en)*2017-01-242018-08-02日本放送協会 Scene change point model learning device, scene change point detection device, and program thereof
CN109391842A (en)*2018-11-162019-02-26维沃移动通信有限公司A kind of dubbing method, mobile terminal
CN110971969A (en)*2019-12-092020-04-07北京字节跳动网络技术有限公司Video dubbing method and device, electronic equipment and computer readable storage medium
CN111031386A (en)*2019-12-172020-04-17腾讯科技(深圳)有限公司Video dubbing method and device based on voice synthesis, computer equipment and medium
CN112188117A (en)*2020-08-292021-01-05上海量明科技发展有限公司Video synthesis method, client and system
CN112270920A (en)*2020-10-282021-01-26北京百度网讯科技有限公司Voice synthesis method and device, electronic equipment and readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
JP2018120362A (en)*2017-01-242018-08-02日本放送協会 Scene change point model learning device, scene change point detection device, and program thereof
CN109391842A (en)*2018-11-162019-02-26维沃移动通信有限公司A kind of dubbing method, mobile terminal
CN110971969A (en)*2019-12-092020-04-07北京字节跳动网络技术有限公司Video dubbing method and device, electronic equipment and computer readable storage medium
CN111031386A (en)*2019-12-172020-04-17腾讯科技(深圳)有限公司Video dubbing method and device based on voice synthesis, computer equipment and medium
CN112188117A (en)*2020-08-292021-01-05上海量明科技发展有限公司Video synthesis method, client and system
CN112270920A (en)*2020-10-282021-01-26北京百度网讯科技有限公司Voice synthesis method and device, electronic equipment and readable storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114205671A (en)*2022-01-172022-03-18百度在线网络技术(北京)有限公司 Video content editing method and device based on scene alignment
CN114938473A (en)*2022-05-162022-08-23上海幻电信息科技有限公司Comment video generation method and device
CN114938473B (en)*2022-05-162023-12-12上海幻电信息科技有限公司 Comment video generation method and device

Also Published As

Publication numberPublication date
CN112954453B (en)2023-04-28

Similar Documents

PublicationPublication DateTitle
CN112929746B (en)Video generation method and device, storage medium and electronic equipment
WO2022068533A1 (en)Interactive information processing method and apparatus, device and medium
CN113259740A (en)Multimedia processing method, device, equipment and medium
CN114697760B (en)Processing method, processing device, electronic equipment and medium
CN113889113A (en) Clause method, device, storage medium and electronic device
CN113778419A (en)Multimedia data generation method and device, readable medium and electronic equipment
CN113886612A (en) A kind of multimedia browsing method, apparatus, equipment and medium
CN113010698A (en)Multimedia interaction method, information interaction method, device, equipment and medium
CN112397104B (en)Audio and text synchronization method and device, readable medium and electronic equipment
CN111897950A (en)Method and apparatus for generating information
CN115967833A (en)Video generation method, device and equipment meter storage medium
CN114780792A (en) A method, apparatus, device and medium for generating a video summary
CN113011169A (en)Conference summary processing method, device, equipment and medium
CN112380365A (en)Multimedia subtitle interaction method, device, equipment and medium
CN112954453B (en)Video dubbing method and device, storage medium and electronic equipment
CN113761865A (en)Sound and text realignment and information presentation method and device, electronic equipment and storage medium
CN114117127B (en) Video generation method, device, readable medium and electronic device
CN115052188B (en)Video editing method, device, equipment and medium
CN114697762B (en)Processing method, processing device, terminal equipment and medium
CN112309389A (en) Information interaction method and device
CN112530472B (en)Audio and text synchronization method and device, readable medium and electronic equipment
CN113885741A (en) A multimedia processing method, device, equipment and medium
CN112905838A (en)Information retrieval method and device, storage medium and electronic equipment
CN109947526B (en)Method and apparatus for outputting information
CN113628097A (en)Image special effect configuration method, image recognition method, image special effect configuration device and electronic equipment

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp