CN114120992B

Movatterモバイル変換

Info

Publication number: CN114120992B
Application number: CN202010906384.9A
Authority: CN
Inventors: 付平非; 王宇嘉; 杨乐; 潘世光; 杨杰豪
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2020-09-01
Filing date: 2020-09-01
Publication date: 2025-01-07
Anticipated expiration: 2040-09-01
Also published as: CN114120992A

Abstract

The disclosure provides a method, a device, electronic equipment and a computer readable medium for generating video by voice, and relates to the technical field of video production. The method for generating the video through the voice comprises the steps of obtaining a voice function starting instruction to start a voice function corresponding to the voice function starting instruction, obtaining voice information, recognizing semantics corresponding to the voice information according to the voice information, and generating the video of the first type according to the semantics. According to the technical scheme, after the voice function starting instruction is acquired, voice information is acquired, corresponding functions can be prevented from being started when videos do not need to be generated, the generated videos are automatically generated according to semantics, the generated videos meet the requirements of users, the users do not need to manually search video materials, the users do not need to manually manufacture videos, the video generation is convenient and quick, and the video generation efficiency is high.

Description

Method, device, electronic equipment and computer readable medium for generating video through voice

Technical Field

The embodiment of the disclosure relates to the technical field of video production, in particular to a method, a device, electronic equipment and a computer readable medium for generating video by voice.

Background

With the development of the internet and intelligent terminals, more and more users use the terminals to make videos and share the videos to others so as to obtain attention, click volume, vermicelli and the like. For example, in a short video viewing platform, a user may publish a short video to publish the short video for viewing by others.

Video is released, and then video needs to be produced. In the prior art, when a user makes a video, the user needs to manually make the video, such as manually shooting the video, or manually searching for video materials, then making the video, and making the video takes more time and has low efficiency.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, there is provided a method of speech generating video, the method comprising:

acquiring a voice function starting instruction to start a voice function corresponding to the voice function starting instruction;

acquiring voice information;

recognizing semantics corresponding to the voice information according to the voice information;

And generating a first type video according to the semantics.

In a second aspect, there is also provided an apparatus for speech generating video, the apparatus comprising:

the instruction acquisition module is used for acquiring a voice function starting instruction so as to start a voice function corresponding to the voice function starting instruction;

The voice acquisition module is used for acquiring voice information;

The voice recognition module is used for recognizing the semantics corresponding to the voice information according to the voice information;

and the video generation module is used for generating a first type video according to the semantics.

In a third aspect, there is also provided an electronic device comprising:

one or more processors;

A memory;

One or more applications, wherein the one or more applications are stored in memory and configured to be executed by one or more processors, the one or more applications configured to perform the method of speech generating video as shown in the first aspect of the present disclosure.

In a fourth aspect, there is also provided a computer readable medium having stored thereon a computer program which when executed by a processor implements the method of speech generating video shown in the first aspect of the present disclosure.

Compared with the prior art, the embodiment of the disclosure provides a method, a device, electronic equipment and a computer readable medium for generating a video by voice, after a voice function starting instruction is acquired, voice information is acquired, corresponding functions can be prevented from being started when the video is not required to be generated, the video is automatically generated according to the semantics, the video is not only related to keywords or keywords in the voice information, but also related to the whole semantics corresponding to the voice information, the generated video accords with the use scene expressed by the user semantics, the generated video accords with the requirements of multiple aspects in the user semantics, the user does not need to manually search for video materials, the user does not need to manually manufacture the video, the video generation is convenient and quick, and the video generation efficiency is high.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

Fig. 1 is a flowchart of a method for generating a video by voice according to an embodiment of the disclosure;

fig. 2 is a schematic position diagram of a virtual control a in a display interface of a terminal according to an embodiment of the present disclosure;

FIG. 3 is a detailed flowchart of step S104 in FIG. 1;

FIG. 4 is a schematic diagram of an interface for displaying voice prompt information according to an embodiment of the disclosure;

fig. 5 is a schematic flow chart of a method for generating video by voice according to an embodiment of the disclosure;

Fig. 6 is a schematic structural diagram of an apparatus for generating video by voice according to an embodiment of the disclosure;

fig. 7 is a schematic structural diagram of an electronic device for generating video by voice according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment," another embodiment "means" at least one additional embodiment, "and" some embodiments "means" at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are used merely to distinguish one device, module, or unit from another device, module, or unit, and are not intended to limit the order or interdependence of the functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

The method, the device, the electronic equipment and the medium for generating the video through the voice provided by the disclosure aim to solve the technical problems in the prior art.

The following describes the technical solutions of the present disclosure and how the technical solutions of the present disclosure solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present disclosure will be described below with reference to the accompanying drawings.

As will be appreciated by those skilled in the art, the "terminal" as used in the present disclosure may be a cell phone, a tablet computer, a PDA (Personal DIGITAL ASSISTANT ), a MID (Mobile INTERNET DEVICE, mobile internet device), or the like.

Referring to fig. 1, in an embodiment of the present disclosure, a method for generating a video by voice is provided, where the method for generating a video by voice may be applied to a terminal, and the method includes:

step S101, a voice function starting instruction is obtained to start the voice function corresponding to the voice function starting instruction.

Referring to fig. 2, the voice function start command may be triggered by the user operating terminal. Specifically, the voice function activation instruction may be triggered by the user pressing a preset virtual control for a preset duration. The preset duration is not limited, for example, the preset duration may be 1s,2s, 2.5s, or the like. When the preset duration is 2s, if the user presses the virtual control for more than or equal to 2s, the terminal can acquire a voice function starting instruction. Specifically, for example, an app (Application) for capturing a short video, a virtual control a is provided at the middle-lower part of a display interface of the app, and a user presses the virtual control a with a finger for longer than a preset time period, so that the terminal can obtain a voice function opening instruction. After the terminal starts the voice function, the terminal can acquire the voice information.

Step S102, voice information is acquired.

After the terminal starts the voice function, if the user inputs voice information, namely, the user speaks, the terminal can acquire the voice information corresponding to the user speaking. If the user words as ' I want to make the travel going to Chongqing as a video of the check-in, the terminal can acquire the voice information of ' I want to make the travel going to Chongqing as a video of the check-in '.

And step 103, recognizing the semantics corresponding to the voice information according to the voice information.

If the voice information is 'I want to make a trip to Chongqing a video of a point of the trip', the terminal analyzes the semantics of the voice information to acquire the semantics of the voice information, and the terminal can determine what the user wants to do and what type of video is needed. The specific scheme for acquiring the semantics corresponding to the voice information is the prior art, and the disclosure is not described.

And step S104, generating the first type video according to the semantics.

Generating the first type of video according to the semantics can comprise the steps of acquiring video materials corresponding to the semantics and generating the first type of video according to the video materials.

If the voice information is 'I want to make a trip to Chongqing a video of a stuck point', the semantics corresponding to the voice information are obtained, and then the video material corresponding to the semantics can be obtained locally at the terminal. And (3) the inventor wants to make a video of the trip to the Chongqing as a stuck point, and the corresponding video material is a photo of the trip to the Chongqing and a locally preset audio of the terminal. When a user takes a picture, the obtained picture can comprise picture basic information, such as the picture basic information comprises place information and time information when the picture is taken, the terminal obtains the picture basic information, the picture basic information accords with the meaning that I want to take a trip to be taken as a video of a point of sale, and preset audio is obtained as a video material, and a video is generated according to the picture and the audio. The preset audio can be multiple, and when one audio is selected as the video material, one audio can be randomly selected as the video material, or one audio conforming to the semantics can be selected as the video material. Specifically, if the mood of the user is judged according to the semantics, the mood of the user is cheerful, and the cheerful type audio is selected as the video material.

According to the method for generating the video by the voice, the voice information is acquired after the voice function starting instruction is acquired, the corresponding function can be prevented from being started when the video is not required to be generated, the video is automatically generated according to the semantics, the video is not only related to keywords or keywords in the voice information, but is related to the whole semantics corresponding to the voice information, the generated video accords with the use scene expressed by the user semantics, the generated video accords with the various requirements in the user semantics, the user does not need to manually search for video materials, the user does not need to manually manufacture the video, the video generation is convenient and quick, and the video generation efficiency is high.

Referring to fig. 3, optionally, generating the first type of video according to semantics includes:

and S301, determining a video dimension of the generated video according to the semantics, wherein the video dimension comprises at least two of time, place, person, scenery spot, emotion, voice-to-text effect, audio, scene and special effect.

In this disclosure, video temperature includes, but is not limited to, time, place, person, object, attraction, emotion, voice-to-text effect, audio, scene, special effects.

The time, place and person can be the time, place and person corresponding to the video material, and the object can be the object of the video material, such as a photo. The speech-to-text effect may be a letter effect in the produced video. The scene can be a scene of a material, and if the object is a photo and the scene is a grassland, the photo of the corresponding grassland is obtained. The special effect may be an effect added to the video, such as using a filter. After the voice corresponding to the voice information is acquired, the dimension of the video can be determined. If the voice information is "help me make photo movie of yesterday's cat", after determining the meaning of the voice information, the dimensions of the video including time "yesterday", character "cat" and object "photo" can be determined according to the voice.

And S302, acquiring video materials according to the video dimension.

The video material is pre-stored in the terminal. The video material can be collected and generated in daily life of the user, such as photos taken by the user. The type of video material is not limited and may include, for example, photographs, audio, text, and the like. The video material comprises preset basic information. The basic information of the video material corresponds to the video dimension. After the video dimension is determined, the video material can be obtained according to the video dimension. If the video dimension includes time "yesterday", character "cat" and object "photo", then the photo of the cat photographed yesterday needs to be obtained, so that the obtained photo is used as the video material, that is, the video material corresponding to the video dimension is obtained according to the video dimension and the basic information corresponding to the video material.

Optionally, preset video material, such as preset audio, may also be obtained. The preset audio can be multiple, when one of the audio is selected as the video material, one of the audio can be randomly selected as the video material, or one of the audio conforming to the semantics can be selected as the video material.

And S303, generating a first type video according to the video material and the semantics.

After the video material is obtained, a first type video can be generated according to the video material and the semantics.

If the voice information is "help me make photo movie with yesterday my photo, add A special effect to my image", then when the video dimension is obtained as time "yesterday", person "me" and object "photo", then according to the video dimension the correspondent material is obtained, i.e. i yesterday photo, and in the generated video, A special effect is added in the image of "me".

According to the scheme of the embodiment of the disclosure, at least two video dimensions are considered when the video is generated, and the generated video meets the requirements of users.

Optionally, after the voice function starting instruction is acquired, the method for generating the video by voice further comprises the following steps:

And displaying voice prompt information, wherein the voice prompt information is used for prompting a user to input the voice information according to the voice prompt information.

The specific type of the voice prompt information is not limited. Alternatively, the voice prompt may include various prompts for causing the terminal to perform different functions. Optionally, the voice prompt information includes first type information and second type information, wherein the first type information is used for generating the first type video, and the second type information is used for realizing other functions.

Optionally, the voice prompt information comprises first type information, second type information and third type information, and the voice prompt information identifies the semantics corresponding to the voice information according to the voice information:

When the type of the voice information is the first type voice, the corresponding semantics of the voice information are recognized according to the voice information, and a first type video is generated according to the semantics.

The method comprises the steps of obtaining a voice function starting instruction to start a voice function corresponding to the voice function starting instruction, obtaining voice information, displaying voice prompt information, wherein the voice prompt information is used for prompting a user to input the voice information according to the voice prompt information, and executing steps S103 and S104 in the method for generating a first type video when the type of the voice information is a first type voice.

Referring to fig. 5, after the voice information is acquired, the method for generating the video by voice further includes:

S501, determining the type of voice information according to a preset language model corresponding to voice prompt information, wherein the type of the voice information comprises a first type voice, a second type voice and a third type voice.

The language model corresponds to the voice prompt information. The language model may be trained from a plurality of pre-set speech samples. The first type voice and the second type voice respectively correspond to a plurality of voice new books, and the language model is obtained through training according to a plurality of voice samples respectively corresponding to the first type voice and the second type voice. The voice information can be determined to be the first type voice or the second type voice according to a preset language model corresponding to the voice prompt information, and the voice information is determined to be the third type voice when the voice information does not belong to the first type voice and the second type voice.

Optionally, determining the type of the voice information according to a preset language model corresponding to the voice prompt information includes:

Determining that the voice information is first type voice or second type voice according to a preset language model corresponding to the voice prompt information;

When the voice does not belong to the first type voice and the second type voice, the voice information is determined as the third type voice.

When the type of the voice information is the first type voice, the semantics corresponding to the voice information are recognized according to the voice information, and the first type video is generated according to the semantics.

S502, when the voice information is of the second type, starting the function of the corresponding voice information according to the voice information.

The second type of voice is used to turn on the function of the corresponding voice information. If the voice information is "I want to try on the latest fire album", the terminal opens the album mode and sets the latest fire album on top so that the user views the preview of the most fire album, and if the voice information is "I want to try on the latest prop", the terminal opens the front camera and opens the latest prop so that the user uses the latest prop to self-photograph. Specific voice information corresponding to the second type of voice is not limited, and the terminal can be enabled to start the function corresponding to the voice information. It can be understood that after the terminal starts the function corresponding to the voice information, the user can directly and manually operate the terminal to use the corresponding function, or the user continues to send out the voice, so that the terminal uses the corresponding function according to the voice information corresponding to the voice. How the user of the terminal uses the corresponding function is not limited in this disclosure.

And S503, when the type of the voice information is the third type voice, generating a second type video according to the voice information.

Optionally, generating the second type of video from the speech information may include:

Acquiring characters corresponding to the voice information;

And generating a second type video according to the preset audio corresponding to the text and the third type voice.

After the voice information is obtained, the text corresponding to the voice information can be obtained through analysis. If the acquired voice information corresponding to the third type of voice is 'happy today', the text 'happy today' corresponding to the voice information can be acquired.

The third type of speech may be preset to correspond to one or more tones. Audio, i.e. music. When generating the second video, one or more audios corresponding to the third type may be selected as video materials to generate the second type video. It can be understood that when the second video is played, the user can watch the text corresponding to the voice information of the user, and can also hear the corresponding audio.

Generating a dynamic wave corresponding to the frequency of the voice information according to the voice information;

acquiring an avatar of a preset account of a user;

And generating a second type video according to the dynamic wave, the head portrait and the preset audio corresponding to the third type voice.

Dynamic boep is a dynamic curve that varies with the frequency variation of speech information. The speech information is formed by the sound of the user speaking, which has a frequency, so that dynamic boils can be generated.

And presetting an account number which is an account number of an application program logged on the terminal. Optionally, the solution of the present disclosure is executed by an application program on the terminal. The user can have the account of the application program, the head portrait corresponding to the account is displayed when the account is logged in, and the terminal can acquire the head portrait of the account of the user.

And when the video is generated, generating a second type video according to the dynamic wave, the head portrait and the third type voice. It can be understood that when the second video is played, the user can watch the dynamic pop corresponding to the own voice information and the head portrait of the own account, and can also hear the corresponding audio.

The above is just an example of generating the second type of video from the third type of speech, and in the present disclosure, no limitation is made as to how the second type of video is generated from the third type of speech.

The technical proposal of the present disclosure can determine the type of the voice information according to the voice information of the user so as to execute different functions, the voice recognition method has the advantages that the voice effect is diversified, the type of the voice information is determined according to the preset language model corresponding to the voice prompt information, and the voice type recognition speed is high.

Referring to fig. 6, an embodiment of the present disclosure provides a device 60 for generating a voice video, where the device for generating a voice video may implement the method for generating a voice video of the foregoing embodiment, and the device 60 for generating a voice video may include:

the instruction obtaining module 601 is configured to obtain a voice function starting instruction to start a voice function corresponding to the voice function starting instruction;

a voice acquisition module 602, configured to acquire voice information;

A voice recognition module 603, configured to recognize semantics corresponding to the voice information according to the voice information;

the video generation module 604 is configured to generate a first type of video according to semantics.

According to the device for generating the video by the voice, the voice information is acquired after the voice function starting instruction is acquired, the corresponding function can be prevented from being started when the video is not required to be generated, the video is automatically generated according to the semantics, the video is not only related to keywords or keywords in the voice information, but is related to the whole semantics corresponding to the voice information, the generated video accords with the use scene expressed by the user semantics, the generated video accords with the various requirements in the user semantics, the user does not need to manually search for video materials, the user does not need to manually manufacture the video, the video generation is convenient and quick, and the video generation efficiency is high.

The video generating module 604 may include:

The dimension acquisition unit is used for determining and generating video dimensions of the video according to the semantics, wherein the video dimensions comprise at least two of time, places, characters, objects, scenic spots, moods, voice-to-word effects, audios, scenes and special effects;

The material acquisition unit is used for acquiring video materials according to the video dimension;

And the first video generation unit is used for generating a first type of video according to the video material and the semantics.

The apparatus 60 for generating video through voice may further include:

the prompt display module is used for displaying voice prompt information, and the voice prompt information is used for prompting a user to input the voice information according to the voice prompt information.

The voice recognition module 603 is specifically configured to recognize semantics corresponding to the voice information according to the voice information when the type of the voice information is the first type voice.

Wherein the means 60 for generating a video from speech may further comprise:

The voice type determining module is used for determining the type of voice information according to a preset language model corresponding to the voice prompt information, wherein the type of the voice information comprises a first type voice, a second type voice and a third type voice;

The function starting module is used for starting the function of the corresponding voice information according to the voice information when the type of the voice information is the second type voice;

And the video module is used for generating a second type video according to the voice information when the voice information is of a third type.

Wherein, the video module may include:

The text acquisition unit is used for acquiring text corresponding to the voice information;

a second video generating unit for generating a second type video according to the preset audio corresponding to the text and the third type voice

Or a video module, may include:

the wave acquisition unit is used for generating dynamic wave corresponding to the frequency of the voice information according to the voice information;

The head portrait acquiring unit is used for acquiring head portraits of preset account numbers of users;

and the third video generation unit is used for generating a second type of video according to the dynamic wave, the head portrait and the preset audio corresponding to the third type of voice.

The voice type determining module may include:

the first determining unit is used for determining that the voice information is first type voice or second type voice according to a language model of the preset corresponding voice prompt information;

And a second determining unit configured to determine the voice information as a third type voice when the voice does not belong to the first type voice and the second type voice.

Referring to fig. 7, a schematic diagram of an electronic device 700 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in the drawings is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

The electronic device comprises a memory and a processor, wherein the processor may be referred to as a processing means 701 hereinafter, and the memory may comprise at least one of a Read Only Memory (ROM) 702, a Random Access Memory (RAM) 703, and a storage means 708 hereinafter, as shown in detail below:

As shown, the electronic device 700 may include a processing means (e.g., a central processor, a graphics processor, etc.) 701, which may perform various suitable actions and processes in accordance with a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage means 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

In general, devices may be connected to I/O interface 705 including input devices 706 such as a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 707 including a Liquid Crystal Display (LCD), speaker, vibrator, etc., storage devices 708 including, for example, magnetic tape, hard disk, etc., and communication devices 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While an electronic device 700 having various means is shown, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication device 709, or installed from storage 708, or installed from ROM 702. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 701.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to electrical wiring, fiber optic cable, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be included in the electronic device or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs, which when executed by the electronic device, cause the electronic device to acquire a voice function start instruction to start a voice function corresponding to the voice function start instruction, acquire voice information, recognize semantics corresponding to the voice information according to the voice information, and generate a first type video according to the semantics.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules or units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of a module or unit is not limited to the unit itself in some cases, and for example, the instruction acquisition module may also be described as "a module that acquires external information".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic that may be used include Field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-a-chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In accordance with one or more embodiments of the present disclosure, there is provided a method of speech generating video, comprising:

acquiring voice information;

identifying semantics corresponding to the voice information according to the voice information;

A first type of video is generated based on semantics.

According to one or more embodiments of the present disclosure, generating a first type of video from semantics includes:

Determining a video dimension of the generated video according to the semantics, wherein the video dimension comprises at least two of time, place, person, object, scenic spot, emotion, voice-to-text effect, audio, scene and special effect;

Acquiring video materials according to the video dimension;

a first type of video is generated from the video material and semantics.

According to one or more embodiments of the present disclosure, after acquiring the voice function opening instruction, the method further includes:

According to one or more embodiments of the present disclosure, the voice prompt information includes a first type of information, a second type of information, and a third type of information, and when the type of the voice information is the first type of voice, semantics corresponding to the voice information are recognized according to the voice information.

In accordance with one or more embodiments of the present disclosure, after obtaining the voice information, the method of voice generating the video further comprises:

determining the type of voice information according to a preset language model corresponding to the voice prompt information, wherein the type of the voice information comprises a first type voice, a second type voice and a third type voice;

when the type of the voice information is the second type voice, starting the function of the corresponding voice information according to the voice information;

and when the type of the voice information is the third type voice, generating a second type video according to the voice information.

According to one or more embodiments of the present disclosure, generating a second type of video from speech information includes:

Acquiring characters corresponding to the voice information;

Generating a second type of video according to the preset audio corresponding to the text and the third type of voice, or

Generating a second type of video from the speech information, comprising:

acquiring an avatar of a preset account of a user;

According to one or more embodiments of the present disclosure, determining a type of voice information according to a language model of a preset corresponding voice prompt information includes:

According to one or more embodiments of the present disclosure, there is provided an apparatus for generating video from speech, comprising:

The voice acquisition module is used for acquiring voice information;

and the video generation module is used for generating the first type of video according to the semantics.

In accordance with one or more embodiments of the present disclosure, a video generation module may include:

In accordance with one or more embodiments of the present disclosure, an apparatus for speech generating video may further include:

According to one or more embodiments of the present disclosure, the voice recognition module is specifically configured to recognize semantics corresponding to the voice information according to the voice information when the type of the voice information is the first type voice.

In accordance with one or more embodiments of the present disclosure, the apparatus for speech generating video may further include:

In accordance with one or more embodiments of the present disclosure, a video module may include:

Or a video module, may include:

In accordance with one or more embodiments of the present disclosure, a voice type determination module may include:

According to one or more embodiments of the present disclosure, there is provided an electronic device including:

one or more processors;

A memory;

One or more applications, wherein the one or more applications are stored in memory and configured to be executed by one or more processors, the one or more applications configured to perform the method of speech generating video according to any of the above embodiments.

According to one or more embodiments of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the method of speech generating video of any of the above embodiments.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

Translated fromChinese

1.一种语音生成视频的方法，其特征在于，包括：1. A method for generating video from speech, comprising:

获取语音功能开启指令，以开启所述语音功能开启指令对应的语音功能；Obtain a voice function activation instruction to activate the voice function corresponding to the voice function activation instruction;

获取语音信息；Get voice information;

根据所述语音信息识别所述语音信息对应的语义；Identify semantics corresponding to the voice information according to the voice information;

根据所述语义生成第一类型视频；所述第一类型视频是根据视频素材和所述语义生成的；所述视频素材是根据视频维度获取的；所述视频维度是根据所述语义确定生成的。A first type of video is generated according to the semantics; the first type of video is generated according to video material and the semantics; the video material is obtained according to video dimensions; the video dimensions are determined and generated according to the semantics.

2.根据权利要求1所述的语音生成视频的方法，其特征在于，所述视频维度包括时间、地点、人物、对象、景点、情绪、语音转文字效果、音频、场景、特效效果中至少两个。2. The method for generating video from speech according to claim 1 is characterized in that the video dimensions include at least two of time, place, person, object, scenic spot, emotion, speech-to-text effect, audio, scene, and special effects.

3.根据权利要求1所述的语音生成视频的方法，其特征在于，所述获取语音功能开启指令之后，所述方法还包括：3. The method for generating video from voice according to claim 1, characterized in that after obtaining the voice function activation instruction, the method further comprises:

显示语音提示信息，所述语音提示信息用于提示用户根据所述语音提示信息输入语音信息。Display voice prompt information, wherein the voice prompt information is used to prompt the user to input voice information according to the voice prompt information.

4.根据权利要求3所述的语音生成视频的方法，其特征在于，所述语音提示信息包括第一类型信息、第二类型信息和第三类型信息；所述语音信息的类型包括第一类型语音、第二类型语音和第三类型语音；4. The method for generating video from speech according to claim 3, characterized in that the speech prompt information includes first type information, second type information and third type information; the types of the speech information include first type speech, second type speech and third type speech;

所述根据所述语音信息识别所述语音信息对应的语义；根据所述语义生成第一类型视频，包括：The step of identifying semantics corresponding to the voice information according to the voice information; and generating a first type of video according to the semantics includes:

所述语音信息的类型为所述第一类型语音时，根据所述语音信息识别所述语音信息对应的语义；根据所述语义生成第一类型视频；When the type of the voice information is the first type of voice, identifying semantics corresponding to the voice information according to the voice information; and generating a first type of video according to the semantics;

所述获取语音信息之后，所述方法还包括：After acquiring the voice information, the method further includes:

根据预设的对应所述语音提示信息的语言模型确定所述语音信息的类型；Determining the type of the voice information according to a preset language model corresponding to the voice prompt information;

所述语音信息的类型为所述第二类型语音时，根据所述语音信息开启所对应所述语音信息的功能；When the type of the voice information is the second type of voice, enabling a function corresponding to the voice information according to the voice information;

所述语音信息的类型为所述第三类型语音时，根据所述语音信息生成第二类型视频。When the type of the voice information is the third type of voice, a second type of video is generated according to the voice information.

5.根据权利要求4所述的语音生成视频的方法，其特征在于，所述根据所述语音信息生成第二类型视频，包括：5. The method for generating video from speech according to claim 4, wherein generating the second type of video according to the speech information comprises:

获取所述语音信息对应的文字；Obtaining text corresponding to the voice information;

根据所述文字和所述第三类型语音对应的预设的音频生成第二类型视频；或generating a second type of video according to the preset audio corresponding to the text and the third type of voice; or

所述根据所述语音信息生成第二类型视频，包括：The generating a second type of video according to the voice information includes:

根据所述语音信息生成对应所述语音信息频率的动态波普；Generate a dynamic wave corresponding to the frequency of the voice information according to the voice information;

获取用户的预设账号的头像；Get the user's preset account avatar;

根据所述动态波普、所述头像和所述第三类型语音对应的预设的音频生成第二类型视频。A second type of video is generated according to the preset audio corresponding to the dynamic pop, the avatar and the third type of voice.

6.根据权利要求4所述的语音生成视频的方法，其特征在于，所述根据预设的对应所述语音提示信息的语言模型确定所述语音信息的类型，包括：6. The method for generating video from speech according to claim 4, characterized in that the step of determining the type of the speech information according to a preset language model corresponding to the speech prompt information comprises:

根据预设的对应所述语音提示信息的语言模型确定所述语音信息为第一类型语音或第二类型语音；Determining whether the voice information is a first type of voice or a second type of voice according to a preset language model corresponding to the voice prompt information;

在所述语音不属于第一类型语音和第二类型语音时，将所述语音信息确定为第三类型语音。When the voice does not belong to the first type of voice or the second type of voice, the voice information is determined to be a third type of voice.

7.根据权利要求1所述的语音生成视频的方法，其特征在于：所述语音功能开启指令由用户按压预设的虚拟控件预设时长触发。7. The method for generating video by voice according to claim 1 is characterized in that: the voice function activation instruction is triggered by the user pressing a preset virtual control for a preset duration.

8.一种语音生成视频的装置，其特征在于，包括：8. A device for generating video from speech, comprising:

指令获取模块，用于获取语音功能开启指令，以开启所述语音功能开启指令对应的语音功能；An instruction acquisition module, used to acquire a voice function activation instruction to activate the voice function corresponding to the voice function activation instruction;

语音获取模块，用于获取语音信息；A voice acquisition module is used to acquire voice information;

语音识别模块，用于根据所述语音信息识别所述语音信息对应的语义；A speech recognition module, used for recognizing the semantics corresponding to the speech information according to the speech information;

视频生成模块，用于根据所述语义生成第一类型视频；所述第一类型视频是根据视频素材和所述语义生成的；所述视频素材是根据视频维度获取的；所述视频维度是根据所述语义确定生成的。A video generation module is used to generate a first type of video according to the semantics; the first type of video is generated according to video material and the semantics; the video material is obtained according to video dimensions; the video dimensions are determined and generated according to the semantics.

9.一种电子设备，其特征在于，其包括：9. An electronic device, characterized in that it comprises:

一个或多个处理器；one or more processors;

存储器；Memory;

一个或多个应用程序，其中所述一个或多个应用程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行，所述一个或多个程序配置用于：执行根据权利要求1-7任一项所述的语音生成视频的方法。One or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, and the one or more programs are configured to: execute the method for generating video from speech according to any one of claims 1-7.

10.一种计算机可读介质，其上存储有计算机程序，其特征在于，该程序被处理器执行时实现权利要求1-7任一项所述的语音生成视频的方法。10. A computer-readable medium having a computer program stored thereon, characterized in that when the program is executed by a processor, the method for generating video from speech according to any one of claims 1 to 7 is implemented.