The content of the invention
In view of this, the application provides a kind of video data generating method, computer readable storage medium and electronic equipment,Its object is to automatically add text message in the video of generation, a key operation is realized, simultaneously so that user can break away fromDependence for input method carries out the input of video, voice, image and word freely.
On the one hand, present applicant proposes a kind of video data generating method, including:
Obtain image stream data and audio stream data;
Obtain the text data to come by speech recognition conversion;And
It is generated according to described image flow data, audio stream data and the text data to be come by speech recognition conversionVideo data to be released.
Preferably, when detecting the operation to the first control, start to obtain described image flow data and audio stream data,When detecting the operation of the second control to being located at same position, triggering generates video data to be released to realize that a key is graspedMake.
Preferably, when detecting the operation to the first space, start to obtain described image flow data and audio stream data,When detecting that recording time reaches pre- specified time, redirect automatically and generate video data to be released to realize a key operation.
Preferably, the method further includes:
When receiving issue instruction, the video data to be released is uploaded into content server.
Preferably, obtaining the text data to come by speech recognition conversion is:
The text flow information to come by speech recognition conversion is obtained in real time.
Preferably, the method further includes:
During video record, the treated image stream data of real-time display and described pass through speech recognition conversion mistakeThe text flow information come.
Preferably, the treated image stream data of the real-time display and the text flow information include:
The image stream data obtained with the display of the first figure layer;
In the masking-out image and/or filtering effects that the addition of the second figure layer is selected;And
The text flow information to come by speech recognition conversion is shown in the 3rd figure layer.
Preferably, the method further includes:
Obtain masking-out image and/or filter that user selectes.
Preferably, show that the text flow information to come by speech recognition conversion includes in the 3rd figure layer:
Division mark in the text flow information to be come by speech recognition conversion shows described logical step by stepCross the text flow information that speech recognition conversion comes.
Preferably, the corresponding text data of the audio stream data that obtains includes:
The audio stream data is sent to online speech recognition server and receives what is come by speech recognition conversionText flow information;Or
Call offline speech recognition application programming interface interface that the audio stream data information is identified to obtain described lead toCross the text flow information that speech recognition conversion comes.
Preferably, the video data includes video file and subtitle file;
According to treated image stream data, audio stream data and the textual data to be come by speech recognition conversionInclude according to the video data for generating to be released:
According to treated image stream data and audio stream data generation video file;And
According to the correspondence generation by the text data that speech recognition conversion comes and audio stream data with synchronousThe subtitle file of information, the synchronizing information are used to cause the text data to come by speech recognition conversion when playingWith the video file time synchronization.
On the other hand, the application also proposed a kind of computer readable storage medium, store computer program instructions thereon,Wherein, the computer program instructions realize method as described above when being executed by processor.
Meanwhile the application also proposed a kind of electronic equipment, including memory and processor, wherein, the memory is usedIn storing one or more computer program instructions, wherein, one or more computer program instructions are by the processorIt performs to realize method as described in relation to the first aspect.
The application is obtained by obtaining the audio stream data in video, identification audio stream data through speech recognition conversion mistakeThe text data come, and the image stream data according to synchronization gain and the text data to be come by speech recognition conversionTo generate the video data issued or shared.The application can be embedded in characterization voice messaging in video data as a result,Text is expressed in terms of vision and the sense of hearing two for the voice messaging in video, provides more abundant user experience.Also, by the video data generating method of the application can " key " operation generation one carry video, voice, image and textThis multimedia content, it is user-friendly.Video generation disclosed in the present application and dissemination method, in mobile Internet informationIn issue and interaction scenarios, than issuing text information more convenient and efficient with input method and keyboard.
Specific embodiment
Below based on embodiment, present invention is described, but the present invention is not restricted to these embodiments.UnderText to the present invention datail description in, it is detailed to describe some specific detail sections.Do not have for a person skilled in the artThe description of these detail sections can also understand the present invention completely.In order to avoid obscuring the substantive content of the present invention, well known sideThere is no narrations in detail for method, process, flow, element and circuit.
In addition, it should be understood by one skilled in the art that provided herein attached drawing be provided to explanation purpose, andWhat attached drawing was not necessarily drawn to scale.
Unless the context clearly requires otherwise, otherwise throughout the specification and claims " comprising ", "comprising" etc. are similarWord should be construed to the meaning included rather than exclusive or exhaustive meaning;That is, it is containing for " including but not limited to "Justice.
In the description of the present invention, in the description of the present invention, unless otherwise indicated, " multiple " are meant that two or twoMore than a.
Fig. 1 is the system block diagram of the video sharing system of the embodiment of the present application.As shown in Figure 1, the video of the present embodiment pointThe system of enjoying can include multiple user terminals 101, network 102 and content server 103.User terminal 101 by network 102 withContent server 103 communicates.On the one hand, user terminal 101 can record generation video data, be uploaded to by network 102 interiorHold server 103 to issue.On the other hand, user terminal 101 can from content server 103 obtain video data carry out browsing andComment.Content server 103 is configured as receiving the video data that user terminal 101 uploads and passes through database purchase, meanwhile,Request in response to other user terminals 101 provides video data to user terminal.
In the present embodiment, user terminal 101 can be the communications data processing unit of loading predetermined application, exampleSuch as, intelligent mobile terminal, smart television or all-purpose computer etc..User terminal 101 can carry out under control of the applicationThe acquisition of image stream data and audio stream data, and handled the data of acquisition are further.Content server 103 can beCommunications data processing unit.Since content server 103 needs pair after the video data sent by user terminal 101 is receivedInformation is stored and issued, therefore in general content server 103 should possess larger storage capacity and preferableData-handling capacity.It is to be understood that content server 103 can be concentrate connection one or more servers or withDistributed way is in communication with each other multiple server clusters of connection.Network 102 can be LAN (LAN) or wide area network (WAN),Can be the network based on wired connection access or the network of wireless connection access.Preferably, network 102, which uses, is based on wirelessly connectingThe internet of access.
Fig. 2 is the flow chart of the video data generating method of the embodiment of the present application.As shown in Fig. 2, the video of the present embodimentData creation method includes the following steps:
Step S210, image stream data and audio stream data are obtained.
Step S220, the text data to be come described in obtaining by speech recognition conversion.
Step S230, at the end of video record, according to described image flow data, audio stream data and described voice is passed throughIdentify that converted next text data generates video data to be released.
When applied to terminal device with touch-control input device, the step S210 of the present embodiment can be in Programmable detectionIt is triggered when clicking on to user and start and record control.After step S210 triggerings, it is located at same position detecting that user clicks onWhen terminating to record control, video record terminates.User only needs the same position in touch-control input device to click on twice, and being exactly canWith one section of video content with sound and word of generation, the operation of video data generation is simplified, realizes " key " operation.
Meanwhile the present embodiment can also realized for the triggering of generation operation by redirecting automatically, detected to theDuring the operation of one control, start to obtain described image flow data and audio stream data, detecting that recording time reaches pre- timingIn limited time, redirect automatically and generate video data to be released to realize a key operation.
As a result, after recording is started, redirected automatically if the time longer (such as reaching 60 seconds) and generate to be released regardFrequency evidence is further simplified operation.
In step S210, receive it is input by user beginning record command when, user terminal 101 start camera andMicrophone is acquired external dynamic image and audio, and is formed as image stream data and audio stream data.In this realityIt applies in example, image stream data and audio stream data are stream medium data.Stream medium data refer to be suitable for used on network,Using the consecutive hours base media data of streaming technology, can data packet in a manner of continual data flow intoRow is transmitted and realized and handles in real time.
Meanwhile the process of text data that above-mentioned acquisition is come by speech recognition conversion, it can terminate in video recordWhen proceed by, corresponding all pass through speech recognition in this way, can be obtained according to the audio stream data file of a completionConverted next text data.
In step S220, voice knowledge is carried out by calling the speech recognition application programming interface interface of offline or online formNot.The corresponding text data of audio stream data namely the text to be come by speech recognition conversion can be obtained by speech recognitionNotebook data.Above-mentioned text data can effectively characterize voice messaging included in audio stream data.
The characteristics of being handled in real time using stream medium data, obtains the textual data to come by speech recognition conversionAccording to process can during the video record by almost in real time in a manner of carry out.In this way, can with recording process intoExhibition obtains the feedback of the text data to come by speech recognition conversion in real time.Meanwhile it obtains in real time in this way logicalIt crosses the text data that speech recognition conversion comes to can be also used for being shown in recording process, to improve in recording processUser experience.
In step S230, by the image stream data got in entire recording process, audio stream data and pass through voiceIdentify that converted next text data is integrated into the video data issued with shared together.The video data can be oneA independent data file can also be the file bag of multiple file compositions.The beholder of video data can out see as a result,See image and hear outside sound, can also see the text information shown in different forms in video, so as to from viewingEnvironment, producer's accent and the influence of beholder ability itself fully obtain the information that video is conveyed.On the one hand, this is promotedOn the other hand the usage experience that producer shares, also improves the usage experience of video viewers.
In an optional implementation, the process of above-mentioned acquisition text data can be opened at the end of video recordBegin to carry out, in this way, corresponding all text datas can be obtained according to the audio stream data file of a completion.
The technical solution of the application passes through voice by obtaining the audio stream data in video, identification audio stream data acquisitionIdentify it is converted come text data, and described come according to the image stream data of synchronization gain and by speech recognition conversionText data generate the video data issued or shared.Thus, it is possible to the embedded characterization voice letter in video dataThe text of breath is expressed in terms of vision and the sense of hearing two for the voice messaging in video, provides more abundant userExperience.Also, by the video data generating method of the application can " key " operation generation one with video, voice, textThis multimedia content, it is user-friendly.
Meanwhile each step of the embodiment of the present invention can optimize, and further promote user and carry out video recordUsage experience.
Fig. 3 is the flow chart of the video data generating method of an optional realization method of the embodiment of the present application.Such as Fig. 3 institutesShow, described method includes following steps:
Step S310, into configuration interface is recorded, user is prompted to select masking-out image and/or filter.
Step S320, the masking-out image and/or filter that user selectes are obtained.
Step S330, after receiving the instruction for starting to record, image stream data and audio stream data are obtained in real time.
Step S340, the text data that speech recognition conversion comes is obtained.
Step S350, during video record, real-time display is by rendering the image stream data of processing and passing through voiceIdentify converted next text flow information.
Step S360, at the end of video record, according to treated image stream data, audio stream data and described logicalIt crosses the text data that speech recognition conversion comes and generates video data to be released.
Step S370, when receiving issue instruction, the video data to be released is uploaded into content server.
In step S310, before recording starts, masking-out in follow-up rendering to video, filter can be pre-selected in userOr other special efficacys.One illustrative graphic user interface is as shown in Figure 4.User can select certain types of the Video Option,For example, selection " mood recording " option, enters interface shown in Fig. 4.In graphic user interface shown in Fig. 4, pass through rowTable lists alternative multiple masking-outs, including black dusters, dreamlike space and the pixel world.It is to be understood that except masking-out,It can also select the various special efficacys such as filter, frame.Meanwhile be provided with recording in graphical user interfaces and start control 11,User can start to record by clicking on recording beginning control 11.
For step S320, it should be appreciated that in other realization methods, the selection step of masking-out image or filter can alsoIt is arranged in video record or after video record.In these realization methods, based on masking-out image and/or filter for figureAs the processing of flow data can also be deferred in video record or after.
Optionally, user during being recorded can with the selected masking-out image of multiple conversions and/or filter, byThis, the treated image stream data finally obtained different effects can be presented in different time sections according to the opportunity that user selectsFruit.For example, user starts to record in moment t0, initial selected is black dusters masking-out, and dreamlike space is switched in moment t1Masking-out terminates to record in moment t2.In this case, the image stream data finally exported is covered in t0-t1 using black dustersVersion processing image stream data, image stream data is handled in t1-t2 using the dreamlike space masking-out after switching.
In step S330, image stream data and audio stream data are stream medium data, can pass through real-time modeIt obtains, handle and forwards.Specifically, frame data YUV420 and audio output signal the PCM Data of video output signals are receivedRespectively as image stream data and audio stream data.
In step S340, by calling speech recognition reference program interface (Application ProgrammingInterface, API) voice messaging in audio stream data is identified, come so as to obtain by speech recognition conversionText flow information as text data.In this realization method, audio stream data is sent to online speech recognition program and is connectMouthful, and receive the word stream information to come by speech recognition conversion that online speech recognition program interface returns.Optionally, may be usedWith using the online speech recognition application programming interface interface provided by third parties such as search dog, HKUST News, Baidu.
Further, under the application scenarios for needing saving network flow, offline speech recognition application journey can also be usedSequence interface is identified.
In step S350, image stream data is obtained to step S320 according to previously selected masking-out image and/or filterIt is handled, output is being shown by rendering the image stream data of processing.Meanwhile step S340 obtain in real time withThe corresponding text flow information of voice messaging in audio stream file is also attached to show.
Specifically, image stream data render processing can be rendered by the texture of OpenGL API processing obtain one it is pre-It lookes at figure layer (the first figure layer), adds a masking-out figure layer or filter figure layer (the second figure layer) on it, then, will be known by voiceNot converted next word flow information processing is word figure layer (the 3rd figure layer), according to image stream file, masking-out figure layer and wordThe stacked display of mode order of figure layer.
User may browse through the dynamic that selected masking-out image is attached to recording during recorded video as a result,On image, meanwhile, text data (that is, the text data to be come by speech recognition conversion) corresponding with the voice of inputIt is attached on video.
Further, the text flow information that step S340 is obtained includes the pause according to voice messaging in audio stream dataAnd the division mark that speech recognition obtains (including punctuation mark and the different words of division or the mark of phrase).In real-time display textDuring this stream information, can text data be shown according to division mark step by step in a manner of being segmented display, to obtainIt inputs the effect of word in real time by voice, promotes user experience.
Fig. 5-Fig. 7 is the schematic diagram of the graphic user interface during video record.User, which inputs, starts recorded videoAfter instruction, graphic user interface is switched to Fig. 5.Recording in Fig. 4 starts control 11 and is switched to recording end control 12, meanwhile,It records in the region 13 for terminating 12 top of control and shows the image stream data obtained in real time and masking-out image.In the example of Fig. 5In, user has selected granular translucent masking-out image, and the image obtained in real time as a result, is hidden by granular translucent masking-outGear, shows dim effect.This can effectively protect the privacy of user.As Figure 6-Figure 7, after recording starts, useFamily can record the voice of oneself or other people.After user terminal 101 gets audio stream data, speech recognition application journey is calledIt is identified in sequence interface, so as to obtain the text flow information to come by speech recognition conversion.Pass through speech recognition conversionThe text flow information to come over is displayed in graphical interfaces step by step in units of word or word or short sentence.During display,The text having shown that can also change position according to the quantity of follow-up text information, the effect being gradually increasing be showed, to carryFor more preferably user experience.
It is to be understood that the recording of step S330, S340 and step S350, identification and display operation in the order described above withContinuous obtain of image stream data and audio stream data repeats in real time until user's end of input record command.
User can trigger step S360 by inputting recording END instruction, terminate to record.It can also be monitored by programIt the time of recording, prescribes a time limit when recording time reaches predetermined, redirects end automatically and record.
In this step, it is image stream data, audio stream data and the text data that is come by speech recognition conversion is wholeIt is combined into same video file or the file bag including video file and subtitle file.It, can as a result, after video data issueTo be played by downloading video files or file bag with masking-out image and the textual data to be come by speech recognition conversionAccording to video.
Further, in video data, in the displaying and video data of the text data to be come by speech recognition conversionVoice messaging basic synchronization.In this way, when playing, what user can show according to voice and synchronously passes through speech recognition conversionThe progress and content of voice messaging is understood quickly in the text data to come over.
To achieve these goals, a kind of mode is will directly to be come in synthetic video file by speech recognition conversionText data be synchronously embedded.
Another way is according to subtitle file of the text data generation with synchronizing information.Specifically, can include such asLower step:
Step S361, video file is generated according to the image stream data and audio stream data by rendering processing.
Preferably, the video file is MP4 (MPEG-4Part 14) file.
Step S362, according to the subtitle text of text data and the generation of the correspondence of audio stream data with synchronizing informationPart.Wherein, the synchronizing information is used to cause the text data and the video file time synchronization when playing.
Preferably, caption information can be the general subtitle file of the forms such as srt, smi, ssa.
Fig. 8 is the schematic diagram of graphic user interface at the end of video record.As shown in Figure 7 and Figure 8, Fig. 7 is clicked in userIn end record control 12 after, user terminal 101 terminates video acquisition, and jumps to interface shown in Fig. 8.In fig. 8The main part loop play of graphic user interface this step generation video data (including image, audio and text).Meanwhile alsoDisplay issue control 14.Under this interface, user can be with the effect of preview video, and decides whether to issue or share the video.The video data generated can be uploaded content server 103 by user by clicking on the issue input issue instruction of control 14.
After video data issue, the other users of application program can log in content server by application program and obtainGeneration is recorded to above-mentioned steps S310-S360
Fig. 9 is the schematic diagram of the graphic user interface under video broadcasting condition.As shown in figure 9, playing above-mentioned video countsAccording to when, while show the image by rendering processing, audio stream plays data, and text is shown in a synchronous manner in top layerData.The text data can be shown according to synchronizing information in a manner of changing color, to prompt corresponding playing progress rate.
The technical solution of the application passes through voice by obtaining the audio stream data in video, identification audio stream data acquisitionIdentify it is converted come text data, and according to the image stream data with audio stream data synchronization gain and described pass through voiceConverted next text data is identified to generate the video data issued or shared.It is thus, it is possible to embedding in video dataEnter the text for characterizing the voice messaging in video, table is carried out simultaneously for the voice messaging in video in terms of vision and the sense of hearing twoIt reaches, provides more abundant user experience.
Further, by the video data generating method of the application can " key " operation generation one with video,The multimedia content of voice, text, it is user-friendly.Meanwhile video generation method disclosed in the present application, it can substitute oftenThe input method of rule directly inputs word and voice by voice, is effectively improved information input efficiency.
Further, the embodiment of the present application can carry out image stream data according to selected masking-out image and/or filterProcessing provides secret protection or promotion display effect to the user to realize different image shows effects.
Further, the embodiment of the present application by identifying audio stream data in real time so that can be obtained with near-real-timeCorresponding text flow information.It, can be with producer's shape by the text flow that while recorded video, synchronous displaying identification obtainsInto feedback, so as to further promote user experience.
Further, by after recording, by the text data to be come by speech recognition conversion with audio streamThe mode of data synchronization is incorporated into the video data of generation so that the text to be come during follow-up play by speech recognition conversionThe broadcasting of data is synchronous with the broadcasting of audio stream data, and generation video data has preferably readable and bandwagon effect.
Above the technical solution of the embodiment of the present application to be illustrated exemplified by video sharing application program this scene.It is to be understood that the embodiment of the present application is not limited to above-mentioned specific application scenarios, the video generation method of the application can also be appliedThe scene for being suitable for the video for other terminal plays is recorded in other any need.For example, the video data of the embodiment of the present applicationGeneration method can be applied in instant communication software, and the lteral data to come by speech recognition conversion is carried by recordingShort-sighted frequency, and another or another group of user terminal that user specifies are sent to, realize the video point with abundant bandwagon effectIt enjoys.In another example the video data generating method of the embodiment of the present application can be applied to commenting for social networking application, E-business applications etc.In, user can be recorded one end based on the method for the embodiment of the present application and carry the lteral data to come by speech recognition conversionComment information of the video as the information entity shown for above application, thus, it is possible to provide more abundant comment sideFormula promotes user experience.
Figure 10 is the schematic diagram of the terminal device for the method for being used to implement the embodiment of the present invention.Terminal device 10 includes displayDevice A1, memory A2 (it can include one or more computer readable storage mediums), storage control A3, one or moreProcessor (CPU) A4, Peripheral Interface A5, radio circuit A6, input/output (I/O) subsystem A7 and one or more can be withObtain the optical sensor A8 of image.These components can be communicated by one or more communication bus or signal wire A9.Ying LiSolution, electronic equipment 10 shown in Fig. 10 are an example, and electronic equipment 10 can have more more or fewer than shown componentComponent can combine two or more components or can have different component Configurations or arrangement.
Wherein, memory A2 can store component software, such as operating system, communication module, interactive module and applicationProgram.Above-described each module and application program are both corresponded to complete one or more functions and retouched in inventive embodimentsOne group of executable program instructions of the method stated.
Meanwhile as skilled in the art will be aware of, the various aspects of the embodiment of the present application may be implemented as beSystem, method or computer program product.Therefore, the various aspects of the embodiment of the present application can take following form:Complete hardwareEmbodiment, complete software embodiment (including firmware, resident software, microcode etc.) usually can all claim hereinFor the embodiment for being combined software aspects with hardware aspect of " circuit ", " module " or " system ".In addition, the side of the applicationFace can take following form:The computer program product realized in one or more computer-readable mediums, computer canReading medium has the computer readable program code realized on it.
Any combination of one or more computer-readable mediums can be utilized.Computer-readable medium can be computerReadable signal medium or computer readable storage medium.Computer readable storage medium can be such as (but not limited to) electronics,Magnetic, optical, electromagnetism, infrared or semiconductor system, device or foregoing any suitable combination.MeterThe more specific example (exhaustive to enumerate) of calculation machine readable storage medium storing program for executing will include the following:With one or more electric wireElectrical connection, portable computer diskette, random access memory (RAM), read-only memory (ROM), erasable are compiled hard diskJourney read-only memory (EPROM or flash memory), optical fiber, portable optic disk read-only storage (CD-ROM), light storage device,Magnetic memory apparatus or foregoing any suitable combination.In the context of this hair application embodiment, computer-readable storage mediumMatter can be can include or store the program used by instruction execution system, device or combined command perform system,The arbitrary tangible medium for the program that device uses.
Computer-readable signal media can include the data-signal propagated, and the data-signal of the propagation has whereinThe computer readable program code realized such as the part in a base band or as carrier wave.The signal of such propagation may be employedAny form in diversified forms includes but not limited to:Electromagnetism, optical or its any appropriate combination.It is computer-readableSignal media can be following arbitrary computer-readable medium:It is not computer readable storage medium, and can be to by instructingThe program that execution system, device use or combined command performs system, device uses is communicated, propagatedOr transmission.
Including but not limited to wireless, wired, fiber optic cables, RF etc. or foregoing can be used arbitrary appropriately combined arbitrarySuitable medium transmits the program code realized on a computer-readable medium.
Computer program code for performing for the operation of the application each side can be with one or more programming languagesAny combination of speech is write, and the programming language includes:The programming language of object-oriented such as Java, Python,Smalltalk, C++ etc.;And conventional process programming language such as " C " programming language or similar programming language.Program code canFully on the user computer, partly to be performed on the user computer as independent software package;Partly counted in userIt partly performs on calculation machine and on the remote computer;Or it fully performs on a remote computer or server.LatterIt, can be by remote computer by including any type of network connection of LAN (LAN) or wide area network (WAN) in the case of kind(such as the Yin Te by using ISP can be attached to subscriber computer or with outer computerNet).
The above-mentioned flow chart legend according to the method for the embodiment of the present application, equipment (system) and computer program product and/Or block diagram describes the various aspects of the application.It will be appreciated that each block and flow of flow chart legend and/or block diagramThe combination of block in figure legend and/or block diagram can be realized by computer program instructions.These computer program instructions can be withThe processor of all-purpose computer, special purpose computer or other programmable data processing devices is provided to, to generate machine so that(being performed via computer or the processor of other programmable data processing devices) instruction establishment be used to implement flow chart and/orThe device for the function/action specified in block diagram or block.
These computer program instructions can also be stored in can instruct computer, other programmable data processing devicesOr in the computer-readable medium that runs in a specific way of other devices so that the instruction production stored in computer-readable mediumThe product of the raw instruction for including realizing the function/action specified in flow chart and/or block diagram or block.
Computer program instructions can also be loaded on computer, other programmable data processing devices or other devicesOn, so as to perform a series of operable steps on computer, other programmable devices or other devices to generate computer realityExisting process so that the instruction offer performed on computer or other programmable devices is used to implement in flow chart and/or frameThe process for the function/action specified in segment or block.
The foregoing is merely the preferred embodiments of the application, are not limited to the application, for those skilled in the artFor, the application can have various modifications and changes.All any modifications made within spirit herein and principle are equalReplace, improve etc., it should be included within the protection domain of the application.