Summary of the invention
For this reason, the present invention proposes a kind of audio-frequency processing method and audio processing equipment, can eliminate fully one or more problems that restriction and defect due to prior art cause.
Additional advantages of the present invention, purpose and characteristic, a part will be elucidated in the following description, and another part will be significantly or acquire from enforcement of the present invention by the investigation of the explanation to following for those of ordinary skill in the art.Can realize and obtain the object of the invention and advantage by the structure of pointing out especially in the specification at word and claims and accompanying drawing.
The invention provides a kind of audio processing equipment, it is characterized in that, described audio processing equipment comprises:
The first extraction unit for by mobile terminal, extracts and carries the voice data for the treatment of object content from audio stream;
Recognition unit, for identifying the word content that described voice data is corresponding;
The second extraction unit, for obtaining user's preferred language, using as object language;
Converting unit, for described word content being changed into to the word content of object language form, the word content of the word content of described object language form for adopting object language to describe;
Substituting unit, for the word content by described object language form, be converted to the voice data of object language form, to substitute the described voice data for the treatment of target.
Preferably, described recognition unit utilizes speech recognition technology, identifies the word content that described voice data is corresponding.
Preferably, described audio processing equipment also comprises:
The video extraction unit for by mobile terminal, extracts the video data relevant to captions from video flowing;
The video identification unit, the video data for relevant according to captions, identify caption content;
Preferably, described audio processing equipment also comprises:
The video converting unit, for by described caption content, convert the caption content of object language form to, the caption content of the caption content of described object language form for adopting object language to describe;
The video substituting unit, for the caption content by described object language form, be converted to the video data of object language form, to substitute the described video data relevant to captions.
Preferably, described audio processing equipment also comprises:
The timestamp unit, for obtaining in advance the synchronized timestamp of described voice data and described video data;
Lock unit, for passing through described synchronized timestamp, the voice data of controlling described interpreter language form is synchronizeed with the video data with described interpreter language form.
The present invention also provides a kind of audio-frequency processing method, it is characterized in that, described method comprises:
By mobile terminal, extract the voice data that carries content to be translated from audio stream;
Identify the word content that described voice data is corresponding;
Obtain user's preferred language, using as object language;
Described word content is converted to the word content of object language form, the word content of the word content of described object language form for adopting object language to describe;
By the word content of described object language form, be converted to the voice data of object language form, to substitute described voice data to be converted.
Preferably, utilize speech recognition technology, identify the word content that described voice data is corresponding.
Preferably, described method also comprises:
By mobile terminal, extract the video data relevant to captions from video flowing;
The relevant video data according to captions, identify caption content;
By described caption content, convert the caption content of object language form to, the caption content of the caption content of described object language form for adopting object language to describe;
By the caption content of described object language form, be converted to the video data of object language form, to replace the described video data relevant to captions.
The present invention realizes the audio stream of strange language is converted to the audio stream of preferred language form, with the preferred language rendering content, to the user, has more hommization, also has more versatility.
Embodiment
Fig. 1 shows according to the embodiment of the present invention, flow chart audio-frequency processing method, and details are as follows for concrete steps:
Step S101 by mobile terminal, extracts the voice data that carries content to be translated from audio stream.
The playout software audio stream plays, comprising voice data in described audio stream, the content that this voice data is being put down in writing background music and recorded.If need, can from audio stream, extract the voice data that carries content to be translated.For example: when the user listens to music by mobile terminal, in order to realize the soundplay with user's appointment by music, at first, extract audio stream from the music file, after the wiping out background music, extract the voice data relevant to voice from audio stream, for example: after the wiping out background music, extract song.
As another embodiment of the present invention, state by mobile terminal, extract the step of the voice data that carries content to be translated from audio stream before, described method also comprises:
Obtain user's preferred language, using as interpreter language.
Described preferred language comprises all parts of the world dialect, global various countries mother tongue.
At first, after getting the user instruction of interpreter language is set, mobile terminal ejects the speech selection dialog box, in the voice hurdle of this dialog box, has listed all category of language that this locality and/or server comprise; The user can be according to preference, choose at least one preferred language, selected preferred language is set to interpreter language, and according to user's preferred selection, preferred sequence is set, for example: Chinese is set to the first interpreter language, Sichuan dialect is set to the second interpreter language, and English is set to the 3rd interpreter language; After confirming that the interpreter language setting completes, when by voice data, corresponding word content is translated into the word content of the first interpreter language, if all do not find the literal pool that the first interpreter language is corresponding in local and server, preferred sequence according to interpreter language, search the literal pool that the second interpreter language is corresponding, if search successfully, according to the literal pool of the second translated speech, by voice data, corresponding word content is translated into the word content of the second interpreter language, and described literal pool comprises word to be translated and the mapping relations of translating word; By that analogy, if do not search successfully, according to the preferred sequence of interpreter language, search successively, when the interpreter language for all, all do not find corresponding literal pool, retain former audio stream to play.
Preferably, when displaying video and/or audio frequency, the user can be according to the preference of oneself, the change interpreter language.Particularly, after getting change directive, call the speech selection dialog box to realize the change of interpreter language.
Preferably, the microphone that can carry by mobile terminal, obtain the voice of user's typing, according to language library, identifies the category of language of the voice of this typing.Using the language that identifies as interpreter language, certainly, also the different language of typing repeatedly, then arrange preferred sequences to all interpreter languages that get.
Step S102, utilize speech recognition technology, identifies the word content that described voice data is corresponding.
By binary voice data typing speech recognition equipment, this speech recognition equipment adopts speech recognition technology, identifies the word content that this voice data is corresponding.
Step S103, translate into described word content the word content of interpreter language form, the word content of the word content of described interpreter language form for adopting interpreter language to describe.
Adopt existing Language Translation software, described word content is translated into to the word content of interpreter language form.
Step S104, by the word content of described interpreter language form, be converted to the voice data of interpreter language form, to replace described voice data to be translated.
The voice data of the voice data of described interpreter language form for adopting interpreter language to record, form.
Corresponding timestamp, the word content of interpreter language form according to the voice data that carries content to be translated of putting down in writing in audio stream, record the voice data of interpreter language again; The voice data of interpreter language form is replaced to the described voice data that carries content to be translated.Particularly, in the situation that it is constant to keep carrying the synchronized timestamp of voice data of content to be translated, the voice data of interpreter language form is replaced to the voice data that carries content to be translated, kept audio stream synchronously to play, realize the transformation of audio speech.
As another embodiment of the present invention, described method also comprises:
By mobile terminal, extract the video data relevant to captions from video flowing;
The relevant video data according to captions, identify caption content;
By described caption content, translate into the caption content of interpreter language form, the caption content of the caption content of described interpreter language form for adopting interpreter language to describe;
By the caption content of described interpreter language form, be converted to the video data of interpreter language form, to replace the described video data relevant to captions.
Mobile terminal is by the video software playing video file, and described video file comprises video flowing and/or audio stream; After getting video flowing, extract the video data relevant to captions from described video flowing, particularly, the video data relevant to captions is the video data that carries the word content that captions comprise, simultaneously, extracts the timestamp of these captions; To be identified go out caption content after, by described caption content, translate into the caption content of interpreter language form; By the caption content of described interpreter language form, be converted to the video data of interpreter language form; Then, according to the timestamp of captions, control the video data of interpreter language form is replaced to the described video data relevant to captions.While replaying the video file after translation, captions will show caption content with the interpreter language form.
As another embodiment of the present invention, described method also comprises:
Obtain in advance the synchronized timestamp of described voice data and described video data;
By described synchronized timestamp, the voice data of controlling described interpreter language form is synchronizeed with the video data with described interpreter language form.
When watching video, in order to translate better and to show, keep video flowing and audio stream synchronous, obtain in advance the synchronized timestamp of voice data and video data, the synchronized timestamp of described voice data and video data comprises: the voice data of the timestamp of voice data, the timestamp of captions, interpreter language form with and the synchronized timestamp of the video data of interpreter language form; By above-mentioned three timestamps, realize following Synchronization Control simultaneously:
By the timestamp of voice data, control the voice data of interpreter language form and replace the voice data that carries content to be translated;
By the timestamp of captions, control the video data of interpreter language form and replace the former video data relevant to captions;
Voice data by the interpreter language form with and the synchronized timestamp of the video data of interpreter language form, the voice data of controlling described interpreter language form is synchronizeed with the video data with described interpreter language form.
The present embodiment provides a kind of audio-frequency processing method of movement-based terminal, when the user uses mobile terminal to listen to, obtain in advance user's preferred language, using as interpreter language, when needs are translated, extract the voice data that carries content to be translated and the timestamp that carries the voice data of content to be translated from audio stream, utilize speech recognition technology, identify word content that described voice data is corresponding to translate into the word content of interpreter language form, word content by described interpreter language form, be converted to the voice data of interpreter language form, to replace described voice data to be translated, more optimizedly, when if the broadcasting media are video, in the translated speech content, extract video data and the synchronized timestamp relevant to captions from video flowing, the voice data of interpreter language form is replaced to described voice data to be translated, the video data of interpreter language form is replaced to the described video data relevant to captions, more optimizedly, by described synchronized timestamp, the voice data of controlling described interpreter language form is synchronizeed with the video data with described interpreter language form, thereby, realize that the audio frequency of strange language and/or video are converted to the preferred language form presents to the user, have more hommization, have more versatility.
embodiment bis-:
Fig. 2 shows the composition structure of the audio processing equipment of the movement-based terminal that the embodiment of the present invention provides, and for convenience of description, only shows the part relevant to the embodiment of the present invention;
The audio processing equipment of described movement-based terminal can be to run on the unit that software unit, hardware cell or software and hardware in mobile terminal device combine, and also can be used as independently suspension member and is integrated in described terminal equipment or runs in the application system of described terminal equipment.
A kind of audio processing equipment of movement-based terminal, the audio processing equipment of described movement-based terminal can compriseextraction unit 21,recognition unit 22,translation unit 23 andreplacement unit 24, the concrete function of each functional unit is described below:
Extraction unit 21 for by mobile terminal, extracts the voice data that carries content to be translated from audio stream.
The playout software audio stream plays, comprising voice data in described audio stream, the content that this voice data is being put down in writing background music and recorded.If need, can extract byextraction unit 21 voice data that carries content to be translated from audio stream.For example: when the user listens to music by mobile terminal, in order to realize the soundplay with user's appointment by music, at first, extract audio stream from the music file, after the wiping out background music,extraction unit 21 extracts the voice data relevant to voice from audio stream, for example: after the wiping out background music, extract song.
As another embodiment of the present invention, described device also comprises:
Acquiringunit 25, for obtaining user's preferred language, using as interpreter language.
Described preferred language comprises all parts of the world dialect, global various countries mother tongue.
At first, after getting the user instruction of interpreter language is set, acquiring unit 25 ejects the speech selection dialog boxes, in the voice hurdle of this dialog box, has listed all category of language that this locality and/or server comprise; The user can be according to preference, choose at least one preferred language, the selected preferred language of acquiring unit 25 is set to interpreter language, and according to user's preferred selection, preferred sequence is set, for example: acquiring unit 25 Chinese are set to the first interpreter language, Sichuan dialect is set to the second interpreter language, and English is set to the 3rd interpreter language; After confirming that the interpreter language setting completes, when by voice data, corresponding word content is translated into the word content of the first interpreter language, if all do not find the literal pool that the first interpreter language is corresponding in local and server, preferred sequence according to interpreter language, search the literal pool that the second interpreter language is corresponding, if search successfully, according to the literal pool of the second translated speech, by voice data, corresponding word content is translated into the word content of the second interpreter language, and described literal pool comprises word to be translated and the mapping relations of translating word; By that analogy, if do not search successfully, according to the preferred sequence of interpreter language, search successively, when the interpreter language for all, all do not find corresponding literal pool, retain former audio stream to play.
Preferably, when displaying video and/or audio frequency, the user can be according to the preference of oneself, the change interpreter language.Particularly, after getting change directive, acquiringunit 25 calls the speech selection dialog box to realize the change of interpreter language.
Preferably, the microphone that can carry by mobile terminal, obtain the voice of user's typing, according to language library, identifies the category of language of the voice of this typing.Using the language that identifies as interpreter language, certainly, also the different language of typing repeatedly, then arrange preferred sequences to all interpreter languages that get.
Recognition unit 22, for utilizing speech recognition technology, identify the word content that described voice data is corresponding.
Recognition unit 22 is by binary voice data typing speech recognition equipment, and this speech recognition equipment adopts speech recognition technology, identifies the word content that this voice data is corresponding.
Translation unit (that is, converting unit) 23, for described word content being translated into to the word content of interpreter language form, the word content of the word content of described interpreter language form for adopting interpreter language to describe.
Translation unit 23 adopts existing Language Translation software, described word content is translated into to the word content of interpreter language form.
Replacement unit 24, for the word content by described interpreter language form, be converted to the voice data of interpreter language form, to replace described voice data to be translated.
The voice data of the voice data of described interpreter language form for adopting interpreter language to record, form.
The timestamp thatreplacement unit 24 is corresponding according to the voice data that carries content to be translated of putting down in writing in audio stream, the word content of interpreter language form, record the voice data of interpreter language again;Replacement unit 24 is replaced the described voice data that carries content to be translated by the voice data of interpreter language form.Particularly, in the situation that it is constant to keep carrying the synchronized timestamp of voice data of content to be translated,replacement unit 24 is replaced the voice data of interpreter language form the voice data that carries content to be translated, has kept audio stream synchronously to play, and realizes the transformation of audio speech.
As another embodiment of the present invention, described device also comprises:
Video extraction unit 26 for by mobile terminal, extracts the video data relevant to captions from video flowing;
Video identification unit 27, the video data for relevant according to captions, identify caption content;
Video translation unit 28, for by described caption content, translate into the caption content of interpreter language form, the caption content of the caption content of described interpreter language form for adopting interpreter language to describe;
Video replacing unit 29, for the caption content by described interpreter language form, be converted to the video data of interpreter language form, to replace the described video data relevant to captions.
Mobile terminal is by the video software playing video file, and described video file comprises video flowing and/or audio stream; After getting video flowing,video extraction unit 26 extracts the video data relevant to captions from described video flowing, and particularly, the video data relevant to captions is the video data that carries the word content that captions comprise, simultaneously, extracts the timestamp of these captions; Aftervideo identification unit 27 identifies caption content,video translation unit 28, by described caption content, is translated into the caption content of interpreter language form;Video replacing unit 29, by the caption content of described interpreter language form, is converted to the video data of interpreter language form; Then, according to the timestamp of captions,video replacing unit 29 is controlled the video data of interpreter language form is replaced to the described video data relevant to captions.While replaying the video file after translation, captions will show caption content with the interpreter language form.
As another embodiment of the present invention, described device also comprises:
Timestamp unit 30, for obtaining in advance the synchronized timestamp of described voice data and described video data;
Lock unit 31, for passing through described synchronized timestamp, the voice data of controlling described interpreter language form is synchronizeed with the video data with described interpreter language form.
When watching video, in order to translate better and to show, keep video flowing and audio stream synchronous,timestamp unit 30 obtains the synchronized timestamp of voice data and video data in advance, and the synchronized timestamp of described voice data and video data comprises: the voice data of the timestamp of voice data, the timestamp of captions, interpreter language form with and the synchronized timestamp of the video data of interpreter language form; By above-mentioned three timestamps, realize following Synchronization Control simultaneously:
By the timestamp of voice data,replacement unit 24 is controlled the voice data of interpreter language form and is replaced the voice data that carries content to be translated;
By the timestamp of captions,video replacing unit 29 is controlled the video data of interpreter language form and is replaced the former video data relevant to captions;
Voice data by the interpreter language form with and the synchronized timestamp of the video data of interpreter language form, the voice data that lockunit 31 is controlled described interpreter language form is synchronizeed with the video data with described interpreter language form.
Thereby, kept voice or the video reproduction time before and after Language Translation correct.
The present embodiment provides a kind of audio processing equipment of movement-based terminal, when the user uses mobile terminal to listen to, acquiring unit obtains user's preferred language in advance, using as interpreter language, when needs are translated, extraction unit extracts the voice data that carries content to be translated and carries the timestamp of the voice data of content to be translated from audio stream, recognition unit utilizes speech recognition technology, identify word content that described voice data is corresponding to translate into the word content of interpreter language form, translation unit is by the word content of described interpreter language form, be converted to the voice data of interpreter language form, replace described voice data to be translated with replacement unit, more optimizedly, when if the broadcasting media are video, in the translated speech content, the timestamp unit extracts the video data relevant to captions and synchronized timestamp from video flowing, the voice data of interpreter language form is replaced to described voice data to be translated, the video data of interpreter language form is replaced to the described video data relevant to captions, more optimizedly, by described synchronized timestamp, the voice data that lock unit is controlled described interpreter language form with and the video data of described interpreter language form synchronize, thereby, realize that the audio frequency of strange language and/or video are converted to the preferred language form presents to the user, have more hommization, have more versatility.
As one embodiment of the invention, the invention provides a kind of mobile terminal, the audio processing equipment of the movement-based terminal that described mobile terminal is above-mentioned.
Described mobile terminal can for but be not limited to smart mobile phone and IPAD etc.
The embodiment of the present invention provides a kind of audio-frequency processing method and device of movement-based terminal, when the user uses mobile terminal to listen to, obtain in advance user's preferred language, using as interpreter language, when needs are translated, extract the voice data that carries content to be translated and the timestamp that carries the voice data of content to be translated from audio stream, utilize speech recognition technology, identify word content that described voice data is corresponding to translate into the word content of interpreter language form, word content by described interpreter language form, be converted to the voice data of interpreter language form, to replace described voice data to be translated, more optimizedly, when if the broadcasting media are video, in the translated speech content, extract video data and the synchronized timestamp relevant to captions from video flowing, the voice data of interpreter language form is replaced to described voice data to be translated, the video data of interpreter language form is replaced to the described video data relevant to captions, more optimizedly, by described synchronized timestamp, the voice data of controlling described interpreter language form is synchronizeed with the video data with described interpreter language form, thereby, realize that the audio frequency of strange language and/or video are converted to the preferred language form presents to the user, have more hommization, have more versatility.
It will be appreciated by those skilled in the art that the unit that comprises for above-described embodiment two is just divided according to function logic, but be not limited to above-mentioned division, as long as can realize corresponding function; In addition, the concrete title of each functional unit also, just for the ease of mutual differentiation, is not limited to protection scope of the present invention.
One of ordinary skill in the art will appreciate that all or part of step realized in above-described embodiment method is to come the hardware that instruction is relevant to complete by program, described program can be in being stored in a computer read/write memory medium, described storage medium, as ROM/RAM, disk, CD etc.
Above content is only preferred embodiment of the present invention, for those of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, and this description should not be construed as limitation of the present invention.