CN101516005A

Movatterモバイル変換

Info

Publication number: CN101516005A
Application number: CNA2008100654170A
Authority: CN
Inventors: 吴治国; 张勤伟
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2008-02-23
Filing date: 2008-02-23
Publication date: 2009-08-26
Also published as: WO2009103226A1

Abstract

Translated fromChinese

本发明提供一种语音识别频道选择系统、方法及频道转换装置，该方法包括：控制器接收用户的语音输入信号；频道转换装置根据输入的语音信号及识别词表识别出待匹配名称；根据待匹配名称与匹配表进行匹配得出需要切换的频道；切换到需要切换的频道。本发明避免了在控制器上进行语音识别操作复杂和成本高的问题，使得用户在操作起来十分方便，并且充分利用频道转换装置的性能，节省了控制的成本。通过频道转换装置识别出待匹配名称，不需要在网络中设置专门的语音识别服务器，防止响应时间过长，避免了由于网络传输数据丢失的问题，并且节约了构建网络的成本。

The present invention provides a voice recognition channel selection system, method and channel conversion device, the method comprising: the controller receives the user's voice input signal; the channel conversion device recognizes the name to be matched according to the input voice signal and the recognition vocabulary; The matching name is matched with the matching table to obtain the channel to be switched; switch to the channel to be switched. The present invention avoids the problems of complex and high-cost voice recognition operations on the controller, makes the operation very convenient for users, and fully utilizes the performance of the channel conversion device to save control costs. The name to be matched is recognized by the channel conversion device, without setting up a special voice recognition server in the network, preventing the response time from being too long, avoiding the problem of data loss due to network transmission, and saving the cost of building the network.

Description

A kind of speech recognition channel selection system, method and channel switch device

Technical field

The present invention relates to communication technical field, relate in particular to and a kind ofly carry out channel selection system, device and method by speech recognition.

Background technology

Along with the development of information technology and broadcast television technique, business developments such as cable digital TV and IPTV are rapid in recent years.(Set-top Box, STB), as IP set-top box and top box of digital machine etc., progressively under Shi Changhua the trend, the complete function of set-top box has replaced traditional VCD machine and DVD player gradually in set-top box.On the other hand, along with the development of automatic speech recognition technology, make set-top box select channel to become possibility by voice, this technology also becomes the emphasis of industry research and development.

Traditional speech recognition selects channel that dual mode is arranged: a kind of is by increasing the mode of voice recognition processor on remote controller, imports by the user when identification and downloads the definite speech data of sound template and the speech data coupling of user's input and come converted channel; A kind of is by special speech recognition server is set in network.

The inventor finds that in realizing process of the present invention there is following shortcoming at least in the mode of traditional speech recognition selection channel: by increase the mode of voice recognition processor on remote controller, because each sound template that upgrades all needs user's manual operation to download on the remote controller when identification, it is very complicated, inconvenient to operate, simultaneously, also increased the cost of remote controller; By the mode of special speech recognition server is set in network, owing to voice signal need be uploaded to network during the identification voice, response time is longer, and the possibility by network uplink and twice data-bag lost of downlink transfer also can increase, and special in addition speech recognition server has also increased the cost of building network.

Summary of the invention

In view of this, be necessary to provide a kind of easy to operate, cost-effective speech recognition band selecting method in fact.

Simultaneously, provide a kind of easy to operate, cost-effective speech recognition channel switch system.

Simultaneously, provide a kind of easy to operate, cost-effective channel switch device.

A kind of speech recognition band selecting method comprises the steps:

Controller receives the user's voice input signal;

The channel switch device identifies title to be matched according to the voice signal and the identification vocabulary of input;

Mate the channel that draws the needs switching according to described title to be matched and matching list;

Switch to the described channel that needs switching.

A kind of speech recognition channel selection system comprises: controller is used for communicating with the channel switch processing unit;

Described controller is used to receive the user's voice input signal;

Described channel switch processing unit is used for identifying title to be matched according to the voice input signal of described input and identification vocabulary, mates the channel that draws the needs switching according to described title to be matched and matching list, and switches to the described channel that needs switching.

A kind of channel switch device comprises:

Receiver module is used to receive the user's voice input signal that controller sends;

Recognition processing module is used for identifying title to be matched according to the voice input signal and the identification vocabulary of described input;

The match query module is used for mating the channel that draws the needs switching according to described title to be matched and matching list;

The channel switch control module is used to switch to the channel that described needs switch.

Compared with prior art, the embodiment of the invention receives the user's voice input signal by controller, identify title to be matched by the channel switch device according to the voice input signal of described input, mate the channel that draws the needs switching according to described title to be matched and matching list, and switch to the described channel that need to switch, avoided the complicated and high problem of cost at the enterprising lang sound of controller identifying operation, make the user operate very convenient, and make full use of the performance of channel switch device, saved the cost of control.Identify title to be matched by the channel switch device, special speech recognition server need be set in network, prevent that the response time is long, avoided because the problem that transmitted data on network is lost, and saved the cost of building network.

Description of drawings

Fig. 1 is an embodiment of the invention speech recognition channel switch system configuration schematic diagram.

Fig. 2 is an embodiment of the invention controller architecture schematic diagram.

Fig. 3 is an embodiment of the invention channel switch processing unit structural representation.

Fig. 4 is an embodiment of the invention speech recognition band selecting method flow chart.

Fig. 5 is embodiment of the invention channel and listing update method flow chart.

Fig. 6 is embodiment of the invention identification vocabulary and matching list update method flow chart.

Embodiment

Please referring to Fig. 1, embodiment of the invention speech recognitionchannel switch system 100 comprises: (Electronic Program Guide, EPG)server 106 forcontroller 102,channel switch device 104 and electronicprogram guides.Controller 102 is used to receive the user's voice input signal.Channel switch device 104 is used for identifying title to be matched according to the voice input signal and the identification vocabulary of input, mates the channel that draws the needs switching according to title to be matched and matching list, and switches to the channel that needs switching.EPG server 106, the identification vocabulary of up-to-date matching list that is used to provide to be updated and/or up-to-date renewal,channel switch device 104 can upgrade matching list according to up-to-date matching list, and/or upgrades the identification vocabulary according to up-to-dateidentification vocabulary.Controller 102 can be system's external controller, HS (Handset, mobile phone) or remote controller, in the present embodiment, is example with the remote controller.Channel switch device 104 can be PC (Personal Computer, PC), STB (Set-top Box, set-top box), NB (NotebookComputer, notebook computer), HS (Handset, mobile phone), GP (Game Player, game machine) or ODD (Optical Disc Drive, CD-ROM device) etc., in the present embodiment, be that example describes with STB.

Please in conjunction with referring to Fig. 2, in the present embodiment,controller 102 comprises:voice receiver module 202, voicesignal processing module 204,input module 210,controller receiver module 212 andsending module 216.

Voicesignal receiver module 202 is used to receive the user's voice input signal, and in the present embodiment, voice input module can be a microphone on the remote controller.

Voicesignal processing module 204 is used for the voice input signal of process user.Voicesignal processing module 204 also comprises:speech conversion unit 206 and speech coding unit 208.Speech conversion unit 206 is used for voice signal is converted into digital signal, and in the present embodiment,speech conversion unit 206 can be the A/D change-over circuit.Speech coding unit 208 is used for the digital signal after encodedvoice converting unit 206 is changed, and this coding can be a compressed encoding, comprises diminishing compressed encoding or lossless compression-encoding.The user's voice collection can have different schemes with handling, in the present embodiment, sample with the 16KHz sample rate, by 16 or the precision of 8bit quantize.The coded format of voice signal after over-sampling and processing is PCM (Pulse Code Modulation, pulse code modulation) form.

Input module 210 is used to receive the instruction of user's input, as, the voice activation instruction, it is voice activated to be used to control the channel switch device, and in the present embodiment,input module 210 can be keyboard or touch-screen.

Controller receiver module 212 is used for the signal that receivingchannels conversion equipment 104 sends, and this signal comprises the command signal returned and notification message etc.

Sendingmodule 216, be used to send signal and operation signal after the speech coding of user's input, in the present embodiment, sendingmodule 216 can be wireless communication apparatus such as infrared, bluetooth, as passing through Bluetooth2.0 (bluetooth 2.0 technology), purple honeybee Zigbee or high speed infrared agreement etc. can guarantee the high-speed radiocommunication technology that PCM (Pulse Code Modulation, pulse code modulation) speech data can real-timeTransmission.Sending module 216 also comprises: operationsignal transmitting unit 218, be used to send the operation signal that the user imports, for example, keyboard input and touch-screen input signal.Voicesignal transmitting element 214 is used to send the voice signal that the user imports, and this signal also can be the signal behind the compressed encoding for the digital signal through the A/D conversion.

Please in conjunction with referring to Fig. 3, in the present embodiment, channel switch device 104 (STB) comprising:receiver module 302,quiet control module 308,speech selection module 310,recognition processing module 312,sending module 322, refusalidentification reminding module 324,memory module 326,match query module 336, channelswitch control module 338 andupdate module 340.

Receiver module 302, be used to receive the user's voice input signal of controller transmission and user's operation control command, in the present embodiment, user input signal comprises user's voice input signal and user's operation control command, if be phonetic entry all, also can not comprise user's control command signal.The user's voice input signal is the audio digital signals after changing through analog/digital A/D.Receiver module 302 also comprises operationsignal receiving element 304 and voice signal receiving element 306.Operationsignal receiving element 304 is used to receive user's operation control command, for example voice activated control command.Voicesignal receiving element 306 is used to receive the user's voice input signal.

Quiet control module 308 is used for the voice activated instruction according to user input, and the channel switch device is changed to mute state, and mute state is switched to non-mute state behind voice collecting.

Speech selection module 310 is used for the speech selection signal according to user input, select one with the corresponding acoustic model of described speech selection signal.

Recognition processing module 312 is used for identifying title to be matched according to the voice signal and the identification vocabulary of input.Recognition processing module 312 comprises: voiceactivation detecting unit 314, phoneticfeature extraction unit 316,voice recognition unit 318 andvoice judging unit 320.

Voiceactivation detecting unit 314 is used to detect the starting point and the terminal point of actual speech section.In the present embodiment, the sane end-point detection algorithm of voiceactivation detecting unit 314 employings detects the starting point and the terminal point of actual speech, with actual speech section and non-speech segment in the voice signal of distinguishing input.

Phoneticfeature extraction unit 316 is used for that voice signal is carried out phonetic feature and extracts.In the present embodiment, phoneticfeature extraction unit 316 is handled the voice signal that voiceactivation detecting unit 314 sends, and extracts voice feature data.The phonetic feature type can adopt MFCC (Mel-FrequencyCeptral Coefficients, the Mei Er frequency cepstral coefficient) feature, PLP (Perceptually LinearPrediction, the perception linear prediction) feature or LPCC (Linear Predictive Cepstral Coding, the linear prediction cepstrum coefficient) feature, in order to improve the anti-noise effect, the processing that can in the phonetic feature leaching process, use cepstral mean to subtract.Consider the MFCC characteristic use people's ear the acoustics apperceive characteristic and noise is had robustness preferably, preferred MFCC feature is as phonetic feature.Voice signal has frame-to-frame correlation as stationary signal in short-term between the speech frame, can improve the accuracy rate of speech recognition to MFCC feature extraction first-order difference or single order and second differnce for this reason.

Voice recognition unit 318 is used for calculating the acoustics distance of the voice feature data of input with respect to entry according to acoustic model and identification vocabulary.In the present embodiment,voice recognition unit 318 obtains the shortest accumulation acoustics distance of each isolated word according to acoustic model data and isolated vocabulary data, get then the shortest acoustics apart from the isolated word of minimum as the first-selected recognition result of these voice.The acoustic model that speech recognition is adopted comprises continuous HMM (Hidden Markov Model hidden Markov model) model and Discrete HMM model.In addition, the recognition result thatvoice recognition unit 318 can also provide a plurality of candidates allows the user select, and the foundation of ordering is the shortest accumulation acoustics distance.

Voice judging unit 320, be used to judge voice feature data with respect to the acoustics distance of entry whether less than threshold value, if voice feature data less than threshold value, calculates the channel designation of current speech correspondence with respect to the acoustics of entry distance according to identification vocabulary and matching list.

Sending module 322 is used for sending the identification processing signals to controller 102, and after identification disposed,controller 102 can stop to gather the user's voice input signal.In the present embodiment,sending module 322 also can adopt bluetooth, wireless mode such as infrared to transmit signal.

Refusalidentification reminding module 324 is used for when recognition result is non-voice, and the prompting user re-enters voice.This prompting can be message notifying, video display reminding or auditory tone cues, and in the present embodiment, employing mode of display reminding literal on screen is pointed out the user.

Memory module 326 is used for data such as storage of channel and listing, identification vocabulary, acoustic model and matching list.In the present embodiment,memory module 326 comprises: channel and listingmemory cell 328, identificationvocabulary memory cell 330, acousticmodel memory cell 332, matchinglist memory cell 334.

Channel and listingmemory cell 328 are used for storage of channel and program correspondence table, and in the present embodiment, each entry of table is the channel designation and the in progress programm name of this channel of current time of live telecast.This channel and program correspondence table can be upgraded according toEPG server 106, and the update cycle can be set to one day or a week, and the concrete time interval can be with reference to the EPG server update interval of IPTV or cable digital TV system.

Identificationvocabulary memory cell 330 is used for storage identification vocabulary, and in the present embodiment, the identification vocabulary also comprises an isolated vocabulary that is used for alone word voice identification.

Acousticmodel memory cell 332 is used to store acoustic model to be matched.In the present embodiment, employing comprises the model parameter at the acoustic model of bilingual kind of hybrid modeling of HMM model.Parameter and speaker that bilingual kind is mixed acoustic model have nothing to do, and are the model at unspecified person.Model parameter needs to train through training aids according to the good expectation data of mark in advance, the parameter that training obtains just can be cured to the speech recognition that acoustic model parameter storage part is used for isolated word, and the acoustic model parameter comprises the state parameter of hidden Markov model and the probability-distribution function of state output observational characteristic vector.

Matchinglist memory cell 334 is used to store matching list, and matching list has been stored the channel that the user need switch and the channel corresponding relation of user's voice input.

Match query module 336 is used for mating the channel that draws the needs switching according to title to be matched and matching list.In the present embodiment, as key word of the inquiry, during ranking, the channel of table that at first inquiry in the channel program table comprises inquires about the entry that meets keyword with the isolated word that identifies.

Channelswitch control module 338 is used to switch to the channel that needs switch.If there is the entry of coupling, when Query Result was single entry, controller top box live telecast switched to the channel of entry mid band name attribute-bit; When Query Result is a plurality of record, the control video screen shows the property value of the channel name of a plurality of entries, and the prompting user selects one of them channel to watch live television programming by remote controller, treat that the user finishes selection after, the control TV switches to the channel that the user selects.

Update module 340 is used for according to the EPG server with new matching list and/or identificationvocabulary.Update module 340 also comprises:upgrade timing unit 342 and upgrade control unit 344.Upgrade timing unit 342, be used to write down the time of renewal, and when arrive or be overtime update time, trigger and upgrade, in the present embodiment, channel and listing can be set to upgrade every day update time, and identification vocabulary and matching list can be set to the per minute renewal update time.Upgrade control unit 344, be used for when satisfying update time, matching list and/or identification vocabulary are upgraded in control.

The embodiment of the invention receives the user's voice input signal by controller, identify title to be matched by the channel switch device according to the voice input signal of described input, mate the channel that draws the needs switching according to described title to be matched and matching list, and switch to the described channel that need to switch, avoided the complicated and high problem of cost at the enterprising lang sound of controller identifying operation, make the user operate very convenient, and make full use of the performance of channel switch device, saved the cost of control.Identify title to be matched by the channel switch device, special speech recognition server need be set in network, prevent that the response time is long, avoided because the problem that transmitted data on network is lost, and saved the cost of building network.The embodiment of the invention is by intercepting actual speech section, and the accuracy rate of speech recognition is improved.During by quiet control unit control phonetic entry, set-top box is quiet, prevent the sound of televising interference to user speech.From EPG server more new channel and listing automatically, identification vocabulary and matching list have avoided that the user is manual affected to bring unhandy drawback by update module.

Please in conjunction with referring to Fig. 4, embodiment of the invention speech recognition band selecting method comprises the steps:

Step 402, controller receives the voice activated instruction of user's input.In the present embodiment, the voice activation instruction can be the push button signalling that the user imports, and the user can be by the command signal of input equipments such as keyboard or touch-screen input.

Step 404, controller send to the channel switch device and start the speech recognition controlled command signal.In the present embodiment, be example, send startup speech recognition controlled command signal to set-top box by remote controller in wireless transmission modes such as bluetooth, high speed infrared agreement, purple honeybee Zigbee.

Step 406, the channel switch device is changed to mute state.

Step 408, channel switch device send to controller and start the voice collecting control command signal.If when not adopting mute function, also can not comprise above step, repeat no more.

Step 410, controller receives the user's voice input signal, and the voice signal of collection and process user input in the present embodiment, converts analog voice signal to audio digital signals by A/D converter, and sends the channel switch device to by wireless mode.

Step 412, channel switch device detect the starting point and the terminal point of actual speech section, are used to identify title to be matched according to the starting point and the terminal point of actual speech section.In the present embodiment, voice activation detects starting point and the terminal point that the sane end-point detection algorithm of employing detects actual speech, with actual speech section and non-speech segment in the voice signal of distinguishing input.

Step 414, channel switch device send to controller and stop the voice collecting control signal.After identification disposed, controller can stop to gather the user's voice input signal.In the present embodiment, send mode also can adopt wireless modes such as bluetooth, high speed infrared agreement and Zigbee to transmit signal.

Step 416, controller stops to gather and processes voice signals according to the control that stops the voice collecting control signal of channel switch device.

Step 418 sends the signal of the actual speech section between starting point and the terminal point to the phonetic feature extraction unit.Step 418 and step 414 can not have precedence relationship, also can be first execution in step 418 back execution in step 416, repeat no more.

Step 420, the phonetic feature extraction unit extracts phonetic feature according to the voice signal of input, and voice signal is carried out feature extraction, in the present embodiment, obtains the step that the actual speech paragraph detects if having before, just only needs extraction actual speech section.The phonetic feature type can adopt the MFCC feature, and PLP feature or LPCC feature are in order to improve the anti-noise effect, the processing that can use cepstral mean to subtract in the phonetic feature leaching process.Consider the MFCC characteristic use people's ear the acoustics apperceive characteristic and noise is had robustness preferably, preferred MFCC feature is as phonetic feature.Voice signal has frame-to-frame correlation as stationary signal in short-term between the speech frame, can improve the accuracy rate of speech recognition to MFCC feature extraction first-order difference or single order and second differnce for this reason.

Step 422 calculates the acoustics distance of the voice feature data of input with respect to entry according to acoustic model and identification vocabulary.In the present embodiment, speech recognition obtains the shortest accumulation acoustics distance of each isolated word according to acoustic model data and isolated vocabulary data, get then the shortest acoustics apart from the isolated word of minimum as the first-selected recognition result of these voice.The acoustic model that speech recognition is adopted comprises continuous HMM model and Discrete HMM model.In addition, the recognition result that speech recognition can also provide a plurality of candidates allows the user select, and the foundation of ordering is the shortest accumulation acoustics distance.In the present embodiment, employing comprises the model parameter at the acoustic model of the bilingual kind of hybrid modeling of HMM.Parameter and speaker that bilingual kind is mixed acoustic model have nothing to do, and are the model at unspecified person.Model parameter needs to train through training aids according to the good expectation data of mark in advance, the parameter that training obtains just can be cured to the speech recognition that acoustic model parameter storage part is used for isolated word, and the acoustic model parameter comprises the state parameter of HMM and the probability-distribution function of state output observational characteristic vector.Before this step, can also comprise speech selection signal, select the step of an acoustic model corresponding with this speech selection signal according to user's input.

Step 424, judge voice feature data with respect to each entry acoustics distance whether less than threshold value, if the acoustics distance is not less than threshold value, execution in step 426; If acoustics distance is less than threshold value, execution in step 428.

Step 426, if voice feature data with respect to the acoustics of entry distance more than or equal to threshold value, recognition result is a non-voice, the prompting user re-enters.This prompting can be message notifying, video display reminding or auditory tone cues, and in the present embodiment, employing mode of display reminding literal on screen is pointed out the user.After the execution of step 426, finish this identifying.

Step 428, if voice feature data with respect to the acoustics of entry distance less than threshold value, calculate the channel designation of current speech correspondence according to identification vocabulary and matching list.In the present embodiment, obtain the shortest accumulation acoustics distance of each isolated word according to acoustic model data and isolated vocabulary data, get then the shortest acoustics apart from the isolated word of minimum as the first-selected recognition result of these voice.The acoustic model that speech recognition is adopted comprises continuous HMM model and Discrete HMM model.In addition, the recognition result that can also provide a plurality of candidates allows the user select, and the foundation of ordering is the shortest accumulation acoustics distance.

Step 430 switches to the channel that needs switch according to the channel designation that identifies.If there is the entry of coupling, when Query Result was single entry, controller top box live telecast switched to the channel of entry mid band name attribute-bit; When Query Result is a plurality of record, the control video screen shows the property value of the channel name of a plurality of entries, and the prompting user selects one of them channel to watch live television programming by remote controller, treat that the user finishes selection after, the control TV switches to the channel that the user selects.

Please in conjunction with referring to Fig. 5, embodiment of the invention channel and listing update method comprise the steps:

Step 502 checks whether channel and listing satisfy the condition that is provided with of upgrading, and upgrading the condition that is provided with can be according to user's demand setting, and the renewal of identification vocabulary and matching list can be set to one day.If satisfy to upgrade condition execution instep 504 is set, otherwise returnsstep 502.

Step 504, the channel switch device is downloaded up-to-date channel and listing data, more new channel and listing from the EPG server.

The target of this renewal can be the EPG server, also can be local network or CD etc.

Please in conjunction with referring to Fig. 6, embodiment of the invention identification vocabulary and matching list update method comprise the steps:

Step 602 checks whether identification vocabulary and matching list satisfy the condition that is provided with of upgrading, and upgrading the condition that is provided with can be according to user's demand setting, and the renewal of identification vocabulary and matching list can be set to one minute.If satisfy to upgrade condition execution instep 604 is set, otherwise returnsstep 602.

Step 604 is upgraded local identification vocabulary and matching list according to channel and listing.

One of ordinary skill in the art will appreciate that all or part of step in the said method can be finished by the relevant hardware of program command, this program can be stored in the computer-readable recording medium, this storage medium as, RAM, ROM or CD etc.

The embodiment of the invention receives the user's voice input signal by controller, identify title to be matched by the channel switch device according to the voice input signal of described input, mate the channel that draws the needs switching according to described title to be matched and matching list, and switch to the described channel that need to switch, avoided the complicated and high problem of cost at the enterprising lang sound of controller identifying operation, make the user operate very convenient, and make full use of the performance of channel switch device, saved the cost of control.Identify title to be matched by the channel switch device, special speech recognition server need be set in network, prevent that the response time is long, avoided because the problem that transmitted data on network is lost, and saved the cost of building network.The embodiment of the invention is by intercepting actual speech section, and the accuracy rate of speech recognition is improved, and has removed the interference of noise.During by quiet control unit control phonetic entry, set-top box is quiet, prevent the sound of televising interference to user speech.From EPG server more new channel and listing automatically, identification vocabulary and matching list have avoided that the user is manual affected to bring unhandy drawback by update module.

In sum, more than be preferred embodiment of the present invention only, be not to be used to limit protection scope of the present invention.Within the spirit and principles in the present invention all, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.