Summary of the invention
Of the present invention is a kind of intelligentized speech recognition system, and main summary of the invention is following:
The speech recognition system structure
Speech recognition system is based on distributed frame, and system is flexible, reliable, and cost efficiency is high.Shown in system architecture Fig. 1.To distinguish each ingredient of descriptive system below.
Identify customer end
Identify customer end is to handle mutual process between application program and speech recognition system.Its processing audio input and output, and support limited phone control.Audio frequency input is optional selects the echo that disappears and makes pauses in reading unpunctuated ancient writings then.Prescoring prompting playback is supported in audio frequency output, changes (TTS) system for third-party Text To Speech a framework is provided.Under customized configuration, call out control and point out playback to control by the assembly outside the system.At last, identify customer end is passed to speech recognition server with audio frequency, and incident and result are returned to application program.
Identified server
Identified server carries out speech recognition and natural language understanding to receiving the terminal audio frequency that comes from identify customer end.If be that recognizing voice also is the explanation of expression content return to the nature language, identified server needs a series of acoustic model and grammer.Acoustic model and grammer help identified server to confirm the content of speaking.Grammer also is used to explain the meaning of oral vocabulary.Application program is specified acoustic model and grammer that identified server loads in the bag.
Explorer
The explorer executive real-time is written into equilibrium function, arrives available identified server to guarantee the identification mission mean allocation, thereby reduces hsrdware requirements, improves service quality.
Database
Speech recognition system adopts database (supporting relevant databases such as text, ODBC) to preserve dynamic syntax and subscriber data.For some speech recognition application, look its application instance, possibly not need database.
Speech recognition process
In order to understand the structure of speech recognition system, the most important thing is roughly to understand its identifying, emphasis is in client, server and application program.Fig. 2 and Fig. 3 are the synoptic diagram and the step of speech recognition process, are the explanation of each step subsequently.
The process of speech recognition system identification roughly comprises following several steps:
1. identify customer end has phone to arrive, the identify customer end notification application, and system answers the call;
2. the system requirements identify customer end is play first prompting, and the caller reacts.To Text To Speech conversion prompting, identify customer end will send to the TTS server through a socket by synthesis text, and receive the sample of passback;
3. be the reaction of call identifying side, identify customer end is to the request of explorer send server (buffered audio data simultaneously), and explorer points to only identified server with identify customer end;
4. identify customer end sends an identification request to identified server.Each request is made up of audio stream and the grammar entries in application.This grammar entries has implied acoustic model, because both are built in the identification bag of identified server loading;
5. after identified server receives request, carry out identification mission, then recognition result is returned to identify customer end;
6. during this period, explorer is kept watch on the current content that is written into of identified server;
7. identify customer end sends to application program with recognition result;
8. application program is made corresponding response, for example, carries out data base querying or another prompting of request identify customer end broadcast, as the response to the user;
9. the caller makes a response; Identify customer end sends next identification request and (sees step 4);
More than be a simple identifying, if to a large amount of speech recognition application, the identification service end can be launched a plurality of, and through resource management, reasonable distribution identification service processing.
Voice identification result
After each speech recognition was accomplished, system passed to application program with recognition result, and application program is made response according to the result is corresponding.Recognition result comprises abundant information programs to be used, and comprising:
Through identification speech copy and degree of confidence thereof
2. value of natural language result, each grade and corresponding degree of confidence score value
3. verification score value
Fig. 4 is the synoptic diagram of recognition result, comprises the text, confidence levels and the natural language explanation that are identified.
Similar sound identification
For similar sound, Chinese pronunciations especially, similar sound can often run into.Lift an automatic speech exchange examples of applications, there is the close or approaching situation of a plurality of employee's name pronunciations in a company, is " Li Xiang " if any the position male employee, and the female employee is " Li Xiang ", also has other like Li Qiang, Li Xiang etc.If the user looks for Li Xiang, the recognition result of system discovery Li Xiang, Li Xiang is very approaching, has all surpassed empirical value (as 85); In view of the situation, after application flow is received the result, be not sure of user's selection; But can further point out the user, man Li Xiang still is woman Li Xiang, if the user says man Li Xiang; System will be easy to judge recognition result, accomplish user's operation, and will be as shown in Figure 5.
Fault-tolerant processing
In the speech recognition application process, in the time of seldom, slightly unclear or weight difference causes recognition result wrong unavoidably like the user's voice input, can make troubles to the user.Voice call book as shown in Figure 5 is used.
Li Xiang and two contact persons of Li Xiang are deposited in the user-phone book the inside, and the user does not carry out similar sound and handles for rapid and convenient; If hear the name that is not that the user says during call forwarding; At this moment, the user need not to hang up the telephone, and only need say " returning " perhaps " wrong "; System can return upper level automatically, lets the user reselect.Both avoided misrouting connecing, and also let the user re-enter easily.More than be simple example, in application such as phonetic search, this fault-tolerant processing will embody very important value.
The speech recognition system key property
1. cloud computing (distributed) structure.Explorer is written into equilibrium between identified server, thereby guarantees the utilization ratio of hardware.Identification to CPU intensity is big can be carried out by the remote machine of inoperative application program and COBBAIF;
2. High Density Interface.A small amount of processing of client is isolated from the intensive server process of CPU, allowed client to have highdensity interface can improve the service efficiency of server end CPU again;
3. fault-tolerant and reliability.Even individual servers lost efficacy, can not make system crash yet, even can not miss an identification request.When an identified server lost efficacy, explorer stopped to send request to it automatically, when server recovers, began automatically to send request to it;
4. easy to maintenance.Can close an identified server and keep in repair, and the performance of total system is not influenced, perhaps influence is very little.The maintenance of some types even can not close identified server and carry out;
5. scalability.Along with the increase of client identification request, can increase the instance of identified server, identify customer end and application, need not stop any running application program or close recognition system;
6. request by all kinds of means.System supports the identification services request from internet (TCP/IP and Session Initiation Protocol) and telephone network heterogeneous networks such as (fixed line are with mobile);
7. algorithm optimization, separate unit identified server identification concurrent processing ability greater than 300 (Intel CPU Xeon E5, RAM RDIMM 8GB, RAID5), single identification processing procedure required time < 0.1 second.
The speech recognition system major function
Magnanimity vocabulary, be independent of talker's powerful recognition function
Speech recognition system can be carried out the identification of large vocabulary reliably to multilingual, and the degree of confidence of recognition result can be provided.This system provides speech recognition technology the most accurately to a large amount of vocabulary.The application program of utilizing the speech recognition system exploitation is through test, and accuracy surpasses 96%;
2. the natural language understanding of building in
Can develop natural language understanding system through speech recognition system, it is input with the sentence, returns the explanatory expression of S meaning.Application program can be taked corresponding action according to user's request.Native system also provides based on the letter of putting of class and marks, and it can more closely differentiate the accurately phrase each several part of (or inaccurate) identification.Then can be more nature with revise application program effectively, handle bug check or prompting again;
3. Host Based client/server structure
Speech recognition system is based on open client/server structure, is in particular required stability of large-scale application program and scalability and designs.Caller's speech is collected by client, and the load that identification is handled is by on a plurality of servers that separate of mean allocation to the network;
4. single vocabulary is proofreaied and correct
Also cry by shelves and put letter scoring, if a word in long sentence is unrecognized, application program can point out the user to repeat this fragment, rather than whole sentence;
5. hot speech identification
Hot speech identification makes system advance to monitor to the talker, waits for specific vocabulary or phrase, and this application program is returned in control.Can use this function in application program, recognizer can be listened attentively to silently, up to the user say specific phrase when asking just and user interactions;
6. intelligence is made pauses in reading unpunctuated ancient writings
Punctuate is that the sample flow of coming in is confirmed the processing procedure that the initial sum of statement stops.Behind the initial sum terminating point that finds statement, predetermined length is extended in the statement district forwards, backwards respectively.In case detect the starting point of statement, sample begins to flow to identified server, up to the terminating point of finding statement.In this way, identified server has in fact begun to handle the content of speech when the user is still talking, and don't handles the unnecessary blank of start-stop place of speech, thereby practices thrift the CPU time and the network bandwidth;
7. interrupt function
Interrupt function and make the user can interrupt prompting, make response, need not to point out by the time finishes.Interrupt function and make quick more, the nature, the particularly frequent user of system of exchanging between user and system;
8.N-Best handle
For some application program, possibly need recognition engine to produce possible recognition result collection, rather than a best result.The N-best identification disposal route of native system just has this function, and it provides possible recognition result list, and arranges from high to low by possibility;
9. grammer probability
Native system allows the particular words that the caller said or the probability in grammer of phrase are specified.When the probability of word of being said or phrase can be estimated according to the reality use, very useful.Grammer is increased probability can improve the accuracy rate and the speed of identification;
10. reduction noise
When the calling of coming in comprised stable background noise, native system discerned identified server through a kind of mechanism more accurately.The language that identified server will be come in strengthens, with effectively with the tone, buzz, groan noise filterings such as cry, hiss.If a considerable amount of phones all contain stable ground unrest, during such as hands-free making a phone call on automobile, this machine-processed effect is more satisfactory;
11. prompting playback
Native system allows to play prompting that records in advance and the prompting that is produced by the Text To Speech converting system.If application program is used a plurality of Text To Speech change server, explorer will carry out balance to the transformed load of these servers, to improve hardware efficiency;
12.SNMP support
Native system is that remote monitoring provides Simple Network Management Protocol (SNMP) support, and unique visualization tool is convenient to be configured, manages and is operated.
Voice application technology platform based on speech recognition system
The concrete application of speech recognition system, integrated with other related communications, voice, network, database etc. usually, form an intelligentized voice application technology platform, providing with the speech recognition is the application service of core.Fig. 7 is key foundation with the speech recognition system for our company, a software architecture diagram based on the intelligent sound application service technology platform of cloud computing of structure.Below concise and to the point do one and describe, only supply reference when concrete the application:
1. Access Layer
Access Layer comprises platform to connection module and terminal user's AM access module, and H.323 agreement and Session Initiation Protocol are supported in platform AM access module; Terminal user's AM access module supports the endpoint registration of SIP type on application platform;
2. call control layer
The various functions relevant such as call control layer is realized incoming call exhalation, call status analysis, call forwarding, recording playback sound, received DTMF, switching is attended a banquet with calling, and with the service of communicating by letter and charge of accounting server;
3. session layer
Session layer mainly realizes the dialog procedure of user and system, comprises functions such as media, speech recognition sampled voice, the synthetic medium output of text, and synthesizes the interface and the interaction process of serving with speech-recognition services, text;
4. flow process analytic sheaf
The flow process analytic sheaf is mainly realized the flow process script analytical capabilities of Voice XML, according to the service request from the operation flow key-course, is controlling the user service flow journey;
5. operation flow key-course
The operation flow key-course receives the service request from application server, through discriminatory analysis, this service request is consigned to the flow process analytic sheaf handle;
6. external interface module
External interface module mainly comprises application server (comprising database server and Web server), accounting server, speech recognition server, text synthesis server, content server, operator attendance, IP terminal, administers and maintains terminal etc.
Description of drawings
Fig. 1 is the speech recognition system structural drawing; Fig. 2 is a speech recognition system identifying synoptic diagram; Fig. 3 is speech recognition system identification step figure; Fig. 4 is speech recognition system recognition result figure; Fig. 5 is the similar sound figure of speech recognition system; Fig. 6 is speech recognition system fault-tolerant processing figure; Fig. 7 is based on the intelligent sound application technology platform software structural drawing of cloud computing.