CN102760431A

Movatterモバイル変換

Info

Publication number: CN102760431A
Application number: CN2012102408980A
Authority: CN
Inventors: 余金环; 陈洪林
Original assignee: SHANGHAI YULIAN INFORMATION TECHNOLOGY Co Ltd
Current assignee: SHANGHAI YULIAN INFORMATION TECHNOLOGY Co Ltd
Priority date: 2012-07-12
Filing date: 2012-07-12
Publication date: 2012-10-31

Abstract

The invention discloses an intelligentized voice recognition system, belonging to the technical field of electronic information and comprising multiple background technologies of acoustics, linguistics, artificial intelligence, cloud calculation and the like. Voice (speaking) is a most convenient, rapid and natural interpersonal communication means, natural voice is used as a means of interacting people with a computer, so that the computer has capacities of listening, speaking and comprehending like people, and is the basis of application and development of an intelligentized voice technology. On the basis of research and development of a voice recognition system for many years, multiple innovations are brought out and are mainly concentrated on the structure of the voice recognition system and the specific voice recognition function and the intelligentized characteristic, so that a user effectively and conveniently develops and applies various voice recognition services.

Description

Intelligentized speech recognition system

Technical field

The present invention is an a kind of intelligentized speech recognition software system, belongs to electronic information technical field, has comprised multinomial background technologies such as acoustics, linguistics, artificial intelligence, computer network, cloud computing.

Background technology

Voice (speech) are the most convenient, fast, natural interpersonal communication means, adopt the means of natural-sounding as people and computer interactive, make calculating functional image people the same, have the ability of listening, mediating a settlement and understanding, and are the bases of intelligent sound technical application development.In the required therein various technology,, thereby be chosen as the 21 century previous decade by external numerous medium and expert and will produce one of ten big science and technology progress of significant impact the human life style with the tool challenge of speech recognition technology.

Speech recognition technology is quite complicated; An integrated technology that has comprised acoustics, linguistics, digital signal processing, statistical model, theory of probability and information theory, sound generating mechanism and multidisciplinary technology such as hearing mechanism, artificial intelligence; It is very big to study input human and material resources etc., and required time is relatively also long.

Speech recognition belongs to the category of multi-dimensional model identification and intelligent computer interface, and the basic goal of The Research of Speech Recognition is to work out a kind of machine with auditory function, and directly acceptor's voice command is understood people's intention and made corresponding reaction.In fact, let machine understand people's language, be the human long-term ideal of pursuing always, and demand has a wide range of applications.For example, the computing machine that has speech interface can change people at present to the mode of operation of computer, causes the revolution of operating system; Realize the direct communication between bilingual, promptly a kind of language is directly changed into another kind of language through " speech recognition-mechanical translation-text is synthetic "; The voice world can make the user pass through the direct searching database of voice; Just the phonetic search of similar internet search engine obtains required information, perhaps voice call dialing; This is in specific environment, as seeming extremely important and convenient in the car steering process.

More than these application demands derive from the essential characteristic of voice signal: on the one hand it be people the most naturally, boundary lake instrument the most easily, do not need to do specialized training again, and reaction velocity is fast especially, can reach a millisecond magnitude; Voice signal does not have the restriction of strict direction on the other hand, and can propagate in the dark, be picture, literal or button etc. other look, tactile data institute is irreplaceable.

But; The language that lets computing machine understand the people but is faced with many difficulties; The main the following aspects that embodies: 1. the acoustic feature of voice signal produces very big variation with the voice that are attached thereto before and after it are different, and does not have tangible border between each phonetic unit in the continuous flow; 2. phonetic feature can produce very big difference with the variation of difference, speaker psychology or the physiological status of speaker; 3. the difference of transaudient equipment and ambient noise interference also will directly influence the accurate extraction of phonetic feature; 4. meaning that statement is expressed be relevant with context, factors such as environmental baseline and background when speaking, and the syntactic structure of statement is diverse, and language ambience information almost is that the computing machine automatic speech recognition is unserviceable; 5. speech recognition can not be simple recognition technology in concrete the application, and will form a distributed systems, satisfies a large amount of concurrent speech-recognition services.

The present invention is an intelligentized continuous speech recognition system; Except speech recognition technology self; Emphasis has been made multinomial innovation on the speech recognition system structure, wherein the system architecture accuracy is high, extendable room is big, steady quality is reliable, can create high-quality speech recognition system and use.

Summary of the invention

Of the present invention is a kind of intelligentized speech recognition system, and main summary of the invention is following:

The speech recognition system structure

Speech recognition system is based on distributed frame, and system is flexible, reliable, and cost efficiency is high.Shown in system architecture Fig. 1.To distinguish each ingredient of descriptive system below.

Identify customer end

Identify customer end is to handle mutual process between application program and speech recognition system.Its processing audio input and output, and support limited phone control.Audio frequency input is optional selects the echo that disappears and makes pauses in reading unpunctuated ancient writings then.Prescoring prompting playback is supported in audio frequency output, changes (TTS) system for third-party Text To Speech a framework is provided.Under customized configuration, call out control and point out playback to control by the assembly outside the system.At last, identify customer end is passed to speech recognition server with audio frequency, and incident and result are returned to application program.

Identified server

Identified server carries out speech recognition and natural language understanding to receiving the terminal audio frequency that comes from identify customer end.If be that recognizing voice also is the explanation of expression content return to the nature language, identified server needs a series of acoustic model and grammer.Acoustic model and grammer help identified server to confirm the content of speaking.Grammer also is used to explain the meaning of oral vocabulary.Application program is specified acoustic model and grammer that identified server loads in the bag.

Explorer

The explorer executive real-time is written into equilibrium function, arrives available identified server to guarantee the identification mission mean allocation, thereby reduces hsrdware requirements, improves service quality.

Database

Speech recognition system adopts database (supporting relevant databases such as text, ODBC) to preserve dynamic syntax and subscriber data.For some speech recognition application, look its application instance, possibly not need database.

Speech recognition process

In order to understand the structure of speech recognition system, the most important thing is roughly to understand its identifying, emphasis is in client, server and application program.Fig. 2 and Fig. 3 are the synoptic diagram and the step of speech recognition process, are the explanation of each step subsequently.

The process of speech recognition system identification roughly comprises following several steps:

1. identify customer end has phone to arrive, the identify customer end notification application, and system answers the call;

2. the system requirements identify customer end is play first prompting, and the caller reacts.To Text To Speech conversion prompting, identify customer end will send to the TTS server through a socket by synthesis text, and receive the sample of passback;

3. be the reaction of call identifying side, identify customer end is to the request of explorer send server (buffered audio data simultaneously), and explorer points to only identified server with identify customer end;

4. identify customer end sends an identification request to identified server.Each request is made up of audio stream and the grammar entries in application.This grammar entries has implied acoustic model, because both are built in the identification bag of identified server loading;

5. after identified server receives request, carry out identification mission, then recognition result is returned to identify customer end;

6. during this period, explorer is kept watch on the current content that is written into of identified server;

7. identify customer end sends to application program with recognition result;

8. application program is made corresponding response, for example, carries out data base querying or another prompting of request identify customer end broadcast, as the response to the user;

9. the caller makes a response; Identify customer end sends next identification request and (sees step 4);

More than be a simple identifying, if to a large amount of speech recognition application, the identification service end can be launched a plurality of, and through resource management, reasonable distribution identification service processing.

Voice identification result

After each speech recognition was accomplished, system passed to application program with recognition result, and application program is made response according to the result is corresponding.Recognition result comprises abundant information programs to be used, and comprising:

Through identification speech copy and degree of confidence thereof

2. value of natural language result, each grade and corresponding degree of confidence score value

3. verification score value

Fig. 4 is the synoptic diagram of recognition result, comprises the text, confidence levels and the natural language explanation that are identified.

Similar sound identification

Fault-tolerant processing

In the speech recognition application process, in the time of seldom, slightly unclear or weight difference causes recognition result wrong unavoidably like the user's voice input, can make troubles to the user.Voice call book as shown in Figure 5 is used.

Li Xiang and two contact persons of Li Xiang are deposited in the user-phone book the inside, and the user does not carry out similar sound and handles for rapid and convenient; If hear the name that is not that the user says during call forwarding; At this moment, the user need not to hang up the telephone, and only need say " returning " perhaps " wrong "; System can return upper level automatically, lets the user reselect.Both avoided misrouting connecing, and also let the user re-enter easily.More than be simple example, in application such as phonetic search, this fault-tolerant processing will embody very important value.

The speech recognition system key property

1. cloud computing (distributed) structure.Explorer is written into equilibrium between identified server, thereby guarantees the utilization ratio of hardware.Identification to CPU intensity is big can be carried out by the remote machine of inoperative application program and COBBAIF;

2. High Density Interface.A small amount of processing of client is isolated from the intensive server process of CPU, allowed client to have highdensity interface can improve the service efficiency of server end CPU again;

3. fault-tolerant and reliability.Even individual servers lost efficacy, can not make system crash yet, even can not miss an identification request.When an identified server lost efficacy, explorer stopped to send request to it automatically, when server recovers, began automatically to send request to it;

4. easy to maintenance.Can close an identified server and keep in repair, and the performance of total system is not influenced, perhaps influence is very little.The maintenance of some types even can not close identified server and carry out;

5. scalability.Along with the increase of client identification request, can increase the instance of identified server, identify customer end and application, need not stop any running application program or close recognition system;

6. request by all kinds of means.System supports the identification services request from internet (TCP/IP and Session Initiation Protocol) and telephone network heterogeneous networks such as (fixed line are with mobile);

7. algorithm optimization, separate unit identified server identification concurrent processing ability greater than 300 (Intel CPU Xeon E5, RAM RDIMM 8GB, RAID5), single identification processing procedure required time < 0.1 second.

The speech recognition system major function

Magnanimity vocabulary, be independent of talker's powerful recognition function

Speech recognition system can be carried out the identification of large vocabulary reliably to multilingual, and the degree of confidence of recognition result can be provided.This system provides speech recognition technology the most accurately to a large amount of vocabulary.The application program of utilizing the speech recognition system exploitation is through test, and accuracy surpasses 96%;

2. the natural language understanding of building in

Can develop natural language understanding system through speech recognition system, it is input with the sentence, returns the explanatory expression of S meaning.Application program can be taked corresponding action according to user's request.Native system also provides based on the letter of putting of class and marks, and it can more closely differentiate the accurately phrase each several part of (or inaccurate) identification.Then can be more nature with revise application program effectively, handle bug check or prompting again;

3. Host Based client/server structure

Speech recognition system is based on open client/server structure, is in particular required stability of large-scale application program and scalability and designs.Caller's speech is collected by client, and the load that identification is handled is by on a plurality of servers that separate of mean allocation to the network;

4. single vocabulary is proofreaied and correct

Also cry by shelves and put letter scoring, if a word in long sentence is unrecognized, application program can point out the user to repeat this fragment, rather than whole sentence;

5. hot speech identification

Hot speech identification makes system advance to monitor to the talker, waits for specific vocabulary or phrase, and this application program is returned in control.Can use this function in application program, recognizer can be listened attentively to silently, up to the user say specific phrase when asking just and user interactions;

6. intelligence is made pauses in reading unpunctuated ancient writings

Punctuate is that the sample flow of coming in is confirmed the processing procedure that the initial sum of statement stops.Behind the initial sum terminating point that finds statement, predetermined length is extended in the statement district forwards, backwards respectively.In case detect the starting point of statement, sample begins to flow to identified server, up to the terminating point of finding statement.In this way, identified server has in fact begun to handle the content of speech when the user is still talking, and don't handles the unnecessary blank of start-stop place of speech, thereby practices thrift the CPU time and the network bandwidth;

7. interrupt function

Interrupt function and make the user can interrupt prompting, make response, need not to point out by the time finishes.Interrupt function and make quick more, the nature, the particularly frequent user of system of exchanging between user and system;

8.N-Best handle

For some application program, possibly need recognition engine to produce possible recognition result collection, rather than a best result.The N-best identification disposal route of native system just has this function, and it provides possible recognition result list, and arranges from high to low by possibility;

9. grammer probability

Native system allows the particular words that the caller said or the probability in grammer of phrase are specified.When the probability of word of being said or phrase can be estimated according to the reality use, very useful.Grammer is increased probability can improve the accuracy rate and the speed of identification;

10. reduction noise

When the calling of coming in comprised stable background noise, native system discerned identified server through a kind of mechanism more accurately.The language that identified server will be come in strengthens, with effectively with the tone, buzz, groan noise filterings such as cry, hiss.If a considerable amount of phones all contain stable ground unrest, during such as hands-free making a phone call on automobile, this machine-processed effect is more satisfactory;

11. prompting playback

Native system allows to play prompting that records in advance and the prompting that is produced by the Text To Speech converting system.If application program is used a plurality of Text To Speech change server, explorer will carry out balance to the transformed load of these servers, to improve hardware efficiency;

12.SNMP support

Native system is that remote monitoring provides Simple Network Management Protocol (SNMP) support, and unique visualization tool is convenient to be configured, manages and is operated.

Voice application technology platform based on speech recognition system

The concrete application of speech recognition system, integrated with other related communications, voice, network, database etc. usually, form an intelligentized voice application technology platform, providing with the speech recognition is the application service of core.Fig. 7 is key foundation with the speech recognition system for our company, a software architecture diagram based on the intelligent sound application service technology platform of cloud computing of structure.Below concise and to the point do one and describe, only supply reference when concrete the application:

1. Access Layer

Access Layer comprises platform to connection module and terminal user's AM access module, and H.323 agreement and Session Initiation Protocol are supported in platform AM access module; Terminal user's AM access module supports the endpoint registration of SIP type on application platform;

2. call control layer

The various functions relevant such as call control layer is realized incoming call exhalation, call status analysis, call forwarding, recording playback sound, received DTMF, switching is attended a banquet with calling, and with the service of communicating by letter and charge of accounting server;

3. session layer

Session layer mainly realizes the dialog procedure of user and system, comprises functions such as media, speech recognition sampled voice, the synthetic medium output of text, and synthesizes the interface and the interaction process of serving with speech-recognition services, text;

4. flow process analytic sheaf

The flow process analytic sheaf is mainly realized the flow process script analytical capabilities of Voice XML, according to the service request from the operation flow key-course, is controlling the user service flow journey;

5. operation flow key-course

The operation flow key-course receives the service request from application server, through discriminatory analysis, this service request is consigned to the flow process analytic sheaf handle;

6. external interface module

External interface module mainly comprises application server (comprising database server and Web server), accounting server, speech recognition server, text synthesis server, content server, operator attendance, IP terminal, administers and maintains terminal etc.

Description of drawings

Fig. 1 is the speech recognition system structural drawing; Fig. 2 is a speech recognition system identifying synoptic diagram; Fig. 3 is speech recognition system identification step figure; Fig. 4 is speech recognition system recognition result figure; Fig. 5 is the similar sound figure of speech recognition system; Fig. 6 is speech recognition system fault-tolerant processing figure; Fig. 7 is based on the intelligent sound application technology platform software structural drawing of cloud computing.

Claims

1. the system architecture of intelligentized speech recognition system is characterized in that adopting the distributed frame of intelligentized load balancing, and characteristics are dirigibility and high efficiency of application deployment.

2. the speech recognition process of intelligentized speech recognition system is characterized in that voice pre-service and many results coupling, and characteristics are to promote identification to carry out efficient and identification accuracy.

4. the reliability of intelligentized speech recognition system; Lost efficacy even it is characterized in that individual servers, and also can not make system crash, and even can not miss an identification request; When an identified server lost efficacy; Explorer stops to send request to it automatically, when server recovers, begins automatically to send request to it.

5. the fault-tolerant processing of intelligentized speech recognition system is characterized in that can accepting re-entering of user in the system recovery outcome procedure.

6. the intelligence of intelligentized speech recognition system is made pauses in reading unpunctuated ancient writings, and it is characterized in that making pauses in reading unpunctuated ancient writings is that the sample flow of coming in is confirmed the processing procedure that the initial sum of statement stops.

7. the hot speech of intelligentized speech recognition system identification is characterized in that making identified server to listen attentively to silently, up to the user say specific phrase when asking just and user interactions.

Intelligentized speech recognition system interrupt function, it is characterized in that the user can interrupt voice suggestion, respond, need not to point out by the time finishes, interrupt function and make quick more, the nature, the particularly frequent user of system of exchanging between user and system.

9. the reduction noise of intelligentized speech recognition system; It is characterized in that native system is through a kind of mechanism; Identified server is discerned more accurately, and the language that identified server will be come in strengthens, with effectively with the tone, buzz, groan cry, noise filtering such as the hiss that hisses.