Summary of the invention
The needed result's of user problem be can obtain in order to solve the user intervention that often needs that exists in the existing machine learning method, user's word recognition method and machine learning system in the statement level Chinese character input method the present invention proposes.
User's word recognition method in the statement level Chinese character input method of the present invention is a kind of location-based user's word recognition method, in this method:
For root c, the probability that this root c is occurred in word combination with position rp as the one-tenth speech ability IWP of this root c (c, rp):
Wherein, C (Word (c, the number of the speech that root c occurs with position rp in the language material of training usefulness when rp)) being the production language model, C (c) is the number of times that root c occurs in the language material, described one-tenth speech ability IWP (c, during rp) greater than threshold value δ (0<δ<1), corresponding speech is as user's speech, otherwise corresponding speech is not as user's speech;
For speech string S=c1, c2..., cl(l〉1), with the geometrical mean of the one-tenth speech ability of each root in this speech string one-tenth speech ability IWP (S) as this speech string:
When IWP (S) 〉=δ (0<δ≤1), so with S as user's speech; Otherwise S is not as user's speech.
Above-mentioned location-based user's word recognition method is when estimating individual character and speech and be combined into user's speech, adopted and a kind ofly become speech ability IWP (c with relative position, rp) be evaluation criterion, select the candidate user speech by calculating IWP (S), and then according to statistical information to determine whether it is user's speech.
The present invention also provides the machine learning system in the statement level Chinese character input method, and this system is made up of user's speech identification module and online disposable study module, wherein:
User's speech identification module, whether be used for discerning the statement level Chinese character input method is user's speech through the final output result that user intervention obtains, and the speech that is judged to be user's speech encoded, then this user's speech machine code is deposited in the user thesaurus of statement level Chinese character input method;
Online disposable study module, be used for optimal path in statement level Chinese character input method output with final path when inconsistent, carry out online disposable study according to the optimal path of statement level Chinese character input method output with through the final path that user intervention obtains, and adjust the weight of corresponding words according to learning outcome, revise the user language model bank then.
In described user's speech identification module, the recognition methods of user's speech is:
For root c, the probability that this root c is occurred in word combination with position rp as the one-tenth speech ability IWP of this root c (c, rp):
Wherein, C (Word (c, the number of the speech that root c occurs with position rp in the language material of training usefulness when rp)) being the production language model, C (c) is the number of times that root c occurs in the language material, (c is during greater than threshold value δ (0<δ<1) rp) to described one-tenth speech ability IWP, and corresponding speech is as user's speech, otherwise corresponding speech is not as user's speech;
For speech string S=c1, c2..., cl(l〉1), with the geometrical mean of the one-tenth speech ability of each root in this speech string one-tenth speech ability IWP (S) as this speech string:
When IWP (S) 〉=δ (0<δ≤1), so with S as user's speech; Otherwise S is not as user's speech;
In described user's speech identification module, the process of the online disposable learning method in the online disposable study module is:
Step 1, with sound word conversion outgoing route cRoad[M] and final path candidate wRoad[N] carry out alignment, the sound word conversion outgoing route cRoadA[L after obtaining aliging based on length] and final path candidate wRoadA[L]; M, N and L represent the number of speech contained in this two paths respectively;
Step 2, make i=1;
Step 3, according to the information in the language model, calculating p (cRoadA[i] | cRoadA[i-1]) and p (wRoadA[i] | wRoadA[i-1]), and then utilize this two values, adopt maximum a posteriori MAP (Maximum a Posterior) probabilistic method to calculate user's regulated value C of posterior probability maximumAWill (wRoad[i-1], wRoad[i]) and corresponding CABe added in the user language model bank as the binary element;
Step 4, make i=i+1,, then return execution in step three if i≤L is arranged; Otherwise once study is finished,
P (cRoadA[i] | cRoadA[i-1]) expression: at sound word conversion outgoing route cRoadA[L] in, when i-1 speech is cRoadA[i-1] condition under, an i speech is cRoadA[i] probability, p (wRoadA[i] | wRoadA[i-1]) expression: at final path candidate cRoadA[L] in, when i-1 speech is wRoadA[i-1] condition under, an i speech is wRoadA[i] probability
Adopt maximum a posteriori MAP (Maximum a Posterior) probabilistic method to calculate user's regulated value C of posterior probability maximumAMethod be:
C wherein
BWord frequency for each speech node in user candidate's the path;
Word frequency for each speech node in the maximum probability path that calculates; ε is a less constant, for example natural number between the 1-10; W represents a speech, and W represents the path be made up of the predicate w of institute, and σ represents weighting factor, the rational number between the desirable 0-2, and the information in () is calculated in σ () expression earlier, does multiplication again.
Above-mentioned online disposable learning method is applicable in the Chinese input method or input system of existing any use based on the language model of statistics, makes up and revise the user language model bank.
In order to effectively utilize user's input information, the necessary while of net result that the result of sound word conversion and process user intervention obtain is as the input of adaptive model, and then the processing of process adaptive model, and influence sound word transformation model in some way, and then reach the purpose of an on-line study.
Above-mentioned online disposable learning method is applied in existing Chinese input method or the system, can effectively solve the problem that exists in existing Chinese character input method described in the background technology or the system, for example: after adopting the disposable learning method of line of the present invention that language model is adjusted at first kind of phenomenon " 1. coding corresponding a plurality of Chinese characters ", when for the second time importing same a string phonetic, just can directly obtain " the output result of Harbin Institute of Technology's intelligence computation " center ".At second kind of phenomenon " the 2. combination ambiguity between words ", adopt online disposable learning method of the present invention, just can be after the user intervene input for the first time, " Pei Jianli " is added into user thesaurus.After this, though to import this people's name separately or in statement the input this name can both obtain the transformation result that the user wants.At the third phenomenon " the 3. transformation of user's custom ", adopt an on-line study method of the present invention, after being subjected to user intervention each first time, just can remember this speech, and this speech exported as optimum option, and need not repeatedly intervene, can greatly reduce this class intervention operation.
Embodiment
Embodiment one: the user's word recognition method in the described Chinese input method of present embodiment is:
For root c, the probability that this root c is occurred in word combination with position rp as the one-tenth speech ability IWP of this root c (c, rp):
Wherein, C (Word (c, the number of the speech that root c occurs with position rp in the language material of training usefulness when rp)) being the production language model, C (c) is the number of times that root c occurs in the language material, (c is during greater than threshold value δ (0<δ<1) rp) to described one-tenth speech ability IWP, and corresponding speech is as user's speech, otherwise corresponding speech is not as user's speech;
For speech string S=c1, c2..., cl(l〉1), with the geometrical mean of the one-tenth speech ability of each root in this speech string one-tenth speech ability IWP (S) as this speech string:
When IWP (S) 〉=δ (0<δ≤1), so with S as user's speech; Otherwise S is not as user's speech.
User thesaurus described in the present embodiment adopts the file layout of Hash table.Concrete file layout is referring to shown in Figure 2, and wherein every data comprise label i, keyword w0, keyword property value and w0Relevant data link table, described and w0Comprise in the relevant data link table: correlation unit (w0, WK0), (w0, WK0) property value, next bar pointer ..., correlation unit (w0, WK0+n0), (w0, WK0+n0) property value, full stop.This storage mode relatively is fit to user's speech storage that needs dynamically change.
User's speech is exactly that the user needs but the speech in the dictionary of Chinese input system not.The record of these speech can improve user's input efficiency.
From the linguistics angle, user's speech can be divided into following a few class substantially according to the source:
1. named entity: comprise name, place name, trade name, company's font size, mechanism's name etc.;
2. abbreviation: as " World Trade Organization ", " South Airways " etc.;
3. dialecticism: as " beautiful ", " footing the bill " etc.;
4. coinage: as " sharp brother ", " phoenix elder sister " etc.;
5. technical term: as " wireless network ", " trigger " etc.;
6. transliteration speech: as " extremely ", " show ", " clone " etc.;
7. alphabetic word: as " WTO ", " UN " etc.;
8. the old word that changes of the meaning of a word, usage: as step down, " charging " etc.
9. user's self-word creation: as " An Bami " etc.
The main difficult point of user's speech identification is to set up rational criterion, the described user's word recognition method of present embodiment is when estimating individual character and speech and be combined into user's speech, adopted and a kind ofly become speech ability IWP (c with relative position, rp) be evaluation criterion, select the candidate user speech by calculating IWP (S), and then according to statistical information to determine whether it is user's speech.
For example the described user's word recognition method of present embodiment is described below:
Owing to have a large amount of derivation phenomenons in the Chinese,, the word-building position be divided into three classes as " straw hat ", " sunbonnet " etc. based on root:
1 root is positioned at prefix: as " working ", " top " etc., relative position rp=1.
2 roots are arranged in speech: as " not understanding puzzled ", " cannot bear to part " etc., relative position rp=2.
3 roots are positioned at suffix: as " mouse ", " squirrel " etc., relative position rp=3.
Therefore, appear at probability in the speech with position rp, can increase into the accuracy that speech is judged according to root.
And for speech string S=c1, c2..., cl(l〉1),, adopt the method for geometrical averages-were calculated to obtain into the speech ability for fear of " short speech is preferential " danger
The recognition methods of this user speech can be carried out effective user's speech identification under very little computing cost.This is because all statisticss all are that calculated in advance obtains, and are saved in the corresponding file.When carrying out the judgement of user's speech, this method directly goes to read in the file the good statistical information of these calculated in advance, and needn't go to add up these information again.This just greatly reduces the calculated amount of system, can satisfy the real-time requirement of input system, and these sizes of preserving the statistical information file are also very little.
The described user's word recognition method of present embodiment can be applied in the Chinese input method or input system of existing any use based on the language model of statistics.
Embodiment two: the online disposable learning method in the described statement level input method of present embodiment, this on-line study method is:
Step 1, with sound word conversion outgoing route cRoad[M] and final path candidate wRoad[N] carry out alignment, the sound word conversion outgoing route cRoadA[L after obtaining aliging based on length] and final path candidate wRoadA[L]; M, N and L represent the number of speech contained in this two paths respectively;
Step 2, make i=1;
Step 3, according to the information in the language model, calculating p (cRoadA[i] | cRoadA[i-1]) and p (wRoadA[i] | wRoadA[i-1]), and then utilize this two values, adopt maximum a posteriori MAP (Maximum a Posterior) probabilistic method to calculate user's regulated value C of posterior probability maximumAWill (wRoad[i-1], wRoad[i]) and corresponding CABe added in the user language model bank as the binary element;
Step 4, make i=i+1,, then return execution in step three if i≤L is arranged; Otherwise once study is finished.
The described online disposable learning method of present embodiment, only when the optimal path of statement level Chinese character input method output is inconsistent with final path, just start, can change language model apace, original language model is close on user's the speech habits most.And can avoid the overlearning problem, just can change language model more hardly because in a single day it achieve the goal.
The purpose of the step 2 in the present embodiment, be because the corresponding multiple Chinese character word combination of phonetic, sound word conversion outgoing route cRoad[M] and final path candidate wRoad[N] length might be inconsistent, therefore to carry out the length alignment to it, L represents length after reunification.
P in the present embodiment in the step 3 (cRoadA[i] | cRoadA[i-1]) expression: at sound word conversion outgoing route cRoadA[L] in, when i-1 speech is cRoadA[i-1] condition under, an i speech is cRoadA[i] probability, p (wRoadA[i] | wRoadA[i-1]) expression: at final path candidate cRoadA[L] in, when i-1 speech is wRoadA[i-1] condition under, an i speech is wRoadA[i] probability.
Method for calculating probability in the present embodiment is based on the existing N unit gram probability computing method that adopted by the statement level input method, and this method is to obtain one by m speech w1, w2... wmForm the probable value P (S) of sentence S, formula is:
Wherein, n is the value of N in the N unit syntax, P (wi) word wiStatistical probability value in language model.
The maximum a posteriori MAP of the employing dynamic weighting factor described in the present embodiment (Maximum a Posterior) probabilistic method is calculated user's regulated value C of posterior probability maximumA, so-called maximum a posteriori method, its main thought win it after being the P (w) that regulates between some word and word in next probabilistic operations.So just can make the statement S' after the adjusting, the new P (S') that calculates is bigger than the P (S) that wrong word combination S calculates.Adopt following method to obtain C in the present embodimentA:
C wherein
BWord frequency for each speech node in user candidate's the path;
Word frequency for each speech node in the maximum probability path that calculates; ε is a less constant, for example natural number between the 1-10; W represents a speech, and W represents the path be made up of the predicate w of institute, and σ represents weighting factor, the rational number between the desirable 0-2, and the information in () is calculated in σ () expression earlier, does multiplication again.The present invention only adjusts the part of mistake.
In adjusting the process of language model, subject matter is how user's input habit to be attached in the background model to go, and promptly needs to adopt certain method that the parameter of original language model is reappraised.For on-line study, require the speed of this parameter revaluation method fast as far as possible, and need can return to original parameter detecting self-adaptation when unreasonable.
If the probable value P of some word (w) is than the word or the speech height of other unisonance in the same path, input system will be it as optimal candidate.If these words desired obtaining that be not the user, the user is with regard to the manual candidate operations of carrying out of needs so.In the input system with study not, these speech also can occur as optimal candidate next time, and at this moment the user just also needs to carry out candidate operations, and this has reduced user's input efficiency to a great extent.
Because language model is based on statistics, so can regulate these statistical parameters by dynamically recording user's input historical data.Adopt online disposable learning method of the present invention, can change language model apace, original language model is close on user's the speech habits most.And can avoid the overlearning problem, just can change language model more hardly because in a single day it achieve the goal.
The described disposable learning method of present embodiment, its time and space complexity can satisfy the real-time requirement of input system fully by analysis.At first this method be that non-user moves when required in the result only, and it is very low that it calls probability.Secondly when operation, it is only made amendment to the statement error section, is not to calculate all speech in this sentence.When last this method is revised, only calculate this speech and with the speech of its competition, this class speech is few, so calculated amount is very little.Aspect storage, adopted Hash table mode, search and storage time expense little.And aspect storage space, this method is only stored the speech that needs adjustment, can be far smaller than the system statistics storehouse so take.
Embodiment three: the machine learning system in the described statement level Chinese character input of present embodiment, be to adopt embodiment one described user's word recognition method and embodiment two described online disposable learning methods to realize, this system is made up of user's speech identification module and online disposable study module, wherein:
User's speech identification module, whether be used for discerning the statement level Chinese character input method is user's speech through the final output result that user intervention obtains, and the speech that is judged to be user's speech encoded, then this user's speech machine code is deposited in the user thesaurus of statement level Chinese character input method;
Online disposable study module, be used for optimal path in statement level Chinese character input method output with final path when inconsistent, carry out online disposable study according to the optimal path of statement level Chinese character input method output with through the final path that user intervention obtains, and adjust the weight of corresponding words according to learning outcome, revise the user language model bank then.
In user's speech identification module in the present embodiment, embodiment one described user's word recognition method is adopted in the recognition methods of user's speech.
In user's speech identification module in the present embodiment, the online disposable learning method in the online disposable study module adopts embodiment two described learning methods.
The machine learning method application model that the described machine learning system of present embodiment is applied to form in existing statement level input system or the method is referring to shown in Figure 1.
The final output result of this application model, be the Chinese character transformation result that obtains according to adaptation module, after the Chinese character transformation result that obtains of Chinese character transformation result that the user language model bank obtains and the language model storehouse in former input method or the input system multiplies each other with a weighting coefficient (rational number of 0-1) respectively, summation calculates most possible Chinese character combination again, then it is sent to input method or input system as final corresponding Chinese character combination.
In this model, in user's speech identifying, need dictionary that original system provides self to user's word recognition method, user's word recognition method just can judge whether user's input should be formed user's speech and set up user thesaurus, Chinese character input method originally need read user's speech from the dictionary that this method is set up, and these information are used for optimum path calculation get final product.
In online disposable learning method, need original system that self statistical information of language model is provided, this method can be imported by these information and user and calculate adjustment amount, and generation user language model, Chinese character input method originally need read statistical information from the user language model that this method is set up, and these information are used for optimum path calculation get final product.