CN102004560B

Movatterモバイル変換

Info

Publication number: CN102004560B
Application number: CN 201010567997
Authority: CN
Inventors: 刘秉权; 王晓龙; 刘峰; 刘远超; 林磊; 孙承杰; 单丽莉; 刘铭
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2010-12-01
Filing date: 2010-12-01
Publication date: 2013-07-24
Anticipated expiration: 2030-12-01
Also published as: CN102004560A

Abstract

语句级汉字输入方法中的用户词识别方法与机器学习系统，涉及汉字输入的机器学习技术领域。本发明解决了现有机器学习方法中存在的经常需要用户干预才能够获得最终结果的问题。用户词识别方法是采用相对位置成词能力作为评价标准来识别用户词。学习方法仅在输入法输出的最优路径与最终输出路径不一致时才启动，该方法采用基于N元文法的概率计算方法获得概率值后，采用最大后验MAP获得用户调节值C_A，该调节值C_A和相应的词存入用户语言模型库。机器学习系统是应用上述用户词识别方法和学习方法实现的学习系统。采用本发明技术，能减少用户输入时的干预次数，让用户更轻松地得到需要的输出结果。

A user word recognition method and a machine learning system in a sentence-level Chinese character input method relate to the technical field of machine learning for Chinese character input. The invention solves the problem in the existing machine learning method that user intervention is often required to obtain the final result. The user word recognition method is to use the relative position word forming ability as the evaluation standard to identify user words. The learning method starts only when the optimal path output by the input method is inconsistent with the final output path. This method uses the probability calculation method based on N-grams to obtain the probability value, and then uses the maximum a posteriori MAP to obtain the user adjustment value C_A . The value C_A and the corresponding words are stored in the user language model library. The machine learning system is a learning system implemented by applying the above-mentioned user word recognition method and learning method. By adopting the technology of the invention, the number of times of intervention during user input can be reduced, and the user can obtain the desired output result more easily.

Description

User's word recognition method and machine learning system in the statement level Chinese character input method

Technical field

The present invention relates to user's word recognition method and on-line study method in the machine learning method of Chinese character input.

Background technology

Machine learning method in the input of statement level Chinese character can be adjusted the result that best Chinese character makes up according to user's input habit automatically, goes for various input method of Chinese character and input system.

Along with natural language processing and artificial intelligence theory constantly improve, Chinese character entering technique also correspondingly improves constantly, but does not also have a kind of Chinese character entering technique can reach the boundary of a perfect conversion up to now, all exists deficiency separately in the various technology.Be embodied in the spelling input method and be exactly, do not have a kind of product can reach 100% conversion ratio accuracy now,, just can reach the output result that a user needs all in the intervention that needs the user in varying degrees by different way.Adopt this method to improve these systems, can greatly reduce the number of times of required user intervention, and then improve the conversion accuracy.

In for the input method of encode Chinese characters for computer and input system, the situation through there are the corresponding a plurality of Chinese characters of coding in regular meeting as the phonetic input, when phonetic entry and hand-written fuzzy diagnosis, is embodied as:

One the coding corresponding a plurality of Chinese characters.When for example phonetic was imported " cheng ", corresponding Chinese character had " one-tenth ", " city ", " title ", " being " etc.During input Pinyin string " chengshi ", equivalent has " city ", " honesty ", " formula ", " succeeding " etc.When longer statement input, this situation also can occur.If the preference that this moment, input system provided not is user's required input, the user then needs the manual required input of selecting so.When being furnished with online disposable learning functionality, the input habit that input method can recording user provides the most frequently used result of user as preference.Here give one example, in common input method system, during the input Pinyin first time " haerbingongyedaxuezhinengjisuanzhongxin ", because " function " speech word frequency in the statistics storehouse is higher, the result who obtains is " function computing center of Harbin Institute of Technology ", after user intervention is once imported, carried out adjustment to language model.So-called intervention is exactly that the user replaces " function " usefulness " intelligence " candidate item by hand.

2. the combination ambiguity between words.When " being the Pei Jianli meeting tomorrow " as the short sentence that contains name " Pei Jianli " in input, the corresponding transformation result of pinyin string " mingtianjiaopeijianlikaihui " mostly is " meeting is set up in mating tomorrow ".This is because contain " mating " and " foundation " these two speech in the input system dictionary, and does not have " Pei Jianli " this name, and this situation needs more user intervention just can obtain correct result.If input system is not furnished with corresponding user's speech structure and corresponding on-line study function, the user will intervene thorough at every turn in a large number so.

3. the transformation of user's custom.For example a user studies electronic engineering, and he always uses " chip " this speech, and he is to also being a moviegoer simultaneously, will remove to write " new film " every night and recommend.

If adopt the existing input method learning method, just need the user constantly to intervene when importing these two speech.

Summary of the invention

The needed result's of user problem be can obtain in order to solve the user intervention that often needs that exists in the existing machine learning method, user's word recognition method and machine learning system in the statement level Chinese character input method the present invention proposes.

User's word recognition method in the statement level Chinese character input method of the present invention is a kind of location-based user's word recognition method, in this method:

For root c, the probability that this root c is occurred in word combination with position rp as the one-tenth speech ability IWP of this root c (c, rp):

IWP (c, rp) = \frac{C (Word (c, rp))}{C (c)} - - - (1)

Wherein, C (Word (c, the number of the speech that root c occurs with position rp in the language material of training usefulness when rp)) being the production language model, C (c) is the number of times that root c occurs in the language material, described one-tenth speech ability IWP (c, during rp) greater than threshold value δ (0＜δ＜1), corresponding speech is as user's speech, otherwise corresponding speech is not as user's speech;

For speech string S=c₁, c₂..., c_l(l〉1), with the geometrical mean of the one-tenth speech ability of each root in this speech string one-tenth speech ability IWP (S) as this speech string:

IWP (S) = \sqrt[l]{Π_{i = 1}^{l} IWP (c_{i}, rp)} - - - (2)

When IWP (S) 〉=δ (0＜δ≤1), so with S as user's speech; Otherwise S is not as user's speech.

Above-mentioned location-based user's word recognition method is when estimating individual character and speech and be combined into user's speech, adopted and a kind ofly become speech ability IWP (c with relative position, rp) be evaluation criterion, select the candidate user speech by calculating IWP (S), and then according to statistical information to determine whether it is user's speech.

The present invention also provides the machine learning system in the statement level Chinese character input method, and this system is made up of user's speech identification module and online disposable study module, wherein:

User's speech identification module, whether be used for discerning the statement level Chinese character input method is user's speech through the final output result that user intervention obtains, and the speech that is judged to be user's speech encoded, then this user's speech machine code is deposited in the user thesaurus of statement level Chinese character input method;

Online disposable study module, be used for optimal path in statement level Chinese character input method output with final path when inconsistent, carry out online disposable study according to the optimal path of statement level Chinese character input method output with through the final path that user intervention obtains, and adjust the weight of corresponding words according to learning outcome, revise the user language model bank then.

In described user's speech identification module, the recognition methods of user's speech is:

IWP (c, rp) = \frac{C (Word (c, rp))}{C (c)} - - - (1)

Wherein, C (Word (c, the number of the speech that root c occurs with position rp in the language material of training usefulness when rp)) being the production language model, C (c) is the number of times that root c occurs in the language material, (c is during greater than threshold value δ (0＜δ＜1) rp) to described one-tenth speech ability IWP, and corresponding speech is as user's speech, otherwise corresponding speech is not as user's speech;

IWP (S) = \sqrt[l]{Π_{i = 1}^{l} IWP (c_{i}, rp)} - - - (2)

When IWP (S) 〉=δ (0＜δ≤1), so with S as user's speech; Otherwise S is not as user's speech;

In described user's speech identification module, the process of the online disposable learning method in the online disposable study module is:

Step 1, with sound word conversion outgoing route cRoad[M] and final path candidate wRoad[N] carry out alignment, the sound word conversion outgoing route cRoadA[L after obtaining aliging based on length] and final path candidate wRoadA[L]; M, N and L represent the number of speech contained in this two paths respectively;

Step 2, make i=1;

Step 3, according to the information in the language model, calculating p (cRoadA[i] | cRoadA[i-1]) and p (wRoadA[i] | wRoadA[i-1]), and then utilize this two values, adopt maximum a posteriori MAP (Maximum a Posterior) probabilistic method to calculate user's regulated value C of posterior probability maximum_AWill (wRoad[i-1], wRoad[i]) and corresponding C_ABe added in the user language model bank as the binary element;

Step 4, make i=i+1,, then return execution in step three if i≤L is arranged; Otherwise once study is finished,

P (cRoadA[i] | cRoadA[i-1]) expression: at sound word conversion outgoing route cRoadA[L] in, when i-1 speech is cRoadA[i-1] condition under, an i speech is cRoadA[i] probability, p (wRoadA[i] | wRoadA[i-1]) expression: at final path candidate cRoadA[L] in, when i-1 speech is wRoadA[i-1] condition under, an i speech is wRoadA[i] probability

Adopt maximum a posteriori MAP (Maximum a Posterior) probabilistic method to calculate user's regulated value C of posterior probability maximum_AMethod be:

C_{A} = \frac{C_{B} \underset{w &Element; W}{Σ} C_{B}^{*} - C_{B}^{*} \underset{w &Element; W}{Σ} C_{B}}{σ (\underset{w &Element; W}{Σ} C_{B} - C_{B})} + ϵ - - - (4)

C wherein_BWord frequency for each speech node in user candidate's the path;

Word frequency for each speech node in the maximum probability path that calculates; ε is a less constant, for example natural number between the 1-10; W represents a speech, and W represents the path be made up of the predicate w of institute, and σ represents weighting factor, the rational number between the desirable 0-2, and the information in () is calculated in σ () expression earlier, does multiplication again.

Above-mentioned online disposable learning method is applicable in the Chinese input method or input system of existing any use based on the language model of statistics, makes up and revise the user language model bank.

In order to effectively utilize user's input information, the necessary while of net result that the result of sound word conversion and process user intervention obtain is as the input of adaptive model, and then the processing of process adaptive model, and influence sound word transformation model in some way, and then reach the purpose of an on-line study.

Above-mentioned online disposable learning method is applied in existing Chinese input method or the system, can effectively solve the problem that exists in existing Chinese character input method described in the background technology or the system, for example: after adopting the disposable learning method of line of the present invention that language model is adjusted at first kind of phenomenon " 1. coding corresponding a plurality of Chinese characters ", when for the second time importing same a string phonetic, just can directly obtain " the output result of Harbin Institute of Technology's intelligence computation " center ".At second kind of phenomenon " the 2. combination ambiguity between words ", adopt online disposable learning method of the present invention, just can be after the user intervene input for the first time, " Pei Jianli " is added into user thesaurus.After this, though to import this people's name separately or in statement the input this name can both obtain the transformation result that the user wants.At the third phenomenon " the 3. transformation of user's custom ", adopt an on-line study method of the present invention, after being subjected to user intervention each first time, just can remember this speech, and this speech exported as optimum option, and need not repeatedly intervene, can greatly reduce this class intervention operation.

Description of drawings

Fig. 1 be embodiment three described machine learning methods application model.

Fig. 2 is the storage organization of embodiment one described user thesaurus.

Embodiment

Embodiment one: the user's word recognition method in the described Chinese input method of present embodiment is:

IWP (c, rp) = \frac{C (Word (c, rp))}{C (c)} - - - (1)

IWP (S) = \sqrt[l]{Π_{i = 1}^{l} IWP (c_{i}, rp)} - - - (2)

User thesaurus described in the present embodiment adopts the file layout of Hash table.Concrete file layout is referring to shown in Figure 2, and wherein every data comprise label i, keyword w₀, keyword property value and w₀Relevant data link table, described and w₀Comprise in the relevant data link table: correlation unit (w₀, W_K0), (w₀, W_K0) property value, next bar pointer ..., correlation unit (w₀, W_K0+n0), (w₀, W_K0+n0) property value, full stop.This storage mode relatively is fit to user's speech storage that needs dynamically change.

User's speech is exactly that the user needs but the speech in the dictionary of Chinese input system not.The record of these speech can improve user's input efficiency.

From the linguistics angle, user's speech can be divided into following a few class substantially according to the source:

1. named entity: comprise name, place name, trade name, company's font size, mechanism's name etc.;

2. abbreviation: as " World Trade Organization ", " South Airways " etc.;

3. dialecticism: as " beautiful ", " footing the bill " etc.;

4. coinage: as " sharp brother ", " phoenix elder sister " etc.;

5. technical term: as " wireless network ", " trigger " etc.;

6. transliteration speech: as " extremely ", " show ", " clone " etc.;

7. alphabetic word: as " WTO ", " UN " etc.;

8. the old word that changes of the meaning of a word, usage: as step down, " charging " etc.

9. user's self-word creation: as " An Bami " etc.

The main difficult point of user's speech identification is to set up rational criterion, the described user's word recognition method of present embodiment is when estimating individual character and speech and be combined into user's speech, adopted and a kind ofly become speech ability IWP (c with relative position, rp) be evaluation criterion, select the candidate user speech by calculating IWP (S), and then according to statistical information to determine whether it is user's speech.

For example the described user's word recognition method of present embodiment is described below:

Owing to have a large amount of derivation phenomenons in the Chinese,, the word-building position be divided into three classes as " straw hat ", " sunbonnet " etc. based on root:

1 root is positioned at prefix: as " working ", " top " etc., relative position rp=1.

2 roots are arranged in speech: as " not understanding puzzled ", " cannot bear to part " etc., relative position rp=2.

3 roots are positioned at suffix: as " mouse ", " squirrel " etc., relative position rp=3.

Therefore, appear at probability in the speech with position rp, can increase into the accuracy that speech is judged according to root.

And for speech string S=c₁, c₂..., c_l(l〉1),, adopt the method for geometrical averages-were calculated to obtain into the speech ability for fear of " short speech is preferential " danger

The recognition methods of this user speech can be carried out effective user's speech identification under very little computing cost.This is because all statisticss all are that calculated in advance obtains, and are saved in the corresponding file.When carrying out the judgement of user's speech, this method directly goes to read in the file the good statistical information of these calculated in advance, and needn't go to add up these information again.This just greatly reduces the calculated amount of system, can satisfy the real-time requirement of input system, and these sizes of preserving the statistical information file are also very little.

The described user's word recognition method of present embodiment can be applied in the Chinese input method or input system of existing any use based on the language model of statistics.

Embodiment two: the online disposable learning method in the described statement level input method of present embodiment, this on-line study method is:

Step 2, make i=1;

Step 4, make i=i+1,, then return execution in step three if i≤L is arranged; Otherwise once study is finished.

The described online disposable learning method of present embodiment, only when the optimal path of statement level Chinese character input method output is inconsistent with final path, just start, can change language model apace, original language model is close on user's the speech habits most.And can avoid the overlearning problem, just can change language model more hardly because in a single day it achieve the goal.

The purpose of the step 2 in the present embodiment, be because the corresponding multiple Chinese character word combination of phonetic, sound word conversion outgoing route cRoad[M] and final path candidate wRoad[N] length might be inconsistent, therefore to carry out the length alignment to it, L represents length after reunification.

P in the present embodiment in the step 3 (cRoadA[i] | cRoadA[i-1]) expression: at sound word conversion outgoing route cRoadA[L] in, when i-1 speech is cRoadA[i-1] condition under, an i speech is cRoadA[i] probability, p (wRoadA[i] | wRoadA[i-1]) expression: at final path candidate cRoadA[L] in, when i-1 speech is wRoadA[i-1] condition under, an i speech is wRoadA[i] probability.

Method for calculating probability in the present embodiment is based on the existing N unit gram probability computing method that adopted by the statement level input method, and this method is to obtain one by m speech w₁, w₂... w_mForm the probable value P (S) of sentence S, formula is:

P (S) = P (w_{1} w_{2} . . . w_{m}) = Π_{i = 1}^{m} P (w_{i} | w_{i - 1} w_{i - 2} . . . w_{i - n + 1}) - - - (3)

Wherein, n is the value of N in the N unit syntax, P (w_i) word w_iStatistical probability value in language model.

The maximum a posteriori MAP of the employing dynamic weighting factor described in the present embodiment (Maximum a Posterior) probabilistic method is calculated user's regulated value C of posterior probability maximum_A, so-called maximum a posteriori method, its main thought win it after being the P (w) that regulates between some word and word in next probabilistic operations.So just can make the statement S' after the adjusting, the new P (S') that calculates is bigger than the P (S) that wrong word combination S calculates.Adopt following method to obtain C in the present embodiment_A:

C_{A} = \frac{C_{B} \underset{w &Element; W}{Σ} C_{B}^{*} - C_{B}^{*} \underset{w &Element; W}{Σ} C_{B}}{σ (\underset{w &Element; W}{Σ} C_{B} - C_{B})} + ϵ - - - (4)

C wherein_BWord frequency for each speech node in user candidate's the path;

Word frequency for each speech node in the maximum probability path that calculates; ε is a less constant, for example natural number between the 1-10; W represents a speech, and W represents the path be made up of the predicate w of institute, and σ represents weighting factor, the rational number between the desirable 0-2, and the information in () is calculated in σ () expression earlier, does multiplication again.The present invention only adjusts the part of mistake.

In adjusting the process of language model, subject matter is how user's input habit to be attached in the background model to go, and promptly needs to adopt certain method that the parameter of original language model is reappraised.For on-line study, require the speed of this parameter revaluation method fast as far as possible, and need can return to original parameter detecting self-adaptation when unreasonable.

If the probable value P of some word (w) is than the word or the speech height of other unisonance in the same path, input system will be it as optimal candidate.If these words desired obtaining that be not the user, the user is with regard to the manual candidate operations of carrying out of needs so.In the input system with study not, these speech also can occur as optimal candidate next time, and at this moment the user just also needs to carry out candidate operations, and this has reduced user's input efficiency to a great extent.

Because language model is based on statistics, so can regulate these statistical parameters by dynamically recording user's input historical data.Adopt online disposable learning method of the present invention, can change language model apace, original language model is close on user's the speech habits most.And can avoid the overlearning problem, just can change language model more hardly because in a single day it achieve the goal.

The described disposable learning method of present embodiment, its time and space complexity can satisfy the real-time requirement of input system fully by analysis.At first this method be that non-user moves when required in the result only, and it is very low that it calls probability.Secondly when operation, it is only made amendment to the statement error section, is not to calculate all speech in this sentence.When last this method is revised, only calculate this speech and with the speech of its competition, this class speech is few, so calculated amount is very little.Aspect storage, adopted Hash table mode, search and storage time expense little.And aspect storage space, this method is only stored the speech that needs adjustment, can be far smaller than the system statistics storehouse so take.

Embodiment three: the machine learning system in the described statement level Chinese character input of present embodiment, be to adopt embodiment one described user's word recognition method and embodiment two described online disposable learning methods to realize, this system is made up of user's speech identification module and online disposable study module, wherein:

In user's speech identification module in the present embodiment, embodiment one described user's word recognition method is adopted in the recognition methods of user's speech.

In user's speech identification module in the present embodiment, the online disposable learning method in the online disposable study module adopts embodiment two described learning methods.

The machine learning method application model that the described machine learning system of present embodiment is applied to form in existing statement level input system or the method is referring to shown in Figure 1.

The final output result of this application model, be the Chinese character transformation result that obtains according to adaptation module, after the Chinese character transformation result that obtains of Chinese character transformation result that the user language model bank obtains and the language model storehouse in former input method or the input system multiplies each other with a weighting coefficient (rational number of 0-1) respectively, summation calculates most possible Chinese character combination again, then it is sent to input method or input system as final corresponding Chinese character combination.

In this model, in user's speech identifying, need dictionary that original system provides self to user's word recognition method, user's word recognition method just can judge whether user's input should be formed user's speech and set up user thesaurus, Chinese character input method originally need read user's speech from the dictionary that this method is set up, and these information are used for optimum path calculation get final product.

In online disposable learning method, need original system that self statistical information of language model is provided, this method can be imported by these information and user and calculate adjustment amount, and generation user language model, Chinese character input method originally need read statistical information from the user language model that this method is set up, and these information are used for optimum path calculation get final product.

Claims

1. the user's word recognition method in the statement level Chinese character input method is characterized in that, it is based on user's word recognition method of position,

IWP (c, rp) = \frac{C (Word (c, rp))}{C (c)} - - - (1)

Wherein, C (Word (c, the number of the speech that root c occurs with position rp in the language material of training usefulness when rp)) being the production language model, C (c) is the number of times that root c occurs in the language material, (c is during greater than threshold value δ (0＜δ＜1) rp), and corresponding speech is as user's speech as described one-tenth speech ability IWP, otherwise corresponding speech is not as user's speech;

IWP (S) = \sqrt[l]{Π_{i = 1}^{l} IWP (c_{i}, rp)} - - - (2)

2. the user's word recognition method in the statement level Chinese character input method according to claim 1 is characterized in that, described user thesaurus adopts the file layout of Hash table.

3. the machine learning system in the statement level Chinese character input method, this system is made up of user's speech identification module and online disposable study module, wherein:

User's speech identification module, whether be used for discerning the statement level Chinese character input method is user's speech through the final output result that user intervention obtains, and the speech that is judged to be user's speech encoded obtain user's speech machine code, then this user's speech machine code is deposited in the user thesaurus of statement level Chinese character input method;

Online disposable study module, be used for optimal path in statement level Chinese character input method output with final path when inconsistent, carry out online disposable study according to the optimal path of statement level Chinese character input method output with through the final path that user intervention obtains, and adjust the weight of corresponding words according to learning outcome, revise the user language model bank then;

IWP (c, rp) = \frac{C (Word (c, rp))}{C (c)} - - - (1)

IWP (S) = \sqrt[l]{Π_{i = 1}^{l} IWP (c_{i}, rp)} - - - (2)

Step 2, make i=1;

Step 3, according to the information in the language model, calculate p (cRoadA[i] | cRoadA[i-1]) and p (wRoadA[i] | wRoadA[i-1]), and then utilize this two values, employing maximum a posteriori MAP probabilistic method is calculated user's regulated value C of posterior probability maximum_AWill (wRoad[i-1], wRoad[i]) and corresponding C_ABe added in the user language model bank as the binary element;

Adopt maximum a posteriori MAP probabilistic method to calculate user's regulated value C of posterior probability maximum_AMethod be:

C_{A} = \frac{C_{B} \underset{w &Element; W}{Σ} C_{B}^{*} - C_{B}^{*} \underset{w &Element; W}{Σ} C_{B}}{σ (\underset{w &Element; W}{Σ} C_{B} - C_{B})} + - ϵ - - (4)

C wherein_BWord frequency for each speech node in user candidate's the path;