Movatterモバイル変換


[0]ホーム

URL:


CN102004560B - User character recognition method in sentence-level Chinese character input method and machine learning system - Google Patents

User character recognition method in sentence-level Chinese character input method and machine learning system
Download PDF

Info

Publication number
CN102004560B
CN102004560BCN 201010567997CN201010567997ACN102004560BCN 102004560 BCN102004560 BCN 102004560BCN 201010567997CN201010567997CN 201010567997CN 201010567997 ACN201010567997 ACN 201010567997ACN 102004560 BCN102004560 BCN 102004560B
Authority
CN
China
Prior art keywords
speech
user
iwp
word
root
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 201010567997
Other languages
Chinese (zh)
Other versions
CN102004560A (en
Inventor
刘秉权
王晓龙
刘峰
刘远超
林磊
孙承杰
单丽莉
刘铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology Shenzhen
Original Assignee
Harbin Institute of Technology Shenzhen
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology ShenzhenfiledCriticalHarbin Institute of Technology Shenzhen
Priority to CN 201010567997priorityCriticalpatent/CN102004560B/en
Publication of CN102004560ApublicationCriticalpatent/CN102004560A/en
Application grantedgrantedCritical
Publication of CN102004560BpublicationCriticalpatent/CN102004560B/en
Expired - Fee Relatedlegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Landscapes

Abstract

Translated fromChinese

语句级汉字输入方法中的用户词识别方法与机器学习系统,涉及汉字输入的机器学习技术领域。本发明解决了现有机器学习方法中存在的经常需要用户干预才能够获得最终结果的问题。用户词识别方法是采用相对位置成词能力作为评价标准来识别用户词。学习方法仅在输入法输出的最优路径与最终输出路径不一致时才启动,该方法采用基于N元文法的概率计算方法获得概率值后,采用最大后验MAP获得用户调节值CA,该调节值CA和相应的词存入用户语言模型库。机器学习系统是应用上述用户词识别方法和学习方法实现的学习系统。采用本发明技术,能减少用户输入时的干预次数,让用户更轻松地得到需要的输出结果。

Figure 201010567997

A user word recognition method and a machine learning system in a sentence-level Chinese character input method relate to the technical field of machine learning for Chinese character input. The invention solves the problem in the existing machine learning method that user intervention is often required to obtain the final result. The user word recognition method is to use the relative position word forming ability as the evaluation standard to identify user words. The learning method starts only when the optimal path output by the input method is inconsistent with the final output path. This method uses the probability calculation method based on N-grams to obtain the probability value, and then uses the maximum a posteriori MAP to obtain the user adjustment value CA . The value CA and the corresponding words are stored in the user language model library. The machine learning system is a learning system implemented by applying the above-mentioned user word recognition method and learning method. By adopting the technology of the invention, the number of times of intervention during user input can be reduced, and the user can obtain the desired output result more easily.

Figure 201010567997

Description

User's word recognition method and machine learning system in the statement level Chinese character input method
Technical field
The present invention relates to user's word recognition method and on-line study method in the machine learning method of Chinese character input.
Background technology
Machine learning method in the input of statement level Chinese character can be adjusted the result that best Chinese character makes up according to user's input habit automatically, goes for various input method of Chinese character and input system.
Along with natural language processing and artificial intelligence theory constantly improve, Chinese character entering technique also correspondingly improves constantly, but does not also have a kind of Chinese character entering technique can reach the boundary of a perfect conversion up to now, all exists deficiency separately in the various technology.Be embodied in the spelling input method and be exactly, do not have a kind of product can reach 100% conversion ratio accuracy now,, just can reach the output result that a user needs all in the intervention that needs the user in varying degrees by different way.Adopt this method to improve these systems, can greatly reduce the number of times of required user intervention, and then improve the conversion accuracy.
In for the input method of encode Chinese characters for computer and input system, the situation through there are the corresponding a plurality of Chinese characters of coding in regular meeting as the phonetic input, when phonetic entry and hand-written fuzzy diagnosis, is embodied as:
One the coding corresponding a plurality of Chinese characters.When for example phonetic was imported " cheng ", corresponding Chinese character had " one-tenth ", " city ", " title ", " being " etc.During input Pinyin string " chengshi ", equivalent has " city ", " honesty ", " formula ", " succeeding " etc.When longer statement input, this situation also can occur.If the preference that this moment, input system provided not is user's required input, the user then needs the manual required input of selecting so.When being furnished with online disposable learning functionality, the input habit that input method can recording user provides the most frequently used result of user as preference.Here give one example, in common input method system, during the input Pinyin first time " haerbingongyedaxuezhinengjisuanzhongxin ", because " function " speech word frequency in the statistics storehouse is higher, the result who obtains is " function computing center of Harbin Institute of Technology ", after user intervention is once imported, carried out adjustment to language model.So-called intervention is exactly that the user replaces " function " usefulness " intelligence " candidate item by hand.
2. the combination ambiguity between words.When " being the Pei Jianli meeting tomorrow " as the short sentence that contains name " Pei Jianli " in input, the corresponding transformation result of pinyin string " mingtianjiaopeijianlikaihui " mostly is " meeting is set up in mating tomorrow ".This is because contain " mating " and " foundation " these two speech in the input system dictionary, and does not have " Pei Jianli " this name, and this situation needs more user intervention just can obtain correct result.If input system is not furnished with corresponding user's speech structure and corresponding on-line study function, the user will intervene thorough at every turn in a large number so.
3. the transformation of user's custom.For example a user studies electronic engineering, and he always uses " chip " this speech, and he is to also being a moviegoer simultaneously, will remove to write " new film " every night and recommend.
If adopt the existing input method learning method, just need the user constantly to intervene when importing these two speech.
Summary of the invention
The needed result's of user problem be can obtain in order to solve the user intervention that often needs that exists in the existing machine learning method, user's word recognition method and machine learning system in the statement level Chinese character input method the present invention proposes.
User's word recognition method in the statement level Chinese character input method of the present invention is a kind of location-based user's word recognition method, in this method:
For root c, the probability that this root c is occurred in word combination with position rp as the one-tenth speech ability IWP of this root c (c, rp):
IWP(c,rp)=C(Word(c,rp))C(c)---(1)
Wherein, C (Word (c, the number of the speech that root c occurs with position rp in the language material of training usefulness when rp)) being the production language model, C (c) is the number of times that root c occurs in the language material, described one-tenth speech ability IWP (c, during rp) greater than threshold value δ (0<δ<1), corresponding speech is as user's speech, otherwise corresponding speech is not as user's speech;
For speech string S=c1, c2..., cl(l〉1), with the geometrical mean of the one-tenth speech ability of each root in this speech string one-tenth speech ability IWP (S) as this speech string:
IWP(S)=Πi=1lIWP(ci,rp)l---(2)
When IWP (S) 〉=δ (0<δ≤1), so with S as user's speech; Otherwise S is not as user's speech.
Above-mentioned location-based user's word recognition method is when estimating individual character and speech and be combined into user's speech, adopted and a kind ofly become speech ability IWP (c with relative position, rp) be evaluation criterion, select the candidate user speech by calculating IWP (S), and then according to statistical information to determine whether it is user's speech.
The present invention also provides the machine learning system in the statement level Chinese character input method, and this system is made up of user's speech identification module and online disposable study module, wherein:
User's speech identification module, whether be used for discerning the statement level Chinese character input method is user's speech through the final output result that user intervention obtains, and the speech that is judged to be user's speech encoded, then this user's speech machine code is deposited in the user thesaurus of statement level Chinese character input method;
Online disposable study module, be used for optimal path in statement level Chinese character input method output with final path when inconsistent, carry out online disposable study according to the optimal path of statement level Chinese character input method output with through the final path that user intervention obtains, and adjust the weight of corresponding words according to learning outcome, revise the user language model bank then.
In described user's speech identification module, the recognition methods of user's speech is:
For root c, the probability that this root c is occurred in word combination with position rp as the one-tenth speech ability IWP of this root c (c, rp):
IWP(c,rp)=C(Word(c,rp))C(c)---(1)
Wherein, C (Word (c, the number of the speech that root c occurs with position rp in the language material of training usefulness when rp)) being the production language model, C (c) is the number of times that root c occurs in the language material, (c is during greater than threshold value δ (0<δ<1) rp) to described one-tenth speech ability IWP, and corresponding speech is as user's speech, otherwise corresponding speech is not as user's speech;
For speech string S=c1, c2..., cl(l〉1), with the geometrical mean of the one-tenth speech ability of each root in this speech string one-tenth speech ability IWP (S) as this speech string:
IWP(S)=Πi=1lIWP(ci,rp)l---(2)
When IWP (S) 〉=δ (0<δ≤1), so with S as user's speech; Otherwise S is not as user's speech;
In described user's speech identification module, the process of the online disposable learning method in the online disposable study module is:
Step 1, with sound word conversion outgoing route cRoad[M] and final path candidate wRoad[N] carry out alignment, the sound word conversion outgoing route cRoadA[L after obtaining aliging based on length] and final path candidate wRoadA[L]; M, N and L represent the number of speech contained in this two paths respectively;
Step 2, make i=1;
Step 3, according to the information in the language model, calculating p (cRoadA[i] | cRoadA[i-1]) and p (wRoadA[i] | wRoadA[i-1]), and then utilize this two values, adopt maximum a posteriori MAP (Maximum a Posterior) probabilistic method to calculate user's regulated value C of posterior probability maximumAWill (wRoad[i-1], wRoad[i]) and corresponding CABe added in the user language model bank as the binary element;
Step 4, make i=i+1,, then return execution in step three if i≤L is arranged; Otherwise once study is finished,
P (cRoadA[i] | cRoadA[i-1]) expression: at sound word conversion outgoing route cRoadA[L] in, when i-1 speech is cRoadA[i-1] condition under, an i speech is cRoadA[i] probability, p (wRoadA[i] | wRoadA[i-1]) expression: at final path candidate cRoadA[L] in, when i-1 speech is wRoadA[i-1] condition under, an i speech is wRoadA[i] probability
Adopt maximum a posteriori MAP (Maximum a Posterior) probabilistic method to calculate user's regulated value C of posterior probability maximumAMethod be:
CA=CBΣw∈WCB*-CB*Σw∈WCBσ(Σw∈WCB-CB)+ϵ---(4)
C whereinBWord frequency for each speech node in user candidate's the path;
Figure GDA00002292352700042
Word frequency for each speech node in the maximum probability path that calculates; ε is a less constant, for example natural number between the 1-10; W represents a speech, and W represents the path be made up of the predicate w of institute, and σ represents weighting factor, the rational number between the desirable 0-2, and the information in () is calculated in σ () expression earlier, does multiplication again.
Above-mentioned online disposable learning method is applicable in the Chinese input method or input system of existing any use based on the language model of statistics, makes up and revise the user language model bank.
In order to effectively utilize user's input information, the necessary while of net result that the result of sound word conversion and process user intervention obtain is as the input of adaptive model, and then the processing of process adaptive model, and influence sound word transformation model in some way, and then reach the purpose of an on-line study.
Above-mentioned online disposable learning method is applied in existing Chinese input method or the system, can effectively solve the problem that exists in existing Chinese character input method described in the background technology or the system, for example: after adopting the disposable learning method of line of the present invention that language model is adjusted at first kind of phenomenon " 1. coding corresponding a plurality of Chinese characters ", when for the second time importing same a string phonetic, just can directly obtain " the output result of Harbin Institute of Technology's intelligence computation " center ".At second kind of phenomenon " the 2. combination ambiguity between words ", adopt online disposable learning method of the present invention, just can be after the user intervene input for the first time, " Pei Jianli " is added into user thesaurus.After this, though to import this people's name separately or in statement the input this name can both obtain the transformation result that the user wants.At the third phenomenon " the 3. transformation of user's custom ", adopt an on-line study method of the present invention, after being subjected to user intervention each first time, just can remember this speech, and this speech exported as optimum option, and need not repeatedly intervene, can greatly reduce this class intervention operation.
Description of drawings
Fig. 1 be embodiment three described machine learning methods application model.
Fig. 2 is the storage organization of embodiment one described user thesaurus.
Embodiment
Embodiment one: the user's word recognition method in the described Chinese input method of present embodiment is:
For root c, the probability that this root c is occurred in word combination with position rp as the one-tenth speech ability IWP of this root c (c, rp):
IWP(c,rp)=C(Word(c,rp))C(c)---(1)
Wherein, C (Word (c, the number of the speech that root c occurs with position rp in the language material of training usefulness when rp)) being the production language model, C (c) is the number of times that root c occurs in the language material, (c is during greater than threshold value δ (0<δ<1) rp) to described one-tenth speech ability IWP, and corresponding speech is as user's speech, otherwise corresponding speech is not as user's speech;
For speech string S=c1, c2..., cl(l〉1), with the geometrical mean of the one-tenth speech ability of each root in this speech string one-tenth speech ability IWP (S) as this speech string:
IWP(S)=Πi=1lIWP(ci,rp)l---(2)
When IWP (S) 〉=δ (0<δ≤1), so with S as user's speech; Otherwise S is not as user's speech.
User thesaurus described in the present embodiment adopts the file layout of Hash table.Concrete file layout is referring to shown in Figure 2, and wherein every data comprise label i, keyword w0, keyword property value and w0Relevant data link table, described and w0Comprise in the relevant data link table: correlation unit (w0, WK0), (w0, WK0) property value, next bar pointer ..., correlation unit (w0, WK0+n0), (w0, WK0+n0) property value, full stop.This storage mode relatively is fit to user's speech storage that needs dynamically change.
User's speech is exactly that the user needs but the speech in the dictionary of Chinese input system not.The record of these speech can improve user's input efficiency.
From the linguistics angle, user's speech can be divided into following a few class substantially according to the source:
1. named entity: comprise name, place name, trade name, company's font size, mechanism's name etc.;
2. abbreviation: as " World Trade Organization ", " South Airways " etc.;
3. dialecticism: as " beautiful ", " footing the bill " etc.;
4. coinage: as " sharp brother ", " phoenix elder sister " etc.;
5. technical term: as " wireless network ", " trigger " etc.;
6. transliteration speech: as " extremely ", " show ", " clone " etc.;
7. alphabetic word: as " WTO ", " UN " etc.;
8. the old word that changes of the meaning of a word, usage: as step down, " charging " etc.
9. user's self-word creation: as " An Bami " etc.
The main difficult point of user's speech identification is to set up rational criterion, the described user's word recognition method of present embodiment is when estimating individual character and speech and be combined into user's speech, adopted and a kind ofly become speech ability IWP (c with relative position, rp) be evaluation criterion, select the candidate user speech by calculating IWP (S), and then according to statistical information to determine whether it is user's speech.
For example the described user's word recognition method of present embodiment is described below:
Owing to have a large amount of derivation phenomenons in the Chinese,, the word-building position be divided into three classes as " straw hat ", " sunbonnet " etc. based on root:
1 root is positioned at prefix: as " working ", " top " etc., relative position rp=1.
2 roots are arranged in speech: as " not understanding puzzled ", " cannot bear to part " etc., relative position rp=2.
3 roots are positioned at suffix: as " mouse ", " squirrel " etc., relative position rp=3.
Therefore, appear at probability in the speech with position rp, can increase into the accuracy that speech is judged according to root.
And for speech string S=c1, c2..., cl(l〉1),, adopt the method for geometrical averages-were calculated to obtain into the speech ability for fear of " short speech is preferential " danger
The recognition methods of this user speech can be carried out effective user's speech identification under very little computing cost.This is because all statisticss all are that calculated in advance obtains, and are saved in the corresponding file.When carrying out the judgement of user's speech, this method directly goes to read in the file the good statistical information of these calculated in advance, and needn't go to add up these information again.This just greatly reduces the calculated amount of system, can satisfy the real-time requirement of input system, and these sizes of preserving the statistical information file are also very little.
The described user's word recognition method of present embodiment can be applied in the Chinese input method or input system of existing any use based on the language model of statistics.
Embodiment two: the online disposable learning method in the described statement level input method of present embodiment, this on-line study method is:
Step 1, with sound word conversion outgoing route cRoad[M] and final path candidate wRoad[N] carry out alignment, the sound word conversion outgoing route cRoadA[L after obtaining aliging based on length] and final path candidate wRoadA[L]; M, N and L represent the number of speech contained in this two paths respectively;
Step 2, make i=1;
Step 3, according to the information in the language model, calculating p (cRoadA[i] | cRoadA[i-1]) and p (wRoadA[i] | wRoadA[i-1]), and then utilize this two values, adopt maximum a posteriori MAP (Maximum a Posterior) probabilistic method to calculate user's regulated value C of posterior probability maximumAWill (wRoad[i-1], wRoad[i]) and corresponding CABe added in the user language model bank as the binary element;
Step 4, make i=i+1,, then return execution in step three if i≤L is arranged; Otherwise once study is finished.
The described online disposable learning method of present embodiment, only when the optimal path of statement level Chinese character input method output is inconsistent with final path, just start, can change language model apace, original language model is close on user's the speech habits most.And can avoid the overlearning problem, just can change language model more hardly because in a single day it achieve the goal.
The purpose of the step 2 in the present embodiment, be because the corresponding multiple Chinese character word combination of phonetic, sound word conversion outgoing route cRoad[M] and final path candidate wRoad[N] length might be inconsistent, therefore to carry out the length alignment to it, L represents length after reunification.
P in the present embodiment in the step 3 (cRoadA[i] | cRoadA[i-1]) expression: at sound word conversion outgoing route cRoadA[L] in, when i-1 speech is cRoadA[i-1] condition under, an i speech is cRoadA[i] probability, p (wRoadA[i] | wRoadA[i-1]) expression: at final path candidate cRoadA[L] in, when i-1 speech is wRoadA[i-1] condition under, an i speech is wRoadA[i] probability.
Method for calculating probability in the present embodiment is based on the existing N unit gram probability computing method that adopted by the statement level input method, and this method is to obtain one by m speech w1, w2... wmForm the probable value P (S) of sentence S, formula is:
P(S)=P(w1w2...wm)=Πi=1mP(wi|wi-1wi-2...wi-n+1)---(3)
Wherein, n is the value of N in the N unit syntax, P (wi) word wiStatistical probability value in language model.
The maximum a posteriori MAP of the employing dynamic weighting factor described in the present embodiment (Maximum a Posterior) probabilistic method is calculated user's regulated value C of posterior probability maximumA, so-called maximum a posteriori method, its main thought win it after being the P (w) that regulates between some word and word in next probabilistic operations.So just can make the statement S' after the adjusting, the new P (S') that calculates is bigger than the P (S) that wrong word combination S calculates.Adopt following method to obtain C in the present embodimentA:
CA=CBΣw∈WCB*-CB*Σw∈WCBσ(Σw∈WCB-CB)+ϵ---(4)
C whereinBWord frequency for each speech node in user candidate's the path;
Figure GDA00002292352700073
Word frequency for each speech node in the maximum probability path that calculates; ε is a less constant, for example natural number between the 1-10; W represents a speech, and W represents the path be made up of the predicate w of institute, and σ represents weighting factor, the rational number between the desirable 0-2, and the information in () is calculated in σ () expression earlier, does multiplication again.The present invention only adjusts the part of mistake.
In adjusting the process of language model, subject matter is how user's input habit to be attached in the background model to go, and promptly needs to adopt certain method that the parameter of original language model is reappraised.For on-line study, require the speed of this parameter revaluation method fast as far as possible, and need can return to original parameter detecting self-adaptation when unreasonable.
If the probable value P of some word (w) is than the word or the speech height of other unisonance in the same path, input system will be it as optimal candidate.If these words desired obtaining that be not the user, the user is with regard to the manual candidate operations of carrying out of needs so.In the input system with study not, these speech also can occur as optimal candidate next time, and at this moment the user just also needs to carry out candidate operations, and this has reduced user's input efficiency to a great extent.
Because language model is based on statistics, so can regulate these statistical parameters by dynamically recording user's input historical data.Adopt online disposable learning method of the present invention, can change language model apace, original language model is close on user's the speech habits most.And can avoid the overlearning problem, just can change language model more hardly because in a single day it achieve the goal.
The described disposable learning method of present embodiment, its time and space complexity can satisfy the real-time requirement of input system fully by analysis.At first this method be that non-user moves when required in the result only, and it is very low that it calls probability.Secondly when operation, it is only made amendment to the statement error section, is not to calculate all speech in this sentence.When last this method is revised, only calculate this speech and with the speech of its competition, this class speech is few, so calculated amount is very little.Aspect storage, adopted Hash table mode, search and storage time expense little.And aspect storage space, this method is only stored the speech that needs adjustment, can be far smaller than the system statistics storehouse so take.
Embodiment three: the machine learning system in the described statement level Chinese character input of present embodiment, be to adopt embodiment one described user's word recognition method and embodiment two described online disposable learning methods to realize, this system is made up of user's speech identification module and online disposable study module, wherein:
User's speech identification module, whether be used for discerning the statement level Chinese character input method is user's speech through the final output result that user intervention obtains, and the speech that is judged to be user's speech encoded, then this user's speech machine code is deposited in the user thesaurus of statement level Chinese character input method;
Online disposable study module, be used for optimal path in statement level Chinese character input method output with final path when inconsistent, carry out online disposable study according to the optimal path of statement level Chinese character input method output with through the final path that user intervention obtains, and adjust the weight of corresponding words according to learning outcome, revise the user language model bank then.
In user's speech identification module in the present embodiment, embodiment one described user's word recognition method is adopted in the recognition methods of user's speech.
In user's speech identification module in the present embodiment, the online disposable learning method in the online disposable study module adopts embodiment two described learning methods.
The machine learning method application model that the described machine learning system of present embodiment is applied to form in existing statement level input system or the method is referring to shown in Figure 1.
The final output result of this application model, be the Chinese character transformation result that obtains according to adaptation module, after the Chinese character transformation result that obtains of Chinese character transformation result that the user language model bank obtains and the language model storehouse in former input method or the input system multiplies each other with a weighting coefficient (rational number of 0-1) respectively, summation calculates most possible Chinese character combination again, then it is sent to input method or input system as final corresponding Chinese character combination.
In this model, in user's speech identifying, need dictionary that original system provides self to user's word recognition method, user's word recognition method just can judge whether user's input should be formed user's speech and set up user thesaurus, Chinese character input method originally need read user's speech from the dictionary that this method is set up, and these information are used for optimum path calculation get final product.
In online disposable learning method, need original system that self statistical information of language model is provided, this method can be imported by these information and user and calculate adjustment amount, and generation user language model, Chinese character input method originally need read statistical information from the user language model that this method is set up, and these information are used for optimum path calculation get final product.

Claims (3)

1. the user's word recognition method in the statement level Chinese character input method is characterized in that, it is based on user's word recognition method of position,
For root c, the probability that this root c is occurred in word combination with position rp as the one-tenth speech ability IWP of this root c (c, rp):
IWP(c,rp)=C(Word(c,rp))C(c)---(1)
Wherein, C (Word (c, the number of the speech that root c occurs with position rp in the language material of training usefulness when rp)) being the production language model, C (c) is the number of times that root c occurs in the language material, (c is during greater than threshold value δ (0<δ<1) rp), and corresponding speech is as user's speech as described one-tenth speech ability IWP, otherwise corresponding speech is not as user's speech;
For speech string S=c1, c2..., cl(l〉1), with the geometrical mean of the one-tenth speech ability of each root in this speech string one-tenth speech ability IWP (S) as this speech string:
IWP(S)=Πi=1lIWP(ci,rp)l---(2)
When IWP (S) 〉=δ (0<δ≤1), so with S as user's speech; Otherwise S is not as user's speech.
2. the user's word recognition method in the statement level Chinese character input method according to claim 1 is characterized in that, described user thesaurus adopts the file layout of Hash table.
3. the machine learning system in the statement level Chinese character input method, this system is made up of user's speech identification module and online disposable study module, wherein:
User's speech identification module, whether be used for discerning the statement level Chinese character input method is user's speech through the final output result that user intervention obtains, and the speech that is judged to be user's speech encoded obtain user's speech machine code, then this user's speech machine code is deposited in the user thesaurus of statement level Chinese character input method;
Online disposable study module, be used for optimal path in statement level Chinese character input method output with final path when inconsistent, carry out online disposable study according to the optimal path of statement level Chinese character input method output with through the final path that user intervention obtains, and adjust the weight of corresponding words according to learning outcome, revise the user language model bank then;
In described user's speech identification module, the recognition methods of user's speech is:
For root c, the probability that this root c is occurred in word combination with position rp as the one-tenth speech ability IWP of this root c (c, rp):
IWP(c,rp)=C(Word(c,rp))C(c)---(1)
Wherein, C (Word (c, the number of the speech that root c occurs with position rp in the language material of training usefulness when rp)) being the production language model, C (c) is the number of times that root c occurs in the language material, (c is during greater than threshold value δ (0<δ<1) rp) to described one-tenth speech ability IWP, and corresponding speech is as user's speech, otherwise corresponding speech is not as user's speech;
For speech string S=c1, c2..., cl(l〉1), with the geometrical mean of the one-tenth speech ability of each root in this speech string one-tenth speech ability IWP (S) as this speech string:
IWP(S)=Πi=1lIWP(ci,rp)l---(2)
When IWP (S) 〉=δ (0<δ≤1), so with S as user's speech; Otherwise S is not as user's speech;
In described user's speech identification module, the process of the online disposable learning method in the online disposable study module is:
Step 1, with sound word conversion outgoing route cRoad[M] and final path candidate wRoad[N] carry out alignment, the sound word conversion outgoing route cRoadA[L after obtaining aliging based on length] and final path candidate wRoadA[L]; M, N and L represent the number of speech contained in this two paths respectively;
Step 2, make i=1;
Step 3, according to the information in the language model, calculate p (cRoadA[i] | cRoadA[i-1]) and p (wRoadA[i] | wRoadA[i-1]), and then utilize this two values, employing maximum a posteriori MAP probabilistic method is calculated user's regulated value C of posterior probability maximumAWill (wRoad[i-1], wRoad[i]) and corresponding CABe added in the user language model bank as the binary element;
Step 4, make i=i+1,, then return execution in step three if i≤L is arranged; Otherwise once study is finished,
P (cRoadA[i] | cRoadA[i-1]) expression: at sound word conversion outgoing route cRoadA[L] in, when i-1 speech is cRoadA[i-1] condition under, an i speech is cRoadA[i] probability, p (wRoadA[i] | wRoadA[i-1]) expression: at final path candidate cRoadA[L] in, when i-1 speech is wRoadA[i-1] condition under, an i speech is wRoadA[i] probability
Adopt maximum a posteriori MAP probabilistic method to calculate user's regulated value C of posterior probability maximumAMethod be:
CA=CBΣw∈WCB*-CB*Σw∈WCBσ(Σw∈WCB-CB)+-ϵ--(4)
C whereinBWord frequency for each speech node in user candidate's the path;
Figure FDA00002765498300023
Word frequency for each speech node in the maximum probability path that calculates; ε is a less constant, for example natural number between the 1-10; W represents a speech, and W represents the path be made up of the predicate w of institute, and σ represents weighting factor, the rational number between the desirable 0-2, and the information in () is calculated in σ () expression earlier, does multiplication again.
CN 2010105679972010-12-012010-12-01User character recognition method in sentence-level Chinese character input method and machine learning systemExpired - Fee RelatedCN102004560B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN 201010567997CN102004560B (en)2010-12-012010-12-01User character recognition method in sentence-level Chinese character input method and machine learning system

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN 201010567997CN102004560B (en)2010-12-012010-12-01User character recognition method in sentence-level Chinese character input method and machine learning system

Publications (2)

Publication NumberPublication Date
CN102004560A CN102004560A (en)2011-04-06
CN102004560Btrue CN102004560B (en)2013-07-24

Family

ID=43811964

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN 201010567997Expired - Fee RelatedCN102004560B (en)2010-12-012010-12-01User character recognition method in sentence-level Chinese character input method and machine learning system

Country Status (1)

CountryLink
CN (1)CN102004560B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112053686B (en)*2020-07-282024-01-02出门问问信息科技有限公司Audio interruption method, device and computer readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101038596A (en)*2007-04-292007-09-19北京搜狗科技发展有限公司Method and system for classifying website
CN101127042A (en)*2007-09-212008-02-20浙江大学 A Sentiment Classification Method Based on Language Model
CN101539907A (en)*2008-03-192009-09-23日电(中国)有限公司Part-of-speech tagging model training device and part-of-speech tagging system and method thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101038596A (en)*2007-04-292007-09-19北京搜狗科技发展有限公司Method and system for classifying website
CN101127042A (en)*2007-09-212008-02-20浙江大学 A Sentiment Classification Method Based on Language Model
CN101539907A (en)*2008-03-192009-09-23日电(中国)有限公司Part-of-speech tagging model training device and part-of-speech tagging system and method thereof

Also Published As

Publication numberPublication date
CN102004560A (en)2011-04-06

Similar Documents

PublicationPublication DateTitle
US10997370B2 (en)Hybrid classifier for assigning natural language processing (NLP) inputs to domains in real-time
CN113010693A (en)Intelligent knowledge graph question-answering method fusing pointer to generate network
CN107836000B (en)Improved artificial neural network method and electronic device for language modeling and prediction
CN109800307B (en) Analysis method, device, computer equipment and storage medium for product evaluation
CN109376222B (en)Question-answer matching degree calculation method, question-answer automatic matching method and device
CN110147421B (en)Target entity linking method, device, equipment and storage medium
CN113743099B (en)System, method, medium and terminal for extracting terms based on self-attention mechanism
CN111611791B (en)Text processing method and related device
WO2023103914A1 (en)Text sentiment analysis method and device, and computer-readable storage medium
US20210405765A1 (en)Method and Device for Input Prediction
US12190621B2 (en)Generating weighted contextual themes to guide unsupervised keyphrase relevance models
CN116342167B (en)Intelligent cost measurement method and device based on sequence labeling named entity recognition
CN113761890A (en) A Multi-level Semantic Information Retrieval Method Based on BERT Context Awareness
CN103869998A (en)Method and device for sorting candidate items generated by input method
KR20250047390A (en) Data processing method and device, entity linking method and device, and computer device
CN105183803A (en)Personalized search method and search apparatus thereof in social network platform
CN109710921A (en)Calculation method, device, computer equipment and the storage medium of Words similarity
CN110188926A (en)A kind of order information forecasting system and method
CN115269834A (en)High-precision text classification method and device based on BERT
WO2025194913A1 (en)Data query method and apparatus, computer device, and storage medium
CN116991877A (en) A method, device and application for generating structured query statements
CN107329951A (en)Build name entity mark resources bank method, device, storage medium and computer equipment
CN111738008B (en)Entity identification method, device and equipment based on multilayer model and storage medium
CN119089886A (en) File processing method, electronic device and readable storage medium
CN102004560B (en)User character recognition method in sentence-level Chinese character input method and machine learning system

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
C14Grant of patent or utility model
GR01Patent grant
CF01Termination of patent right due to non-payment of annual fee

Granted publication date:20130724

Termination date:20141201

EXPYTermination of patent right or utility model

[8]ページ先頭

©2009-2025 Movatter.jp