Movatterモバイル変換


[0]ホーム

URL:


CN101122901B - Chinese integral sentence generation method and device - Google Patents

Chinese integral sentence generation method and device
Download PDF

Info

Publication number
CN101122901B
CN101122901BCN200710151332XACN200710151332ACN101122901BCN 101122901 BCN101122901 BCN 101122901BCN 200710151332X ACN200710151332X ACN 200710151332XACN 200710151332 ACN200710151332 ACN 200710151332ACN 101122901 BCN101122901 BCN 101122901B
Authority
CN
China
Prior art keywords
candidate word
word
probability
candidate
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN200710151332XA
Other languages
Chinese (zh)
Other versions
CN101122901A (en
Inventor
张会鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co LtdfiledCriticalTencent Technology Shenzhen Co Ltd
Priority to CN200710151332XApriorityCriticalpatent/CN101122901B/en
Publication of CN101122901ApublicationCriticalpatent/CN101122901A/en
Application grantedgrantedCritical
Publication of CN101122901BpublicationCriticalpatent/CN101122901B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Landscapes

Abstract

The present invention discloses a Chinese complete sentence generating method and device. The method of the present invention includes: a candidate word generated the last time and a candidate in a pinyin string are obtained; a candidate directed graph is constructed and a candidate with a biggest conditional probability is selected which corresponds with the candidate generated the previous time from candidates which correspond with the initial arc of the directed graph. Based on the candidate with the biggest conditional probability, a complete sentence result of the pinyin string is obtained. The embodiment of the present invention also provides a corresponding device. When computing probability of the candidate which corresponds with the initial arc of the candidate word directed graph, the embodiment of the present invention applies word frequency of the candidate combination and of the candidate word generated the previous time, that is, applies the context information to generalize a complete sentence, thus improving the complete sentence generating accuracy and the candidate word generating accuracy.

Description

Chinese complete sentence generating method and device
Technical field
The present invention relates to the input in Chinese technology, relate in particular to a kind of Chinese complete sentence generating method and device.
Background technology
At present, the function that most input in Chinese softwares all have whole sentence to generate, such as, the user thinks input " People's Republic of China (PRC) ", so, the user only need be in input method software input Pinyin string " zhonghuarenmingongheguo " continuously, can obtain correct whole sentence and generate the result, see also Fig. 1.
See also Fig. 2, the Chinese complete sentence generating method process flow diagram for prior art provides comprises:
Step 201: pinyin string is carried out syllabification;
Step 202: according to the syllabification result, in the phonetic dictionary, search all candidate word that occur in the pinyin string, and make up the candidate word digraph, and corresponding one or several candidate word of each bar arc of this digraph, and each bar arc all has the word frequency of the candidate word of word frequency maximum;
Wherein, writing down the mapping relations of phonetic to candidate word in the phonetic dictionary, described word frequency is meant the number of times that candidate word occurs.
Step 203:, obtain the probability of every arc according to the word frequency that described digraph carries;
Wherein, the probability that obtains every arc specifically comprises: the word frequency of carrying with every arc of described digraph obtains the probability of every arc respectively divided by the word frequency summation of all speech in the phonetic dictionary.
Step 204: a paths (candidate word assembled scheme) that utilizes shortest path first (as dijkstra's algorithm, Viterbi algorithm etc.) to obtain the probability maximum generates the result as whole sentence;
Step 205: described whole sentence is generated the result be presented at first of candidate word window, and the candidate word of the initial arc correspondence of digraph is presented in the candidate word window successively according to word frequency order from high to low.
With the Viterbi algorithm is example, briefly describes the specific implementation process ofstep 204.
Start node from described digraph, calculate the accumulated probability (product of probability) of each node, the accumulated probability of start node is initialized as 1, choose an accumulated probability and the corresponding forward direction node sequence number of record maximum in the accumulated probability of each node, up to the accumulated probability and the forward direction node sequence number thereof of last node that obtains described digraph as this node; Then, from last node of described digraph, recall forward according to the forward direction node sequence number of record, date back to start node always, obtain a paths of probability maximum, the candidate word sequence combination of every arc correspondence in this path is obtained whole sentence generate the result.Wherein, the computing formula of accumulated probability is: the probability of the accumulated probability * forward direction arc of the accumulated probability of current node=its forward direction node.
Below illustrate the implementation procedure of existing Chinese complete sentence generating method.
For example, user's input Pinyin string " womendoushipingfanren ", result after the syllabification is " wo ' men ' dou ' shi ' ping ' fan ' ren ", according to this syllabification result, in the phonetic dictionary, search all candidate word that occur in this pinyin string, and the candidate word digraph of structure shown in Fig. 3 (a), every arc of this digraph is all corresponding one or more candidate word (candidate word is from top to bottom according to word frequency series arrangement from high to low), and each bar arc all carries the word frequency (not marking among the figure) of the candidate word (promptly coming uppermost candidate word among the figure) of word frequency maximum; Adopt the Viterbi algorithm to obtain whole sentence generation result and be " we are the ordinary peoples ", this whole sentence generation result is presented at first of candidate word window, shown in Fig. 3 (b), show successively that according to word frequency order from high to low the candidate word " we " " I " of the initial arc correspondence of this digraph " is held " from second beginning of candidate word window.
But, generally user and one group of very long pinyin string of uncomfortable continuous input, but custom is a unit input Pinyin string with the speech, such as, the user thinks input " this bedroom is very big ", if the user imports at twice, and input " zhejian " for the first time, the word that generates sees also Fig. 4 (a), the user selects " this ", continues input " woshihenda ", and the word of generation sees also Fig. 4 (b), the candidate word that makes number one is " I am very big ", this whole sentence generates the requirement that the result does not meet the user, and the user needs to select 2 earlier, obtains Fig. 4 (c) result displayed, the user selects 1 more then, obtains correct whole sentence and generates result's " bedroom is very big ".
The defective of prior art is: because prior art is only considered the candidate word that word frequency is the highest when whole sentence generates, this makes the user when importing whole sentence several times, precision is not high as a result in the whole sentence generation of first demonstration of candidate word window, the selection operation that the user need carry out repeatedly just can obtain correct whole sentence generation result, influences user's input speed.
Summary of the invention
The technical matters that the embodiment of the invention will solve provides a kind of Chinese complete sentence generating method and device, can access whole accurately sentence and generate the result.
For solving the problems of the technologies described above, the embodiment of the invention provides a kind of Chinese complete sentence generating method, comprising:
Obtain the candidate word that last time generated;
From the phonetic dictionary, obtain the candidate word that occurs in the pinyin string, make up the candidate word digraph;
From the candidate word of the initial arc correspondence of described digraph, select the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated;
Based on the candidate word of described conditional probability maximum, the whole sentence that obtains described pinyin string generates the result.
Preferably, the described candidate word of selecting the conditional probability maximum of the described candidate word correspondence that last time generated specifically comprises:
With the candidate word of the initial arc correspondence of described digraph respectively with the described candidate word combination that last time generated;
Inquire about the word frequency of described candidate word combination respectively, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated;
According to the word frequency of described candidate word combination, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated are calculated the conditional probability of the candidate word of described initial arc correspondence, the candidate word of alternative condition probability maximum respectively.
Preferably, the conditional probability of the candidate word of the described initial arc correspondence of described calculating is specially:
According to the word frequency of described candidate word combination, and the word frequency of the described candidate word that last time generated, the co-occurrence probabilities of described candidate word combination calculated;
According to the word frequency of the candidate word of described initial arc correspondence, calculate the independent probability of described candidate word;
With described co-occurrence probabilities and described independent probability addition, obtain the conditional probability of the candidate word of described initial arc correspondence.
Preferably, the co-occurrence probabilities of the described candidate word combination of described calculating are specially:
Word frequency with described candidate word combination multiply by first parameter again divided by the described last time word frequency of the candidate word of generation, obtains the co-occurrence probabilities of described candidate word combination;
The independent probability of the described candidate word of described calculating is specially:
Multiply by second parameter with the word frequency of the candidate word of described initial arc correspondence again divided by the word frequency summation of all speech in the phonetic dictionary, obtain the independent probability of described candidate word;
Wherein, described first parameter and second parameter are greater than 0 less than 1 positive number, and described first parameter and second parameter and less than 1.
Preferably, the described whole sentence that obtains described pinyin string correspondence based on selected candidate word generates the result and is specially:
The conditional probability of candidate word of obtaining described conditional probability maximum is as the probability of the initial arc of described candidate word digraph;
Calculate in the described candidate word digraph probability of other arcs except that initial arc;
Adopt shortest path first, obtain the whole sentence generation result of a paths of probability maximum as described pinyin string.
Preferably, said method further comprises:
Described whole sentence is generated the result be presented at first of candidate word window.
Preferably, said method further comprises:
The candidate word that last time generates is kept in the buffer zone;
After obtaining whole sentence generation result, the candidate word of preserving in the described buffer zone is replaced with described whole sentence generate the result.
The embodiment of the invention also provides the whole sentence of a kind of Chinese generating apparatus, comprising:
The digraph construction unit is used for obtaining the candidate word that occurs the pinyin string from the phonetic dictionary, makes up the candidate word digraph;
Last time the candidate word acquiring unit was used to obtain the candidate word that last time generated;
The candidate word selected cell is used for from the candidate word of the initial arc correspondence of described digraph, selects the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated;
Whole sentence generation unit is used for the candidate word based on described conditional probability maximum, obtains whole sentence and generates the result.
Preferably, described candidate word selected cell specifically comprises: candidate word assembled unit, word frequency inquiry unit, selected cell;
Described candidate word assembled unit, be used for the candidate word of the initial arc correspondence of described digraph respectively with the candidate word combination that last time generated;
Described word frequency inquiry unit is used for inquiring about respectively the word frequency that described candidate word makes up, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated;
Described selected cell, be used for word frequency, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated according to described candidate word combination, calculate the conditional probability of the candidate word of described initial arc correspondence respectively, the candidate word of alternative condition probability maximum.
Preferably, described selected cell specifically comprises: the co-occurrence probabilities computing unit, and independent probability calculation unit, the conditional probability computing unit selects the speech unit;
Described co-occurrence probabilities computing unit is used for, and the word frequency that makes up with described candidate word multiply by first parameter again divided by the described last time word frequency of the candidate word of generation, obtains the co-occurrence probabilities of described candidate word combination;
Described independent probability calculation unit is used for, and with the word frequency of the candidate word of the described initial arc correspondence word frequency summation divided by all speech in the phonetic dictionary, multiply by second parameter again, obtains the independent probability of described candidate word;
Wherein, described first parameter and second parameter are greater than 0 less than 1 positive number, and described first parameter and second parameter and less than 1;
Described conditional probability computing unit is used for described co-occurrence probabilities and described independent probability addition are obtained the conditional probability of the candidate word of described initial arc correspondence;
The described speech unit that selects is used for the candidate word of alternative condition probability maximum.
As can be seen from the above technical solutions, the embodiment of the invention has the following advantages:
The candidate word that embodiment of the invention utilization last time generated from the candidate word of the initial arc correspondence of described digraph, is selected the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated; Based on the candidate word of described conditional probability maximum, the whole sentence that obtains described pinyin string generates the result.Because in calculated candidate speech digraph during the conditional probability of the candidate word of initial arc correspondence, utilized the word frequency of described candidate word combination, and the word frequency of the candidate word that last time generated, promptly utilize contextual information to realize that whole sentence generates, improved the accuracy rate that whole sentence generation accuracy rate and candidate word generate.
Description of drawings
Fig. 1 generates example one as a result for the Chinese whole sentence that prior art provides;
The Chinese complete sentence generating method process flow diagram that Fig. 2 provides for prior art;
Fig. 3 (a) generates digraph as a result for the Chinese whole sentence that prior art provides;
Fig. 3 (b) generates example two as a result for the Chinese whole sentence that prior art provides;
Fig. 4 (a) generates example three as a result for the Chinese whole sentence that prior art provides;
Fig. 4 (b) generates example three as a result for the Chinese whole sentence that prior art provides;
Fig. 4 (c) generates example three as a result for the Chinese whole sentence that prior art provides;
The Chinese complete sentence generating method that Fig. 5 provides for the embodiment of the invention;
Fig. 6 generates digraph as a result for the Chinese whole sentence that the embodiment of the invention provides;
Fig. 7 (a) forms synoptic diagram for the Chinese whole sentence generating apparatus that the embodiment of the invention provides;
Fig. 7 (b) forms synoptic diagram for the digraph construction unit that the embodiment of the invention provides;
Fig. 7 (c) forms synoptic diagram for the candidate word selected cell that the embodiment of the invention provides;
Fig. 7 (d) forms synoptic diagram for the selected cell that the embodiment of the invention provides;
Fig. 7 (e) forms synoptic diagram for the whole sentence generation unit that the embodiment of the invention provides.
Embodiment
The embodiment of the invention provides a kind of Chinese complete sentence generating method and device, for the purpose that makes the embodiment of the invention, technical scheme, and advantage clearer, below the embodiment of the invention is elaborated with reference to accompanying drawing.
In embodiments of the present invention, described whole sentence is meant speech or contamination.
The Chinese complete sentence generating method that the embodiment of the invention provides comprises: obtain the candidate word that last time generated;
Obtain the candidate word that occurs in the pinyin string, make up the candidate word digraph; From the candidate word of the initial arc correspondence of described digraph, select the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated; Based on the candidate word of described conditional probability maximum, the whole sentence that obtains described pinyin string generates the result.
See also Fig. 5, the Chinese complete sentence generating method process flow diagram for the embodiment of the invention provides comprises:
Step 501: pinyin string is carried out syllabification;
Step 502: according to the syllabification result, in the phonetic dictionary, search all candidate word that occur in the described pinyin string, make up the candidate word digraph;
Step 503: obtain the candidate word that last time generated;
Wherein, last time the candidate word of Sheng Chenging was meant that the user was at speech that carries out importing before the current input operation or whole sentence, last time the candidate word of Sheng Chenging was stored in the buffer zone, the user whenever carries out an input operation, then speech that described buffer zone is preserved or whole sentence replace with new speech or whole sentence, if what the user imported once more is punctuation mark, then buffer zone is emptied.Such as, the current input of user " woshihenda ", and " zhejian " imported in user's input " woshihenda " before, and the user selects " this ", then " this " is kept in the buffer zone, the user selects " bedroom is very big " in input " woshihenda " back, then the speech of preserving in the buffer zone " this " is replaced with whole sentence " bedroom is very big ".
Step 504: with the candidate word of the initial arc correspondence of described digraph respectively with the candidate word combination that last time generated;
Wherein, the initial arc of described candidate word digraph is meant that the start node with described digraph is the arc of starting point.
Step 505: inquire about the word frequency of described candidate word combination respectively, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated;
In embodiments of the present invention, utilize the phonetic dictionary in advance, the urtext cutting is the branch set of words, scanning divides set of words, the number of times that speech in the statistics phonetic dictionary and contamination occur in minute set of words, promptly add up the word frequency of speech and contamination in the phonetic dictionary, and the word frequency summation of all speech in the phonetic dictionary, described word frequency information is kept in the word frequency message file.It should be noted that:, then the word frequency of this speech or phrase is counted zero in minute set of words if certain speech in the phonetic dictionary or phrase do not occur.
Wherein, with candidate word, speech or the contamination preserved in candidate's contamination and the word frequency message file mate in thestep 505, search the word frequency of candidate word and candidate's contamination correspondence.
Step 506: according to the word frequency of described candidate word combination, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated, calculate the conditional probability of the candidate word of described initial arc correspondence, the candidate word of alternative condition probability maximum respectively;
Step 507: based on the candidate word of selected initial arc, the whole sentence that obtains described pinyin string generates the result;
Step 508: described whole sentence is generated the result be presented at first of candidate word window.
Below specifically introduce the implementation procedure ofstep 507 in the embodiment of the invention, comprising:
The conditional probability of candidate word of obtaining the conditional probability maximum is as the probability of initial arc;
Calculate in the described candidate word digraph probability of other arcs except that initial arc, the probability of other arcs equals the word frequency of candidate word of the word frequency maximum that other arcs carry divided by the word frequency summation of all speech in the phonetic dictionary;
A paths (candidate word assembled scheme) that utilizes shortest path first (as dijkstra's algorithm, Viterbi algorithm etc.) to obtain the probability maximum generates the result as whole sentence.
Below be that example is specifically introduced and adopted shortest path first to obtain the process that whole sentence generates the result with the Viterbi algorithm.
Start node from described digraph, calculate the accumulated probability (product of probability) of each node, the accumulated probability of start node is initialized as 1, choose an accumulated probability and the corresponding forward direction node sequence number of record maximum in the accumulated probability of each node, up to the accumulated probability and the forward direction node sequence number thereof of last node that obtains described digraph as this node; Then, from last node of described digraph, recall forward according to the forward direction node sequence number of record, date back to start node always, obtain a paths of probability maximum, the candidate word sequence combination of every arc correspondence in this path is obtained whole sentence generate the result.Wherein, the computing formula of accumulated probability is: the probability of the accumulated probability * forward direction arc of the accumulated probability of current node=its forward direction node.
By said process as can be seen, embodiment of the invention difference with the prior art is: in the embodiment of the invention, the probability of initial arc is the conditional probability of the candidate word of conditional probability maximum, and in the prior art, the probability of initial arc is for according to the word frequency probability that calculates of the word frequency of high candidate word.
More than the Chinese complete sentence generating method that provides for the embodiment of the invention, in other embodiments of the invention, also can be when making up the candidate word digraph, calculate the conditional probability of candidate word of the initial arc correspondence of digraph; Also can be after having made up the candidate word digraph, the conditional probability of the candidate word of the initial arc correspondence of calculating digraph does not influence the realization of the embodiment of the invention.
When the specific implementation said method, can adopt the conditional probability of the candidate word of the initial arc correspondence of following method calculated candidate speech:
According to the word frequency of described candidate word combination, and the word frequency of the described candidate word that last time generated, the co-occurrence probabilities of described candidate word combination calculated;
According to the word frequency of the candidate word of described initial arc correspondence, calculate the independent probability of described candidate word;
With described co-occurrence probabilities and described independent probability addition, obtain the conditional probability of the candidate word of described initial arc correspondence.
In embodiments of the present invention, specifically can adopt following formula to calculate co-occurrence probabilities, separately probability and conditional probability:
The word frequency of co-occurrence probabilities=described candidate word combination multiply by first parameter again divided by the word frequency of the described candidate word that last time generated;
The word frequency summation of all speech multiply by second parameter again in the word frequency/phonetic dictionary of the candidate word of probability=described initial arc correspondence separately;
Conditional probability=co-occurrence probabilities+independent probability+offset delta
Wherein, described first parameter and second parameter are greater than zero less than 1 positive number, and described first parameter and second parameter and less than 1; Total speech number of offset delta=(1-first parameter-second parameter)/phonetic dictionary, offset delta can be approximately equal to 0.
In other embodiments of the invention, also can adopt other formula to calculate above-mentioned three kinds of probability, all do not influence the realization of the embodiment of the invention.
Below illustrate the specific implementation process of the whole sentence generating method that the embodiment of the invention provides.Suppose: the user thinks input " this bedroom is very big ", if the user imports at twice, input " zhejian " for the first time, the user selects " this ", at this moment, buffer zone is preserved " this ", and the user continues input " woshihenda ", through the syllabification to " woshihenda ", the syllabification result who obtains is: " wo ' shi ' hen ' da
Figure 200710151332X_0
All candidate word in the inquiry pinyin string in the phonetic dictionary, make up candidate word digraph as shown in Figure 6, this candidate word digraph is 5 nodes altogether, start node is numbered 0, last node be numbered 4, with the candidate word of the initial arc correspondence of this digraph respectively and " this " make up, obtain " this I ", " this holds ", " this bedroom ", candidate's word combinations such as " I make for this ", the word frequency of above-mentioned candidate word combination in the word frequency message file, the word frequency that obtains " this bedroom " is the integer greater than zero, and the word frequency of other candidate word combinations is zero, and therefore, the conditional probability in " bedroom " is greater than the conditional probability of other candidate word of initial arc correspondence, with the conditional probability in " bedroom " probability as initial arc, then, the word frequency of the candidate word of the word frequency maximum of carrying according to other arcs is calculated the probability of other arcs; With No. 0 node just the accumulated probability of start node be initialized as 1, since No. 0 node, calculate the cumulative probability and the forward direction arc node sequence number thereof of each node, at last, since No. 4 node, forward direction arc node sequence number according to record is recalled forward, dates back to the 0th node always, obtains the path of probability maximum.Recalled forward by node 4 in this example, its forward direction node is 2, is recalled forward by node 2 then, and its forward direction node is 0, finishes, and the node of the probability maximum path that obtains is 0-2-4, and the candidate word sequence combination of path correspondence is obtained " bedroom is very big ".In embodiments of the present invention, because maximum is the conditional probability in " bedroom " in the probability of initial arc, so, its forward direction node of No. 2 nodes records is No. 0 node, and its forward direction node of No. 4 nodes records is the reason of No. 2 nodes rather than No. 3 nodes be: the accumulated probability that the probability of " very big " multiply by " No. 2 nodes " multiply by the accumulated probability of " No. 3 nodes " greater than the probability of " greatly ", so, the result that this whole sentence generates is: " bedroom is very big ", rather than " I am very big " of prior art generation.
The embodiment of the invention also provides the whole sentence of a kind of Chinese generating apparatus, sees also Fig. 7 (a), and this device comprises:
Digraph construction unit 701 is used for obtaining the candidate word that pinyin string occurs, and makes up the candidate word digraph;
Last time the candidateword acquiring unit 702, were used to obtain the candidate word that last time generated;
Candidate word selectedcell 703 is used for from the candidate word of the initial arc correspondence of described digraph, selects the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated;
Wholesentence generation unit 704 is used for the candidate word based on described conditional probability maximum, obtains whole sentence and generates the result.
When specific implementation, describeddigraph construction unit 701 can be made of following three unit, sees also Fig. 7 (b), comprising:
Syllabification unit 7011 is used for pinyin string is carried out syllabification;
Candidate word is searchedunit 7012, is used for according to the syllabification result, searches the candidate word that occurs in the described pinyin string in the phonetic dictionary;
Digraph generation unit 7013 is used for searching the candidate word that the unit obtains according to described candidate word, makes up the candidate word digraph.
When specific implementation, described candidate word selectedcell 703 can be made of following four unit, sees also Fig. 7 (c), comprising:
Candidate word assembledunit 7031, be used for the candidate word of the initial arc correspondence of described digraph respectively with the candidate word combination that last time generated;
Wordfrequency inquiry unit 7032 is used for inquiring about respectively the word frequency that described candidate word makes up, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated;
Selected cell 7033, be used for word frequency, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated according to described candidate word combination, calculate the conditional probability of the candidate word of described initial arc correspondence respectively, the candidate word of alternative condition probability maximum.
When specific implementation, described selectedcell 7033 can have following 4 unit to constitute, and sees also Fig. 7 (d), comprising:
Co-occurrenceprobabilities computing unit 70331 is used for the word frequency according to described candidate word combination, and the word frequency of the described candidate word that last time generated, and calculates the co-occurrence probabilities of described candidate word combination;
Probability calculation unit 70332 is used for the word frequency according to the candidate word of described initial arc correspondence separately, calculates the independent probability of described candidate word;
Conditionalprobability computing unit 70333 is used for described co-occurrence probabilities and described independent probability addition are obtained the conditional probability of the candidate word of described initial arc correspondence;
Select speech unit 70334, be used for the candidate word of alternative condition probability maximum.
Wherein, co-occurrenceprobabilities computing unit 70331 and separatelyprobability calculation unit 70332 can adopt the calculating co-occurrence probabilities that preamble stated and the computing formula of probability separately, calculate co-occurrence probabilities and independent probability, related content please refer to preamble and has stated content, repeats no more herein.
When specific implementation, wholesentence generation unit 704 can be made of following unit, sees also Fig. 7 (e), comprising:
Initial arcprobability acquiring unit 7041, the conditional probability of candidate word that is used to obtain described conditional probability maximum is as the probability of the initial arc of described candidate word digraph;
Other arcprobability calculation unit 7042 are used for calculating the probability of described candidate word digraph other arcs except that initial arc;
Path selection unit 7043 adopts shortest path first, obtains the whole sentence generation result of a paths of probability maximum as described pinyin string.
In order to realize showing that described whole sentence generates the result, said apparatus can further include:
Whole sentence display unit is used for that described whole sentence is generated the result and is presented at first of candidate word window.
In addition, in embodiments of the present invention, if the user is divided into twice input with a speech, such as, the user imports " motorcycle " at twice, input " rubbing " for the first time, input " holder car " for the second time, at this moment, the phonetic of " rubbing " in the buffer zone of preserving and the pinyin combinations of " the touch é " of input for the second time can be obtained " motuoche " together, then, in the phonetic dictionary, search " motuoche " corresponding speech, then, " holder car " corresponding in " motorcycle " is presented at first of candidate word window as generating the result.
More than a kind of Chinese complete sentence generating method provided by the present invention and device are described in detail, for one of ordinary skill in the art, thought according to the embodiment of the invention, part in specific embodiments and applications all can change, in sum, this description should not be construed as limitation of the present invention.

Claims (10)

1. a Chinese complete sentence generating method is characterized in that, comprising:
Obtain the candidate word that last time generated;
From the phonetic dictionary, obtain the candidate word that occurs in the pinyin string, make up the candidate word digraph;
From the candidate word of the initial arc correspondence of described digraph, select the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated, wherein, described conditional probability equals co-occurrence probabilities and adds independent probability and add side-play amount;
Based on the candidate word of described conditional probability maximum, the whole sentence that obtains described pinyin string generates the result.
2. the method for claim 1 is characterized in that, the described candidate word of selecting the conditional probability maximum of the described candidate word correspondence that last time generated specifically comprises:
With the candidate word of the initial arc correspondence of described digraph respectively with the described candidate word combination that last time generated;
Inquire about the word frequency of described candidate word combination respectively, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated;
According to the word frequency of described candidate word combination, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated are calculated the conditional probability of the candidate word of described initial arc correspondence, the candidate word of alternative condition probability maximum respectively.
3. method as claimed in claim 2 is characterized in that, the conditional probability of the candidate word of the described initial arc correspondence of described calculating is specially:
According to the word frequency of described candidate word combination, and the word frequency of the described candidate word that last time generated, the co-occurrence probabilities of described candidate word combination calculated;
According to the word frequency of the candidate word of described initial arc correspondence, calculate the independent probability of described candidate word;
Described co-occurrence probabilities and described independent probability addition are added side-play amount, obtain the conditional probability of the candidate word of described initial arc correspondence.
4. method as claimed in claim 3 is characterized in that, the co-occurrence probabilities of the described candidate word combination of described calculating are specially:
Word frequency with described candidate word combination multiply by first parameter again divided by the described last time word frequency of the candidate word of generation, obtains the co-occurrence probabilities of described candidate word combination;
The independent probability of the described candidate word of described calculating is specially:
Multiply by second parameter with the word frequency of the candidate word of described initial arc correspondence again divided by the word frequency summation of all speech in the phonetic dictionary, obtain the independent probability of described candidate word;
Wherein, described first parameter and second parameter are greater than 0 less than 1 positive number, and described first parameter and second parameter and less than 1.
5. as the described arbitrary method of claim 1 to 4, it is characterized in that the described whole sentence that obtains described pinyin string correspondence based on selected candidate word generates the result and is specially:
The conditional probability of candidate word of obtaining described conditional probability maximum is as the probability of the initial arc of described candidate word digraph;
Calculate in the described candidate word digraph probability of other arcs except that initial arc;
Adopt shortest path first, obtain the whole sentence generation result of a paths of probability maximum as described pinyin string.
6. as the described arbitrary method of claim 1 to 4, it is characterized in that, further comprise:
Described whole sentence is generated the result be presented at first of candidate word window.
7. as the described arbitrary method of claim 1 to 4, it is characterized in that, further comprise:
The candidate word that last time generates is kept in the buffer zone;
After obtaining whole sentence generation result, the candidate word of preserving in the described buffer zone is replaced with described whole sentence generate the result.
8. the whole sentence of a Chinese generating apparatus is characterized in that, comprising:
The digraph construction unit is used for obtaining the candidate word that occurs the pinyin string from the phonetic dictionary, makes up the candidate word digraph;
Last time the candidate word acquiring unit was used to obtain the candidate word that last time generated;
The candidate word selected cell is used for from the candidate word of the initial arc correspondence of described digraph, selects the candidate word of the conditional probability maximum of the described candidate word correspondence that last time generated, and wherein, described conditional probability equals co-occurrence probabilities and adds independent probability and add side-play amount;
Whole sentence generation unit is used for the candidate word based on described conditional probability maximum, obtains whole sentence and generates the result.
9. device as claimed in claim 8 is characterized in that, described candidate word selected cell specifically comprises: candidate word assembled unit, word frequency inquiry unit, selected cell;
Described candidate word assembled unit, be used for the candidate word of the initial arc correspondence of described digraph respectively with the candidate word combination that last time generated;
Described word frequency inquiry unit is used for inquiring about respectively the word frequency that described candidate word makes up, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated;
Described selected cell, be used for word frequency, the word frequency of the candidate word of described initial arc correspondence, and the word frequency of the described candidate word that last time generated according to described candidate word combination, calculate the conditional probability of the candidate word of described initial arc correspondence respectively, the candidate word of alternative condition probability maximum.
10. device as claimed in claim 9 is characterized in that, described selected cell specifically comprises: the co-occurrence probabilities computing unit, and independent probability calculation unit, the conditional probability computing unit selects the speech unit;
Described co-occurrence probabilities computing unit is used for, and the word frequency that makes up with described candidate word multiply by first parameter again divided by the described last time word frequency of the candidate word of generation, obtains the co-occurrence probabilities of described candidate word combination;
Described independent probability calculation unit is used for, and with the word frequency of the candidate word of the described initial arc correspondence word frequency summation divided by all speech in the phonetic dictionary, multiply by second parameter again, obtains the independent probability of described candidate word;
Wherein, described first parameter and second parameter are greater than 0 less than 1 positive number, and described first parameter and second parameter and less than 1;
Described conditional probability computing unit is used for described co-occurrence probabilities and described independent probability addition are added side-play amount, obtains the conditional probability of the candidate word of described initial arc correspondence;
The described speech unit that selects is used for the candidate word of alternative condition probability maximum.
CN200710151332XA2007-09-252007-09-25Chinese integral sentence generation method and deviceActiveCN101122901B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN200710151332XACN101122901B (en)2007-09-252007-09-25Chinese integral sentence generation method and device

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN200710151332XACN101122901B (en)2007-09-252007-09-25Chinese integral sentence generation method and device

Publications (2)

Publication NumberPublication Date
CN101122901A CN101122901A (en)2008-02-13
CN101122901Btrue CN101122901B (en)2011-11-09

Family

ID=39085238

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN200710151332XAActiveCN101122901B (en)2007-09-252007-09-25Chinese integral sentence generation method and device

Country Status (1)

CountryLink
CN (1)CN101122901B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10037319B2 (en)2010-09-292018-07-31Touchtype LimitedUser input prediction
GB201200643D0 (en)2012-01-162012-02-29Touchtype LtdSystem and method for inputting text
CN104081320B (en)*2012-01-272017-12-12触摸式有限公司User data input is predicted
CN103677299A (en)2012-09-122014-03-26深圳市世纪光速信息技术有限公司Method and device for achievement of intelligent association in input method and terminal device
CN107390892A (en)*2016-05-172017-11-24富士通株式会社The method and apparatus for generating user-oriented dictionary
CN107688398B (en)*2016-08-032019-09-17中国科学院计算技术研究所It determines the method and apparatus of candidate input and inputs reminding method and device
CN107688397B (en)*2016-08-032022-10-21北京搜狗科技发展有限公司Input method, system and device for inputting
CN106896936B (en)*2017-02-242020-06-12百度在线网络技术(北京)有限公司Vocabulary pushing method and device
CN108595437B (en)*2018-05-042022-06-03和美(深圳)信息技术股份有限公司Text query error correction method and device, computer equipment and storage medium

Also Published As

Publication numberPublication date
CN101122901A (en)2008-02-13

Similar Documents

PublicationPublication DateTitle
CN101122901B (en)Chinese integral sentence generation method and device
CN107704102B (en)Text input method and device
US8401314B2 (en)Systems and methods for character correction in communication devices
US8914275B2 (en)Text prediction
JP5362095B2 (en) Input method editor
KR101465770B1 (en)Word probability determination
CN101639830B (en)Chinese term automatic correction method in input process
US20080028303A1 (en)Fault-Tolerant Romanized Input Method for Non-Roman Characters
EP2807535B1 (en)User data input prediction
CN104252484A (en)Pinyin error correction method and system
CN106528846B (en)A kind of search method and device
CN101158969A (en)Whole sentence generating method and device
JP2007004633A (en)Language model generation device and language processing device using language model generated by the same
CN101131706A (en) A query correction method and system
JP2014194774A (en)Misspelling correction system and misspelling correction method
Zhou et al.Resolving surface forms to wikipedia topics
CN114330303A (en) Text error correction method and related equipment
US10152473B2 (en)English input method and input device
US20190087466A1 (en)System and method for utilizing memory efficient data structures for emoji suggestions
CN107408109B (en) Method for suggesting one or more multi-word candidates based on an input string received at an electronic device
CN103324683A (en)Method, device and client for providing search suggestion in input field of browser
Audah et al.A Comparison Between SymSpell and a Combination of Damerau-Levenshtein Distance with the Trie Data Structure
Wei et al.Demonstrating Robust Voice Querying with MUVE: Optimally Visualizing Results of Phonetically Similar Queries
SamuelssonWeighting edit distance to improve spelling correction in music entity search
JP5400813B2 (en) Address search device and address search method

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
C14Grant of patent or utility model
GR01Patent grant
ASSSuccession or assignment of patent right

Owner name:SHENZHEN SHIJI LIGHT SPEED INFORMATION TECHNOLOGY

Free format text:FORMER OWNER: TENGXUN SCI-TECH (SHENZHEN) CO., LTD.

Effective date:20131021

C41Transfer of patent application or patent right or utility model
CORChange of bibliographic data

Free format text:CORRECT: ADDRESS; FROM: 518044 SHENZHEN, GUANGDONG PROVINCE TO: 518057 SHENZHEN, GUANGDONG PROVINCE

TR01Transfer of patent right

Effective date of registration:20131021

Address after:518057 Tencent Building, 16, Nanshan District hi tech park, Guangdong, Shenzhen

Patentee after:Shenzhen Shiji Guangsu Information Technology Co., Ltd.

Address before:2, 518044, East 410 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before:Tencent Technology (Shenzhen) Co., Ltd.


[8]ページ先頭

©2009-2025 Movatter.jp