Fig. 6 is a synoptic diagram of the experimental result of the embodiment of the invention, has described the iterations of EM algorithm and the relation of corpus complexity.Fig. 7 is another synoptic diagram of the experimental result of the embodiment of the invention, has represented the increase along with iterations, the variation of mutual information between text and phonetic.

Table 2 and 3 is the comparing result of sound word conversion ratio, promptly respectively Baseline I and Baseline II and Optimized SLM method is contrasted.

Table 2

Table 3

As shown in Figure 6, the complexity of Baseline II and Optimized bilingual model increases and reduces along with iteration, through all having reached local optimum after six iteration.Baseline II has lower language model complexity than method of the present invention.As shown in Figure 7, the mutual information entropy between text and phonetic is along with number of iterations increases gradually, convergence after iteration eight times.

Method of the present invention is compared with Baseline I, on training set test set, has showed its superiority respectively, and the error rate of sound word conversion has reduced by 87.04% and 19.72% respectively relatively.Compare with Baseline II, system of the present invention is resulting error rate on training set and test set, has reduced by 82.8% and 10.3% respectively.

Experimental result shows that the method that the present invention proposes particularly on training set, has shown very high accuracy having obtained optimum result aspect the accuracy of sound word conversion.Compare with traditional complexity Baseline II to optimize language model, method of the present invention has bigger complexity, yet higher accuracy is but arranged.The complexity of this descriptive language model can not be portrayed the performance of system well.

After about 8 iteration of process, final dictionary size has comprised 147,784 entries; Wherein probably have 36; 000 entry is consistent with traditional dictionary, and remaining entry then is based on data-driven, is obtained automatically through mutual information entropy between optimization text and phonetic.

The composition of new entry can be divided into two types: 1. have very high co-occurrence rate between adjacent words, such as " he ", " will come ", entries such as " as us ".According to the word-building of Chinese, it is illegal that these entries are considered to usually, so can not be admitted in the standard dictionary.2. the discovery of neologisms and term, such as " domain name ", " Quanjude ", " Beijing Capital Iron and Steel " etc.The adding of these speech will reduce the uncertainty that phonetic is carried to the Chinese character conversion, thereby improves the accuracy of sound word conversion.

Can know by the foregoing description,, realize the conversion of Chinese sound word, can further improve the accuracy of sound word conversion through the differentiation property dictionary that the mutual information based on text and phonetic makes up.

Embodiment 2

The embodiment of the invention provides a kind of Chinese sound word converting system based on the property distinguished dictionary, corresponding to the Chinese tone-character conversion method among theembodiment 1, repeats no more withembodiment 1 identical content.

Fig. 8 is the formation synoptic diagram of the Chinese sound word converting system of the embodiment of the invention; As shown in Figure 8; This Chinese soundword converting system 800 comprises:first generation unit 801 obtainsunit 802 with the path, and other parts of Chinese soundword converting system 800 can be with reference to prior art.

Wherein,first generation unit 801 generates the words grid corresponding with pinyin string according to the pinyin string of input and the differentiation property dictionary that makes up in advance; Wherein the property distinguished dictionary is based on the mutual information of text and phonetic and make up;Acquisition unit 802, path to the words trellis decode, obtains the maximum transduction pathway of probability to realize the conversion of Chinese sound word according to statistical language model.

As shown in Figure 8, Chinese soundword converting system 800 can also comprise:dictionary construction unit 803,dictionary construction unit 803 makes up the property distinguished dictionary through all possible words border in the mutual information adjustment statement of text and phonetic.

Fig. 9 is the formation synoptic diagram of the dictionary construction unit of the embodiment of the invention, and is as shown in Figure 9, and thisdictionary construction unit 803 can comprise thatsecond generation unit 901,mode confirm unit 902 andtext cutter unit 903;

Wherein,second generation unit 901 makes up the words grid according to training pinyin string and initial dictionary, and with statistical language model the words grid is decoded to obtain different phonetic switching modes; Mode is confirmedunit 902 maximum phonetic switching mode of definite mutual information from different phonetic switching modes;Text cutter unit 903 is according to mutual information maximum cutting of phonetic switching mode and the corresponding text of training pinyin string, and the text after the statistics cutting is to obtain new dictionary.

As shown in Figure 9,dictionary construction unit 803 specifically can also comprise:information evaluation unit 904 anditeration judging unit 905; Wherein the mutual information betweeninformation evaluation unit 904 pairs of training pinyin string and text is assessed; Wheniteration judging unit 905 surpasses predetermined threshold value in the variation of the mutual information of assessing out, select new training pinyin string that new dictionary is carried out the iteration training.

The professional can also further recognize; The unit and the algorithm steps of each example of describing in conjunction with embodiment disclosed herein; Can realize with electronic hardware, computer software or the combination of the two; For the interchangeability of hardware and software clearly is described, the composition and the step of each example described prevailingly according to function in above-mentioned explanation.These functions still are that software mode is carried out with hardware actually, depend on the application-specific and the design constraint of technical scheme.The professional and technical personnel can use distinct methods to realize described function to each certain applications, but this realization should not thought and exceeds scope of the present invention.

The software module that the method for describing in conjunction with embodiment disclosed herein or the step of algorithm can use hardware, processor to carry out, perhaps the combination of the two is implemented.Software module can place the storage medium of any other form known in random access memory (RAM), internal memory, ROM (read-only memory) (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or the technical field.

Above-described embodiment; The object of the invention, technical scheme and beneficial effect have been carried out further explain, and institute it should be understood that the above is merely embodiment of the present invention; And be not used in qualification protection scope of the present invention; All within spirit of the present invention and principle, any modification of being made, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. Chinese tone-character conversion method is based on the property distinguished dictionary; It is characterized in that said Chinese tone-character conversion method comprises:

2. Chinese tone-character conversion method according to claim 1, wherein, said Chinese tone-character conversion method also comprises:

Through all possible words border in the mutual information adjustment statement of text and phonetic, make up said differentiation property dictionary.

3. Chinese tone-character conversion method according to claim 2 wherein, through all possible words border in the mutual information adjustment statement of text and phonetic, makes up said differentiation property dictionary and specifically comprises:

According to said mutual information maximum cutting of phonetic switching mode and the corresponding text of said training pinyin string, and the text after the statistics cutting is to obtain new dictionary.

4. Chinese tone-character conversion method according to claim 3 wherein, through all possible words border in the mutual information adjustment statement of text and phonetic, makes up said differentiation property dictionary and specifically also comprises:

Mutual information between said training pinyin string and text is assessed;

If the variation of the mutual information of assessing out surpasses predetermined threshold value, then select new training pinyin string that said new dictionary is carried out the iteration training.

5. the construction method of the property a distinguished dictionary is characterized in that, said construction method comprises:

6. construction method according to claim 5, wherein, said construction method also comprises:

Mutual information between said training pinyin string and text is assessed;

7. Chinese sound word converting system is based on the property distinguished dictionary; It is characterized in that said Chinese sound word converting system comprises:

8. Chinese sound word converting system according to claim 7, wherein, said Chinese sound word converting system also comprises:

The dictionary construction unit through all possible words border in the mutual information adjustment statement of text and phonetic, makes up said differentiation property dictionary.

9. Chinese sound word converting system according to claim 7, wherein, said dictionary construction unit specifically comprises:

Second generation unit makes up the words grid according to training pinyin string and initial dictionary, and with statistical language model said words grid is decoded to obtain different phonetic switching modes;

Mode is confirmed the unit, from said different phonetic switching mode, confirms the phonetic switching mode that mutual information is maximum;

The text cutter unit, according to said mutual information maximum cutting of phonetic switching mode and the corresponding text of said training pinyin string, and the text after the statistics cutting is to obtain new dictionary.

10. Chinese sound word converting system according to claim 7, wherein, said dictionary construction unit specifically also comprises:

The information evaluation unit is assessed the mutual information between said training pinyin string and text;

The iteration judging unit if the variation of the mutual information of assessing out surpasses predetermined threshold value, then selects new training pinyin string that said new dictionary is carried out the iteration training.