Movatterモバイル変換


[0]ホーム

URL:


CN107807918A - The method and device of Thai words recognition - Google Patents

The method and device of Thai words recognition
Download PDF

Info

Publication number
CN107807918A
CN107807918ACN201710982841.0ACN201710982841ACN107807918ACN 107807918 ACN107807918 ACN 107807918ACN 201710982841 ACN201710982841 ACN 201710982841ACN 107807918 ACN107807918 ACN 107807918A
Authority
CN
China
Prior art keywords
character string
thai
slices
language character
comentropy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710982841.0A
Other languages
Chinese (zh)
Inventor
张凯
闫昊
车双武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TRANSN (BEIJING) INFORMATION TECHNOLOGY Co Ltd
Original Assignee
TRANSN (BEIJING) INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TRANSN (BEIJING) INFORMATION TECHNOLOGY Co LtdfiledCriticalTRANSN (BEIJING) INFORMATION TECHNOLOGY Co Ltd
Priority to CN201710982841.0ApriorityCriticalpatent/CN107807918A/en
Publication of CN107807918ApublicationCriticalpatent/CN107807918A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The invention discloses the method and device of Thai words recognition, belong to technical field of information retrieval.This method includes:According to setting step-length, filter segmentation processing is carried out to Thai document to be identified, obtains the set of slices for including at least one section Thai language character string;According to the comentropy process parameter values of each section Thai language character string, brush choosing is carried out to the set of slices, forms words output set of slices;From the words output set of slices, the section Thai language character string for setting number is defined as the Thai word identified.So, it can be handled by comentropy, Thai word is identified from Thai document, so, improved the efficiency of Thai words recognition, can also increase the brose and reading speed of Thai document.

Description

The method and device of Thai words recognition
Technical field
The present invention relates to the method and device of technical field of information retrieval, more particularly to Thai words recognition.
Background technology
ThaiAlso referred to as Dai Nationality's language (Dai language), it is the safe national language of the Dai Nationality, belongs to East Asia languageA kind of language of system/Han-Tibetan family.The whole world has about 68,000,000 populations to use Thai.In the document of Thai, do not have between word and wordPunctuate, space, does not spell continuously in short from the beginning to the end, typically, among empty two alphabetical intervals or sentenceDwell represent a sentence.So, for Thai learner, translator or other Thais user, it is difficult to by wordFrequently, the method for the existing words recognition such as word length, space or punctuation mark, identifies Thai word from Thai language document.
The content of the invention
The embodiments of the invention provide a kind of method and device of Thai words recognition.For one of the embodiment to disclosureA little aspects have a basic understanding, shown below is simple summary.The summarized section is not extensive overview, nor trueDetermine key/critical component or describe the protection domain of these embodiments.Its sole purpose is to be presented one with simple formA little concepts, in this, as the preamble of following detailed description.
First aspect according to embodiments of the present invention, there is provided a kind of method of Thai words recognition, including:
According to setting step-length, filter segmentation processing is carried out to Thai document to be identified, it is safe that acquisition includes at least one sectionThe set of slices of Chinese character string;
According to the comentropy process parameter values of each section Thai language character string, brush choosing is carried out to the set of slices, is formedWords output set of slices;
From the words output set of slices, the section Thai language character string for setting number is defined as the Thai identifiedWord.
In one embodiment of the invention, when described information entropy process parameter values include frequency, coagulation grade value, Yi Jixin occurWhen ceasing the free angle value of entropy, the basis is each cut into slices the comentropy process parameter values of Thai language character string, and the words output is cutPiece set carries out brush choosing, and forming words output set of slices includes:
The section Thai language character string for exceeding setting frequency according to there is frequency, form the first set of slices to be output;
Determine each coagulation grade value of section Thai language character string, and according to solidification in the described first set of slices to be outputDegree value is more than the section Thai language character string of the first setting value, forms the second set of slices to be output;
Determine in the described second set of slices to be output the free angle value of comentropy of each section Thai language character string, and according toThe free angle value of comentropy is more than the section Thai language character string of the second setting value, forms words output set of slices.
It is described to determine each section Thai language character string in first set of slices to be output in one embodiment of the inventionCoagulation grade value includes:
According to formula (2), the information solidification of current slice Thai language character string in the described first set of slices to be output is determinedDegree value;
Wherein, Pi is the appearance frequency of current slice Thai language character string, and Pij is corresponding in current slice Thai language character stringThe appearance frequency of son section Thai language character string, co is coagulation grade value.
It is described to determine each section Thai language character string in second set of slices to be output in one embodiment of the inventionThe free angle value of comentropy includes:
According to formula (3), the left adjacent word comentropy of current slice Thai language character string and right adjacent word comentropy are determined;
According to formula (4), the smaller value in the left adjacent word comentropy and right adjacent word comentropy is defined as described currentThe free angle value of comentropy for Thai language character string of cutting into slices;
Wherein, Pi is the appearance frequency of each section Thai language character string, and H (U) is comentropy;
Free=min H (U) 1, H (U) 2 ... and H (U) n } --- --- --- --- --- formula (4)
Wherein, H (U) is comentropy, and free is the free angle value of comentropy.
It is described from the words output set of slices in one embodiment of the invention, the section Thai language word of number will be setThe Thai word that symbol string is defined as identifying includes:
Before being carried out according to the height of the frequency of occurrences to each section Thai language character string in the words output set of slicesAfter sort;
The Thai word identified will be defined as positioned at the section Thai language character string of the setting number of forefront.
Second aspect according to embodiments of the present invention, there is provided a kind of device of Thai words recognition, including:
Filter segmentation unit, for according to setting step-length, carrying out filter segmentation processing to Thai document to be identified, being wrappedInclude the set of slices of at least one section Thai language character string;
Comentropy brush menu member, for the comentropy process parameter values according to each section Thai language character string, cut to describedPiece set carries out brush choosing, forms words output set of slices;
Word determining unit, for from the words output set of slices, the section Thai language character string of number will to be setIt is defined as Thai word.
In one embodiment of the invention, described information entropy brush menu member includes:
Frequency brush modeling block, for the section Thai language character string for exceeding setting frequency according to there is frequency, form first and treatExport set of slices;
Coagulation grade brush modeling block, for determining Thai language character string of each being cut into slices in the described first set of slices to be outputCoagulation grade value, and according to section Thai language character string of the coagulation grade value more than the first setting value, form the second section to be outputSet;
Free degree brush modeling block, for determining the letter of each section Thai language character string in the described second set of slices to be outputThe free angle value of entropy is ceased, and is more than the section Thai language character string of the second setting value according to the free angle value of comentropy, forms words outputSet of slices.
In one embodiment of the invention, the coagulation grade brush modeling block, specifically for according to formula (2), determining describedThe coagulation grade value of current slice Thai language character string in one set of slices to be output;
Wherein, Pi is the appearance frequency of current slice Thai language character string, and Pij is corresponding in current slice Thai language character stringThe appearance frequency of son section Thai language character string, co is coagulation grade value.
In one embodiment of the invention, the free degree brush modeling block, specifically for according to formula (3), determining current sliceThe left adjacent word comentropy and right adjacent word comentropy of Thai language character string;According to formula (4), by the left adjacent word comentropy and right adjacent wordSmaller value in comentropy, it is defined as the free angle value of comentropy of the current slice Thai language character string;
Wherein, Pi is the appearance frequency of each section Thai language character string, and H (U) is comentropy;
Free=min H (U) 1, H (U) 2 ... and H (U) n } --- --- --- --- --- formula (4)
Wherein, H (U) is comentropy, and free is the free angle value of comentropy.
In one embodiment of the invention, the word determining unit, specifically for the height according to the frequency of occurrences to institute's predicateThe each section Thai language character string exported in set of slices of converging carries out front and rear sort;By positioned at the section of the setting number of forefrontThai language character string is defined as the Thai word identified.
Technical scheme provided in an embodiment of the present invention can include the following benefits:
It in the embodiment of the present invention, can be handled by comentropy, Thai word is identified from Thai document, so, improvedThe efficiency of Thai words recognition, also, can also increase the brose and reading speed of Thai document.
It should be appreciated that the general description and following detailed description of the above are only exemplary and explanatory, notCan the limitation present invention.
Brief description of the drawings
Accompanying drawing herein is merged in specification and forms the part of this specification, shows the implementation for meeting the present inventionExample, and for explaining principle of the invention together with specification.
Fig. 1 is a kind of flow chart of Thai words recognition method according to an exemplary embodiment;
Fig. 2 is a kind of flow chart of Thai words recognition method according to an exemplary embodiment;
Fig. 3 is a kind of block diagram of Thai words recognition device according to an exemplary embodiment;
Fig. 4 is a kind of block diagram of Thai words recognition device according to an exemplary embodiment.
Embodiment
The following description and drawings fully show specific embodiments of the present invention, to enable those skilled in the art toPut into practice them.Embodiment only represents possible change.Unless explicitly requested, otherwise single components and functionality is optional, andAnd the order of operation can change.The part of some embodiments and feature can be included in or replace other embodimentsPart and feature.The scope of embodiment of the present invention includes the gamut of claims, and the institute of claimsThere is obtainable equivalent.Herein, each embodiment can individually or generally be represented that this is only with term " invention "It is merely for convenience, and if in fact disclosing the invention more than one, it is not meant to automatically limit the scope of the applicationFor any single invention or inventive concept.Herein, such as first and second or the like relational terms are used only for oneEntity or operation make a distinction with another entity or operation, exist without requiring or implying between these entities or operationAny actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant be intended to it is non-exclusiveProperty includes, so that process, method or equipment including a series of elements not only include those key elements, but also includingThe other element being not expressly set out.Each embodiment herein is described by the way of progressive, and each embodiment stressesBe all difference with other embodiment, between each embodiment identical similar portion mutually referring to.For implementingFor structure, product etc. disclosed in example, due to its with embodiment disclosed in part it is corresponding, so fairly simple, the phase of descriptionPart is closed referring to method part illustration.
In Thai document, punctuate is not had between word and word, not space, in short continuously spelt from the beginning to the end,Thai word is identified in the more difficult document from Thai language.It in the embodiment of the present invention, can be handled by comentropy, known from Thai documentDo not go out Thai word, so, improve the efficiency of Thai words recognition, also, can also increase the brose and reading speed of Thai documentDegree.
Fig. 1 is a kind of flow chart of Thai words recognition method according to an exemplary embodiment.As shown in figure 1,The process of Thai words recognition includes:
Step 101:According to setting step-length, filter segmentation processing is carried out to Thai document to be identified, obtaining includes at least oneThe set of slices of individual section Thai language character string.
User need to identify the word in the Thai document, that is, be extracted the Thailand of information when obtaining the information of Thai documentChinese language shelves are Thai document to be identified.Main character is all Thai characters in Thai document to be identified, it is also possible in the presence ofSome digital informations, website information, mailbox message, English character etc., these information need to filter, therefore, need to be to Thailand to be identifiedChinese language shelves carry out filtration treatment, form the first Thai document for only including Thai language character.
In Thai document, punctuate is not had between word and word, not space, in short continuously spelt from the beginning to the end,Therefore, Thai document is divided into some short circuits, and further segmentation, then may be partitioned into some short sentences, short sentence is then by continuous ThaiCharacter forms, and therefore, at least one Thai short sentence in the first Thai document can be split, formed according to setting step-lengthInclude the set of slices of at least one section Thai language character string.
Such as:After Thai document to be identified carries out filtration treatment, the first Thai document D 1 is formd, and a Thailand in D1Language short sentence Si, i=1,2 ... n.Thai short sentence D1 can be split according to setting step-length, form one, two or multiple cutPiece Thai language character string.If Si includesIt is corresponding to cut if carrying out cutting by step-length step=1Piece setIt is corresponding if carrying out cutting by step-length step=2Set of slicesIt is corresponding if carrying out cutting by step-length step=3Set of slicesEach Thai short sentence Si can be cut into slices successivelyDividing processing, set of slices M corresponding to formation, including one, two or more section Thai language character strings.
Step 102:According to the comentropy process parameter values of each section Thai language character string, brush choosing is carried out to set of slices,Form words output set of slices.
In the embodiment of the present invention, comentropy processing need to be carried out to each section Thai language character string in set of slices, then,According to corresponding comentropy process parameter values, brush choosing is carried out to set of slices, forms words output set of slices.Wherein, informationEntropy process parameter values include at least one of frequency, coagulation grade value and the free angle value of comentropy occur.There is frequency to useWith the frequency of occurrences of instruction section Thai language character string, occur that frequency is higher, show the frequency of occurrences of the section Thai language character stringGreatly.One section Thai language character string may wrap a word, it is also possible to the phrase that two or more words are formed, therefore, solidify journeyFor angle value to indicate that section Thai language character string is the probability of a word, coagulation grade value is more big, represents the section Thai language character stringBe a word probability it is higher.And comentropy is the uncertainty for describing information source.Generally, an information source sends out any symbolNumber it is uncertain, weighing it can measure according to the probability of its appearance.Probability is big, and it is more chance occur, uncertain small;InsteadIt is just big.If source symbol has n kind values:U1 ... Ui ... Un, corresponding probability are:P1 ... Pi ... Pn, and the appearance of various symbolsIndependently of one another.At this moment, the average uncertainty of information source should be single symbol uncertainty-logPi assembly average (E),It can be described as comentropy.Here, when section Thai language character string has corresponding left adjacent information and right adjacent information, comentropy can be used certainlyInformation source certainty corresponding to Thai language character string of cutting into slices is indicated by angle value.
In the embodiment of the present invention, set of slices can be carried out using one, two or more comentropy process parameter valuesBrush choosing, forms words output set of slices.Such as:The section Thai language character string for exceeding setting frequency according to there is frequency, formedWords output set of slices.Or it is more than the section Thai language character string of the second setting value according to the free angle value of comentropy, form wordConverge and export set of slices etc..To be further, improve the precision of Thai words recognition, can according to occur frequency, coagulation grade value,And the free angle value of comentropy carries out brush choosing to set of slices, words output set of slices is formed.
Specifically it may include:There are one, two or more section Thai language character strings in set of slices M, each section can be countedThe appearance frequency of Thai language character string, then, according to there is section Thai language character string of the frequency more than setting frequency, form first and treatExport set of slices.
Wherein, the appearance frequency of each section Thai language character string can be determined according to formula (1).
Pi=Wi/ ∑sMWi------------------------------------------------- formula (1)
Wherein, Wi is the frequency of each section Thai language character string, and Pi is the appearance frequency of each section Thai language character string, MFor set of slices.
Wi is the frequency of each section Thai language character string, represents what section Thai language character string occurred in segmentation processNumber.So, frequency is set as A, by the appearance frequency Pi of each section Thai language character string compared with setting frequency A, ifThe appearance frequency Pi of current slice Thai language character string is more than A, then it is to be output current slice Thai language character string Pi can be put into firstIn set of slices.So, by occur frequency carried out first brush choosing after, form the first set of slices to be output.
There is the higher section Thai language character string of frequency and be likely to be a word, or the word that two or more words are formedGroup.Therefore, also need to carry out the first set of slices to be output further brush choosing.In the embodiment of the present invention, it may be determined that first treatsThe coagulation grade value of each section Thai language character string in set of slices is exported, and the first setting value is more than according to coagulation grade valueSection Thai language character string, forms the second set of slices to be output.
Wherein, the solidification journey of current slice Thai language character string in the second set of slices to be output can according to formula (2), be determinedAngle value;
Wherein, Pi is the appearance frequency of current slice Thai language character string, and Pij is corresponding in current slice Thai language character stringThe appearance frequency of son section Thai language character string, co is coagulation grade value.
In the present embodiment, there is frequency to indicate the frequency of occurrences of section Thai language character string, i.e. Pi can be specifically with currentThe probability for the Thai language character string of cutting into slices indicates.Such as:Current slice Thai language character stringCorresponding sub- section Thai languageCharacter string is respectivelyWithWherein, current slice Thai language character stringProbability P=0.0005, and son section Thai language character stringProbability P 11=0.0002, sub- sectionThai language character stringProbability P 12=0.0003 etc., according to formula (2), you can determine current slice Thai language character stringCoagulation grade value co.
Then, by the coagulation grade value of each section Thai language character string compared with the first setting value, if current sliceThe coagulation grade value of Thai language character string is more than the first setting value, then it is to be output the current slice Thai language character string can be put into secondIn set of slices, i.e., it is more than the section Thai language character string of the first setting value according to coagulation grade value, forms the second section to be outputSet.
Also need the comentropy free degree to Thai language character string of each being cut into slices in the second set of slices to be output according to comentropyValue, further brush choosing is carried out to the second set of slices to be output.In the embodiment of the present invention, the second set of slices to be output is determinedIn each section Thai language character string the free angle value of comentropy, and be more than according to the free angle value of comentropy the section of the second setting valueThai language character string, form words output set of slices.
Wherein, the left adjacent word comentropy of current slice Thai language character string and right adjacent word information can according to formula (3), be determinedEntropy;Then, according to formula (4), by the smaller value in left adjacent word comentropy and right adjacent word comentropy, it is defined as current slice Thai languageThe free angle value of comentropy of character string.
Wherein, Pi is the appearance frequency of each section Thai language character string, and H (U) is comentropy;
Free=min H (U) 1, H (U) 2 ... and H (U) n } --- --- --- --- --- formula (4)
Wherein, H (U) is comentropy, and free is the free angle value of comentropy.
Multiple left adjacent word comentropy H (U) and right adjacent word comentropy H (U), i.e. H (U) 1, H (U) can obtain by formula (3)2, H (U) 3 ... H (U) n, the minimum value in multiple comentropy H (U) then can be obtained by formula (4), so as to obtain comentropyFree angle value., can will be every after determining in the second set of slices to be output each free angle value of comentropy of section Thai language character stringThe free angle value of comentropy of individual section Thai language character string is compared with the second setting value, if the letter of current slice Thai language character stringWhen the breath free angle value of entropy is more than the second setting value, you can current slice Thai language character string is added in words output set of slices,It is more than the section Thai language character string of the second setting value according to the free angle value of comentropy, forms words output set of slices.
Above-mentioned basis is each cut into slices the appearance frequency of Thai language character string, coagulation grade value and the free angle value of comentropySequentially, brush choosing is carried out to words output set of slices, forms words output set of slices.Certainly, the embodiment of the present invention is not limited toThis, can also according to coagulation grade value, there is the order of frequency and the free angle value of comentropy, words output set of slices is enteredRow brush choosing, forms words output set of slices.Or can be according to there is the free angle value of frequency, comentropy and coagulation grade valueDeng order, brush choosing is carried out to words output set of slices, forms words output set of slices.Stated particularly as tiring out one by one.
Step 103:From words output set of slices, the section Thai language character string for setting number is defined as what is identifiedThai word.
Here, the section Thai language character string of setting number can be selected at random from words output set of slices, and be defined asThe Thai word identified.Or include occurring frequency, coagulation grade value or comentropy certainly according to comentropy process parameter valuesBy angle value, from words output set of slices, selection sets the section Thai language character string of number, and is defined as the Thai identifiedWord.
Wherein, before being carried out according to the height of the frequency of occurrences to each section Thai language character string in words output set of slicesAfter sort;The Thai word identified will be defined as positioned at the section Thai language character string of the setting number of forefront.
It can be seen that brush choosing can be carried out to the section Thai language character string in Thai document, finally by comentropy process parameter valuesThai word is identified from Thai document, so, improves the efficiency of Thai words recognition, also, can also increase Thai textThe brose and reading speed of shelves.
Below by operating process set into specific embodiment, the method that the embodiment of the present disclosure provides is illustrated.
In the present embodiment, comentropy process parameter values include:There is frequency, coagulation grade value and the comentropy free degreeValue.Therefore, setting frequency, the first setting value and the second setting value can be configured in advance.
Fig. 2 is a kind of flow chart of Thai words recognition method according to an exemplary embodiment.Such as Fig. 2, ThaiWords recognition process includes:
Step 201:Filtration treatment is carried out to Thai document to be identified, forms the first Thai text for only including Thai language characterShelves.
Full half-angle character in Thai language is included into other non-Thai language characters such as English, mathematics and carries out filtration treatment, is only retained safeThe character of the scope [0x0E00,0x0E7F] of text, the pure Thai document of a completion is thus formed, that is, formed and only include ThailandFirst Thai document of Chinese character.
Step 202:According to setting step-length, at least one Thai short sentence in the first Thai document is split, formedInclude the set of slices of at least one section character string.
Such as:A Thai short sentence in first Thai documentLength N=8 is carried out by step=2Segmentation, the set of slices of formation
Step 203:According to formula (1), each appearance frequency of section Thai language character string is determined in set of slices.
Step 204:The section Thai language character string for exceeding setting frequency according to there is frequency, form the first section collection to be outputClose.
Step 205:According to formula (2), each solidification of section Thai language character string in the first set of slices to be output is determinedDegree value.
Step 206:It is more than the section Thai language character string of the first setting value according to coagulation grade value, formation second is to be output to cutPiece set.
Step 207:According to formula (3) and formula (4), Thai language word of each being cut into slices in the second set of slices to be output is determinedAccord with the free angle value of comentropy of string.
Such as:Thai short sentenceThis section Thai language character string occursFour times, wherein left adjacent word is respectivelyRight adjacent word is respectivelyAccording to formula (3),The comentropy of the left adjacent word of this section Thai language character string is-(1/2) log (1/2)-(1/2)Log (1/2) ≈ 0.51, the comentropy of its right adjacent word is then-(1/2) log (1/2)-(1/4) log (1/4)-(1/4)log(1/4)≈1.73.So as to which the corresponding free angle value of comentropy is 0.51.
Step 208:It is more than the section Thai language character string of the second setting value according to the free angle value of comentropy, forms words outputSet of slices.
Step 209:Each section Thai language character string in words output set of slices is entered according to the height of the frequency of occurrencesRow is front and rear to sort.
Such as:Words output set of slices includes 50 section Thai language character strings, and the height of the corresponding frequency of occurrences is suitableSequence is 25,23,19,15,10,7,5,4,4,4,3,3,2,2 ....Then corresponding section Thai language character string can be subjected to front and rear rowSequence.
Step 210:The Thai word identified will be defined as positioned at the section Thai language character string of the setting number of forefront.
If setting number as 5, frequency section Thai language character string corresponding to 25,23,19,15,10 can be will appear from respectivelyIt is defined as the Thai word identified.
It can be seen that in the present embodiment, can be by there is frequency, coagulation grade value and the free angle value of comentropy, to Thai textSection Thai language character string in shelves carries out brush choosing, finally identifies Thai word from Thai document more accurately, so,The efficiency and accuracy rate of Thai words recognition are improved, also, can also increase the brose and reading speed of Thai document.
Following is embodiment of the present disclosure, can be used for performing embodiments of the present disclosure.
According to the process of above-mentioned Thai words recognition, a kind of device of Thai words recognition can be built.
Fig. 3 is a kind of block diagram of Thai words recognition device according to an exemplary embodiment.As shown in figure 3, shouldDevice includes:Filter segmentation unit 310, comentropy brush menu member 320 and word determining unit 330, wherein,
Filter segmentation unit 310, for according to setting step-length, carrying out filter segmentation processing to Thai document to be identified, obtainingThe set of slices of at least one section Thai language character string must be included.
Comentropy brush menu member 320, for the comentropy process parameter values according to each section Thai language character string, to sectionSet carries out brush choosing, forms words output set of slices.
Word determining unit 330, it is for from words output set of slices, the section Thai language character string for setting number is trueIt is set to the Thai word identified.
In one embodiment of the invention, comentropy brush menu member 320 includes:
Frequency brush modeling block, for the section Thai language character string for exceeding setting frequency according to there is frequency, form first and treatExport set of slices.
Coagulation grade brush modeling block, for determining the solidification of each section Thai language character string in the first set of slices to be outputDegree value, and according to section Thai language character string of the coagulation grade value more than the first setting value, form the second set of slices to be output.
Free degree brush modeling block, for determining the comentropy of each section Thai language character string in the second set of slices to be outputFree angle value, and according to section Thai language character string of the free angle value of comentropy more than the second setting value, form words output sectionSet.
In one embodiment of the invention, coagulation grade brush modeling block, specifically for according to formula (2), determining that first is to be outputThe coagulation grade value of current slice Thai language character string in set of slices.
Wherein, Pi is the appearance frequency of current slice Thai language character string, and Pij is corresponding in current slice Thai language character stringThe appearance frequency of son section Thai language character string, co is coagulation grade value.
In one embodiment of the invention, free degree brush modeling block, specifically for according to formula (3), determining current slice Thai languageThe left adjacent word comentropy and right adjacent word comentropy of character string;According to formula (4), by left adjacent word comentropy and right adjacent word comentropySmaller value, be defined as the free angle value of comentropy of current slice Thai language character string.
Wherein, Pi is the appearance frequency of each section Thai language character string, and H (U) is comentropy;
Free=min H (U) 1, H (U) 2 ... and H (U) n } --- --- --- --- --- formula (4)
Wherein, H (U) is comentropy, and free is the free angle value of comentropy.
In one embodiment of the invention, word determining unit 330, specifically for according to the height of the frequency of occurrences to words outputEach section Thai language character string in set of slices carries out front and rear sort;By positioned at the section Thai language word of the setting number of forefrontSymbol string is defined as the Thai word identified.
The device of embodiment of the present disclosure offer is provided.
Fig. 4 is a kind of block diagram of Thai words recognition device according to an exemplary embodiment.As shown in figure 4, shouldDevice includes:Filter segmentation unit 310, comentropy brush menu member 320 and word determining unit 330, wherein, comentropy brushMenu member 320 includes:Frequency brush modeling block 321, coagulation grade brush modeling block 322 and free degree brush modeling block 323.
Wherein, filter segmentation unit 310 can carry out filtration treatment to Thai document to be identified, and formation only includes Thai language characterThe first Thai document, then according to setting step-length, at least one Thai short sentence in the first Thai document is split, shapeInto the set of slices for including at least one section character string.
So, the frequency brush modeling block 321 in comentropy brush menu member 320 can be determined in set of slices according to formula (1)The appearance frequency of each section Thai language character string, and the section Thai language character string for exceeding setting frequency according to there is frequency, are formedFirst set of slices to be output.And the coagulation grade brush modeling block 322 in comentropy brush menu member 320 can be according to formula (2), reallyThe coagulation grade value of each section Thai language character string in fixed first set of slices to be output, and it is more than first according to coagulation grade valueThe section Thai language character string of setting value, form the second set of slices to be output.Free degree brush choosing in comentropy brush menu member 320Module 323 can determine Thai language character string of each being cut into slices in the second set of slices to be output according to formula (3) and formula (4)The free angle value of comentropy, and according to section Thai language character string of the free angle value of comentropy more than the second setting value, it is defeated to form vocabularyGo out set of slices.
So as to which word determining unit 330 can be according to the height of the frequency of occurrences to each cutting in words output set of slicesPiece Thai language character string carries out front and rear sequence, and will be defined as identifying positioned at the section Thai language character string of the setting number of forefrontThai word.
It can be seen that in the present embodiment, can be by there is frequency, coagulation grade value and the free angle value of comentropy, to Thai textSection Thai language character string in shelves carries out brush choosing, finally identifies Thai word from Thai document more accurately, so,The efficiency and accuracy rate of Thai words recognition are improved, also, can also increase the brose and reading speed of Thai document.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer programProduct.Therefore, the present invention can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardwareApply the form of example.Moreover, the present invention can use the computer for wherein including computer usable program code in one or moreThe shape for the computer program product that usable storage medium is implemented on (including but is not limited to magnetic disk storage and optical memory etc.)Formula.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program productFigure and/or block diagram describe.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagramJourney and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be providedThe processors of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produceA raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for realThe device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spyDetermine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring toMake the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram orThe function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that countedSeries of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer orThe instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram oneThe step of function of being specified in individual square frame or multiple square frames.
It should be appreciated that the invention is not limited in the flow and structure for being described above and being shown in the drawings,And various modifications and changes can be being carried out without departing from the scope.The scope of the present invention is only limited by appended claimSystem.

Claims (10)

CN201710982841.0A2017-10-202017-10-20The method and device of Thai words recognitionPendingCN107807918A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201710982841.0ACN107807918A (en)2017-10-202017-10-20The method and device of Thai words recognition

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201710982841.0ACN107807918A (en)2017-10-202017-10-20The method and device of Thai words recognition

Publications (1)

Publication NumberPublication Date
CN107807918Atrue CN107807918A (en)2018-03-16

Family

ID=61592904

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201710982841.0APendingCN107807918A (en)2017-10-202017-10-20The method and device of Thai words recognition

Country Status (1)

CountryLink
CN (1)CN107807918A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111209946A (en)*2019-12-312020-05-29上海联影智能医疗科技有限公司Three-dimensional image processing method, image processing model training method, and medium
WO2021051600A1 (en)*2019-09-192021-03-25平安科技(深圳)有限公司Method, apparatus and device for identifying new word based on information entropy, and storage medium
CN114860911A (en)*2022-05-132022-08-05阳光保险集团股份有限公司Method and device for acquiring vocabulary to be matched and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20110137642A1 (en)*2007-08-232011-06-09Google Inc.Word Detection
CN105320960A (en)*2015-10-142016-02-10北京航空航天大学Voting based classification method for cross-language subjective and objective sentiments
CN106815190A (en)*2015-11-272017-06-09阿里巴巴集团控股有限公司A kind of words recognition method, device and server
CN107180025A (en)*2017-03-312017-09-19北京奇艺世纪科技有限公司A kind of recognition methods of neologisms and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20110137642A1 (en)*2007-08-232011-06-09Google Inc.Word Detection
CN105320960A (en)*2015-10-142016-02-10北京航空航天大学Voting based classification method for cross-language subjective and objective sentiments
CN106815190A (en)*2015-11-272017-06-09阿里巴巴集团控股有限公司A kind of words recognition method, device and server
CN107180025A (en)*2017-03-312017-09-19北京奇艺世纪科技有限公司A kind of recognition methods of neologisms and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2021051600A1 (en)*2019-09-192021-03-25平安科技(深圳)有限公司Method, apparatus and device for identifying new word based on information entropy, and storage medium
CN111209946A (en)*2019-12-312020-05-29上海联影智能医疗科技有限公司Three-dimensional image processing method, image processing model training method, and medium
CN111209946B (en)*2019-12-312024-04-30上海联影智能医疗科技有限公司Three-dimensional image processing method, image processing model training method and medium
CN114860911A (en)*2022-05-132022-08-05阳光保险集团股份有限公司Method and device for acquiring vocabulary to be matched and electronic equipment

Similar Documents

PublicationPublication DateTitle
CN104881458B (en)A kind of mask method and device of Web page subject
CN110941959B (en)Text violation detection, text restoration method, data processing method and equipment
CN108845982B (en) A Chinese word segmentation method based on word association features
CN106708798B (en)Character string segmentation method and device
CN112445912A (en)Fault log classification method, system, device and medium
CN106445915B (en)New word discovery method and device
CN107688630B (en)Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN113934910A (en)Automatic optimization and updating theme library construction method and hot event real-time updating method
CN108304377A (en)A kind of extracting method and relevant apparatus of long-tail word
CN103955450A (en)Automatic extraction method of new words
WO2019100458A1 (en)Method and device for segmenting thai syllables
CN107807918A (en)The method and device of Thai words recognition
CN104978354A (en)Text classification method and text classification device
CN105608075A (en)Related knowledge point acquisition method and system
CN106446051A (en)Deep search method of Eagle media assets
CN104077274B (en)Method and device for extracting hot word phrases from document set
CN109213974B (en)Electronic document conversion method and device
CN107665188A (en)A kind of semantic understanding method and device
CN106126495B (en)One kind being based on large-scale corpus prompter method and apparatus
CN106598997A (en)Method and device for computing membership degree of text subject
CN117972025B (en)Massive text retrieval matching method based on semantic analysis
CN106933818A (en)A kind of quick multiple key text matching technique and device
CN110674286A (en)Text abstract extraction method and device and storage equipment
CN116303987A (en)Recommendation method, device, computer equipment and storage medium for bidding document
CN106933797B (en)Target information generation method and device

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication
RJ01Rejection of invention patent application after publication

Application publication date:20180316


[8]ページ先頭

©2009-2025 Movatter.jp