Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present inventionIn attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment isA part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the artEvery other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
A kind of flow chart of the training method of the text similarity model provided as shown in Figure 1 for one embodiment of the invention,Include the following steps:
S11: receiving dictionary training set, to default sentence word segmentation processing each in the dictionary training set, determines described defaultThe text-string of sentence;
S12: according to the text-string of each default sentence, determine term vector corresponding with the text-string andText phonetic corresponding with the text-string;
S13: it according to the corresponding text-string of each default sentence, text phonetic and term vector, determines described eachThe default corresponding feature vector of sentence, training text similarity model.
In the present embodiment, it due to no longer only comparing the number of the directly similar word of text-string, but introducesNew parameter carries out multiple orientation and comprehensively considers, therefore used text similarity model is also required to further training.
For step S11, dictionary training set is received, wherein a large number of users is contained in dictionary training set in daily lifeIn some words that may use, for example, " the first affiliated hospital, Peking University ", " the second affiliated hospital, Peking University ", " northThird affiliated hospital, capital university ", " the 4th affiliated hospital, Peking University ", " KFC ", " McDonald ", " thousand hand-pulled noodles of taste ", " pepperWork mill ", " Friendship Bridge ", " Shahe bridge ", " Yongdinghe River bridge ", " Zhenyang bridge ", " Yangtze Bridge ", " Caobai River is bigBridge " ....After receiving dictionary training set, word segmentation processing is carried out to default sentence each in the dictionary training set, is determined described pre-If the text-string of sentence, for example, the Changjiang river the text-string s1=_ bridge of " Yangtze Bridge ".Wherein Words partition systemIn may separate an individual word, it is also possible to separate a word.
Word corresponding with the text-string is determined according to the text-string of each default sentence for step S12Vector and text phonetic, after step S11, the determining the Changjiang river text-string s1=_ bridge.It is true according to the text-stringFixed its text phonetic p1 and term vector w1 obtains p1=chang jiang by determination | da qiao, w1=(0.323,0.123,...)(0.564,0.348,...).Wherein, when the text-string includes Chinese character, mapping with it is described inThe corresponding text phonetic of Chinese character, when the text-string includes English character, the text phonetic of the English characterFor described English character itself.
For step S13, according to the corresponding text-string of each default sentence, text phonetic and term vector, reallyThe corresponding feature vector of fixed each default sentence, feature vector cover the text-string feature of default sentence, textPhonetic feature and term vector feature, and then pass through described eigenvector training text similarity model.
It can be seen that by the embodiment by determining that multiple feature vectors of word are trained text similarity mouldType, model parameter is more abundant, and the feature being related to is more, and determining text similarity is more accurate.
A kind of text matching technique based on text similarity model of one embodiment of the invention offer is providedFlow chart includes the following steps:
S21: text information is received, determines the feature vector of the text information, wherein described eigenvector is at least wrappedIt includes: text-string, text phonetic, term vector;
S22: described eigenvector is input in the text similarity model;
S23: the characteristic similarity of the text similarity model output is obtained;
S24: determine that at least one reaches the default sentence of default characteristic threshold value using as institute according to the characteristic similarityState the matched text of text information.
In the present embodiment, the text similarity model by the claim 1 training carries out specific practical application.
For step S21, text information is received, wherein the text information can be inputted according to user by voice, phaseThe equipment answered carries out speech recognition, and the text information obtained, can also according to user by the input method of corresponding equipment intoRow input.For example, user carries out text input by input method, due to the hand shaking or general idea or other situations of user,User has got " the Changjiang river bridging " by input method.And then determine the feature vector of " the Changjiang river bridging " of user's input, including textThis character string, text phonetic, term vector.Wherein, the Changjiang river text-string s2=_ bridging, text phonetic p2=chang jiang| da qiao, term vector w2=(0.1234,0.2133 ...) (0.823,0.234 ...).
For step S22, the feature vector determined in the step s 21 is input to the text similarity modelIn, it is compared according to the various features with the default sentence in text similarity model.
For step S23, after step s 22, the characteristic similarity of the text similarity model output is obtained, whereinCharacteristic similarity includes the characteristic similarity of each default sentence in the word and text similarity model of user's input.
At least one, which reaches default threshold, is determined according to the characteristic similarity determined in step S23 for step S24Matched text of the default sentence of value as the text information.
It can be seen that by the embodiment true by using the text similarity model of a variety of dimensional characteristics vectors of considerationMake the characteristic similarity of each default sentence in user's read statement and text similarity model, so determine relatively precisely compared withHigh matched text.Default dictionary collects relatively easy, advantage of lower cost.
As an implementation, in the present embodiment, the default characteristic threshold value includes pre-set text threshold value, described to obtainThe characteristic similarity for taking text similarity model output includes:
When described eigenvector include at least text-string when, according to the text-string of the text information with it is describedThe text-string of each default sentence determines the text of the text information and each default sentence in text similarity modelSimilarity;
The default sentence that the text similarity is more than pre-set text threshold value is determined as matched character string set;
According at least to text-string, the text phonetic, term vector in described eigenvector, determine the text information withThe characteristic similarity of default sentence in the matched character string set.
In the present embodiment, the default characteristic threshold value includes pre-set text threshold value, also, works as described eigenvector extremelyWhen less including text-string, according to the text-string and the text similarity model of the text information of user inputThe text-string of interior each default sentence determines the text similarity of the text information and each default sentence.Namely firstWith one of various features vector feature, similarity-rough set is carried out.Determine that a range is lesser more than pre-set text thresholdThe matched character string set of the default sentence of value.
After determining matched character string set, in the text envelope for being determined user's input together according to various features vectorThe characteristic similarity of breath and the default sentence in matched character string set.
It can be seen that by the embodiment by first using single feature, to the pre- of the text similarity modelIf sentence carries out preliminary screening.It filters out relatively small-scale matched character string set and passes through various features vector again and determineCorresponding matched text accelerates the efficiency of determining matched text.
As an implementation, in the present embodiment, the default characteristic threshold value includes default phonetic threshold value, described to obtainThe characteristic similarity for taking text similarity model output includes:
When described eigenvector includes at least text phonetic, according to the text phonetic of the text information and the textThe text phonetic of each default sentence determines the pinyin similarity of the text information and each default sentence in similarity model;
The pinyin similarity is determined to be more than to preset the default sentence of phonetic threshold value as matching phonetic set;
According at least to text-string, the text phonetic, term vector in described eigenvector, determine the text information withThe characteristic similarity of default sentence in the matching phonetic set.
In the present embodiment, the default characteristic threshold value includes default phonetic threshold value, also, works as described eigenvector extremelyWhen less including text phonetic, according to each in the text phonetic and the text similarity model of the text information of user inputThe text phonetic of default sentence determines the pinyin similarity of the text information and each default sentence.Similarly, and first it usesOne of various features vector feature carries out similarity-rough set.Determine that a range is lesser more than default phonetic threshold valueDefault sentence matching phonetic set.
After determining matching phonetic set, in the text information for being determined user's input together according to various features vectorWith the characteristic similarity of the default sentence matched in phonetic set.
It can be seen that by the embodiment by first using single feature, to the pre- of the text similarity modelIf sentence carries out preliminary screening.Relatively small-scale matching phonetic set is filtered out, then is driven out by various features vectorCorresponding matched text accelerates the efficiency of determining matched text.
As an implementation, in the present embodiment, the default characteristic threshold value includes default vector threshold, described to obtainThe characteristic similarity for taking text similarity model output includes:
It is similar to the text according to the term vector of the text information when described eigenvector includes at least term vectorThe term vector of each default sentence determines the vector similarity of the text information and each default sentence in degree model;
The vector similarity is determined to be more than to preset the default sentence of vector threshold as matching vector set;
According at least to text-string, the text phonetic, term vector in described eigenvector, determine the text information withThe characteristic similarity of default sentence in the matching vector set.
In the present embodiment, the default characteristic threshold value includes default vector threshold, also, works as described eigenvector extremelyWhen less including term vector, according to each default in the term vector and the text similarity model of the text information of user inputThe term vector of sentence determines the vector similarity of the text information and each default sentence.Similarly, and first with a variety of spiesOne of vector feature is levied, similarity-rough set is carried out.Determine that a range is lesser default more than default vector thresholdThe matching vector set of sentence.
After determining matching vector set, in the text information for being determined user's input together according to various features vectorWith the characteristic similarity of the default sentence in matching vector set.
It can be seen that by the embodiment by first using single feature, to the pre- of the text similarity modelIf sentence carries out preliminary screening.Relatively small-scale matching vector set is filtered out, then is driven out by various features vectorCorresponding matched text accelerates the efficiency of determining matched text.
As an implementation, in the present embodiment, described to determine that at least one reaches default according to characteristic similarityThe default sentence of characteristic threshold value includes: using the matched text as the text information
When according to the sequence of similarity from high to low, determining only one is more than the default sentence conduct for presetting characteristic threshold valueWhen the matched text of the text information, using one default sentence as the matched text of the text information;Or
It is more than the default sentence work for presetting characteristic threshold value when having at least two according to the sequence determination of similarity from high to lowFor the text information matched text when, described at least two default sentences are sent to user;
Receive the default sentence of user's selection;
Using the selected default sentence as the matched text of the text information.
In the present embodiment, can according to similarity from high to low determine the default language for reaching default characteristic threshold valueMatched text of the sentence as the text information.Wherein when only determining a matched text, for example, the text envelope of user's inputBreath is " the Changjiang river bridging ", and a matched text of the determination by similarity by height on earth is " Yangtze Bridge ", " the Changjiang river by described inThe matched text of " the Changjiang river bridging " that bridge " is inputted as user.
When determining at least two matched texts, for example, the text information of user's input is " BJ Univ Hospital ", by similarAt least two determining matched texts of degree are " Peking University First Hospital ", " the second hospital, Peking University ", " Peking University's thirdHospital " ... receives the default sentence of user's selection to user feedback, such as user selects " The Third Affiliated Hospital of Peking University ", by instituteState matched text of the default sentence selected as text information.
It can be seen that the matched text by determining specified quantity by the embodiment, provide more for userWith mode, matching range is expanded, while also improving the usage experience of user.
A kind of structural representation of the training system of text similarity model of one embodiment of the invention offer is providedFigure, which can be performed the training method of text similarity model described in above-mentioned any embodiment, and configure in the terminal.
A kind of training system of text similarity model provided in this embodiment includes: that text-string determines program module11, term vector and text phonetic determine program module 12 and text similarity model training program module 13.
Wherein, text-string determines program module 11 for receiving dictionary training set, to each in the dictionary training setDefault sentence word segmentation processing, determines the text-string of the default sentence;Term vector and text phonetic determine program module 12For the text-string according to each default sentence, determine term vector corresponding with the text-string and with the textThe corresponding text phonetic of this character string;Text similarity model training program module 13 is used for according to each default sentence pairText-string, text phonetic and the term vector answered determine the corresponding feature vector of each default sentence, training text phaseLike degree model.
A kind of text matches system based on text similarity model of one embodiment of the invention offer is providedThe text matching technique based on text similarity model described in above-mentioned any embodiment can be performed in structural schematic diagram, the system,And it configures in the terminal.
A kind of text matches system based on text similarity model provided in this embodiment includes: that feature vector determines journeySequence module 21, feature vector input program module 22, and characteristic similarity obtains program module 23 and text matches program module 24.
Wherein, feature vector determines program module 21 for receiving text information, determine the feature of the text information toAmount, wherein described eigenvector includes at least: text-string, text phonetic, term vector;Feature vector inputs program module22 for described eigenvector to be input in the text similarity model;Characteristic similarity obtains program module 23 and is used forObtain the characteristic similarity of the text similarity model output;Text matches program module 24 is used for similar according to the featureDegree determines that at least one reaches the default sentence of default characteristic threshold value using the matched text as the text information.
Further, the default characteristic threshold value includes pre-set text threshold value, and the characteristic similarity obtains program moduleFor:
When described eigenvector include at least text-string when, according to the text-string of the text information with it is describedThe text-string of each default sentence determines the text of the text information and each default sentence in text similarity modelSimilarity;
The default sentence that the text similarity is more than pre-set text threshold value is determined as matched character string set;
According at least to text-string, the text phonetic, term vector in described eigenvector, determine the text information withThe characteristic similarity of default sentence in the matched character string set.
Further, the default characteristic threshold value includes default phonetic threshold value, and the characteristic similarity obtains program moduleFor:
When described eigenvector includes at least text phonetic, according to the text phonetic of the text information and the textThe text phonetic of each default sentence determines the pinyin similarity of the text information and each default sentence in similarity model;
The pinyin similarity is determined to be more than to preset the default sentence of phonetic threshold value as matching phonetic set;
According at least to text-string, the text phonetic, term vector in described eigenvector, determine the text information withThe characteristic similarity of default sentence in the matching phonetic set.
Further, the default characteristic threshold value includes default vector threshold, and the characteristic similarity obtains program moduleFor:
It is similar to the text according to the term vector of the text information when described eigenvector includes at least term vectorThe term vector of each default sentence determines the vector similarity of the text information and each default sentence in degree model;
The vector similarity is determined to be more than to preset the default sentence of vector threshold as matching vector set;
According at least to text-string, the text phonetic, term vector in described eigenvector, determine the text information withThe characteristic similarity of default sentence in the matching vector set.
Further, the text matches program module is used for:
When according to the sequence of similarity from high to low, determining only one is more than the default sentence conduct for presetting characteristic threshold valueWhen the matched text of the text information, using one default sentence as the matched text of the text information;Or
It is more than the default sentence work for presetting characteristic threshold value when having at least two according to the sequence determination of similarity from high to lowFor the text information matched text when, described at least two default sentences are sent to user;
Receive the default sentence of user's selection;
Using the selected default sentence as the matched text of the text information.
The embodiment of the invention also provides a kind of nonvolatile computer storage media, computer storage medium is stored with meterThe text similarity model in above-mentioned any means embodiment can be performed in calculation machine executable instruction, the computer executable instructionsTraining method;
As an implementation, nonvolatile computer storage media of the invention is stored with the executable finger of computerIt enables, computer executable instructions setting are as follows:
It receives dictionary training set and the default sentence is determined to default sentence word segmentation processing each in the dictionary training setText-string;
According to the text-string of each default sentence, determine term vector corresponding with the text-string and with instituteState the corresponding text phonetic of text-string;
According to the corresponding text-string of each default sentence, text phonetic and term vector, determine described each defaultThe corresponding feature vector of sentence, training text similarity model.
The embodiment of the invention also provides a kind of nonvolatile computer storage media, computer storage medium is stored with meterCalculation machine executable instruction, the computer executable instructions can be performed in above-mentioned any means embodiment based on text similarity mouldThe text matching technique of type;
As an implementation, nonvolatile computer storage media of the invention is stored with the executable finger of computerIt enables, computer executable instructions setting are as follows:
Text information is received, determines the feature vector of the text information, wherein described eigenvector includes at least: textThis character string, text phonetic, term vector;
Described eigenvector is input in the text similarity model;
Obtain the characteristic similarity of the text similarity model output;
Determine that at least one reaches the default sentence of default characteristic threshold value using as the text according to the characteristic similarityThe matched text of this information.
As a kind of non-volatile computer readable storage medium storing program for executing, it can be used for storing non-volatile software program, non-volatileProperty computer executable program and module, such as the corresponding program instruction/mould of the method for the test software in the embodiment of the present inventionBlock.One or more program instruction is stored in non-volatile computer readable storage medium storing program for executing, when being executed by a processor, is heldThe training method of text similarity model in the above-mentioned any means embodiment of row and text based on text similarity modelMatching process.
Non-volatile computer readable storage medium storing program for executing may include storing program area and storage data area, wherein storage journeyIt sequence area can application program required for storage program area, at least one function;Storage data area can be stored according to test softwareDevice use created data etc..In addition, non-volatile computer readable storage medium storing program for executing may include that high speed is deposited at randomAccess to memory, can also include nonvolatile memory, a for example, at least disk memory, flush memory device or other are non-Volatile solid-state part.In some embodiments, it includes relative to place that non-volatile computer readable storage medium storing program for executing is optionalThe remotely located memory of device is managed, these remote memories can be by being connected to the network to the device of test software.Above-mentioned networkExample include but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
The embodiment of the present invention also provides a kind of electronic equipment comprising: at least one processor, and with described at least oneThe memory of a processor communication connection, wherein the memory is stored with the finger that can be executed by least one described processorEnable, described instruction executed by least one described processor so that at least one described processor be able to carry out it is of the invention anyThe step of training method of the text similarity model of embodiment and text matching technique based on text similarity model.
The client of the embodiment of the present application exists in a variety of forms, including but not limited to:
(1) mobile communication equipment: the characteristics of this kind of equipment is that have mobile communication function, and to provide speech, dataCommunication is main target.This Terminal Type includes: smart phone (such as iPhone), multimedia handset, functional mobile phone and lowHold mobile phone etc..
(2) super mobile personal computer equipment: this kind of equipment belongs to the scope of personal computer, there is calculating and processing functionCan, generally also have mobile Internet access characteristic.This Terminal Type includes: PDA, MID and UMPC equipment etc., such as iPad.
(3) portable entertainment device: this kind of equipment can show and play multimedia content.Such equipment include: audio,Video player (such as iPod), handheld device, e-book and intelligent toy and portable car-mounted navigation equipment.
(4) other electronic devices having data processing function.
Herein, relational terms such as first and second and the like be used merely to by an entity or operation with it is anotherOne entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this realityRelationship or sequence.Moreover, the terms "include", "comprise", include not only those elements, but also including being not explicitly listedOther element, or further include for elements inherent to such a process, method, article, or device.Do not limiting moreIn the case where system, the element that is limited by sentence " including ... ", it is not excluded that including process, method, the article of the elementOr there is also other identical elements in equipment.
The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation memberIt is physically separated with being or may not be, component shown as a unit may or may not be physics listMember, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needsIn some or all of the modules achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creativenessLabour in the case where, it can understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment canIt realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, onStating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, shouldComputer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingersIt enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementationMethod described in certain parts of example or embodiment.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;AlthoughPresent invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be usedTo modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features;And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit andRange.