A kind of automatic question-answering method, device and storage mediumTechnical field
The present invention relates to field of artificial intelligence, and in particular to a kind of automatic question-answering method, device and storage medium.
Background technology
Chat robots system be exactly it is a kind of by means of means of communication can it is at every moment online and by natural language withThe artificial intelligence system of people's communication exchange.It is a kind of automatic question answering (QA) system on chat robots system parenchyma.This is asked automaticallySystem is answered, question answering system is also, is one Big-corpus of memory, automatically retrieval searches and answers user's computer of enquirementProcessing system.
Specifically, chat robots system can be retrieved matched with the problem in the database after user's input problemThen answer exports the answer retrieved, to reply problem input by user, and then realize chat.
However, chat robots system often will appear answer and the unmatched situation of problem at present, answer correlation compared withDifference reduces the accuracy of chat robots system output answer.
Invention content
A kind of automatic question-answering method of offer of the embodiment of the present invention, device and storage medium, can improve chat robots systemThe accuracy of system output answer.
The embodiment of the present invention provides a kind of automatic question-answering method, including:
Based on multiple question and answer pair that the social data in social platform is formed, the question and answer are to including problem and its correspondingAnswer;
Establish the inverted index of described problem and its phrase;
Search problem is obtained, and is determined and the inspection according to phrase the problem of the search problem and the inverted indexPhase near problem similar in Suo Wenti;
According to the phase near problem and the question and answer to the candidate answers of acquisition search problem, the time of search problem is obtainedSelect answer set;
The target answer of the search problem is chosen from the candidate answers set.
Correspondingly, the embodiment of the present invention also provides a kind of automatic call answering arrangement, including:
Question and answer are to forming unit, multiple question and answer pair for being formed based on the social data in social platform, the question and answerTo including problem and its corresponding answer;
Index establishes unit, the inverted index for establishing described problem and its phrase;
Problem acquiring unit, for obtaining search problem, and according to phrase the problem of the search problem and it is described fallRow's index determines and phase near problem similar in the search problem;
Candidate answers acquiring unit, for according to the phase near problem and the question and answer to obtaining the candidate of search problemAnswer obtains the candidate answers set of search problem;
Answer selection unit, the target answer for choosing the search problem from the candidate answers set.
Correspondingly, the embodiment of the present invention also provides a kind of storage medium, the storage medium is stored with instruction, described instructionThe automatic question-answering method of any offer of the embodiment of the present invention is provided when being executed by processor.
The embodiment of the present invention uses the multiple question and answer pair formed based on the social data in social platform, and the question and answer are to packetProblem and its corresponding answer are included, then, establishes the inverted index of described problem and its phrase, obtains search problem, and according toThe problem of search problem phrase and the inverted index determine and phase near problem similar in the search problem, according to institutePhase near problem and the question and answer are stated to the candidate answers of acquisition search problem, obtain the candidate answers set of search problem, fromThe target answer of the search problem is chosen in the candidate answers set.The program can first inquire with similar in search problemPhase near problem, and the corresponding answer of phase near problem is inquired, then most suitable answer is chosen in the answer of slave phase near problem, therefore,The program can export the answer to match with search problem, improve the accuracy and matter of chat robots system output answerAmount.
Description of the drawings
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodimentAttached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, forFor those skilled in the art, without creative efforts, it can also be obtained according to these attached drawings other attachedFigure.
Fig. 1 a are the flow diagrams of automatic question-answering method provided in an embodiment of the present invention;
Fig. 1 b are the flow diagrams of sentence synthesis provided in an embodiment of the present invention;
Fig. 1 c are provided in an embodiment of the present invention to obtain the schematic diagram of sentence vector by term vector;
Fig. 1 d are the calculating schematic diagrames of the sentence similarity provided in an embodiment of the present invention based on convolutional neural networks;
Fig. 1 e are the schematic diagrames of word co-occurrence number statistics provided in an embodiment of the present invention;
Fig. 2 a are the schematic diagram of a scenario of automatically request-answering system provided in an embodiment of the present invention;
Fig. 2 b are the flow diagrams of automatic question-answering method provided in an embodiment of the present invention;
Fig. 2 c are the schematic diagrames of robot chat interface provided in an embodiment of the present invention;
Fig. 2 d are another schematic diagrames of robot chat interface provided in an embodiment of the present invention;
Fig. 3 a are the Organization Charts of automatically request-answering system provided in an embodiment of the present invention;
Fig. 3 b are the structural schematic diagrams of ordering system provided in an embodiment of the present invention;
Fig. 4 a are the first structural schematic diagrams of automatic call answering arrangement provided in an embodiment of the present invention;
Fig. 4 b are second of structural schematic diagrams of automatic call answering arrangement provided in an embodiment of the present invention;
Fig. 4 c are the third structural schematic diagrams of automatic call answering arrangement provided in an embodiment of the present invention;
Fig. 4 d are the 4th kind of structural schematic diagrams of automatic call answering arrangement provided in an embodiment of the present invention.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, completeSite preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based onEmbodiment in the present invention, the every other implementation that those skilled in the art are obtained without creative effortsExample, shall fall within the protection scope of the present invention.
An embodiment of the present invention provides a kind of automatic question-answering method, device and storage mediums.It will carry out respectively below detailedExplanation.
Embodiment one,
The present embodiment will be described from the angle of automatic call answering arrangement, which can specifically be integrated in oneIn a entity or multiple entities;For example, the automatic call answering arrangement can be integrated in the equipment such as server.
A kind of automatic question-answering method, including:Based on multiple question and answer pair that the social data in social platform is formed, the question and answerTo including then problem and its corresponding answer establish the inverted index of the problem and its phrase, search problem, and root are obtainedIt is close according to this according to phrase the problem of the search problem and inverted index determination and phase near problem similar in the search problemProblem and the question and answer obtain the candidate answers set of search problem, are answered from the candidate to the candidate answers of acquisition search problemThe target answer of the search problem is chosen in case set.
As shown in Figure 1a, the detailed process of the automatic question-answering method can be as follows:
101, multiple question and answer pair are formed based on the social data in social platform, the question and answer are to including problem and its correspondingAnswer.
Wherein, social platform refers to providing the user with the platform for sharing the information such as oneself information, mood, perception.For example, baseIn the social platform etc. of instant messaging.In addition, social platform can also include chat robots, problem system;Namely chatting machineDevice, problem system can be the social platform of interaction social information.
Social data in social platform is the social information data of user's interaction in social platform, which canWith the UGC (User Generated Content, user-generated content) in packet social platform.For example, the social data can be withThe content (such as content of text) and other users delivered in social platform including user are directed to the content in social platformComment information or reply content etc..
In the present embodiment, question and answer refer to being also referred to as problem answers to (uestion-answer pair, QA pair)It is a pair of of social data such as text data of question-response.The question and answer are to can specifically include problem and its corresponding answer.ThanSuch as, social data includes the content a delivered in social platform, and for the comment content b, comment content c, comment of content aContent d;At this point, can be using content a as problem, it respectively will be for comment content b, c, c answering as problem of content aCase, can finally form question and answer to (a, b), question and answer to (a, c), question and answer to (a, d).
The present embodiment can obtain the social data of magnanimity in social platform, and then, dismantling forms question and answer pair.Namely stepSuddenly " based on the social data in social platform formed multiple question and answer to " may include:
Obtain the social data in social platform;
Question and answer dismantling is carried out to social data, to form multiple question and answer pair.
In practical application, in order to reduce treating capacity and improve the quality of answer, data can also be carried out to social dataIt filters (i.e. data cleansing), for example, some private datas or sensitive data etc. are filtered out, then, to filtered social numberAccording to question and answer dismantling is carried out, multiple question and answer pair are formed.Namely step " carrying out question and answer dismantling to social data " may include:
Data filtering, social data after being filtered are carried out to social data;
Question and answer dismantling is carried out to social data after filtering.
Specifically, first social data can be segmented, then, some phrase numbers is filtered out according to vocabulary filtering principleAccording to vocabulary filtering principle can be set according to actual demand, for example can filter out the names such as name, place name, mechanism name realityVolume data filters out the private datas such as telephone number, instant messaging account, financial account (such as bank's card number), filters out dirtyThe uncivil lexical data such as word.That is, step " carrying out data filtering to social data " may include:
Word segmentation processing is carried out to social data, obtains the corresponding phrase data of social data;
Principle is filtered according to default vocabulary, and data filtering is carried out to the corresponding phrase data of social data.
The present embodiment can be filtered data under off-line state;For example, off-line processing system acquisition may be usedThen social data in social platform is filtered processing to social data.
102, the inverted index of the problem and its phrase is established.
Wherein, inverted index needs the value according to attribute to search record in practical application.In this concordance listEach single item all includes an attribute value and the address respectively recorded with the attribute value.Due to not being to determine attribute by recordingValue, but the position of record is determined by attribute value, thus referred to as inverted index (inverted index).With inverted indexText we be known as inverted index text, the referred to as row's of falling text (inverted file).
The inverted index that problem and its phrase are stated in the present embodiment can be with:It is determined by the phrase in problem and waits for duplicate removal text.The inverted index may include that multiple directory entries either index pair each directory entry or index to including indexing key wordsAnd its corresponding index entry, wherein indexing key words can be the phrase in problem, and index entry, which can be that phrase is corresponding, asksTopic.Therefore, the inverted index of the problem and its phrase is established in the present embodiment, as establishes pair between characterization phrase and problemThe index pair or directory entry that should be related to.Specifically, step " inverted index for establishing the problem and its phrase " may include:
Word segmentation processing is carried out to question and answer centering problem, obtains the phrase of problem;
Index pair is established according to the phrase of problem and problem, the index is to including indexing key words and its corresponding index, which is the phrase of problem, and index entry can be the problem.
For example, obtained after word segmentation processing can be carried out to problem Q phrase q1, q2 ... qn, then index can be establishedTo (q1, Q), (q2, Q) ... (qn, Q), to obtain the inverted index of problem and problem phrase.
After the inverted index of problem of establishing and its phrase, it can be inquired based on the inverted index and search problemSimilar phase near problem.
103, search problem is obtained, and is determined and the retrieval according to phrase the problem of the search problem and the inverted indexPhase near problem similar in problem.
(1), search problem is obtained:
Wherein, search problem is the problem of needing to retrieve answer;The acquisition modes of the search problem can there are many, thanSuch as, it can be obtained according to sentence input by user.It specifically, can be using sentence input by user as search problem.
The sentence input by user can be one section input by user or multistage content, which can be wordContent etc., the sentence can be made of phrase, if which can be one section complete, or if one section incomplete.
Optionally, in order to promote the accuracy and quality of answer, the present embodiment can also use syntactic analysis defeated to userThe sentence entered is synthesized, to obtain accurate search problem.Specifically, step " obtaining search problem " may include:
Obtain the history sentence inputted before the sentence and user that user currently inputs;
When the sentence is free of subject-predicate language and dynamic object, and the history sentence contains subject-predicate language or dynamic object, by historyThe subject of subject or object as the sentence in sentence, to synthesize new sentence;
Using the new sentence as search problem.
Wherein, the sentence that history sentence inputs before being user, for example, the history sentence last time is input by userSentence.For example, user then inputs " red " after input " apple is nice ", then " apple is nice " is at this timeHistory sentence, " red " sentence currently inputted for user.
In the present embodiment, sentence input by user also refers to user and is inputted by robot chat client in terminalSentence.For example, when inputting message content in input frame of the user in robot chat interface, terminal to server is sent shouldMessage content, at this time server the message content can be received.
The history sentence for the sentence and user's input before that the present embodiment respectively can currently input user carries out sentenceMethod is analyzed, and obtains syntactic analysis as a result, then, determining whether the sentence that user currently inputs contains master based on syntactic analysis resultEither whether dynamic object and history sentence contain subject-predicate language or dynamic object to predicate.
Syntactic analysis (Parsing) refers to just that the word grammatical function in distich is analyzed, such as to " I comes late" syntactic analysis is carried out, it can be obtained:" I " is subject, and " next " is predicate, and " evening " is complement.
In the present embodiment, when the sentence that user currently inputs be free of subject-predicate language and dynamic object, and user before input go throughHistory sentence contain subject-predicate language either dynamic object when can using in history sentence subject or object as the subject of the sentence,To synthesize new sentence.
For example, user's first round inputs:" apple is nice ";Second wheel input " red ";At this point, can by syntactic analysisKnow, user second takes turns the sentence (sentence that i.e. user currently inputs) of input without containing subject-predicate language and dynamic object, but userThe first round sentence (i.e. history sentence) of input contains subject, therefore, can be using subject in the sentence of first round input as working asThe subject of preceding sentence is spliced into new sentence " apple is red ", and then, the new sentence is as search problem.
Optionally, may be unsuccessful using above-mentioned synthesis mode in practical application, then in order to promote search problemAccuracy, the present embodiment method further include:
When synthesizing the failure of new sentence, according to corresponding keyword in the word type-collection of the phrase sentence;
By part of speech type target phrase identical with the keyword in the history sentence, the keyword is replaced with, to be replacedSentence after changing;
Using replaced sentence as search problem.
Wherein, the part of speech type of phrase can be divided based on the part of speech of phrase, for example, phrase can be divided into noun,Verb, adjective, number, quantifier, pronoun, function word etc..In addition, divided when being also based on the word meaning expressed by phrase,For example, phrase can be divided into name, place name, mechanism name etc..
The present embodiment, can word-based type priority grade corresponding keyword is extracted from sentence;For example, name can be arrangedThe priority of word is higher than the priority of verb, at this point it is possible to phrase that part of speech type in sentence is noun is extracted as keyword, ifExtraction is less than then extracting phrase that part of speech type in sentence is verb as keyword word, namely according to first extracting noun, extract afterwardsPriority as verb extracts keyword.
Optionally, when extracting noun, can further, according to entity naming word (such as name, place name, mechanism name),The priority of non-physical naming word successively from high to low, to extract keyword from sentence.Wherein, when extracting entity naming wordIt can be extracted according to the priority of name, place name, mechanism name from high in the end to extract keyword from sentence.In extraction non-physicalIt, can be according to the tfidf (term frequency-inverse document of non-physical naming word when naming wordWeighting technique is commonly used in frequency, information retrieval data mining) value extracts corresponding non-physical naming word as crucialWord, for example, can extract the highest non-physical naming word of tfidf values is used as keyword.
Specifically, it can first extract the phrase that part of speech type in sentence is name and be used as keyword, if extracting less than extractingPart of speech type is the phrase keyword of place name in sentence, if extraction is less than the part of speech type in sentence that extracts is that the phrase of mechanism name closesKeyword, if extraction less than, extract the highest noun keyword of tfidf values in sentence, if also extract less than, further extractionPart of speech type is the phrase keyword of verb in sentence.
It, can be identical with the keyword by part of speech type in the history sentence after successfully extracting keyword from sentenceTarget phrase replaces with the keyword.For example, if the keyword extracted be name, place name or mechanism name as keywordWhen, name, place name or the mechanism name in history sentence can be replaced with the keyword, if the keyword extracted isPredicate or verb in history sentence are then replaced with the keyword etc. by word, if the keyword extracted is named for non-physicalWord, then can the keyword be replaced with for subject or object in history sentence.
For example, user's first round inputs:" Beijing is gone to go on business ";Second wheel input " Shanghai ";At this point, passing through syntaxAnalysis does not contain subject-predicate language and dynamic object it is found that user second takes turns the sentence (sentence that i.e. user currently inputs) inputted, butIt is therefore the sentence (i.e. history sentence) inputted user's first round splices conjunction containing dynamic guest by the way of above-mentioned replacement subjectIt can fail at sentence, can be used as keyword by extraction entity naming word " Shanghai " from the sentence of the second wheel input at this time, then,Entity naming word " Beijing " in the sentence of first round input is replaced with into keyword " Shanghai ", new sentence can be synthesized and " goneIt goes on business in Shanghai ";Finally using the new sentence as search problem.
Optionally, when keyword replaces synthesis sentence failure, the synthesis strategy for being also based on statistics comes the present embodimentNew sentence is synthesized, using as search problem.For example, corresponding keyword can be extracted from history sentence, then, by keywordSplicing synthesizes new sentence.Then the present embodiment can be selected with the idf values of phrase in statistical history sentence according to the idf values of phraseTake keyword.For example, the highest word of global statistics idf values is as keyword etc. in selection history sentence.
According to the above-mentioned acquisition modes introduction to search problem, the present embodiment can define three kinds of strategies to synthesize new sentenceSon, using as search problem.Three kinds of strategies include:
Syntactic analysis synthesis strategy 1:It is and defeated before user when the sentence that user currently inputs is free of subject-predicate language and dynamic objectWhen the history sentence entered contains subject-predicate language or dynamic object;By in history sentence subject or object replace with and currently inputThe subject of sentence.
Syntactic analysis synthesis strategy 2:It is and defeated before user when the sentence that user currently inputs is free of subject-predicate language and dynamic objectWhen the history sentence entered contains subject-predicate language or dynamic object, according to corresponding keyword in the word type-collection of the phrase sentence,By part of speech type target phrase identical with the keyword in the history sentence, the keyword is replaced with.
For example, according to name, place name, mechanism name, the highest noun of tfidf values, the priority of verb from high in the end from working asCorresponding keyword is extracted in preceding sentence, then, replaces corresponding phrase in history sentence.Specifically:
case1:If having extracted name, place name, mechanism name as keyword, directly sought in a upper queryCorresponding same type entity is looked for be replaced
Case2. if having extracted noun, similarity is calculated with candidate subject or object, selection is replaced
Case3. if having extracted verb, predicate or verb are replaced.
Count synthesis strategy:To extract corresponding keyword from history sentence, then, keyword splicing synthesis is newSentence.Specifically, in statistical history sentence phrase idf values, then according to the idf values of phrase choose keyword.For example, choosingThe highest word of global statistics idf values is as keyword etc. in history sentence.
With reference to figure 1b, the flow of the sentence synthesis of the present embodiment is as follows:
1031, the history sentence inputted before the current sentence and user that user currently inputs is obtained.
Wherein, current sentence is the content that user is currently inputted by robot chat client in terminal, history sentenceFor the content inputted by robot chat client in terminal before user.
1032, syntactic analysis is carried out to current sentence and history sentence respectively, obtains syntactic analysis result.
1033, judge whether that current sentence is free of subject-predicate language and dynamic object according to syntactic analysis result, and history sentence containsThere are subject-predicate language or dynamic object;If so, thening follow the steps 1034, if it is not, being then not necessarily to synthesis, terminate synthesis flow.
1034, sentence synthesis is carried out using syntactic analysis synthesis strategy 1.
1035, determine whether to synthesize successfully, if it is not, thening follow the steps 1036, if so, synthesizing successfully, terminate resultant currentJourney.
1036, sentence synthesis is carried out using syntactic analysis synthesis strategy 2.
1037, determine whether to synthesize successfully, if it is not, thening follow the steps 1038, if so, synthesizing successfully, terminate resultant currentJourney
1038, sentence synthesis is carried out using statistics synthesis strategy.
1039, determine whether to synthesize successfully, if it is not, thening follow the steps 1040, if so, synthesizing successfully, terminate resultant currentJourney.
1040, determination can not synthesize.
(2), phase near problem is determined:
Wherein, be with phase near problem similar in search problem retrieved based on inverted index and word the problem of search problemThe problem of group matches, problem such as close with problem phrase, similar or like.
Problem and the inverted index of its phrase may include in the present embodiment:Index pair, the wherein index are to including indexKeyword and its corresponding index entry, the indexing key words are the phrase of problem, and index entry can be the problem.At this point, closeProblem is:The index to match with phrase the problem of search problem to the problems in, be keyword and search problem specificallyThe index that problem phrase matches to the problems in.Namely step " should arrange according to phrase the problem of the search problem and ropeDraw and determine and phase near problem similar in the search problem " may include:
The problem of word matches the problem of index is to middle inquiry and search problem, obtains and phase similar in the search problemNear problem.
For example, can to search problem word segmentation processing, the problem of obtaining search problem phrase then looked into in indexAsk the problem of matching with the problem phrase.When the phrase of search problem has multiple, can index to middle inquiry with it is eachThe problem of problem word matches can obtain one group and phase near problem similar in search problem in this way.
For example, after being segmented to search problem A, obtain phrase a1, a2 ... ai ... an, then, in index inRespectively searching keyword and a1, keyword and a2 ... keyword and an be matched to be indexed pair, finally matches keyword with aiIndex to the problems in as with phase near problem similar in search problem A, a series of phase near problem can be obtained in this way;For example, when corresponding to different problems by inquiring each problem phrase, then, can obtain at this time phase near problem Q1,Q2……Qn。
104, the time of search problem is obtained to the candidate answers of acquisition search problem according to the phase near problem and the question and answerSelect answer set.
After obtaining multiple phase near problems, the corresponding answer of each phase near problem can be inquired in question and answer pair, so justMultiple candidate answers can be obtained, candidate answers set is formed.
Specifically, it can determine then the matching question and answer pair that problem matches with phase near problem will match in question and answer pairCandidate answers of the answer of question and answer centering as search problem.That is, the question and answer pair that problem can match with phase near problemIn candidate answers of the answer as search problem, for example, can will be in the problem question and answer similar or identical with phase near problemCandidate answers of the answer as search problem.
For example, question and answer to have (Q1, A1), (Q2, A2) ... (Qi, Ai) ... (Qn, An) when phase near problem have Q1,When Q2 ... Qi, then the corresponding answer of each phase near problem can be obtained from question and answer centering, candidate answers set is obtained{A1、A2……Ai}。
105, the target answer of the search problem is chosen from the candidate answers set.
For example, one or more answer can be chosen using in candidate answers set { A1, A2 ... Ai } as search problemFinal result.
In practical application, can give a mark to candidate answers, then, based on the score of candidate answers to candidate answers intoRow sequence, candidate answers set after being sorted, finally, candidate answers, which are concentrated, after sequence chooses final answer.Namely stepSuddenly " the target answer that the search problem is chosen from the candidate answers set " may include:
It gives a mark to candidate answers in candidate answers set, obtains the score of candidate answers;
According to the score of candidate answers, candidate answers are ranked up from candidate answers set, candidate answers after being sortedSet;
The target answer of search problem is chosen from candidate answers set after sequence.
Wherein, the sortord of candidate answers, can there are many, the progress of the mode of score from high to low such as may be usedSequence, for another example, can also be ranked up using score by the way of from low to high.
In the present embodiment, it can choose and come foremost or rearmost candidate answers are answered as the target of search problemCase.Specifically mode can be set according to actual demand.
In order to improve the correlation of output answer and search problem, to promote the quality of answer, the present embodiment may be usedFollowing manner chooses the target answer of search problem:
(1), based on the similarity between answer and problem:
Specifically, the sentence similarity information between candidate answers and search problem in candidate answers set can be obtained;According to the sentence similarity information in candidate answers set between candidate answers and search problem, selected from the candidate answers setTake the target answer of the search problem.
Wherein, sentence similarity information is the information for characterizing the similarity between two sentences;The present embodiment can be adoptedSentence similarity is calculated with vector space model, sentence similarity information includes at this time:Vector between sentence vector is similarDegree.Wherein, the included angle cosine value (i.e. cosine similarity) between vector may be used in the vector similarity between sentence vectorMeasurement, the distance between vector (such as Euclidean distance, manhatton distance) are measured etc.;Namely between sentence vector toMeasuring similarity may include:The distance between cosine similarity, sentence vector between sentence vector etc..
Namely step " obtaining the sentence similarity information between answer and search problem in candidate answers set " can wrapIt includes:
Obtain the corresponding Answer Sentence subvector of candidate answers and problem sentence corresponding to search problem in candidate answers setSubvector;
Obtain the vector similarity between the Answer Sentence subvector and the problem sentence vector;
At this point, step " according to the sentence similarity information in candidate answers set between candidate answers and search problem, fromThe target answer of the search problem is chosen in the candidate answers set " may include:According to Answer Sentence subvector and problem sentenceVector similarity between vector chooses the target answer of the search problem from the candidate answers set.
The present embodiment can obtain the Answer Sentence subvector Yu problem sentence of each candidate answers in candidate answers set toVector similarity such as cosine similarity between amount, then, the Answer Sentence subvector based on each candidate answers and problem sentenceVector similarity between vector chooses the final reply answer of search problem from candidate answers set.
For example, can according to the corresponding Answer Sentence subvector of each candidate answers in candidate answers set and problem sentence toVector similarity between amount gives a mark to candidate answers in candidate answers set, then, based on candidate answers in setScore is ranked up candidate answers, and the candidate answers set after being sorted is chosen from the candidate answers set after sequenceThe target answer of search problem.
In the present embodiment, obtain sentence vector mode can there are many, it is vectorial in order to form accurate sentence, with accurateGround calculates vector similarity, and the present embodiment can obtain sentence vector as follows:
(1-1), sentence vector is obtained based on term vector:
Specifically, above-mentioned steps " obtain the corresponding Answer Sentence subvector of candidate answers and retrieval in candidate answers setProblem sentence corresponding to problem is vectorial " may include:
The corresponding term vector of answer phrase of candidate answers in candidate answers set is obtained, and is corresponded to according to the answer phraseTerm vector obtain the corresponding Answer Sentence subvector of the candidate answers;
The problem of the obtaining search problem corresponding term vector of phrase, and retrieval is obtained according to the corresponding term vector of problem phraseProblem sentence vector corresponding to problem.
The mode of data training may be used to obtain term vector, for example, word2vec tools may be used in the present embodimentPreset quantity question and answer to having trained term vector on (QA-pair).Specifically, preset quantity (such as 100,000,000) can be chosenQuestion and answer to the answer phrase of candidate answers in candidate answers to as training data, then, term vector being carried out based on training dataTraining, obtains the corresponding term vector of answer phrase;And it is based on the problem of training data is to search problem phrase and carries out term vectorTraining, obtains the corresponding term vector of problem word.Wherein, question and answer are to that can be to be asked based on what the social data in social platform was formedIt answers questions.
In practical application, the dimension of word amount can be preset in training term vector, with formed the word of corresponding dimension toAmount, and then the sentence vector of corresponding dimension can be obtained.That is, training data and default vector dimension can be based on to answer wordTerm vector training is carried out, and term vector training is carried out to problem phrase based on training data and default vector dimension.
The mode of addition of vectors may be used to form sentence vector after obtaining term vector in the present embodiment, it is preferable thatIt can be weighted summation to term vector and obtain sentence vector.Such as, when have identical dimensional term vector W1, W2, W3 ... WnWhen, formula can be passed through:W1*X1+W2*X2+W3*X3+Wi*Xi ... W6*X6=S, wherein S are Sentence sentence vectors,Xi is the corresponding weighted values of term vector Wi.
For example, with reference to figure 1c, term vector includes term vector W1, W2, W3, W4, W5, W6 of 100 dimensions;It so can be by W1-W6 weighted sums form the sentence vector of 100 dimensions.Such as pass through formula:W1*X1+W2*X2+W3*X3+W4*X4+W5*X5+W6*X6=S.
(1-2), sentence vector is obtained based on convolutional neural networks
Authenticate above-mentioned (1-1) obtain sentence vector mode can there are problems that word order information lose, cause sentence toAmount is inaccurate, therefore to promote the accuracy of sentence vector, and the present embodiment may be used convolutional neural networks model and obtain sentenceVector.
Specifically, step " obtains the corresponding Answer Sentence subvector of candidate answers and search problem in candidate answers setCorresponding problem sentence is vectorial " may include:
Search problem is expressed as to corresponding problem matrix, and problem matrix is rolled up based on convolutional neural networks modelProduct processing obtains problem sentence vector corresponding to search problem;
Candidate answers in candidate answers set are expressed as corresponding answer matrix, and are based on the convolutional neural networks modelProcess of convolution is carried out to answer matrix, obtains the corresponding Answer Sentence subvector of answer.
The present embodiment obtains the mode of matrix, can be obtained based on term vector namely search problem " is expressed as by stepCorresponding problem matrix " may include:Then the problem of the obtaining search problem corresponding term vector of word is corresponded to based on problem wordTerm vector obtain problem matrix corresponding to search problem.
Step " candidate answers in candidate answers set are expressed as corresponding answer matrix " may include:Candidate is obtained to answerThen the corresponding term vector of answer word of candidate answers in case set obtains candidate answers based on the corresponding term vector of answer wordCorresponding answer matrix.
For example, the term vector of known each word, and term vector is 100 dimensions, and sentence is assumed most with 50 words, then canTo form or be constituted the sentence matrix of 50*100.
Wherein, term vector can be trained to obtain by sample data to word, such as by the question and answer of preset quantity to rightWord trains to obtain into row vector.
Preferably, the present embodiment may be used multiple and different convolution kernels to carry out convolution algorithm to matrix, to obtain phaseThe sentence vector answered.For example, different convolution kernels can be respectively adopted carries out convolution algorithm to matrix, different convolution kernels pair are obtainedThen the convolution results answered build corresponding sentence vector based on the corresponding convolution results of different convolution kernels.
Namely step " carries out process of convolution to problem matrix based on convolutional neural networks model, obtains search problem correspondenceThe problem of sentence vector " may include:
Multiple and different convolution check problem matrix is respectively adopted and carries out convolution algorithm, obtains the corresponding volume of different convolution kernelsProduct result;
According to problem sentence vector corresponding to the corresponding convolution results structure search problem of different convolution kernels.
Step " carries out process of convolution to answer matrix based on the convolutional neural networks model, it is corresponding to obtain candidate answersAnswer Sentence subvector " may include:
Multiple and different convolution kernels is respectively adopted, convolution algorithm is carried out to answer matrix, obtains the corresponding volume of different convolution kernelsProduct result;
According to the corresponding Answer Sentence subvector of the corresponding convolution results structure candidate answers of different convolution kernels.
The present embodiment, can be corresponding to each convolution kernel respectively after the corresponding convolution results of the different convolution kernels of acquisitionConvolution results make pondization processing, the corresponding characteristic value of each convolution kernel are obtained, then, according to the corresponding characteristic value of each convolution kernelBuild corresponding sentence vector.For example, after obtaining the corresponding convolution results of different convolution kernels, each convolution kernel can be directed toCorresponding convolution results do pondization processing, obtain the corresponding characteristic value of each convolution kernel, then, corresponding based on each convolution kernelCharacteristic value builds problem sentence vector corresponding to search problem or the corresponding Answer Sentence subvector of candidate answers.
For example, sentence constitutes the matrix S of 50*100, convolution kernel is respectively the 1*100 that quantity is 500,2*100,3*100,5*100 weight matrixs.The extraction and calculating that feature can be carried out to the matrix of this sentence, pass through convolution, non-linear changeIt changes, pondization operation, a vectorization for having ultimately generated sentence indicates (vector of length 2000).It is right respectively with reference to figure 1dThe problem of 100*6 matrix and answer matrix carry out process of convolution process, can be from problem square based on convolutional neural networks modelBattle array extraction problem sentence vector, Answer Sentence subvector is extracted from answer matrix, then, computational problem sentence vector and answer sentenceCosine similarity between vector.
It is assumed that by taking convolution kernel is a weight matrix M of 100*3 as an example, sentence constitutes the matrix S of 50*100, then rolling upLong-pending process is exactly to slide the convolution window M of this 100*3 on the matrix S of 100*50, and total process (50-3+1)=48 times.In each step of sliding, it is required for calculating a part of matrix that convolution window is covered in weight matrix M and sentence matrixConvolution (dark rectangular segment in Fig. 1 d).It (is 3 for size that 48 results can be generated in entire moving processThe convolution karyogenesis fruiting quantities of convolution kernel, other sizes are different.Size be 1 have 50 as a result, size is 2 to have 49 knotsFruit, size are 5 to have 46 results), maximum pond is done for the result, finally due to sharing 2000 convolution kernels, so meetingThe sentence abstract representation of 2000 dimensions is generated, that is, generates sentence vector.
The present embodiment method after obtaining answer matrix-vector and problem matrix-vector, can obtain the two sentences toVector similarity between amount, such as cosine similarity.Wherein, cosine similarity, also known as cosine similarity.By calculating twoThe included angle cosine value of a vector assesses their similarity.Angle is smaller, and cosine value is closer to 1, their direction is moreIt coincide, then it is more similar.
Such as, it is assumed that vectorial A=(A1, A2 ..., An), B=(B1, B2 ..., Bn);It can be calculated by the following formulaCosine similarity between vectorial A and B:
For example, the corresponding Answer Sentence subvector of each candidate answers in candidate answers set and search problem pair can be calculatedThen cosine similarity between the problem of answering sentence vector gives a mark to candidate answers based on cosine similarity, obtain everyThe score (such as cosine value is bigger, and score is higher) of a candidate answers, and according to the scores of each candidate answers to candidate answers intoRow sequence (being such as ranked up according to the sequence of score from high to low), the target that search problem is chosen from the set after sequence are answeredCase.
(2), word co-occurrence statistics are based on:
The present embodiment can count each word in the number of co-occurrence and problem two-by-two of the word in word and problem in answerThen the number of appearance chooses the answer of search problem based on statistical result.Namely step " is selected from the candidate answers setTake the target answer of the search problem " may include:
Answer word problem answer is total in the problem of obtaining problem phrase in search problem occurrence number and candidate answersOccurrence number;
According to answer word problem answer is total in occurrence number and candidate answers the problem of problem phrase in search problemOccurrence number chooses the target answer of the search problem from the candidate answers set.
Wherein, which is:The number namely problem that problem phrase occurs in question and answer pair include problem wordThe quantity of the question and answer pair of group.
Problem answers co-occurrence number is:The problems in answer phrase and search problem in answer phrase is two in problem pairThe number of two co-occurrences;Namely problem includes the quantity of the question and answer pair of problem phrase and answer comprising problem word.
For example, the phrase after problem Q participles is q1, q2 ... qi ... qn.When the problem for having 800 question and answer pair includes q1When, then occurrence number is 800 the problem of q1, when there is the problem of 789 question and answer pair to include or q2 occur, then the problem of q2Occurrence number is 789 ... ... when there is the problem of m question and answer pair to include or qi occur, then occurrence number is m the problem of qi,And so on can obtain each problem word problem occurrence number in problem,.
For another example, the phrase after problem Q participles is q1, q2 ... qi ... qn;Obtained after candidate answers A participle a1,a2……ai……aj.When problem includes or qi occurs, when answer includes or the question and answer of qi occurs to there is k, then ai and qiCo-occurrence number be k.
With reference to figure 1e, the word occurrence number of search problem Q, candidate answers A1, candidate answers A2 can be counted, as candidate answersThe word in word and search problem in the case case number that each word occurs in the number and search problem of co-occurrence two-by-two.From figureIt is can be seen that in table shown in 1e when occurring " having a meal " in problem, " eating " occurs in good answer, " electricity " occur in bad answer, led toCross the co-occurrence number for counting and finding that the number of " eating " and four word co-occurrences in problem has been more than four words in " electricity " and problem.
The candidate answers word problem answer in occurrence number and answer the problem phrase in obtaining search problem the problem ofThe problem of co-occurrence number, the present embodiment can be based on problem phrase occurrence number and answer in candidate answers word problem answer it is totalOccurrence number distinguishes the quality of candidate answers, to choose the optimum answer of search problem.
For example, can be according to candidate answers word in occurrence number and answer the problem of problem phrase in search problemProblem answers co-occurrence number gives a mark to candidate answers in candidate answers set, then, based on the score of candidate answers to waitingIt selects candidate answers in answer set to be ranked up, finally, the answer of search problem is chosen from the candidate answers set after sequence.
Preferably, the present embodiment, can be with computational problem answer co-occurrence number and problem after counting word occurrence numberThen number ratio between occurrence number obtains probability of each candidate answers as target answer based on the number ratio,Finally, target answer is chosen based on probability.Namely step " according to occurrence number the problem of problem phrase in search problem, withAnd answer word problem answer co-occurrence number in candidate answers, the target that the search problem is chosen from the candidate answers set are answeredCase " may include:
It obtains answer word problem answer co-occurrence number in candidate answers, go out occurrence with problem word problem in search problemSeveral number ratios;
The corresponding target answer probability of candidate answers is obtained according to number ratio, which is the answer conductThe probability of the target answer of search problem;
The target answer of the search problem is chosen from the candidate answers set according to the corresponding target answer probability of answer.
For example, the phrase after problem Q participles is q1, q2 ... qi ... qn;Word is obtained after being segmented to each candidate answers AGroup a1, a2 ... ai ... aj, then occurrence number Count (qi) the problem of can count qi, i.e., qi occurs in question and answer pairNumber, and answer co-occurrence number Count (qi, ai) the problem of (qi, ai), i.e. qi and ai appear in same question and answer pair simultaneouslyNumber, can be with times of acquisition ratio Count (qi, ai)/Count (qi);Assuming that obtained after candidate answers Ai participle a1,A2 ... am, m are less than or equal to j, can calculate at this time Ai target answer probability Score (Q, Ai)=1/t*Count (q1,A1)/Count (q1)+1/t*Count (q2, a1)/Count (q2)+...+1/t*Count (qn, a1)/Count (qn)+1/t*Count (q1, a2)/Count (q1)+1/t*Count (q2, a2)/Count (q2)+...+1/t*Count (qn, a2)/Count(qn)+...+1/t*Count (q1, ai)/Count (q1)+1/t*Count (q2, ai)/Count (q2)+...+1/t*Count (qn, ai)/Count (qn)+...+1/t*Count (q1, am)/Count (qi)+1/t*Count (q2, am)/Count(q2)+...+1/t*Count (qn, am)/Count (qn).Wherein, t is the sum of the phrase quantity of problem Q and candidate answers.ThanSuch as, there is a z phrase after problem Q participles, a total of c phrase after answer A is segmented, t=z+c at this time.
With reference to figure 1e, Score (Q, A1), Score (Q, A2) can be calculated.Detailed process is as follows:
Score (Q, A1)=1/8*Count (you eat)/Count (you)+1/8*Count/Count that (have a meal, eat) (are eatenMeal)+1/8*Count (, eat)/Count ()+1/8*Count (, eat) and/Count ()+1/8*Count (you do not have)/Count (you)+1/8*Count/Count that (have a meal, do not have) (have a meal)+1/8*Count (, do not have)/Count ()+1/8*Count (, do not have) and/Count ()=0.055
Score (Q, A2)=1/8*Count (you, electricity)/Count (you)+1/8*Count/Count that (have a meal, electricity) (are eatenMeal)+1/8*Count (, electricity)/Count ()+1/8*Count (, electricity) and/Count ()+1/8*Count (you do not have)/Count (you)+1/8*Count/Count that (have a meal, do not have) (have a meal)+1/8*Count (, do not have)/Count ()+1/8*Count (, do not have) and/Count ()=0.017.
From the foregoing, it will be observed that Score (Q, A1)>Score (Q, A2), so A1 ratios A2 is more suitable for doing the answer of Q.
It, can be based on the mesh of each candidate answers A after the target answer probability Score for obtaining each candidate answers AMark answer probability Score gives a mark (such as Score is bigger, and score is higher) to each candidate answers in candidate answers set, soAfterwards, it sorts to candidate answers, target answer is chosen from the set after sequence, most preceding candidate answers are come as mesh as chosenMark answer.
From the foregoing, it will be observed that the embodiment of the present invention uses the multiple question and answer pair formed based on the social data in social platform, it shouldQuestion and answer are to including problem and its corresponding answer, then, establish the inverted index of the problem and its phrase, obtain search problem,And according to phrase the problem of the search problem and inverted index determination and phase near problem similar in the search problem, according to thisPhase near problem and the question and answer obtain the candidate answers set of search problem, from the time to the candidate answers of acquisition search problemSelect the target answer that the search problem is chosen in answer set.The program can first inquire ask close with similar in search problemTopic, and the corresponding answer of phase near problem is inquired, then most suitable answer, therefore, the program are chosen in the answer of slave phase near problemThe answer to match with search problem can be exported, the accuracy and quality of chat robots system output answer are improved.
Embodiment two,
According to method described in embodiment one, citing is described in further detail below.
The embodiment of the present invention is introduced so that automatic call answering arrangement is integrated in the server as an example provided by the invention asks automaticallyAnswer method.
With reference to figure 2a, an embodiment of the present invention provides a kind of automatically request-answering system, including server and terminal, the serversIt is integrated with automatic call answering arrangement, which passes through network connection with terminal.
As shown in Figure 2 b, a kind of automatic question-answering method, detailed process can be as follows:
201, server obtains the social data in social platform, and carries out data filtering to social data.
Wherein, social platform refers to providing the user with the platform for sharing the information such as oneself information, mood, perception.For example, baseIn the social platform etc. of instant messaging.In addition, social platform can also include chat robots, problem system;Namely chatting machineDevice, problem system can be the social platform of interaction social information.
Social data in social platform is the social information data of user's interaction in social platform, which canWith the UGC (User Generated Content, user-generated content) in packet social platform.
The present embodiment server can filter principle according to vocabulary and filter out some phrase data, which filters principle canTo be set according to actual demand, such as the name solid data such as can filter out name, place name, mechanism name, filter out phone numberThe private datas such as code, instant messaging account, financial account (such as bank's card number), filter out the uncivil lexical data such as dirty word.
202, server forms multiple question and answer pair according to filtered social data, and preserves question and answer pair.
Wherein, question and answer refer to that one asks to being also referred to as problem answers to (uestion-answer pair, QA pair)One a pair of of social data such as text data answered.
203, server segments question and answer centering problem, and establishes the inverted index of the problem and its phrase.
The inverted index may include that multiple directory entries either index pair each directory entry or index to including ropeDraw keyword and its corresponding index entry, wherein indexing key words can be the phrase in problem, and index entry can be phrase pairThe problem of answering.
For example, server can carry out word segmentation processing to question and answer centering problem, the phrase of problem is obtained;According to problem andThe phrase of problem establishes index pair, and for the index to including indexing key words and its corresponding index entry, which is to askThe phrase of topic, index entry can be the problem.
204, terminal to server sends the message content that user currently inputs.
For example, with reference to figure 2c, after user opens robot chat application, user clicks the pet on application interfaceTo enter the dialog interface with robot, user can be by inputting message content, when user is defeated in the input frame of dialog interfaceEnter after content points hit send button, the message content that terminal can currently input user is sent to server, to serviceDevice returns to the feedback message content of the content, i.e. answer.For example, user inputs " you have had a meal " in talking with input frame, eventuallyThe content can be sent to server by end.
205, server obtains search problem according to the message content that user currently inputs.
For example, service can be using the sentence in the message content as search problem.Such as it can be by " you have had a meal "Directly as search problem Q.
For another example, in order to promote the accuracy and quality of answer, the present embodiment can be directed to sentence in message content and carry outSyntactic analysis, to obtain accurate search problem.Specifically, server can obtain current in the current input content of userThen the history sentence of input content before sentence and user carries out syntax point to current sentence and history sentence respectivelyAnalysis, to determine whether the sentence that user currently inputs contain subject-predicate containing subject-predicate language or dynamic object and history sentenceLanguage or dynamic object.
If current sentence is free of subject-predicate language and dynamic object, and the history sentence can be with when containing subject-predicate language or dynamic objectNew sentence is synthesized using the synthesis strategy of above-mentioned introduction, using as search problem, specifically sentence synthesis can be with reference implementationA kind of building-up process of example, details are not described herein again.
206, server is based on the problem of the problem of inverted index lookup and search problem phrase matches, and obtains and is asked with retrievalThe similar one group of phase near problem of topic.
Wherein, be with phase near problem similar in search problem retrieved based on inverted index and word the problem of search problemThe problem of group matches, problem such as close with problem phrase, similar or like.
Problem and the inverted index of its phrase may include:Index pair, wherein the index to include indexing key words and itsCorresponding index entry, the indexing key words are the phrase of problem, and index entry can be the problem.At this point it is possible to search problemWord segmentation processing, the problem of obtaining search problem phrase, then, the problem of index matches to middle inquiry with the problem phrase.It, can be the problem of index matches to middle inquiry with each problem word, in this way when the phrase of search problem has multipleTo obtain one group and phase near problem similar in search problem.
For example, phrase q1, q2 and q3 are obtained after can carrying out word segmentation processing to problem Q, inverted index includes:Index pair(q1, Q1), (q2, Q2), (q3, Q3) ....So side can determine that the phase near problem of problem Q is Q1, Q2 from inverted indexAnd Q3.
207, server is from question and answer centering inquiry and the matched answer of phase near problem, and will match answer as search problemCandidate answers, obtain the candidate answers set of search problem.
Specifically, server can determine the matching question and answer pair that problem matches with phase near problem in question and answer pair, then,The answer of question and answer centering will be matched as the candidate answers of search problem.
For example, determining the answer of phase near problem Q1, Q2 and Q3 from question and answer centering, it is assumed that question and answer to include (Q1, A1), (Q2,A2)、(Q3、A3)…….So at this point, the candidate answers set { A1, A2, A3 } of search problem Q can be obtained.
208, server gives a mark to candidate answers in candidate answers set, obtains the score of each candidate answers.
For example, server can respectively give a mark to the answer in candidate answers set { A1, A2, A3 }, it is with 100 pointsExample, it is assumed that A1, A2, A3 are respectively:80 points, 76 points, 79 points.
Wherein, to the marking mode of candidate answers can there are many, for example, including:
(1), based on the similarity between answer and problem:
Specifically, server can obtain the sentence similarity between candidate answers and search problem in candidate answers setInformation;According to the sentence similarity information in candidate answers set between candidate answers and search problem, candidate answers are carried outMarking.
Wherein, sentence similarity information is the information for characterizing the similarity between two sentences;The present embodiment can be adoptedSentence similarity is calculated with vector space model, sentence similarity information includes at this time:Vector between sentence vector is similarDegree.Wherein, the included angle cosine value (i.e. cosine similarity) between vector may be used in the vector similarity between sentence vectorMeasurement, the distance between vector (such as Euclidean distance, manhatton distance) are measured etc.;Namely between sentence vector toMeasuring similarity may include:The distance between cosine similarity, sentence vector between sentence vector etc..
For example, server can obtain the corresponding Answer Sentence subvector of candidate answers in candidate answers set and retrievalProblem sentence vector corresponding to problem, obtains the vector similarity between the Answer Sentence subvector and the problem sentence vector, soAfterwards, based on being given a mark to candidate answers according to the vector similarity between Answer Sentence subvector and problem sentence vector.It is practicalIn, the vector similarity between sentence can be calculated by Word2vec algorithms.
Specific marking rule can be set according to actual demand, for example, when vector similarity is cosine similarity, answerVector and the vector of problem between cosine value it is bigger, to answer give a mark it is higher.
Wherein, sentence vector acquisition modes may include:It is obtained based on term vector, for example, the phrase for obtaining sentence corresponds toTerm vector, and the corresponding Answer Sentence subvector of the sentence is obtained according to the corresponding term vector of the phrase
In addition, to improve the accuracy of sentence vector, sentence vector acquisition modes can also include:Based on convolutional Neural netNetwork obtains, for example, search problem to be expressed as to corresponding problem matrix, and based on convolutional neural networks model and multiple and differentConvolution kernel, distich submatrix carry out process of convolution, obtain the corresponding sentence vector of sentence.
Specifically, the acquisition of term vector, the acquisition of sentence vector, the acquisition of vector similarity can be in reference implementation example oneDetailed description.
(2), word co-occurrence statistics are based on:
For example, word in the word and problem in answer can be counted, each word goes out in the number of co-occurrence and problem two-by-twoThen existing number gives a mark to candidate answers based on statistical result.
Specifically, answer word asks in the problem of obtaining problem phrase in search problem occurrence number and candidate answersAnswer co-occurrence number is inscribed, according to answer word is asked in occurrence number and candidate answers the problem of problem phrase in search problemAnswer co-occurrence number is inscribed, is given a mark to candidate answers.
Preferably, the present embodiment server, can be with computational problem answer co-occurrence number after counting word occurrence numberThen number ratio between problem occurrence number obtains each candidate answers as target answer based on the number ratioProbability, be finally based on probability to give a mark to candidate answers.
It is higher to the marking of the candidate answers when the target answer probability of candidate answers is bigger when practical application.
Specifically, the acquisition of occurrence number, the acquisition of number ratio and the acquisition of target answer probability can refer to realApply a kind of detailed description of example.
209, server is ranked up candidate answers in candidate answers set according to the score of candidate answers.
For example, server successively can be ranked up candidate answers according to the sequence of score from high to low.Such as according to A1,The score of A2, A3, candidate answers collection is combined into { A1, A3, A2 } after sequence.
210, server chooses the target answer of search problem from the candidate answers set after sequence, and is sent to terminalThe answer.
Specifically, server can choose the optimum answer that most preceding candidate answers are search problem that sorts in set.In some other embodiments, the nearest answer that last candidate answers are search problem of sorting can also be chosen, such as when according to pointWhen the sequence of number from low to high sorts.
For example, server can choose the mesh that A1 is search problem Q from candidate answers set { A1, A3, A2 } after sequenceAnswer, that is, optimum answer is marked, then, server can send the A1 to terminal.Terminal can be on chat circle of robot when receiving A1It is shown in face, just realizes robot chat in this way.
For example, with reference to figure 2d, when the message content inputted in input frame of the user in robot chat interface is that " you have a meal", after which is sent to server by terminal transmission, server is using the message content as search problem Q, soAfterwards, by inquiring Similar Problems and its answer of search problem Q, to obtain candidate answers list { A1=" not eating ", A2=" not eating electricity ", A3=" not eating fruit " }, server can carry out candidate answers in candidate answers list to each candidate answersIt is ranked up, { A1=" not eating ", A3=" not eating fruit ", A2=" are not eaten for candidate answers list after being sortedElectricity " }, then server can choose the candidate answers A1 for coming foremost as search problem Q's from candidate answers listCandidate answers A1 is sent to terminal by optimum answer, server, and with reference to figure 2d, terminal shows A1=in robot chat interface" not eating ".
From the foregoing, it will be observed that multiple question and answer pair that the embodiment of the present invention can be formed based on the social data in social platform, it shouldQuestion and answer are to including problem and its corresponding answer, then, establish the inverted index of the problem and its phrase, obtain search problem,And according to phrase the problem of the search problem and inverted index determination and phase near problem similar in the search problem, according to thisPhase near problem and the question and answer obtain the candidate answers set of search problem, then, base to the candidate answers of acquisition search problemIn sentence vector similarity algorithm, the similarity algorithm of convolutional neural networks or word co-occurrence statistics to the candidate answers setMiddle candidate answers are given a mark, and the score based on candidate answers in set sorts to candidate answers, finally from the candidate after sequenceThe target answer of the search problem is chosen in answer set.The program can export the answer to match with search problem, improveThe accuracy and quality of chat robots system output answers.
Embodiment three,
According to method described in embodiment one and two, citing is described in further detail below.
An embodiment of the present invention provides a kind of automatically request-answering systems, and with reference to figure 3a, voluntarily question answering system may include for this:Line retrieval system, off-line processing system and ordering system.
Automatically request-answering system realizes the process of chat, as follows:
Off-line processing system, obtain social platform on magnanimity social data (such as UGC data), to magnanimity social data intoRow data cleansing (i.e. data filtering) then disassembles social data at question and answer to QA pairs of base, and by QA to being put into online retrievingSystem.For example, off-line processing system can filter out the name entity such as name, place name, mechanism name, telephone number, i.e. is filtered outWhen communication number, the private datas such as bank's card number, filter out the uncivil vocabulary such as dirty word.Specifically, QA can to formation, data cleansingWith the detailed description in reference implementation example one.
Online retrieving system builds being indexed the magnanimity QA of typing, constructs inverted index, facilitate and examinedRope operates, and can be found by retrieval and input the most close problems of problem Q with user.Specifically, by QA to the problems in carry outParticiple, establishes the inverted index between problem and its word, which includes:Keyword is the phrase of problem, index objectFor problem.Specifically, inverted index is established, phase near problem is inquired can be with the detailed description in reference implementation example one.
Online retrieving system receives the message content input by user that terminal is sent, and then, is obtained based on the message contentSearch problem Q input by user.And inquiry inputs the answer of the most close problems of problem Q with user in QA pairs of magnanimity, withAs the candidate answers of search problem Q, candidate answers list is obtained.Online retrieving system inputs candidate answers list and userSearch problem Q export to ordering system.
Ordering system, including:Scoring modules and sorting module, wherein scoring modules, for in candidate answers listCandidate answers are given a mark.With reference to figure 3b, scoring modules can give a mark to candidate answers based on following three kinds of algorithms:
(1), word is for existing statistics:
The number that each word occurs in the number of co-occurrence and problem two-by-two of the word in the word and problem in answer is counted,Then, the answer of search problem is chosen based on statistical result.Specifically, the problem of can obtaining problem phrase in search problem, goes outAnswer word problem answer co-occurrence number in occurrence number and candidate answers, computational problem answer co-occurrence number occur with problemThen number ratio between number obtains probability of each candidate answers as target answer, most based on the number ratioAfterwards, it is given a mark to candidate answers based on probability.
Wherein, which is:The number namely problem that problem phrase occurs in question and answer pair include problem wordThe quantity of the question and answer pair of group.
Problem answers co-occurrence number is:The problems in answer phrase and search problem in answer phrase is two in problem pairThe number of two co-occurrences;Namely problem includes the quantity of the question and answer pair of problem phrase and answer comprising problem word.
(2), Word2vec vector similarities calculate:
Using word2vec tools preset quantity question and answer to having trained phrase in answer and problem on (QA-pair)Term vector, the corresponding Answer Sentence subvector of the candidate answers is obtained according to the corresponding term vector of answer phrase;According to problemThe corresponding term vector of phrase obtains problem sentence vector corresponding to search problem;Calculate Answer Sentence subvector and problem sentence vectorBetween cosine similarity, based on the cosine similarity between the corresponding Answer Sentence subvector of candidate answers and problem sentence vectorIt gives a mark to candidate answers.
For example, the mode of addition of vectors can be may be used to form sentence vector, preferably after obtaining term vectorGround can be weighted term vector summation and obtain sentence vector.Such as, when have identical dimensional term vector W1, W2, W3 ...When Wn, formula can be passed through:W1*X1+W2*X2+W3*X3+Wi*Xi ... W6*X6=S, wherein S be Sentence sentences toAmount, Xi are the corresponding weighted values of term vector Wi.
(3), the similarity calculation based on convolutional neural networks:
In view of word2vec weightings sentence vector for word order information loss problem, the present embodiment can be usedConvolutional neural networks model, by the way that different size of convolution kernel is arranged, to capture the sequence of word.Specifically, by sentence expressionProcess of convolution is carried out to matrix at corresponding matrix, and based on convolutional neural networks model and multiple and different convolution kernels, is obtainedThe corresponding sentence vector of sentence.For example, based on convolutional neural networks model and multiple and different convolution kernels to answer matrix andProblem matrix carries out process of convolution respectively, obtains the corresponding Answer Sentence subvector of candidate answers and problem corresponding to search problemSentence vector.After obtaining sentence vector, the vectorial phase between Answer Sentence subvector and the problem sentence vector can be calculatedLike degree, then, based on being beaten candidate answers according to the vector similarity between Answer Sentence subvector and problem sentence vectorPoint.
Specific marking rule can be set according to actual demand, for example, when vector similarity is cosine similarity, answerVector and the vector of problem between cosine value it is bigger, to answer give a mark it is higher.
With reference to figure 3b, after giving a mark to each candidate answers, sorting module can be based on point of each candidate answersAnswer in several lists to candidate answers is ranked up, the candidate answers list after output sequence.
After the candidate answers list after output is sorted, automatically request-answering system can be chosen from the candidate answers listFinal result of the candidate answers of corresponding position as search problem Q.For example, choosing the candidate answers conduct of foremost in listThe final result of search problem Q.
From the foregoing, it will be observed that automatically request-answering system provided in an embodiment of the present invention can be based on the social data shape in social platformAt multiple question and answer pair, the question and answer are to including problem and its corresponding answer, then, establish fall row rope of the problem with its phraseDraw, obtains search problem, and close with the search problem according to phrase the problem of the search problem and inverted index determinationPhase near problem obtain the time of search problem according to the phase near problem and the question and answer to obtaining the candidate answers of search problemAnswer set is selected, then, is united based on sentence vector similarity algorithm, the similarity algorithm of convolutional neural networks or word co-occurrenceMeter gives a mark to candidate answers in the candidate answers set, and the score based on candidate answers in set sorts to candidate answers,The target answer of the search problem is finally chosen out of candidate answers set after sequence.The program can export and search problemThe answer to match improves the accuracy and quality of chat robots system output answer.
Example IV,
In order to preferably implement above method, the embodiment of the present invention also provides automatic call answering arrangement, as shown in fig. 4 a, should be certainlyDynamic question and answer system may include:Question and answer are to forming unit 401, indexing and establish unit 402, problem acquiring unit 403, candidate answersAcquiring unit 404 and answer selection unit 405 are as follows:
(1) question and answer are to forming unit 401;
Question and answer are to forming unit 401, multiple question and answer pair for being formed based on the social data in social platform, the question and answerTo including problem and its corresponding answer.
In, social platform refers to providing the user with the platform for sharing the information such as oneself information, mood, perception.For example, being based onThe social platform etc. of instant messaging.
Social data in social platform is the social information data of user's interaction in social platform, which canWith the UGC (User Generated Content, user-generated content) in packet social platform.For example, the social data can be withThe content (such as content of text) and other users delivered in social platform including user are directed to the content in social platformComment information or reply content etc..
In the present embodiment, question and answer refer to being also referred to as problem answers to (uestion-answer pair, QA pair)It is a pair of of social data such as text data of question-response.The question and answer are to can specifically include problem and its corresponding answer.
In order to reduce treating capacity and improve answer quality, question and answer to formed unit 401, can also to social data intoRow data filtering, social data after being filtered carry out question and answer dismantling to social data after filtering, form multiple question and answer pair.
For example, question and answer can first segment social data to forming unit 401, then, principle be filtered according to vocabularyFilter out some phrase data.Such as the name solid data such as can filter out name, place name, mechanism name, filter out phone numberThe private datas such as code, instant messaging account, financial account (such as bank's card number), filter out the uncivil lexical data such as dirty word.
(2) index establishes unit 402;
Index establishes unit 402, the inverted index for establishing the problem and its phrase.
Wherein, inverted index needs the value according to attribute to search record in practical application.In this concordance listEach single item all includes an attribute value and the address respectively recorded with the attribute value.Due to not being to determine attribute by recordingValue, but the position of record is determined by attribute value, thus referred to as inverted index (inverted index).With inverted indexText we be known as inverted index text, the referred to as row's of falling text (inverted file).
The inverted index that problem and its phrase are stated in the present embodiment can be with:It is determined by the phrase in problem and waits for duplicate removal text.The inverted index may include that multiple directory entries either index pair each directory entry or index to including indexing key wordsAnd its corresponding index entry, wherein indexing key words can be the phrase in problem, and index entry, which can be that phrase is corresponding, asksTopic.Therefore, the inverted index of the problem and its phrase is established in the present embodiment, as establishes pair between characterization phrase and problemThe index pair or directory entry that should be related to.
That is, index establish unit 402 can be with, be specifically used for carrying out word segmentation processing to question and answer centering problem, obtain problemPhrase, according to the phrase of problem and problem establish index pair, the index is to including indexing key words and its corresponding index, which is the phrase of problem, and index entry can be the problem,.
(3) problem acquiring unit 403;
Problem acquiring unit 403 for obtaining search problem, and should be arranged according to phrase the problem of the search problem andIndex determines and phase near problem similar in the search problem.
With reference to figure 4b, which may include:Search problem obtains subelement 4031 and phase near problem obtainsTake subelement 4032;
Search problem obtains subelement 4031, is used for:
Obtain the history sentence inputted before the sentence and user that user currently inputs;
When the sentence is free of subject-predicate language and dynamic object, and the history sentence contains subject-predicate language or dynamic object, by historyThe subject of subject or object as the sentence in sentence, to synthesize new sentence;
Using the new sentence as search problem;
Phase near problem obtains subelement 4032, for being determined according to phrase the problem of the search problem and the inverted indexWith phase near problem similar in the search problem.
Wherein, the sentence that history sentence inputs before being user, for example, the history sentence last time is input by userSentence.
Wherein, search problem obtains subelement 4031 and can be also used for:
When synthesizing new sentence failure, corresponding keyword is extracted from the sentence according to the part of speech type of phrase in the sentence,By part of speech type target phrase identical with the keyword in the history sentence, the keyword is replaced with, to obtain replaced sentenceSon, using replaced sentence as search problem.
For another example, which obtains subelement 4031 and can be also used for when keyword replaces synthesis sentence failure,New sentence is synthesized based on the synthesis strategy of statistics, using as search problem.For example, can be extracted from history sentence correspondingThen keyword splicing is synthesized new sentence by keyword.
(4) candidate answers acquiring unit 404;
Candidate answers acquiring unit 404, for according to the phase near problem and the question and answer to obtaining the candidate of search problemAnswer obtains the candidate answers set of search problem.
Specifically, candidate answers acquiring unit 404 can determine that problem matches with phase near problem in question and answer pairWith question and answer pair, then, the answer of question and answer centering will be matched as the candidate answers of search problem.That is, can be by problem and phaseCandidate answers of the answer for the question and answer centering that near problem matches as search problem.
(5) answer selection unit 405;
Answer selection unit 405, the target answer for choosing the search problem from the candidate answers set.
For example, answer selection unit 405 can be used for giving a mark to candidate answers in candidate answers set, candidate is obtainedThe score of answer, according to the score of candidate answers, candidate answers are ranked up from candidate answers set, candidate after being sortedAnswer set chooses the target answer of search problem from candidate answers set after sequence.
Wherein, the sortord of candidate answers, can there are many, the progress of the mode of score from high to low such as may be usedSequence, for another example, can also be ranked up using score by the way of from low to high.
In order to improve the correlation of output answer and search problem, to promote the quality of answer, the present embodiment may be usedFollowing manner chooses the target answer of search problem.
For example, with reference to figure 4c, answer selection unit 405 may include:
Similarity obtains subelement 4051, for obtaining the sentence in candidate answers set between candidate answers and search problemSub- similarity information;
Subelement 4052 is chosen in answer, for according to the sentence in candidate answers set between candidate answers and search problemSimilarity information chooses the target answer of the search problem from the candidate answers set.
Wherein, sentence similarity information is the information for characterizing the similarity between two sentences;The present embodiment can be adoptedSentence similarity is calculated with vector space model, sentence similarity information includes at this time:Vector between sentence vector is similarDegree.Wherein, the included angle cosine value (i.e. cosine similarity) between vector may be used in the vector similarity between sentence vectorMeasurement, the distance between vector (such as Euclidean distance, manhatton distance) are measured etc.;Namely between sentence vector toMeasuring similarity may include:The distance between cosine similarity, sentence vector between sentence vector etc..
Specifically, which obtains subelement 4051, is used for:Candidate answers are corresponding in acquisition candidate answers set answersThe problem sentence vector that case sentence is vectorial and search problem is corresponding;Obtain the Answer Sentence subvector and the problem sentence vectorBetween vector similarity;
Subelement 4052 is chosen in answer, for according to the vectorial phase between the Answer Sentence subvector and the problem sentence vectorLike degree, the target answer of the search problem is chosen from the candidate answers set.
Wherein, there are many acquisition modes of sentence vector, for example, similarity obtains subelement 4051, can be used for obtainingThe corresponding term vector of answer phrase of candidate answers in candidate answers set, and obtained according to the corresponding term vector of the answer phraseThe corresponding Answer Sentence subvector of the candidate answers;The problem of the obtaining search problem corresponding term vector of phrase, and according to problem wordThe corresponding term vector of group obtains problem sentence vector corresponding to search problem.
For another example, similarity obtains subelement 4051, can be used for:
Search problem is expressed as to corresponding problem matrix, and problem matrix is rolled up based on convolutional neural networks modelProduct processing obtains problem sentence vector corresponding to search problem;
Candidate answers in candidate answers set are expressed as corresponding answer matrix, and are based on the convolutional neural networks modelProcess of convolution is carried out to answer matrix, obtains the corresponding Answer Sentence subvector of candidate answers.
In order to improve the accuracy of sentence vector, the present embodiment can also using multiple and different convolution kernels come to matrix intoRow convolution algorithm, to obtain corresponding sentence vector.Such as:
Similarity obtains subelement 4051, can be used for:
Multiple and different convolution check problem matrix is respectively adopted and carries out convolution algorithm, obtains the corresponding volume of different convolution kernelsProduct is as a result, according to problem sentence vector corresponding to the corresponding convolution results structure search problem of different convolution kernels;
Multiple and different convolution kernels is respectively adopted, convolution algorithm is carried out to answer matrix, obtains the corresponding volume of different convolution kernelsProduct result;
According to the corresponding Answer Sentence subvector of the corresponding convolution results structure candidate answers of different convolution kernels.
The present embodiment, can be corresponding to each convolution kernel respectively after the corresponding convolution results of the different convolution kernels of acquisitionConvolution results make pondization processing, the corresponding characteristic value of each convolution kernel are obtained, then, according to the corresponding characteristic value of each convolution kernelBuild corresponding sentence vector.For example, after obtaining the corresponding convolution results of different convolution kernels, each convolution kernel can be directed toCorresponding convolution results do pondization processing, obtain the corresponding characteristic value of each convolution kernel, then, corresponding based on each convolution kernelCharacteristic value builds problem sentence vector corresponding to search problem or the corresponding Answer Sentence subvector of candidate answers.
With reference to figure 4d, answer selection unit 405 may include in the present embodiment:
Number obtains subelement 4053, occurrence number the problem of for obtaining problem phrase in search problem, and candidateAnswer word problem answer co-occurrence number in answer;The problem occurrence number is the number that problem phrase occurs in question and answer pair,The problem answers co-occurrence number be answer phrase in candidate answers and the problems in search problem phrase in problem pair two-by-twoThe number of co-occurrence;
Subelement 4054 is chosen in answer, for according to occurrence number and answer the problem of problem phrase in search problemMiddle candidate answers word problem answer co-occurrence number, chooses the target answer of the search problem from the candidate answers set.
Subelement 4054 is chosen in the answer, can be used for:
It obtains answer word problem answer co-occurrence number in candidate answers, go out occurrence with problem word problem in search problemSeveral number ratios;
The corresponding target answer probability of candidate answers is obtained according to number ratio, which is the answer conductThe probability of the target answer of search problem;
The target answer of the search problem is chosen from the candidate answers set according to the corresponding target answer probability of answer.
When it is implemented, above each unit can be realized as independent entity, arbitrary combination can also be carried out, is madeIt is realized for same or several entities, the specific implementation of above each unit can be found in the embodiment of the method for front, herein notIt repeats again.
The automatic call answering arrangement can specifically be realized by one or more entity, for example be integrated in the equipment such as serverIn, for another example, which can be with realizations such as processed offline server, online processing server, sequence servers.
From the foregoing, it will be observed that automatic call answering arrangement of the embodiment of the present invention is based on to forming unit 401 in social platform by question and answerSocial data formed multiple question and answer pair, the question and answer are to including then problem and its corresponding answer establish unit by index402 establish the inverted index of the problem and its phrase, obtain search problem by problem acquiring unit 403, and ask according to the retrievalThe problem of topic phrase and the inverted index determine and phase near problem similar in the search problem, by candidate answers acquiring unit404 according to the phase near problem and the question and answer to obtaining the candidate answers of search problem, obtain the candidate answers collection of search problemIt closes, answer selection unit 405 chooses the target answer of the search problem from the candidate answers set.The program can be inquired firstWith phase near problem similar in search problem, and the corresponding answer of phase near problem is inquired, then is chosen most in the answer of slave phase near problemSuitable answer, therefore, the program can export the answer to match with search problem, improve chat robots system outputThe accuracy and quality of answer.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is canIt is completed with instructing relevant hardware by program, which can be stored in a computer readable storage medium, storageMedium may include:Read-only memory (ROM, Read Only Memory), random access memory (RAM, RandomAccess Memory), disk or CD etc..
It is provided for the embodiments of the invention a kind of automatic question-answering method above and device is described in detail, hereinApplying specific case, principle and implementation of the present invention are described, and the explanation of above example is only intended to helpUnderstand the method and its core concept of the present invention;Meanwhile for those skilled in the art, according to the thought of the present invention, havingThere will be changes in body embodiment and application range, in conclusion the content of the present specification should not be construed as to the present inventionLimitation.