CN104462024A

Movatterモバイル変換

Info

Publication number: CN104462024A
Application number: CN201410594506.XA
Authority: CN
Inventors: 焦增涛; 汪冠春
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2014-10-29
Filing date: 2014-10-29
Publication date: 2015-03-25
Anticipated expiration: 2034-10-29
Also published as: CN104462024B

Abstract

The invention provides a method and device for generating a dialogue action strategy model. The method includes the steps that a user historical dialogue log is acquired; the user historical dialogue log is analyzed by combining the scene priori knowledge of a target task, and a plurality of first system state characteristics of the target task are dug out; a plurality of second system state characteristics of the preset target task are combined with the first system state characteristics, and a plurality of third system state characteristics are acquired; label data extracted from the user historical dialogue log are used as training samples, the third system state characteristics are used as training characteristics to establish an action decision model and to conduct model training, and the parameter vectors of all the third system state characteristics are learned. Fine granularity dialogue strategy learning is guided under a unified framework, more accurate results and the dialogue strategy which mostly meets the target task requirement are provided for users, and therefore user experience is improved.

Description

Method and device for generating dialogue action strategy model

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method and a device for generating a dialogue action strategy model.

Background

In the age of rapid popularization of the internet, the instant messaging conversation system gradually enters the lives of people, great convenience is brought to the lives of people, and the conversation strategy is the key influencing the user experience.

The dialog strategies of existing dialog systems are mainly designed and applied on the basis of rules and on the basis of statistical models. However, the rule-based method requires good background knowledge for rule makers, and in addition, as the factors involved in the rule increase, the processing logic becomes complex, and the effect cannot reach the optimal state, thereby affecting the user experience; the existing dialogue systems based on the statistical model cannot reasonably utilize information related to the dialogue and cannot guide the dialogue systems to complete fine-grained dialogue strategies.

Disclosure of Invention

The invention aims to provide a method and a device for generating a dialogue action strategy model, which are used for establishing an action decision model by using rich state characteristics as training characteristics and performing model training so as to guide the learning of a fine-grained dialogue strategy under a unified framework.

According to an aspect of the present invention, there is provided a method of generating a dialogue action strategy model, the method comprising: acquiring a historical dialog log of a user; analyzing the user historical dialogue logs by combining scene prior knowledge of a target task, and excavating a plurality of first system state characteristics of the target task; combining a plurality of preset second system state features of the target task with the plurality of first system state features to obtain a plurality of third system state features; and taking the marked data extracted from the user historical dialogue logs as training samples, taking the third system state characteristics as training characteristics to establish an action decision model and carry out model training, and learning parameter vectors of the third system state characteristics.

According to another aspect of the present invention, there is provided an apparatus for generating a dialogue action strategy model, the apparatus comprising: the log acquisition unit is used for acquiring a user history conversation log; the state acquisition unit is used for analyzing the user historical conversation log by combining scene prior knowledge of a target task and excavating a plurality of first system state characteristics of the target task; the state combining unit is used for combining a plurality of preset second system state characteristics of the target task with the plurality of first system state characteristics to obtain a plurality of third system state characteristics; and the decision model generating unit is used for establishing an action decision model by taking the marking data extracted from the user historical dialogue logs as training samples and taking the third system state characteristics as training characteristics, carrying out model training and learning parameter vectors of the third system state characteristics.

According to the method and the device for generating the dialogue action strategy model, provided by the invention, the situation prior characteristics of the dialogue are effectively utilized and the user behavior characteristics are analyzed based on the dialogue log, so that rich system state characteristics are obtained for establishing and training the action decision model, and the learning of a fine-grained dialogue strategy is guided under a unified frame, so that a more accurate result and a dialogue strategy which best meets the requirement of a target task are provided for a user, and the user experience is improved.

Drawings

Fig. 1 is a flowchart illustrating a method of generating a dialogue action policy model according to an exemplary embodiment of the present invention.

FIG. 2 is example data of a user history dialog log illustrating a method of generating a dialog action policy model according to an example embodiment of the present invention.

Fig. 3 is an exemplary diagram illustrating a context prior feature of an exemplary embodiment of the present invention.

Fig. 4 is an exemplary diagram illustrating a third system status feature of an exemplary embodiment of the present invention.

FIG. 5 is an exemplary diagram of a Markov decision process based decision model illustrating an exemplary embodiment of the invention.

Fig. 6 is a logic block diagram illustrating an apparatus for generating a dialogue action policy model according to an exemplary embodiment of the present invention.

Detailed Description

The general concept of the invention is that a user historical dialogue log is analyzed by combining scene prior knowledge of a template task, a plurality of new system state features (first system state features) of the target task are mined, the new system state features are combined with the traditional state features (second system state features) to obtain rich system state features (third system state features), and an action decision model is established by using the rich system state features and training samples obtained based on labeled data extracted from the user historical dialogue log to carry out model training, so that the learning of a fine-grained dialogue strategy is guided under a unified frame, and the user experience of the dialogue task is enhanced.

A method and apparatus for generating a dialogue action strategy model according to an exemplary embodiment of the present invention will be described in detail with reference to the accompanying drawings.

Referring to fig. 1, in step S110, a user history dialog log is acquired.

According to an exemplary embodiment of the present invention, the user history dialog log may be a number of rounds of dialog initiated in a dialog system to complete a target task.

FIG. 2 is example data of a user history dialog log illustrating a method of generating a dialog action policy model according to an example embodiment of the present invention. FIG. 2 shows a dialog record for a user performing a reservation airline ticket task.

Referring to fig. 2, the data of the user history dialog log includes, but is not limited to: the date the conversation occurred (e.g., 06-13), the time (e.g., 09:08), the USER's identity ID (e.g., USER _04E15FFC $ D261B6D2032B6316CBD36F4), the USER query word (e.g., "tickets to nanjing" in the figure), and the system return results (e.g., destination nanjing, origin Manshan, departure date 2014-. In practical application, different user history conversation logs and corresponding data in the logs can be obtained according to different conversation tasks.

In step S120, the historical dialog logs of the user are analyzed in combination with the scenario prior knowledge of the target task, and a plurality of first system state features of the target task are mined.

Specifically, the plurality of first system state features of the mining target task comprise: counting the distribution state of a plurality of user behavior characteristics on the preselected characteristics of the target task according to the user historical dialogue log; further, the situation prior characteristics of the target task are verified according to the distribution state of the user behavior characteristics on the preselected characteristics of the target task, and the first system state characteristics (new system state characteristics) are extracted from the situation prior characteristics. The preselected features are preselected dialog state features that have a close association with the performance of the task. For example, in the task of booking an airline ticket, the preselected feature may be "origin", "destination", or the like.

Wherein the plurality of user behavior characteristics include at least one of the following statistical characteristics: the proportion of the user completing the target task query, the proportion of the user not completing the target task query, the proportion of the user continuing the conversation after obtaining the query result, the proportion of the user clearly expressing the unconsciousness and the average number of interactive rounds of the conversation. Thus, for example, the distribution of the proportion of the user who completed the airline reservation query over the origin characteristic, the distribution of the proportion of the user who did not complete the airline reservation query over the origin characteristic, the proportion of the user who continued to converse (e.g., make a hotel) after completing the airline reservation, and so forth may be counted.

The scene prior characteristics include various elements that may affect the effect of the system returned result, as shown in fig. 3, for example: time prior information (e.g., time, date, etc. of user's session), region prior information (e.g., city where user's session is located, size type of the city, whether it is a tourist area or an industrial city), and historical action information (e.g., last confirmed current session turn, last cleared current session turn, or last inquired current session turn, etc. during user's session), it can be understood by those skilled in the art that for a target task, there may be different factors that affect the session effect, such as ticket booking task, day of week, time of week, place of departure, destination, session turn of last specific action from current action, etc. as the context prior knowledge.

Specifically, for the time prior information, for example, a user who orders an air ticket at late night tends to completely submit ticket ordering requirement information more than a user who orders an air ticket at evening, and other user behavior characteristics at two times are also obviously different; the reason for understanding the prior information of the region, for example, the user who starts from the tourist area has a higher ratio of completing the target task query by the user than the user who is scheduled to go to the air ticket of the tourist area, a lower ratio of not completing the target task query by the user, a ratio of continuing the conversation after the user obtains the query result, a ratio of clearly expressing the unconsciousness of the user and an average number of interactive turns of the conversation, is probably that the user who starts from the tourist area generally needs to go back urgently, and the user who goes to the tourist area may only browse the result and then make a decision; in addition, in the user dialogue experience, the frequency of system actions such as information confirmation, clarification and the like affects the user experience and further affects the user behavior data, and in summary, it is also necessary to count the historical action information of the system.

In step S130, a plurality of preset second system state features of the target task are combined with the plurality of first system state features obtained in step S120 to obtain a plurality of third system state features (i.e., rich system state features, wherein the plurality of third system state features are respectively represented as feature vectors), so as to form a more complete system state vector.

According to an exemplary embodiment of the present invention, in step S130, the plurality of second system state features are conventional system state features (which may be, but is not limited to, represented by feature attribute slot states), such as filling states and filling types of attribute slots, and specifically, for example, in an airline ticket booking task, whether a destination attribute slot is filled (or assigned) or is assigned with an ambiguity, or is assigned with a high confidence level, and the like.

In step S140, an operation decision model is created using the plurality of third system state features obtained in step S130 as training features, model training is performed, a parameter vector of each third system state feature is learned, and labeled data extracted from the user history dialog log is used as a training sample.

Specifically, in step S140, the original system log is formatted into a sample format required by the training model based on the new system state feature (first system state feature) and the user behavior feature counted from the user history dialog log, wherein in this step, the training sample is used as an input, and the action decision is used as an output.

According to a preferred embodiment of the invention, the action decision model is a Markov Decision Process (MDP) based model or a Partially Observable Markov Decision Process (POMDP) based model, and each of the training samples comprises parameter values of a plurality of third system state features, action data and a reward score for labeling the action.

As shown in fig. 5, a Markov Decision Process (MDP) based model is exemplified for the description. In particular, for an exemplary graph of a Markov decision process based decision model of an exemplary embodiment of the present invention, a one-wheel dialog process may be represented based on the current system state s₁The system takes action a₁And for the action, the external environment gives a reward r for the system action₁After the user gives the next query term (i.e. user requirement), the system enters the next system state s₂And the above process is repeated until the session is over (e.g., the prize r shown in the figure)₃)。

Wherein the system state s_iCan beThe action a of the system is described by a new state vector from a dialog log_iCan be directly extracted from the historical dialog log of the user, and the reward r is_iThe method can be obtained by fitting according to the user behavior characteristics or by means of manual marking, and the number of the i is represented.

In particular, a Q-value function of a Markov Decision Process (MDP) model or a Partially Observable Markov Decision Process (POMDP) model is estimated by means of function approximationWherein,the method is to express a characteristic function, a pair of system states s and system actions a are combined and mapped to a K-dimensional space, theta is an action decision model, the goal of model training under the line is to learn the action decision model theta based on training corpus, model parameters are learned by time difference calculation, and the method comprises the following steps of continuously iterating (iterative formula is as follows: q(s)_t，a_t)＝Q(s_t，a_t)+α(r_t+1+γQ(s_t+1，a_t+1)-Q(s_t，a_t) Where α is a learning step length, γ is a discount coefficient, r_t+1Rewarding the system action by the external environment at the moment of t +1, wherein the rewarding is obtained according to the user behavior characteristics of the system), and learning the approximate optimal model parameter vector, thereby outputting the corresponding action decision model.

Further, based on the decision model θ obtained by training, the process of obtaining the system action a from the current state s is the system decision, and when the system makes the action decision, the system is firstly usedIn (1)And mapping the system state characteristics and any effective action to a K-dimensional space by the function, obtaining a Q function value corresponding to each action based on the decision model theta, and taking the action corresponding to the maximum Q value as the system action output.

In order to make the technical scheme of the present invention more understandable, the technical scheme of the present invention will be further explained by using examples of the technical scheme of the present invention, wherein the examples of the application are as follows:

scene 1: suppose a user has a conversation with the system in Beijing.

The user: and helping I order a ticket of the economy class for going to III.

The system comprises the following steps: good to help you find an economy class airline ticket tomorrow from beijing to saint, as follows, what are other needs left?

If the dialogue is performed in the scene 1, the user is in a metropolitan city (Beijing) and a destination tourist attraction (III), a radical default assignment dialogue strategy can be learned through model iterative training, a GPS address (Beijing) is automatically used as a starting place, and a starting date is set as tomorrow, so that the aim of rapidly displaying results and screening the results by the user is fulfilled.

Scene 2: suppose a user is conversing with the system at three.

The user: helping me to order an economic cabin air ticket back to Beijing.

The system comprises the following steps: good, to beijing economy class ticket, where do you go? When to want to walk?

The user: tomorrow started from san.

The system comprises the following steps: help you find the result as follows (display result), do you want to find a few tickets?

As shown in the dialog of the scene 2, after statistical learning, a strategy can be learned that a user orders an air ticket when traveling, and generally has a clear return plan, and the system can ask detailed needs (for example, "where do you go.

The application scenarios of the two regions of prior information are taken as examples for explanation, and the application scenarios of the time prior information and the historical action information can learn the fine-granularity conversation strategy which best meets the requirement of the target task through the technical scheme of the invention.

According to the method for generating the dialogue action strategy model, provided by the invention, the situation prior characteristics of the dialogue are effectively utilized and the user behavior characteristics are analyzed based on the dialogue log, so that rich system state characteristics are obtained to establish and train an action decision model, and the learning of a fine-grained dialogue strategy is guided under a unified frame, so that a more accurate result and a dialogue strategy which best meets the requirement of a target task are provided for a user, and the user experience is improved.

Referring to fig. 6, the apparatus for generating a dialogue action policy model according to an exemplary embodiment of the present invention includes a log obtaining unit 610, a state obtaining unit 620, a state combining unit 630, and a decision model generating unit 640.

The log obtaining unit 610 is used for obtaining a user history dialog log.

The state obtaining unit 620 is configured to analyze the user history dialog log in combination with context prior knowledge of a target task, and mine a plurality of first system state features of the target task.

According to a preferred embodiment of the present invention, the state acquiring unit 620 further includes a statistical unit (not shown in the figure) and a state feature extracting unit (not shown in the figure). The statistical unit is used for counting the distribution state of a plurality of user behavior characteristics on the preselected characteristics of the target task according to the user historical dialogue logs; the state feature extraction unit is used for verifying the situation prior feature of the target task according to the distribution state of the user behavior features on the preselected feature of the target task, and extracting the first system state features from the situation prior feature.

Wherein the plurality of user behavior characteristics include at least one of the following statistical characteristics: the proportion of the user completing the target task query, the proportion of the user not completing the target task query, the proportion of the user continuing the conversation after obtaining the query result, the proportion of the user clearly expressing the unconsciousness and the average number of interactive rounds of the conversation.

In addition, the context prior characteristics of the target task include time prior information, region prior information, and historical action information.

The state combining unit 630 is configured to combine a plurality of preset second system state features of the target task with the plurality of first system state features to obtain a plurality of third system state features.

According to another preferred embodiment of the present invention, the state combining unit comprises a state feature representing unit (not shown in the figure) for representing the plurality of third system state features as feature vectors, respectively.

The decision model generating unit 640 is configured to use the labeled data extracted from the user historical dialog log as a training sample, use the plurality of third system state features as training features to establish an action decision model, perform model training, and learn parameter vectors of the third system state features.

Further, the action decision model is a Markov Decision Process (MDP) based model or a Partially Observable Markov Decision Process (POMDP) based model, and each of the training samples applied at the decision model generating unit 640 includes parameter values of a plurality of third system state features, action data, and a reward score for labeling the action.

According to the device for generating the dialogue action strategy model, provided by the invention, the situation prior characteristics of the dialogue are effectively utilized and the user behavior characteristics are analyzed based on the dialogue log, so that rich system state characteristics are obtained for establishing and training the action decision model, and the learning of a fine-grained dialogue strategy is guided under a unified frame, so that a more accurate result and a dialogue strategy which best meets the requirement of a target task are provided for a user, and the user experience is improved.

It should be noted that, according to the implementation requirement, each step described in the present application can be divided into more steps, and two or more steps or partial operations of the steps can be combined into a new step to achieve the purpose of the present invention.

The above-described method according to the present invention can be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium and to be stored in a local recording medium downloaded through a network, so that the method described herein can be stored in such software processing on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the processing methods described herein. Further, when a general-purpose computer accesses code for implementing the processes shown herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the processes shown herein.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method of generating a dialogue action strategy model, the method comprising:

acquiring a historical dialog log of a user;

analyzing the user historical dialogue logs by combining scene prior knowledge of a target task, and excavating a plurality of first system state characteristics of the target task;

combining a plurality of preset second system state features of the target task with the plurality of first system state features to obtain a plurality of third system state features;

and taking the marked data extracted from the user historical dialogue logs as training samples, taking the third system state characteristics as training characteristics to establish an action decision model and carry out model training, and learning parameter vectors of the third system state characteristics.

2. The method of claim 1, wherein the step of mining a plurality of first system state features of the target task by analyzing the user historical dialog logs in conjunction with contextual prior knowledge of the target task comprises:

counting the distribution state of a plurality of user behavior characteristics on the preselected characteristics of the target task according to the user historical dialog logs,

verifying the situation prior characteristics of the target task according to the distribution state of the user behavior characteristics on the preselected characteristics of the target task, and extracting the first system state characteristics from the situation prior characteristics.

3. The method of claim 2, wherein the plurality of user behavior characteristics comprises at least one of the following statistical characteristics: the proportion of the user completing the target task query, the proportion of the user not completing the target task query, the proportion of the user continuing the conversation after obtaining the query result, the proportion of the user clearly expressing the unconsciousness and the average number of interactive rounds of the conversation.

4. The method of claim 3, wherein the contextual apriori characteristics of the target task comprise temporal apriori information, regional apriori information, and historical action information.

5. The method of claim 4, wherein the step of combining a plurality of second system status features of the preset target task with the plurality of first system status features to obtain a plurality of third system status features further comprises:

and respectively representing the plurality of third system state features as feature vectors.

6. The method of claim 5, wherein the action decision model is a Markov Decision Process (MDP) based model or a Partially Observable Markov Decision Process (POMDP) based model.

7. The method of claim 6, wherein each of the training samples comprises parameter values for a plurality of third system state features, action data, and a reward score for tagging the action.

8. An apparatus for generating a dialogue action strategy model, the apparatus comprising:

the log acquisition unit is used for acquiring a user history conversation log;

the state acquisition unit is used for analyzing the user historical conversation log by combining scene prior knowledge of a target task and excavating a plurality of first system state characteristics of the target task;

the state combining unit is used for combining a plurality of preset second system state characteristics of the target task with the plurality of first system state characteristics to obtain a plurality of third system state characteristics;

and the decision model generating unit is used for establishing an action decision model by taking the marking data extracted from the user historical dialogue logs as training samples and taking the third system state characteristics as training characteristics, carrying out model training and learning parameter vectors of the third system state characteristics.

9. The apparatus of claim 8, wherein the state obtaining unit comprises:

a statistic unit for counting the distribution state of a plurality of user behavior characteristics on the preselected characteristics of the target task according to the user history dialog log,

and the state feature extraction unit is used for verifying the situation prior feature of the target task according to the distribution state of the plurality of user behavior features on the preselected feature of the target task and extracting the plurality of first system state features from the situation prior feature.

10. The apparatus of claim 9, wherein the plurality of user behavior characteristics comprises at least one of the following statistical characteristics:

the proportion of the user completing the target task query, the proportion of the user not completing the target task query, the proportion of the user continuing the conversation after obtaining the query result, the proportion of the user clearly expressing the unconsciousness and the average number of interactive rounds of the conversation.

11. The apparatus of claim 10, wherein the contextual apriori characteristics of the target task comprise temporal apriori information, regional apriori information, and historical action information.

12. The apparatus of claim 11, wherein the state combining unit comprises:

and the state feature representing unit is used for representing the plurality of third system state features as feature vectors respectively.

13. The apparatus of claim 12, wherein the action decision model is a Markov Decision Process (MDP) based model or a Partially Observable Markov Decision Process (POMDP) based model.

14. The apparatus of claim 13, wherein each of the training samples comprises parameter values for a plurality of third system state features, action data, and a reward score for tagging the action.