Detailed Description
The general concept of the invention is that a user historical dialogue log is analyzed by combining scene prior knowledge of a template task, a plurality of new system state features (first system state features) of the target task are mined, the new system state features are combined with the traditional state features (second system state features) to obtain rich system state features (third system state features), and an action decision model is established by using the rich system state features and training samples obtained based on labeled data extracted from the user historical dialogue log to carry out model training, so that the learning of a fine-grained dialogue strategy is guided under a unified frame, and the user experience of the dialogue task is enhanced.
A method and apparatus for generating a dialogue action strategy model according to an exemplary embodiment of the present invention will be described in detail with reference to the accompanying drawings.
Fig. 1 is a flowchart illustrating a method of generating a dialogue action policy model according to an exemplary embodiment of the present invention.
Referring to fig. 1, in step S110, a user history dialog log is acquired.
According to an exemplary embodiment of the present invention, the user history dialog log may be a number of rounds of dialog initiated in a dialog system to complete a target task.
FIG. 2 is example data of a user history dialog log illustrating a method of generating a dialog action policy model according to an example embodiment of the present invention. FIG. 2 shows a dialog record for a user performing a reservation airline ticket task.
Referring to fig. 2, the data of the user history dialog log includes, but is not limited to: the date the conversation occurred (e.g., 06-13), the time (e.g., 09:08), the USER's identity ID (e.g., USER _04E15FFC $ D261B6D2032B6316CBD36F4), the USER query word (e.g., "tickets to nanjing" in the figure), and the system return results (e.g., destination nanjing, origin Manshan, departure date 2014-. In practical application, different user history conversation logs and corresponding data in the logs can be obtained according to different conversation tasks.
In step S120, the historical dialog logs of the user are analyzed in combination with the scenario prior knowledge of the target task, and a plurality of first system state features of the target task are mined.
Specifically, the plurality of first system state features of the mining target task comprise: counting the distribution state of a plurality of user behavior characteristics on the preselected characteristics of the target task according to the user historical dialogue log; further, the situation prior characteristics of the target task are verified according to the distribution state of the user behavior characteristics on the preselected characteristics of the target task, and the first system state characteristics (new system state characteristics) are extracted from the situation prior characteristics. The preselected features are preselected dialog state features that have a close association with the performance of the task. For example, in the task of booking an airline ticket, the preselected feature may be "origin", "destination", or the like.
Wherein the plurality of user behavior characteristics include at least one of the following statistical characteristics: the proportion of the user completing the target task query, the proportion of the user not completing the target task query, the proportion of the user continuing the conversation after obtaining the query result, the proportion of the user clearly expressing the unconsciousness and the average number of interactive rounds of the conversation. Thus, for example, the distribution of the proportion of the user who completed the airline reservation query over the origin characteristic, the distribution of the proportion of the user who did not complete the airline reservation query over the origin characteristic, the proportion of the user who continued to converse (e.g., make a hotel) after completing the airline reservation, and so forth may be counted.
Fig. 3 is an exemplary diagram illustrating a context prior feature of an exemplary embodiment of the present invention.
The scene prior characteristics include various elements that may affect the effect of the system returned result, as shown in fig. 3, for example: time prior information (e.g., time, date, etc. of user's session), region prior information (e.g., city where user's session is located, size type of the city, whether it is a tourist area or an industrial city), and historical action information (e.g., last confirmed current session turn, last cleared current session turn, or last inquired current session turn, etc. during user's session), it can be understood by those skilled in the art that for a target task, there may be different factors that affect the session effect, such as ticket booking task, day of week, time of week, place of departure, destination, session turn of last specific action from current action, etc. as the context prior knowledge.
Specifically, for the time prior information, for example, a user who orders an air ticket at late night tends to completely submit ticket ordering requirement information more than a user who orders an air ticket at evening, and other user behavior characteristics at two times are also obviously different; the reason for understanding the prior information of the region, for example, the user who starts from the tourist area has a higher ratio of completing the target task query by the user than the user who is scheduled to go to the air ticket of the tourist area, a lower ratio of not completing the target task query by the user, a ratio of continuing the conversation after the user obtains the query result, a ratio of clearly expressing the unconsciousness of the user and an average number of interactive turns of the conversation, is probably that the user who starts from the tourist area generally needs to go back urgently, and the user who goes to the tourist area may only browse the result and then make a decision; in addition, in the user dialogue experience, the frequency of system actions such as information confirmation, clarification and the like affects the user experience and further affects the user behavior data, and in summary, it is also necessary to count the historical action information of the system.
In step S130, a plurality of preset second system state features of the target task are combined with the plurality of first system state features obtained in step S120 to obtain a plurality of third system state features (i.e., rich system state features, wherein the plurality of third system state features are respectively represented as feature vectors), so as to form a more complete system state vector.
According to an exemplary embodiment of the present invention, in step S130, the plurality of second system state features are conventional system state features (which may be, but is not limited to, represented by feature attribute slot states), such as filling states and filling types of attribute slots, and specifically, for example, in an airline ticket booking task, whether a destination attribute slot is filled (or assigned) or is assigned with an ambiguity, or is assigned with a high confidence level, and the like.
Fig. 4 is an exemplary diagram illustrating a third system status feature of an exemplary embodiment of the present invention.
Further, a plurality of second system state features of the preset target task are combined with the plurality of first system state features to obtain a third system state feature, the third system state feature is output in the form of a feature vector, the feature vector represents different meaning types according to different dimensions, as shown in fig. 4, the third system state feature comprises a preset traditional system state feature (second system state feature) and newly-added time prior information, region prior information and historical action information, therefore, not only can the logic of the system task be expressed, but also a conversation strategy which better accords with the individual fine granularity of the user action feature can be described, wherein the traditional system state feature, the time prior information, the region prior information and the historical action information are all expressed as feature vectors.
In step S140, an operation decision model is created using the plurality of third system state features obtained in step S130 as training features, model training is performed, a parameter vector of each third system state feature is learned, and labeled data extracted from the user history dialog log is used as a training sample.
Specifically, in step S140, the original system log is formatted into a sample format required by the training model based on the new system state feature (first system state feature) and the user behavior feature counted from the user history dialog log, wherein in this step, the training sample is used as an input, and the action decision is used as an output.
According to a preferred embodiment of the invention, the action decision model is a Markov Decision Process (MDP) based model or a Partially Observable Markov Decision Process (POMDP) based model, and each of the training samples comprises parameter values of a plurality of third system state features, action data and a reward score for labeling the action.
FIG. 5 is an exemplary diagram of a Markov decision process based decision model illustrating an exemplary embodiment of the invention.
As shown in fig. 5, a Markov Decision Process (MDP) based model is exemplified for the description. In particular, for an exemplary graph of a Markov decision process based decision model of an exemplary embodiment of the present invention, a one-wheel dialog process may be represented based on the current system state s1The system takes action a1And for the action, the external environment gives a reward r for the system action1After the user gives the next query term (i.e. user requirement), the system enters the next system state s2And the above process is repeated until the session is over (e.g., the prize r shown in the figure)3)。
Wherein the system state siCan beThe action a of the system is described by a new state vector from a dialog logiCan be directly extracted from the historical dialog log of the user, and the reward r isiThe method can be obtained by fitting according to the user behavior characteristics or by means of manual marking, and the number of the i is represented.
In particular, a Q-value function of a Markov Decision Process (MDP) model or a Partially Observable Markov Decision Process (POMDP) model is estimated by means of function approximationWherein,the method is to express a characteristic function, a pair of system states s and system actions a are combined and mapped to a K-dimensional space, theta is an action decision model, the goal of model training under the line is to learn the action decision model theta based on training corpus, model parameters are learned by time difference calculation, and the method comprises the following steps of continuously iterating (iterative formula is as follows: q(s)t,at)=Q(st,at)+α(rt+1+γQ(st+1,at+1)-Q(st,at) Where α is a learning step length, γ is a discount coefficient, rt+1Rewarding the system action by the external environment at the moment of t +1, wherein the rewarding is obtained according to the user behavior characteristics of the system), and learning the approximate optimal model parameter vector, thereby outputting the corresponding action decision model.
Further, based on the decision model θ obtained by training, the process of obtaining the system action a from the current state s is the system decision, and when the system makes the action decision, the system is firstly usedIn (1)And mapping the system state characteristics and any effective action to a K-dimensional space by the function, obtaining a Q function value corresponding to each action based on the decision model theta, and taking the action corresponding to the maximum Q value as the system action output.
In order to make the technical scheme of the present invention more understandable, the technical scheme of the present invention will be further explained by using examples of the technical scheme of the present invention, wherein the examples of the application are as follows:
scene 1: suppose a user has a conversation with the system in Beijing.
The user: and helping I order a ticket of the economy class for going to III.
The system comprises the following steps: good to help you find an economy class airline ticket tomorrow from beijing to saint, as follows, what are other needs left?
If the dialogue is performed in the scene 1, the user is in a metropolitan city (Beijing) and a destination tourist attraction (III), a radical default assignment dialogue strategy can be learned through model iterative training, a GPS address (Beijing) is automatically used as a starting place, and a starting date is set as tomorrow, so that the aim of rapidly displaying results and screening the results by the user is fulfilled.
Scene 2: suppose a user is conversing with the system at three.
The user: helping me to order an economic cabin air ticket back to Beijing.
The system comprises the following steps: good, to beijing economy class ticket, where do you go? When to want to walk?
The user: tomorrow started from san.
The system comprises the following steps: help you find the result as follows (display result), do you want to find a few tickets?
As shown in the dialog of the scene 2, after statistical learning, a strategy can be learned that a user orders an air ticket when traveling, and generally has a clear return plan, and the system can ask detailed needs (for example, "where do you go.
The application scenarios of the two regions of prior information are taken as examples for explanation, and the application scenarios of the time prior information and the historical action information can learn the fine-granularity conversation strategy which best meets the requirement of the target task through the technical scheme of the invention.
According to the method for generating the dialogue action strategy model, provided by the invention, the situation prior characteristics of the dialogue are effectively utilized and the user behavior characteristics are analyzed based on the dialogue log, so that rich system state characteristics are obtained to establish and train an action decision model, and the learning of a fine-grained dialogue strategy is guided under a unified frame, so that a more accurate result and a dialogue strategy which best meets the requirement of a target task are provided for a user, and the user experience is improved.
Fig. 6 is a logic block diagram illustrating an apparatus for generating a dialogue action policy model according to an exemplary embodiment of the present invention.
Referring to fig. 6, the apparatus for generating a dialogue action policy model according to an exemplary embodiment of the present invention includes a log obtaining unit 610, a state obtaining unit 620, a state combining unit 630, and a decision model generating unit 640.
The log obtaining unit 610 is used for obtaining a user history dialog log.
The state obtaining unit 620 is configured to analyze the user history dialog log in combination with context prior knowledge of a target task, and mine a plurality of first system state features of the target task.
According to a preferred embodiment of the present invention, the state acquiring unit 620 further includes a statistical unit (not shown in the figure) and a state feature extracting unit (not shown in the figure). The statistical unit is used for counting the distribution state of a plurality of user behavior characteristics on the preselected characteristics of the target task according to the user historical dialogue logs; the state feature extraction unit is used for verifying the situation prior feature of the target task according to the distribution state of the user behavior features on the preselected feature of the target task, and extracting the first system state features from the situation prior feature.
Wherein the plurality of user behavior characteristics include at least one of the following statistical characteristics: the proportion of the user completing the target task query, the proportion of the user not completing the target task query, the proportion of the user continuing the conversation after obtaining the query result, the proportion of the user clearly expressing the unconsciousness and the average number of interactive rounds of the conversation.
In addition, the context prior characteristics of the target task include time prior information, region prior information, and historical action information.
The state combining unit 630 is configured to combine a plurality of preset second system state features of the target task with the plurality of first system state features to obtain a plurality of third system state features.
According to another preferred embodiment of the present invention, the state combining unit comprises a state feature representing unit (not shown in the figure) for representing the plurality of third system state features as feature vectors, respectively.
The decision model generating unit 640 is configured to use the labeled data extracted from the user historical dialog log as a training sample, use the plurality of third system state features as training features to establish an action decision model, perform model training, and learn parameter vectors of the third system state features.
Further, the action decision model is a Markov Decision Process (MDP) based model or a Partially Observable Markov Decision Process (POMDP) based model, and each of the training samples applied at the decision model generating unit 640 includes parameter values of a plurality of third system state features, action data, and a reward score for labeling the action.
According to the device for generating the dialogue action strategy model, provided by the invention, the situation prior characteristics of the dialogue are effectively utilized and the user behavior characteristics are analyzed based on the dialogue log, so that rich system state characteristics are obtained for establishing and training the action decision model, and the learning of a fine-grained dialogue strategy is guided under a unified frame, so that a more accurate result and a dialogue strategy which best meets the requirement of a target task are provided for a user, and the user experience is improved.
It should be noted that, according to the implementation requirement, each step described in the present application can be divided into more steps, and two or more steps or partial operations of the steps can be combined into a new step to achieve the purpose of the present invention.
The above-described method according to the present invention can be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium and to be stored in a local recording medium downloaded through a network, so that the method described herein can be stored in such software processing on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the processing methods described herein. Further, when a general-purpose computer accesses code for implementing the processes shown herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the processes shown herein.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.