CN119808879A

Movatterモバイル変換

Info

Publication number: CN119808879A
Application number: CN202510004388.0A
Authority: CN
Inventors: 候亚庆; 高一凡; 赵梦辰
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2025-01-02
Filing date: 2025-01-02
Publication date: 2025-04-11

Abstract

The invention discloses a recommendation system optimization method based on user satisfaction, and belongs to the technical field of recommendation systems. The present invention models the user's decision process as a Markov decision process and assumes that the user always attempts to maximize satisfaction during interactions with the recommendation system. Based on this assumption, the user's interaction data set may be regarded as expert behavior data. Then, the invention provides a method based on inverse reinforcement learning to train and obtain a user satisfaction model. Finally, the invention designs an auxiliary alignment task so that the recommendation system maximizes the user satisfaction in the recommendation process, and the task can be combined with any sequence recommendation model to realize the alignment of the recommendation system and the user satisfaction. The invention has the advantages of strong universality, wide application scene and the like, and can be widely applied to various recommended scenes such as news, music, electronic commerce and the like.

Description

Recommendation system optimization method based on user satisfaction

Technical Field

The invention belongs to the technical field of recommendation systems, and relates to a recommendation system optimization method based on user satisfaction.

Background

The recommendation system aims at screening the content meeting the interest demands of the user according to the historical behavior information of the user. Along with the progress of scientific technology, a recommendation system has made remarkable progress in aspects of cold start problems, diversity guarantee, long-term user participation improvement and the like, but has obvious defects in understanding user behaviors and demands. For example, when a user clicks on some topic related news, the system will tend to continue to recommend more similar news. Such a recommendation strategy is reasonable from a positive feedback point of view, but in practice the user may lose interest in the topic because enough information has been obtained. This phenomenon suggests that the recommendation strategy may deviate severely from the user's real interest preferences, depending only on the user's display interactions with the recommendation system (e.g. clicking, browsing, etc.).

When a user consumes content recommended by a recommendation system, the mind will create a subjective feeling to the content, which the present invention defines as user satisfaction. Regarding user satisfaction, the present invention has the following two-point observation that 1. User satisfaction directly affects its interest distribution and subsequent behavior. For example, when a user clicks on a news item but finds that the content is repeated with previously read content, the satisfaction may be low and thus unwilling to click on similar content, 2. Users typically tend to maximize their satisfaction when interacting with the recommender system. This is intuitive in the daily sense, as users prefer content that can bring more emotional value or pleasure. Therefore, the recommendation system should not only pay attention to explicit feedback of the user, but also need to be able to meet the satisfaction of the user to the greatest extent. However, since satisfaction is a subjective perception of the user, it is generally unknown to the recommender system, and new techniques are needed to help the recommender system align with user satisfaction during interactions with the user.

In recent years, the field of natural language processing proposes an alignment algorithm for guiding a large-scale language model (LLMs) to generate content more conforming to human value, and research for improving user experience using the alignment algorithm in the field of recommendation systems is still in an early stage. Meanwhile, the recommendation system has significant difference from LLMs in the problem that in LLMs, due to the fact that enough annotation data (such as conversation quality) is lacked, the research emphasis is on how to construct the annotation data which accords with human value and input a model for training, and in the recommendation system, a large number of interaction tracks of users and the system are recorded as the annotation data, and manual annotation is not needed. Therefore, the alignment problem of the recommendation system has a key challenge in mining user satisfaction information from the user's interaction trajectory while learning how to align with the user's actual satisfaction.

(1) Existing algorithms can mimic user behavior based on user history data, but cannot understand the real motivation underlying the user behavior. When the user interests change, the predictive performance of existing models tends to drop significantly.

(2) The reward signals used by the existing algorithm are generally generated by a complex model based on rules, and the reward signals often have errors with the real preference of the user, so that the optimization direction of the recommendation system can be misled, and a recommendation result with poor quality is generated.

Disclosure of Invention

Aiming at the problem that the algorithm of the current recommendation system is deviated from the satisfaction degree of the user, the invention provides a recommendation system optimization method based on the satisfaction degree of the user.

According to the invention, firstly, the interaction data of the user and the system are utilized to learn the motivations and interests behind the user behaviors and model the motivations and interests as a user satisfaction model, and then, the model is utilized to guide the training of the main recommendation system model, so that the alignment of the recommendation system and the user satisfaction is realized. The most critical issue in this process is how to quantify the satisfaction that the user gets when consuming the recommended content, i.e. how to train the user satisfaction model. Direct learning of user satisfaction models presents a significant challenge because satisfaction is hidden behind user behavior. To this end, the present invention first models the user's decision process as a Markov decision process (Markov Decision Process, MDP) and assumes that the user always tries to maximize satisfaction during interaction with the recommender system. Based on this assumption, the user's interaction data set may be regarded as expert behavior data. Then, the invention provides a method based on inverse reinforcement learning, which is used for mining the user satisfaction model hidden behind the expert behavior data. Finally, the invention designs an auxiliary task, which can guide the recommendation system to maximize the user satisfaction degree in the recommendation process, and the task can be combined with any sequence recommendation model to realize the alignment of the recommendation system and the user satisfaction degree.

On this basis, the process according to the invention is largely divided into two stages:

(1) And in the training stage of the user satisfaction model, the user satisfaction model is trained through an inverse reinforcement learning technology. The goal of traditional reinforcement learning is to train the agent strategy to maximize its cumulative rewards based on known environmental transfer functions and rewards functions, while the goal of inverse reinforcement learning is to derive a rewards function through a given expert strategy trajectory to maximize the probability of the agent generating the expert strategy trajectory under the rewards function. In the invention, the user is regarded as an agent, the recommendation system is regarded as an environment, and the historical interaction data of the user and the recommendation system is regarded as an expert policy track. Meanwhile, it is assumed that the user always follows the optimal strategy when interacting with the recommender system, i.e. the user always chooses to maximize his own rewards. Based on this assumption, the present invention formalizes the user satisfaction model as a reward model in inverse reinforcement learning, and restores implicit user satisfaction by analyzing the user's interaction history data. The process effectively solves the problem of direct quantification of subjective satisfaction of the user, and provides a reliable guide signal for optimization of a follow-up recommendation system.

(2) And in the recommendation system training and optimizing stage, the sequence recommendation system model is mainly considered in the invention. The recommendation system model takes the user interaction history sequence as input to predict the next article which can attract the interest of the user. Because the user satisfaction cannot be quantified directly, the recommendation system cannot maximize the user satisfaction during the training process, and therefore the recommendation system is often biased from the real interests of the user. In order to solve the problem, the invention designs an auxiliary task, and optimizes the recommendation system by using the user satisfaction model trained in the first stage so that the recommendation system can be aligned with the interests of the user. Specifically, the invention designs a new training target, ensures that the original target of the recommendation system can be met in the training stage, and simultaneously can maximize the user satisfaction.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a recommendation system optimization method based on user satisfaction comprises the following specific steps:

Step 1, setting a problem model, and carrying out mathematical modeling:

1.1 Markov decision process modeling (Markov Decision Process, MDP) in reinforcement learning or inverse reinforcement learning, the problem is typically modeled using a Markov decision process represented by a five-tuple < S, A, p, r, pi >. Where S represents a state space, A represents an action space, p represents an environmental transfer equation, r represents a reward function, and pi represents a policy function. The invention carries out Markov process modeling from the view angle of the user, namely the user is regarded as an intelligent agent, and the recommendation system is regarded as an environment. The specific modeling mode is as follows:

The state space S_t ε S represents the state of the user at time t. The invention defines the user state as s_t＝(h_t-1,i_t), wherein h_t-1＝(σ₁,σ₂,…,σ_t-1) is the interaction history at the time of t-1, each interaction sigma= < u, i, a >, u is the user characteristic, i is the object interacted by the user, and a is the user action.

The action space A, user action a E A, represents user feedback on interactive items. In different scenarios, the feedback of the user is also varied, such as clicking, purchasing, liking, forwarding, etc. In the present invention, to simplify problem size, user feedback is divided into two categories, a= { a^P,a^N }, where a^P represents user positive feedback and a^N represents user negative feedback.

Transfer function p is sxaxA.fwdarw.0, 1. When the user performs action a_t at time t, the transition to the new state s_t+1 is made according to the transfer function.

The reward function r is modeled from the user perspective, so that the reward function r is a quantitative representation of the satisfaction produced by the user after consuming the item recommended by the recommendation system. This function is also the primary model to be trained in the present invention.

Policy pi, user policy, is a mapping function from state space to action space. The present invention considers that the use of the optimal strategy pi^*, i.e. the user always tends to maximize his own satisfaction, i.e. the reward function, during the user's interaction with the recommender system.

1.2 Modeling of recommender alignment problems the objective of the present invention is to optimize the recommender model to align with the user satisfaction model, where alignment refers to maximizing user satisfaction during a complete interaction of the user with the recommender. Specifically, the present invention uses a reward function r (s, a) to quantify user satisfaction, where s= < h, i > is the user status, h is the user interaction history track, i is the item currently interacted with by the user, and a is the user action. The goal of the alignment is to maximize Σ_tη·r(s_t,a_t during the interaction of the recommender system and the user), where i_t is the item recommended by the recommender system at time t, a_t is the user's action at time t, η is the decay factor, which balances the short-term rewards and the long-term rewards.

Step 2, user satisfaction model training

The invention models the user satisfaction model as a reward function in a Markov decision process, and solves the reward function under the condition of knowing the user state sequence and the corresponding action. In the step, a classical algorithm IQ-Learn in inverse reinforcement learning is utilized to solve a reward function, the algorithm adopts a soft-Q function to model the state-action value Q (s, a) of a user, and the state transition probability and the strategy distribution are utilized to derive an analytical expression of the reward function, so that the complexity of directly modeling the reward function is avoided. The method comprises the following specific steps:

2.1 Data processing, constructing an experience pool D. Existing recommender system datasets cannot be used directly to train a user satisfaction model. The aim of the step is to process the data set of the recommendation system to enable the data set to meet the modeling mode of the Markov decision process in the step 1, so that the subsequent model training is facilitated.

2.2 Randomly initializing agent soft-Q model Q_w (s, a) network parameters w.

2.3 Training the model parameters w using the training data.

2.3.1 Sampling m samples from experience pool DCalculation ofV^j(s_t) and V^j(s_t+1). Wherein,The state and the action of the user at the time t in the jth sample are respectively,The state of the user at the time t+1 in the jth sample; The action state value of the user at time t in the jth sample, V^j(s_t)、V^j(s_t+1) are the state value of the user s_t、s_t+1 in the jth sample, respectively.

Q and V satisfy the soft-Bellman equation:

2.3.2 Calculating an inverse reinforcement learning loss L_IRL:

where α is a hyper-parameter.

2.3.3 Calculating a bonus discrimination enhancement regularization term (Reward Distinction Enlargement, RDE) penalty L_RDE:

The regular term improves the interpretation of the reward function on the user behavior by amplifying the difference of different user behaviors (such as clicking and non-clicking) on the reward signal, so that the learned user satisfaction model can reflect the real preference of the user more accurately.

2.3.4 Total loss is calculated, all parameters w of the Q_w (s, a) network are updated by gradient back-propagation of the neural network.

loss=L_IRL+β·L_RDE(6)

Where β is a superparameter to balance the canonical term loss weights.

Thus, the user satisfaction r (s_t,a_t) can be calculated when the action a is executed in the state s.

r(s_t,a_t)＝Q_w(s_t,a_t)-γV(s_t+1)(7)

Where γ is the hyper-parameter and V (s_t+1) is the state value of the user at state s_t+1.

Step 3, optimizing recommendation system model training by using user satisfaction model

The goal of this stage is to align the recommender system with the user satisfaction model during training. The user satisfaction model r has been obtained in step 2 (s_t,a_t). The goal of the alignment problem is for the recommender system to maximize the jackpot Σ_tη·r(s_t,a_t while recommending item i_t).

3.1 Initializing a recommendation system data experience pool and a recommendation system model delta_ψ.

3.2 Training recommender system model and user satisfaction model alignment):

3.2.1 Extracting n pieces of user interaction history data from the data experience poolWherein the method comprises the steps ofAn interactive sequence from 0 to t for user u. Suppose that the user is at time t with an itemThe interaction is performed byWherein the method comprises the steps ofAn interactive sequence from 0 to t-1 for user u.

3.3.2 The goal of the serialization recommendation system model is typically a Click-Through-Rate (CTR). Interactive sequence for user from 0 to t-1 timeInputting the item information into a recommendation system model, and clicking the item at the time t by the user predicted by the modelProbability of (2)

3.3.3 Calculating the recommender system loss function L_CE using cross entropy (Cross Entropy):

where a^P＝1,a^N =0 represents whether the user clicks,The probability estimate for the user u clicking on the item.

3.3.4 Modeling the recommendation system data into a Markov decision process data format by using the modeling method in the step 1, and inputting the data into the user satisfaction model in the step 2 to obtain the satisfaction r (s_t,a_t) of the user at the moment, wherein

3.3.5 A calculation of auxiliary task loss. In order to align the recommendation system training process with the user satisfaction, the alignment loss L_Align is designed by referring to the cross entropy loss used in the recommendation system, and the user satisfaction is maximized when the recommendation system is guided to train:

3.3.6 Calculating a final loss L_Rec, back-propagating the update recommendation system parameter ψ through a neural network:

L_Rec＝L_CE+κ·L_Align(10)

Where κ is a superparameter used to control the specific gravity of the alignment loss in the recommended mission.

The invention has the beneficial effects that:

The invention adopts a new optimization method in the training process of the recommendation system. Firstly, the invention utilizes reverse reinforcement learning to mine motivation hidden behind user behavior from interactive data of users to obtain a user satisfaction model, and can quantify the satisfaction degree of the users at the current moment according to the states and actions of the users. Meanwhile, the optimization method is beneficial to optimization of the recommendation system in industry, and is characterized in that (1) the optimization method can be expanded to any sequence recommendation scene, such as news recommendation, music recommendation and the like, only the MDP process needs to be simply modified, (2) all training is performed in an offline environment, the cost of building an online user interaction environment is saved, and (3) the alignment task can be easily combined with any existing sequence recommendation system frame, so that the method is high in universality.

Drawings

FIG. 1 is a block diagram of the present invention.

Fig. 2 is a schematic diagram of an alignment task of a recommendation system according to the present invention.

FIG. 3 is a diagram of a DIN network architecture of a click rate estimation model used in the examples.

Detailed Description

The following describes the embodiments of the present invention further with reference to the drawings and technical schemes.

The method can be used for the training process of the recommendation system in the e-commerce recommendation scene, and the flow chart of the method is shown in figure 1.

The invention relates to an alignment task optimization method, and a framework of the alignment task optimization method is shown in fig. 2.

The following describes embodiments of the present invention in detail (as shown in fig. 1), and specifically includes the following steps:

(1) Data processing procedure this embodiment uses amazon datasets. The data set has the comment information of the E-commerce website Amazon users on the commodity from 1996 to 2014 collected, and a plurality of different sub-data sets are divided according to different commodity types, wherein two of the sub-data sets are used in the invention, namely Amazon Electronics and Amazon Book. The specific Markov decision process is modeled as follows:

The state space S is defined as the state S_t＝(h_t-1,i_t of the user at the time t, wherein h_t-1＝(σ₁,σ₂,…,σ_t-1) is the interaction history at the time t-1, each time interaction sigma= < u, i, a >, u is user id, i is user interaction commodity, commodity id and category are used for representing the invention, and a is user action.

The action space a = { a^P,a^N}.a^P = 1 represents that the user has no comment, and a^N = 0 represents that the user has posted a comment.

Transfer function p is sxaxA.fwdarw.0, 1. When the user performs action a_t at time t, the state transitions to a new state s_t+1＝s_t∪σ_t+1.

(2) User satisfaction model training process:

(2.1) initializing Q network parameters in the user satisfaction model, and filling the data experience pool with processed data.

(2.2) Sampling 64 samples from the data experience poolModel loss is calculated.

Firstly, calculating a corresponding state value function and a state function according to the user state and the action:

calculating an inverse reinforcement learning loss, wherein α=0.5, γ=1:

Then calculate the reward distinguishes the enhancement regularization term loss:

The sum of the two is the total loss of the user satisfaction model, where β=0.5:

loss=L_IRL+β·L_RDE

(2.3) reversely updating the network parameters of the user satisfaction model by using a gradient descent method.

When the circulation condition is not satisfied, that is, the circulation reaches the preset training times (200000 steps), a trained soft-Q model can be obtained, and then the satisfaction r (s_t,a_t) of the user at any time can be calculated, wherein γ=1:

r(s_t,a_t)＝Q_w(s_t,a_t)-γV(s_t+1)

the model can be used for quantitatively calculating the satisfaction degree in the training process of the recommendation system.

(3) The satisfaction model provided by the invention can be combined with any serialization recommendation system model, and is described by taking a classical click rate estimation model DIN (DEEP INTEREST Network) as an example. DIN is a recommendation algorithm based on a user history behavior sequence, and by introducing an interest extraction mechanism, each user is dynamically recommended to the item most relevant to the current interest, so that the individuation and accuracy of recommendation are improved. The structure of the model is shown in fig. 3.

(3.1) Initializing network parameters in the recommendation system model.

(3.2) Extracting 64 pieces of user interaction history data from the data experience poolModel loss is calculated.

Firstly, inputting samples into a recommendation system model, wherein each sample comprises user characteristics (user id), commodity characteristics (commodity id and commodity category) which are reviewed by a user, and commodity characteristics (commodity id and commodity category) to be recommended. And calculating the probability of commenting the commodity to be recommended under the interaction history sequence. Calculating cross entropy loss between model estimation probability and sample label:

And (3) processing the sample into s and a in the Markov decision process in the step (1), and inputting the s and a into a user satisfaction model to obtain the user satisfaction r (s, a) in the state.

Calculation-aided alignment task loss:

Total loss was calculated, where κ=0.6:

L_Rec＝L_CE+κ·L_Align

(3.3) reversely updating the recommended system model network parameters by using a gradient descent method.

When the circulation condition is not satisfied, that is, the circulation reaches the preset training times or the preset index, the recommended system model aligned with the user satisfaction degree can be obtained.

In order to measure the performance of the recommendation system model, the invention uses the following two indexes:

(1) AUC (Area Under Curve) for measuring ranking capabilities of the recommender model:

Wherein x and y are the number of positive and negative samples, respectively. In training sample h_t<h_t-1,i_t of the recommender model, if the user's action with item i_t is a^P, i.e. reviews the item, then the sample is a positive sample, otherwise it is a negative sample. P (positive) is the recommendation system's predictive value of the probability of performing for positive sample action a^P, and P (negative) is the recommendation system's predictive value of the probability of performing for negative sample action a^N.

(2) NCIS (Normalised Capped Importance Sampling) for estimating the on-line performance of the recommender model. The longer the interaction sequence length a user generates under the recommender system model, the greater the NCIS value:

where U represents the U-th user, U is the number of users in the data set, ρ_u is the probability that the user generates a corresponding interaction track under the current recommendation system, and L_u is the length of the user interaction track.

Through experiments, the index values of the recommendation system model DIN before and after alignment with the user satisfaction model in this embodiment are as follows:

According to the experimental results, the AUC and NCIS indexes of the recommendation system model are improved by aligning with the user satisfaction model, so that the optimization method provided by the invention can improve the sequencing capability of the recommendation system, improve the interactive sequence length of the user and improve the user viscosity.

Claims

1. A recommendation system optimization method based on user satisfaction is characterized by comprising the following specific steps:

Step 1, setting a problem model, and carrying out mathematical modeling:

1.1 The invention relates to a method for modeling a Markov decision process, which is characterized by modeling a problem in the Markov decision process, wherein the process is represented by a five-tuple < S, A, p, r, pi >, wherein S represents a state space, A represents an action space, p represents an environment transfer equation, r represents a reward function and pi represents a strategy function, the Markov decision process is modeled from the view of a user, namely the user is regarded as an intelligent agent, and a recommendation system is regarded as an environment, and the specific modeling mode is as follows:

The state space S_t epsilon S represents the state of the user at the time t, the user state is defined as S_t＝(h_t-1,i_t), wherein h_t-1＝(σ₁,σ₂,…,σ_t-1) is the interaction history at the time t-1, each interaction sigma= < u, i, a >, u is a user characteristic, i is an object interacted by the user, and a is a user action;

To simplify the problem scale, the user feedback is divided into two types, namely A= { a^P,a^N }, wherein a^P represents positive feedback of the user and a^N represents negative feedback of the user;

A transfer function p is S x A → [0,1] (when the user performs action a_t at time t, the new state S_t+1 is transferred according to the transfer function;

the rewarding function r is a quantitative representation of satisfaction degree generated after the user consumes the articles recommended by the recommendation system;

The optimal strategy pi^* is used in the interaction process of the user and the recommendation system, namely the user always tends to have the maximum self satisfaction degree, namely the rewarding function;

1.2 Specifically, a reward function r (s, a) is used for quantifying the user satisfaction, wherein s= < h, i > is a user state, h is a user interaction history track, i is an object currently interacted with the user, a is a user action, and the alignment is that Sigma_tη·r(s_t,a_t is maximized in the process of interaction between the recommendation system and the user, i_t is an object recommended by the recommendation system at the time t, a_t is an action of the user at the time t, and eta is a decay coefficient for balancing short-term rewards and long-term rewards;

step 2, user satisfaction model training

Modeling a user satisfaction model as a reward function in a Markov decision process, and solving the reward function under the condition of knowing a user state sequence and corresponding actions; solving a reward function by utilizing an algorithm IQ-Learn in inverse reinforcement learning, wherein the algorithm adopts a soft-Q function to model a state-action value Q (s, a) of a user, and deducing an analytical expression of the reward function by utilizing state transition probability and strategy distribution; the method comprises the following specific steps:

2.1 Processing the data set of the recommendation system to enable the data set to meet the modeling mode of the Markov decision process in the step 1, so that the subsequent model training is facilitated;

2.2 Randomly initializing an agent soft-Q model Q_w (s, a) network parameters w;

2.3 Training model parameters w using the training data;

2.3.1 Sampling m samples from experience pool DCalculation ofV^j(s_t) and V^j(s_t+1), wherein,The state and the action of the user at the time t in the jth sample are respectively,The state of the user at the time t+1 in the jth sample; The action state value of the user at the moment t in the jth sample is V^j(s_t)、V^j(s_t+1), and the action state value of the user s_t、s_t+1 in the jth sample is V^j(s_t)、V^j(s_t+1);

Q and V satisfy the soft-Bellman equation:

2.3.2 Calculating an inverse reinforcement learning loss L_IRL:

wherein alpha is a super parameter;

2.3.3 Calculating a bonus discrimination enhancement regularization term loss L_RDE:

2.3.4 Calculating the total loss, updating all parameters w of the Q_w (s, a) network by gradient back propagation of the neural network;

loss=L_IRL+β·L_RDE(6)

wherein β is a superparameter to balance the canonical term loss weights;

thus, the satisfaction r of the user is calculated when the action a is executed in the state s (s_t,a_t);

r(s_t,a_t)＝Q_w(s_t,a_t)-γV(s_t+1)(7)

Where γ is a hyper-parameter, V (s_t+1) is the state value of the user at state s_t+1;

The goal is to align the recommender system with the user satisfaction model during training, the user satisfaction model r has been obtained in step 2 (s_t,a_t), the goal of the alignment problem is to maximize the cumulative rewards Σ_tη·r(s_t,a_t while recommending item i_t by the recommender system;

3.1 Initializing a recommendation system data experience pool and a recommendation system model delta_ψ;

3.2 Training recommender system model and user satisfaction model alignment):

3.2.1 Extracting n pieces of user interaction history data from the data experience poolWherein the method comprises the steps ofFor the interactive sequence from 0 to t of user u, assume that user and article at tThe interaction is performed byWherein the method comprises the steps ofAn interaction sequence from 0 to t-1 for user u;

3.3.2 The goal of the serialized recommender model is the click rate CTR, the interaction sequence from 0 to t-1 time of the userInputting the item information into a recommendation system model, and clicking the item at the time t by the user predicted by the modelProbability of (2)

where a^P＝1,a^N =0 represents whether the user clicks,A probability estimate for clicking the item for user u;

3.3.5 In order to align the recommender training process with user satisfaction, an alignment loss L_Align is designed with reference to the cross entropy loss used in the recommender, which maximizes user satisfaction while guiding the recommender training:

L_Rec＝L_CE+κ·L_Align(10)