Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a commodity recommendation method and a commodity recommendation system, which are used for completing matching by respectively extracting characteristics of a user and commodities and giving out comprehensive recommendation scores based on relations among the commodities, so that accuracy and richness of commodity recommendation are improved.
The aim of the invention can be achieved by the following technical scheme:
a first aspect of the present disclosure provides a commodity recommendation method, including the steps of:
Collecting and preprocessing data, namely collecting user information data and commodity information data from a user through obtaining user permission, cleaning and encoding the collected data to enable the data format to be consistent, and desensitizing the data;
constructing user features through the preprocessed data features, converting all the user features into vector forms to form user portrait vectors, and adding labels to the user portraits;
Extracting the topic features of commodity information by using a topic modeling algorithm through commodity information data, and calculating the distribution condition of each topic feature;
the commodity relation time function is constructed by mining commodity relation between commodity information data and user behavior data, and establishing a time function of a change trend based on the complementary substitution relation of the commodity;
Generating a recommendation list, namely establishing a recommendation scoring function for recommending commodities based on the user portraits, commodity information data and commodity relations, and generating a recommendation list of commodity information according to the recommendation scoring;
The construction of the commodity relation time function comprises the following steps:
Mining combined commodities of recommended commodities from all commodity purchase records by adopting an association rule mining algorithm, and then screening association rules based on user behavior data to establish a complementary commodity set;
adopting a similarity calculation method, setting a threshold value to calculate the similarity between the recommended commodity and the purchased commodity in the user behavior data, and screening out a substitute commodity set;
based on the established complementary commodity set and the replacement commodity set, a time function based on commodity relations is established according to the time change characteristics of the complementary replacement commodity relations.
Further, the establishing the complementary commodity set comprises the following steps:
Finding frequent item sets about recommended commodities in all transaction records through an association rule mining algorithm; generating association rules from the frequent item set, and calculating the support and confidence of the rules;
Setting a minimum support threshold and a confidence threshold to screen association rules, and then removing rules which do not contain commodity purchased by the user from the association rules based on user behavior data;
and acquiring the contained commodity combination according to the association rule obtained by final screening, and establishing a complementary commodity set of recommended commodities.
Further, the data acquisition and preprocessing comprises the following steps:
The data division, namely identifying the collected data sensitive state and dividing the data into sensitive fields and conventional fields, de-identifying the sensitive fields, and extracting features from the conventional fields by adopting a data engine;
De-identifying, namely de-sensitizing a user information sensitive field, wherein the de-sensitizing process comprises the following steps:
Deleting direct identification information, namely deleting fields for directly identifying individuals in the data, wherein the fields comprise information such as names, telephones, account numbers and the like;
an anonymous identifier is replaced by allocating a random anonymous identifier to each user ID, and a hash algorithm is used for converting the sensitive field into a hash value with a fixed length;
Blurring the geographic position, namely blurring the geographic position data and blurring accurate longitude and latitude information to the level of a city or region;
Time blurring, converting accurate time data into months or quarters;
The extraction features are that the user data are scanned piece by defining the rule features of the data engine, and corresponding data features are generated, including:
rule feature construction, namely logically defining each type of data rule through a plurality of dimension construction rules including user basic data rules, consumption data rules, behavior data rules, social data rules and feedback data rules, wherein the rule comprises range definition, condition combination and dynamic addition;
and generating data characteristics of each piece of user data through logic definition according to the constructed data rule characteristics.
Further, the user information data comprises user basic data, user behavior data and user feedback data, wherein the user basic data comprises user ID, gender, age, geographic position and registration time, the user behavior data comprises purchase records, browsing history and search keywords, and the user feedback data comprises click rate, purchase rate and evaluation feedback of a user on recommended commodities.
Further, the acquiring the distribution of commodity information topics comprises the following steps:
constructing a document-word matrix, namely converting text data preprocessed by commodity information data into the document-word matrix, wherein M documents and V words are arranged, and the document-word matrix D is expressed as follows:
;
where dMV represents the frequency of occurrence of word V in document M;
Training a document-word matrix by using an LDA model, setting a topic number N, and generating an N-dimensional topic distribution vector I for each document D, wherein the topic distribution vector I is expressed as:
;
In the formula,Representing the probability of occurrence of the nth topic RN given document D.
Further, the time function based on commodity relation has the expression:
Km,c(Δt)=N(Δt∣0,σc);
Kn,s(Δt)=-N(Δt∣0,σs1)+N(Δt∣us,σs2);
Where Km,c (Δt) represents the degree of influence between the commodity m and the complementary commodity c, N (Δt|0, σc) represents a normal distribution with a mean value of 0 and a standard deviation of σc, Δt represents the time interval of the last purchase record of the complementary commodity c, Kn,s (Δt) represents the degree of influence between the commodity N and the substitute commodity S, N (Δt|0, σs1) represents a normal distribution with a mean value of 0 and a standard deviation of σs1, N (Δt|us,σs2) represents a normal distribution with a mean value of us and a standard deviation of σs2, us represents a mean value of the substitution relationship in a long term, and subscripts S1 and S2 of σs1 and σs2 are used for distinguishing the standard deviation of negative influence from the standard deviation of the positive influence.
Further, the expression formula of the recommendation scoring function F is as follows:
;
Wherein S (A, B) represents the similarity between the user portraits A and B, M (H, I) represents the matching degree of the user behavior data H and the topic distribution vector I, alpha, beta and gamma are weight coefficients,Indicating the degree of influence between the article m and the complementary article c,Indicating the degree of influence between commodity n and substitute commodity c,Representing the user's score for item m in the complementary set of items C,Representing the user' S score for commodity n in the set of alternate commodities S.
Further, the generating the recommendation list further includes:
And setting the minimum recommendation score according to the recommendation requirement, screening out commodity information meeting the recommendation condition, and pushing the commodity information to the user.
A second aspect of the present disclosure provides a commodity recommendation system for implementing a commodity recommendation method as described above, including a user portrayal module, a commodity information modeling module, and a commodity recommendation module;
The user portrait module is used for collecting user information data and constructing a user portrait by extracting user characteristics, and comprises the following steps:
Cleaning and preprocessing the collected original data, including removing duplication, filling missing values, converting data formats and detecting and processing abnormal values;
Identifying the collected data sensitive state and dividing the data into sensitive fields and conventional fields, de-identifying the sensitive fields, and extracting features from the conventional fields by adopting a data engine;
Extracting features for constructing a user representation from the preprocessed data, including extracting features from demographics, behavioral features, hobbies of interest, lifestyle and value aspects, and emotion analysis;
Combining features of different dimensions into a composite feature and creating a unified vector representation for each user.
As a preferred technical scheme of the invention, the commodity information modeling module is used for extracting the theme characteristics according to commodity information data and calculating the distribution condition of the theme characteristics, and comprises the following steps:
Training the document-word matrix by utilizing an LDA model, setting the number of topics, and generating a topic distribution vector of a dimension corresponding to the number of topics for each document;
the commodity recommendation module is used for matching user groups with interest preferences for users according to user portraits, acquiring matching degree with the current users through commodity theme feature distribution, carrying out weighted correction on the scores of recommended commodities based on a time function considering commodity relations, and pushing commodity information according to the comprehensive scores of the recommended commodities;
The time function of the commodity relation comprises a time function established according to the time sequence change trend of the commodity with the complementary relation and a time function established according to the time sequence change trend of the commodity with the alternative relation.
The beneficial effects of the invention are as follows:
The invention firstly collects the data of the user, de-labeling the data collected by the collection and extracting the characteristics of a data engine for protecting the data privacy, de-labeling the sensitive fields of the user data, directly regularizing the data extracted by the data engine, only preserving the characteristics of the data, only receiving the de-labeled and extracted characteristics of the data, and not directly contacting the original data by a developer, thereby reducing the risk of revealing the user privacy data caused by the defect of the de-sensitization rule and protecting the privacy data of the user.
The method comprises the steps of firstly respectively processing user information data and commodity information data, respectively establishing user portraits and commodity information theme distribution aiming at users and commodities, then carrying out group matching on the users according to the user portraits, recommending group interest commodities for the users, then completing matching with the users according to the theme distribution, ensuring that a recommendation result meets the interest preference of the users and meets the attribute requirement of commodity information, wherein the influence of commodities with complementary substitution relation with purchased commodities on target item recommendation can not be ignored over time when the recommendation is carried out, thus establishing a time function of commodity relation on the basis of the first two steps, incorporating the time function into the recommended commodity score in the form of weighted items, mining the potential requirement of the users from the relation among commodities, and improving the accuracy of the recommended commodities and the richness of the classes.
Detailed Description
In order to further describe the technical means and effects adopted by the invention for achieving the preset aim, the following detailed description is given below of the specific implementation, structure, characteristics and effects according to the invention with reference to the attached drawings and the preferred embodiment.
The embodiment provides a commodity recommendation method, as shown in fig. 1, comprising the following steps:
S1, data acquisition and preprocessing, namely collecting user information data and commodity information data from a user by acquiring user permission, cleaning and encoding the collected data to enable the data format to be consistent, and desensitizing the data.
It can be understood that the user information data includes user basic data including user ID, gender, age, geographical location, registration time, etc. for constructing basic information of user portraits to help understand basic features and potential needs of users, user behavior data including purchase records (commodity ID, purchase time, purchase quantity, payment amount), browsing history (commodity ID, browsing time, stay time), search keywords (search time, search content), etc. for analyzing interest preference and consumption habit of users to form a user behavior model to support personalized recommendation, commodity information data including product information (commodity ID, name, category, price, stock), promotional activities (activity ID, activity type, start time, end time), user evaluation (commodity ID, user ID, score, comment content), etc. for evaluating marketing effect, analyzing attraction of different products and activities to users to provide basic data for personalized recommendation, and user feedback data including click rate, purchase rate, evaluation feedback, etc. of recommended commodity to help optimize effect of recommendation system.
S11, data division, namely identifying the collected data sensitive state and dividing the data into sensitive fields and conventional fields, de-identifying the sensitive fields, and extracting features from the conventional fields by adopting a data engine;
s12, de-labeling, namely performing desensitization treatment on the user information sensitive field, wherein the method comprises the following steps:
Deleting direct identification information, namely deleting fields for directly identifying individuals in the data, wherein the fields comprise information such as names, telephones, account numbers and the like;
An anonymous identifier is replaced by allocating a random anonymous identifier to each user ID, and a hash algorithm (such as SHA-256, MD5 and the like) is used for converting the sensitive field into a hash value with fixed length;
Blurring the geographic position, namely blurring geographic position data (such as GPS coordinates) and blurring accurate longitude and latitude information to the level of a city or region;
Time blurring-converting precise time data (e.g., registration time, access time, etc.) into months or quarters.
S13, extracting features, namely scanning the user data piece by defining rule features of the data engine and generating corresponding data features, wherein the steps comprise:
Rule feature construction, namely logically defining each type of data rule through a plurality of dimension construction rules including user basic data rules, consumption data rules, behavior data rules, social data rules and feedback data rules, wherein each type of data rule comprises range definition (such as age, consumption amount and the like), condition combination (such as judging the liveness of a user) and dynamic addition (such as purchasing frequency of data features greatly increased by the user);
and generating data characteristics of each piece of user data through logic definition according to the constructed data rule characteristics.
It should be noted that, basic data, behavior data and feedback data of the user are collected, and feature extraction and data de-identification are performed to construct a user portrait, and commodity information data of the commodity platform are collected at the same time, so that personalized recommendation is performed.
S2, constructing user portrait, namely constructing user features through the preprocessed data features, converting all the user features into vector forms, forming user portrait vectors, and adding labels to the user portrait.
It should be noted that the user features include user basic features (such as age, geographic location, etc.), consumption features (such as consumption frequency, average consumption amount, popular purchase category, etc.), behavior features (such as browsing duration, diversity of search keywords, etc.), social features (such as praise, number of comments, etc.), and feedback features (such as average score of purchased goods). The user portrait is tagged according to the characteristics and behaviors of the user, such as 'high consumption user', 'electronic product fan', 'social active' and the like.
S3, acquiring commodity information theme distribution, namely extracting theme characteristics of commodity information through commodity information data by using a theme modeling algorithm, and calculating distribution conditions of each theme characteristic, wherein the method comprises the following steps of:
S31, constructing a document-word matrix, namely converting text data preprocessed by commodity information data into the document-word matrix, wherein M documents and V words are arranged, and the document-word matrix D can be expressed as:
;
Where dMV denotes the frequency of occurrence of word V in document M.
S32, generating a topic distribution vector, namely training a document-word matrix by using an LDA model, setting a topic number N, and generating an N-dimensional topic distribution vector I for each document D, wherein the topic distribution vector I is expressed as:
;
In the formula,Representing the probability of occurrence of the nth topic RN given document D.
It can be appreciated that the common topic modeling algorithms include Latent Dirichlet Allocation (LDA), non-negative matrix factorization (Non-negative Matrix Factorization, NMF), and the like, and in this embodiment, the topic modeling is performed using LDA. And the topic distribution in the commodity information data of the e-commerce platform is extracted, so that matching with the user based on the topic in the personalized recommendation is facilitated.
S4, constructing a commodity relation time function, namely mining commodity relation between commodity information data and user behavior data, and building a time function of a change trend based on the complementary substitution relation of commodities, wherein the time function is shown in FIG 2 and comprises the following steps of:
s41, mining combined commodities of recommended commodities from all commodity purchase records by adopting an association rule mining algorithm, and then screening association rules based on user behavior data to establish a complementary commodity set, wherein the method comprises the following steps:
frequent item sets are found in all transaction records for recommended merchandise by association rule mining algorithms (e.g., apriori or FP-Growth), for example, if recommended merchandise E and merchandise F are often purchased together, { E, F } is a frequent item set.
Generating association rules from the frequent item set, calculating the support degree and the confidence degree of the rules, setting a minimum support degree threshold value and a confidence degree threshold value to screen the association rules, and then eliminating rules which do not contain the commodity purchased by the user from the association rules based on the user behavior data;
and acquiring the contained commodity combination according to the association rule obtained by final screening, and establishing a complementary commodity set of recommended commodities.
It should be noted that, in the acquisition of the complementary commodity set, the association rule mining algorithm is firstly adopted to perform frequent item set mining on all the user purchase records of the recommended commodity so as to obtain more comprehensive commodity complementary combinations, and then the obtained commodity complementary combinations are screened based on the purchase records of the behavior data of the current user, so that the commodity complementary combinations of the exclusive current user are established, the set individuation is improved, and the accuracy is improved for the subsequent commodity recommendation.
S42, adopting a similarity calculation method, setting a threshold value to calculate the similarity between the recommended commodity and the purchased commodity in the user behavior data, and screening out a substitute commodity set.
In this embodiment, a commodity set with a substitute relationship is obtained by calculating the cosine distance between commodities, and by analyzing the purchase history of the user, which commodities are often purchased together (complementary relationship) or are often substituted for purchase (substitute relationship) is identified, so that a complementary commodity set and a substitute commodity set are obtained, which will not be described in detail herein.
S43, based on the established complementary commodity set and the replacement commodity set, establishing a time function based on commodity relations according to time change characteristics of the complementary replacement commodity relations:
Km,c(Δt)=N(Δt∣0,σc);
Kn,s(Δt)=-N(Δt∣0,σs1)+N(Δt∣us,σs2);
Where Km,c (Δt) represents the degree of influence between the commodity m and the complementary commodity c, N (Δt|0, σc) represents a normal distribution with a mean value of 0 and a standard deviation of σc, Δt represents the time interval of the last purchase record of the complementary commodity c, Kn,s (Δt) represents the degree of influence between the commodity N and the substitute commodity S, N (Δt|0, σs1) represents a normal distribution with a mean value of 0 and a standard deviation of σs1, N (Δt|us,σs2) represents a normal distribution with a mean value of us and a standard deviation of σs2, and us represents a mean value of the substitution relationship, reflecting the time point when the user may wish to replace the commodity after a period of use, and the subscripts S1 and S2 are used for distinguishing the standard deviation of the negative effect from the standard deviation of the positive effect.
It should be noted that the commodity purchasing relationship changes with the lapse of time. For example, the purchase of complementary merchandise may be expected to be stronger in the short term, while the impact of replacement merchandise may be negative in the short term and may turn to positive impact over time, so the time function of the complementary relationship is designed to have a positive initial value and decay faster, using a normal distribution with zero mean as its time function, and the time function of the replacement relationship is designed to have 2 opposite normal distributions as its time function. Therefore, by introducing the time factors, commodity relations which change with time are considered, and the accuracy of recommendation is improved.
And S5, generating a recommendation list, namely establishing a recommendation scoring function for recommending commodities based on the user portrait, commodity information data and commodity relations, and generating the recommendation list of commodity information according to the recommendation scoring.
Wherein, the formula of the recommendation scoring function F is as follows:
;
Wherein S (A, B) represents the similarity between the user portraits A and B, M (H, I) represents the matching degree of the user behavior data H and the topic distribution vector I, alpha, beta and gamma are weight coefficients,Indicating the degree of influence between the article m and the complementary article c,Indicating the degree of influence between commodity n and substitute commodity c,Representing the user's score for item m in the complementary set of items C,Representing the user' S score for commodity n in the set of alternate commodities S.
It should be noted that, the lowest recommendation score is set according to the recommendation requirement, and the commodity information meeting the recommendation condition is screened out and pushed to the user. In the recommendation scoring function, the scoring of the commodity m by the user characterizes the preference degree of the commodity m by calculating the lifting degree in the process of acquiring the association rule mining of the complementary commodity set, and the scoring of the commodity n by the user characterizes the preference degree of the commodity n by acquiring the cosine distance of the substitute commodity set.
The embodiment also provides a commodity recommendation system, as shown in fig. 3, which comprises a user portrait module, a commodity information modeling module and a commodity recommendation module.
The user portrait module is used for collecting user information data and constructing a user portrait by extracting user characteristics, and comprises the following steps:
cleaning and preprocessing the collected original data, including:
duplicate records are removed, and misleading analysis results caused by duplicate data are avoided;
filling the missing value, namely processing the missing item in the user information, and adopting methods such as mean filling, median filling, most frequent value filling or predictive filling;
Data format conversion, converting data into standard format, for example converting time field into unified date format or converting category variable into digital variable;
abnormal value detection and processing, namely identifying and processing abnormal values, such as the situation that the purchase amount is too high or too low, so as to avoid negative influence on subsequent analysis;
Identifying the collected data sensitive state and dividing the data into sensitive fields and conventional fields, de-identifying the sensitive fields, and extracting features from the conventional fields by adopting a data engine;
Extracting features for constructing the user representation from the preprocessed data, including extracting features from:
Demographic characteristics such as age, sex, region, occupation, etc.;
behavior characteristics such as purchase history, browsing records, searching keywords and the like;
hobbies such as favorite merchandise categories, brand preferences, content consumption habits, etc.;
Lifestyle and value aspects-user's value aspects, consumption levels, risk preferences, etc. (obtainable by indirect analysis);
emotion analysis, namely extracting emotion tendencies (positive, negative or neutral) of a user on a certain product or service by analyzing text data such as comments, feedback and the like of the user.
Combining features of different dimensions into a composite feature and creating a unified vector representation for each user.
The commodity information modeling module is used for extracting the theme characteristics according to commodity information data and calculating the distribution condition of the theme characteristics, and comprises the following steps:
Training the document-word matrix by using an LDA model, setting the number of topics, and generating a topic distribution vector of a dimension corresponding to the number of topics for each document.
The commodity recommendation module is used for matching user groups with interest preferences for users according to user portraits, acquiring matching degree with the current users through commodity theme feature distribution, carrying out weighted correction on the scores of the recommended commodities based on a time function considering commodity relations, and pushing commodity information according to the comprehensive scores of the recommended commodities.
The time function of the commodity relation comprises a time function established according to the time sequence change trend of the commodity with the complementary relation and a time function established according to the time sequence change trend of the commodity with the alternative relation.
The invention firstly collects the data of the user, de-labeling the data collected by the collection and extracting the characteristics of a data engine for protecting the data privacy, de-labeling the sensitive fields of the user data, directly regularizing the data extracted by the data engine, only preserving the characteristics of the data, only receiving the de-labeled and extracted characteristics of the data, and not directly contacting the original data by a developer, thereby reducing the risk of revealing the user privacy data caused by the defect of the de-sensitization rule and protecting the privacy data of the user.
The method comprises the steps of firstly respectively processing user information data and commodity information data, respectively establishing user portraits and commodity information theme distribution aiming at users and commodities, then carrying out group matching on the users according to the user portraits, recommending group interest commodities for the users, then completing matching with the users according to the theme distribution, ensuring that a recommendation result meets the interest preference of the users and meets the attribute requirement of commodity information, wherein the influence of commodities with complementary substitution relation with purchased commodities on target item recommendation can not be ignored over time when the recommendation is carried out, thus establishing a time function of commodity relation on the basis of the first two steps, incorporating the time function into the recommended commodity score in the form of weighted items, mining the potential requirement of the users from the relation among commodities, and improving the accuracy of the recommended commodities and the richness of the classes.
The present invention is not limited in any way by the above-described preferred embodiments, but is not limited to the above-described preferred embodiments, and any person skilled in the art will appreciate that the present invention can be embodied in the form of a program for carrying out the method of the present invention, while the above disclosure is directed to equivalent embodiments capable of being modified or altered in some ways, it is apparent that any modifications, equivalent variations and alterations made to the above embodiments according to the technical principles of the present invention fall within the scope of the present invention.