Movatterモバイル変換


[0]ホーム

URL:


CN112925973B - Data processing method and device - Google Patents

Data processing method and device
Download PDF

Info

Publication number
CN112925973B
CN112925973BCN201911243337.4ACN201911243337ACN112925973BCN 112925973 BCN112925973 BCN 112925973BCN 201911243337 ACN201911243337 ACN 201911243337ACN 112925973 BCN112925973 BCN 112925973B
Authority
CN
China
Prior art keywords
users
user
user set
candidate
term
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911243337.4A
Other languages
Chinese (zh)
Other versions
CN112925973A (en
Inventor
张美娜
仲济源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co LtdfiledCriticalBeijing Jingdong Century Trading Co Ltd
Priority to CN201911243337.4ApriorityCriticalpatent/CN112925973B/en
Publication of CN112925973ApublicationCriticalpatent/CN112925973A/en
Application grantedgrantedCritical
Publication of CN112925973BpublicationCriticalpatent/CN112925973B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention discloses a data processing method and device, and relates to the technical field of computers. Wherein the method comprises the following steps: responding to the triggering of the crowd expansion task, and constructing a candidate user set for crowd expansion; extracting partial users from the candidate user set according to a first extraction rule, and then taking the extracted partial users and seed user set as positive sample users; extracting part of users as negative sample users according to the second extraction rule; training a first machine learning model according to the user characteristic data of the positive sample user and the negative sample user to obtain a trained first machine learning model; and screening out an expanded user set from the candidate user set according to the trained first machine learning model. Through the steps, the training effect of the machine learning model in crowd expansion can be improved, and the crowd expansion accuracy is improved.

Description

Data processing method and device
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data processing method and apparatus.
Background
Crowd extensions are commonly used for advertising or marketing campaigns by merchants. For example, when advertisement delivery is performed, considering that the user quantity of the seed crowd provided by the advertiser is often smaller, advertisement delivery based on the seed crowd has the defects of small advertisement coverage, no expected flow and the like, and the advertisement data platform or the shopping data platform (DMP) analyzes the salient features of the seed crowd and expands the seed crowd according to the features, and then performs advertisement delivery based on the expanded crowd, so that the aim of improving click conversion rate or purchase conversion rate is fulfilled.
The existing crowd expansion scheme mainly comprises the following two types: first, people are expanded based on user portraits. Specifically, various portrait characteristic labels are set for users through user portrait analysis, portrait characteristic labels of most users in seed populations are analyzed, and then people with high-similarity portrait characteristic labels in a database are classified as extended populations. Secondly, population expansion is performed based on a classification algorithm. Specifically, the seed population is taken as a positive sample, the candidate population is taken as a negative sample, and then the candidate population is screened through the trained classification model, so that the expanded population is obtained.
In the process of implementing the present invention, the inventor finds that at least the following problems exist in the prior art: in the first prior art, people group expansion is performed by completely relying on user images, and the problems of low accuracy, low timeliness and the like exist. In the second prior art, the selection of the seed population may come from a specific rule, so that the use of the seed population as a positive sample easily causes the problem that the model is over-fitted, and the sampling of the negative sample is rough, so that the training effect of the model is poor, and the expansion effect of the final population is affected.
Disclosure of Invention
In view of the above, the invention provides a data processing method and device, which can improve the training effect of a machine learning model in crowd expansion and improve the accuracy of crowd expansion.
To achieve the above object, according to one aspect of the present invention, there is provided a data processing method.
The data processing method of the invention comprises the following steps: responding to the triggering of the crowd expansion task, and determining a candidate user set for crowd expansion; extracting partial users from the candidate user set according to a first extraction rule, and then taking the extracted partial users and seed user set as positive sample users; extracting part of users as negative sample users according to the second extraction rule; training a first machine learning model according to the user characteristic data of the positive sample user and the negative sample user to obtain a trained first machine learning model; and screening out an expanded user set from the candidate user set according to the trained first machine learning model.
Optionally, the determining the candidate user set for crowd expansion includes: acquiring business activity information needing crowd expansion; inquiring a database table according to the service activity information to obtain a candidate user set corresponding to the service activity information; the business activity information comprises at least one of brand identification of a target commodity related to a business activity, category identification of the target commodity related to the business activity and shop identification related to the business activity.
Optionally, the candidate user set includes: a short-term interest user set and a medium-long-term interest user set; the short-term interest user set is a user set which is screened out and is interested in the target commodity based on short-term behavior characteristic data of the user; the medium-long term interest user set is a user set which is screened out and is interested in the target commodity based on medium-long term behavior characteristic data of the user.
Optionally, the short-term interest user set includes: a first set of short-term interest users, a second set of short-term interest users, and a third set of short-term interest users; the method further comprises the steps of: screening a first short-term interest user set from a first user set having a first type of operation behavior on the target commodity recently; determining similar commodities of the target commodity, and then screening a second short-term interest user set from a second user set which has a first type of operation behavior on the similar commodities recently; a third set of short-term users of interest is screened from a third set of users recently having a second type of operational activity with the target commodity or the similar commodity.
Optionally, the screening the first short-term interest user set from the first user set having a first type of operation behavior on the target commodity recently includes: acquiring a first user set which has a first type of operation behavior on a target commodity recently; determining the preference degree of each user in the first user set to the target commodity according to the trained second machine learning model; and taking all users with the preference degree larger than a preset threshold value or a preset number of users with the greatest preference degree as a first short-term interest user set.
Optionally, the medium-long term interest user set includes: a first set of medium-and-long-term-interest users, and a second set of medium-and-long-term-interest users; the method further comprises the steps of: counting the value distribution conditions of portrait labels corresponding to all users in the seed user set to determine group portraits corresponding to the seed user set; constructing a first set of medium-long-term-interest users from users similar to the community representation; and constructing a second medium-long period interest user set according to users which have no purchase behaviors on the target commodity in the near term but have purchase behaviors on the target commodity.
Optionally, the screening the extended user set from the candidate user set according to the trained first machine learning model includes: determining the preference degree of each user in the candidate user set to the target commodity according to the trained first machine learning model; and taking all users with preference degrees larger than a preset threshold value or a preset number of users with the greatest preference degrees as an expanding user set.
To achieve the above object, according to another aspect of the present invention, there is provided a data processing apparatus.
The data processing device of the present invention includes: the determining module is used for responding to the triggering of the crowd expansion task and determining a candidate user set for crowd expansion; the extraction module is used for extracting part of users from the candidate user set according to a first extraction rule, and then taking the extracted part of users and the seed user set as positive sample users; and is further configured to extract a portion of the users as negative sample users according to the second extraction rule; the training module is used for training the first machine learning model according to the user characteristic data of the positive sample user and the negative sample user so as to obtain a trained first machine learning model; and the screening module is used for screening out an expanded user set from the candidate user set according to the trained first machine learning model.
To achieve the above object, according to still another aspect of the present invention, there is provided an electronic apparatus.
The electronic device of the present invention includes: one or more processors; and a storage means for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the data processing method of the present invention.
To achieve the above object, according to still another aspect of the present invention, a computer-readable medium is provided.
The computer readable medium of the present invention has stored thereon a computer program which, when executed by a processor, implements the data processing method of the present invention.
One embodiment of the above invention has the following advantages or benefits: by constructing a candidate user set for crowd expansion, extracting part of users from the candidate user set according to a first extraction rule, then taking the extracted part of users and seed user set as positive sample users, extracting part of users as negative sample users according to a second extraction rule, and training a first machine learning model according to user characteristic data of the positive sample users and the negative sample users, the training effect of the machine learning model in crowd expansion can be improved, and the accuracy of crowd expansion is further improved.
Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of the main flow of a data processing method according to a first embodiment of the present invention;
FIG. 2 is a schematic diagram of the main flow of a data processing method according to a second embodiment of the present invention;
FIG. 3 is a schematic diagram of main modules of a data processing apparatus according to a third embodiment of the present invention;
FIG. 4 is a schematic diagram of main modules of a data processing apparatus according to a fourth embodiment of the present invention;
FIG. 5 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;
Fig. 6 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It is noted that embodiments of the invention and features of the embodiments may be combined with each other without conflict.
Fig. 1 is a schematic diagram of the main flow of a data processing method according to a first embodiment of the present invention. As shown in fig. 1, the data processing method in the embodiment of the present invention includes:
And step S101, responding to the triggering of the crowd expansion task, and determining a candidate user set for crowd expansion.
For example, after receiving a crowd expansion request submitted by a demand party (such as an advertiser or a marketer), the crowd expansion task may be started to execute, that is, the data processing flow of the embodiment of the invention may be started to execute. Wherein, the crowd expansion request may include: business activity information that needs to be expanded by the crowd, and a seed user set to which the business activity relates. Further, the business activity information needing crowd expansion may include at least one of the following information: the brand identification of the target commodity related to the business activity, the class identification of the target commodity related to the business activity and the shop identification related to the business activity.
In an alternative embodiment, the determining the candidate set of users for crowd-sourcing includes: acquiring business activity information needing crowd expansion; and inquiring a database table according to the service activity information to obtain a candidate user set corresponding to the service activity information. Wherein, the database table stores candidate user information (such as candidate user identification and other information) corresponding to the business activity information.
In a specific example of this alternative embodiment, the obtained business activity information is specifically a brand identifier of a target commodity related to a business activity, and the database table may be queried according to the brand identifier of the target commodity to find a candidate user identifier corresponding to the brand identifier, so as to construct a candidate user set based on the found candidate user identifier.
In another specific example of this alternative embodiment, the obtained business activity information is specifically a class identifier of a target commodity related to the business activity, and the database table may be queried according to the class identifier of the target commodity to find a candidate user identifier corresponding to the target commodity, and then a candidate user set is constructed based on the found candidate user identifier.
Step S102, extracting partial users from the candidate user set according to a first extraction rule, and taking the extracted partial users and seed user set as positive sample users; and extracting part of the users as negative sample users according to the second extraction rule.
Considering that the set of seed users provided by the demander may come from a specific strong rule, such as a user who has recently purchased the brand or class of goods, the direct use of seeds as a positive sample tends to cause a problem of over-fitting the model. In view of this, in the embodiment of the present invention, a part of users are extracted from the candidate user set according to the first extraction rule, and the seed user set is appropriately generalized to solve the problem of overfitting, and enhance the generalization capability of the machine learning model.
In an alternative embodiment, the first extraction rule may include: the users that need to be extracted from the candidate set of users are determined based on the purchase conversion and/or click conversion. For example, if the candidate users have multiple categories in the set, the users who have purchasing and clicking actions on the commodity brands related to the business activities in the last month can be counted first, the proportion of the users in the candidate users in each category is analyzed, and part of the users in the candidate users in each category are extracted as the supplement of the positive sample according to the proportion. For another example, if the candidate users have multiple categories in the set, the users who have purchasing or clicking actions on the commodity category related to the business activity in the last month can be counted first, the ratio of the users in the candidate users in each category is analyzed, and part of the users in the candidate users in each category are extracted as the supplement of the positive sample according to the ratio.
Further, in the above optional embodiment, the first extraction rule may further include: the number of partial users extracted from the candidate user set is made not less than one fifth of the seed users and not more than one third of the seed users. Through setting up above first extraction rule, can control the quantity of positive sample of generalization when guaranteeing that the positive sample of generalization is similar with seed user to make seed user not diluted, thereby help improving model training effect.
In the prior art, users except seed users are taken as negative samples, and the selection mode of the negative samples is rough, so that the training effect of the model is poor. In view of this, in the embodiment of the present invention, the user having a correlation with the positive sample and a large difference is extracted as the negative sample user according to the second extraction rule, so as to improve the training effect of the model.
In an alternative embodiment, the second extraction rule includes: users who have purchase activity recently (e.g., within half a year or within one year), but click on brands or items related to business activity, but not purchase activity, are taken as negative example users. In another alternative embodiment, the second extraction rule includes: users who have recent purchasing behavior, but click on brands involved in business activities and similar brands, but no purchasing behavior, are taken as negative example users.
And step 103, training the first machine learning model according to the user characteristic data of the positive sample user and the negative sample user to obtain a trained first machine learning model.
Wherein the user feature data may be constructed based on user features in one or more of the following dimensions: user portrayal features, user behavior features on target merchandise related to business activities (such as purchasing, shopping cart adding, clicking, focusing, etc.), user behavior features on similar merchandise, and word segmentation features on purchased merchandise.
Illustratively, the first machine learning model may be XGBoost model (XGBoost model, collectively eXtreme Gradient Boosting, a Boosting algorithm). The first machine learning model may also be other machine learning models without affecting the practice of the present invention.
Step S104, screening out an expanded user set from the candidate user set according to the trained first machine learning model.
In this step, the preference degree of each user in the candidate user set for the target commodity can be determined according to the trained first machine learning model; and taking all users with preference degrees larger than a preset threshold value or a preset number of users with the greatest preference degrees as an expanding user set. In specific implementation, the preset threshold and the preset number can be set by a demand party, and also can be flexibly set by an executive party of a crowd expansion task according to specific business demands.
In the embodiment of the invention, the crowd expansion is realized through the steps. Compared with the prior art, the method provided by the embodiment of the invention can improve the training effect of the machine learning model in crowd expansion and improve the accuracy of crowd expansion. In addition, the embodiment of the invention does not need to rely on social networks of users when people are expanded, so that the universality of people expansion is improved.
Fig. 2 is a schematic diagram of the main flow of a data processing method according to a second embodiment of the present invention. As shown in fig. 2, the data processing method according to the embodiment of the present invention includes:
Step S201, constructing a database table, wherein the database table is used for storing candidate user sets corresponding to commodity brand identifiers or commodity class identifiers.
In an alternative embodiment, the candidate user set includes: a short-term interest user set and a medium-long-term interest user set; the short-term interest user set is a user set which is screened out and is interested in the target commodity based on short-term behavior characteristic data of the user; the medium-long term interest user set is a user set which is screened out and is interested in the target commodity based on medium-long term behavior characteristic data of the user. In the embodiment of the invention, when the candidate user set is determined, not only the users interested in the target commodity in a short period are considered, but also the users interested in the target commodity in a long period are considered, the long-period preference and the diversity of the users are considered, the timeliness and the accuracy of crowd expansion are improved, and the effect of business activities (such as advertisement delivery) based on the expansion crowd expansion is improved.
Illustratively, the short term interest user set may include: a first set of short-term interest users, a second set of short-term interest users, and a third set of short-term interest users. Wherein the first short-term set of interested users is a set of users selected from a first set of users recently (e.g., within a month of the last month) having a first type of operational behavior on the target commodity, which may be referred to simply as a "high-potential set of users for the target commodity"; the second short-term set of interested users is a set of users screened from a second set of users that have a first type of operational behavior recently on similar items of the target item, which may be referred to as a "high-potential set of users of similar items"; the third short-term set of interested users is a set of users selected from a third set of users recently having a second type of operational activity with the target commodity or the similar commodity. Wherein, the first type of operation behavior can be purchasing, shopping cart adding, focusing, clicking and the like; the second type of operational behavior may be a search behavior. When the second type of operational behavior is a search behavior, the third set of short-term interest users may simply be referred to as a "search recall user set". Furthermore, the short-term set of users of interest may also include only one or two of the first through third short-term sets of users of interest without affecting the practice of the present invention.
Illustratively, the medium-long term interest user set includes: a first set of medium-and-long-term-interest users, and a second set of medium-and-long-term-interest users. Wherein the first medium-long term interest user set is a user set constructed based on users similar to the group portraits of the seed user set, which may be referred to simply as a "portrayal tag similar user set"; the second medium-and-long-term-interest user set is a user set constructed based on users who have recently had no purchase activity for the target commodity, but who have once had purchase activity for the target commodity, which may be simply referred to as a "churn user set". Furthermore, the set of mid-to-long-term-interest users may also include only one of the first through second sets of mid-to-long-term-interest users without affecting the practice of the present invention.
In a specific example, the candidate set of users specifically includes: "high potential user set of target commodity", "high potential user set of similar commodity", "search recall user set", "portrait tag similar user set", and "churn user set". The construction process for these five candidate user sets is described below.
In this particular example, the construction process of the "high potential user set of target commodity" includes: acquiring a first user set having a first type of operation behavior on a target commodity recently (such as in the last month); determining the preference degree of each user in the first user set to the target commodity according to the trained second machine learning model; and taking all users with the preference degree larger than a preset threshold value or a preset number of users with the greatest preference degree as a first short-term interest user set. The second machine learning model may be XGBoost models (the model XGBoost is a Boosting algorithm, which is called eXtreme Gradient Boosting). The second machine learning model may also be other machine learning models without affecting the practice of the present invention.
In this particular example, the construction process of the "high potential user set of similar commodities" includes: determining similar commodities of the target commodity, and then screening out 'high-potential user sets of similar commodities' from a second user set which has a first type of operation behaviors on the similar commodities recently.
In an alternative embodiment, similar items to the target item may be determined according to the following: and inquiring a corresponding relation table of the brand of the commodity and the similar brands according to the brand identification of the target commodity, and taking the inquired preset number (such as the first 10, the first 5 or other numbers with the maximum similarity) of the similar brands as the similar commodity of the target commodity. In implementation, the behavior sequence of the user in one month can be obtained, the behavior sequence is processed through a text processing algorithm (such as word2vec algorithm) to obtain word embedments (embeding) of all brands, then similarity among brands is calculated based on the word embedments of all brands, further similar brands of all brands are determined, and a corresponding relation table of commodity brands and similar brands is generated based on the similar brands.
In another alternative embodiment, similar items to the target item may be determined according to the following: inquiring a corresponding relation table of commodity classes and related commodity classes according to the commodity class identification of the target commodity, and taking the inquired related commodity classes with preset quantity as similar commodity of the target commodity. Wherein the related items are obtained based on the concept of "shopping basket", for example, a customer may purchase a toothbrush while buying toothpaste, so that toothpaste and toothbrush are related items. In particular, a frequent mining mode may be employed to mine the relevant categories of each commodity. For example, user order data of the last year may be acquired first, a degree of improvement may be calculated based on the user order data, and then the relevant category of the commodity may be determined based on the degree of improvement. The lifting degree is used for measuring the correlation degree between commodity categories. For example, the degree of promotion between commodity class A and commodity class B may be defined as: the ratio of the "ratio of the number of users who purchase commodity class a and commodity class B simultaneously to the number of users who purchase commodity class a" to the "ratio of the number of users who purchase commodity class B to the total number of users".
In this particular example, the construction process of the "search recall user set" includes: the search keyword records of all users in the last month can be firstly obtained, so that the users searching the target commodity or similar commodities can be found out, and a search recall user set is constructed based on the users "
In this particular example, the construction process of the portrait tag similar user set includes: counting the value distribution conditions of portrait labels corresponding to all users in a seed user set to determine group portraits corresponding to the seed user set; then, a 'portrait tag similar user set' is constructed according to users similar to the group portrait.
Further, in this particular example, the construction process of the "churn user set" includes: users who have not purchased the target commodity in the last year, but have purchased the target commodity at one time, are obtained, and a 'lost user set' is constructed based on the users.
Step S202, responding to the triggering of the crowd expansion task, and acquiring business activity information and seed user sets which need to be subjected to crowd expansion.
Illustratively, the crowd-expanding task may be started to be executed after the crowd-expanding request submitted by the demand party (such as the advertiser or the marketer) is received, that is, step S202 is started to be executed. Wherein, the crowd expansion request may include: business activity information that needs to be expanded by the crowd, and a seed user set to which the business activity relates. Further, the business activity information that needs to be expanded by the crowd may include the following information: the brand identification of the target commodity related to the business activity, the class identification of the target commodity related to the business activity, and the shop identification related to the business activity. In addition, the crowd expansion request may further include: recall ratio of each type of candidate user set by the demander. For example, assuming that there are five candidate user sets of "high potential user set of target commodity", "high potential user set of similar commodity", "search recall user set", "portrait tag similar user set", and "churn user set", the demand side can flexibly set the recall ratio, for example, set the recall ratio to 3:3:2:1:1.
And step 203, inquiring the database table according to the business activity information to obtain a candidate user set corresponding to the business activity information.
In one example, a database table may be queried according to brand identification of the target good to find candidate user identifications corresponding thereto, and a candidate user set is constructed based on the found candidate user identifications.
In another example, the database table may be queried according to the category identification of the target commodity to find a candidate user identification corresponding thereto, and a candidate user set may be constructed based on the found candidate user identification.
Step S204, extracting partial users from the candidate user set according to a first extraction rule, and taking the extracted partial users and seed user set as positive sample users.
Considering that the set of seed users provided by the demander may come from a specific strong rule, such as a user who has recently purchased the brand or class of goods, the direct use of seeds as a positive sample tends to cause a problem of over-fitting the model. In view of this, in the embodiment of the present invention, a part of users are extracted from the candidate user set according to the first extraction rule, and the seed user set is appropriately generalized to solve the problem of overfitting, and enhance the generalization capability of the machine learning model.
In an alternative embodiment, the first extraction rule may include: the users that need to be extracted from the candidate set of users are determined based on the purchase conversion and/or click conversion. For example, the purchase click rate of each user in the candidate user set on the brand of the commodity related to the business activity in recent times can be analyzed first, and the 10 users with the highest purchase click rate are selected and taken as positive sample users together with the seed user set. For another example, the click conversion rate of each user in the candidate user set on the commodity class related to the business activity in recent time may be analyzed first, and the 20 users with the highest click conversion rate may be selected and used as positive sample users together with the seed user set.
Further, in the above optional embodiment, the first extraction rule may further include: the number of partial users extracted from the candidate user set is made not less than one fifth of the seed users and not more than one third of the seed users. Through setting up above first extraction rule, can control the quantity of positive sample of generalization when guaranteeing that the positive sample of generalization is similar with seed user to make seed user not diluted, thereby help improving model training effect.
Step S205, extracting part of users as negative sample users according to the second extraction rule.
In the prior art, users except seed users are taken as negative samples, and the selection mode of the negative samples is rough, so that the training effect of the model is poor. In view of this, in the embodiment of the present invention, the user having a correlation with the positive sample and a large difference is extracted as the negative sample user according to the second extraction rule, so as to improve the training effect of the model.
In an alternative embodiment, the second extraction rule includes: users who have purchase activity recently (e.g., within half a year or within one year), but click on brands or items related to business activity, but not purchase activity, are taken as negative example users. In another alternative embodiment, the second extraction rule includes: users who have recent purchasing behavior, but click on brands involved in business activities and similar brands, but no purchasing behavior, are taken as negative example users.
Step S206, training the first machine learning module according to the user characteristic data of the positive sample user and the negative sample user to obtain a trained first machine learning model.
Wherein the user feature data may be constructed based on user features in one or more of the following dimensions: user portrayal features, user behavior features on target merchandise related to business activities (such as purchasing, shopping cart adding, clicking, focusing, etc.), user behavior features on similar merchandise, and word segmentation features on purchased merchandise.
Illustratively, the first machine learning model may be XGBoost model (XGBoost model, collectively eXtreme Gradient Boosting, a Boosting algorithm). The first machine learning model may also be other machine learning models without affecting the practice of the present invention.
Step S207, screening out an extended user set from the candidate user set according to the trained first machine learning model.
In this step, the preference degree of each user in the candidate user set for the target commodity can be determined according to the trained first machine learning model; and taking all users with preference degrees larger than a preset threshold value or a preset number of users with the greatest preference degrees as an expanding user set. In specific implementation, the preset threshold and the preset number can be set by a demand party, and also can be flexibly set by an executive party of a crowd expansion task according to specific business demands.
Further, the method of the embodiment of the invention can further comprise the following steps: and evaluating the quality of the expanded user set screened based on the trained first machine learning model. In specific implementation, the evaluation can be performed based on a plurality of indexes such as accuracy and recall.
In the embodiment of the invention, the crowd expansion is realized through the steps. Compared with the prior art, the method provided by the embodiment of the invention can improve the training effect of the machine learning model in crowd expansion and improve the accuracy of crowd expansion. In addition, the embodiment of the invention does not need to rely on social networks of users when people are expanded, so that the universality of people expansion is improved.
Fig. 3 is a schematic diagram of main modules of a data processing apparatus according to a third embodiment of the present invention. As shown in fig. 3, a data processing apparatus 300 of an embodiment of the present invention includes: a determining module 301, an extracting module 302, a training module 303 and a screening module 304.
The determining module 301 is configured to determine a candidate user set for crowd expansion in response to a trigger of the crowd expansion task.
For example, the data processing apparatus 300 may start performing the crowd expansion task after receiving the crowd expansion request submitted by the demand party (such as an advertiser or a marketer), that is, start determining, by the determining module, the candidate user set for crowd expansion. Wherein, the crowd expansion request may include: business activity information that needs to be expanded by the crowd, and a seed user set to which the business activity relates. Further, the business activity information needing crowd expansion may include at least one of the following information: the brand identification of the target commodity related to the business activity, the class identification of the target commodity related to the business activity and the shop identification related to the business activity.
In an alternative embodiment, the determining module 301 determines the candidate set of users for crowd expansion includes: the determining module 301 obtains business activity information of crowd expansion; the determining module 301 queries the database table according to the service activity information to obtain a candidate user set corresponding to the database table. Wherein, the database table stores candidate user information (such as candidate user identification and other information) corresponding to the business activity information.
In a specific example of this alternative embodiment, the business activity information obtained by the determining module 301 is specifically a brand identifier of a target commodity related to the business activity, and then the determining module 301 may query the database table according to the brand identifier of the target commodity to find a candidate user identifier corresponding to the brand identifier, and further construct a candidate user set based on the found candidate user identifier.
In another specific example of this alternative embodiment, the business activity information obtained by the determining module 301 is specifically a category identifier of a target commodity involved in the business activity, and then the determining module 301 may query the database table according to the category identifier of the target commodity to find a candidate user identifier corresponding to the category identifier, and further construct a candidate user set based on the found candidate user identifier.
An extraction module 302, configured to extract a part of users from the candidate user set according to a first extraction rule, and then use the extracted part of users and seed user set as positive sample users; and extracting part of the users as negative sample users according to the second extraction rule.
Considering that the set of seed users provided by the demander may come from a specific strong rule, such as a user who has recently purchased the brand or class of goods, the direct use of seeds as a positive sample tends to cause a problem of over-fitting the model. In view of this, in an embodiment of the present invention, a portion of the users are extracted from the candidate user set by the extraction module 302 according to the first extraction rule, and the seed user set is appropriately generalized to solve the problem of overfitting and enhance the generalization capability of the machine learning model.
In an alternative embodiment, the first extraction rule may include: the users that need to be extracted from the candidate set of users are determined based on the purchase conversion and/or click conversion. For example, if the candidate users have multiple categories in the set, the users who have purchasing and clicking actions on the commodity brands related to the business activities in the last month can be counted first, the proportion of the users in the candidate users in each category is analyzed, and part of the users in the candidate users in each category are extracted as the supplement of the positive sample according to the proportion. For another example, if the candidate users have multiple categories in the set, the users who have purchasing or clicking actions on the commodity category related to the business activity in the last month can be counted first, the ratio of the users in the candidate users in each category is analyzed, and part of the users in the candidate users in each category are extracted as the supplement of the positive sample according to the ratio.
Further, in the above optional embodiment, the first extraction rule may further include: the number of partial users extracted from the candidate user set is made not less than one fifth of the seed users and not more than one third of the seed users. Through setting up above first extraction rule, can control the quantity of positive sample of generalization when guaranteeing that the positive sample of generalization is similar with seed user to make seed user not diluted, thereby help improving model training effect.
In the prior art, users except seed users are taken as negative samples, and the selection mode is rough, so that the training effect of the model is poor. In view of this, in the embodiment of the present invention, the user having relevance to the positive sample and having a large difference is extracted as the negative sample user by the extraction module 302 according to the second extraction rule, so as to improve the training effect of the model.
In an alternative embodiment, the second extraction rule includes: users who have purchase activity recently (e.g., within half a year or within one year), but click on brands or items related to business activity, but not purchase activity, are taken as negative example users. In another alternative embodiment, the second extraction rule includes: users who have recent purchasing behavior, but click on brands involved in business activities and similar brands, but no purchasing behavior, are taken as negative example users.
And the training module 303 is configured to train the first machine learning model according to the user characteristic data of the positive sample user and the negative sample user, so as to obtain a trained first machine learning model.
Wherein the user feature data may be constructed based on user features in one or more of the following dimensions: user portrayal features, user behavior features on target merchandise related to business activities (such as purchasing, shopping cart adding, clicking, focusing, etc.), user behavior features on similar merchandise, and word segmentation features on purchased merchandise.
Illustratively, the first machine learning model may be XGBoost model (XGBoost model, collectively eXtreme Gradient Boosting, a Boosting algorithm). The first machine learning model may also be other machine learning models without affecting the practice of the present invention.
And a screening module 304, configured to screen an extended user set from the candidate user set according to the trained first machine learning model.
Illustratively, the screening module 304 may determine a preference of each user in the candidate set of users for the target commodity based on the trained first machine learning model; then, the filtering module 304 uses all users with the preference degree greater than a preset threshold value or a preset number of users with the greatest preference degree as an extended user set. In specific implementation, the preset threshold and the preset number can be set by a demand party, and also can be flexibly set by an executive party of a crowd expansion task according to specific business demands.
In the embodiment of the invention, the crowd expansion is realized through the device. Compared with the prior art, the device provided by the embodiment of the invention can improve the training effect of the machine learning model in crowd expansion and improve the accuracy of crowd expansion. In addition, the embodiment of the invention does not need to rely on social networks of users when people are expanded, so that the universality of people expansion is improved.
Fig. 4 is a schematic diagram of main modules of a data processing apparatus according to a fourth embodiment of the present invention. As shown in fig. 4, a data processing apparatus 400 of an embodiment of the present invention includes: a construction module 401, a determination module 402, an extraction module 403, a training module 404, a screening module 405.
A construction module 401, configured to construct a database table, where the database table is configured to store a candidate user set corresponding to a brand identifier or a category identifier of a commodity.
In an alternative embodiment, the candidate user set includes: a short-term interest user set and a medium-long-term interest user set; the short-term interest user set is a user set which is screened out and is interested in the target commodity based on short-term behavior characteristic data of the user; the medium-long term interest user set is a user set which is screened out and is interested in the target commodity based on medium-long term behavior characteristic data of the user. In the embodiment of the invention, when the candidate user set is determined, not only the users interested in the target commodity in a short period are considered, but also the users interested in the target commodity in a long period are considered, the long-period preference and the diversity of the users are considered, the timeliness and the accuracy of crowd expansion are improved, and the effect of business activities (such as advertisement delivery) based on the expansion crowd expansion is improved.
Illustratively, the short term interest user set may include: a first set of short-term interest users, a second set of short-term interest users, and a third set of short-term interest users. Wherein the first short-term set of interested users is a set of users selected from a first set of users recently (e.g., within a month of the last month) having a first type of operational behavior on the target commodity, which may be referred to simply as a "high-potential set of users for the target commodity"; the second short-term set of interested users is a set of users screened from a second set of users that have a first type of operational behavior recently on similar items of the target item, which may be referred to as a "high-potential set of users of similar items"; the third short-term set of interested users is a set of users selected from a third set of users recently having a second type of operational activity with the target commodity or the similar commodity. Wherein, the first type of operation behavior can be purchasing, shopping cart adding, focusing, clicking and the like; the second type of operational behavior may be a search behavior. When the second type of operational behavior is a search behavior, the third set of short-term interest users may simply be referred to as a "search recall user set". Furthermore, the short-term set of users of interest may also include only one or two of the first through third short-term sets of users of interest without affecting the practice of the present invention.
Illustratively, the medium-long term interest user set includes: a first set of medium-and-long-term-interest users, and a second set of medium-and-long-term-interest users. Wherein the first medium-long term interest user set is a user set constructed based on users similar to the group portraits of the seed user set, which may be referred to simply as a "portrayal tag similar user set"; the second medium-and-long-term-interest user set is a user set constructed based on users who have recently had no purchase activity for the target commodity, but who have once had purchase activity for the target commodity, which may be simply referred to as a "churn user set". Furthermore, the set of mid-to-long-term-interest users may also include only one of the first through second sets of mid-to-long-term-interest users without affecting the practice of the present invention.
The determining module 402 is configured to determine a candidate user set for crowd expansion in response to triggering of the crowd expansion task.
For example, the data processing apparatus 400 may begin determining, by the determination module, a candidate set of users for crowd expansion upon receipt of a crowd expansion request submitted by a demand party (such as an advertiser or marketer). Wherein, the crowd expansion request may include: business activity information that needs to be expanded by the crowd, and a seed user set to which the business activity relates. Further, the business activity information needing crowd expansion may include at least one of the following information: the brand identification of the target commodity related to the business activity, the class identification of the target commodity related to the business activity and the shop identification related to the business activity.
In an embodiment of the present invention, the determining module 402 determines a candidate set of users for crowd expansion includes: the determining module 402 obtains business activity information of crowd expansion; the determining module 402 queries the database table according to the service activity information to obtain a candidate user set corresponding to the database table. Wherein, the database table stores candidate user information (such as candidate user identification and other information) corresponding to the business activity information.
An extraction module 403, configured to extract a part of users from the candidate user set according to a first extraction rule, and then use the extracted part of users and seed user set as positive sample users; and extracting part of the users as negative sample users according to the second extraction rule.
Considering that the set of seed users provided by the demander may come from a specific strong rule, such as a user who has recently purchased the brand or class of goods, the direct use of seeds as a positive sample tends to cause a problem of over-fitting the model. In view of this, in the embodiment of the present invention, a portion of the users are extracted from the candidate user set by the extraction module 403 according to the first extraction rule, and the seed user set is appropriately generalized to solve the problem of overfitting and enhance the generalization capability of the machine learning model.
In the prior art, users except seed users are taken as negative samples, and the selection mode is rough, so that the training effect of the model is poor. In view of this, in the embodiment of the present invention, the user having relevance to the positive sample and having a large difference is extracted as the negative sample user by the extraction module 403 according to the second extraction rule, so as to improve the training effect of the model.
The training module 404 is configured to train the first machine learning model according to the user feature data of the positive sample user and the negative sample user, so as to obtain a trained first machine learning model.
Wherein the user feature data may be constructed based on user features of the following dimensions: user portrayal features, user behavior features on target merchandise related to business activities (such as purchasing, shopping cart adding, clicking, focusing, etc.), user behavior features on similar merchandise, and word segmentation features on purchased merchandise. In specific implementation, the user characteristic data can be pre-built, for example, the step of building the user characteristic data can be routinely executed every day, so that after the crowd expansion task is triggered, the user characteristic data can be directly obtained from the database, and the execution efficiency of the crowd expansion task is improved.
Illustratively, the first machine learning model may be XGBoost model (XGBoost model, collectively eXtreme Gradient Boosting, a Boosting algorithm). The first machine learning model may also be other machine learning models without affecting the practice of the present invention.
And a screening module 405, configured to screen an extended user set from the candidate user set according to the trained first machine learning model.
Illustratively, the screening module 405 may determine a preference of each user in the candidate set of users for the target commodity based on the trained first machine learning model; then, the filtering module 405 uses all users whose preference degree is greater than a preset threshold value, or a preset number of users whose preference degree is the largest, as the extended user set. In specific implementation, the preset threshold and the preset number can be set by a demand party, and also can be flexibly set by an executive party of a crowd expansion task according to specific business demands.
In the embodiment of the invention, the crowd expansion is realized through the device. Compared with the prior art, the device provided by the embodiment of the invention can improve the training effect of the machine learning model in crowd expansion and improve the accuracy of crowd expansion. In addition, the embodiment of the invention does not need to rely on social networks of users when people are expanded, so that the universality of people expansion is improved.
Fig. 5 illustrates an exemplary system architecture 500 in which a data processing method or data processing apparatus of an embodiment of the present invention may be applied.
As shown in fig. 5, the system architecture 500 may include terminal devices 501, 502, 503, a network 504, and a server 505. The network 504 is used as a medium to provide communication links between the terminal devices 501, 502, 503 and the server 505. The network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 505 via the network 504 using the terminal devices 501, 502, 503 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the terminal devices 501, 502, 503.
The terminal devices 501, 502, 503 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 505 may be a server providing various services, such as a background management server providing support for crowd expansion tasks submitted by users using the terminal devices 501, 502, 503. The background management server can perform analysis and other processing after receiving the crowd expansion task, and feed back a processing result (for example, an expansion user set) to the terminal equipment.
It should be noted that, the data processing method provided by the embodiment of the present invention is generally executed by the server 505, and accordingly, the data processing apparatus is generally disposed in the server 505.
It should be understood that the number of terminal devices, networks and servers in fig. 5 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 6, there is illustrated a schematic diagram of a computer system 600 suitable for use in implementing an electronic device of an embodiment of the present invention. The computer system shown in fig. 6 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU) 601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 601.
The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, as: a processor includes a determination module, an extraction module, a training module, and a screening module. Where the names of the modules do not constitute a limitation on the module itself in some cases, the determination module may also be described as "module for determining a candidate set of users", for example.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer-readable medium carries one or more programs which, when executed by one of the devices, cause the device to perform the following: responding to the triggering of the crowd expansion task, and determining a candidate user set for crowd expansion; extracting partial users from the candidate user set according to a first extraction rule, and then taking the extracted partial users and seed user set as positive sample users; extracting part of users as negative sample users according to the second extraction rule; training a first machine learning model according to the user characteristic data of the positive sample user and the negative sample user to obtain a trained first machine learning model; and screening out an expanded user set from the candidate user set according to the trained first machine learning model.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (9)

CN201911243337.4A2019-12-062019-12-06Data processing method and deviceActiveCN112925973B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201911243337.4ACN112925973B (en)2019-12-062019-12-06Data processing method and device

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201911243337.4ACN112925973B (en)2019-12-062019-12-06Data processing method and device

Publications (2)

Publication NumberPublication Date
CN112925973A CN112925973A (en)2021-06-08
CN112925973Btrue CN112925973B (en)2024-06-18

Family

ID=76161682

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201911243337.4AActiveCN112925973B (en)2019-12-062019-12-06Data processing method and device

Country Status (1)

CountryLink
CN (1)CN112925973B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113407763B (en)*2021-06-242025-04-04腾讯音乐娱乐科技(深圳)有限公司 Hot music mining method, electronic device and computer readable storage medium
CN114090401B (en)*2021-11-012024-09-10支付宝(杭州)信息技术有限公司Method and device for processing user behavior sequence
CN114706864B (en)*2022-03-042022-11-01阿波罗智能技术(北京)有限公司Model updating method and device for automatically mining scene data and storage medium
CN114792256B (en)*2022-06-232023-05-26上海维智卓新信息科技有限公司Crowd expansion method and device based on model selection
TWI871662B (en)*2023-06-132025-02-01鼎鼎企業管理顧問股份有限公司The intelligence analysis system for retail marketing
CN119338530A (en)*2024-10-212025-01-21广州钛动科技股份有限公司 Advertisement crowd expansion method, device, equipment and medium based on CatBoost model

Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107273454A (en)*2017-05-312017-10-20北京京东尚科信息技术有限公司User data sorting technique, device, server and computer-readable recording medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN105447730B (en)*2015-12-252020-11-06腾讯科技(深圳)有限公司Target user orientation method and device
CN108205766A (en)*2016-12-192018-06-26阿里巴巴集团控股有限公司Information-pushing method, apparatus and system
CN108427690B (en)*2017-02-152022-09-13腾讯科技(深圳)有限公司Information delivery method and device
US20190102674A1 (en)*2017-09-292019-04-04Here Global B.V.Method, apparatus, and system for selecting training observations for machine learning models

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107273454A (en)*2017-05-312017-10-20北京京东尚科信息技术有限公司User data sorting technique, device, server and computer-readable recording medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于大数据的用户画像系统概述;徐璐瑶;姜增祺;黄婷婷;刘云鹏;;电子世界;20180123(02);全文*

Also Published As

Publication numberPublication date
CN112925973A (en)2021-06-08

Similar Documents

PublicationPublication DateTitle
CN112925973B (en)Data processing method and device
CN108665329B (en)Commodity recommendation method based on user browsing behavior
CN109145280B (en)Information pushing method and device
CN113450172B (en)Commodity recommendation method and device
CN107426328B (en)Information pushing method and device
CN110059172B (en)Method and device for recommending answers based on natural language understanding
CN112418932B (en)Marketing information pushing method and device based on user tag
CN108243219B (en)Information pushing method and device
CN110020162B (en)User identification method and device
CN107274209A (en)The method and apparatus for predicting advertising campaign sales data
CN112749323B (en)Method and device for constructing user portrait
CN113516524B (en) Method and device for pushing information
CN113077292B (en)User classification method and device, storage medium and electronic equipment
CN115423555A (en)Commodity recommendation method and device, electronic equipment and storage medium
CN110232581B (en)Method and device for providing coupons for users
CN110827101B (en)Shop recommending method and device
CN113378043A (en)User screening method and device
CN111787042B (en)Method and device for pushing information
CN108512674B (en)Method, device and equipment for outputting information
CN110555745B (en)Information pushing method and system, computer system and computer readable storage medium
CN111125514B (en)Method, device, electronic equipment and storage medium for analyzing user behaviors
CN113327145B (en)Article recommendation method and device
CN112784861B (en)Similarity determination method, device, electronic equipment and storage medium
CN113781062A (en)User label display method and device
CN113269600A (en)Information sending method and device

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp