Detailed Description
Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It is noted that embodiments of the invention and features of the embodiments may be combined with each other without conflict.
Fig. 1 is a schematic diagram of the main flow of a data processing method according to a first embodiment of the present invention. As shown in fig. 1, the data processing method in the embodiment of the present invention includes:
And step S101, responding to the triggering of the crowd expansion task, and determining a candidate user set for crowd expansion.
For example, after receiving a crowd expansion request submitted by a demand party (such as an advertiser or a marketer), the crowd expansion task may be started to execute, that is, the data processing flow of the embodiment of the invention may be started to execute. Wherein, the crowd expansion request may include: business activity information that needs to be expanded by the crowd, and a seed user set to which the business activity relates. Further, the business activity information needing crowd expansion may include at least one of the following information: the brand identification of the target commodity related to the business activity, the class identification of the target commodity related to the business activity and the shop identification related to the business activity.
In an alternative embodiment, the determining the candidate set of users for crowd-sourcing includes: acquiring business activity information needing crowd expansion; and inquiring a database table according to the service activity information to obtain a candidate user set corresponding to the service activity information. Wherein, the database table stores candidate user information (such as candidate user identification and other information) corresponding to the business activity information.
In a specific example of this alternative embodiment, the obtained business activity information is specifically a brand identifier of a target commodity related to a business activity, and the database table may be queried according to the brand identifier of the target commodity to find a candidate user identifier corresponding to the brand identifier, so as to construct a candidate user set based on the found candidate user identifier.
In another specific example of this alternative embodiment, the obtained business activity information is specifically a class identifier of a target commodity related to the business activity, and the database table may be queried according to the class identifier of the target commodity to find a candidate user identifier corresponding to the target commodity, and then a candidate user set is constructed based on the found candidate user identifier.
Step S102, extracting partial users from the candidate user set according to a first extraction rule, and taking the extracted partial users and seed user set as positive sample users; and extracting part of the users as negative sample users according to the second extraction rule.
Considering that the set of seed users provided by the demander may come from a specific strong rule, such as a user who has recently purchased the brand or class of goods, the direct use of seeds as a positive sample tends to cause a problem of over-fitting the model. In view of this, in the embodiment of the present invention, a part of users are extracted from the candidate user set according to the first extraction rule, and the seed user set is appropriately generalized to solve the problem of overfitting, and enhance the generalization capability of the machine learning model.
In an alternative embodiment, the first extraction rule may include: the users that need to be extracted from the candidate set of users are determined based on the purchase conversion and/or click conversion. For example, if the candidate users have multiple categories in the set, the users who have purchasing and clicking actions on the commodity brands related to the business activities in the last month can be counted first, the proportion of the users in the candidate users in each category is analyzed, and part of the users in the candidate users in each category are extracted as the supplement of the positive sample according to the proportion. For another example, if the candidate users have multiple categories in the set, the users who have purchasing or clicking actions on the commodity category related to the business activity in the last month can be counted first, the ratio of the users in the candidate users in each category is analyzed, and part of the users in the candidate users in each category are extracted as the supplement of the positive sample according to the ratio.
Further, in the above optional embodiment, the first extraction rule may further include: the number of partial users extracted from the candidate user set is made not less than one fifth of the seed users and not more than one third of the seed users. Through setting up above first extraction rule, can control the quantity of positive sample of generalization when guaranteeing that the positive sample of generalization is similar with seed user to make seed user not diluted, thereby help improving model training effect.
In the prior art, users except seed users are taken as negative samples, and the selection mode of the negative samples is rough, so that the training effect of the model is poor. In view of this, in the embodiment of the present invention, the user having a correlation with the positive sample and a large difference is extracted as the negative sample user according to the second extraction rule, so as to improve the training effect of the model.
In an alternative embodiment, the second extraction rule includes: users who have purchase activity recently (e.g., within half a year or within one year), but click on brands or items related to business activity, but not purchase activity, are taken as negative example users. In another alternative embodiment, the second extraction rule includes: users who have recent purchasing behavior, but click on brands involved in business activities and similar brands, but no purchasing behavior, are taken as negative example users.
And step 103, training the first machine learning model according to the user characteristic data of the positive sample user and the negative sample user to obtain a trained first machine learning model.
Wherein the user feature data may be constructed based on user features in one or more of the following dimensions: user portrayal features, user behavior features on target merchandise related to business activities (such as purchasing, shopping cart adding, clicking, focusing, etc.), user behavior features on similar merchandise, and word segmentation features on purchased merchandise.
Illustratively, the first machine learning model may be XGBoost model (XGBoost model, collectively eXtreme Gradient Boosting, a Boosting algorithm). The first machine learning model may also be other machine learning models without affecting the practice of the present invention.
Step S104, screening out an expanded user set from the candidate user set according to the trained first machine learning model.
In this step, the preference degree of each user in the candidate user set for the target commodity can be determined according to the trained first machine learning model; and taking all users with preference degrees larger than a preset threshold value or a preset number of users with the greatest preference degrees as an expanding user set. In specific implementation, the preset threshold and the preset number can be set by a demand party, and also can be flexibly set by an executive party of a crowd expansion task according to specific business demands.
In the embodiment of the invention, the crowd expansion is realized through the steps. Compared with the prior art, the method provided by the embodiment of the invention can improve the training effect of the machine learning model in crowd expansion and improve the accuracy of crowd expansion. In addition, the embodiment of the invention does not need to rely on social networks of users when people are expanded, so that the universality of people expansion is improved.
Fig. 2 is a schematic diagram of the main flow of a data processing method according to a second embodiment of the present invention. As shown in fig. 2, the data processing method according to the embodiment of the present invention includes:
Step S201, constructing a database table, wherein the database table is used for storing candidate user sets corresponding to commodity brand identifiers or commodity class identifiers.
In an alternative embodiment, the candidate user set includes: a short-term interest user set and a medium-long-term interest user set; the short-term interest user set is a user set which is screened out and is interested in the target commodity based on short-term behavior characteristic data of the user; the medium-long term interest user set is a user set which is screened out and is interested in the target commodity based on medium-long term behavior characteristic data of the user. In the embodiment of the invention, when the candidate user set is determined, not only the users interested in the target commodity in a short period are considered, but also the users interested in the target commodity in a long period are considered, the long-period preference and the diversity of the users are considered, the timeliness and the accuracy of crowd expansion are improved, and the effect of business activities (such as advertisement delivery) based on the expansion crowd expansion is improved.
Illustratively, the short term interest user set may include: a first set of short-term interest users, a second set of short-term interest users, and a third set of short-term interest users. Wherein the first short-term set of interested users is a set of users selected from a first set of users recently (e.g., within a month of the last month) having a first type of operational behavior on the target commodity, which may be referred to simply as a "high-potential set of users for the target commodity"; the second short-term set of interested users is a set of users screened from a second set of users that have a first type of operational behavior recently on similar items of the target item, which may be referred to as a "high-potential set of users of similar items"; the third short-term set of interested users is a set of users selected from a third set of users recently having a second type of operational activity with the target commodity or the similar commodity. Wherein, the first type of operation behavior can be purchasing, shopping cart adding, focusing, clicking and the like; the second type of operational behavior may be a search behavior. When the second type of operational behavior is a search behavior, the third set of short-term interest users may simply be referred to as a "search recall user set". Furthermore, the short-term set of users of interest may also include only one or two of the first through third short-term sets of users of interest without affecting the practice of the present invention.
Illustratively, the medium-long term interest user set includes: a first set of medium-and-long-term-interest users, and a second set of medium-and-long-term-interest users. Wherein the first medium-long term interest user set is a user set constructed based on users similar to the group portraits of the seed user set, which may be referred to simply as a "portrayal tag similar user set"; the second medium-and-long-term-interest user set is a user set constructed based on users who have recently had no purchase activity for the target commodity, but who have once had purchase activity for the target commodity, which may be simply referred to as a "churn user set". Furthermore, the set of mid-to-long-term-interest users may also include only one of the first through second sets of mid-to-long-term-interest users without affecting the practice of the present invention.
In a specific example, the candidate set of users specifically includes: "high potential user set of target commodity", "high potential user set of similar commodity", "search recall user set", "portrait tag similar user set", and "churn user set". The construction process for these five candidate user sets is described below.
In this particular example, the construction process of the "high potential user set of target commodity" includes: acquiring a first user set having a first type of operation behavior on a target commodity recently (such as in the last month); determining the preference degree of each user in the first user set to the target commodity according to the trained second machine learning model; and taking all users with the preference degree larger than a preset threshold value or a preset number of users with the greatest preference degree as a first short-term interest user set. The second machine learning model may be XGBoost models (the model XGBoost is a Boosting algorithm, which is called eXtreme Gradient Boosting). The second machine learning model may also be other machine learning models without affecting the practice of the present invention.
In this particular example, the construction process of the "high potential user set of similar commodities" includes: determining similar commodities of the target commodity, and then screening out 'high-potential user sets of similar commodities' from a second user set which has a first type of operation behaviors on the similar commodities recently.
In an alternative embodiment, similar items to the target item may be determined according to the following: and inquiring a corresponding relation table of the brand of the commodity and the similar brands according to the brand identification of the target commodity, and taking the inquired preset number (such as the first 10, the first 5 or other numbers with the maximum similarity) of the similar brands as the similar commodity of the target commodity. In implementation, the behavior sequence of the user in one month can be obtained, the behavior sequence is processed through a text processing algorithm (such as word2vec algorithm) to obtain word embedments (embeding) of all brands, then similarity among brands is calculated based on the word embedments of all brands, further similar brands of all brands are determined, and a corresponding relation table of commodity brands and similar brands is generated based on the similar brands.
In another alternative embodiment, similar items to the target item may be determined according to the following: inquiring a corresponding relation table of commodity classes and related commodity classes according to the commodity class identification of the target commodity, and taking the inquired related commodity classes with preset quantity as similar commodity of the target commodity. Wherein the related items are obtained based on the concept of "shopping basket", for example, a customer may purchase a toothbrush while buying toothpaste, so that toothpaste and toothbrush are related items. In particular, a frequent mining mode may be employed to mine the relevant categories of each commodity. For example, user order data of the last year may be acquired first, a degree of improvement may be calculated based on the user order data, and then the relevant category of the commodity may be determined based on the degree of improvement. The lifting degree is used for measuring the correlation degree between commodity categories. For example, the degree of promotion between commodity class A and commodity class B may be defined as: the ratio of the "ratio of the number of users who purchase commodity class a and commodity class B simultaneously to the number of users who purchase commodity class a" to the "ratio of the number of users who purchase commodity class B to the total number of users".
In this particular example, the construction process of the "search recall user set" includes: the search keyword records of all users in the last month can be firstly obtained, so that the users searching the target commodity or similar commodities can be found out, and a search recall user set is constructed based on the users "
In this particular example, the construction process of the portrait tag similar user set includes: counting the value distribution conditions of portrait labels corresponding to all users in a seed user set to determine group portraits corresponding to the seed user set; then, a 'portrait tag similar user set' is constructed according to users similar to the group portrait.
Further, in this particular example, the construction process of the "churn user set" includes: users who have not purchased the target commodity in the last year, but have purchased the target commodity at one time, are obtained, and a 'lost user set' is constructed based on the users.
Step S202, responding to the triggering of the crowd expansion task, and acquiring business activity information and seed user sets which need to be subjected to crowd expansion.
Illustratively, the crowd-expanding task may be started to be executed after the crowd-expanding request submitted by the demand party (such as the advertiser or the marketer) is received, that is, step S202 is started to be executed. Wherein, the crowd expansion request may include: business activity information that needs to be expanded by the crowd, and a seed user set to which the business activity relates. Further, the business activity information that needs to be expanded by the crowd may include the following information: the brand identification of the target commodity related to the business activity, the class identification of the target commodity related to the business activity, and the shop identification related to the business activity. In addition, the crowd expansion request may further include: recall ratio of each type of candidate user set by the demander. For example, assuming that there are five candidate user sets of "high potential user set of target commodity", "high potential user set of similar commodity", "search recall user set", "portrait tag similar user set", and "churn user set", the demand side can flexibly set the recall ratio, for example, set the recall ratio to 3:3:2:1:1.
And step 203, inquiring the database table according to the business activity information to obtain a candidate user set corresponding to the business activity information.
In one example, a database table may be queried according to brand identification of the target good to find candidate user identifications corresponding thereto, and a candidate user set is constructed based on the found candidate user identifications.
In another example, the database table may be queried according to the category identification of the target commodity to find a candidate user identification corresponding thereto, and a candidate user set may be constructed based on the found candidate user identification.
Step S204, extracting partial users from the candidate user set according to a first extraction rule, and taking the extracted partial users and seed user set as positive sample users.
Considering that the set of seed users provided by the demander may come from a specific strong rule, such as a user who has recently purchased the brand or class of goods, the direct use of seeds as a positive sample tends to cause a problem of over-fitting the model. In view of this, in the embodiment of the present invention, a part of users are extracted from the candidate user set according to the first extraction rule, and the seed user set is appropriately generalized to solve the problem of overfitting, and enhance the generalization capability of the machine learning model.
In an alternative embodiment, the first extraction rule may include: the users that need to be extracted from the candidate set of users are determined based on the purchase conversion and/or click conversion. For example, the purchase click rate of each user in the candidate user set on the brand of the commodity related to the business activity in recent times can be analyzed first, and the 10 users with the highest purchase click rate are selected and taken as positive sample users together with the seed user set. For another example, the click conversion rate of each user in the candidate user set on the commodity class related to the business activity in recent time may be analyzed first, and the 20 users with the highest click conversion rate may be selected and used as positive sample users together with the seed user set.
Further, in the above optional embodiment, the first extraction rule may further include: the number of partial users extracted from the candidate user set is made not less than one fifth of the seed users and not more than one third of the seed users. Through setting up above first extraction rule, can control the quantity of positive sample of generalization when guaranteeing that the positive sample of generalization is similar with seed user to make seed user not diluted, thereby help improving model training effect.
Step S205, extracting part of users as negative sample users according to the second extraction rule.
In the prior art, users except seed users are taken as negative samples, and the selection mode of the negative samples is rough, so that the training effect of the model is poor. In view of this, in the embodiment of the present invention, the user having a correlation with the positive sample and a large difference is extracted as the negative sample user according to the second extraction rule, so as to improve the training effect of the model.
In an alternative embodiment, the second extraction rule includes: users who have purchase activity recently (e.g., within half a year or within one year), but click on brands or items related to business activity, but not purchase activity, are taken as negative example users. In another alternative embodiment, the second extraction rule includes: users who have recent purchasing behavior, but click on brands involved in business activities and similar brands, but no purchasing behavior, are taken as negative example users.
Step S206, training the first machine learning module according to the user characteristic data of the positive sample user and the negative sample user to obtain a trained first machine learning model.
Wherein the user feature data may be constructed based on user features in one or more of the following dimensions: user portrayal features, user behavior features on target merchandise related to business activities (such as purchasing, shopping cart adding, clicking, focusing, etc.), user behavior features on similar merchandise, and word segmentation features on purchased merchandise.
Illustratively, the first machine learning model may be XGBoost model (XGBoost model, collectively eXtreme Gradient Boosting, a Boosting algorithm). The first machine learning model may also be other machine learning models without affecting the practice of the present invention.
Step S207, screening out an extended user set from the candidate user set according to the trained first machine learning model.
In this step, the preference degree of each user in the candidate user set for the target commodity can be determined according to the trained first machine learning model; and taking all users with preference degrees larger than a preset threshold value or a preset number of users with the greatest preference degrees as an expanding user set. In specific implementation, the preset threshold and the preset number can be set by a demand party, and also can be flexibly set by an executive party of a crowd expansion task according to specific business demands.
Further, the method of the embodiment of the invention can further comprise the following steps: and evaluating the quality of the expanded user set screened based on the trained first machine learning model. In specific implementation, the evaluation can be performed based on a plurality of indexes such as accuracy and recall.
In the embodiment of the invention, the crowd expansion is realized through the steps. Compared with the prior art, the method provided by the embodiment of the invention can improve the training effect of the machine learning model in crowd expansion and improve the accuracy of crowd expansion. In addition, the embodiment of the invention does not need to rely on social networks of users when people are expanded, so that the universality of people expansion is improved.
Fig. 3 is a schematic diagram of main modules of a data processing apparatus according to a third embodiment of the present invention. As shown in fig. 3, a data processing apparatus 300 of an embodiment of the present invention includes: a determining module 301, an extracting module 302, a training module 303 and a screening module 304.
The determining module 301 is configured to determine a candidate user set for crowd expansion in response to a trigger of the crowd expansion task.
For example, the data processing apparatus 300 may start performing the crowd expansion task after receiving the crowd expansion request submitted by the demand party (such as an advertiser or a marketer), that is, start determining, by the determining module, the candidate user set for crowd expansion. Wherein, the crowd expansion request may include: business activity information that needs to be expanded by the crowd, and a seed user set to which the business activity relates. Further, the business activity information needing crowd expansion may include at least one of the following information: the brand identification of the target commodity related to the business activity, the class identification of the target commodity related to the business activity and the shop identification related to the business activity.
In an alternative embodiment, the determining module 301 determines the candidate set of users for crowd expansion includes: the determining module 301 obtains business activity information of crowd expansion; the determining module 301 queries the database table according to the service activity information to obtain a candidate user set corresponding to the database table. Wherein, the database table stores candidate user information (such as candidate user identification and other information) corresponding to the business activity information.
In a specific example of this alternative embodiment, the business activity information obtained by the determining module 301 is specifically a brand identifier of a target commodity related to the business activity, and then the determining module 301 may query the database table according to the brand identifier of the target commodity to find a candidate user identifier corresponding to the brand identifier, and further construct a candidate user set based on the found candidate user identifier.
In another specific example of this alternative embodiment, the business activity information obtained by the determining module 301 is specifically a category identifier of a target commodity involved in the business activity, and then the determining module 301 may query the database table according to the category identifier of the target commodity to find a candidate user identifier corresponding to the category identifier, and further construct a candidate user set based on the found candidate user identifier.
An extraction module 302, configured to extract a part of users from the candidate user set according to a first extraction rule, and then use the extracted part of users and seed user set as positive sample users; and extracting part of the users as negative sample users according to the second extraction rule.
Considering that the set of seed users provided by the demander may come from a specific strong rule, such as a user who has recently purchased the brand or class of goods, the direct use of seeds as a positive sample tends to cause a problem of over-fitting the model. In view of this, in an embodiment of the present invention, a portion of the users are extracted from the candidate user set by the extraction module 302 according to the first extraction rule, and the seed user set is appropriately generalized to solve the problem of overfitting and enhance the generalization capability of the machine learning model.
In an alternative embodiment, the first extraction rule may include: the users that need to be extracted from the candidate set of users are determined based on the purchase conversion and/or click conversion. For example, if the candidate users have multiple categories in the set, the users who have purchasing and clicking actions on the commodity brands related to the business activities in the last month can be counted first, the proportion of the users in the candidate users in each category is analyzed, and part of the users in the candidate users in each category are extracted as the supplement of the positive sample according to the proportion. For another example, if the candidate users have multiple categories in the set, the users who have purchasing or clicking actions on the commodity category related to the business activity in the last month can be counted first, the ratio of the users in the candidate users in each category is analyzed, and part of the users in the candidate users in each category are extracted as the supplement of the positive sample according to the ratio.
Further, in the above optional embodiment, the first extraction rule may further include: the number of partial users extracted from the candidate user set is made not less than one fifth of the seed users and not more than one third of the seed users. Through setting up above first extraction rule, can control the quantity of positive sample of generalization when guaranteeing that the positive sample of generalization is similar with seed user to make seed user not diluted, thereby help improving model training effect.
In the prior art, users except seed users are taken as negative samples, and the selection mode is rough, so that the training effect of the model is poor. In view of this, in the embodiment of the present invention, the user having relevance to the positive sample and having a large difference is extracted as the negative sample user by the extraction module 302 according to the second extraction rule, so as to improve the training effect of the model.
In an alternative embodiment, the second extraction rule includes: users who have purchase activity recently (e.g., within half a year or within one year), but click on brands or items related to business activity, but not purchase activity, are taken as negative example users. In another alternative embodiment, the second extraction rule includes: users who have recent purchasing behavior, but click on brands involved in business activities and similar brands, but no purchasing behavior, are taken as negative example users.
And the training module 303 is configured to train the first machine learning model according to the user characteristic data of the positive sample user and the negative sample user, so as to obtain a trained first machine learning model.
Wherein the user feature data may be constructed based on user features in one or more of the following dimensions: user portrayal features, user behavior features on target merchandise related to business activities (such as purchasing, shopping cart adding, clicking, focusing, etc.), user behavior features on similar merchandise, and word segmentation features on purchased merchandise.
Illustratively, the first machine learning model may be XGBoost model (XGBoost model, collectively eXtreme Gradient Boosting, a Boosting algorithm). The first machine learning model may also be other machine learning models without affecting the practice of the present invention.
And a screening module 304, configured to screen an extended user set from the candidate user set according to the trained first machine learning model.
Illustratively, the screening module 304 may determine a preference of each user in the candidate set of users for the target commodity based on the trained first machine learning model; then, the filtering module 304 uses all users with the preference degree greater than a preset threshold value or a preset number of users with the greatest preference degree as an extended user set. In specific implementation, the preset threshold and the preset number can be set by a demand party, and also can be flexibly set by an executive party of a crowd expansion task according to specific business demands.
In the embodiment of the invention, the crowd expansion is realized through the device. Compared with the prior art, the device provided by the embodiment of the invention can improve the training effect of the machine learning model in crowd expansion and improve the accuracy of crowd expansion. In addition, the embodiment of the invention does not need to rely on social networks of users when people are expanded, so that the universality of people expansion is improved.
Fig. 4 is a schematic diagram of main modules of a data processing apparatus according to a fourth embodiment of the present invention. As shown in fig. 4, a data processing apparatus 400 of an embodiment of the present invention includes: a construction module 401, a determination module 402, an extraction module 403, a training module 404, a screening module 405.
A construction module 401, configured to construct a database table, where the database table is configured to store a candidate user set corresponding to a brand identifier or a category identifier of a commodity.
In an alternative embodiment, the candidate user set includes: a short-term interest user set and a medium-long-term interest user set; the short-term interest user set is a user set which is screened out and is interested in the target commodity based on short-term behavior characteristic data of the user; the medium-long term interest user set is a user set which is screened out and is interested in the target commodity based on medium-long term behavior characteristic data of the user. In the embodiment of the invention, when the candidate user set is determined, not only the users interested in the target commodity in a short period are considered, but also the users interested in the target commodity in a long period are considered, the long-period preference and the diversity of the users are considered, the timeliness and the accuracy of crowd expansion are improved, and the effect of business activities (such as advertisement delivery) based on the expansion crowd expansion is improved.
Illustratively, the short term interest user set may include: a first set of short-term interest users, a second set of short-term interest users, and a third set of short-term interest users. Wherein the first short-term set of interested users is a set of users selected from a first set of users recently (e.g., within a month of the last month) having a first type of operational behavior on the target commodity, which may be referred to simply as a "high-potential set of users for the target commodity"; the second short-term set of interested users is a set of users screened from a second set of users that have a first type of operational behavior recently on similar items of the target item, which may be referred to as a "high-potential set of users of similar items"; the third short-term set of interested users is a set of users selected from a third set of users recently having a second type of operational activity with the target commodity or the similar commodity. Wherein, the first type of operation behavior can be purchasing, shopping cart adding, focusing, clicking and the like; the second type of operational behavior may be a search behavior. When the second type of operational behavior is a search behavior, the third set of short-term interest users may simply be referred to as a "search recall user set". Furthermore, the short-term set of users of interest may also include only one or two of the first through third short-term sets of users of interest without affecting the practice of the present invention.
Illustratively, the medium-long term interest user set includes: a first set of medium-and-long-term-interest users, and a second set of medium-and-long-term-interest users. Wherein the first medium-long term interest user set is a user set constructed based on users similar to the group portraits of the seed user set, which may be referred to simply as a "portrayal tag similar user set"; the second medium-and-long-term-interest user set is a user set constructed based on users who have recently had no purchase activity for the target commodity, but who have once had purchase activity for the target commodity, which may be simply referred to as a "churn user set". Furthermore, the set of mid-to-long-term-interest users may also include only one of the first through second sets of mid-to-long-term-interest users without affecting the practice of the present invention.
The determining module 402 is configured to determine a candidate user set for crowd expansion in response to triggering of the crowd expansion task.
For example, the data processing apparatus 400 may begin determining, by the determination module, a candidate set of users for crowd expansion upon receipt of a crowd expansion request submitted by a demand party (such as an advertiser or marketer). Wherein, the crowd expansion request may include: business activity information that needs to be expanded by the crowd, and a seed user set to which the business activity relates. Further, the business activity information needing crowd expansion may include at least one of the following information: the brand identification of the target commodity related to the business activity, the class identification of the target commodity related to the business activity and the shop identification related to the business activity.
In an embodiment of the present invention, the determining module 402 determines a candidate set of users for crowd expansion includes: the determining module 402 obtains business activity information of crowd expansion; the determining module 402 queries the database table according to the service activity information to obtain a candidate user set corresponding to the database table. Wherein, the database table stores candidate user information (such as candidate user identification and other information) corresponding to the business activity information.
An extraction module 403, configured to extract a part of users from the candidate user set according to a first extraction rule, and then use the extracted part of users and seed user set as positive sample users; and extracting part of the users as negative sample users according to the second extraction rule.
Considering that the set of seed users provided by the demander may come from a specific strong rule, such as a user who has recently purchased the brand or class of goods, the direct use of seeds as a positive sample tends to cause a problem of over-fitting the model. In view of this, in the embodiment of the present invention, a portion of the users are extracted from the candidate user set by the extraction module 403 according to the first extraction rule, and the seed user set is appropriately generalized to solve the problem of overfitting and enhance the generalization capability of the machine learning model.
In the prior art, users except seed users are taken as negative samples, and the selection mode is rough, so that the training effect of the model is poor. In view of this, in the embodiment of the present invention, the user having relevance to the positive sample and having a large difference is extracted as the negative sample user by the extraction module 403 according to the second extraction rule, so as to improve the training effect of the model.
The training module 404 is configured to train the first machine learning model according to the user feature data of the positive sample user and the negative sample user, so as to obtain a trained first machine learning model.
Wherein the user feature data may be constructed based on user features of the following dimensions: user portrayal features, user behavior features on target merchandise related to business activities (such as purchasing, shopping cart adding, clicking, focusing, etc.), user behavior features on similar merchandise, and word segmentation features on purchased merchandise. In specific implementation, the user characteristic data can be pre-built, for example, the step of building the user characteristic data can be routinely executed every day, so that after the crowd expansion task is triggered, the user characteristic data can be directly obtained from the database, and the execution efficiency of the crowd expansion task is improved.
Illustratively, the first machine learning model may be XGBoost model (XGBoost model, collectively eXtreme Gradient Boosting, a Boosting algorithm). The first machine learning model may also be other machine learning models without affecting the practice of the present invention.
And a screening module 405, configured to screen an extended user set from the candidate user set according to the trained first machine learning model.
Illustratively, the screening module 405 may determine a preference of each user in the candidate set of users for the target commodity based on the trained first machine learning model; then, the filtering module 405 uses all users whose preference degree is greater than a preset threshold value, or a preset number of users whose preference degree is the largest, as the extended user set. In specific implementation, the preset threshold and the preset number can be set by a demand party, and also can be flexibly set by an executive party of a crowd expansion task according to specific business demands.
In the embodiment of the invention, the crowd expansion is realized through the device. Compared with the prior art, the device provided by the embodiment of the invention can improve the training effect of the machine learning model in crowd expansion and improve the accuracy of crowd expansion. In addition, the embodiment of the invention does not need to rely on social networks of users when people are expanded, so that the universality of people expansion is improved.
Fig. 5 illustrates an exemplary system architecture 500 in which a data processing method or data processing apparatus of an embodiment of the present invention may be applied.
As shown in fig. 5, the system architecture 500 may include terminal devices 501, 502, 503, a network 504, and a server 505. The network 504 is used as a medium to provide communication links between the terminal devices 501, 502, 503 and the server 505. The network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 505 via the network 504 using the terminal devices 501, 502, 503 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the terminal devices 501, 502, 503.
The terminal devices 501, 502, 503 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 505 may be a server providing various services, such as a background management server providing support for crowd expansion tasks submitted by users using the terminal devices 501, 502, 503. The background management server can perform analysis and other processing after receiving the crowd expansion task, and feed back a processing result (for example, an expansion user set) to the terminal equipment.
It should be noted that, the data processing method provided by the embodiment of the present invention is generally executed by the server 505, and accordingly, the data processing apparatus is generally disposed in the server 505.
It should be understood that the number of terminal devices, networks and servers in fig. 5 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 6, there is illustrated a schematic diagram of a computer system 600 suitable for use in implementing an electronic device of an embodiment of the present invention. The computer system shown in fig. 6 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU) 601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 601.
The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, as: a processor includes a determination module, an extraction module, a training module, and a screening module. Where the names of the modules do not constitute a limitation on the module itself in some cases, the determination module may also be described as "module for determining a candidate set of users", for example.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer-readable medium carries one or more programs which, when executed by one of the devices, cause the device to perform the following: responding to the triggering of the crowd expansion task, and determining a candidate user set for crowd expansion; extracting partial users from the candidate user set according to a first extraction rule, and then taking the extracted partial users and seed user set as positive sample users; extracting part of users as negative sample users according to the second extraction rule; training a first machine learning model according to the user characteristic data of the positive sample user and the negative sample user to obtain a trained first machine learning model; and screening out an expanded user set from the candidate user set according to the trained first machine learning model.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.