Disclosure of Invention
The technical problem to be solved by the invention is as follows: provided are an advertisement recall method and device based on alliance flow.
In order to solve the technical problems, the invention adopts the technical scheme that: a federation traffic-based advertisement recall method, comprising:
acquiring an advertisement request which contains the ID of a user and the information of an advertisement;
performing channel distribution on the ID of the user in the advertisement request based on hash coding, and determining a recall channel of the advertisement;
after the advertisement recall channel is determined, filtering out advertisement information with similar files based on file vectors through a similar file filtering model;
judging whether the advertisement information after the shunting is new advertisement information, if so, transmitting the new advertisement information to a fine sorting; if not, inputting the advertisement information into a multi-channel advertisement recall model for processing, wherein the multi-channel advertisement recall model comprises a collaborative filtering model based on users, a matrix decomposition model, a product division random model and a full random model based on stream processing;
matching the advertisements processed by the multi-channel advertisement recall model with data in parallel according to the number of the distributed advertisements to obtain repeated advertisement information;
judging whether the advertisement quantity meets the threshold requirement, if so, returning the advertisement list to the fine sorting, if not, randomly filling, and returning the advertisement list to the fine sorting;
calculating the estimated click rate of each advertisement and the bid price of each click of an advertiser according to behavior information, equipment information, user liveness, geographical position and advertisement context characteristics given to the user by fine sorting to obtain the final bid price ranking of the advertisement;
and (4) according to the bidding and ranking of the advertisements, returning the advertisement ID with successful bidding and completing the advertisement recall.
Further, user similarity optimization is carried out in the user-based collaborative filtering model, the channel exposure and the channel high-version proportion number are increased to describe the users, and single feature dimensions and cross feature dimensions are aggregated from the user dimensions.
Further, the performing channel distribution on the ID of the user in the advertisement request based on the hash code to determine the advertisement recall channel includes: applying an intra-application message pushing and message pushing form;
the application message pushing comprises five channels, namely a matrix decomposition and product division random fusion model channel, a high ECPM strategy and product division random fusion model channel, a new product channel, a product division random algorithm channel and a pure random algorithm channel;
the message pushing mode comprises six channels, namely a high ECPM strategy and product division random fusion model channel, a new product channel, a product division random algorithm channel and a pure random algorithm channel.
Further, after confirming the advertisement recall channel, filtering out advertisement information similar to the file based on the file vector, including,
coding the pattern of the advertisement through a dictionary, and converting characters of the pattern into vectors;
based on the file vector, calculating vector similarity and setting a threshold, sorting completely consistent files according to ECPM, and reserving 50% of files with the highest ECPM; and for the documents with the similarity of more than 95%, the documents with the highest proportion of 30% of ECPM are reserved according to the ECPM sequence.
Further, the above-mentioned judging that the advertisement information after shunting is new advertisement information, if yes, transmitting the new advertisement information to the fine sorting includes,
and the recall system judges whether the historical data exists in the historical data of the model or not, if not, the historical data is judged to be new advertisement information, the new advertisement information is subjected to advertisement cold start strategy processing and channel cold start strategy processing, and then the new advertisement information is transmitted to the fine sorting.
The invention also provides an advertisement recalling device based on the alliance flow, which comprises:
the advertisement request acquisition module is used for acquiring an advertisement request which contains the ID of a user and the information of the advertisement;
the channel distribution module is used for carrying out channel distribution on the ID of the user in the advertisement request based on hash coding and determining a recall channel of the advertisement;
the file filtering module is used for filtering out advertisement information similar to the file on the basis of the file vector through the similar file filtering model after the advertisement recall channel is determined;
the model processing module is used for judging whether the shunted advertisement information is new advertisement information or not, and if so, transmitting the new advertisement information to the fine sorting module; if not, inputting the advertisement information into a multi-channel advertisement recall model for processing, wherein the multi-channel advertisement recall model comprises a collaborative filtering model based on users, a matrix decomposition model, a product division random model and a full random model based on stream processing;
the model integration module is used for matching the advertisements processed by the multi-channel advertisement recall model with the data in parallel according to the number of the distributed advertisements to obtain the repeated advertisement information;
the advertisement threshold value judging module is used for judging whether the advertisement quantity meets the threshold value requirement, returning the advertisement list to the fine sorting if the advertisement quantity meets the threshold value requirement, randomly filling if the advertisement quantity does not meet the threshold value requirement, and returning the advertisement list to the fine sorting;
the advertisement bidding module is used for calculating the estimated click rate of each advertisement and the bid price of each click of the advertiser according to the behavior information, the equipment information, the user liveness, the geographic position and the advertisement context characteristics given to the user by the fine sorting to obtain the final bidding bid price ranking of the advertisement;
and the advertisement returning module is used for ranking according to the bidding price of the advertisement, returning the advertisement ID with successful bidding price and finishing the advertisement recall.
Further, in the model processing module, user similarity optimization is performed on the collaborative filtering model based on the user, the channel exposure and the channel high version ratio number are increased to describe the user, and the single feature dimension and the cross feature dimension are aggregated from the user dimension.
Further, the channel splitting module is configured to split a channel based on a hash code for an ID of a user in the advertisement request, and determine a recall channel of the advertisement, including: applying an intra-application message pushing and message pushing form;
the application message pushing comprises five channels, namely a matrix decomposition and product division random fusion model channel, a high ECPM strategy and product division random fusion model channel, a new product channel, a product division random algorithm channel and a pure random algorithm channel;
the message pushing mode comprises six channels, namely a high ECPM strategy and product division random fusion model channel, a new product channel, a product division random algorithm channel and a pure random algorithm channel.
Further, the file filtering module is used for,
coding the pattern of the advertisement through a dictionary, and converting characters of the pattern into vectors;
based on the file vector, calculating vector similarity and setting a threshold, sorting completely consistent files according to ECPM, and reserving 50% of files with the highest ECPM; and for the documents with the similarity of more than 95%, the documents with the highest proportion of 30% of ECPM are reserved according to the ECPM sequence.
Further, the model processing module judges whether the advertisement information after the shunting is new advertisement information, if so, the new advertisement information is transmitted to the fine sorting module,
and the recall system judges whether the historical data exists in the historical data of the model or not, if not, the historical data is judged to be new advertisement information, the new advertisement information is subjected to advertisement cold start strategy processing and channel cold start strategy processing, and then the new advertisement information is transmitted to the fine sorting.
The invention has the technical effects that: the product-based random model is adopted in the model recall, and the product-based random algorithm is utilized to ensure that each product has creative participation in the bidding, more opportunities are given to the products with high income, the original purely random opportunities are changed into product-based opportunities, each advertiser is ensured to have exposure opportunities, and the fairness of the bidding system is ensured. By utilizing a similar file filtering strategy, the malignant bidding behavior caused by the water injection file is ensured not to occur, and the number of repeated files is greatly reduced compared with the prior art. Through the fusion of the model and the strategy, the effective distribution of the flow is ensured, and the ECPM (thousands of exposure gains) is improved by about 12 percent compared with the standard average whole.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description of the invention relating to "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying any relative importance or implicit indication of the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Referring to fig. 1, a first embodiment of the present invention is: an advertisement recall method based on alliance flow comprises the following steps:
s10, obtaining an advertisement request, wherein the advertisement request contains the ID of the user and the information of the advertisement;
s20, carrying out channel distribution on the ID of the user in the advertisement request based on hash coding, and determining a recall channel of the advertisement; the method comprises the following steps: applying an intra-application message pushing and message pushing form;
the application message pushing comprises five channels, namely a matrix decomposition and product division random fusion model channel, a high ECPM strategy and product division random fusion model channel, a new product channel, a product division random algorithm channel and a pure random algorithm channel;
the message pushing mode comprises six channels, namely a high ECPM strategy and product division random fusion model channel, a new product channel, a product division random algorithm channel and a pure random algorithm channel.
The interior of the recall system can be shunted according to hash codes, different models can be obtained for different quantities according to the inconsistency of advertisement styles, a plurality of rounds of tests can be carried out in the early period, and an optimal flow distribution strategy is obtained based on data analysis.
a) Message push within application:
i. the 45% flow is distributed to a matrix decomposition and product distribution random fusion model, the matrix decomposition is divided into 50 advertisements, and a random algorithm obtains 150 advertisements.
ii. The 40% flow is distributed to a high ECPM strategy and product distribution random fusion model, the high ECPM strategy is divided into 50 advertisements, and a random algorithm obtains 150 advertisements.
And iii, 5% of flow is distributed to new products, new documentations and direct randomness.
iv, 5% flow is distributed to a product random algorithm, and malignant bidding among products is reduced.
v, 5% flow is distributed to a pure random algorithm to be used as effect comparison of model lifting.
b) Message push mode:
i. the 30% flow is divided into a matrix decomposition model, a high ECPM strategy and a product division random fusion model, the matrix decomposition is divided into 50 advertisements, the high ECPM strategy is divided into 50 advertisements, and a random algorithm obtains 100 advertisements.
ii. The method comprises the steps of distributing 30% of flow to collaborative filtering, randomly fusing a high ECPM strategy and a distributed product model, distributing 50 advertisements based on channel collaborative filtering, distributing 50 advertisements to the high ECPM strategy, and obtaining 100 advertisements by a random algorithm.
And iii, distributing the flow of 25% to a high ECPM strategy and product distribution random fusion model, distributing 50 advertisements to the high ECPM strategy, and obtaining 150 advertisements by a random algorithm.
iv, 5% flow rate is assigned to new product, new case, direct randomization.
v, 5% flow is divided into product random algorithm, and the vicious bidding among products is reduced.
vi, 5% flow is distributed to a pure random algorithm for comparison of the effect of model lifting.
S30, after the advertisement recall channel is determined, filtering out advertisement information with similar files based on file vectors through a similar file filtering model; the method specifically comprises the following steps: coding the pattern of the advertisement through a dictionary, and converting characters of the pattern into vectors; based on the file vector, calculating vector similarity and setting a threshold, sorting completely consistent files according to ECPM, and reserving 50% of files with the highest ECPM; and for the documents with the similarity of more than 95%, the documents with the highest proportion of 30% of ECPM are reserved according to the ECPM sequence. If no matching documentation occurs, it is ignored.
S40, judging whether the advertisement information after the shunting is new advertisement information, if so, transmitting the new advertisement information to a fine sorting; if not, inputting the advertisement information into a multi-channel advertisement recall model for processing, wherein the multi-channel advertisement recall model comprises a collaborative filtering model based on users, a matrix decomposition model, a product division random model and a full random model based on stream processing;
the recall system will first determine if it is a cold start, such as new data, new advertisements, and no historical data in the model:
a) cold starting of the advertisement:
i. the number of cold starts of the creative is judged, and if the advertisements come from are all cold starts, the advertisements are directly processed according to a random algorithm.
ii. If there are old advertisements, the model channel is followed.
b) Channel cold starting; if the channel is cold started, the aid number returned by the model becomes 0.
S50, matching the advertisements processed by the multi-channel advertisement recall model with data in parallel according to the number of the distributed advertisements to obtain repeated advertisement information;
s60, judging whether the advertisement quantity meets the threshold value requirement, if so, returning the advertisement list to the fine sorting, if not, randomly filling, and then, returning the advertisement list to the fine sorting;
s70, calculating the estimated click rate of each advertisement and the bid price of each click of an advertiser according to behavior information, equipment information, user liveness, geographic position and advertisement context characteristics given to the user by fine sorting, and obtaining the final bid price ranking of the advertisement;
and S80, ranking according to the bidding price of the advertisement, returning the advertisement ID with successful bidding price, and completing the advertisement recall.
In a specific embodiment, user similarity optimization is performed on a user-based collaborative filtering model, channel exposure and channel high-version proportion number are increased to describe users, and single feature dimensions and cross feature dimensions are aggregated from user dimensions.
In the embodiment, a product-based random model is adopted in the model recall, and a product-based random algorithm is utilized, so that more opportunities are given to products with high income while each product is initiatively participated in the bidding, the original pure randomness is changed into product-based opportunities, each advertiser is ensured to have an exposure opportunity, and the fairness of a bidding system is ensured. By utilizing a similar file filtering strategy, the malignant bidding behavior caused by the water injection file is ensured not to occur, and the number of repeated files is greatly reduced compared with the prior art. Through the fusion of the model and the strategy, the effective distribution of the flow is ensured, and the ECPM (thousands of exposure gains) is improved by about 12 percent compared with the standard average whole. Based on the assumption that the number of union flow channels is large and the channel users have similarity, the channels are used for replacing the users to calculate the similarity, the calculation difficulty is reduced, and meanwhile flow preferred distribution is achieved.
The number of operational users based on federation traffic can reach 2 million per day, while the model needs to compute full users, and under this time delay pressure, it is not practical to compute full users in the recall phase and do full matches at high QPS. If a clustering algorithm is used, the characteristic information of the whole number of users also needs to be stored on the line, and the condition of cold start of a new user cannot be included, and on the basis, some optimization needs to be carried out on the model.
Recalling the model: a user-based collaborative filtering model (UserCF).
1. Based on the principle that users coming from the same channel are similar, such as users who install k songs of the whole population, the user quality is similar. A layer of improvement is made, and a channel main mark is used for representing a user, so that the calculation amount is reduced, and the occupied ratio of cold start of the user is greatly reduced.
2. Grouping is performed from a channel only, so that the situation that the granularity is too coarse possibly exists, although the recall speed is guaranteed, the recall accuracy cannot be guaranteed, and therefore a feature with the highest feature importance is selected from more than 2 hundred features by using a decision tree algorithm.
The decision tree model training process comprises the following steps:
a) sequentially traversing all the features, and selecting the feature with the largest information gain (the largest degree of information uncertainty reduction);
b) then continuously selecting the features and splitting into 2 groups, continuously and respectively calculating the information gain of each node under the 2 groups of the feature groups, selecting the features with the maximum information gain, and continuously splitting;
c) stopping splitting after threshold conditions (number of leaf nodes, tree depth, number of users on each leaf node) are reached;
and (3) screening the feature flow with the highest importance by the decision tree:
according to the following feature importance formula, traversing all features can obtain feature importance ranking, and the larger the calculated value is, the higher the importance is.
The calculation formula is as follows:
feature_importances=N_t/N*(impurity-N_t_R/N_t*right_impurity-N_t_L/N_t*left_impurity);
where N is the total number of samples, N _ t is the number of samples for the current node, N _ t _ L is the number of samples for the left child of the node, N _ t _ R is the number of samples for the right child of the node, impuity is the degree of non-purity (the kini index or entropy) of the feature node, right _ impuity is the right-side tree degree of non-purity, and 1eft _ impuity is the left-side tree degree of non-purity.
3. The concept of a "user" (hereinafter "user") is replaced with channel + features. Then under the concept of the user, the click willingness (CTR click rate) of the user to the advertisement, the income (ECPM) brought by the advertisement, the click willingness to the advertisement products (different advertisement similar products), the income (ECPM) brought by the advertisement products (different advertisement similar products) and the channel ECPM are calculated. From these 5 dimensions, the user vector is derived.
4. By comparing a plurality of similarity calculation modes, the cosine similarity is finally selected to calculate the similarity of the user, and the following formula is a formula for calculating the cosine similarity by using two vectors A and B:
5. then based on the similarity of the users, the following formula is used for obtaining the recommendation score:
the UserCF algorithm will recommend items to the "user" that are liked by the most similar K "users". The set K is a full user, and comprehensiveness of information is guaranteed.
The formula mainly measures the interest degree of the user u in the item i in the UserCF algorithm: where S (u, K) contains K users with the closest interests to the target "user" u, n (i) is the set of users who have performed the item i, Wuv is the similarity between user u and user v, Rvi represents the interest measure of user v for item i, where ECPM is used as the measure.
6. And after the recommendation scores of the result set are obtained, sorting in a reverse order to obtain the recommended advertisement list of the user.
Recalling the model: and (5) decomposing the model by using a matrix.
1. Calculating a co-occurrence matrix of the user and the advertisement: (similarity of user concept and collaborative filtering for matrix decomposition, which is not described here);
the matrix is filled with interest metrics, ECPM was initially chosen because both revenue and user click interest can be described. However, since some "user" exposures are too high, resulting in a false high ECPM, a layer of weight reduction needs to be applied to high-exposure users using the following formula:
the formula: newEcpm ═ Ecpm (log (base, exposure/scaling));
the base number is 1000;
the scaling is 50, and the value has the best effect after testing a plurality of groups;
when the exposure is less than 5 ten thousand, a layer of weight reduction is carried out, and the ECPM data fluctuation is increased due to low exposure; when the exposure is more than 5 ten thousand, the ECPM is not changed.
2. And decomposing the co-occurrence matrix into two small matrixes of U and V by using a least square method.
3. And multiplying the matrix U and the matrix V to obtain the final recommendation score of each user, and obtaining the recommendation list of each user in a reverse order.
Recalling the model: and (5) dividing a product random model.
In the product-based random stage, the random of the weight is realized, the weight is added for some advertisements (such as new creatives or creatives which need to artificially increase the probability of the participation competition), and the probability of the participation competition in the random part is increased:
1. the weight information of all advertisements is calculated off-line and stored on-line.
2. The request for entering the stochastic model reads the weights of all advertisements from the database.
3. When a request comes, assuming that 1000 advertisements come, 10 products and 200 advertisements are needed at random, there are 20 opportunities for each product, and then advertisements are selected in parallel for a plurality of products.
4. Calculating the number of advertisements that can be selected from each product, and obtaining the weight sum (product division) of the advertisements
5. Using this data, the probability of each advertisement being selected is calculated:
pi is the probability of the advertisement being selected, index is the weight, and the denominator is the sum of the weights of the requested advertisements.
6. A random number k of (0, 1) is generated then let p be the current weight.
7. If the random number K is less than the probability of the first advertisement, the first advertisement is returned.
8. If the random number K is greater than the probability of the first advertisement, then p is the weight of the first advertisement + the weight of the second advertisement.
9. And sequentially circulating until K is less than p, and returning the advertisements with the corresponding number.
10. And then repeated among the remaining advertisements until 20 advertisements for each product are randomly completed.
Recalling the model: full-scale stochastic model based on stream processing
The random without weight of stream processing is realized, and because training sets with required number are obtained randomly in a large number of advertisements in an equal probability manner, an algorithm of stream processing equal probability is needed, and the algorithm rapidly passes through one cycle, thereby reducing cycle times and time complexity.
1. The ad request carries n ads, and assuming that k ads need to be selected from the n ads, an algorithm is needed to ensure that the probability of each ad being selected is k/n.
2. According to the sequence of n advertisements brought by the request, the 1 st to the kth enter an initialized return list re.
3. Then starting with k +1 ads, k +1 ads may replace any of the first k or be static. So the probability of returning the k +1 th advertisement is k/(k +1), while the previous k advertisements can be successfully returned as long as they are not replaced, so the probability of successfully returning is 1 x (1-1/(k +1)) -k/(k +1)
4. Then, each advertisement is processed in turn, and according to the same idea, the probability of each final advertisement is demonstrated to be k/n, and the formula is as follows: (since k +1 has already been demonstrated, the probability of selection of the ith advertisement is also demonstrated below as k/n)
5. And after traversing n advertisements, returning after obtaining the advertisement list re.
Recalling the model: similar document filtering model.
1. And arranging a text dictionary table comprising Chinese, emoji expressions, other languages, full-angle punctuations, half-angle punctuations and the like.
2. And (3) sorting the text mapping, and defining the corresponding conditions of several special conditions:
a) at the beginning of the text: filled uniformly with a notation [ STR ].
b) Ending the case: filled uniformly with a notation [ END ].
c) Dictionary tables cannot be matched: filled uniformly with a notation [ UNK ].
d) Truncation is direct beyond setting the vector length.
e) Not satisfying the set vector length, is filled with the uniform mark symbol [ SEP ].
3. The full-scale file is coded, and the text is corresponding to non-repeated numbers, such as [ STR ] corresponding to 0 and the like.
4. And calculating and storing the similarity of the file vector.
5. A threshold is set based on the result and adjusted according to the result. Completely consistent files are sorted according to the ECPM, and the file with the highest 50% proportion of the ECPM is reserved; and (4) documents with the similarity larger than 95% are sorted according to the ECPM, and documents with the highest proportion of 30% of the ECPM are reserved.
As shown in fig. 2, another embodiment of the present invention is: an advertisement recall apparatus based on federation traffic, comprising:
an advertisementrequest obtaining module 10, configured to obtain an advertisement request, where the advertisement request includes an ID of a user and information of an advertisement;
thechannel distribution module 20 is configured to perform channel distribution on the ID of the user in the advertisement request based on hash coding, and determine a recall channel of the advertisement;
thefile filtering module 30 is configured to filter advertisement information with similar files based on file vectors through a similar file filtering model after determining a recall channel of the advertisement;
themodel processing module 40 is configured to determine whether the shunted advertisement information is new advertisement information, and if so, transmit the new advertisement information to the fine sorting; if not, inputting the advertisement information into a multi-channel advertisement recall model for processing, wherein the multi-channel advertisement recall model comprises a collaborative filtering model based on users, a matrix decomposition model, a product division random model and a full random model based on stream processing;
themodel integration module 50 is used for matching the advertisements processed by the multi-channel advertisement recall model with the data in parallel according to the number of the distributed advertisements to obtain the repeated advertisement information;
the advertisement thresholdvalue judging module 60 judges whether the number of advertisements meets the threshold value requirement, returns the advertisement list to the fine sorting if the number of advertisements meets the threshold value requirement, randomly fills the advertisement list if the number of advertisements does not meet the threshold value requirement, and returns the advertisement list to the fine sorting;
theadvertisement bidding module 70 is used for calculating the estimated click rate of each advertisement and the bidding price of the advertiser in each click according to the behavior information, the equipment information, the user liveness, the geographic position and the advertisement context characteristics given to the user by the fine sorting, so as to obtain the final bidding price ranking of the advertisement;
and anadvertisement return module 80, configured to arrange according to the bidding price of the advertisement, return an advertisement ID with which bidding is successful, and complete advertisement recall.
Further, in themodel processing module 40, user similarity optimization is performed on the collaborative filtering model based on the user, the channel exposure and the channel high-version proportion number are increased to describe the user, and a single feature dimension and a cross feature dimension are aggregated from the user dimension.
Further, thechannel splitting module 20 is configured to split a channel based on a hash code for an ID of a user in an advertisement request, and determine a recall channel of an advertisement, where the channel splitting module includes: applying an intra-application message pushing and message pushing form;
the application message pushing comprises five channels, namely a matrix decomposition and product division random fusion model channel, a high ECPM strategy and product division random fusion model channel, a new product channel, a product division random algorithm channel and a pure random algorithm channel;
the message pushing mode comprises six channels, namely a high ECPM strategy and product division random fusion model channel, a new product channel, a product division random algorithm channel and a pure random algorithm channel.
Further, thedocument filtering module 30 is used for,
coding the pattern of the advertisement through a dictionary, and converting characters of the pattern into vectors;
based on the file vector, calculating vector similarity and setting a threshold, sorting completely consistent files according to ECPM, and reserving 50% of files with the highest ECPM; and for the documents with the similarity of more than 95%, the documents with the highest proportion of 30% of ECPM are reserved according to the ECPM sequence.
Further, in themodel processing module 40, it is determined whether the advertisement information after being distributed is new advertisement information, if yes, the new advertisement information is transmitted to the fine sorting module,
and the recall system judges whether the historical data exists in the historical data of the model or not, if not, the historical data is judged to be new advertisement information, the new advertisement information is subjected to advertisement cold start strategy processing and channel cold start strategy processing, and then the new advertisement information is transmitted to the fine sorting.
It should be noted that, as can be clearly understood by those skilled in the art, for a specific implementation process of the above advertisement recall device based on federation traffic, reference may be made to the corresponding description in the foregoing method embodiment, and for convenience and brevity of description, no further description is provided herein.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.