CN109801091B

Movatterモバイル変換

Info

Publication number: CN109801091B
Application number: CN201711137583.2A
Authority: CN
Inventors: 孙福宁; 孟凡超
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-11-16
Filing date: 2017-11-16
Publication date: 2022-12-20
Anticipated expiration: 2037-11-16
Also published as: CN109801091A

Abstract

The application relates to a target user group positioning method, a target user group positioning device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a marked user group set and an unmarked user group set; obtaining a mixed set according to the marked user group set and the unmarked user group set; acquiring the characteristics of a user identifier; carrying out similarity analysis on the characteristics of the user identifications of the labeled user group set and the mixed set, and screening the user identifications with characteristics similar to those in the mixed set from the labeled user group set to obtain a seed user group set; according to the characteristics of the user identifications in the seed user set, sampling negative samples of the user group sets which are not marked to obtain negative samples; and taking the seed user group set as a positive sample, and positioning a target user group from the unmarked user group set according to the characteristics of the user identifications of the positive sample and the negative sample. The method improves the accuracy of target user group positioning.

Description

Target user group positioning method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a method and an apparatus for locating a group of target users, a computer device, and a storage medium.

Background

Along with the popularization of intelligent mobile terminals, social applications can conveniently acquire a large amount of user data through the mobile terminals. For the advertisement operator, the needs and the preferences of the users are analyzed from the user data, the target users of the products are determined, and the advertisement delivery effect can be improved.

Usually, an advertiser provides known user populations (seed populations), and in order to obtain a better advertisement delivery effect, an advertisement operator needs to find extended populations with similar interests and requirements to the seed populations according to the seed populations, and the extended populations are used as target user populations for advertisement delivery. Thus, whether the target population can be accurately located is influenced by the selection of the seed population. If the seed crowd selects improperly, the problem that the target crowd is not accurately positioned is caused, and the advertisement putting effect is influenced.

Disclosure of Invention

Based on this, it is necessary to provide a target user group positioning method, apparatus, computer device and storage medium for solving the problem of inaccurate target group positioning.

A target user group positioning method comprises the following steps:

acquiring a marked user group set and an unmarked user group set;

obtaining a mixed set according to the labeled user group set and the unlabeled user group set;

acquiring the characteristics of a user identifier;

according to the characteristics of the user identification in the seed user set, negative sample sampling is carried out on the unmarked user group set to obtain a negative sample;

and taking the seed user group set as a positive sample, and positioning a target user group from the unmarked user group set according to the characteristics of the user identifications of the positive sample and the negative sample.

A target user group location apparatus, comprising: the system comprises a data acquisition module, a mixing module, a characteristic acquisition module, a screening module, a negative sampling module and a target positioning module;

the data acquisition module is used for acquiring a labeled user group set and an unlabeled user group set;

the mixing module is used for obtaining a mixed set according to the marked user group set and the unmarked user group set;

the characteristic acquisition module is used for acquiring the characteristics of the user identification;

the screening module is used for carrying out similarity analysis on the characteristics of the user identifications of the labeled user group set and the mixed set, screening the user identifications with characteristics similar to those in the mixed set from the labeled user group set to obtain a seed user group set;

the negative sampling module is used for sampling a negative sample of the unmarked user group set according to the characteristics of the user identification in the seed user set to obtain a negative sample;

and the target positioning module is used for taking the seed user group set as a positive sample and positioning a target user group from the unmarked user group set according to the characteristics of the user identifications of the positive sample and the negative sample.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the above method.

A storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the above method

According to the target user group positioning method, the target user group positioning device, the computer equipment and the storage medium, on the basis of obtaining the marked user group set, the user identification of the marked user group set is screened, the user identification with the characteristics similar to those in the mixed set is determined to be the seed user, the user identification with the similar characteristics and the user identification with the dissimilar characteristics form the classification assumption problem of the seed user for the massive unmarked user group set, and further, the target user group positioning can be accurately carried out according to the seed group, so that the target user group positioning accuracy is improved.

Drawings

FIG. 1 is a diagram of an exemplary embodiment of a target user group location method;

FIG. 2 is a schematic flow chart diagram illustrating a method for locating a group of target users in one embodiment;

FIG. 3 is a flow diagram of the steps for obtaining a set of seed users in one embodiment;

FIG. 4 is a flowchart illustrating steps for obtaining a set of seed users based on intersection clusters, according to an embodiment;

FIG. 5 is a flow diagram of the steps for obtaining characteristics of users in one embodiment;

FIG. 6 is a flowchart of the steps of locating a target user population in one embodiment;

FIG. 7 is a flowchart illustrating a method for locating a group of target users in another embodiment;

FIG. 8 is a block diagram of a target user group locator in one embodiment;

FIG. 9 is a block diagram of a target user group locator in one embodiment;

FIG. 10 is a block diagram showing a configuration of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Fig. 1 is a schematic application environment diagram of a target user group positioning method according to an embodiment. As shown in fig. 1, the application environment includes apositioning server 101 and auser terminal 103. Thepositioning server 101 operates a target user group positioning method, and after determining a target user group, information delivery is performed to theuser terminal 103 corresponding to the target user. The content of the information delivery can be product advertisements or recommendation information.

As shown in FIG. 2, in one embodiment, a target user population location method is provided. As shown in fig. 2, the target user group positioning method specifically includes the following steps:

s202, acquiring a marked user group set and an unmarked user group set.

The marked user group set refers to a set of user identifications marked with preset feature labels. The preset feature tag is used for locating the target user group, that is, the preset feature tag should be a feature tag shared by the target user group. The preset feature labels are different for different target user groups. The feature tags are associated with features of the user, which include basic information of the user and user behavior data. Basic information such as residence, age, and sex, etc. User behavior data is, for example, behavior data implemented by the user through the terminal, such as shopping behavior and the like.

The predetermined characteristic tag may be analyzed and determined according to characteristics of a user of an actual user of the product. Taking a certain brand of air conditioner advertisement as an example, a target user group needing to locate the advertisement determines a preset feature tag by analyzing the features of part of actual users who already use the brand of air conditioner. And the users in the labeled user group set are the users meeting the corresponding characteristics of the preset characteristic labels.

The source of the data in the tagged user cluster may be an information delivery platform, that is, the user identifier that has clicked on the relevant delivered advertisement, or may be a transaction platform, that is, the user identifier that purchased the relevant product. The tagged user group set may also be collected by an advertiser and uploaded to an information delivery platform.

The unlabeled user group set is a randomly extracted set of massive user groups, that is, the feature labels of the users in the unlabeled user group set are uncertain.

The marked user group set and the unmarked user group set are sets of user identifications, and the user identifications are identification identifications of users of a platform for information delivery or an application platform associated with the information delivery platform.

S204: and obtaining a mixed set according to the marked user group set and the unmarked user group set.

The mixed set is obtained by mixing the user identifications of the labeled user group set and the unlabeled user group set. In order to reduce the amount of calculation on the mixed set, in this embodiment, the labeled user group set and the unlabeled user group set are mixed according to a certain proportion to obtain the mixed set. In one embodiment, the mixing ratio is 1:1.

specifically, the step of obtaining the mixed set according to the labeled user group set and the unlabeled user group set includes:

and according to the number of the user identifications in the labeled user group set, randomly extracting the same number of user identifications in an equal proportion from the unlabeled user group set, and mixing the extracted user identifications and the user identifications of the labeled user group set to obtain a mixed set.

S206: the characteristics of the user identification are obtained.

With the outbreak of internet big data, the characteristic users of the users can obtain the data through application platforms such as shopping platforms and social platforms. The characteristics include basic information of the user and user behavior data. Basic information such as residence, age, and sex, etc. User behavior data, such as behavior data implemented by the user through the terminal, e.g., shopping behavior, social behavior, etc.

S208: and performing similarity analysis on the characteristics of the user identifications of the labeled user group set and the mixed set, and screening the user identifications with characteristics similar to those in the mixed set from the labeled user group set to obtain a seed user group set.

Specifically, similarity analysis is performed on the characteristics of the user identifications in the labeled user group set and the mixed set, the user identifications with characteristics similar to those in the mixed set in the labeled user group set are determined, and the user identifications are reserved in the labeled user group set as seed user sets.

In the problem of expanding the user group, the seed user is usually used as a positive sample, a negative sample is determined, and the positive sample and the negative sample are combined to determine whether a user tends to be a positive example or a negative example. Thus, the problem of expanding the user group is actually a two-classification problem, i.e., whether the feature of one user is related to the feature of the positive exemplar primitive or to the feature of the negative exemplar. However, the negative examples are usually from the unlabeled user group set, so the question whether the binary assumption for the massive unlabeled user group holds or not should be considered, for example, the seed user is characterized by having a car in the house, and the unlabeled user group set may exist: the method has the advantages that the target crowd can be accurately positioned according to the seed crowd under the three conditions of no vehicle in the house, no vehicle in the house and no vehicle in the house under the condition that the two-classification assumption for the mass unmarked user groups is established.

In this embodiment, the user identifiers of the labeled user group sets are screened, and it is determined that the user identifiers with similar characteristics to those of the mixed set are seed users, and the user identifiers with similar characteristics and the user identifiers with dissimilar characteristics form a binary classification assumption problem of the seed users for the massive unlabeled user group sets, so that the target group can be accurately positioned according to the seed group. Furthermore, the user identifications of the labeled user group set are screened, the user identifications with characteristics similar to those in the mixed set are determined to be seed users, and the problem of uniform category distribution in the unlabeled user group set is also considered.

S210: and sampling negative samples of the user group sets which are not marked according to the characteristics of the user identifications in the seed user set to obtain the negative samples.

The negative sample sampling refers to the user who does not have the characteristics of the user identifier of the seed user set and is found in the un-labeled user group set by using the similarity model. In this embodiment, the negative sample may be sampled in equal proportion according to the number of the user identifiers in the seed user group set, that is, the number of the user identifiers in the negative sample set is the same as the number of the user identifiers in the seed user set.

Specifically, a Rocchio algorithm is used for randomly sampling from the set of the unlabeled user groups to obtain a negative sample set. The Rocchio algorithm is a related feedback algorithm introduced and widely spread in Salton's SMART system in around the 70's 20 th century. Specifically, all samples in the unlabeled user group set are identified as negative samples U, all samples of the seed user group set are labeled as positive samples P, and the negative sample set Q is initialized to be empty, i.e., Q = { }. And respectively calculating a prototype vector which is a positive example and a prototype vector which is a negative example for each sample d in the negative sample set, and adding the sample d to the negative sample set Q if the similarity between the prototype vector of the positive example and the feature vector of the sample d is smaller than that of the feature vector of the sample d of the prototype vector of the negative example.

Specifically, the calculation formula of the prototype vector of the positive example is:

the prototype vector for the negative case is calculated as:

the formula for similarity comparison is:

wherein alpha and beta are relative degree parameters,

for the prototype vector of the positive case, d is the samples in the negative sample set U, the sim function is used to calculate the similarity,

the similarity of the prototype vector of the positive example and the feature vector of sample d,

the similarity of the prototype vector, which is a negative example, and the feature vector of sample d.

Because the seed user is the accurate seed user group obtained after screening the marked user group set, on the basis of the accurate seed user group, the accurate seed user group is taken as a positive sample, random sampling of the unmarked user group set is carried out, and an accurate negative sample can be obtained.

S212: and taking the seed user group set as a positive sample, and positioning a target user group from the unmarked user group set according to the characteristics of the user identifications of the positive sample and the negative sample.

The similarity between the characteristics of the target user group and the user characteristics of the positive sample is high, the similarity between the characteristics of the target user group and the user characteristics of the negative sample is low, whether the target user is determined according to the similarity between the characteristics of each user identifier in the unmarked user group set and the characteristics of the positive sample and the negative sample, and therefore the target user group is expanded from a small number of marked user group sets to a large number of target user groups.

Specifically, a logistic regression model can be used for training positive samples and negative samples to obtain a training model, characteristics of user identifications in the user group set which is not labeled are input into the model for positive and negative prediction, and whether the target user is the target user or not is determined according to a prediction result.

And the influence characteristics of the positive sample and the negative sample can be respectively extracted, the characteristics of the user identification in the user group set which is not marked are matched with the influence characteristics of the positive sample and the influence characteristics of the negative sample, and whether the target user is the target user or not is preset according to the matching result.

According to the target user group positioning method, on the basis of obtaining the marked user group set, the user identifications of the marked user group set are screened, the user identifications with characteristics similar to those of the mixed set are determined to be seed users, the user identifications with similar characteristics and the user identifications with dissimilar characteristics form the binary assumption problem of the seed users for the massive unmarked user group set, and further, the target user group positioning can be accurately carried out according to the seed groups, so that the target user group positioning accuracy is improved.

FIG. 3 is a flowchart of the steps to obtain a set of seed users, according to one embodiment. As shown in fig. 3, the step of performing similarity analysis on the features of the user identifiers of the labeled user group set and the mixed set, and screening the user identifiers having features similar to those in the mixed set from the labeled user group set to obtain the seed user group set includes the following steps S302 to S304:

s302: and respectively carrying out characteristic clustering analysis on the user identifications of the marked user group set and the mixed set to obtain a first clustering result and a second clustering result.

In this embodiment, the similarity between the users in the user group set and the sample set is labeled through cluster analysis. After the labeled user group set is subjected to cluster analysis, the labeled user group set is divided into a certain number of clusters, and the unlabeled user group set is divided into a certain number of clusters. A class cluster is a collection of users with the same characteristics.

Specifically, DBSCAN (Density-Based Clustering of Applications with Noise) can be used to perform cluster analysis on the labeled user group set and the mixed set respectively, the algorithm divides the area with sufficient Density into clusters and finds an arbitrarily shaped cluster in the noisy Spatial database, which defines the cluster as the maximum set of Density-connected points. The DBSCAN algorithm has the advantages of high clustering speed, capability of effectively processing noise and finding spatial clusters of any shape. And finding out the characteristic vector with the distance between the characteristic vector and the set at the preset scanning radius by presetting the preset scanning radius and the minimum contained point number in the DBSCAN algorithm and selecting one characteristic vector, and taking the characteristic vector and all the characteristic vectors with the distances within the preset scanning radius as a class cluster.

S304: and obtaining a seed user group set according to the user identification in the intersection cluster of the first clustering result and the second clustering result.

The intersection cluster is a set of two types of intersection clusters with a certain proportion, wherein the two types of clusters respectively belong to a first clustering result and a second clustering result. And obtaining a seed user group set according to the users in the intersection cluster.

FIG. 4 is a flowchart illustrating steps of obtaining a seed user set according to an intersection cluster according to an embodiment. As shown in fig. 4, the step of obtaining the seed user group set according to the users in the intersection cluster of the first clustering result and the second clustering result includes:

s402: and acquiring each first cluster of the first clustering result and each second cluster of the second clustering result.

The first type of clusters are obtained by performing cluster analysis on the characteristics of the user identifiers in the seed user group set. The class clusters are sets of users with the same characteristics, that is, the first class clusters are sets of user identifiers of the same class with the same characteristics after the seed user group set is classified.

The second type of clusters are obtained by performing cluster analysis on the characteristics of the users of the sample set. The class cluster is a set of users with the same characteristics, that is, the second class cluster is a set of user identifiers of the same class with the same characteristics after the sample set is classified.

S404: and traversing each first cluster, and calculating to obtain the intersection cluster with the maximum intersection quantity of the users of the first cluster and the users of each second cluster.

Specifically, let the seed user cluster be a, the sample set be C, the first cluster be Ai, and the second cluster be Cj.

Initialization i =1.

And respectively calculating the intersection of the first cluster Ai of the seed user group set and each second cluster Cj of the sample set to obtain a plurality of corresponding intersections B, sequencing according to the descending order of the number of the user identifications of the plurality of intersections B, and taking the intersection B with the largest number of the user identifications as the intersection cluster.

S406: and if the number of the users in the intersection cluster belongs to the first cluster and is larger than a set value, adding the users in the first cluster to the seed user group set.

If the number of the user identifiers of the intersection cluster B belonging to the first cluster Ai is more than half (more than 50%), it can be considered that the similarity between the features of the users of the intersection cluster B and the features of the users of the first cluster is high, and all the user identifiers of the first cluster Ai are added to the seed user group.

And enabling i = i +1, and returning to the step of respectively calculating the intersections of the first clusters Ai of the seed user group set and the second clusters Cj of the sample set to obtain a plurality of corresponding intersections B until the first clusters are traversed.

FIG. 5 is a flow diagram of the steps to obtain characteristics of various users, according to one embodiment. As shown in fig. 5, the step of acquiring the characteristics of each user includes:

s502: geographic location information for a user identification is obtained based on a location service.

The geographical location information includes POI information (Point of Interest) of the user at each time Point. The POI information is used to indicate a point in an electronic map, including a POI point name, a genre, a longitude, a latitude, and the like, which are most core data based on a location service, and the POI is generally represented by a bubble icon on the electronic map, such as a sight spot, a government agency, a company, a mall, a restaurant, and the like on the electronic map. Based on the geographical location information of the user, the activity track of the user can be known. Such as navigation behavior, travel behavior, etc. of the user. The geographic position information of the user is obtained based on the position service, and the method has strong objectivity and reliability.

Location Based Service (LBS) is Location information of a mobile terminal user obtained through a wireless communication network of a telecommunication mobile operator or an external subscription method. Under the support of a Geographic Information System (GIS) platform, a value added service of corresponding service is provided for users.

S504: and constructing a position feature vector of each user identifier according to the geographical position information, wherein the features comprise the position feature vector.

Specifically, geographical location information features are collectively represented as a location feature vector. The position feature vector is data features related to long-term accumulated geographic position information, and comprises the following steps: user travel, check-in, location, etc.

In this embodiment, by acquiring the geographical location information of the user identifier, the user activity track can be acquired based on the geographical location information of the user, and the target user is determined according to the user activity track, so that the manner of acquiring the user characteristics is widened.

FIG. 6 is a flow diagram of the steps to locate a target user population of one embodiment. As shown in fig. 6, the step of locating the target user group from the unlabeled user group set according to the characteristics of the user identifiers of the positive sample and the negative sample by using the seed user group set as the positive sample includes:

s602: the seed user group set is taken as a positive sample.

S604: and performing two-class training according to the characteristics of each user identifier in the positive sample and the negative sample to generate a prediction model.

The problem of expanding the user group is actually a two-classification problem, i.e. whether the features of one user are related to the features of the positive exemplar or to the features of the negative exemplar. Therefore, the method can be obtained by performing two-classification training on the positive sample and the negative sample. The classification training may be performed using a vector machine and a logistic regression model. Because the problem of the two-classification assumption of the seed user aiming at the mass unmarked user group set is solved, the target population can be accurately positioned according to the seed population by combining the prediction model obtained by the two-classification training.

S606: and positioning a target user group in the user set to be tested in the unlabeled user group set by using a prediction model.

Specifically, a Spark loading prediction model is utilized to perform positive and negative prediction on the user identifications of the mass unmarked user group sets, and all users with positive user identifications in the filtering are used as target users, so that a small number of marked user group sets are expanded to a large number of target user groups.

In another embodiment, after the step of locating a target user group from the set of unlabeled user groups according to the characteristics of the user identifications of the positive sample and the negative sample by using the set of seed user groups as the positive sample, the method further includes: and carrying out information delivery on the target user group.

The method can be expanded from a small number of labeled user group sets to a large number of targets, information is released to the target users based on the expanded target user set, released information can be product advertisements or recommendation information, released information is matched with the characteristics of the target users to the maximum extent, the information releasing effect can be improved, and resource waste is reduced.

The following describes a target user group positioning method according to the present application with reference to a specific application scenario. The application scene expands a large number of target user groups of the vehicle family by a small number of labeled user group sets of the vehicle family. Fig. 7 is a flowchart illustrating a target user group positioning method according to an embodiment, and it should be understood that, although the steps in the flowchart of fig. 7 are shown in order as indicated by arrows, the steps are not necessarily executed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 7 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.

As shown in fig. 7, a target user group positioning method includes the following steps:

s702: and acquiring a marked user group set and an unmarked user group set.

In this embodiment, the feature tag for labeling the user group set is a car.

S704: according to the following steps of 1: and 1, mixing the marked user group set and the unmarked user group set to obtain a mixed set.

S706: the method comprises the steps of obtaining geographic position information of user identifications based on position service, and constructing position feature vectors of the user identifications according to the geographic position information, wherein the features comprise the position feature vectors.

The geographical location information includes POI information (Point of Interest) of the user at each time Point. The POI information is used to indicate a point in an electronic map, including a POI point name, a genre, a longitude, a latitude, and the like, which are most core data based on a location service, and the POI is generally represented by a bubble icon on the electronic map, such as a sight spot, a government agency, a company, a mall, a restaurant, and the like on the electronic map. Based on the geographical location information of the user, the activity track of the user can be known.

S708: and respectively carrying out characteristic clustering analysis on the user identifications of the marked user group set and the mixed set to obtain a first clustering result and a second clustering result.

Specifically, clustering is performed on the labeled user group set and the mixed set according to the geographic position vector, and for the labeled user group set as a vehicular group, the clustering may include: directional POI behaviors such as navigation behaviors, gas station parking lots and the like, positioning days and hour distribution on different road levels, working days, weekend trip time consumption, city-crossing trip behaviors and the like.

S710: and acquiring each first cluster of the first clustering result and each second cluster of the second clustering result.

S712: and traversing each first cluster, and calculating to obtain the intersection cluster with the maximum intersection quantity of the users of the first cluster and the users of each second cluster.

S714: and if the number of the users in the intersection cluster belongs to the first cluster and is larger than a set value, adding the users in the first cluster to the seed user group set.

S716: and sampling negative samples of the user group sets which are not marked according to the characteristics of the user identifications in the seed user set to obtain the negative samples.

S718: the seed user group set is taken as a positive sample.

S720: and performing two-class training according to the characteristics of each user identifier in the positive sample and the negative sample to generate a prediction model.

S722: and positioning a target user group in the user set to be tested in the unlabeled user group set by using a prediction model.

And predicting the positive and negative of the user identifications of the mass unmarked user group sets, and taking all users with positive user identifications in the filtering as target users, thereby expanding the user identifications from a small number of marked user group sets to a large number of target user groups of the vehicle family.

And S724, delivering information to the target user group.

The method can be expanded from a small number of labeled user group sets to a large number of targets, information is released to an expanded vehicle family based on the expanded target user set, released information such as vehicle insurance advertisements is matched with the characteristics of the target users to the maximum extent, the information releasing effect can be improved, and resource waste is reduced.

According to the target user group positioning method, the user characteristic vectors are extracted through the positioning data of the users, a seed user set and a mixed set are constructed, cluster clustering of DBSCAn is respectively carried out on the two sets, the intersection of the cluster sets of the two sets is screened out, and the user group belonging to the seed user set in the intersection cluster is obtained and is an accurate seed user group. On the basis of the accurate seed user group, the accurate seed user group is used as a positive sample, and random sampling of the unlabeled user group set is performed to be used as a negative sample. Training an LR model through positive and negative samples, and finally performing positive and negative prediction on a mass of unlabeled user group sets through the LR model, wherein the unlabeled user group of the output positive example is a user group with similar behavior to the seed user group. The method can avoid the problems of uneven data distribution and reasonable secondary classification construction. The original seeds and massive users which are not marked are used for carrying out automatic analysis and screening of data clustering, seed users with more obvious LBS characteristics are filtered out, and therefore the problem of binary hypothesis judgment is solved. The method is used for advertisement putting, and the exposure and click income of the advertisement are guaranteed.

FIG. 8 is a target user group locator of one embodiment. As shown in fig. 8, the target user group locating device includes: adata acquisition module 802, amixing module 804, afeature acquisition module 806, ascreening module 808, anegative sampling module 810, and anobject localization module 812.

Adata obtaining module 802, configured to obtain an annotated user group set and an annotated user group set.

And amixing module 804, configured to obtain a mixed set according to the labeled user group set and the unlabeled user group set.

Afeature obtaining module 806, configured to obtain a feature of the user identifier.

And ascreening module 808, configured to perform similarity analysis on the features of the user identifiers in the labeled user group set and the mixed set, and screen a user identifier with features similar to those in the mixed set from the labeled user group set to obtain a seed user group set.

And thenegative sampling module 810 is configured to perform negative sample sampling on the unlabeled user group set according to the characteristics of the user identifiers in the seed user set, so as to obtain a negative sample.

And atarget positioning module 812, configured to use the seed user group set as a positive sample, and position a target user group from the unlabeled user group set according to characteristics of the user identifiers of the positive sample and the negative sample.

According to the target user group positioning device, on the basis of obtaining the marked user group set, the user identifications of the marked user group set are screened, the user identifications with similar characteristics to those in the mixed set are determined to be the seed users, the user identifications with similar characteristics and the user identifications with dissimilar characteristics form the dichotomy assumption problem of the seed users aiming at the mass unmarked user group sets, and further, the target user group positioning can be accurately carried out according to the seed groups, so that the target user group positioning accuracy is improved.

FIG. 9 is a diagram illustrating a target user group location method according to another embodiment. As shown in fig. 9, thefiltering module 808 includes: clustering module and intersection operation module.

The clustering module is used for respectively carrying out characteristic clustering analysis on the user identifications of the labeled user group set and the mixed set to obtain a first clustering result and a second clustering result;

and the intersection operation module is used for obtaining a seed user group set according to the user identification in the intersection cluster of the first clustering result and the second clustering result.

Specifically, the intersection operation module is configured to obtain each first cluster of the first clustering result and each second cluster of the second clustering result; traversing each first cluster, and calculating to obtain an intersection cluster with the maximum intersection number of the user identification of the first cluster and the user identification of each second cluster; and if the number of the user identifications in the intersection cluster belongs to the number of the first cluster which is larger than a set value, adding the user identifications of the first cluster to the seed user group set.

Continuing with FIG. 9, thefeature obtaining module 806 includes: the device comprises a position feature acquisition module and a vector construction module.

The location characteristic acquisition module is used for acquiring the geographic location information of the user identification based on the location service;

and the vector construction module is used for constructing the position characteristic vector of each user identifier according to the geographical position information, wherein the characteristics comprise the position characteristic vector.

Continuing with FIG. 9, thetarget location module 812 includes: the model training module and the prediction module.

The model training module is used for taking the seed user group set as a positive sample, performing two-classification training according to the characteristics of each user identifier in the positive sample and the negative sample, and generating a prediction model;

and the prediction module is used for positioning the target user group from the unlabeled user group set by using the prediction model.

With continued reference to fig. 9, the apparatus further comprises adelivery module 814 for delivering information to the target group of users.

According to the target user group positioning device, the user characteristic vectors are extracted through the positioning data of the users, the seed user set and the mixed set are constructed, the two sets are clustered like DBSCAn respectively, the intersection of the cluster sets of the two sets is screened out, and the user group belonging to the seed user set in the intersection cluster is obtained and is the accurate seed user group. On the basis of the accurate seed user group, the accurate seed user group is used as a positive sample, and random sampling of the unlabeled user group set is performed to be used as a negative sample. Training an LR model through positive and negative samples, and finally performing positive and negative prediction on a mass of unlabeled user group sets through the LR model, wherein the unlabeled user group of the output positive example is a user group with similar behavior to the seed user group. The method can avoid the problems of uneven data distribution and reasonable secondary classification construction. The original seeds and massive users which are not marked are used for carrying out automatic analysis and screening of data clustering, seed users with more obvious LBS characteristics are filtered out, and therefore the problem of binary hypothesis judgment is solved. The method is used for advertisement putting, and the exposure and click income of the advertisement are guaranteed.

A computer device comprising a memory and a processor, the memory storing a computer program, the computer program, when executed by the processor, causing the processor to perform the steps of the target user group localization method of the embodiments described above.

FIG. 10 is a diagram that illustrates an internal structure of the computer device in one embodiment. The computer device may specifically be thepositioning server 101 in fig. 1. As shown in fig. 10, the computer device includes a processor, a memory, and a network interface connected by a system bus. The memory comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement the target user group location method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform a target user group location method. Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the target user group positioning apparatus provided in the present application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in fig. 10. The memory of the computer device may store various program modules that make up the target user group location apparatus, such as the data acquisition module, the blending module, and the filtering module shown in fig. 8. The computer program constituted by the program modules causes the processor to execute the steps in the target user group positioning method according to the embodiments of the present application described in the present specification.

For example, the computer device shown in fig. 10 may execute the step of acquiring the annotated user group set and the unlabeled user group set by the data acquisition module in the target user group locator shown in fig. 8. The computer equipment can execute the step of obtaining a mixed set according to the marked user group set and the unmarked user group set through a mixing module. And the computer equipment can perform similarity analysis on the characteristics of the user identifications of the labeled user group set and the mixed set through a screening module, and screen the user identifications with characteristics similar to those in the mixed set from the labeled user group set to obtain a seed user group set.

A storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the target user group location method as in the embodiments described above.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.

Claims

1. A target user group positioning method comprises the following steps:

acquiring a marked user group set and an unmarked user group set;

according to the number of the user identifications in the labeled user group set, randomly extracting the same number of user identifications from the unlabeled user group set in equal proportion, and mixing the extracted user identifications and the user identifications of the labeled user group set to obtain a mixed set;

acquiring the characteristics of a user identifier;

according to the characteristics of the user identification in the seed user group set, negative sample sampling is carried out on the user group set which is not marked, and a negative sample is obtained;

2. The method according to claim 1, wherein the step of performing similarity analysis on the features of the user identifiers of the labeled user group set and the mixed set, and screening the user identifiers with similar features to those in the mixed set from the labeled user group set to obtain a seed user group set comprises:

respectively carrying out characteristic cluster analysis on the user identifications of the labeled user group set and the mixed set to obtain a first cluster result and a second cluster result;

and obtaining a seed user group set according to the user identification in the intersection cluster of the first clustering result and the second clustering result.

3. The method of claim 2, wherein the step of obtaining a seed user group set according to the user identifier in the intersection cluster of the first clustering result and the second clustering result comprises:

acquiring each first cluster of the first clustering result and each second cluster of the second clustering result;

traversing each first cluster, and calculating to obtain an intersection cluster with the maximum intersection quantity of the user identification of the first cluster and the user identification of each second cluster;

and if the number of the user identifications in the intersection cluster belongs to the number of the first cluster which is larger than a set value, adding the user identifications of the first cluster to a seed user group set.

4. The method of claim 1, wherein the step of obtaining the characteristics of each subscriber identity comprises:

acquiring geographical location information of a user identifier based on a location service;

and constructing a position feature vector of each user identifier according to the geographical position information, wherein the features comprise the position feature vector.

5. The method of claim 1, wherein the step of locating a target user group from the unlabeled user group set according to the characteristics of the user identities of the positive and negative examples, using the seed user group set as a positive example, comprises:

taking the seed user group set as a positive sample;

performing two-class training according to the characteristics of each user identifier in the positive sample and the negative sample to generate a prediction model;

and positioning a target user group from the set of unlabeled user groups by using the prediction model.

6. The method of claim 1, further comprising, after the step of locating a target user group from the set of unlabeled user groups according to the user-identified features of the positive and negative examples with the set of seed user groups as positive examples:

and carrying out information delivery on the target user group.

7. A target user group location apparatus, comprising: the device comprises a data acquisition module, a mixing module, a characteristic acquisition module, a screening module, a negative sampling module and a target positioning module;

the data acquisition module is used for acquiring a marked user group set and an unmarked user group set;

the mixing module is used for randomly extracting the same number of user identifications from the unmarked user group set in equal proportion according to the number of the user identifications in the marked user group set, and mixing the extracted user identifications and the user identifications of the marked user group set to obtain a mixed set;

the negative sampling module is used for sampling a negative sample of the user group set which is not marked according to the characteristics of the user identification in the seed user group set to obtain a negative sample;

8. The apparatus of claim 7, wherein the screening module comprises: a clustering module and an intersection operation module;

9. The apparatus of claim 8, wherein the intersection operation module is configured to obtain each first cluster of the first clustering result and each second cluster of the second clustering result; traversing each first-class cluster, and calculating to obtain an intersection cluster with the maximum intersection quantity of the user identification of the first-class cluster and the user identification of each second-class cluster; and if the number of the user identifications in the intersection cluster belongs to the number of the first cluster which is larger than a set value, adding the user identifications of the first cluster to a seed user group set.

10. The apparatus of claim 7, wherein the feature obtaining module comprises: the system comprises a position feature acquisition module and a vector construction module;

and the vector construction module is used for constructing a position feature vector of each user identifier according to the geographical position information, wherein the features comprise the position feature vector.

11. The apparatus of claim 7, wherein the target location module comprises: the model training module and the prediction module;

the model training module is used for taking the seed user group set as a positive sample, performing classification training according to the characteristics of each user identifier in the positive sample and the negative sample, and generating a prediction model;

the prediction module is used for positioning a target user group from the unmarked user group set by utilizing the prediction model.

12. The apparatus of claim 7, further comprising a delivery module configured to deliver information to the target group of users.

13. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 6.

14. A storage medium storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 6.