Movatterモバイル変換


[0]ホーム

URL:


CN107153654B - Method and device for identifying region to which user belongs - Google Patents

Method and device for identifying region to which user belongs
Download PDF

Info

Publication number
CN107153654B
CN107153654BCN201610121595.5ACN201610121595ACN107153654BCN 107153654 BCN107153654 BCN 107153654BCN 201610121595 ACN201610121595 ACN 201610121595ACN 107153654 BCN107153654 BCN 107153654B
Authority
CN
China
Prior art keywords
region
probability
suffix
prefix
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610121595.5A
Other languages
Chinese (zh)
Other versions
CN107153654A (en
Inventor
陆青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding LtdfiledCriticalAlibaba Group Holding Ltd
Priority to CN201610121595.5ApriorityCriticalpatent/CN107153654B/en
Publication of CN107153654ApublicationCriticalpatent/CN107153654A/en
Application grantedgrantedCritical
Publication of CN107153654BpublicationCriticalpatent/CN107153654B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

The method includes the steps that a server acquires an electronic mailbox of a user, the electronic mailbox is divided into prefix information and suffix information, for each region, the prefix judgment probability that the prefix information appears in the region is determined, the suffix judgment probability that the suffix information appears in the region is determined, the final judgment probability that the electronic mailbox belongs to each region is determined according to the prefix judgment probability and the suffix judgment probability corresponding to each region, and the region to which the user belongs is identified according to the final judgment probabilities. By the above method, even if the suffix of the electronic mailbox does not contain the character symbol indicating the region (e.g., country) or the regional service provided by the provider of the electronic mailbox relates to a plurality of regions, the region to which the user belongs can be effectively identified through the electronic mailbox.

Description

Method and device for identifying region to which user belongs
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for identifying a region to which a user belongs.
Background
With the continuous development of society, the electronic mailbox has become an important way for people to communicate information, and in practical application, people can use the electronic mailbox to register account numbers of other websites, so as to use services provided in the websites, for example, use the electronic mailbox to register account numbers of forums, so as to communicate with each other in the forums,
currently, in order to better provide services for users, websites generally need to know which region the user is, so that corresponding services can be provided for users in different regions, for example, weather services can be provided for users in different countries.
Since users usually register and use on their own electronic mailbox websites, in the prior art, the country to which the electronic mailbox belongs is mainly determined by the electronic mailbox (e.g., xxx @163.com), and then the country to which the user belongs is determined, specifically, there are two embodiments:
the first method comprises the following steps: after acquiring an email box of a certain user, the server determines the country to which the user of the email box belongs directly according to the suffix (namely the character part behind the @ letter, for example, the suffix in the hotmail.fr is the suffix) in the email box (for example, because fr in the xxx @ hotmail.fr represents that the email box is from france, the user of the email box can be determined to be from france).
And the second method comprises the following steps: the server counts the regional services provided by the providers of different types of electronic mailboxes in advance, that is, the regional services provided by each provider of electronic mailboxes are generally in a certain geographical range, for example, the provider of xxx @163.com provides regional services only in China, and the provider of @ hotmail provides regional services all over the world.
Obviously, it can be seen that when the suffix of the electronic mailbox in the first mode does not contain the character symbol indicating the region (e.g., country), the region to which the electronic mailbox belongs cannot be determined in this mode, and thus the region to which the user belongs cannot be determined. In the second method, when the regional service provided by the provider of the electronic mailbox relates to a plurality of regions (for example, the providers of international electronic mailboxes such as hotmail and gmail provide regional services in all countries of the world), the region (for example, country) to which the electronic mailbox belongs cannot be determined by the method, and thus the region to which the user belongs cannot be determined.
Disclosure of Invention
The embodiment of the application provides a method and a device for identifying a region to which a user belongs, which are used for solving the problem that the region to which the user belongs cannot be identified through an electronic mailbox under the condition that a suffix of the electronic mailbox in the prior art does not contain a character symbol indicating the region (such as a country) or regional service provided by a provider of the electronic mailbox relates to a plurality of regions.
The method for identifying the region to which the user belongs provided by the embodiment of the application comprises the following steps:
acquiring an email box of a user;
splitting the electronic mailbox into prefix information and suffix information;
for each region, determining the prefix judgment probability of the prefix information appearing in the region and determining the suffix judgment probability of the suffix information appearing in the region;
determining the final judgment probability of the electronic mailbox belonging to each region according to the prefix judgment probability and the suffix judgment probability corresponding to each region;
and identifying the region to which the user belongs according to the final judgment probabilities.
The device for identifying the region to which the user belongs provided by the embodiment of the application comprises:
the acquisition module is used for acquiring an email box of a user;
the splitting module is used for splitting the electronic mailbox into prefix information and suffix information;
the first determining module is used for determining the prefix judgment probability of the prefix information appearing in the region and determining the suffix judgment probability of the suffix information appearing in the region aiming at each region;
the second determining module is used for determining the final judgment probability that the electronic mailbox belongs to each region according to the prefix judgment probability and the suffix judgment probability corresponding to each region;
and the identification module is used for identifying the region to which the user belongs according to the final judgment probabilities.
The embodiment of the application provides a method and a device for identifying a region to which a user belongs, in the method, a server acquires an electronic mailbox of the user, the electronic mailbox is divided into prefix information and suffix information, for each region, the prefix judgment probability of the prefix information appearing in the region is determined, the suffix judgment probability of the suffix information appearing in the region is determined, the final judgment probability of the electronic mailbox belonging to each region is determined according to the prefix judgment probability and the suffix judgment probability corresponding to each region, and the region to which the user belongs is identified according to each final judgment probability. By the above method, even if the suffix of the electronic mailbox does not contain the character symbol indicating the region (e.g., country) or the regional service provided by the provider of the electronic mailbox relates to a plurality of regions, the region to which the user belongs can be effectively identified through the electronic mailbox.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a process for identifying a region to which a user belongs according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of an apparatus for identifying a region to which a user belongs according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a process for identifying a region to which a user belongs according to an embodiment of the present application, which specifically includes the following steps:
s101: and acquiring an email box of the user.
In practical applications, since users usually register and use electronic mailboxes on their own electronic mailbox websites, the websites usually determine the regions to which the electronic mailboxes belong through the electronic mailboxes (e.g., ok @163.com), and further determine the regions to which the users belong, so as to provide corresponding services for users in different regions.
In the whole process of determining the area to which the user belongs, the method and the device need to acquire the electronic mailbox of the user firstly, and the acquisition of the electronic mailbox of the user can be completed by the server or other devices with data processing functions.
In the present application, a region may refer to province, city, or country, and in order to clarify specific implementation steps of the present application, the present application will be described in detail below with reference to a region as a country.
For example, suppose a website needs to know which country user a is, and therefore, the server of the website acquires the email box aabaab @ hotmail.
S102: and splitting the electronic mailbox into prefix information and suffix information.
Since the email box is usually composed in a format of xxx @ yyy (in this application, xxx in front of @ is referred to as prefix information, yy behind @ is referred to as suffix information), when a user registers the email box, the prefix information is usually defined by the user, and the suffix information is set by the provider of the email box, in order to improve the accuracy of identifying the region (e.g., country) to which the email box belongs and further identifying the region to which the user belongs, in this application, a mode of processing the prefix information and the suffix information separately is adopted.
Therefore, after the server acquires the email box of the user, the email box is split into the prefix information and the suffix information.
Following the above example, the server, upon receiving the email address aabaab @ hotmail.com of the user a, splits the email address aabaab @ hotmail.com into aabaab (i.e., prefix information) and hotmail.com (i.e., suffix information).
S103: and for each region, determining the prefix judgment probability of the prefix information appearing in the region and determining the suffix judgment probability of the suffix information appearing in the region.
Since the thinking way, the living habits and the social culture of each region (e.g., country) are not the same, when people in different regions register electronic mailboxes, the probability that the same prefix information appears in different regions is different, that is, prefix information defined by users in china when registering electronic mailboxes generally uses the chinese name pinyin of the users in different regions, prefix information defined by users in the united states when registering electronic mailboxes generally also uses the english name alphabet of the users in different regions, and since the chinese name and the english name have different differences, when the current prefix information is the chinese name pinyin, the country to which the users belong is more likely to be china, and certainly prefix information contained in electronic mailboxes of users in foreign countries is likely to be the chinese name pinyin, in the present application, the prefix judgment probability is used to represent that the prefix information appears in the region (e.g., country), the greater the prefix determination probability, the greater the probability that the prefix information appears in the area, and the smaller the prefix determination probability, the less the probability that the prefix information appears in the area.
In addition, if only the prefix information is considered and the suffix information is not considered, the probability that the prefix information appears in the region (e.g., country) is the probability that the electronic mailbox corresponding to the prefix information belongs to the region, and the greater the prefix judgment probability is, the greater the probability that the electronic mailbox corresponding to the prefix information belongs to the region is, and the smaller the prefix judgment probability is, the smaller the probability that the electronic mailbox corresponding to the prefix information belongs to the region is.
Further, the present application provides a prefix determination probability for each region, where prefix information appears in the region, and the specific implementation manner is as follows: dividing the prefix information into a plurality of character strings aiming at each region, determining the probability of each character string appearing in the region in a prefix probability table established aiming at the region in advance, and determining the prefix judgment probability of the prefix information appearing in the region according to the probability of each character string appearing in the region and a Bayesian formula.
In the process of splitting the prefix information into a plurality of character strings, the number of character strings included in the split character strings is equal to the number of characters included in character strings in a prefix probability table established in advance for the region, and at the same time, the present application provides a splitting manner that N-ary splitting (N is the number of characters included in character strings in the prefix probability table established in advance for the region and is a positive integer greater than or equal to 1 and less than the number of characters included in the prefix information), that is, for each character in a character string, N characters immediately following the character are combined into one character string according to the number of characters (e.g., N) included in character strings in the prefix probability table established in advance for the region.
In addition, according to the probability of each character string appearing in the region and a Bayesian formula, a specific calculation mode is provided in the process of determining the prefix judgment probability of the prefix information appearing in the region, specifically: according to the occurrence of each character string in the regionDetermines the probability that the prefix information appears in the region given the region, and then according to a formula
Figure BDA0000933783280000061
Figure BDA0000933783280000062
Determining the area where the prefix information appearsiThe prefix judgment probability of (1), wherein, P areaiPrefix) indicates that prefix information appears in a region given the prefix informationiPrefix judgment probability, P prefix | regioni) Is shown in a given areaiIn the case of (2), in the areaiProbability of prefix information appearing therein, P regioni) Indicating that prefix information is present in a regioniThe probability of experience of (a) is,
Figure BDA0000933783280000063
p prefix | zone representing each zonei) Region P of the same regioni) The sum of the products of (a).
It should be noted that, the P regioni) Indicating that prefix information is present in a regioniThe empirical probability is calculated by determining the region of the email in a known large amount of historical email prefix informationiThe frequency occupied by the email address prefix information, that is, the empirical probability.
Following the above example, for convenience of describing the embodiments of the present application, in this example, only two countries are taken as an example for explanation (that is, only users in the two countries use the email box, and actually, as long as the country using the email box needs to establish a prefix probability table corresponding to the country), it is assumed that the prefix probability table established in advance for the united states is shown in table 1:
Figure BDA0000933783280000064
Figure BDA0000933783280000071
TABLE 1
The prefix probability table established in advance for the united kingdom is shown in table 2:
prefix informationProbability of prefix judgement
aaa0.5/1
aab0.5/1
aba1/3
abb2/3
baa0.5/2.5
bab2/2.5
bba2/2.5
bbb0.5/2.5
TABLE 2
The server splits aabaab (i.e., prefix information) into five character strings "aab", "aba", "baa", "aaa", "aab" according to the number of characters included in character strings in a prefix probability table established in advance for the united states, i.e., the N-ary splitting manner described above, i.e., the ternary splitting manner, and determines the probability of each character string appearing in the united states according to table 1 as shown in table 3:
character stringProbability of character string appearing in the United states
aab2/3
aba3/4
baa1/3
aaa1/3
aab2/3
TABLE 3
The server determines from table 3 that the probability of the prefix information appearing in the united states is 0.0370, i.e., the P-prefix | united states) ═ 0.0370, given the united states.
Similarly, the server determines from table 2 the probability of each string appearing in the uk as shown in table 4:
character stringProbability of character string appearing in the United states
aab0.5/1
aba1/3
baa0.5/2.5
aaa0.5/1
aab0.5/1
TABLE 4
The server determines from table 4 that the probability of the prefix information appearing in the uk is 0.0083, i.e. the P-prefix | uk) ═ 0.0083, given the uk.
Assuming that the empirical probability of the prefix information appearing in the united states is 3/5, i.e., psus) 3/5 and the empirical probability of the prefix information appearing in the united states is 2/5, i.e., pbuk) 2/5, the server determines, by the above-mentioned formula, that the prefix judgment probability of the prefix information appearing in the united states is 0.87, i.e., pbusa | prefix) is 0.87, and determines that the prefix judgment probability of the prefix information appearing in the uk is 0.13, i.e., pbuk | prefix) is 0.13.
The above is a process of determining a prefix determination probability that prefix information appears in each region, and since the prefix information and suffix information are processed separately in the present application, the process of processing suffix information in the present application is described below.
Similarly, since the thinking way, the living habits and the social culture of each region (e.g., country) are not the same, the number of the electronic mailboxes corresponding to each region (e.g., country) registered and used by a certain type of suffix information is generally different, that is, the number of the electronic mailboxes corresponding to the type of suffix information used in some regions is large, and the number of the electronic mailboxes corresponding to the type of suffix information used in some regions is small, in the present application, the suffix judgment probability can be used to indicate the possibility of the suffix information appearing in the region, and the greater the suffix judgment probability is, the greater the possibility of the suffix information appearing in the region is, and the smaller the suffix judgment probability is, the less the possibility of the suffix information appearing in the region is.
Similarly, if only the suffix information is considered without considering the prefix information, the magnitude of the possibility that the suffix information appears in the region (e.g., country) is the magnitude of the possibility that the email corresponding to the suffix information belongs to the region, and the greater the suffix judgment probability is, the greater the likelihood that the email corresponding to the suffix information belongs to the region is, and the smaller the suffix judgment probability is, the smaller the likelihood that the email corresponding to the suffix information belongs to the region is.
Further, the present application provides a suffix judgment probability that suffix information appears in each region, which is determined for each region, and the specific embodiment is as follows: for each region, a suffix determination probability that the suffix information appears in the region is determined in a suffix probability table established in advance for the region.
It should be noted that the above-mentioned manner for determining the prefix determination probability is not unique, as long as the determined result can reflect the possibility that the prefix information appears in the region (e.g., country), for example, the determination manner of character string similarity is used to determine the prefix determination probability, that is, for each region, the prefix determination probability is calculated and selected, and the similarity is the largest in the standard character string corresponding to the region that is established in advance, and then the similarity is used as the prefix determination probability of each region. The same manner for determining the suffix judgment probability is not unique, and is not described in detail herein.
Following the above example, assume that a suffix probability table established in advance for the united states is as shown in table 5:
suffix informationSuffix judgment probability
gmail.com2/3
hotmail.com1/2
TABLE 5
A suffix probability table previously established for the united kingdom is shown in table 6:
suffix informationSuffix judgment probability
gmail.com1/3
hotmail.com1/2
TABLE 6
Com, the server determines from table 5 a suffix judgement probability 1/2, i.e. P us suffix, 1/2 of the suffix information appearing in the united kingdom and determines from table 6 a suffix judgement probability 1/2, i.e. P uk suffix, 1/2 of the suffix information appearing in the uk.
S104: and determining the final judgment probability of the electronic mailbox belonging to each region according to the prefix judgment probability and the suffix judgment probability corresponding to each region.
Since the electronic mailbox is formed by combining the prefix information and the suffix information, the prefix information and the suffix information jointly determine which region the electronic mailbox belongs to, that is, the prefix judgment probability and the suffix judgment probability jointly determine which region the electronic mailbox belongs to.
Therefore, in the present application, after determining the prefix judgment probability and the suffix judgment probability corresponding to each region, the server may directly determine a final judgment probability that an email box corresponding to both the prefix information and the suffix information belongs to each region, where the final judgment probability represents a magnitude of a probability that the email box belongs to each region, and for each region, the greater the final judgment probability is, the greater the probability is, indicating that the email box belongs to the region is, and the smaller the final judgment probability is, indicating that the email box belongs to the region is, the smaller the probability is.
In addition, the application also provides a specific calculation mode of the final judgment probability for determining that the email box corresponding to the prefix information and the suffix information together belongs to each region: by the formula P ═ P (region)iPrefix P (area)iSuffix/| P (region)i) Determining the final judgment probability of the electronic mailbox belonging to each area; wherein: p represents that the email box belongs to the regioniFinal judgment probability of (P) (region)iPrefix) indicates that the prefix information appears in a regioniProbability of prefix judgement, P (region)iSuffix) indicates that the suffix information appears in the regioniSuffix judgment probability of (P) (region)i) Indicating that the email box belongs to a regioniThe empirical probability of (2).
After the server determines the prefix judgment probability and the suffix judgment probability corresponding to the united states, the server determines the final judgment probability 0.725 that the email address aabaab @ hotmail. com of the user a belongs to the united states according to the formula mentioned in step S104, and similarly, the server determines the final judgment probability 0.1625 that the email address aabaab @ hotmail. com of the user a belongs to the united kingdom.
S105: and identifying the region to which the user belongs according to the final judgment probabilities.
After determining the final judgment probability that the user's email box belongs to each region (e.g., country), the server identifies the region corresponding to the maximum final judgment probability among the final judgment probabilities as the home location of the user's email box.
After the server determines that the final judgment probability 0.1625 that the email box aabaab @ hotmail. com of the user A belongs to the United states and the final judgment probability 0.725 that the email box aabaab @ hotmail. com of the user A belongs to the United states, the server identifies the United states as the attribution of the email box of the user A, and then takes the United states as the area to which the user A belongs.
By the above method, even if the suffix of the electronic mailbox does not contain the character symbol indicating the region (e.g., country) or the regional service provided by the provider of the electronic mailbox relates to a plurality of regions, the region to which the user belongs can be effectively identified through the electronic mailbox.
It should be noted that, in the process of determining the suffix determination probability of the email address for each region, when the suffix information includes a character symbol indicating a region (e.g., country), the suffix determination probability of the email address for the region may be directly determined to be 1, and the suffix determination probability of the email address for the other regions may be determined to be 0.
In addition, the present application provides a specific establishment method of the prefix probability table that is established in advance for the region in step S103, specifically as follows: the method comprises the steps of obtaining sample mailboxes of known regions in advance, extracting sample prefix information in the sample mailboxes, dividing the extracted sample prefix information into a plurality of character strings, extracting preamble characters of the character strings aiming at each character string divided by the sample prefix information, determining the ratio of the times of the character strings appearing in the regions to the times of the preamble characters of the character strings appearing in the regions, taking the ratio as the probability of the character strings appearing in the regions, and establishing a prefix probability table corresponding to the regions according to the probability counted by aiming at each character string divided by the sample prefix information.
It should be noted that, assuming that N-ary splitting is adopted to split each piece of extracted sample prefix information into a plurality of character strings, when extracting the preamble character of each character string, continuous (N-1) characters are extracted from the character located at the forefront of the character string as the preamble character for extracting the character string.
For example, for the example in steps S101 to S105, it is assumed that the server acquires each sample mailbox known to belong to the united states, extracts sample prefix information in each sample mailbox, splits each extracted sample prefix information into a plurality of character strings according to a ternary splitting method, and determines the number of times each type of character string appears in the united states, as shown in table 7:
character stringNumber of times
aaa1
aab2
aba3
abb1
baa1
bab2
bbb1
TABLE 7
The server then extracts the first two digits of each string (i.e., the preamble characters) and determines the number of occurrences of each type of preamble character in the united states, as shown in table 8:
character stringNumber of times
aa3
ab4
ba3
bb1
TABLE 8
The server determines the ratio of the number of times the character string appears in the united states to the number of times the preamble character of the character string appears in the united states (i.e., the probability of the character string appearing in the united states), and establishes a prefix probability table corresponding to the united states according to the determined ratios as shown in table 1.
Likewise, the server determines the number of times each type of string appears in the uk, as shown in table 9:
Figure BDA0000933783280000121
Figure BDA0000933783280000131
TABLE 9
The server then extracts the first two digits of each string (i.e., the preamble characters) and determines the number of times each type of preamble character appears in the uk, as shown in table 10:
character stringNumber of times
ab3
ba2
bb2
Watch 10
The server determines the ratio of the number of times the character string appears in the united kingdom to the number of times the preamble of the character string appears in the united kingdom (i.e., the probability of the character string appearing in the united kingdom), and establishes a prefix probability table corresponding to the united kingdom based on the determined ratios as shown in table 2.
In addition, the present application also provides another way of pre-establishing a prefix probability table corresponding to each region, which is specifically as follows: the method comprises the steps of obtaining sample mailboxes of known regions in advance, extracting sample prefix information in the sample mailboxes, dividing the extracted sample prefix information into a plurality of transfer character strings according to an (N-1) element dividing mode, forming a transfer time matrix corresponding to each region by the transfer character strings aiming at each transfer character string divided by the sample prefix information, determining a transfer probability matrix corresponding to each region according to the transfer time matrix corresponding to each region, and establishing a prefix probability table corresponding to each region according to the transfer probability matrix, wherein the number of characters contained in the transfer character strings is one character less than the number of characters contained in the character strings involved in the step S103.
For example, in the example in steps S101 to S105, it is assumed that the server acquires sample mailboxes of which the countries are known to be the united states and the united kingdom, extracts sample prefix information in the sample mailboxes, splits the extracted sample prefix information into a plurality of transfer character strings according to a binary splitting method, and combines the transfer character strings into a transfer number matrix corresponding to the united states for each transfer character string split by the sample prefix information, as shown in table 11:
Figure BDA0000933783280000132
Figure BDA0000933783280000141
TABLE 11
The server determines the transition probability matrix corresponding to the united states according to table 11, as shown in table 12:
aaabbabb
aa1/32/300
ab003/41/4
ba1/32/300
bb000.5/1.50.5/2.5
TABLE 12
The server builds a corresponding prefix probability table for the united states according to table 12, as shown in table 1.
Similarly, for each transfer string split by the sample prefix information, the server makes each transfer string into a transfer number matrix corresponding to uk, as shown in table 13:
aaabbabb
aa0000
ab0012
ba0200
bb0020
watch 13
The server determines from table 13 the transition probability matrix for the uk, as shown in table 14:
aaabbabb
aa0.5/10.5/100
ab001/32/3
ba0.5/2.52/2.500
bb002/2.50.5/2.5
TABLE 14
The server builds a corresponding prefix probability table for the uk from table 14, as shown in table 2.
It should be noted that, in the above process of determining the transition probability matrix corresponding to each region according to the transition number matrix, for all items of 0 in the transition number matrix, if the last (N-2) bit of the transition string in the row where 0 is located is the same as the first (N-2) bit of the transition string in the column where 0 is located, the 0 time is calculated as 0.5 time, and if the last (N-2) bit of the transition string in the row where 0 is located is different from the first (N-2) bit of the transition string in the column where 0 is located, the 0 time is still calculated as 0 time, for example, in table 13, 0 is located in the third column item in the second row, the last bit a of the transition string aa in the row where 0 is located is the same as the last bit a of the transition string ab in the column where 0 is located, the 0 time is calculated as 0.5 time, but 0 is located in the fourth column item in the second row, and the last bit a of the transition string aa in the row where 0 is located is not the same as the first bit a of the transition string in the fourth column where 0, and the first bit b of the transition string in the The same is true.
Further, the present application also provides a specific establishment method for establishing a suffix probability table for the region in advance, which is referred to in step S103, specifically as follows: the method comprises the steps of obtaining sample mailboxes of known regions in advance, extracting sample suffix information in the sample mailboxes, counting the probability of the sample suffix information appearing in the region according to the sample suffix information, and establishing a suffix probability table corresponding to the region according to the probability counted according to the sample suffix information.
In addition, in the process of counting the probability of the sample suffix information appearing in the region for each sample suffix information, the present application firstly counts the number of times the sample suffix information appears in the region for each sample suffix information, then counts the total number of times the sample suffix information appears in each region for each sample suffix information, and then takes the ratio of the number of times the sample suffix information appears in the region to the total number of times the sample suffix information appears in each region as the probability of the sample suffix information appearing in the region.
For example, for the example in steps S101 to S105 described above, it is assumed that sample suffix information is extracted from each known sample mailbox in the united states and uk as shown in table 15:
Figure BDA0000933783280000151
Figure BDA0000933783280000161
watch 15
The server counts the number of times the sample suffix information gmail.com appears in the united states as 2 times, counts the total number of times the sample suffix information gmail.com appears in the united states and the united kingdom as 3 times, and sets the ratio of 2 times (i.e., the number of times the sample suffix information gmail.com appears in the united states) to 3 times (i.e., the total number of times the sample suffix information gmail.com appears in the united states and the united kingdom) as 2/3 as the probability of the sample suffix information gmail.com appearing in the united states, and similarly, counts the probability of each sample suffix information appearing in the united states for the united states as shown in table 5 and counts the probability of each sample suffix information appearing in the united kingdom for the united kingdom as shown in table 6.
Based on the same idea, the method for identifying a region to which a user belongs according to the embodiment of the present application provides an apparatus for identifying a region to which a user belongs, as shown in fig. 2.
Fig. 2 is a schematic structural diagram of an apparatus for identifying a region to which a user belongs according to an embodiment of the present application, where the apparatus includes:
an obtainingmodule 201, configured to obtain an email box of a user;
asplitting module 202, configured to split the email box into prefix information and suffix information;
a first determiningmodule 203, configured to determine, for each region, a prefix determination probability that the prefix information appears in the region, and determine a suffix determination probability that the suffix information appears in the region;
a second determiningmodule 204, configured to determine, according to the prefix judgment probability and the suffix judgment probability corresponding to each region, a final judgment probability that the email box belongs to each region;
and the identifyingmodule 205 is configured to identify a region to which the user belongs according to each final judgment probability.
The first determiningmodule 203 is specifically configured to split the prefix information into a plurality of character strings, determine, in a prefix probability table established in advance for the region, a probability that each character string appears in the region, and determine, according to the probability that each character string appears in the region and a bayesian formula, a prefix determination probability that the prefix information appears in the region.
The first determiningmodule 203 is specifically configured to obtain sample mailboxes of known areas in advance, extract sample prefix information in the sample mailboxes, split the extracted sample prefix information into a plurality of character strings, extract a preamble character of each character string for each character string split by the sample prefix information, determine a ratio of the number of times that the character string appears in the area to the number of times that the preamble character of the character string appears in the area, and use the ratio as a probability that the character string appears in the area, and establish a prefix probability table corresponding to the area according to a probability counted for each character string split by the sample prefix information.
The first determiningmodule 203 is specifically configured to determine a suffix determination probability that the suffix information appears in the region, in a suffix probability table established in advance for the region.
The first determiningmodule 203 is specifically configured to obtain sample mailboxes of known regions in advance, extract sample suffix information in the sample mailboxes, count the probability that the sample suffix information appears in the region for each sample suffix information, and establish a suffix probability table corresponding to the region according to the probability counted for each sample suffix information.
The second determiningmodule 204 is specifically configured to determine the location by using the formula P ═ P (region)iPrefix P (area)iSuffix/| P (region)i) Determining the final judgment probability of the electronic mailbox belonging to each area; wherein: p represents that the email box belongs to the regioniFinal judgment probability of (P) (region)iPrefix) indicates that the prefix information appears in a regioniProbability of prefix judgement, P (region)iSuffix) indicates that the suffix information appears in the regioniSuffix judgment probability of (P) (region)i) Indicating that the email box belongs to a regioniThe empirical probability of (2).
The region includes the country.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include transitory computer readable media (transmyedia) such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (14)

1. A method for identifying a region to which a user belongs, the method comprising:
acquiring an email box of a user;
splitting the electronic mailbox into prefix information and suffix information;
for each region, determining the prefix judgment probability of the prefix information appearing in the region and determining the suffix judgment probability of the suffix information appearing in the region;
determining the final judgment probability of the electronic mailbox belonging to each region according to the prefix judgment probability and the suffix judgment probability corresponding to each region;
and identifying the region to which the user belongs according to the final judgment probabilities.
2. The method of claim 1, wherein determining the prefix judgment probability that the prefix information appears in the region specifically comprises:
splitting the prefix information into a plurality of character strings;
determining the probability of each character string appearing in the region in a prefix probability table established in advance for the region;
and determining the prefix judgment probability of the prefix information appearing in the region according to the probability of each character string appearing in the region and a Bayesian formula.
3. The method of claim 2, wherein pre-establishing a prefix probability table for the region comprises:
obtaining all sample mailboxes of known areas in advance;
extracting sample prefix information in each sample mailbox;
splitting each extracted sample prefix information into a plurality of character strings;
extracting a preamble character of each character string split by the sample prefix information, wherein the preamble character is a character of the character string except the last character;
determining the ratio of the number of times of the character string appearing in the region to the number of times of the preorder character of the character string appearing in the region as the probability of the character string appearing in the region;
and establishing a prefix probability table corresponding to the region according to the probability counted by aiming at each character string split by the sample prefix information.
4. The method according to claim 1, wherein determining the suffix decision probability that the suffix information occurs in the region comprises:
a suffix judgment probability that the suffix information appears in the region is determined in a suffix probability table established in advance for the region.
5. The method of claim 4, wherein pre-establishing a suffix probability table for the region comprises:
obtaining all sample mailboxes of known areas in advance;
extracting sample suffix information in each sample mailbox;
counting the probability of the sample suffix information appearing in the region aiming at each sample suffix information;
and establishing a suffix probability table corresponding to the region according to the probability counted by the suffix information of each sample.
6. The method of claim 1, wherein determining a final decision probability that the electronic mailbox belongs to each zone specifically comprises:
by the formula P ═ P (region)iPrefix P (area)iSuffix/| P (region)i) Determining the final judgment probability of the electronic mailbox belonging to each area; wherein:
p represents that the email box belongs to the regioniFinal judgment probability of (P) (region)iPrefix) indicates that the prefix information appears in a regioniProbability of prefix judgement, P (region)iSuffix) indicates that the suffix information appears in the regioniSuffix judgment probability of (P) (region)i) Indicating that the email box belongs to a regioniThe empirical probability of (2).
7. A method according to any one of claims 1 to 6, wherein the region comprises a country.
8. An apparatus for identifying a region to which a user belongs, the apparatus comprising:
the acquisition module is used for acquiring an email box of a user;
the splitting module is used for splitting the electronic mailbox into prefix information and suffix information;
the first determining module is used for determining the prefix judgment probability of the prefix information appearing in the region and determining the suffix judgment probability of the suffix information appearing in the region aiming at each region;
the second determining module is used for determining the final judgment probability that the electronic mailbox belongs to each region according to the prefix judgment probability and the suffix judgment probability corresponding to each region;
and the identification module is used for identifying the region to which the user belongs according to the final judgment probabilities.
9. The apparatus of claim 8, wherein the first determining module is specifically configured to split the prefix information into a plurality of character strings, determine a probability of each character string appearing in the region in a prefix probability table established in advance for the region, and determine the prefix determination probability of the prefix information appearing in the region according to the probability of each character string appearing in the region and a bayesian formula.
10. The apparatus according to claim 9, wherein the first determining module is specifically configured to obtain sample mailboxes of known areas to which the information belongs in advance, extract sample prefix information in the sample mailboxes, split the extracted sample prefix information into a plurality of character strings, extract preamble characters of each character string split by the sample prefix information, where the preamble characters are characters of the character string other than a last character, determine a ratio of a number of times that the character string appears in the area to a number of times that the preamble characters of the character string appear in the area, as a probability that the character string appears in the area, and establish the prefix probability table corresponding to the area according to a probability counted for each character string split by the sample prefix information.
11. The apparatus according to claim 9, wherein the first determining module is specifically configured to determine a suffix determination probability that the suffix information is present in the region, in a suffix probability table established in advance for the region.
12. The apparatus according to claim 11, wherein the first determining module is specifically configured to obtain sample mailboxes of known regions in advance, extract sample suffix information in the sample mailboxes, count, for each sample suffix information, probabilities of the sample suffix information appearing in the region, and establish a suffix probability table corresponding to the region according to the probabilities counted for each sample suffix information.
13. The apparatus of claim 9, wherein the second determining module is specifically configured to determine the location by the formula P ═ P (region)iPrefix P (area)iSuffix/| P (region)i) Determining the final judgment probability of the electronic mailbox belonging to each area; wherein: p represents that the email box belongs to the regioniFinal judgment probability of (P) (region)iPrefix) indicates that the prefix information appears in a regioniProbability of prefix judgement, P (region)iSuffix) indicates that the suffix information appears in the regioniSuffix judgment probability of (P) (region)i) Indicating that the email box belongs to a regioniThe empirical probability of (2).
14. An apparatus according to any one of claims 8 to 13, wherein the region comprises a country.
CN201610121595.5A2016-03-032016-03-03Method and device for identifying region to which user belongsActiveCN107153654B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201610121595.5ACN107153654B (en)2016-03-032016-03-03Method and device for identifying region to which user belongs

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201610121595.5ACN107153654B (en)2016-03-032016-03-03Method and device for identifying region to which user belongs

Publications (2)

Publication NumberPublication Date
CN107153654A CN107153654A (en)2017-09-12
CN107153654Btrue CN107153654B (en)2020-04-28

Family

ID=59791323

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201610121595.5AActiveCN107153654B (en)2016-03-032016-03-03Method and device for identifying region to which user belongs

Country Status (1)

CountryLink
CN (1)CN107153654B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115099832B (en)*2022-06-292024-07-05广州华多网络科技有限公司Abnormal user detection method and device, equipment, medium and product thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102831246A (en)*2012-09-172012-12-19中央民族大学Method and device for classification of Tibetan webpage
CN103064951A (en)*2012-12-312013-04-24南京烽火星空通信发展有限公司Region recognition method and device of public opinion information
CN104731977A (en)*2015-04-142015-06-24海量云图(北京)数据技术有限公司Phone number data search and classification method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20030216995A1 (en)*2002-05-152003-11-20Depauw ThomasAutomated financial system and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102831246A (en)*2012-09-172012-12-19中央民族大学Method and device for classification of Tibetan webpage
CN103064951A (en)*2012-12-312013-04-24南京烽火星空通信发展有限公司Region recognition method and device of public opinion information
CN104731977A (en)*2015-04-142015-06-24海量云图(北京)数据技术有限公司Phone number data search and classification method

Also Published As

Publication numberPublication date
CN107153654A (en)2017-09-12

Similar Documents

PublicationPublication DateTitle
US9996586B2 (en)Method and device for searching for contact object, and storage medium
CN103546446B (en)Phishing website detection method, device and terminal
HK1211147A1 (en)Method and device for determining ip address field and corresponding latitude and longitude
CN104765729B (en)A kind of cross-platform microblogging community account matching process
CN108038090B (en)A kind for the treatment of method and apparatus of Text Address
CN104537107A (en)URL storage matching method and device
CN106598965B (en)Account mapping method and device based on address information
CN110232156B (en)Information recommendation method and device based on long text
CN104572685A (en)data sorting method
CN106156105A (en)Email polymerization sorting technique and device
EP3198521A1 (en)Method and apparatus of processing a doi (digital object unique identifier) in interaction information
CN106933878B (en)Information processing method and device
CN110020123A (en)A kind of promotion message put-on method, device, medium and equipment
CN107153654B (en)Method and device for identifying region to which user belongs
CN106789147B (en)Flow analysis method and device
CN104166722B (en)A kind of method and apparatus of recommended website
CN113127767B (en)Mobile phone number extraction method and device, electronic equipment and storage medium
HK1243803A1 (en)Method and device for identifying area where user is located
HK1243803A (en)Method and device for identifying area where user is located
CN111832998A (en)Method and device for judging true user of consignment telephone number
CN105260467B (en)A kind of SMS classified method and device
CN107203609A (en) A method and mobile terminal for quickly searching mobile terminal short messages
CN108268659B (en)Method and system for classifying same news information
HK1243803B (en)Method and device for identifying area where user is located
CN114254591A (en)Construction method and device of simplified and traditional conversion tool

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
REGReference to a national code

Ref country code:HK

Ref legal event code:DE

Ref document number:1243803

Country of ref document:HK

GR01Patent grant
GR01Patent grant
TR01Transfer of patent right

Effective date of registration:20200928

Address after:27 Hospital Road, George Town, Grand Cayman ky1-9008

Patentee after:Innovative advanced technology Co.,Ltd.

Address before:27 Hospital Road, George Town, Grand Cayman ky1-9008

Patentee before:Advanced innovation technology Co.,Ltd.

Effective date of registration:20200928

Address after:27 Hospital Road, George Town, Grand Cayman ky1-9008

Patentee after:Advanced innovation technology Co.,Ltd.

Address before:A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Patentee before:Alibaba Group Holding Ltd.

TR01Transfer of patent right

[8]ページ先頭

©2009-2025 Movatter.jp