CN104317784A

Movatterモバイル変換

Info

Publication number: CN104317784A
Application number: CN201410521299.5A
Authority: CN
Inventors: 李寿山; 黄磊; 周国栋; 王红玲
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2014-09-30
Filing date: 2014-09-30
Publication date: 2015-01-28

Abstract

本发明公开的跨平台用户识别方法和系统，充分考虑社交平台中用户消息的重要性，通过相应时间段内不同平台的两个账户中用户消息所反映的用户见闻、兴趣、偏好以及写作风格、用词习惯等个性化信息的相似情况，来识别用户是否为同一用户，具体地，本发明方法获取不同平台的两个账户中发布时间在预设时间段内的消息内容，并对两个账户的消息内容进行分词及特征抽取处理，在此基础上，利用两个账户消息的分词特征相似度识别所述不同平台的两个账户是否属于同一用户。可见，本发明解决了不同社交平台同一用户的识别问题，进而为同一用户的跨平台数据分析提供了支持。

The cross-platform user identification method and system disclosed in the present invention fully consider the importance of user messages on social platforms, and use the user knowledge, interests, preferences, and writing styles reflected in user messages on two accounts on different platforms within a corresponding period of time, The similarity of personalized information such as word usage habits is used to identify whether the users are the same user. Specifically, the method of the present invention obtains the content of the messages published within the preset time period in two accounts on different platforms, and compares the two accounts Word segmentation and feature extraction are performed on the message content of the message, and on this basis, the word segmentation feature similarity of the two account messages is used to identify whether the two accounts on different platforms belong to the same user. It can be seen that the present invention solves the problem of identifying the same user on different social platforms, and further provides support for cross-platform data analysis of the same user.

Description

Translated fromChinese

一种跨平台用户识别方法和系统A cross-platform user identification method and system

技术领域technical field

本发明属于自然语言处理技术及社交网络领域，尤其涉及一种跨平台用户识别方法和系统。The invention belongs to the fields of natural language processing technology and social network, and in particular relates to a cross-platform user identification method and system.

背景技术Background technique

近年来，随着社交网络的迅猛发展，各种类型的微博(Micro-blog)，例如新浪微博、腾讯微博、Twitter、Facebook等，日渐受到用户的青睐。In recent years, with the rapid development of social networks, various types of Micro-blogs, such as Sina Weibo, Tencent Weibo, Twitter, Facebook, etc., are increasingly favored by users.

由于微博既具有媒体传播特性，又具有社交网络特性，吸引了众多研究人员对微博数据进行分析研究。目前，同时拥有多个不同平台微博账户的用户越来越多，例如用户同时拥有新浪账户及腾讯账户等，同时对相同用户在不同平台的微博数据(例如微博消息)进行研究，更有利于对用户的兴趣、偏好等进行全面分析、深度挖掘，从而更有利于企业制定个性化的营销策略、进行精准的广告投放；同时，也更利于对同用户在不同平台的使用动机、使用习惯进行比对分析，为社交网络的运营或开发新的社交网络产品提供了更好的参考作用。Because Weibo has both media communication characteristics and social network characteristics, it has attracted many researchers to analyze and study Weibo data. At present, more and more users have Weibo accounts on different platforms at the same time. For example, users have Sina accounts and Tencent accounts at the same time. It is conducive to comprehensive analysis and in-depth mining of user interests and preferences, which is more conducive to enterprises to formulate personalized marketing strategies and carry out accurate advertising; at the same time, it is also more conducive to the use motivation and usage The comparative analysis of habits provides a better reference for the operation of social networks or the development of new social network products.

然而，目前对于跨社交平台同一用户的识别研究几乎处于空白阶段，无法识别不同平台的账户是否属于同一用户，因此，不同社交平台同一用户的识别问题成为当前亟需解决的问题。However, the current research on the identification of the same user across social platforms is almost blank, and it is impossible to identify whether accounts on different platforms belong to the same user. Therefore, the identification of the same user on different social platforms has become an urgent problem to be solved.

发明内容Contents of the invention

有鉴于此，本发明的目的在于提供一种跨平台用户识别方法和系统，以解决不同社交平台同一用户的识别问题，进而为同一用户的跨平台数据分析提供支持。In view of this, the purpose of the present invention is to provide a cross-platform user identification method and system to solve the problem of identifying the same user on different social platforms, and provide support for cross-platform data analysis of the same user.

为此，本发明公开如下技术方案：For this reason, the present invention discloses following technical scheme:

一种跨平台用户识别方法，包括：A cross-platform user identification method comprising:

获取第一平台上第一用户账户的第一消息段，获取第二平台上第二用户账户的第二消息段，其中，所述第一消息段为由所述第一用户账户内发布时间在第一预设时间段内的所有消息组成的消息段，所述第二消息段为由所述第二用户账户内发布时间在第一预设时间段内的所有消息组成的消息段；Obtain the first message segment of the first user account on the first platform, and obtain the second message segment of the second user account on the second platform, wherein the first message segment is issued by the first user account within A message segment composed of all messages within the first preset time period, and the second message segment is a message segment composed of all messages published within the first preset time period in the second user account;

分别对所述第一消息段及所述第二消息段进行分词处理，得到分词形式的第一消息段及分词形式的第二消息段；respectively performing word segmentation processing on the first message segment and the second message segment to obtain the first message segment in word segment form and the second message segment in word segment form;

基于预设的分词特征对所述分词形式的第一消息段及分词形式的第二消息段进行特征抽取，并在特征抽取的基础上获取所述第一消息段与所述第二消息段的特征相似度数值；Perform feature extraction on the first message segment in the word segment form and the second message segment in the word segment form based on the preset word segmentation features, and obtain the information of the first message segment and the second message segment on the basis of the feature extraction Feature similarity value;

判断所述特征相似度数值是否在预设的相似度数值参考范围内；judging whether the feature similarity value is within a preset similarity value reference range;

若判断结果为是，则所述第一用户账户及所述第二用户账户属于同一用户。If the determination result is yes, then the first user account and the second user account belong to the same user.

上述方法，优选的，所述基于预设的分词特征对所述分词形式的第一消息段及分词形式的第二消息段进行特征抽取，并在特征抽取的基础上获取所述第一消息段与所述第二消息段的特征相似度数值，包括：In the above method, preferably, the feature extraction is performed on the first message segment in the word segment form and the second message segment in the word segment form based on the preset word segmentation feature, and the first message segment is obtained on the basis of the feature extraction The feature similarity value with the second message segment includes:

分别对分词形式的第一消息段及分词形式的第二消息段进行三元词特征抽取，并基于第一消息段及第二消息段中所包含的相同三元词的个数获取两者的词包含相似度数值；Perform trigram feature extraction on the first message segment in the word segmentation form and the second message segment in the word segmentation form, and obtain both based on the number of the same trigram words contained in the first message segment and the second message segment Words contain similarity values;

分别对分词形式的第一消息段及分词形式的第二消息段进行高频词特征抽取，并基于第一消息段及第二消息段中所包含的相同高频词的个数获取两者的高频词相似度数值；Perform high-frequency word feature extraction on the first message segment in the word segmentation form and the second message segment in the word segmentation form, and obtain the two based on the number of the same high-frequency words contained in the first message segment and the second message segment High-frequency word similarity value;

分别对分词形式的第一消息段及分词形式的第二消息段进行单字符出现概率抽取，并基于第一消息段及第二消息段中所包含的相同单字符的出现概率来获取两者的词分布相似度数值；Perform single-character occurrence probability extraction on the first message segment in word-segmented form and the second message segment in word-segmented form, and obtain the occurrence probability of the same single character contained in the first message segment and the second message segment. Word distribution similarity value;

分别对分词形式的第一消息段及分词形式的第二消息段的隐含主题进行抽取，并基于第一消息段及第二消息段中所包含的相同主题的个数获取两者的主题相似度数值。Extract the hidden topics of the first message segment in the word segmentation form and the second message segment in the word segmentation form, and obtain the topic similarity between the two based on the number of the same topics contained in the first message segment and the second message segment degree value.

上述方法，优选的，在对分词形式的第一消息段及分词形式的第二消息段进行特征抽取之前，还包括：分别对所述分词形式的第一消息段及分词形式的第二消息段进行过滤处理，所述过滤处理包括：The above method, preferably, before performing feature extraction on the first message segment in the participle form and the second message segment in the participle form, further includes: respectively analyzing the first message segment in the participle form and the second message part in the participle form Carry out filtering process, described filtering process comprises:

对所述分词形式的第一消息段进行去停用词和去低频词处理；Carrying out stop words and removing low-frequency words processing to the first message segment of described participle form;

对所述分词形式的第二消息段进行去停用词和去低频词处理。The process of removing stop words and removing low-frequency words is performed on the second message segment in word segmentation form.

上述方法，优选的，还包括：The above method, preferably, also includes:

预先利用设定个数的消息段样本对，并基于每个消息段样本对的特征相似度对最大熵分类方法进行跨平台用户识别训练，得到最大熵分类器，以实现采用所述最大熵分类器识别第一平台上第一用户账户与第二平台上第二用户账户是否属于同一用户，其中：Using a set number of message segment sample pairs in advance, and based on the feature similarity of each message segment sample pair, perform cross-platform user identification training on the maximum entropy classification method to obtain a maximum entropy classifier, so as to realize the use of the maximum entropy classification method. The device identifies whether the first user account on the first platform and the second user account on the second platform belong to the same user, wherein:

所述消息段样本对中包含的两个消息段分别属于不同平台的两个账户，所述两个账户为相同用户的账户或不同用户的账户，所述消息段样本对中所包含消息的发布时间在第二预设时间段内；The two message segments contained in the message segment sample pair respectively belong to two accounts of different platforms, and the two accounts are accounts of the same user or accounts of different users, and the release of the messages contained in the message segment sample pair the time is within the second predetermined time period;

所述特征相似度包括词包含相似度、高频词相似度、词分布相似度和主题相似度。The feature similarity includes word inclusion similarity, high frequency word similarity, word distribution similarity and topic similarity.

上述方法，优选的，通过计算第一消息段与第二消息段的相对熵D(p||q)来获取两者的词分布相似度数值；The above method, preferably, obtains the word distribution similarity value of both by calculating the relative entropy D(p||q) of the first message segment and the second message segment;

其中，p、q分别表示第一消息段、第二消息段，p(x)、q(x)表示相同单字符x分别在第一消息段及第二消息段中出现的概率，X表示第一消息段与第二消息段中相同单字符的字符集合。in, p and q represent the first message segment and the second message segment respectively, p(x) and q(x) represent the probability that the same single character x appears in the first message segment and the second message segment respectively, and X represents the first message A set of characters that are the same single character in the segment as in the second message segment.

上述方法，优选的，使用文档主题生成模型LDA对分词形式的第一消息段及分词形式的第二消息段的隐含主题进行抽取。In the above method, preferably, the document topic generation model LDA is used to extract the hidden topics of the first message segment in the form of word segmentation and the second message segment in the form of word segmentation.

一种跨平台用户识别系统，包括：A cross-platform user identification system comprising:

消息获取模块，用于获取第一平台上第一用户账户的第一消息段，获取第二平台上第二用户账户的第二消息段，其中，所述第一消息段为由所述第一用户账户内发布时间在第一预设时间段内的所有消息组成的消息段，所述第二消息段为由所述第二用户账户内发布时间在第一预设时间段内的所有消息组成的消息段；A message obtaining module, configured to obtain a first message segment of a first user account on the first platform, and obtain a second message segment of a second user account on a second platform, wherein the first message segment is generated by the first message segment A message segment composed of all messages published within the first preset time period in the user account, and the second message segment is composed of all messages published within the first preset time period in the second user account message segment;

分词处理模块，用于分别对所述第一消息段及所述第二消息段进行分词处理，得到分词形式的第一消息段及分词形式的第二消息段；A word segmentation processing module, configured to perform word segmentation processing on the first message segment and the second message segment respectively, to obtain the first message segment in the word segment form and the second message segment in the word segment form;

特征抽取模块，用于基于预设的分词特征对所述分词形式的第一消息段及分词形式的第二消息段进行特征抽取，并在特征抽取的基础上获取所述第一消息段与所述第二消息段的特征相似度数值；The feature extraction module is used to perform feature extraction on the first message segment in the word segmentation form and the second message segment in the word segmentation form based on the preset word segmentation feature, and obtain the first message segment and the second message segment on the basis of feature extraction. The feature similarity value of the second message segment;

判断模块，用于判断所述特征相似度数值是否在预设的相似度数值参考范围内；A judging module, configured to judge whether the feature similarity value is within a preset similarity value reference range;

识别模块，用于在判断结果为是时，识别出所述第一用户账户及所述第二用户账户属于同一用户。An identification module, configured to identify that the first user account and the second user account belong to the same user when the determination result is yes.

上述系统，优选的，所述特征抽取模块包括：In the above system, preferably, the feature extraction module includes:

第一抽取单元，用于分别对分词形式的第一消息段及分词形式的第二消息段进行三元词特征抽取，并基于第一消息段及第二消息段中所包含的相同三元词的个数获取两者的词包含相似度数值；；The first extraction unit is used to perform trigram feature extraction on the first message segment in the word segmentation form and the second message segment in the word segmentation form, and based on the same trigrams contained in the first message segment and the second message segment The number of words to get both words contains similarity value;

第二抽取单元，用于分别对分词形式的第一消息段及分词形式的第二消息段进行高频词特征抽取，并基于第一消息段及第二消息段中所包含的相同高频词的个数获取两者的高频词相似度数值；The second extraction unit is used to perform high-frequency word feature extraction on the first message segment in the word segmentation form and the second message segment in the word segmentation form, and based on the same high-frequency words contained in the first message segment and the second message segment Get the similarity value of high-frequency words between the two;

第三抽取单元，用于分别对分词形式的第一消息段及分词形式的第二消息段进行单字符出现概率抽取，并基于第一消息段及第二消息段中所包含的相同单字符的出现概率来获取两者的词分布相似度数值；The third extraction unit is used to perform single-character occurrence probability extraction on the first message segment in the word segmentation form and the second message segment in the word segmentation form, and based on the same single character contained in the first message segment and the second message segment Occurrence probability to obtain the word distribution similarity value of the two;

第四抽取单元，用于分别对分词形式的第一消息段及分词形式的第二消息段的隐含主题进行抽取，并基于第一消息段及第二消息段中所包含的相同主题的个数获取两者的主题相似度数值。The fourth extraction unit is used to extract the hidden topics of the first message segment in the participle form and the second message segment in the participle form respectively, and based on the individual information of the same subject contained in the first message segment and the second message segment Get the topic similarity value of the two.

上述系统，优选的，还包括：用于分别对所述分词形式的第一消息段及分词形式的第二消息段进行过滤处理的过滤模块，所述过滤模块包括：The above-mentioned system, preferably, also includes: a filtering module for filtering the first message segment in the word segmentation form and the second message segment in the word segmentation form respectively, and the filtering module includes:

第一过滤单元，用于对所述分词形式的第一消息段进行去停用词和去低频词处理；The first filtering unit is used to remove stop words and remove low-frequency words for the first message segment in the form of word segmentation;

第二过滤单元，用于对所述分词形式的第二消息段进行去停用词和去低频词处理。The second filtering unit is configured to remove stop words and remove low-frequency words from the second message segment in word segmentation form.

上述系统，优选的，还包括：The above system, preferably, also includes:

预处理模块，用于预先利用设定个数的消息段样本对，并基于每个消息段样本对的特征相似度对最大熵分类方法进行跨平台用户识别训练，得到最大熵分类器，以实现采用所述最大熵分类器识别第一平台上第一用户账户与第二平台上第二用户账户是否属于同一用户，其中：The preprocessing module is used to pre-use the set number of message segment sample pairs, and perform cross-platform user identification training on the maximum entropy classification method based on the feature similarity of each message segment sample pair to obtain the maximum entropy classifier, so as to realize Using the maximum entropy classifier to identify whether the first user account on the first platform and the second user account on the second platform belong to the same user, wherein:

由以上方案可知，本发明公开的跨平台用户识别方法和系统，充分考虑社交平台中用户消息的重要性，通过相应时间段内不同平台的两个账户中用户消息所反映的用户见闻、兴趣、偏好以及写作风格、用词习惯等个性化信息的相似情况，来识别用户是否为同一用户，具体地，本发明方法获取不同平台的两个账户中发布时间在预设时间段内的消息内容，并对两个账户的消息内容进行分词及特征抽取处理，在此基础上，利用两个账户消息的分词特征相似度识别所述不同平台的两个账户是否属于同一用户。可见，本发明解决了不同社交平台同一用户的识别问题，进而为同一用户的跨平台数据分析提供了支持。From the above schemes, it can be seen that the cross-platform user identification method and system disclosed in the present invention fully consider the importance of user information on social platforms, and use the user knowledge, interests, Preferences, writing style, word usage habits and other personalized information to identify whether the user is the same user. Specifically, the method of the present invention obtains the content of messages published within a preset time period in two accounts on different platforms, Word segmentation and feature extraction are performed on the message content of the two accounts, and on this basis, the word segmentation feature similarity of the two account messages is used to identify whether the two accounts on different platforms belong to the same user. It can be seen that the present invention solves the problem of identifying the same user on different social platforms, and further provides support for cross-platform data analysis of the same user.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention, and those skilled in the art can also obtain other drawings according to the provided drawings without creative work.

图1是本发明实施例一公开的跨平台用户识别方法的一种流程图；FIG. 1 is a flow chart of a cross-platform user identification method disclosed in Embodiment 1 of the present invention;

图2是本发明实施例二公开的跨平台用户识别方法的另一种流程图；FIG. 2 is another flow chart of the cross-platform user identification method disclosed in Embodiment 2 of the present invention;

图3是本发明实施例三公开的跨平台用户识别方法的又一种流程图；Fig. 3 is another flow chart of the cross-platform user identification method disclosed in Embodiment 3 of the present invention;

图4是本发明实施例四公开的跨平台用户识别系统的一种结构示意图；FIG. 4 is a schematic structural diagram of a cross-platform user identification system disclosed in Embodiment 4 of the present invention;

图5是本发明实施例四公开的跨平台用户识别系统的另一种结构示意图；Fig. 5 is another schematic structural diagram of the cross-platform user identification system disclosed in Embodiment 4 of the present invention;

图6是本发明实施例四公开的跨平台用户识别系统的又另一种结构示意图。Fig. 6 is yet another structural schematic diagram of the cross-platform user identification system disclosed in Embodiment 4 of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

实施例一Embodiment one

本实施例一公开一种跨平台用户识别方法，参考图1，所述方法可以包括以下步骤：Embodiment 1 discloses a cross-platform user identification method. Referring to FIG. 1, the method may include the following steps:

S101：获取第一平台上第一用户账户的第一消息段，获取第二平台上第二用户账户的第二消息段，其中，所述第一消息段为由所述第一用户账户内发布时间在第一预设时间段内的所有消息组成的消息段，所述第二消息段为由所述第二用户账户内发布时间在第一预设时间段内的所有消息组成的消息段。S101: Obtain the first message segment of the first user account on the first platform, and acquire the second message segment of the second user account on the second platform, wherein the first message segment is published by the first user account A message segment composed of all messages within the first preset time period, and the second message segment is a message segment composed of all messages published within the first preset time period in the second user account.

本实施例以对新浪微博账户和腾讯微博账户是否属于同一用户进行识别为例对本发明方法进行说明。In this embodiment, the method of the present invention is described by taking the identification of whether the Sina Weibo account and the Tencent Weibo account belong to the same user as an example.

具体地，可采用新浪微博特供的API(Application Programming Interface，应用程序编程接口)从设定的新浪微博账户中抓取发布时间在预设时间段内的新浪用户消息，采用腾讯微博特供的API从设定的腾讯微博账户中抓取发布时间在预设时间段内的新浪用户消息。例如，具体从新浪微博账户userid1中抓取最近三个月相应用户发表或转发的所有消息文本，形成文本段1；从腾讯微博账户userid2中抓取最近三个月相应用户发表或转发的所有消息文本，形成文本段2。Specifically, the API (Application Programming Interface, Application Programming Interface) specially provided by Sina Weibo can be used to capture Sina user messages whose release time is within the preset time period from the set Sina Weibo account, and use Tencent Weibo The specially provided API captures Sina user messages published within the preset time period from the set Tencent Weibo account. For example, capture all message texts published or forwarded by corresponding users in the last three months from Sina Weibo account userid1 to form text segment 1; capture texts published or forwarded by corresponding users in the last three months from Tencent Weibo account userid2 All message texts, forming text segment 2.

S102：分别对所述第一消息段及所述第二消息段进行分词处理，得到分词形式的第一消息段及分词形式的第二消息段。S102: Perform word segmentation processing on the first message segment and the second message segment respectively, to obtain a first message segment in a word segment form and a second message segment in a word segment form.

分词，是指将中文句子分成词的序列，如“我爱中国”分词后变为“我爱中国”。Word segmentation refers to dividing Chinese sentences into word sequences, such as "I love China" becomes "I love China" after word segmentation.

本步骤采用分词软件FudanNLP继续对获取的不同平台两个账户的消息段(本申请具体采用文本段形式)进行分词处理，如，对新浪微博账户userid1的文本段1及腾讯微博账户userid2的文本段2进行分词处理。In this step, the word segmentation software FudanNLP is used to continue to perform word segmentation processing on the message segments of the two accounts obtained on different platforms (this application specifically adopts the text segment form), such as the text segment 1 of the Sina Weibo account userid1 and the Tencent Weibo account userid2. Text segment 2 is subjected to word segmentation.

S103：基于预设的分词特征对所述分词形式的第一消息段及分词形式的第二消息段进行特征抽取，并在特征抽取的基础上获取所述第一消息段及所述第二消息段的特征相似度数值。S103: Perform feature extraction on the first message segment in word segment form and the second message segment in word segment form based on preset word segmentation features, and obtain the first message segment and the second message on the basis of feature extraction The feature similarity value of the segment.

其中，步骤S103包括：Wherein, step S103 includes:

a、分别对分词形式的第一消息段及分词形式的第二消息段进行三元词特征抽取，并基于第一消息段及第二消息段中所包含的相同三元词的个数获取两者的词包含相似度数值。a. Carry out trigram feature extraction to the first message segment of the word segmentation form and the second message segment of the word segmentation form respectively, and obtain two based on the number of identical trigrams contained in the first message segment and the second message segment The words of the former contain similarity values.

本步骤根据两个文本段包含的相同三元词的数目，来判断两个文本段的相似程度，包含相同三元词的数目越多，则认为相似的程度越大。In this step, the degree of similarity of the two text segments is judged according to the number of the same trigrams contained in the two text segments, and the greater the number of the same trigrams, the greater the degree of similarity.

三元词是指由消息文本中相连的3个分词组成的结构。A trigram refers to a structure composed of three participle connected in a message text.

步骤a实现对分词后的两个文本段包含的各个三元词进行抽取。例如，假设分词后的文本A:放假又不知道做什么；文本B:不知道做什么事情。则从文本A可抽取4个三元词(1)放假又不(2)又不知道(3)不知道做(4)知道做什么。从文本B中可抽取3个三元词，分别为：(1)不知道做(2)知道做什么(3)做什么事情。Step a implements the extraction of each trigram included in the two text segments after word segmentation. For example, assume that the text A after word segmentation: is on vacation and does not know what to do; text B: does not know what to do. Then four trigrams can be extracted from the text A (1) not on holiday (2) not knowing (3) not knowing what to do (4) knowing what to do. Three trigrams can be extracted from text B, which are: (1) don't know what to do (2) know what to do (3) what to do.

之后，依据两个文本段包含的相同三元词的个数确定两文本段的词包含相似度数值。Afterwards, the word inclusion similarity value of the two text segments is determined according to the number of identical trigrams contained in the two text segments.

例如，若两个文本段包含相同三元词的个数为0，则两个文本段的词包含相似度数值为0；若包含相同三元词的个数属于0～50，则词包含相似度数值为1；若包含相同三元词的个数属于50～100，则词包含相似度数值为2；若包含相同三元词的个数大于100，则词包含相似度数值为3。For example, if the number of the same trigrams in the two text segments is 0, the similarity value of the word inclusions in the two text segments is 0; The degree value is 1; if the number of the same trigrams is between 50 and 100, the word inclusion similarity value is 2; if the number of the same trigrams is greater than 100, the word inclusion similarity value is 3.

b、分别对分词形式的第一消息段及分词形式的第二消息段进行高频词特征抽取，并基于第一消息段及第二消息段中所包含的相同高频词的个数获取两者的高频词相似度数值。B. Carry out high-frequency word feature extraction to the first message segment of the word segmentation form and the second message segment of the word segmentation form respectively, and obtain two based on the number of the same high-frequency words contained in the first message segment and the second message segment The similarity value of high-frequency words.

两段文本中所包含的相同高频词的个数越多，表明两段文本越相似。The more the same high-frequency words contained in the two texts, the more similar the two texts are.

其中，本步骤首先对每个文本段中的分词按分词出现频率降序的顺序进行排序，例如，“我是我”排序后的分词序列为“我是”。Wherein, this step first sorts the word segmentation in each text segment in descending order of occurrence frequency of the word segmentation, for example, the sequence of word segmentation after sorting "I am me" is "I am".

之后，统计两个文本段对应的排序序列中最前面预定个数的高频词中相同分词的数目，例如，统计两个序列中前100高频词中相同分词的个数，并依据相同分词的个数确定两文本段的高频词相似度数值。Afterwards, count the number of the same participle in the first predetermined number of high-frequency words in the sorting sequence corresponding to the two text segments, for example, count the number of the same participle in the first 100 high-frequency words in the two sequences, and based on the same participle The number of determines the similarity value of high-frequency words in two text segments.

例如，可规定相同高频词的个数为0时，高频词相似度数值为0；相同高频词的个数属于0～20时，高频词相似度数值为1；相同高频词的个数属于20～50时，高频词相似度数值为2；相同高频词的个数属于50～100时，高频词相似度数值为3。For example, it can be stipulated that when the number of the same high-frequency words is 0, the similarity value of the high-frequency words is 0; when the number of the same high-frequency words belongs to 0-20, the similarity value of the high-frequency words is 1; When the number of the same high-frequency words is 20-50, the similarity value of high-frequency words is 2; when the number of the same high-frequency words is 50-100, the similarity value of high-frequency words is 3.

c、分别对分词形式的第一消息段及分词形式的第二消息段进行单字符出现概率抽取，并基于第一消息段及第二消息段中所包含的相同单字符的出现概率来获取两者的词分布相似度数值。c. Carry out single-character occurrence probability extraction to the first message segment of the word segmentation form and the second message segment of the word segmentation form respectively, and obtain two based on the occurrence probability of the same single character contained in the first message segment and the second message segment The word distribution similarity value of the author.

本步骤通过计算两文本段的相对熵D(p||q)来获取两者的词分布相似度数值，p、q分别表示第一消息段、第二消息段，p(x)、q(x)表示相同单字符x分别在第一消息段及第二消息段中出现的概率，X表示第一消息段与第二消息段中相同单字符的字符集合。In this step, the word distribution similarity value of the two text segments is obtained by calculating the relative entropy D(p||q) of the two text segments. p and q represent the first message segment and the second message segment respectively, p(x) and q(x) represent the probability that the same single character x appears in the first message segment and the second message segment respectively, and X represents the first message A set of characters that are the same single character in the segment as in the second message segment.

相对熵数值也称为KL(Kullback-Leibler divergence)距离，两段文字的KL距离越小，表示这两段文字的字符随机分布的差异越小，即这两段文字在词的分布上越相似。The relative entropy value is also called the KL (Kullback-Leibler divergence) distance. The smaller the KL distance between the two texts, the smaller the difference in the random distribution of characters between the two texts, that is, the more similar the word distribution of the two texts.

例如，p：我是我；q：我是中国人。则文本p中‘我’的概率为p(我)＝2/3，‘是’的概率为p('是')＝1/3；文本q中所有字符的概率都是1/5，根据公式可知文本p和q的KL距离为：D((p//q))＝(p(我)*log(p(我)/q(我)))+(p(是)*log(p(是)/q(是)))。For example, p: I am me; q: I am Chinese. Then the probability of 'I' in the text p is p(I)=2/3, the probability of 'Yes' is p('Yes')=1/3; the probability of all characters in the text q is 1/5, according to formula It can be seen that the KL distance between text p and q is: D((p//q))=(p(me)*log(p(me)/q(me)))+(p(yes)*log(p( yes)/q(yes))).

在此基础上，依据两文本段的KL距离获取两个文本段的分布相似度数值。On this basis, the distribution similarity value of the two text segments is obtained according to the KL distance of the two text segments.

d、分别对分词形式的第一消息段及分词形式的第二消息段的隐含主题进行抽取，并基于第一消息段及第二消息段中所包含的相同主题的个数获取两者的主题相似度数值。d. Extract the hidden topics of the first message segment in the participle form and the second message segment in the participle form respectively, and obtain the information of the two based on the number of the same topics contained in the first message segment and the second message segment Topic similarity value.

由于同一用户在不同社交媒体上所发表的微博消息文本有着很大的相似性，基于此，若两个文本段所包含的相同主题的个数越多，则两文本段出自同一个用户的可能性越大。Since the texts of Weibo messages published by the same user on different social media are very similar, based on this, if the number of the same topics contained in two text segments is more, the two text segments come from the same user. The more likely it is.

本步骤通过分析两个文本段的内容，采用LDA(使用文档主题生成模型)抽取其隐含主题，如假设文本段A的微博中经常出现游戏、军事信息；文本段B的微博中经常出现娱乐信息、淘宝买东西等；经过LDA算法，文本段A的隐含主题就为游戏、军事等，文本段B的主题则为娱乐、网购等。In this step, by analyzing the content of the two text segments, LDA (using the document topic generation model) is used to extract their hidden topics, such as assuming that games and military information often appear in the Weibo of text segment A; Entertainment information, Taobao shopping, etc. appear; after the LDA algorithm, the implicit theme of text segment A is games, military, etc., and the theme of text segment B is entertainment, online shopping, etc.

后续依据两个文本段中相同主题的个数确定两文本段的主题相似度数值。例如，若相同主题的个数为0，则主题相似度数值特征值为0；若相同主题的个数为1，主题相似度数值为1；若相同主题的个数为2，主题相似度数值2；若相同主题的个数3，主题相似度数值3等。Subsequently, the topic similarity value of the two text segments is determined according to the number of the same topics in the two text segments. For example, if the number of the same topic is 0, the value of the topic similarity value is 0; if the number of the same topic is 1, the value of the topic similarity is 1; if the number of the same topic is 2, the value of the topic similarity 2; if the number of the same topic is 3, the topic similarity value is 3, etc.

其中，词包含相似度数值、高频词相似度数值或主题相似度数值等取值0、1、2、3等仅表示两文本段的不同相似程度，数值越大，相似程度越高。Among them, the values 0, 1, 2, 3, etc. of word-containing similarity value, high-frequency word similarity value or topic similarity value only represent different similarities between two text segments, and the larger the value, the higher the similarity degree.

S104：判断所述特征相似度数值是否在预设的相似度数值参考范围内。S104: Determine whether the feature similarity value is within a preset similarity value reference range.

S105：若判断结果为是，则所述第一用户账户及所述第二用户账户属于同一用户。S105: If the determination result is yes, the first user account and the second user account belong to the same user.

本步骤通过将实际获取的两个文本段的各个相似度数值，即词包含相似度数值、高频词相似度数值、词分布相似度数值以及主题相似度数值与预先规定的参考基准相对比，来识别两个文本段(分别出自不同平台的两个账户)是否属于同一用户，进而实现跨平台用户识别。In this step, by comparing the similarity values of the two text segments actually obtained, that is, word inclusion similarity values, high-frequency word similarity values, word distribution similarity values, and topic similarity values with the pre-specified reference benchmarks, To identify whether two text segments (two accounts from different platforms) belong to the same user, and then realize cross-platform user identification.

例如，假设预先规定的两个不同平台的账户属于同一用户的参考基准为：各个特征相似度数值均大于2。For example, it is assumed that the pre-specified reference benchmark for accounts belonging to the same user on two different platforms is: the similarity values of each feature are greater than 2.

从而，仅当实际获取的各个特征相似度数值均大于2时，两个文本段属于同一用户，进而识别出两个账户为同一用户；否则，不符合规定的参考基准时，两个账户为不同用户。Therefore, only when the actual similarity values of each feature obtained are greater than 2, the two text segments belong to the same user, and then the two accounts are identified as the same user; otherwise, when the specified reference benchmark is not met, the two accounts are different. user.

由以上方案可知，本发明公开的跨平台用户识别方法，充分考虑社交平台中用户消息的重要性，通过相应时间段内不同平台的两个账户中用户消息所反映的用户见闻、兴趣、偏好以及写作风格、用词习惯等个性化信息的相似情况，来识别用户是否为同一用户，具体地，本发明方法获取不同平台的两个账户中发布时间在预设时间段内的消息内容，并对两个账户的消息内容进行分词及特征抽取处理，在此基础上，利用两个账户消息的分词特征相似度识别所述不同平台的两个账户是否属于同一用户。可见，本发明解决了不同社交平台同一用户的识别问题，进而为同一用户的跨平台数据分析提供了支持。It can be seen from the above scheme that the cross-platform user identification method disclosed in the present invention fully considers the importance of user messages on social platforms, and uses the user knowledge, interests, preferences, and The similarity of personalized information such as writing style and word usage habits to identify whether the users are the same user. Specifically, the method of the present invention obtains the content of messages published within a preset time period in two accounts on different platforms, and Word segmentation and feature extraction are performed on the message content of the two accounts. On this basis, the word segmentation feature similarity of the two account messages is used to identify whether the two accounts on different platforms belong to the same user. It can be seen that the present invention solves the problem of identifying the same user on different social platforms, and further provides support for cross-platform data analysis of the same user.

实施例二Embodiment two

本实施例二中，参考图2，所述跨平台用户方法在步骤S102和S103之间还可以包括以下步骤：In the second embodiment, referring to FIG. 2, the cross-platform user method may further include the following steps between steps S102 and S103:

S106:分别对所述分词形式的第一消息段及分词形式的第二消息段进行过滤处理。S106: Perform filtering processing on the first message segment in the participle form and the second message segment in the participle form respectively.

其中，该步骤包括：Among them, this step includes:

对所述分词形式的第一消息段进行去停用词和去低频词处理；对所述分词形式的第二消息段进行去停用词和去低频词处理。performing stop word removal and low frequency word removal processing on the first message segment in the word segmentation form; and performing stop word removal and low frequency word removal processing on the second message segment in the word segment form.

具体地，社交平台用户所发布的消息往往较多、次数频繁，例如微博用户所发布的微博消息较多等，导致所采集的消息文本过大，本实施例中，为了提高跨平台用户的识别速度，分别对不同平台两个账户的文本段进行去停用词和去低频词(例如过滤掉词频小于3的分词)处理，即去掉参考价值相对较低的分词，降低了特征向量的维度，实现了在相对不影响识别准确率的情况下，加快了识别速度。Specifically, social platform users often publish more messages and frequent times, for example, Weibo users publish more Weibo messages, etc., resulting in too large message texts collected. In this embodiment, in order to improve cross-platform user Recognition speed, remove stop words and remove low-frequency words (for example, filter out word segments with a word frequency less than 3) on the text segments of two accounts on different platforms, that is, remove word segments with relatively low reference value, and reduce the feature vector. Dimensions, speeding up the recognition speed without affecting the recognition accuracy.

实施例三Embodiment three

本实施例三中，参考图3，所述跨平台用户识别方法还可以包括：In the third embodiment, referring to FIG. 3, the cross-platform user identification method may also include:

S107:预先利用设定个数的消息段样本对，并基于每个消息段样本对的特征相似度对最大熵分类方法进行跨平台用户识别训练，得到最大熵分类器，以实现采用所述最大熵分类器识别第一平台上第一用户账户与第二平台上第二用户账户是否属于同一用户。S107: Utilize the set number of message segment sample pairs in advance, and perform cross-platform user identification training on the maximum entropy classification method based on the feature similarity of each message segment sample pair to obtain a maximum entropy classifier, so as to realize the adoption of the maximum entropy classifier. The entropy classifier identifies whether the first user account on the first platform and the second user account on the second platform belong to the same user.

其中，所述消息段样本对中包含的两个消息段分别属于不同平台的两个账户，所述两个账户为相同用户的账户或不同用户的账户，所述消息段样本对中所包含消息的发布时间在第二预设时间段内；所述特征相似度包括词包含相似度、高频词相似度、词分布相似度和主题相似度。Wherein, the two message segments contained in the message segment sample pair respectively belong to two accounts of different platforms, and the two accounts are accounts of the same user or accounts of different users, and the message segments contained in the message segment sample pair The publishing time of is within the second preset time period; the feature similarity includes word inclusion similarity, high frequency word similarity, word distribution similarity and topic similarity.

为了提高跨平台用户识别的准确率，本步骤预先利用设定规模的消息段样本对作为训练样本，对最大熵分类方法进行跨平台用户识别训练，得到最大熵分类器。In order to improve the accuracy of cross-platform user identification, in this step, the message segment sample pairs with a set size are used as training samples in advance, and the maximum entropy classification method is used for cross-platform user identification training to obtain a maximum entropy classifier.

以下通过具体实例对最大熵分类器的构建过程进行描述。The following describes the construction process of the maximum entropy classifier through specific examples.

收集1200个同时具有新浪微博账户和腾讯微博账户的用户中每个用户的所述两种平台账户(即userid)，得到1200个用户的1200个新浪微博userid和1200个腾讯微博userid；收集1200个仅有新浪微博账户的用户中每个用户的新浪微博userid，收集1200个仅有腾讯微博账户的用户中每个用户的腾讯微博userid。并将收集到的userid按照平台不同，构建成两个账户列表：新浪账户列表和腾讯账户列表。Collect the two platform accounts (i.e. userid) of each user among 1200 users who have Sina Weibo account and Tencent Weibo account at the same time, and obtain 1200 Sina Weibo userids and 1200 Tencent Weibo userids of 1200 users ; Collect the Sina Weibo userid of each of the 1200 users who only have a Sina Weibo account, and collect the Tencent Weibo userid of each of the 1200 users who only have a Tencent Weibo account. According to different platforms, the collected userids are constructed into two account lists: Sina account list and Tencent account list.

在此基础上，分别利用新浪微博和腾讯微博特供的API接口，按照用户列表抓取用户在近三个月发表的所有微博消息，得到每个userid的消息文本段。之后采用分词软件FudanNLP，分别对每个userid的文本段进行分词处理，并将分词后的文本段与账户列表中相应的userid相关联，其中，每一行代表一个账户的消息文本段(分词形式)。On this basis, use the API interfaces specially provided by Sina Weibo and Tencent Weibo to capture all Weibo messages published by users in the past three months according to the user list, and obtain the message text segment of each userid. Afterwards, the word segmentation software FudanNLP is used to segment the text segment of each userid separately, and associate the segmented text segment with the corresponding userid in the account list, where each line represents the message text segment of an account (in the form of word segmentation) .

将两种平台账户下属于同一用户的两个文本段进行两两组对，将其余文本段跨平台两两组对，共得到1200个同用户的文本段样本对，及1200个不同用户的文本段样本对。The two text segments belonging to the same user under the two platform accounts are paired in pairs, and the remaining text segments are paired in pairs across platforms to obtain a total of 1,200 text segment sample pairs from the same user and 1,200 text segments from different users. Segment sample pairs.

对每个文本段样本对进行分词处理及去停用词、去低频词处理。之后，选取1000个同用户文本段样本对及1000个不同用户文本段样本对进行特征抽取以及特征相似度数值的计算，形成训练样本；同时对其余的200个同用户文本段样本对及200个不同用户文本段样本对进行特征抽取以及特征相似度数值的计算，形成测试样本。其中，特征相似度数值包括词包含相似度数值、高频词相似度数值、词分布相似度数值以及主题相似度数值。本步骤中分词处理、过滤处理、特征抽取及特征相似度数值的获取过程具体可参考实施例一的说明，此处不再详述。For each text segment sample pair, perform word segmentation processing, stop word removal, and low-frequency word processing. After that, select 1000 sample pairs of text segments from the same user and 1000 sample pairs of text segments from different users for feature extraction and calculation of feature similarity values to form training samples; at the same time, the remaining 200 sample pairs of text segments from the same user and 200 Different user text segment samples are subjected to feature extraction and calculation of feature similarity values to form test samples. Wherein, the feature similarity values include word inclusion similarity values, high-frequency word similarity values, word distribution similarity values, and topic similarity values. In this step, word segmentation processing, filtering processing, feature extraction, and the acquisition process of feature similarity values can refer to the description of Embodiment 1, and will not be described in detail here.

在此基础上，基于每个训练样本对的特征相似度，利用训练样本对最大熵分类方法进行跨用户识别的分类训练(两个文本段属同一用户为一个分类类别，不属同一用户为另一分类类别)，构建最大熵分类器。On this basis, based on the feature similarity of each training sample pair, the maximum entropy classification method is used to carry out cross-user classification training on training samples (two text segments belong to the same user as one classification category, and do not belong to the same user as another classification). A classification category), construct a maximum entropy classifier.

其中，最大熵分类方法是基于最大熵信息理论，其基本思想是为所有已知的因素建立模型，而把所有未知的因素排除在外，即要找到一种概率分布，满足所有已知的事实，但是让未知的因素最随机化。相对于朴素贝叶斯方法，该方法最大的特点就是不需要满足特征与特征之间的条件独立，因此，该方法适合统计各种不一样的特征，而无需考虑它们之间的影响。Among them, the maximum entropy classification method is based on the maximum entropy information theory, and its basic idea is to build a model for all known factors and exclude all unknown factors, that is, to find a probability distribution that satisfies all known facts, But let the unknown factor be the most randomized. Compared with the naive Bayesian method, the biggest feature of this method is that it does not need to satisfy the conditional independence between features. Therefore, this method is suitable for counting various features without considering the influence between them.

在最大熵模型下，预测条件概率P(c|D)的公式如下：Under the maximum entropy model, the formula for predicting the conditional probability P(c|D) is as follows:

$P P (({c c}_{i i} | | D D.)) = = \frac{11}{Z Z ((D D.))} exp exp ((\underset{k k}{Σ Σ} {λ λ}_{k k,, c c} {F f}_{k k,, c c} ((D D.,, {c c}_{i i})))) - - - - - - ((11))$

其中，Z(D)是归一化因子；λ_k,c是特征函数F_k,c的权值，在构建基分类器的过程中可以获得λ_k,c的取值；i的取值为1或0；k表示特征空间中的每个特征(本申请中，具体指每个特征相似度)，其值为从1到特征空间的大小；F_k,c是特征函数，定义为：Among them, Z(D) is the normalization factor; λ_{k, c} is the weight of the feature function F_{k, c} , the value of λ_{k, c} can be obtained in the process of building the base classifier; the value of i is 1 or 0; k represents each feature in the feature space (in this application, specifically refers to the similarity of each feature), and its value is from 1 to the size of the feature space; F_k,c is a feature function, defined as:

${F f}_{k k,, c c} ((D D.,, {c c}^{' '})) = = \{\begin{matrix} 11,, & {n no}_{k k} ((d d)) > > 00 and and {c c}^{' '} \\ 00,, & otherwiese other wiese \end{matrix} - - - - - - ((22))$

其中，n_k(d)表示待识别样本对所包含特征的长度，本申请中，n_k(d)始终大于0；c表示待识别样本对是否属于同一用户的真实结果，c'表示分类器分类(识别)后的结果，如果分类器识别的结果和真实的结果相同，则F_k,c的值为1，如果识别的结果和真实的结果不一致，则F_k,c的值为0。Among them,_nk (d) represents the length of the features contained in the sample pair to be identified. In this application,_nk (d) is always greater than 0; c represents the real result of whether the sample pair to be recognized belongs to the same user, and c' represents the classifier After classification (recognition), if the result recognized by the classifier is the same as the real result, the value of F_{k, c} is 1, and if the recognized result is inconsistent with the real result, the value of F_{k, c} is 0.

例如，分别将以上的词包含、高频词、词分布和主题相似度作为第1、2、3、4个特征，对于每个待分类的样本来说，以上4个特征都是存在的(只是特征表示的值不同而已)，从而n_k(d)＝4，则n_k(d)>0。For example, the above word inclusion, high-frequency words, word distribution and topic similarity are respectively used as the first, second, third, and fourth features. For each sample to be classified, the above four features exist ( Only the values represented by the features are different), so n_k (d)=4, then n_k (d)>0.

后续可通过以上的测试样本来测试所构建分类器的分类性能，申请人通过实际的试验数据验证所构建的分类器具有较高的分类精度，基于分类器的跨用户识别准确率较之于未采用分类器的识别准确率有了大幅度提升。In the future, the classification performance of the constructed classifier can be tested through the above test samples. The applicant verifies that the constructed classifier has high classification accuracy through actual test data. The recognition accuracy of the classifier has been greatly improved.

实施例四Embodiment Four

本发明实施例四公开一种跨平台用户识别系统，所述系统与实施例一至实施例三公开的跨平台用户识别方法相对应。Embodiment 4 of the present invention discloses a cross-platform user identification system, and the system corresponds to the cross-platform user identification methods disclosed in Embodiment 1 to Embodiment 3.

首先，相应于实施例一，参考图4，所述系统包括消息获取模块100、分词处理模块200、特征抽取模块300、判断模块400及识别模块500。First, corresponding to Embodiment 1, referring to FIG. 4 , the system includes a message acquisition module 100 , a word segmentation processing module 200 , a feature extraction module 300 , a judgment module 400 and a recognition module 500 .

消息获取模块100，用于获取第一平台上第一用户账户的第一消息段，获取第二平台上第二用户账户的第二消息段，其中，所述第一消息段为由所述第一用户账户内发布时间在第一预设时间段内的所有消息组成的消息段，所述第二消息段为由所述第二用户账户内发布时间在第一预设时间段内的所有消息组成的消息段。The message acquiring module 100 is configured to acquire a first message segment of a first user account on a first platform, and acquire a second message segment of a second user account on a second platform, wherein the first message segment is generated by the second A message segment composed of all messages published within the first preset time period in a user account, and the second message segment is all messages published within the first preset time period in the second user account composed of message segments.

分词处理模块200，用于分别对所述第一消息段及所述第二消息段进行分词处理，得到分词形式的第一消息段及分词形式的第二消息段。The word segmentation processing module 200 is configured to respectively perform word segmentation processing on the first message segment and the second message segment to obtain the first message segment in word segment form and the second message segment in word segment form.

特征抽取模块300，用于基于预设的分词特征对所述分词形式的第一消息段及分词形式的第二消息段进行特征抽取，并在特征抽取的基础上获取所述第一消息段与所述第二消息段的特征相似度数值。The feature extraction module 300 is configured to perform feature extraction on the first message segment in the word segment form and the second message segment in the word segment form based on the preset word segmentation feature, and obtain the first message segment and the message segment based on the feature extraction. The feature similarity value of the second message segment.

其中，特征抽取模块300包括第一抽取单元、第二抽取单元、第三抽取单元和第四抽取单元。Wherein, the feature extraction module 300 includes a first extraction unit, a second extraction unit, a third extraction unit and a fourth extraction unit.

判断模块400，用于判断所述特征相似度数值是否在预设的相似度数值参考范围内。A judging module 400, configured to judge whether the feature similarity value is within a preset similarity value reference range.

识别模块500，用于在判断结果为是时，识别出所述第一用户账户及所述第二用户账户属于同一用户。The identification module 500 is configured to identify that the first user account and the second user account belong to the same user when the determination result is yes.

相应于实施例二，参考图5，所述系统还包括用于分别对所述分词形式的第一消息段及分词形式的第二消息段进行过滤处理的过滤模块600，该模块包括第一过滤单元和第二过滤单元。Corresponding to Embodiment 2, with reference to FIG. 5, the system further includes a filtering module 600 for filtering the first message segment in the word segmentation form and the second message segment in the word segmentation form respectively, and the module includes a first filtering unit and a second filter unit.

相应于实施例三，参考图6，所述系统还包括预处理模块700，该模块用于预先利用设定个数的消息段样本对，并基于每个消息段样本对的特征相似度对最大熵分类方法进行跨平台用户识别训练，得到最大熵分类器，以实现采用所述最大熵分类器识别第一平台上第一用户账户与第二平台上第二用户账户是否属于同一用户，其中：Corresponding to Embodiment 3, referring to FIG. 6, the system further includes a preprocessing module 700, which is used to pre-use a set number of message segment sample pairs, and based on the feature similarity of each message segment sample pair to the maximum The entropy classification method conducts cross-platform user identification training to obtain a maximum entropy classifier, so as to realize whether the first user account on the first platform and the second user account on the second platform belong to the same user using the maximum entropy classifier, wherein:

对于本发明实施例四公开的跨平台用户识别系统而言，由于其与实施例一至实施例三公开的跨平台用户识别方法相对应，所以描述的比较简单，相关相似之处请参见实施例一至实施例三中跨平台用户识别方法部分的说明即可，此处不再详述For the cross-platform user identification system disclosed in Embodiment 4 of the present invention, since it corresponds to the cross-platform user identification method disclosed in Embodiment 1 to Embodiment 3, the description is relatively simple. For related similarities, please refer to Embodiment 1 to Embodiment 3. The description of the cross-platform user identification method in Embodiment 3 is sufficient, and will not be described in detail here

综上所述，本发明充分考虑社交平台中用户消息的重要性，通过相应时间段内不同平台的两个账户中用户消息所反映的用户见闻、兴趣、偏好以及写作风格、用词习惯等个性化信息的相似情况，来识别用户是否为同一用户，并通过预先构建最大熵分类器来提高跨平台用户识别的准确率，解决了不同社交平台同一用户的识别问题，为同一用户的跨平台数据分析提供了支持。In summary, the present invention fully considers the importance of user messages on social platforms, and uses personalities such as user knowledge, interests, preferences, writing styles, and word habits reflected in user messages on two accounts on different platforms within a corresponding period of time. The similarity of information is used to identify whether the users are the same user, and the accuracy of cross-platform user identification is improved by pre-constructing the maximum entropy classifier, which solves the problem of identifying the same user on different social platforms, and provides cross-platform data for the same user Analysis provides support.

需要说明的是，本说明书中的各个实施例均采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似的部分互相参见即可。It should be noted that each embodiment in this specification is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. For the same and similar parts in each embodiment, refer to each other, that is, Can.

为了描述的方便，描述以上系统时以功能分为各种模块或单元分别描述。当然，在实施本申请时可以把各单元的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above system, functions are divided into various modules or units and described separately. Of course, when implementing the present application, the functions of each unit can be implemented in one or more pieces of software and/or hardware.

最后，还需要说明的是，在本文中，诸如第一、第二、第三和第四等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that in this text, relational terms such as first, second, third, and fourth, etc. are only used to distinguish one entity or operation from another entity or operation, and not Any such actual relationship or order between these entities or operations is necessarily required or implied. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that, for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications can also be made. It should be regarded as the protection scope of the present invention.

Claims

Translated fromChinese

1.一种跨平台用户识别方法，其特征在于，包括：1. A cross-platform user identification method, characterized in that, comprising:

2.根据权利要去1所述的方法，其特征在于，所述基于预设的分词特征对所述分词形式的第一消息段及分词形式的第二消息段进行特征抽取，并在特征抽取的基础上获取所述第一消息段与所述第二消息段的特征相似度数值，包括：2. according to the method described in claim 1, it is characterized in that, the first message segment of the word segmentation form and the second message segment of the word segmentation form are carried out feature extraction based on the preset word segmentation feature, and in the feature extraction On the basis of obtaining the characteristic similarity value of the first message segment and the second message segment, including:

3.根据权利要去1所述的方法，其特征在于，在对分词形式的第一消息段及分词形式的第二消息段进行特征抽取之前，还包括：分别对所述分词形式的第一消息段及分词形式的第二消息段进行过滤处理，所述过滤处理包括：3. according to the method described in claim 1, it is characterized in that, before carrying out the feature extraction to the first message segment of the word segmentation form and the second message segment of the word segmentation form, it also includes: respectively to the first message segment of the word segmentation form The message segment and the second message segment in word segmentation form are filtered, and the filter process includes:

4.根据权利要去1所述的方法，其特征在于，还包括：4. The method according to claim 1, further comprising:

5.根据权利要求2所述的方法，其特征在于，通过计算第一消息段与第二消息段的相对熵D(p||q)来获取两者的词分布相似度数值；5. method according to claim 2, is characterized in that, obtains the word distribution similarity value of both by calculating the relative entropy D (p||q) of the first message segment and the second message segment;

6.根据权利要求2所述的方法，其特征在于，使用文档主题生成模型LDA对分词形式的第一消息段及分词形式的第二消息段的隐含主题进行抽取。6 . The method according to claim 2 , wherein the document topic generation model (LDA) is used to extract the hidden topics of the first message segment in word segmentation form and the second message segment in word segment form.

7.一种跨平台用户识别系统，其特征在于，包括：7. A cross-platform user identification system, characterized in that it comprises:

8.根据权利要去7所述的系统，其特征在于，所述特征抽取模块包括：8. The system according to claim 7, wherein the feature extraction module includes:

9.根据权利要去7所述的系统，其特征在于，还包括：用于分别对所述分词形式的第一消息段及分词形式的第二消息段进行过滤处理的过滤模块，所述过滤模块包括：9. The system according to claim 7, further comprising: a filter module for filtering the first message segment of the word segmentation form and the second message segment of the word segmentation form respectively, the filtering Modules include:

10.根据权利要去7所述的系统，其特征在于，还包括：10. The system according to claim 7, further comprising: