CN109284384A

Movatterモバイル変換

Info

Publication number: CN109284384A
Application number: CN201811180356.2A
Authority: CN
Inventors: 龚建
Original assignee: Lazas Network Technology Shanghai Co Ltd
Current assignee: Lazas Network Technology Shanghai Co Ltd
Priority date: 2018-10-10
Filing date: 2018-10-10
Publication date: 2019-01-29
Anticipated expiration: 2038-10-10
Also published as: CN109284384B

Abstract

Translated fromChinese

本公开实施例公开了一种文本分析方法、装置、电子设备及可读存储介质，所述方法包括：利用预设关键词对文本进行匹配，得到与所述预设关键词匹配的匹配文本以及不与所述预设关键词匹配的非匹配文本，其中，所述预设关键词属于多个目标类别，并且与特定预设关键词匹配的匹配文本与所述特定关键词属于同一目标类别；针对每一条子文本进行分词，并根据分词结果为每一条文本生成文本向量；根据属于每一目标类别的全部匹配文本的文本向量计算每一目标类别的目标文本向量；计算每一条非匹配文本的文本向量与所述目标文本向量的相似度以确定所述非匹配文本所属的类别，可以真实反映文本的观点倾向，提高文本分析的准确率。

Embodiments of the present disclosure disclose a text analysis method, an apparatus, an electronic device, and a readable storage medium. The method includes: using preset keywords to match text to obtain matching text matching the preset keywords; Non-matching text that does not match the preset keyword, wherein the preset keyword belongs to multiple target categories, and the matching text that matches a specific preset keyword belongs to the same target category as the specific keyword; Perform word segmentation for each sub-text, and generate a text vector for each text according to the word segmentation result; calculate the target text vector of each target category according to the text vectors of all matching texts belonging to each target category; calculate the value of each non-matching text The similarity between the text vector and the target text vector determines the category to which the non-matching text belongs, which can truly reflect the opinion tendency of the text and improve the accuracy of text analysis.

Description

Translated fromChinese

文本分析方法、装置、电子设备及可读存储介质Text analysis method, apparatus, electronic device and readable storage medium

技术领域technical field

本公开涉及计算机领域，具体涉及一种文本分析方法、装置、电子设备及可读存储介质。The present disclosure relates to the field of computers, and in particular, to a text analysis method, an apparatus, an electronic device, and a readable storage medium.

背景技术Background technique

在互联网平台上，有大量文本信息存在，例如，各种用户发帖、评论信息等。但是，这些文本都是自然语言，比较难以区分其具体的主题倾向。如果可以对这些用户评论进行大数据分析，则对找到具备特定特性的文本，对于发现用户的需求和关注点，从而进行有针对性的运营，具有非常重要的意义。On the Internet platform, a large amount of text information exists, for example, various user postings, comment information, and the like. However, these texts are all natural language, and it is more difficult to distinguish their specific thematic tendencies. If big data analysis can be performed on these user comments, it is of great significance to find texts with specific characteristics, to discover the needs and concerns of users, and to carry out targeted operations.

发明内容SUMMARY OF THE INVENTION

为了解决相关技术中的问题，本公开实施例提供一种文本分析方法、装置、电子设备及可读存储介质。In order to solve the problems in the related art, the embodiments of the present disclosure provide a text analysis method, an apparatus, an electronic device, and a readable storage medium.

第一方面，本公开实施例中提供了一种文本分析方法，包括：In a first aspect, an embodiment of the present disclosure provides a text analysis method, including:

利用预设关键词对文本进行匹配，得到与所述预设关键词匹配的匹配文本以及不与所述预设关键词匹配的非匹配文本，其中，所述预设关键词属于多个目标类别，并且与特定预设关键词匹配的匹配文本与所述特定关键词属于同一目标类别；Use preset keywords to match text to obtain matching text that matches the preset keywords and non-matching text that does not match the preset keywords, where the preset keywords belong to multiple target categories , and the matching text that matches a specific preset keyword belongs to the same target category as the specific keyword;

针对每一条子文本进行分词，并根据分词结果为每一条文本生成文本向量；Perform word segmentation for each sub-text, and generate a text vector for each text according to the word segmentation result;

根据属于每一目标类别的全部匹配文本的文本向量计算每一目标类别的目标文本向量；Calculate the target text vector of each target category according to the text vectors of all matched texts belonging to each target category;

计算每一条非匹配文本的文本向量与所述目标文本向量的相似度以确定所述非匹配文本所属的类别。The similarity between the text vector of each non-matching text and the target text vector is calculated to determine the category to which the non-matching text belongs.

结合第一方面，本公开在第一方面的第一种实现方式中，所述针对每一条子文本进行分词，并根据分词结果为每一条文本生成文本向量，包括：With reference to the first aspect, in a first implementation manner of the first aspect of the present disclosure, the word segmentation is performed for each sub-text, and a text vector is generated for each text according to the word segmentation result, including:

根据预设规则增大所述匹配文本中与所述预设关键词相同的分词的向量。The vector of the segmented word in the matching text that is the same as the preset keyword is increased according to a preset rule.

结合第一方面，本公开在第一方面的第二种实现方式中，所述文本向量为词频-逆文档频率向量。With reference to the first aspect, in a second implementation manner of the first aspect of the present disclosure, the text vector is a word frequency-inverse document frequency vector.

结合第一方面，本公开在第一方面的第三种实现方式中，所述根据属于每一目标类别的全部匹配文本的文本向量计算每一目标类别的目标文本向量，包括：With reference to the first aspect, in a third implementation manner of the first aspect of the present disclosure, calculating the target text vector of each target category according to the text vectors of all matching texts belonging to each target category includes:

通过对属于每一目标类别的全部匹配文本的文本向量求和取平均值来计算每一目标类别的目标文本向量。The target text vector for each target class is calculated by summing the text vectors of all matching texts belonging to each target class and taking the average.

结合第一方面的第三种实现方式，本公开在第一方面的第四种实现方式中，所述计算每一条非匹配文本的文本向量与所述目标文本向量的相似度以确定所述非匹配文本所属的类别，包括：With reference to the third implementation manner of the first aspect, in a fourth implementation manner of the first aspect of the present disclosure, the calculation of the similarity between the text vector of each non-matching text and the target text vector is used to determine the non-matching text. The category to which the matched text belongs, including:

计算每一条非匹配文本的文本向量与所述目标文本向量的相似度作为第一相似度，将所述第一相似度最大的目标文本向量所属的目标类别作为该条非匹配文本的备选类别。Calculate the similarity between the text vector of each non-matching text and the target text vector as the first similarity, and use the target category to which the target text vector with the largest first similarity belongs as the candidate category of the non-matching text .

结合第一方面的第四种实现方式，本公开在第一方面的第五种实现方式中，所述方法还包括：In conjunction with the fourth implementation manner of the first aspect, in a fifth implementation manner of the first aspect, the method further includes:

计算每一条非匹配文本的文本向量与全部非匹配文本的平均文本向量的相似度作为第二相似度；Calculate the similarity between the text vector of each non-matching text and the average text vector of all non-matching texts as the second similarity;

检测所述第一相似度与所述第二相似度的比值是否大于预设阈值；detecting whether the ratio of the first similarity to the second similarity is greater than a preset threshold;

响应于所述第一相似度与所述第二相似度的比值大于预设阈值的检测结果，将所述备选类别作为所述非匹配文本的所属类别。In response to the detection result that the ratio of the first similarity to the second similarity is greater than a preset threshold, the candidate category is used as the category to which the non-matching text belongs.

第二方面，本公开实施例中提供了一种文本分析装置，包括：In a second aspect, an embodiment of the present disclosure provides a text analysis apparatus, including:

匹配模块，被配置为利用预设关键词对文本进行匹配，得到与所述预设关键词匹配的匹配文本以及不与所述预设关键词匹配的非匹配文本，其中，所述预设关键词属于多个目标类别，并且与特定预设关键词匹配的匹配文本与所述特定关键词属于同一目标类别；A matching module, configured to use preset keywords to match texts to obtain matching texts that match the preset keywords and non-matching texts that do not match the preset keywords, wherein the preset keywords The word belongs to multiple target categories, and the matching text that matches the specific preset keyword belongs to the same target category as the specific keyword;

分词模块，被配置为针对每一条子文本进行分词，并根据分词结果为每一条文本生成文本向量；The word segmentation module is configured to perform word segmentation for each sub-text, and generate a text vector for each text according to the word segmentation result;

第一计算模块，被配置为根据属于每一目标类别的全部匹配文本的文本向量计算每一目标类别的目标文本向量；a first calculation module, configured to calculate the target text vector of each target category according to the text vectors of all matched texts belonging to each target category;

第二计算模块，被配置为计算每一条非匹配文本的文本向量与所述目标文本向量的相似度以确定所述非匹配文本所属的类别。The second calculation module is configured to calculate the similarity between the text vector of each non-matching text and the target text vector to determine the category to which the non-matching text belongs.

第三方面，本公开实施例中提供了一种电子设备，包括存储器和处理器；其中，In a third aspect, an embodiment of the present disclosure provides an electronic device, including a memory and a processor; wherein,

所述存储器用于存储一条或多条计算机指令，其中，所述一条或多条计算机指令被所述处理器执行以实现以下步骤：The memory is used to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the following steps:

结合第三方面，本公开在第三方面的第一种实现方式中，所述针对每一条子文本进行分词，并根据分词结果为每一条文本生成文本向量，包括：With reference to the third aspect, in the first implementation manner of the third aspect of the present disclosure, the word segmentation is performed for each sub-text, and a text vector is generated for each text according to the word segmentation result, including:

结合第三方面，本公开在第三方面的第二种实现方式中，所述文本向量为词频-逆文档频率向量。With reference to the third aspect, in a second implementation manner of the third aspect of the present disclosure, the text vector is a word frequency-inverse document frequency vector.

结合第三方面，本公开在第三方面的第三种实现方式中，所述根据属于每一目标类别的全部匹配文本的文本向量计算每一目标类别的目标文本向量，包括：With reference to the third aspect, in a third implementation manner of the third aspect of the present disclosure, calculating the target text vector of each target category according to the text vectors of all matching texts belonging to each target category includes:

结合第三方面的第三种实现方式，本公开在第三方面的第四种实现方式中，所述计算每一条非匹配文本的文本向量与所述目标文本向量的相似度以确定所述非匹配文本所属的类别，包括：With reference to the third implementation manner of the third aspect, in a fourth implementation manner of the third aspect of the present disclosure, the calculation of the similarity between the text vector of each non-matching text and the target text vector is used to determine the non-matching text. The category to which the matched text belongs, including:

结合第三方面的第四种实现方式，本公开在第三方面的第五种实现方式中，所述方法还包括：In conjunction with the fourth implementation manner of the third aspect, in a fifth implementation manner of the third aspect, the method further includes:

第四方面，本公开实施例中提供了一种可读存储介质，其上存储有计算机指令，该计算机指令被处理器执行时实现如第一方面、第一方面的第一种实现方式至第五种实现方式任一项所述的方法。In a fourth aspect, an embodiment of the present disclosure provides a readable storage medium on which computer instructions are stored, and when the computer instructions are executed by a processor, implement the first aspect, the first implementation manner of the first aspect to the fourth aspect. The method described in any one of the five implementation manners.

本公开实施例提供的技术方案可以包括以下有益效果：The technical solutions provided by the embodiments of the present disclosure may include the following beneficial effects:

根据本公开实施例提供的技术方案，通过利用预设关键词对文本进行匹配，得到与所述预设关键词匹配的匹配文本以及不与所述预设关键词匹配的非匹配文本，其中，所述预设关键词属于多个目标类别，并且与特定预设关键词匹配的匹配文本与所述特定关键词属于同一目标类别；针对每一条子文本进行分词，并根据分词结果为每一条文本生成文本向量；根据属于每一目标类别的全部匹配文本的文本向量计算每一目标类别的目标文本向量；计算每一条非匹配文本的文本向量与所述目标文本向量的相似度以确定所述非匹配文本所属的类别，可以在非匹配文本中没有明显的预设关键词的情况下对与目标类别的文本近似非匹配文本进行分类。因此，对于没有预设关键词的文本也可以识别出观点倾向，避免了基于关键词匹配导致的无法获取文本的语义信息，造成误判的缺陷。而且，根据本公开实施方式的文本分析方案可以真实反映文本的观点倾向，提高文本分析的准确率。According to the technical solutions provided by the embodiments of the present disclosure, by using preset keywords to match texts, matching texts matching the preset keywords and non-matching texts not matching the preset keywords are obtained, wherein, The preset keyword belongs to multiple target categories, and the matching text that matches the specific preset keyword belongs to the same target category as the specific keyword; word segmentation is performed for each sub-text, and each text is divided according to the word segmentation result. Generate a text vector; calculate the target text vector of each target category according to the text vectors of all matching texts belonging to each target category; calculate the similarity between the text vector of each unmatched text and the target text vector to determine the non-matching text vector The category to which the matching text belongs, and the non-matching text can be classified as approximately non-matching text with the target category in the absence of obvious preset keywords in the non-matching text. Therefore, the opinion tendency can also be identified for the text without preset keywords, which avoids the defect that the semantic information of the text cannot be obtained based on keyword matching, resulting in misjudgment. Moreover, the text analysis solution according to the embodiment of the present disclosure can truly reflect the opinion tendency of the text, and improve the accuracy of the text analysis.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

附图说明Description of drawings

结合附图，通过以下非限制性实施方式的详细描述，本公开的其它标签、目的和优点将变得更加明显。在附图中：Other labels, objects and advantages of the present disclosure will become more apparent from the following detailed description of non-limiting embodiments, taken in conjunction with the accompanying drawings. In the attached image:

图1示出根据本公开一实施方式的文本分析方法的流程图；FIG. 1 shows a flowchart of a text analysis method according to an embodiment of the present disclosure;

图2示出根据本公开另一实施方式的文本分析方法的流程图；FIG. 2 shows a flowchart of a text analysis method according to another embodiment of the present disclosure;

图3示出根据本公开一实施方式的文本分析装置的结构框图；FIG. 3 shows a structural block diagram of a text analysis apparatus according to an embodiment of the present disclosure;

图4示出根据本公开一实施方式的文本分析方法的一应用场景示例的示意图；4 is a schematic diagram illustrating an example of an application scenario of a text analysis method according to an embodiment of the present disclosure;

图5示出根据本公开一实施方式的电子设备的结构框图；FIG. 5 shows a structural block diagram of an electronic device according to an embodiment of the present disclosure;

图6是适于用来实现根据本公开一实施方式的文本分析方法的计算机系统的结构示意图。FIG. 6 is a schematic structural diagram of a computer system suitable for implementing a text analysis method according to an embodiment of the present disclosure.

具体实施方式Detailed ways

下文中，将参考附图详细描述本公开的示例性实施方式，以使本领域技术人员可容易地实现它们。此外，为了清楚起见，在附图中省略了与描述示例性实施方式无关的部分。Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them. Also, for the sake of clarity, parts unrelated to describing the exemplary embodiments are omitted from the drawings.

在本公开中，应理解，诸如“包括”或“具有”等的术语旨在指示本说明书中所公开的标签、数字、步骤、行为、部件、部分或其组合的存在，并且不欲排除一个或多个其他标签、数字、步骤、行为、部件、部分或其组合存在或被添加的可能性。In the present disclosure, it should be understood that terms such as "comprising" or "having" are intended to indicate the presence of labels, numbers, steps, acts, components, parts, or combinations thereof disclosed in this specification, and are not intended to exclude a or multiple other labels, numbers, steps, acts, parts, sections, or combinations thereof may exist or be added.

另外还需要说明的是，在不冲突的情况下，本公开中的实施例及实施例中的标签可以相互组合。下面将参考附图并结合实施例来详细说明本公开。In addition, it should be noted that the embodiments in the present disclosure and the tags in the embodiments may be combined with each other under the condition of no conflict. The present disclosure will be described in detail below with reference to the accompanying drawings and in conjunction with embodiments.

图1示出根据本公开一实施方式的文本分析方法的流程图。如图1所示，所述文本分析方法包括以下步骤S101-S104：FIG. 1 shows a flowchart of a text analysis method according to an embodiment of the present disclosure. As shown in Figure 1, the text analysis method includes the following steps S101-S104:

在步骤S101中，利用预设关键词对文本进行匹配，得到与预设关键词匹配的匹配文本以及不与预设关键词匹配的非匹配文本，其中，预设关键词属于多个目标类别，并且与特定预设关键词匹配的匹配文本与特定关键词属于同一目标类别。In step S101, a preset keyword is used to match the text to obtain matching text that matches the preset keyword and non-matching text that does not match the preset keyword, wherein the preset keyword belongs to multiple target categories, And the matched text matching the specific preset keyword and the specific keyword belong to the same target category.

在步骤S102中，针对每一条子文本进行分词，并根据分词结果为每一条文本生成文本向量。In step S102, word segmentation is performed for each sub-text, and a text vector is generated for each text according to the word segmentation result.

在步骤S103中，根据属于每一目标类别的全部匹配文本的文本向量计算每一目标类别的目标文本向量。In step S103, the target text vector of each target category is calculated according to the text vectors of all matched texts belonging to each target category.

在步骤S104中，计算每一条非匹配文本的文本向量与目标文本向量的相似度以确定所述非匹配文本所属的类别。In step S104, the similarity between the text vector of each non-matching text and the target text vector is calculated to determine the category to which the non-matching text belongs.

在本公开的一个实施例中，待分析的文本可以包括多条文本。例如，可以对包括多条网络评论信息的文本进行分析，每一条网络评论信息就是一条文本。In one embodiment of the present disclosure, the text to be analyzed may include multiple pieces of text. For example, a text including multiple pieces of online comment information can be analyzed, and each piece of online comment information is a piece of text.

以下以餐饮O2O平台的用户评论为例来说明如何利用预设关键词对文本进行匹配，得到与所述预设关键词匹配的匹配文本以及不与所述预设关键词匹配的非匹配文本。以下表示出了餐饮O2O平台对文本(评论)关键词分类情况。The following takes user reviews of the catering O2O platform as an example to illustrate how to use preset keywords to match texts to obtain matching texts that match the preset keywords and non-matching texts that do not match the preset keywords. The following table shows the classification of text (review) keywords by the catering O2O platform.

如上表所示，餐饮O2O平台对负面评价的关键词分为9类。这9类关键词都是属于一级分类的差评，二级分类分别为商家、物流、平台3类关键词，三级分类是具体在各自二级分类领域内的3个小分类。其中，需要O2O平台重点关注的，会挑选一些典型的关键词用于匹配用户评论。所以总共有9个差评分类的关键词。例如，商家的三级分类关键词包括“口味(不正宗、难吃)：太咸、太辣、没味、太老…。As shown in the table above, the keywords of negative comments on the catering O2O platform are divided into 9 categories. These 9 types of keywords are all negative reviews that belong to the first-level classification. The second-level classification is three types of keywords: merchants, logistics, and platforms. Among them, some typical keywords will be selected for matching user comments if the O2O platform needs to focus on it. So there are a total of 9 keywords in the badly rated category. For example, the three-level classification keywords of the merchant include "taste (unauthentic, unpalatable): too salty, too spicy, tasteless, too old....

另外，不与预设关键词匹配的文本为非匹配文本，单独作为一类，即，非匹配类别。因此，在此示例中，包括非匹配类别在内，总共可以有10个类别。本领域技术人员可以理解，以上9个目标类别仅仅是示例，实际的目标类别数目可以多余或少于9个。In addition, texts that do not match the preset keywords are non-matching texts, and are regarded as a separate category, that is, a non-matching category. So in this example, including the non-matching categories, there can be a total of 10 categories. Those skilled in the art can understand that the above 9 target categories are only examples, and the actual number of target categories may be more or less than 9.

在本公开的一个实施例中，可以针对每一条子文本进行分词，并根据分词结果为每一条文本生成文本向量。例如，可以获取网络商业平台一个时间段内(例如，最近30天)的所有用户评论数据。将用户的每条评论视为一个独立文本，所有用户在该时间段内的评论(即，全部文本)是一个语料。In an embodiment of the present disclosure, word segmentation may be performed for each sub-text, and a text vector may be generated for each piece of text according to the word segmentation result. For example, all user comment data of the network commerce platform within a period of time (eg, the last 30 days) may be obtained. Each comment of a user is regarded as an independent text, and all comments of users in this time period (ie, all texts) are one corpus.

例如，可以通过分词算法将每个文本里的全部词分开。比如一条文本“这家店的味道太咸了”经过分词后变成：For example, all words in each text can be separated by a word segmentation algorithm. For example, a text "The taste of this store is too salty" becomes:

“这家店的味道太咸了”。"The taste of this shop is too salty".

在本公开的一个实施例中，可以对一条文本信息进行分词并根据分词结果为该条文本生成文本向量来对该条文本进行分析。在本公开的一个实施例中，文本向量为词频-逆文档频率(term frequency-inverse document frequency，简称为tf-idf)向量。词频-逆文档频率方法是一种统计方法，用以评估一个字词对于一个文件集或一个语料库中的其中一份文件的重要程度。In an embodiment of the present disclosure, a piece of text information may be segmented, and a text vector may be generated for the piece of text according to the word segmentation result to analyze the piece of text. In one embodiment of the present disclosure, the text vector is a term frequency-inverse document frequency (term frequency-inverse document frequency, abbreviated as tf-idf) vector. The term frequency-inverse document frequency method is a statistical method for evaluating the importance of a word to a document set or one of the documents in a corpus.

还是以前述网络商业平台的评论为例，所有评论，即全部文本，的全部分词的个数是有限的，可以统计这个总数为V。于是每个用户的每条评论都可以以一个V维的向量表示，每一维度是各个分词的tf-idf值。以前述分词后的文本“这家店的味道太咸了”为例，其文本向量如下：Still taking the comments of the aforementioned online business platform as an example, all comments, that is, all texts, have a limited number of all partial words, and the total number can be counted as V. Therefore, each comment of each user can be represented by a V-dimensional vector, and each dimension is the tf-idf value of each word segment. Take the text "This store tastes too salty" after the aforementioned word segmentation as an example, the text vector is as follows:

[“这家”tf-idf值,“店”tf-idf值,“的”tf-idf值,“味道”tf-idf值,“太咸”tf-idf值,“了”tf-idf值,…]。["this" tf-idf value, "shop" tf-idf value, "the" tf-idf value, "taste" tf-idf value, "too salty" tf-idf value, "out" tf-idf value ,…].

在本公开的一个实施例中，以分词“这家”为例说明每个分词的tf-idf值计算公式：In one embodiment of the present disclosure, the word participle "this house" is taken as an example to illustrate the calculation formula of the tf-idf value of each participle:

“这家”tf-idf值＝(“这家”在本条文本中出现的次数/本条文本中所有词出现的次数)*log(全部文本总条数/(包含“这家”的文本条数+1))"This house" tf-idf value = (the number of times "this house" appears in this text / the number of times all words in this text appear) * log (the total number of all texts / (the number of texts containing "this house" +1))

其中，log的底是自然对数底e。where the base of log is the base e of the natural logarithm.

本领域技术人员可以理解，以上确定一条文本中的分词的向量值的方式仅仅是示例，根据本公开的实施方式的教导，可以采用各种方式确定一条文本中的分词的向量值，例如tf-idf值。Those skilled in the art can understand that the above method of determining the vector value of a word segment in a piece of text is only an example. According to the teachings of the embodiments of the present disclosure, various methods can be used to determine the vector value of a word segment in a piece of text, for example, tf- idf value.

为了在进行文本分析时强调分类关键词的影响，对出现在关键词表中的分词，增大其tf-idf值(例如，放大2倍)。例如，对于文本向量：In order to emphasize the influence of categorical keywords when performing text analysis, the tf-idf value of the word segment appearing in the keyword table is increased (for example, 2 times larger). For example, for text vectors:

[“这家”tf-idf值,“店”tf-idf值,“的”tf-idf值,“味道”tf-idf值,“太咸”tf-idf值,“了”tf-idf值,…]["this" tf-idf value, "shop" tf-idf value, "the" tf-idf value, "taste" tf-idf value, "too salty" tf-idf value, "out" tf-idf value ,…]

在将该条评论的文本与上表进行匹配后，由于“太咸”与表中的商家的三级分类中的关键词“太咸”匹配，该条评论中的“太咸”tf-idf值可以增大，例如，放大2倍。本领域技术人员可以理解，放大2倍仅仅是示例，增大的方式可以是增大预设值或放大2倍以外的其他倍数。增大与目标分类的关键词匹配的分词的向量值可以使得该条文本的文本向量在分析时更加体现出目标分类的倾向。After matching the text of the review with the table above, since "too salty" matches the keyword "too salty" in the tertiary classification of the merchant in the table, the "too salty" tf-idf in the review The value can be increased, for example, by a factor of 2. Those skilled in the art can understand that the magnification by 2 times is only an example, and the way of increasing may be to increase the preset value or other multiples than the 2 times magnification. Increasing the vector value of the word segment matching the keyword of the target classification can make the text vector of the text more reflect the tendency of the target classification during analysis.

在本公开的一个实施例中，步骤S102包括：根据预设规则增大匹配文本中与所述预设关键词相同的分词的向量。In an embodiment of the present disclosure, step S102 includes: according to a preset rule, increasing the vector of the segmented word in the matching text that is the same as the preset keyword.

在本公开的一个实施例中，属于同一目标类别的匹配文本具有属于同一目标类别的预设关键词，因此属于与预设关键词相同的类别。但是，属于同一目标类别的匹配文本的文本向量可能相互不同，因此，当基于多个目标分类确定每一目标类别的目标文本向量时，可以根据属于每一目标类别的全部匹配文本的文本向量来进行计算。In one embodiment of the present disclosure, matching texts belonging to the same target category have preset keywords belonging to the same target category, and therefore belong to the same category as the preset keywords. However, the text vectors of matched texts belonging to the same target category may be different from each other. Therefore, when the target text vector of each target category is determined based on multiple target categories, the text vectors of all matched texts belonging to each target category can be determined based on the text vectors of all matched texts belonging to each target category. Calculation.

在本公开的一个实施例中，步骤S103包括：通过对属于每一目标类别的全部匹配文本的文本向量求和取平均值来计算每一目标类别的目标文本向量。In an embodiment of the present disclosure, step S103 includes: calculating the target text vector of each target category by summing and averaging the text vectors of all matching texts belonging to each target category.

在本公开的一个实施例中，对于非匹配文本，由于无法基于预设关键词获取其语义信息，因此可能造成对非匹配文本的误判。为了解决该问题，可以计算每一条非匹配文本的文本向量与目标文本向量的相似度以确定非匹配文本所属的类别。例如，计算一条非匹配文本的文本向量与多个类别的目标文本的相似度，当一条非匹配文本的文本向量与某一特定类别的文本的目标文本向量相似度大于某一阈值时，可以认为该条非匹配文本属于该特定类别。In an embodiment of the present disclosure, for non-matching text, since the semantic information of the non-matching text cannot be obtained based on the preset keywords, a misjudgment of the non-matching text may be caused. To solve this problem, the similarity between the text vector of each non-matching text and the target text vector can be calculated to determine the category to which the non-matching text belongs. For example, to calculate the similarity between the text vector of a non-matching text and the target text of multiple categories, when the similarity between the text vector of a non-matching text and the target text vector of a specific category of text is greater than a certain threshold, it can be considered that The piece of non-matching text belongs to that specific category.

根据本公开实施例，通过利用预设关键词对文本进行匹配，得到与预设关键词匹配的匹配文本以及不与预设关键词匹配的非匹配文本，其中，预设关键词属于多个目标类别，并且与特定预设关键词匹配的匹配文本与特定关键词属于同一目标类别；针对每一条子文本进行分词，并根据分词结果为每一条文本生成文本向量；根据属于每一目标类别的全部匹配文本的文本向量计算每一目标类别的目标文本向量；计算每一条非匹配文本的文本向量与目标文本向量的相似度以确定非匹配文本所属的类别，可以在非匹配文本中没有明显的预设关键词的情况下对与目标类别的文本近似非匹配文本进行分类。因此，对于没有预设关键词的文本也可以识别出观点倾向，避免了基于关键词匹配导致的无法获取文本的语义信息，造成误判的缺陷。而且，根据本公开实施方式的文本分析方案可以真实反映文本的观点倾向，提高文本分析的准确率。According to an embodiment of the present disclosure, by using preset keywords to match texts, matching texts matching the preset keywords and non-matching texts not matching the preset keywords are obtained, wherein the preset keywords belong to multiple targets category, and the matching text that matches the specific preset keyword belongs to the same target category as the specific keyword; perform word segmentation for each sub-text, and generate a text vector for each text according to the word segmentation result; The text vector of the matching text calculates the target text vector of each target category; calculates the similarity between the text vector of each non-matching text and the target text vector to determine the category to which the non-matching text belongs, and there is no obvious prediction in the non-matching text. In the case of setting keywords, classify text that does not match the text of the target category. Therefore, the opinion tendency can also be identified for the text without preset keywords, which avoids the defect that the semantic information of the text cannot be obtained based on keyword matching, resulting in misjudgment. Moreover, the text analysis solution according to the embodiment of the present disclosure can truly reflect the opinion tendency of the text, and improve the accuracy of the text analysis.

图2示出根据本公开另一实施方式的文本分析方法的流程图。如图2所示，与图1所示的实施方式的不同之处在于，步骤S104包括步骤S201。FIG. 2 shows a flowchart of a text analysis method according to another embodiment of the present disclosure. As shown in FIG. 2 , the difference from the embodiment shown in FIG. 1 is that step S104 includes step S201 .

在步骤S201中，计算每一条非匹配文本的文本向量与目标文本向量的相似度作为第一相似度，将第一相似度最大的目标文本向量所属的目标类别作为该条非匹配文本的备选类别。In step S201, the similarity between the text vector of each non-matching text and the target text vector is calculated as the first similarity, and the target category to which the target text vector with the largest first similarity belongs is used as the candidate for the non-matching text category.

在本公开的一个实施例中，将第一相似度最大的目标文本向量所属的目标类别作为该条非匹配文本的备选类别，是因为第一相似度可能不足以用以确定该条非匹配文本的所属类别，因此只能将第一相似度最大的目标文本向量所属的目标类别作为该条非匹配文本的备选类别。In an embodiment of the present disclosure, the target category to which the target text vector with the highest first similarity belongs is used as the candidate category of the piece of non-matching text, because the first similarity may not be enough to determine the piece of non-matching text The category to which the text belongs, so only the target category to which the first target text vector with the highest similarity belongs can be used as the candidate category of the non-matching text.

在本公开的另一个实施例中，图2所示的另一实施方式的文本分析方法还可以包括步骤S202、S203和S204。In another embodiment of the present disclosure, the text analysis method of another embodiment shown in FIG. 2 may further include steps S202 , S203 and S204 .

在步骤S202中，计算每一条非匹配文本的文本向量与全部非匹配文本的平均文本向量的相似度作为第二相似度。In step S202, the similarity between the text vector of each non-matching text and the average text vector of all non-matching texts is calculated as the second similarity.

在步骤S203中，检测第一相似度与第二相似度的比值是否大于预设阈值。In step S203, it is detected whether the ratio of the first similarity to the second similarity is greater than a preset threshold.

在步骤S204中，响应于第一相似度与第二相似度的比值大于预设阈值的检测结果，将备选类别作为非匹配文本的所属类别。In step S204, in response to the detection result that the ratio of the first similarity to the second similarity is greater than the preset threshold, the candidate category is used as the category to which the non-matching text belongs.

在本公开的一个实施例中，对于每条非匹配文本，计算其文本向量与多个类别(例如，上表中的9个类别)的目标文本的余弦相似度，并记录其相似度最大的类别作为备选类别及其对应的第一相似度a。计算该条文本的文本向量与全部非匹配文本的总体平均向量的余弦相似度，即第二相似度b。当a/b>C时，将备选类别作为该条文本所属的目标类别。当a/b≤C时，该条文本仍属于非匹配文本，也可称为未被分类的文本。C是一个可以调整的阈值。当需要分析更多匹配文本时，可以调低C的值，否则可以提高C的值。In one embodiment of the present disclosure, for each piece of non-matching text, the cosine similarity between its text vector and the target text of multiple categories (for example, 9 categories in the above table) is calculated, and the highest similarity is recorded. The category is used as a candidate category and its corresponding first similarity a. Calculate the cosine similarity between the text vector of the text and the overall average vector of all non-matching texts, that is, the second similarity b. When a/b>C, the candidate category is taken as the target category to which the text belongs. When a/b≤C, the text still belongs to the non-matching text, which can also be called unclassified text. C is an adjustable threshold. When more matching text needs to be analyzed, the value of C can be adjusted lower, otherwise the value of C can be raised.

图4示出根据本公开一实施方式的文本分析方法的一应用场景示例的示意图。FIG. 4 is a schematic diagram showing an example of an application scenario of the text analysis method according to an embodiment of the present disclosure.

如图4所示，在餐饮O2O平台场景中，可以获取平台最近30天的所有用户评论数据，将用户的每条评论视为一个独立文本，所有用户30天的评论是一个语料。参照上表，利用9类预设关键词对每一条评论进行匹配，得到与预设关键词匹配的9类目标文本以及不与预设关键词匹配的未被匹配的文本。在对全部评论中的每一条评论分词后，基于tf-idf方法建立文本向量。建立每一条用户评论的文本向量后，根据属于每一目标类别的全部匹配文本的文本向量计算每一目标类别的目标文本向量。计算每一条非匹配文本的文本向量与所述目标文本向量的相似度以确定所述非匹配文本所属的类别。即，可以基于相似度判断未匹配文本(评论)的观点倾向。As shown in Figure 4, in the catering O2O platform scenario, all user comment data of the platform in the last 30 days can be obtained, and each comment of the user is regarded as an independent text, and the comments of all users in the past 30 days are a corpus. Referring to the above table, 9 types of preset keywords are used to match each comment to obtain 9 types of target texts that match the preset keywords and unmatched texts that do not match the preset keywords. After segmenting each comment in all the comments, a text vector is established based on the tf-idf method. After the text vector of each user comment is established, the target text vector of each target category is calculated according to the text vector of all matching texts belonging to each target category. The similarity between the text vector of each non-matching text and the target text vector is calculated to determine the category to which the non-matching text belongs. That is, the opinion tendency of the unmatched text (comment) can be judged based on the similarity.

图3示出根据本公开一实施方式的文本分析装置的结构框图。FIG. 3 shows a structural block diagram of a text analysis apparatus according to an embodiment of the present disclosure.

如图3所示，文本分析装置包括匹配模块301、分词模块302、第一计算模块303和第二计算模块304。As shown in FIG. 3 , the text analysis apparatus includes a matching module 301 , a word segmentation module 302 , a first calculation module 303 and a second calculation module 304 .

匹配模块301被配置为利用预设关键词对文本进行匹配，得到与所述预设关键词匹配的匹配文本以及不与所述预设关键词匹配的非匹配文本，其中，所述预设关键词属于多个目标类别，并且与特定预设关键词匹配的匹配文本与所述特定关键词属于同一目标类别。The matching module 301 is configured to use a preset keyword to match the text to obtain matching text that matches the preset keyword and non-matching text that does not match the preset keyword, wherein the preset key Words belong to multiple target categories, and matching texts that match a specific preset keyword belong to the same target category as the specific keyword.

分词模块302被配置为针对每一条子文本进行分词，并根据分词结果为每一条文本生成文本向量。The word segmentation module 302 is configured to perform word segmentation for each sub-text, and generate a text vector for each piece of text according to the word segmentation result.

第一计算模块303被配置为根据属于每一目标类别的全部匹配文本的文本向量计算每一目标类别的目标文本向量。The first calculation module 303 is configured to calculate the target text vector of each target category from the text vectors of all matched texts belonging to each target category.

第二计算模块304被配置为计算每一条非匹配文本的文本向量与所述目标文本向量的相似度以确定所述非匹配文本所属的类别。The second calculation module 304 is configured to calculate the similarity between the text vector of each piece of non-matching text and the target text vector to determine the category to which the non-matching text belongs.

以上描述了文本分析装置的内部功能和结构，在一个可能的设计中，该文本分析装置的结构可实现为文本分析设备，如图5中所示，该处理设备500可以包括处理器501以及存储器502。The internal function and structure of the text analysis device are described above. In a possible design, the structure of the text analysis device may be implemented as a text analysis device. As shown in FIG. 5 , the processing device 500 may include a processor 501 and a memory 502.

所述存储器502用于存储支持文本分析装置执行上述任一实施例中文本分析方法的程序，所述处理器501被配置为用于执行所述存储器502中存储的程序。The memory 502 is used to store a program that supports the text analysis apparatus to execute the text analysis method in any of the foregoing embodiments, and the processor 501 is configured to execute the program stored in the memory 502 .

所述存储器502用于存储一条或多条计算机指令，其中，所述一条或多条计算机指令被所述处理器501执行以实现以下步骤：The memory 502 is used to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor 501 to implement the following steps:

在本公开的另一个实施例中，所述针对每一条子文本进行分词，并根据分词结果为每一条文本生成文本向量，包括：In another embodiment of the present disclosure, performing word segmentation for each sub-text, and generating a text vector for each piece of text according to the word segmentation result, including:

在本公开的另一个实施例中，所述文本向量为词频-逆文档频率向量。In another embodiment of the present disclosure, the text vector is a word frequency-inverse document frequency vector.

在本公开的另一个实施例中，所述根据属于每一目标类别的全部匹配文本的文本向量计算每一目标类别的目标文本向量，包括：In another embodiment of the present disclosure, calculating the target text vector of each target category according to the text vectors of all matching texts belonging to each target category includes:

在本公开的另一个实施例中，所述计算每一条非匹配文本的文本向量与所述目标文本向量的相似度以确定所述非匹配文本所属的类别，包括：In another embodiment of the present disclosure, the calculating the similarity between the text vector of each piece of non-matching text and the target text vector to determine the category to which the non-matching text belongs, including:

在本公开的另一个实施例中，所述一条或多条计算机指令还被所述处理器501执行以实现以下步骤：In another embodiment of the present disclosure, the one or more computer instructions are further executed by the processor 501 to implement the following steps:

所述处理器501用于执行前述各方法步骤中的全部或部分步骤。The processor 501 is configured to execute all or part of the foregoing method steps.

其中，所述文本分析设备的结构中还可以包括通信接口，用于文本分析设备与其他设备或通信网络通信。Wherein, the structure of the text analysis device may further include a communication interface for the text analysis device to communicate with other devices or a communication network.

本公开示例性实施例还提供了一种计算机存储介质，用于储存所述文本分析装置所用的计算机软件指令，其包含用于执行上述任一实施例中文本分析方法所涉及的程序。Exemplary embodiments of the present disclosure also provide a computer storage medium for storing computer software instructions used by the text analysis apparatus, which includes a program for executing the text analysis method in any of the foregoing embodiments.

如图6所示，计算机系统600包括中央处理单元(CPU)601，其可以根据存储在只读存储器(ROM)602中的程序或者从存储部分608加载到随机访问存储器(RAM)603中的程序而执行上述图1所示的实施方式中的各种处理。在RAM603中，还存储有系统600操作所需的各种程序和数据。CPU601、ROM602以及RAM603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。As shown in FIG. 6, a computer system 600 includes a central processing unit (CPU) 601, which can be loaded into a random access memory (RAM) 603 according to a program stored in a read only memory (ROM) 602 or a program from a storage section 608 Instead, various processes in the above-described embodiment shown in FIG. 1 are executed. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601 , the ROM 602 and the RAM 603 are connected to each other through a bus 604 . An input/output (I/O) interface 605 is also connected to bus 604 .

以下部件连接至I/O接口605：包括键盘、鼠标等的输入部分606；包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分607；包括硬盘等的存储部分608；以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分609。通信部分609经由诸如因特网的网络执行通信处理。驱动器610也根据需要连接至I/O接口605。可拆卸介质611，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器610上，以便于从其上读出的计算机程序根据需要被安装入存储部分608。The following components are connected to the I/O interface 605: an input section 606 including a keyboard, a mouse, etc.; an output section 607 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 608 including a hard disk, etc. ; and a communication section 609 including a network interface card such as a LAN card, a modem, and the like. The communication section 609 performs communication processing via a network such as the Internet. A drive 610 is also connected to the I/O interface 605 as needed. A removable medium 611, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 610 as needed so that a computer program read therefrom is installed into the storage section 608 as needed.

特别地，根据本公开的实施方式，上文参考图1描述的方法可以被实现为计算机软件程序。例如，本公开的实施方式包括一种计算机程序产品，其包括有形地包含在及其可读介质上的计算机程序，所述计算机程序包含用于执行图1的数据处理方法的程序代码。在这样的实施方式中，该计算机程序可以通过通信部分609从网络上被下载和安装，和/或从可拆卸介质611被安装。In particular, according to an embodiment of the present disclosure, the method described above with reference to FIG. 1 may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a readable medium thereof, the computer program containing program code for executing the data processing method of FIG. 1 . In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 609 and/or installed from the removable medium 611 .

附图中的流程图和框图，图示了按照本公开各种实施方式的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，路程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分，所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，并且/或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the diagram or block diagram may represent a module, segment, or portion of code that contains one or more functions for implementing the specified logical function. executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , and/or may be implemented in a combination of dedicated hardware and computer instructions.

描述于本公开实施方式中所涉及到的单元或模块可以通过软件的方式实现，也可以通过硬件的方式来实现。所描述的单元或模块也可以设置在处理器中，这些单元或模块的名称在某种情况下并不构成对该单元或模块本身的限定。The units or modules involved in the embodiments of the present disclosure can be implemented in software or hardware. The described units or modules may also be provided in the processor, and the names of these units or modules do not constitute a limitation on the units or modules themselves in certain circumstances.

作为另一方面，本公开还提供了一种计算机可读存储介质，该计算机可读存储介质可以是上述实施方式中所述装置中所包含的计算机可读存储介质；也可以是单独存在，未装配入设备中的计算机可读存储介质。计算机可读存储介质存储有一个或者一个以上程序，所述程序被一个或者一个以上的处理器用来执行描述于本公开的方法。As another aspect, the present disclosure also provides a computer-readable storage medium, and the computer-readable storage medium may be a computer-readable storage medium included in the apparatus described in the foregoing embodiments; A computer-readable storage medium that fits into a device. The computer-readable storage medium stores one or more programs used by one or more processors to perform the methods described in the present disclosure.

以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解，本公开中所涉及的发明范围，并不限于上述技术标签的特定组合而成的技术方案，同时也应涵盖在不脱离所述发明构思的情况下，由上述技术标签或其等同标签进行任意组合而形成的其它技术方案。例如上述标签与本公开中公开的(但不限于)具有类似功能的技术标签进行互相替换而形成的技术方案。The above description is merely a preferred embodiment of the present disclosure and an illustration of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in the present disclosure is not limited to the technical solutions formed by the specific combination of the above technical labels, and should also cover the technical solutions made by the above technical labels without departing from the inventive concept. or other technical solutions formed by any combination of its equivalent tags. For example, a technical solution is formed by replacing the above-mentioned tags with technical tags disclosed in the present disclosure (but not limited to) with similar functions.