CN107577690B

Movatterモバイル変換

Info

Publication number: CN107577690B
Application number: CN201710346631.2A
Authority: CN
Inventors: 白鹤; 侯斌; 刘东海; 杨帆; 颜斯泰; 罗亚林; 王云福; 涂红兵; 戴伟琦
Original assignee: China General Nuclear Power Corp; China Nuclear Power Engineering Co Ltd; Shenzhen China Guangdong Nuclear Engineering Design Co Ltd
Current assignee: China General Nuclear Power Corp; China Nuclear Power Engineering Co Ltd; Shenzhen China Guangdong Nuclear Engineering Design Co Ltd
Priority date: 2017-05-17
Filing date: 2017-05-17
Publication date: 2021-01-05
Anticipated expiration: 2037-05-17
Also published as: CN107577690A

Abstract

本发明属于信息处理技术领域，提供了一种海量信息数据的推荐方法及推荐装置。该推荐方法包括：从企业内容管理系统ECM中获取元数据信息；根据所述元数据信息的元数据集样本空间，生成元数据聚类模板；根据用户的相关信息，获取所述用户的静态属性空间；根据所述用户的静态属性空间和所述元数据聚类模板，获取相应的静态海量数据模板；监控所述用户的行为日志，并根据所述用户的行为日志，获取所述用户在预设时间内的关注词；根据海量数据非结构化文档的文本分析，形成文本索引；根据所述文本索引、所述用户在预设时间内的关注词以及所述静态海量数据模板，查找所要推荐的内容。通过本发明有效解决了用户无法及时有效的获得所需信息的问题。

The invention belongs to the technical field of information processing, and provides a recommendation method and a recommendation device for massive information data. The recommending method includes: obtaining metadata information from an enterprise content management system ECM; generating a metadata clustering template according to a metadata set sample space of the metadata information; obtaining static attributes of the user according to the relevant information of the user space; according to the static attribute space of the user and the metadata clustering template, obtain the corresponding static massive data template; monitor the behavior log of the user, and according to the behavior log of the user, obtain the user's pre- According to the text analysis of unstructured documents of massive data, a text index is formed; according to the text index, the concerned words of the user within the preset time, and the static massive data template, find the recommendation to be recommended Content. The invention effectively solves the problem that the user cannot obtain the required information in a timely and effective manner.

Description

Translated fromChinese

技术领域technical field

本发明属于信息处理技术领域，尤其涉及一种海量信息数据的推荐方法及推荐装置。The invention belongs to the technical field of information processing, and in particular relates to a recommendation method and a recommendation device for massive information data.

背景技术Background technique

核电工程企业内容信息数据复杂，文档资料数量庞大，达到百万级别，尤其是项目工程文件、技术文档、商务合同、往来函件及各技术路线(如AP1000、EPR三代核电技术)转让资料。由于技术资料大部分是以半结构化存储在企业内容管理平台(Enterprise ContentManagement，ECM)中，信息量庞大，技术人员无法及时获得相关知识更新。The content and data of nuclear power engineering enterprises are complex, and the number of documents is huge, reaching the level of one million, especially project engineering documents, technical documents, business contracts, correspondence and transfer documents of various technical routes (such as AP1000, EPR third-generation nuclear power technology). Since most of the technical data is semi-structured and stored in the Enterprise Content Management platform (Enterprise Content Management, ECM), the amount of information is huge, and technical personnel cannot obtain relevant knowledge updates in time.

故，有必要提出一种新的技术方案，以解决上述技术问题。Therefore, it is necessary to propose a new technical solution to solve the above-mentioned technical problems.

发明内容SUMMARY OF THE INVENTION

鉴于此，本发明实施例提供一种海量信息数据的推荐方法及推荐装置，旨在解决用户无法及时有效的获得所需信息的问题。In view of this, embodiments of the present invention provide a recommendation method and a recommendation device for massive information data, aiming to solve the problem that users cannot obtain required information in a timely and effective manner.

本发明实施例的第一方面，提供一种海量信息数据的推荐方法，所述推荐方法包括：In a first aspect of the embodiments of the present invention, a method for recommending massive information data is provided, and the method for recommending includes:

从企业内容管理系统ECM中获取元数据信息；Obtain metadata information from enterprise content management system ECM;

根据所述元数据信息的元数据集样本空间，生成元数据聚类模板；generating a metadata clustering template according to the metadata set sample space of the metadata information;

根据用户的相关信息，获取所述用户的静态属性空间；Obtain the static attribute space of the user according to the relevant information of the user;

根据所述用户的静态属性空间和所述元数据聚类模板，获取相应的静态海量数据模板；According to the static attribute space of the user and the metadata clustering template, obtain a corresponding static massive data template;

监控所述用户的行为日志，并根据所述用户的行为日志，获取所述用户在预设时间内的关注词；Monitor the behavior log of the user, and obtain the concerned words of the user within a preset time according to the behavior log of the user;

根据海量数据非结构化文档的文本分析，形成文本索引；According to the text analysis of massive data unstructured documents, a text index is formed;

根据所述文本索引、所述用户在预设时间内的关注词以及所述静态海量数据模板，查找所要推荐的内容。According to the text index, the concerned words of the user within a preset time, and the static massive data template, the content to be recommended is searched.

本发明实施例的第一方面，提供一种海量信息数据的推荐装置，所述推荐装置包括：In a first aspect of the embodiments of the present invention, a recommendation device for massive information data is provided, and the recommendation device includes:

元数据信息获取模块，用于从企业内容管理系统ECM中获取元数据信息；The metadata information acquisition module is used to acquire metadata information from the enterprise content management system ECM;

元数据聚集模板生成模块，用于根据所述元数据信息的元数据集样本空间，生成元数据聚类模板；a metadata aggregation template generation module, configured to generate a metadata clustering template according to the metadata set sample space of the metadata information;

静态属性空间获取模块，用于根据用户的相关信息，获取所述用户的静态属性空间；a static attribute space acquisition module, used for acquiring the static attribute space of the user according to the relevant information of the user;

静态海量数据模板获取模块，用于根据所述用户的静态属性空间和所述元数据聚类模板，获取相应的静态海量数据模板；A static massive data template acquisition module, configured to acquire a corresponding static massive data template according to the static attribute space of the user and the metadata clustering template;

关注词获取模块，用于监控所述用户的行为日志，并根据所述用户的行为日志，获取所述用户在预设时间内的关注词；a concerned word acquisition module, configured to monitor the behavior log of the user, and obtain the attention words of the user within a preset time according to the behavior log of the user;

文本索引形成模块，用于根据海量数据非结构化文档的文本分析，形成文本索引；The text index forming module is used to form a text index according to the text analysis of unstructured documents of massive data;

推荐内容查找模块，用于根据所述文本索引、所述用户在预设时间内的关注词以及所述静态海量数据模板，查找所要推荐的内容。A recommended content search module is configured to search for the content to be recommended according to the text index, the concerned words of the user within a preset time, and the static massive data template.

本发明实施例与现有技术相比存在的有益效果是：本发明实施例根据用户的静态属性空间和元数据聚类模板，获取相应的静态海量数据模板，监控用户的行为日志，并根据所述用户的行为日志，获取所述用户在预设时间内的关注词，根据海量数据非结构化文档的文本分析，形成文本索引，从而可以根据所述文本索引、所述用户在预设时间内的关注词以及所述静态海量数据模板，快速地查找所要推荐的内容。通过本发明实施例可以将静态信息与动态数据相结合，快速地完成核电专业技术人员的数据知识推送，从而保证专业技术人员及时有效的获得精准匹配的有效信息。Compared with the prior art, the embodiment of the present invention has the following beneficial effects: the embodiment of the present invention obtains the corresponding static massive data template according to the static attribute space and the metadata clustering template of the user, monitors the behavior log of the user, The user's behavior log is described, the concerned words of the user within a preset time are obtained, and a text index is formed according to the text analysis of unstructured documents of massive data, so that according to the text index and the user within the preset time, a text index can be formed. and the static massive data template to quickly find the content to be recommended. Through the embodiments of the present invention, static information and dynamic data can be combined to quickly complete the push of data knowledge of nuclear power professional technicians, thereby ensuring that professional technicians can obtain accurate matching and effective information in a timely and effective manner.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only for the present invention. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

图1是本发明实施例一提供的海量信息数据的推送方法的实现流程图；Fig. 1 is the realization flow chart of the method for pushing massive information data provided by Embodiment 1 of the present invention;

图2是本发明实施例二提供的海量信息数据的推送方法的实现流程图；Fig. 2 is the realization flow chart of the method for pushing massive information data provided by the second embodiment of the present invention;

图3是本发明实施例三提供的海量信息数据的推送装置的组成示意图；3 is a schematic diagram of the composition of a push device for massive information data provided by Embodiment 3 of the present invention;

图4是本发明实施例四提供的海量信息数据的推送装置的组成示意图。FIG. 4 is a schematic diagram of the composition of the apparatus for pushing massive information data according to Embodiment 4 of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

本发明实现了半结构化海量核电信息推荐系统，一方面利用知识本体的概念对技术信息结构化元数据进行专业聚类分析，并结合核电专业人员技术背景及归纳偏好，通过海量数据学习分析算法获得假设空间内的静态海量数据模板。另一方面根据海量数据非结构化文档的文本分析，形成文本索引，并与核电专业人员动态需求结合，在静态海量数据模板内进行数据的索引检索，最终实现静态信息与动态数据的利用与结合，完成核电专业人员的数据知识推荐。The invention realizes a semi-structured mass nuclear power information recommendation system. On the one hand, the concept of knowledge ontology is used to conduct professional cluster analysis on the structured metadata of technical information, and combined with the technical background and inductive preference of nuclear power professionals, the analysis algorithm is learned and analyzed through mass data. Obtain static massive data templates within a hypothesis space. On the other hand, according to the text analysis of massive data unstructured documents, a text index is formed, and combined with the dynamic needs of nuclear power professionals, data index retrieval is carried out in the static massive data template, and finally the utilization and combination of static information and dynamic data are realized. , complete the data knowledge recommendation for nuclear power professionals.

本发明实现了海量的半结构化核电技术文档的静态数据(包括元数据与文本)与核电专业人员需求(包括静态知识背景与动态需求)的海量数据匹配处理方法。包括可配置的核电技术文档基础信息约束及核电专业技术人员背景分析与识别技术；结构化元数据聚类模板及静态海量数据模板建立方法；并结合动态日志抓取分析技术与文本分析技术；利用倒排索引技术对文本匹配进行加权排序算法；集成静态信息与动态需求的核电专业知识信息推荐功能方案。通过以上技术方法满足企业知识管理的信息传播与再造要求，保证专业技术人员及时有效的获得精准匹配的有效信息。The invention realizes a massive data matching processing method between the static data (including metadata and text) of massive semi-structured nuclear power technical documents and the requirements of nuclear power professionals (including static knowledge background and dynamic requirements). Including configurable basic information constraints of nuclear power technical documents and background analysis and identification technology of nuclear power professional and technical personnel; structured metadata clustering template and static massive data template establishment method; combined with dynamic log capture analysis technology and text analysis technology; use Inverted index technology performs weighted sorting algorithm for text matching; nuclear power professional knowledge information recommendation function scheme integrating static information and dynamic requirements. Through the above technical methods, the information dissemination and reconstruction requirements of enterprise knowledge management are met, and professional and technical personnel are guaranteed to obtain accurate matching and effective information in a timely and effective manner.

实施例一：Example 1:

图1示出了本发明实施例一提供的海量信息数据的推荐方法的实现流程，所述实现流程详述如下：Fig. 1 shows the implementation process of the method for recommending massive information data provided by the first embodiment of the present invention, and the implementation process is described in detail as follows:

步骤S101，从企业内容管理系统ECM中获取元数据信息。Step S101, obtaining metadata information from the enterprise content management system ECM.

在本发明实施例中，所述企业内容管理系统ECM可以为核电企业内容管理系统，所述ECM中包含有大量的企业内容，包括但不限于元数据信息、非结构化文件文本内容、系统访问及检索相关日志以及人员信息。In this embodiment of the present invention, the enterprise content management system ECM may be a nuclear power enterprise content management system, and the ECM contains a large amount of enterprise content, including but not limited to metadata information, unstructured file text content, system access and retrieval of relevant logs and personnel information.

步骤S102，根据所述元数据信息的元数据集样本空间，生成元数据聚类模板。Step S102: Generate a metadata clustering template according to the metadata set sample space of the metadata information.

具体的，将复杂的元数据结构进行简化，生成元数据聚类模板，即通过聚类方法，将结构化元数据所代表的内容进行分类，提取核心元数据结构。Specifically, the complex metadata structure is simplified to generate a metadata clustering template, that is, the content represented by the structured metadata is classified by the clustering method, and the core metadata structure is extracted.

步骤S103，根据用户的相关信息，获取所述用户的静态属性空间。Step S103: Acquire the static attribute space of the user according to the relevant information of the user.

具体地，根据技术人员背景，如专业、部门、参与项目、阶段、职位等相关信息得出专业人员静态属性空间，并将每位技术人员的静态属性空间记录下来。Specifically, according to the background of the technicians, such as majors, departments, participating projects, stages, positions and other related information, the static attribute space of professionals is obtained, and the static attribute space of each technician is recorded.

步骤S104，根据所述用户的静态属性空间和所述元数据聚类模板，获取相应的静态海量数据模板。Step S104: Acquire a corresponding static massive data template according to the static attribute space of the user and the metadata clustering template.

具体的，根据步骤S102中元数据聚类模板所得的核电技术知识聚类与步骤S103所得的专业人员背景分析数据，将二者结合，获得静态海量数据模板。Specifically, according to the nuclear power technology knowledge clustering obtained from the metadata clustering template in step S102 and the professional background analysis data obtained in step S103, the two are combined to obtain a static massive data template.

步骤S105，监控所述用户的行为日志，并根据所述用户的行为日志，获取所述用户在预设时间内的关注词。Step S105: Monitor the behavior log of the user, and acquire the concerned words of the user within a preset time period according to the behavior log of the user.

具体的，需要对用户的关注点进行分析，分析的方法为基于时序的用户行为日志监控与记录，进而对日志数据进行用户行为及期望的挖掘。Specifically, it is necessary to analyze the user's concerns. The analysis method is to monitor and record user behavior logs based on time series, and then mine the log data for user behavior and expectations.

首先收集系统记录的用户检索、阅览及关注的内容。其次根据每次检索内容分解为若干的关键词，将用户检索内容按时间因子(时序)对关注内容单元的频率及次数进行记录，最终形成用户近期热门关注词。First, collect the contents of user search, viewing and attention recorded by the system. Secondly, according to each retrieval content is decomposed into several keywords, the frequency and times of the content unit of interest are recorded according to the time factor (sequence) of the retrieval content of the user, and finally the recent popular attention words of the user are formed.

步骤S106，根据海量数据非结构化文档的文本分析，形成文本索引。In step S106, a text index is formed according to the text analysis of the unstructured document of the massive data.

具体的，首先从文本集中获取信息，根据核电词典对文本进行分析与预处理，将其中的词汇进行筛选与识别，根据停用词表去掉无用词。特征抽取根据词在文本集中的词频，词在文本集各文本出现的次数占文本数的比例对文本集中的词加权排序，即在词典中的词汇具有较高权值。按特征词顺序选出由多少词组成特征向量，并通过MapReduce算法对海量文本进行索引，并给出文档的特征结果与摘要。Specifically, first obtain information from the text set, analyze and preprocess the text according to the nuclear power dictionary, screen and identify the words in it, and remove useless words according to the stop word list. Feature extraction weights and sorts the words in the text set according to the word frequency of the word in the text set and the ratio of the number of times the word appears in each text in the text set to the number of texts, that is, the words in the dictionary have higher weights. According to the order of the feature words, how many words constitute the feature vector are selected, and the massive text is indexed by the MapReduce algorithm, and the feature results and abstracts of the documents are given.

步骤S107，根据所述文本索引、所述用户在预设时间内的关注词以及所述静态海量数据模板，查找所要推荐的内容。Step S107, searching for the content to be recommended according to the text index, the concerned words of the user within the preset time, and the static massive data template.

具体的，动态索引检索建立在静态数据空间模型算法下的样本空间与非结构化文本的索引基础之上，通过索引排序，选择最终推荐的知识信息。Specifically, the dynamic index retrieval is based on the index of the sample space and unstructured text under the static data space model algorithm, and the final recommended knowledge information is selected through index sorting.

通过本发明实施例可以将静态信息与动态数据相结合，快速地完成核电专业技术人员的数据知识推送，从而保证专业技术人员及时有效的获得精准匹配的有效信息。Through the embodiments of the present invention, static information and dynamic data can be combined to quickly complete the push of data knowledge of nuclear power professional technicians, thereby ensuring that professional technicians can obtain accurate matching and effective information in a timely and effective manner.

实施例二：Embodiment 2:

图2示出了本发明实施例二提供的海量信息数据的推荐方法的实现流程，所述实现流程详述如下：FIG. 2 shows the implementation process of the method for recommending massive information data provided by the second embodiment of the present invention, and the implementation process is described in detail as follows:

步骤S201，从企业内容管理系统ECM中获取元数据信息。Step S201, obtaining metadata information from the enterprise content management system ECM.

该步骤与步骤S101相同，具体可参见步骤S101的相关描述，在此不再赘述。This step is the same as step S101 , for details, please refer to the relevant description of step S101 , which will not be repeated here.

步骤S202，根据所述元数据信息的元数据集样本空间，生成元数据聚类模板。Step S202: Generate a metadata clustering template according to the metadata set sample space of the metadata information.

该步骤与步骤S102相同，具体可参见步骤S102的相关描述，在此不再赘述。This step is the same as step S102. For details, please refer to the relevant description of step S102, which will not be repeated here.

可选的，所述根据所述元数据信息的元数据集样本空间，生成元数据聚类模板包括：Optionally, the generating a metadata clustering template according to the metadata set sample space of the metadata information includes:

步骤一，从所述元数据集样本空间中任意选择K个对象作为初始的簇中心，其中，K为大于零的整数，其中一个簇对象对应一类技术文档；Step 1, arbitrarily select K objects from the metadata set sample space as the initial cluster centers, where K is an integer greater than zero, and one cluster object corresponds to a type of technical document;

步骤二，计算所述元数据集样本空间中所有对象与K个簇中心的相似度，并将所述所有对象中每个对象归类于与该对象相似度最高的簇；Step 2, calculating the similarity between all objects in the metadata set sample space and K cluster centers, and classifying each object in the all objects into the cluster with the highest similarity with the object;

步骤三，根据每个簇中的对象重新计算所述每个簇的簇中心，以重新计算K个簇中心；Step 3, recalculate the cluster center of each cluster according to the object in each cluster, to recalculate K cluster centers;

步骤四，若重新计算的K个簇中心中任一簇中心发生改变，则重新计算所述所有对象与所述重新计算的K个簇中心的相似度，并将所述所有对象中每个对象归类于与该对应相似度最高的簇，形成新的簇对象；Step 4: If any one of the recalculated K cluster centers changes, recalculate the similarity between all the objects and the recalculated K cluster centers, and assign each object in the all objects. It is classified into the cluster with the highest similarity with the corresponding correspondence to form a new cluster object;

步骤五，重复步骤三和四，直到K个簇中心不再发生改变，该K个簇中心形成所述元数据聚类模板。Step 5: Repeat steps 3 and 4 until the K cluster centers no longer change, and the K cluster centers form the metadata clustering template.

元数据属性集空间由可以由多个维度独立属性集汇集而成。在元数据集样本空间内任意选择K个对象作为初始的簇的中心(可取大于等于专业技术分工总数)，计算各对象与K个簇中心的相似度，将各对象归于最相似的簇，对簇内对象计算出新的平均值(中心)；再计算各对象与新的K个簇中心的相似度，再根据各对象与新的簇平均值相似度，将每个对象重新赋给最类似的簇，形成新的簇对象；再更新簇的平均值，即计算每个对象的平均值，直到不再发生变，最终形成元数据聚类模板。The metadata attribute set space is composed of independent attribute sets that can be composed of multiple dimensions. In the sample space of the metadata set, K objects are arbitrarily selected as the center of the initial cluster (which can be greater than or equal to the total number of professional and technical divisions of labor), the similarity between each object and the K cluster centers is calculated, and each object is assigned to the most similar cluster. Calculate the new average value (center) of the objects in the cluster; then calculate the similarity between each object and the new K cluster centers, and then reassign each object to the most similar according to the similarity between each object and the new cluster average Then update the average value of the cluster, that is, calculate the average value of each object until it no longer changes, and finally form a metadata clustering template.

需要说明的是，所述静态海量数据模板包含多个簇对象，每个簇对象中包含着具有相同技术特点的知识内容，即一个簇对象为一类技术文档。It should be noted that the static massive data template includes a plurality of cluster objects, and each cluster object includes knowledge content with the same technical characteristics, that is, a cluster object is a type of technical document.

步骤S203，根据用户的相关信息，获取所述用户的静态属性空间。Step S203: Acquire the static attribute space of the user according to the relevant information of the user.

该步骤与步骤S103相同，具体可参见步骤S103的相关描述，在此不再赘述。This step is the same as step S103. For details, please refer to the relevant description of step S103, which will not be repeated here.

步骤S204，根据所述用户的静态属性空间和所述元数据聚类模板，获取相应的静态海量数据模板。Step S204: Acquire a corresponding static massive data template according to the static attribute space of the user and the metadata clustering template.

所述用户的静态属性空间与所述元数据聚类模板所描述的技术特点参数应对应，取二者属性参数的交集，最后根据实际业务，调整各属性权值，形成静态数据模型模板。The static attribute space of the user should correspond to the technical characteristic parameters described in the metadata clustering template, the intersection of the two attribute parameters is taken, and finally, according to the actual business, the weights of each attribute are adjusted to form a static data model template.

可选的，每个用户属于一类技术关注群体；所述根据所述用户的静态属性空间和所述元数据聚类模板，获取相应的静态海量数据模板包括：Optionally, each user belongs to a class of technical interest groups; the obtaining the corresponding static massive data template according to the static attribute space of the user and the metadata clustering template includes:

根据所述用户的静态属性空间中的属性参数和所述元数据聚类模板中的属性参数，计算所述每一类技术文档δ与每一类技术关注群体μ的匹配关系

以获取所述静态海量数据模板，其中，att_i为所述用户的静态属性空间中的属性参数与所述元数据聚类模板中的属性参数的交集中的第i个属性参数，n为该交集中属性参数的个数，Meta(att_i)为att_i在所述元数据聚类模板中的属性信息，Specialty(att_i)为att_i在所述用户的静态属性空间中的属性信息，

为att_i的权值。According to the attribute parameters in the static attribute space of the user and the attribute parameters in the metadata clustering template, the matching relationship between each type of technical document δ and each type of technical interest group μ is calculated

To obtain the static massive data template, where at_i is the ith attribute parameter in the intersection of the attribute parameter in the static attribute space of the user and the attribute parameter in the metadata clustering template, and n is the The number of attribute parameters in the intersection, Meta(att_i ) is the attribute information of att_i in the metadata clustering template, Specialty(att_i ) is the attribute information of att_i in the static attribute space of the user,

is the weight of att_i .

对于任意一个文档δ属于用户μ的静态样本空间D，则静态支持力度V(μ,δ)，与属性参数att_i在元数据聚类模板中的属性信息和在所述用户的静态属性空间中的属性信息的方差成反比相关，当然这个值应该乘以属性参数att_i的重要性标示

即权值，最终将所有属性的信息汇总后，形成静态支持力度。For any document δ belonging to the static sample space D of user μ, then the static support strength V(μ,δ), and the attribute information of the attribute parameter att_i in the metadata clustering template and in the static attribute space of the user The variance of the attribute information is inversely related, of course, this value should be multiplied by the importance indicator of the attribute parameter att_i

That is, the weight, and finally the information of all attributes is aggregated to form a static support force.

支持力度越大，说明群体关注度越高，因此可按此排序形成每个专业关注矩阵，以便后续模块使用。The greater the support, the higher the group's attention. Therefore, each professional attention matrix can be formed in this order for the use of subsequent modules.

该步骤与步骤S104相同，具体可参见步骤S104的相关描述，在此不再赘述。This step is the same as step S104. For details, please refer to the relevant description of step S104, which will not be repeated here.

步骤S205，监控所述用户的行为日志，并根据所述用户的行为日志，获取所述用户在预设时间内的关注词。Step S205: Monitor the behavior log of the user, and acquire the concerned words of the user within a preset time according to the behavior log of the user.

该步骤与步骤S105相同，具体可参见步骤S105的相关描述，在此不再赘述。This step is the same as step S105. For details, please refer to the relevant description of step S105, which will not be repeated here.

步骤S206，根据海量数据非结构化文档的文本分析，形成文本索引。Step S206, forming a text index according to the text analysis of the unstructured document of the massive data.

该步骤与步骤S106相同，具体可参见步骤S106的相关描述，在此不再赘述。This step is the same as that of step S106. For details, please refer to the relevant description of step S106, which will not be repeated here.

步骤S207，根据所述文本索引、所述用户在预设时间内的关注词以及所述静态海量数据模板，查找所要推荐的内容。Step S207, searching for the content to be recommended according to the text index, the concerned words of the user within a preset time, and the static massive data template.

动态索引检索建立在静态数据空间模型算法下的样本空间与非结构化文本的索引基础之上，通过索引排序，选择最终推荐的知识信息。Dynamic index retrieval is based on the index of sample space and unstructured text under the static data space model algorithm, and the final recommended knowledge information is selected through index sorting.

其中动态索引检索分析分为两个方面，内容支持力度与时间支持力度。The dynamic index retrieval analysis is divided into two aspects, content support and time support.

内容支持力度包括静态海量数据模板中的样本空间，该样本空间的每一条数据，均有对应的支持力度，这些支持力度是通过核电文档元数据中计算得出的；除此之外，还包含根据海量数据非结构化文档的文本分析，形成文本索引，这部分称之为全文本支持力度，是通过文档全文索引得出的结果。The content support strength includes the sample space in the static massive data template. Each piece of data in the sample space has a corresponding support strength. These support strengths are calculated from the nuclear power document metadata; in addition, it also includes According to the text analysis of unstructured documents of massive data, a text index is formed. This part is called full text support, which is the result obtained through the full text index of documents.

时间支持力度可理解为新鲜度，从文档角度，文档产生的时间因素称为文档新鲜度，此外根据步骤S205监控得出的用户阅览、检索、下载及关注的知识内容也与时间相关，这部分成为关注新鲜度，从时间维度将二者进行运算后得出关注点内容信息及每个关注点的新鲜度。Time support can be understood as freshness. From the perspective of documents, the time factor of document generation is called document freshness. In addition, the knowledge content of user viewing, retrieval, download and attention obtained according to the monitoring of step S205 is also related to time. This part It becomes the freshness of concern, and after calculating the two from the time dimension, the content information of the concern and the freshness of each concern are obtained.

最终根据用户最新的关注点与样本空间的索引顺序，计算获得最终的推荐内容结果。Finally, according to the latest user concerns and the index order of the sample space, the final recommended content result is calculated.

可选的，所述根据所述文本索引、所述用户在预设时间内的关注词以及所述静态海量数据模板，查找所要推荐的内容包括：Optionally, according to the text index, the concerned words of the user within a preset time, and the static massive data template, searching for the content to be recommended includes:

获取所述用户在预设时间内的关注词在文本索引中出现的频率

其中，

为所述用户在预设时间内的第j个关注词；Obtain the frequency of the user's attention words appearing in the text index within a preset time

in,

is the jth concerned word of the user within the preset time;

根据

和V(μ,δ)，计算每一类技术文档的推荐力度

其中，m为所述用户在预设时间内的关注词的个数，

为关注时间新鲜度权值，

为关注频率权值，τ(δ)为文档δ的更新时间参数；according to

and V(μ,δ), calculate the recommendation strength of each type of technical document

Wherein, m is the number of concerned words of the user within the preset time,

In order to pay attention to the time freshness weight,

In order to pay attention to the frequency weight, τ(δ) is the update time parameter of document δ;

根据每一类技术文档的推荐力度，将满足预设条件的推荐力度所对应的技术文档以列表的形式生成推荐内容。According to the recommendation strength of each type of technical documents, the recommended content is generated in the form of a list of the technical documents corresponding to the recommendation strengths that satisfy the preset conditions.

其中，所述预设时间可以是用户设置的周期时间，例如一周，在此不作限定。所述预设条件可以是指大于预设阈值的推荐力度，可以根据推荐力度的大小，将推荐力度从大到小的顺序，排列各个推荐力度对应的技术文档。The preset time may be a cycle time set by a user, such as a week, which is not limited herein. The preset condition may refer to a recommendation strength that is greater than a preset threshold, and the technical documents corresponding to each recommendation strength may be arranged in order of the recommendation strength from large to small according to the size of the recommendation strength.

步骤S208，记录查找到的所要推荐的内容和所述静态海量数据模板。Step S208, record the found content to be recommended and the static massive data template.

记录操作过程，一方面记录静态支持向量结果，另一方面记录动态需求更新过程与动态索引信息。Record the operation process, on the one hand record the static support vector results, on the other hand record the dynamic demand update process and dynamic index information.

实施例三：Embodiment three:

图3示出了本发明实施例三提供的海量信息数据的推荐装置的组成示意图，为了便于说明，仅示出了与本发明实施例相关的部分，详述如下：FIG. 3 shows a schematic diagram of the composition of a recommending device for massive information data provided by Embodiment 3 of the present invention. For the convenience of description, only the part related to the embodiment of the present invention is shown, and the details are as follows:

元数据信息获取模块31，用于从企业内容管理系统ECM中获取元数据信息；The metadatainformation acquisition module 31 is used for acquiring metadata information from the enterprise content management system ECM;

元数据聚集模板生成模块32，用于根据所述元数据信息的元数据集样本空间，生成元数据聚类模板；A metadata clusteringtemplate generation module 32, configured to generate a metadata clustering template according to the metadata set sample space of the metadata information;

静态属性空间获取模块33，用于根据用户的相关信息，获取所述用户的静态属性空间；The static attributespace obtaining module 33 is used for obtaining the static attribute space of the user according to the relevant information of the user;

静态海量数据模板获取模块34，用于根据所述用户的静态属性空间和所述元数据聚类模板，获取相应的静态海量数据模板；A static massive datatemplate obtaining module 34, configured to obtain a corresponding static massive data template according to the static attribute space of the user and the metadata clustering template;

关注词获取模块35，用于监控所述用户的行为日志，并根据所述用户的行为日志，获取所述用户在预设时间内的关注词；A concernedword acquisition module 35, configured to monitor the behavior log of the user, and obtain the attention word of the user within a preset time according to the behavior log of the user;

文本索引形成模块36，用于根据海量数据非结构化文档的文本分析，形成文本索引；The textindex forming module 36 is used to form a text index according to the text analysis of unstructured documents of massive data;

推荐内容查找模块37，用于根据所述文本索引、所述用户在预设时间内的关注词以及所述静态海量数据模板，查找所要推荐的内容。The recommendedcontent search module 37 is configured to search for the content to be recommended according to the text index, the concerned words of the user within a preset time, and the static massive data template.

元数据信息获取模块31是海量信息数据的推荐装置与企业内容管理平台的接口模块，负责与核电企业内容管理系统ECM进行数据交互，其中主要包含的企业内容有：元数据信息、非结构化文件文本内容、系统访问及检索相关日志以及人员信息。这些信息将被集中存储在元数据信息获取模块31中，供各模块调用，主要使用者为元数据聚集模板生成模块32。The metadatainformation acquisition module 31 is an interface module between the recommendation device for massive information data and the enterprise content management platform, and is responsible for data interaction with the nuclear power enterprise content management system ECM, which mainly includes enterprise content: metadata information, unstructured files Text content, system access and retrieval related logs, and personnel information. These information will be centrally stored in the metadatainformation acquisition module 31 for each module to call, and the main user is the metadata aggregationtemplate generation module 32 .

另外，系统集成数据的更新也由该元数据信息获取模块31负责。In addition, the update of the system integration data is also in charge of the metadatainformation acquisition module 31 .

本发明实施例提供的海量信息数据的推荐装置可以使用在前述对应的推荐方法实施例一中，详情参见上述实施例一的描述，在此不再赘述。The apparatus for recommending massive information data provided by this embodiment of the present invention may be used in the foregoing corresponding first embodiment of the recommendation method. For details, refer to the description of the first embodiment, which will not be repeated here.

实施例四：Embodiment four:

图4示出了本发明实施例四提供的海量信息数据的推荐装置的组成示意图，为了便于说明，仅示出了与本发明实施例相关的部分，详述如下：FIG. 4 shows a schematic diagram of the composition of a recommendation device for massive information data provided by Embodiment 4 of the present invention. For the convenience of description, only the parts related to the embodiment of the present invention are shown, and the details are as follows:

元数据信息获取模块41，用于从企业内容管理系统ECM中获取元数据信息；The metadatainformation acquisition module 41 is used for acquiring metadata information from the enterprise content management system ECM;

元数据聚集模板生成模块42，用于根据所述元数据信息的元数据集样本空间，生成元数据聚类模板；A metadata clusteringtemplate generation module 42, configured to generate a metadata clustering template according to the metadata set sample space of the metadata information;

静态属性空间获取模块43，用于根据用户的相关信息，获取所述用户的静态属性空间；The static attributespace obtaining module 43 is used for obtaining the static attribute space of the user according to the relevant information of the user;

静态海量数据模板获取模块44，用于根据所述用户的静态属性空间和所述元数据聚类模板，获取相应的静态海量数据模板；A static massive datatemplate obtaining module 44, configured to obtain a corresponding static massive data template according to the static attribute space of the user and the metadata clustering template;

关注词获取模块45，用于监控所述用户的行为日志，并根据所述用户的行为日志，获取所述用户在预设时间内的关注词；A concernedword acquisition module 45, configured to monitor the behavior log of the user, and obtain the concerned words of the user within a preset time according to the behavior log of the user;

文本索引形成模块46，用于根据海量数据非结构化文档的文本分析，形成文本索引；The textindex forming module 46 is used to form a text index according to the text analysis of unstructured documents of massive data;

推荐内容查找模块47，用于根据所述文本索引、所述用户在预设时间内的关注词以及所述静态海量数据模板，查找所要推荐的内容；The recommendedcontent search module 47 is configured to search for the content to be recommended according to the text index, the concerned words of the user within a preset time, and the static massive data template;

日志记录模块48，用于记录查找到的所要推荐的内容和所述静态海量数据模板。Thelog recording module 48 is configured to record the found content to be recommended and the static massive data template.

所述元数据聚类模板生成模块42包括：The metadata clusteringtemplate generation module 42 includes:

选择单元421，用于从所述元数据集样本空间中任意选择K个对象作为初始的簇中心，其中，K为大于零的整数，其中一个簇对象对应一类技术文档；Theselection unit 421 is used to arbitrarily select K objects from the metadata set sample space as the initial cluster centers, wherein K is an integer greater than zero, and one of the cluster objects corresponds to a class of technical documents;

第一计算单元422，用于计算所述元数据集样本空间中所有对象与K个簇中心的相似度，并将所述所有对象中每个对象归类于与该对象相似度最高的簇；Thefirst computing unit 422 is used to calculate the similarity between all objects in the metadata set sample space and the K cluster centers, and classify each object in the all objects into the cluster with the highest similarity with the object;

第二计算单元423，用于根据每个簇中的对象重新计算所述每个簇的簇中心，以重新计算K个簇中心；Thesecond calculation unit 423 is used to recalculate the cluster center of each cluster according to the object in each cluster, to recalculate K cluster centers;

第三计算单元424，用于若重新计算的K个簇中心中任一簇中心发生改变，则重新计算所述所有对象与所述重新计算的K个簇中心的相似度，并将所述所有对象中每个对象归类于与该对应相似度最高的簇，形成新的簇对象；Thethird calculating unit 424 is configured to recalculate the similarity of all the objects and the recalculated K cluster centers if any one of the recalculated K cluster centers changes, and calculate all the recalculated K cluster centers Each object in the object is classified into the cluster with the highest similarity to the corresponding, forming a new cluster object;

形成单元425，用于重复执行第二计算单元和第三计算单元，直到K个簇中心不再发生改变，该K个簇中心形成所述元数据聚类模板。The formingunit 425 is configured to repeatedly execute the second computing unit and the third computing unit until the K cluster centers no longer change, and the K cluster centers form the metadata clustering template.

所述每个用户属于一类技术关注群体；所述静态海量数据模板获取模块44具体用于：Described each user belongs to a class of technical concern groups; described static massive datatemplate acquisition module 44 is specifically used for:

以获取所述静态海量数据模板，其中，att_i为所述用户的静态属性空间中的属性参数与所述元数据聚类模板中的属性参数的交集中的第i个属性参数，n为该交集中属性参数的个数，Meta(att_i)为att_i所述用户的静态属性空间中的值，Specialty(att_i)为att_i在所述元数据聚类模板中的值，

To obtain the static massive data template, where at_i is the ith attribute parameter in the intersection of the attribute parameter in the static attribute space of the user and the attribute parameter in the metadata clustering template, and n is the The number of attribute parameters in the intersection, Meta(att_i ) is the value of att_i in the static attribute space of the user, Specialty(att_i ) is the value of att_i in the metadata clustering template,

is the weight of att_i .

所述推荐内容查找模块47包括：The recommendedcontent search module 47 includes:

频率获取单元471，用于获取所述用户在预设时间内的关注词在文本索引中出现的频率

其中，

为所述用户在预设时间内的第j个关注词；Frequency obtaining unit 471, used to obtain the frequency of the user's concerned words appearing in the text index within a preset time

in,

is the jth concerned word of the user within the preset time;

推荐力度计算单元472，用于根据

和V(μ,δ)，计算每一类技术文档的推荐力度

其中，m为所述用户在预设时间内的关注词的个数，

为关注时间新鲜度权值，

为关注频率权值，τ(δ)为文档δ的更新时间参数；The recommendedstrength calculation unit 472 is used for according to

Wherein, m is the number of concerned words of the user within the preset time,

In order to pay attention to the time freshness weight,

推荐内容生成单元473，用于根据每一类技术文档的推荐力度，将满足预设条件的推荐力度所对应的技术文档以列表的形式生成推荐内容。The recommendedcontent generating unit 473 is configured to generate recommended content in the form of a list according to the recommendation strength of each type of technical document, the technical documents corresponding to the recommendation strength satisfying the preset condition.

本发明实施例提供的海量信息数据的推荐装置可以使用在前述对应的推荐方法实施例二中，详情参见上述实施例二的描述，在此不再赘述。The apparatus for recommending massive information data provided by this embodiment of the present invention may be used in the foregoing corresponding second embodiment of the recommendation method. For details, refer to the description of the foregoing second embodiment, which will not be repeated here.

所述领域的技术人员可以清楚地了解到，为描述的方便和简洁，仅以上述各功能模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能模块完成，即所述装置的内部结构划分成不同的功能模块，上述功能模块既可以采用硬件的形式实现，也可以采用软件的形式实现。另外，各功能模块的具体名称也只是为了便于相互区别，并不用于限制本申请的保护范围。Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, only the division of the above-mentioned functional modules is used as an example for illustration. That is, the internal structure of the device is divided into different functional modules, and the above functional modules can be implemented in the form of hardware or software. In addition, the specific names of the functional modules are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present application.

综上所述，本发明实施例填补了核电结构化海量信息的推荐问题，可根据核电技术文件的特点及专业人员的专业属性与关注信息进行有效结合，可以适应多种核电技术路线。本系统可动态记录用户关注信息，并将相关操作以日志形式记录。本发明构建了一种智能化的核电技术资料的知识抽取与匹配处理方法，有效的提高了核电技术信息知识的传播效率及准确性，并有效的提高工作效率，降低了生产成本，且稳定可靠。To sum up, the embodiment of the present invention fills the recommendation problem of nuclear power structural mass information, can be effectively combined with attention information according to the characteristics of nuclear power technical documents and professional attributes of professionals, and can adapt to various nuclear power technical routes. The system can dynamically record user attention information and record related operations in the form of logs. The invention constructs an intelligent knowledge extraction and matching processing method for nuclear power technical data, effectively improves the dissemination efficiency and accuracy of nuclear power technical information knowledge, effectively improves work efficiency, reduces production cost, and is stable and reliable .

本领域普通技术人员还可以理解，实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，所述的程序可以在存储于一计算机可读取存储介质中，所述的存储介质，包括ROM/RAM、磁盘、光盘等。Those of ordinary skill in the art can also understand that all or part of the steps in the methods of the above embodiments can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable storage medium, so The storage medium described above includes ROM/RAM, magnetic disk, optical disk, etc.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims

1. A recommendation method for massive information data is characterized by comprising the following steps:

acquiring metadata information from an enterprise content management system (ECM);

generating a metadata clustering template according to the metadata set sample space of the metadata information;

acquiring a static attribute space of a user according to related information of the user;

each user belongs to a class of technology concern groups; acquiring a corresponding static mass data template according to the static attribute space of the user and the metadata clustering template, wherein the method comprises the following steps: calculating the matching relation between each type of technical document and each type of technical concern group mu according to the attribute parameters in the static attribute space of the user and the attribute parameters in the metadata clustering template

To obtain the static mass data template, wherein att_iIs the ith attribute parameter in the intersection of the attribute parameters in the static attribute space of the user and the attribute parameters in the metadata clustering template, n is the number of the attribute parameters in the intersection, Meta (att)_i) Is att_iValue in the static attribute space of the user, Special (att)_i) Is att_iA value in the metadata cluster template,

is att_iThe weight of (2);

monitoring the behavior log of the user, and acquiring the attention word of the user within preset time according to the behavior log of the user;

forming a text index according to text analysis of the massive data unstructured document;

searching contents to be recommended according to the text index, the attention words of the user in a preset time and the static mass data template;

and recording the searched content to be recommended and the static mass data template.

2. The recommendation method according to claim 1, wherein the generating a metadata clustering template according to the metadata set sample space of the metadata information comprises:

step one, randomly selecting K objects from the metadata set sample space as initial cluster centers, wherein K is an integer larger than zero, and one cluster object corresponds to one type of technical documents;

calculating the similarity between all objects in the metadata set sample space and K cluster centers, and classifying each object in all the objects into a cluster with the highest similarity to the object;

recalculating the cluster center of each cluster according to the object in each cluster so as to recalculate K cluster centers;

if any cluster center in the K cluster centers which are recalculated changes, recalculating the similarity between all the objects and the K cluster centers which are recalculated, and classifying each object in all the objects into a cluster with the highest corresponding similarity to form a new cluster object;

and step five, repeating the step three and the step four until K cluster centers are not changed any more, wherein the K cluster centers form the metadata clustering template.

3. The recommendation method according to claim 2, wherein the searching for the content to be recommended according to the text index, the word of interest of the user within a preset time, and the static massive data template comprises:

acquiring the frequency of the attention word of the user in the text index within the preset time

Wherein,

the j-th attention word of the user in a preset time is shown;

according to

And V (mu,) calculating the recommendation strength of each technical document

Wherein m is the number of the attention words of the user in the preset time,

in order to focus on the temporal freshness weight,

for the frequency weight of interest, τ () is the update time parameter of the document;

and generating recommendation content in a list form for the technical documents corresponding to the recommendation strength meeting the preset conditions according to the recommendation strength of each type of technical documents.

4. A recommendation apparatus for mass information data, the recommendation apparatus comprising:

the system comprises a metadata information acquisition module, a metadata information acquisition module and a metadata information acquisition module, wherein the metadata information acquisition module is used for acquiring metadata information from an enterprise content management system (ECM);

the metadata aggregation template generation module is used for generating a metadata clustering template according to the metadata set sample space of the metadata information;

the static attribute space acquisition module is used for acquiring the static attribute space of the user according to the relevant information of the user;

a static mass data template obtaining module for obtaining corresponding static state according to the static attribute space of the user and the metadata clustering templateMass data templates; each user belongs to a class of technology concern groups; the static mass data template acquisition module is specifically configured to: calculating the matching relation between each type of technical document and each type of technical concern group mu according to the attribute parameters in the static attribute space of the user and the attribute parameters in the metadata clustering template

is att_iThe weight of (2);

the attention word acquisition module is used for monitoring the behavior log of the user and acquiring the attention words of the user within preset time according to the behavior log of the user;

the text index forming module is used for forming a text index according to the text analysis of the massive data unstructured document;

the recommended content searching module is used for searching contents to be recommended according to the text index, the attention words of the user in a preset time and the static mass data template;

and the log recording module is used for searching the content to be recommended and the static mass data template.

5. The recommendation device of claim 4, wherein the metadata clustering template generation module comprises:

a selecting unit, configured to arbitrarily select K objects from the metadata set sample space as initial cluster centers, where K is an integer greater than zero, and one of the cluster objects corresponds to one class of technical documents;

the first calculating unit is used for calculating the similarity between all objects in the metadata set sample space and K cluster centers and classifying each object in all the objects into a cluster with the highest similarity to the object;

a second calculating unit, configured to recalculate a cluster center of each cluster according to an object in the cluster, so as to recalculate K cluster centers;

a third calculating unit, configured to recalculate the similarity between all the objects and the recalculated K cluster centers if any one of the recalculated K cluster centers is changed, and classify each of the objects into a cluster with the highest corresponding similarity, so as to form a new cluster object;

and the forming unit is used for repeatedly executing the second calculating unit and the third calculating unit until K cluster centers are not changed any more, and the K cluster centers form the metadata clustering template.

6. The recommendation device of claim 5, wherein the recommended content search module comprises:

a frequency obtaining unit, configured to obtain a frequency of appearance of a word of interest of the user in a text index within a preset time

Wherein,

the j-th attention word of the user in a preset time is shown;

a recommendation force calculation unit for calculating a recommendation force based on

And V (mu,) calculating the recommendation strength of each technical document

Wherein m is the number of the attention words of the user in the preset time,

in order to focus on the temporal freshness weight,

and the recommended content generating unit is used for generating recommended content in a list form according to the recommendation strength of each type of technical document and the technical documents corresponding to the recommendation strength meeting the preset conditions.