技术领域Technical field
本发明涉及人工智能技术领域,尤其涉及一种简报生成方法及装置。The present invention relates to the field of artificial intelligence technology, and in particular, to a method and device for generating a briefing.
背景技术Background technique
信息技术是现代生产的重要基础条件,信息已逐渐成为一种战略性、战术性和工具性资源,企业间的竞争日益体现为以信息技术为核心的综合实力的竞争。对资讯的处理与应用已成为现代企业的重要组成部分。Information technology is an important basic condition for modern production. Information has gradually become a strategic, tactical and instrumental resource. Competition among enterprises is increasingly reflected in the competition of comprehensive strength with information technology as the core. The processing and application of information have become an important part of modern enterprises.
随着信息技术的发展和人们生活方式变化,社交媒体的兴起与普及,如何抓住数据资源中蕴含的价值,提供快速、时效、精练、最有价值的科研简报信息。对资讯在资讯收集、资讯处理生成、资讯挖掘分析等方面提出了更高的要求。With the development of information technology, changes in people's lifestyles, and the rise and popularization of social media, how to seize the value contained in data resources and provide fast, timely, concise, and most valuable scientific research briefing information. Higher requirements have been put forward for information in terms of information collection, information processing and generation, information mining and analysis, etc.
当前的资讯收集已完成传统由人工收集向自动化信息技术采集转型,使用爬虫框架进行自动化采集方式已被越来越多的人所青睐。但是,由于每个网站都有个性化设计,都需要做不同的处理,而网站一旦改版,开发的代码也需要跟着更新,所以,在我国目前大部分领域,还没有实现完全自动化收集数据与管理。在当今资讯发展迅猛的时候,传统资讯收集方式将逐渐被淘汰,需要研发新型的资讯收集方式。当前的资讯处理生成,一般会采用抽取式文本摘要和生成式文本摘要,抽取式文本摘要使用简单算法抽取3个句子组成摘要。生成式文本摘要把关键词和句子顺序合并起来,先遍历关键词,再顺序遍历句子,直至找到出现的第一个句子,作为摘要。The current information collection has completed the transformation from traditional manual collection to automated information technology collection. The use of crawler frameworks for automated collection has been favored by more and more people. However, since each website has a personalized design and requires different processing, and once the website is revised, the developed code also needs to be updated accordingly. Therefore, in most fields in our country, fully automated data collection and management have not yet been achieved. . In today's era of rapid information development, traditional information collection methods will gradually be eliminated, and new information collection methods need to be developed. Current information processing generation generally uses extractive text summary and generative text summary. Extractive text summary uses a simple algorithm to extract 3 sentences to form a summary. Generative text summarization combines keywords and sentences sequentially, traversing keywords first, and then traversing sentences sequentially until the first sentence that appears is found, which is used as a summary.
发明内容Contents of the invention
本发明的主要目的在于提供一种简报生成方法及装置,旨在解决现有技术简报信息搜集以及撰写的相关工作,往往需要通过人工进行信息筛选和手工撰写,容易造成大量时间成本浪费的技术问题。The main purpose of the present invention is to provide a briefing generation method and device, aiming to solve the technical problems related to the collection and writing of briefing information in the prior art, which often require manual information screening and manual writing, which easily leads to a large amount of time and cost waste. .
为实现上述目的,本发明提供了一种简报生成方法,所述方法包括以下步骤:To achieve the above objectives, the present invention provides a briefing generation method, which method includes the following steps:
根据预设多模渠道特征配置库从全网资讯内容中筛选出多个参考资讯;Filter out multiple reference information from the entire network information content based on the preset multi-mode channel feature configuration library;
对所述多个参考资讯进行特征提取得到各个参考资讯的参考特征,所述参考特征包括资讯标题、资讯正文、资讯作者、资讯发布时间;Perform feature extraction on the plurality of reference information to obtain reference features of each reference information, where the reference features include information title, information text, information author, and information release time;
对所述参考特征进行特征拓展,得到资讯简报数据集;Perform feature expansion on the reference features to obtain an information briefing data set;
基于预设反馈神经网络对所述资讯简报数据集进行数据处理,得到报头、报核、报尾,根据所述报头、报核、报尾生成简报。The information briefing data set is processed based on a preset feedback neural network to obtain a header, a report core, and a trailer, and a briefing is generated based on the header, report core, and report trailer.
可选地,所述根据预设多模渠道特征配置库从所述全网资讯内容中筛选出多个参考资讯,包括:Optionally, the plurality of reference information is selected from the entire network information content according to the preset multi-mode channel feature configuration library, including:
根据预设多模渠道特征配置库中的监控资讯渠道特征集进行搜索,得到网页内容;Search based on the monitoring information channel feature set in the preset multi-mode channel feature configuration library to obtain web page content;
对所述网页内容进行内容提取,得到初始内容特征;Extract content from the web page content to obtain initial content features;
根据预设反馈神经网络将所述初始内容特征与预设多模渠道特征配置库中的资讯特征集进行匹配;Match the initial content features with the information feature set in the preset multi-mode channel feature configuration library according to the preset feedback neural network;
将为匹配成功的网页内容通过预设反馈神经网络进行学习,得到新增特征,根据所述新增特征对所述预设多模渠道特征配置库中的资讯特征集进行更新;Learn through a preset feedback neural network for successfully matched web content to obtain new features, and update the information feature set in the preset multi-mode channel feature configuration library according to the new features;
将匹配成功的网页内容作为参考资讯。Use the successfully matched web content as reference information.
可选地,所述对所述多个参考资讯进行特征提取得到各个参考资讯的参考特征,所述参考特征包括资讯标题、资讯正文、资讯作者、资讯发布时间,包括:Optionally, the feature extraction of the plurality of reference information is performed to obtain reference features of each reference information. The reference features include information title, information text, information author, and information release time, including:
根据所述预设多模渠道特征配置库中的资讯标题特征计算各个参考资讯的标题权重,根据所述标题权重进行特征提取;Calculate the title weight of each reference information based on the information title features in the preset multi-mode channel feature configuration library, and perform feature extraction based on the title weight;
计算所述各个参考资讯的正文的段落链接密度和文本密度,根据所述段落链接密度和文本密度进行特征提取,得到资讯正文;Calculate the paragraph link density and text density of the text of each reference information, perform feature extraction based on the paragraph link density and text density, and obtain the information text;
通过正则对所述各个参考资讯的统一资源定位器进行时间提取,得到初始时间特征,对所述初始时间进行格式化处理,得到资讯发布时间;Extract the time from the unified resource locator of each reference information through regular rules to obtain the initial time characteristics, format the initial time to obtain the information release time;
根据所述预设多模渠道特征配置库中的发布作者特征与所述参考资讯进行匹配,得到资讯作者。The information author is obtained by matching publishing author characteristics in the preset multi-mode channel feature configuration library with the reference information.
可选地,所述对所述多个参考资讯进行特征提取得到各个参考资讯的参考特征之后,还包括:Optionally, after performing feature extraction on the plurality of reference information to obtain reference features of each reference information, the method further includes:
将所述各个参考资讯的资讯正文和资讯标题与所述预设多模渠道特征配置库中的监控标签特征集进行匹配,根据匹配结果得到所述参考资讯是否相关;Match the information text and information title of each reference information with the monitoring tag feature set in the preset multi-mode channel feature configuration library, and obtain whether the reference information is relevant based on the matching results;
将所述各个参考资讯的资讯正文和资讯标题与所述预设多模渠道特征配置库中的敏感词标签特征集进行匹配,根据匹配结果判断所述参考资讯是否为无效资讯;Match the information text and information title of each reference information with the sensitive word tag feature set in the preset multi-mode channel feature configuration library, and determine whether the reference information is invalid information based on the matching results;
对所述参考资讯的资讯正文和资讯标题进行文本关键词提取,得到文本关键词,根据所述文本关键词判断所述参考资讯是否满足质量要求;Extract text keywords from the information text and information title of the reference information to obtain text keywords, and determine whether the reference information meets quality requirements based on the text keywords;
剔除所述参考资讯中不相关的参考资讯、无效的参考资讯以及不满足质量要求的参考资讯,得到多个参考资讯。Eliminate irrelevant reference information, invalid reference information and reference information that does not meet quality requirements from the reference information to obtain multiple reference information.
可选地,所述对所述参考特征进行特征拓展,得到资讯简报数据集,包括:Optionally, perform feature expansion on the reference features to obtain an information briefing data set, including:
将各个参考资讯的资讯标题作为资讯标识,将资讯标题、资讯正文、资讯作者和资讯时间转化为资讯ID;Use the information title of each reference information as the information identifier, and convert the information title, information text, information author and information time into information ID;
将各个参考资讯与预设多模渠道特征配置库中资讯特征集匹配成功的资讯标题配置特征、资讯正文配置特征、资讯发布时间配置特征、资讯作者配置作为特征目标值;The information title configuration features, information body configuration features, information release time configuration features, and information author configuration that successfully match each reference information with the information feature set in the preset multi-mode channel feature configuration library are used as feature target values;
根据所述资讯标识、所述特征目标值以及所述资讯ID生成资讯目录,根据多个资讯目录构成资讯简报数据集。An information directory is generated based on the information identifier, the characteristic target value and the information ID, and an information briefing data set is formed based on multiple information directories.
可选地,所述根据所述资讯标识、所述特征目标值以及所述资讯ID生成资讯目录,根据多个资讯目录构成资讯简报数据集,包括:Optionally, generating an information catalog based on the information identifier, the characteristic target value and the information ID, and forming an information briefing data set based on multiple information catalogs, including:
根据所述资讯标识、所述特征目标值以及所述资讯ID生成资讯目录;Generate an information catalog based on the information identifier, the characteristic target value and the information ID;
根据所述参考资讯的资讯正文和资讯标题确定多个关键词,根据所述关键词生成资讯标签;Determine a plurality of keywords based on the information text and information title of the reference information, and generate information tags based on the keywords;
根据所述参考资讯的资讯正文和资讯标题构建相邻词组,根据所述相邻词组进行组合得到关键字短语;Construct adjacent phrases based on the information text and information title of the reference information, and combine the adjacent phrases to obtain keyword phrases;
根据所述资讯标签和所述关键字短语对所述资讯目录进行更新,得到参考目录,根据所述参考资讯目录构成资讯简报数据集。The information catalog is updated according to the information tags and the keyword phrases to obtain a reference catalog, and an information briefing data set is formed based on the reference information catalog.
可选地,所述根据所述资讯标签和所述关键字短语对所述资讯目录进行更新,得到参考目录,根据所述参考资讯目录构成资讯简报数据集,还包括:Optionally, updating the information directory according to the information tags and the keyword phrases to obtain a reference directory, and forming an information briefing data set based on the reference information directory also includes:
根据所述资讯标签和所述关键字短语对所述资讯目录进行更新,得到参考目录;Update the information directory according to the information tag and the keyword phrase to obtain a reference directory;
根据所述参考资讯的资讯正文生成文章简报;Generate an article summary based on the information text of the reference information;
根据所述文章简报对所述参考资讯目录进行更新,根据更新后的参考资讯目录构成资讯简报数据集。The reference information catalog is updated according to the article brief, and an information brief data set is formed based on the updated reference information catalog.
可选地,所述基于预设反馈神经网络对所述资讯简报数据集进行数据处理,得到报头、报核、报尾,根据所述报头、报核、报尾生成简报,包括:Optionally, the information briefing data set is processed based on a preset feedback neural network to obtain a header, a core, and a trailer, and a briefing is generated based on the header, core, and trailer, including:
基于预设反馈神经网络对所述资讯简报数据集中对应的资讯配置特征标签以及配置科研资讯单位生成报头;Based on the preset feedback neural network, configure the corresponding information feature tags and configure scientific research information units in the information briefing data set to generate headers;
根据所述资讯简报数据集中对应的科研资讯配置、所述资讯简报数据集中的文章简报、关键词、资讯目录生成报核;Generate a report based on the corresponding scientific research information configuration in the information briefing data set, the article briefings, keywords, and information catalogs in the information briefing data set;
根据所述资讯简报数据集中对应的配置科研资讯发生范围生成报尾;Generate a tail report based on the corresponding configured scientific research information occurrence range in the information briefing data set;
根据所述报头、报核、报尾生成简报。Generate a briefing based on the header, report verification, and report trailer.
可选地,所述根据所述资讯简报数据集中对应的科研资讯配置、所述资讯简报数据集中的文章简报、关键词、资讯目录生成报核,包括:Optionally, generating a report based on the corresponding scientific research information configuration in the information briefing data set, article briefings, keywords, and information catalogs in the information briefing data set includes:
根据所述资讯简报数据集中对应的科研资讯配置进行特征提取,得到有效目录;Perform feature extraction based on the corresponding scientific research information configuration in the information briefing data set to obtain an effective catalog;
根据所述资讯简报数据集中参考资讯生成多个简报按语,将多个简报按照进行组合得到目标按语;Generate multiple briefing notes based on the reference information in the information briefing data set, and combine the multiple briefing notes accordingly to obtain target notes;
通过所述资讯简报数据集中关键词进行词频统计,对多个关键词进行聚类形得到词群,根据词群得到简报核心关键词信息;Perform word frequency statistics on keywords in the information briefing data set, cluster multiple keywords to obtain word groups, and obtain briefing core keyword information based on the word groups;
根据所述资讯简报数据集中的关键字短语生成文章标题,根据所述资讯简报数据集中的文章简报生成文章导语;Generate an article title based on the keyword phrases in the information briefing data set, and generate an article lead based on the article briefings in the information briefing data set;
根据所述资讯简报数据集中的得到文章来源,文章作者信息,以及作者信息;Obtain the article source, article author information, and author information from the information briefing data set;
根据所述有效目录、所述目标按语、所述简报核心关键词信息、所述文章导语、文章来源,文章作者信息,以及作者信息生成报核。A report is generated based on the effective directory, the target comment, the core keyword information of the briefing, the article lead, the article source, the article author information, and the author information.
此外,为实现上述目的,本发明还提出一种简报生成装置,所述简报生成装置包括:In addition, to achieve the above objectives, the present invention also proposes a briefing generating device, which includes:
资讯获取模块,用于根据预设多模渠道特征配置库从全网资讯内容中筛选出多个参考资讯;The information acquisition module is used to filter out multiple reference information from the entire network information content based on the preset multi-mode channel feature configuration library;
所述资讯获取模块,还用于对所述多个参考资讯进行特征提取得到各个参考资讯的参考特征,所述参考特征包括资讯标题、资讯正文、资讯作者、资讯发布时间;The information acquisition module is also used to perform feature extraction on the plurality of reference information to obtain reference features of each reference information. The reference features include information title, information text, information author, and information release time;
简报生成模块,用于对所述参考特征进行特征拓展,得到资讯简报数据集;A briefing generation module, used to perform feature expansion on the reference features to obtain an information briefing data set;
所述简报生成模块,还用于基于预设反馈神经网络对所述资讯简报数据集进行数据处理,得到报头、报核、报尾,根据所述报头、报核、报尾生成简报。The briefing generation module is also used to perform data processing on the information briefing data set based on a preset feedback neural network to obtain a header, a report core, and a trailer, and generate a briefing based on the header, report core, and report trailer.
本发明通过基于通用配置库、标签库本申请提案提供完成自适应的新型的资讯收集方式,通过用户感兴趣的标签配置,采用有效关键词配置特征库和敏感词配置库获取有效网页/内容,缩短科研简报信息传递时间、提高简报工作效率,有效避免了简报信息搜集以及撰写的相关工作因为需要通过人工进行信息筛选和手工撰写,造成大量时间成本浪费的问题。The present invention provides a novel information collection method that is adaptive based on a general configuration library and a tag library. Through the tag configuration that the user is interested in, the effective keyword configuration feature library and the sensitive word configuration library are used to obtain effective web pages/content. It shortens the information transmission time of scientific research briefings, improves the efficiency of briefing work, and effectively avoids the problem of wasting a lot of time and cost due to the need for manual information screening and manual writing in the collection and writing of briefing information.
附图说明Description of the drawings
图1是本发明实施例方案涉及的硬件运行环境的简报生成设备的结构示意图;Figure 1 is a schematic structural diagram of a briefing generation device for a hardware operating environment involved in an embodiment of the present invention;
图2为本发明简报生成方法第一实施例的流程示意图;Figure 2 is a schematic flow chart of the first embodiment of the briefing generation method of the present invention;
图3为本发明简报生成方法一实施例的完整简报生成步骤示意图;Figure 3 is a schematic diagram of the complete briefing generation steps of an embodiment of the briefing generation method of the present invention;
图4为本发明简报生成方法第二实施例的流程示意图;Figure 4 is a schematic flow chart of the second embodiment of the briefing generation method of the present invention;
图5为本发明简报生成方法第三实施例的流程示意图;Figure 5 is a schematic flowchart of the third embodiment of the briefing generation method of the present invention;
图6为本发明简报生成装置第一实施例的结构框图。Figure 6 is a structural block diagram of the first embodiment of the briefing generating device of the present invention.
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization of the purpose, functional features and advantages of the present invention will be further described with reference to the embodiments and the accompanying drawings.
具体实施方式Detailed ways
应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。It should be understood that the specific embodiments described here are only used to explain the present invention and are not intended to limit the present invention.
参照图1,图1为本发明实施例方案涉及的硬件运行环境的简报生成设备结构示意图。Referring to Figure 1, Figure 1 is a schematic structural diagram of a briefing generation device for a hardware operating environment involved in an embodiment of the present invention.
如图1所示,该简报生成设备可以包括:处理器1001,例如中央处理器(CentralProcessing Unit,CPU),通信总线1002、用户接口1003,网络接口1004,存储器1005。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如无线保真(Wireless-Fidelity,Wi-Fi)接口)。存储器1005可以是高速的随机存取存储器(RandomAccess Memory,RAM)存储器,也可以是稳定的非易失性存储器(Non-Volatile Memory,NVM),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in Figure 1, the briefing generation device may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Among them, the communication bus 1002 is used to realize connection communication between these components. The user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard). The optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface or a wireless interface (such as a Wireless-Fidelity (Wi-Fi) interface). The memory 1005 may be a high-speed random access memory (Random Access Memory, RAM) memory or a stable non-volatile memory (Non-Volatile Memory, NVM), such as a disk memory. The memory 1005 may optionally be a storage device independent of the aforementioned processor 1001.
本领域技术人员可以理解,图1中示出的结构并不构成对简报生成设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the structure shown in Figure 1 does not constitute a limitation on the briefing generation device, and may include more or fewer components than shown, or combine certain components, or arrange different components.
如图1所示,作为一种存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及简报生成程序。As shown in Figure 1, memory 1005 as a storage medium may include an operating system, a network communication module, a user interface module and a briefing generation program.
在图1所示的简报生成设备中,网络接口1004主要用于与网络服务器进行数据通信;用户接口1003主要用于与用户进行数据交互;本发明简报生成设备中的处理器1001、存储器1005可以设置在简报生成设备中,所述简报生成设备通过处理器1001调用存储器1005中存储的简报生成程序,并执行本发明实施例提供的简报生成方法。In the briefing generation device shown in Figure 1, the network interface 1004 is mainly used for data communication with the network server; the user interface 1003 is mainly used for data interaction with the user; the processor 1001 and the memory 1005 in the briefing generation device of the present invention can Set in a briefing generation device, the briefing generation device calls the briefing generation program stored in the memory 1005 through the processor 1001, and executes the briefing generation method provided by the embodiment of the present invention.
本发明实施例提供了一种简报生成方法,参照图2,图2为本发明一种简报生成方法第一实施例的流程示意图。An embodiment of the present invention provides a method for generating a briefing. Refer to Figure 2. Figure 2 is a schematic flow chart of a first embodiment of a method for generating a briefing according to the present invention.
本实施例中,所述简报生成方法包括以下步骤:In this embodiment, the briefing generation method includes the following steps:
步骤S10:根据预设多模渠道特征配置库从全网资讯内容中筛选出多个参考资讯。Step S10: Screen out multiple reference information from the entire network information content based on the preset multi-mode channel feature configuration library.
可理解的是,预设多模渠道特征配置库可以理解为包括多种特征集合的数据库,特征集合可以包括监控资讯渠道特征集、监控标签特征集、敏感词标签特征集、资讯特征集。It is understandable that the preset multi-mode channel feature configuration library can be understood as a database that includes multiple feature sets. The feature set can include a monitoring information channel feature set, a monitoring tag feature set, a sensitive word tag feature set, and an information feature set.
需说明的是,预设多模渠道特征配置库最开始先构建一个特征库,对特征库进行初始化可以是在公开渠道收集集团内部(研究院、广东创新院、浙江创新院、广东公司、浙江公司,相关专业机构等)以及外部业内公司和机构(如电信/联通研究院、互联网公司等)科技创新方面的动态。It should be noted that the preset multi-mode channel feature configuration library first builds a feature library. The feature library can be initialized within the public channel collection group (Research Institute, Guangdong Innovation Institute, Zhejiang Innovation Institute, Guangdong Company, Zhejiang companies, relevant professional institutions, etc.) as well as external industry companies and institutions (such as China Telecom/China Unicom Research Institute, Internet companies, etc.).
然后配置多模渠道特征配置库:配置任务涉及的监控资讯渠道特征集R1:配置新闻资讯平台地址和资讯平台名称,如:江苏发改、赛迪智库等;配置公众号新媒体地址和资讯平台名称,如:微信公众号、微博等;Then configure the multi-mode channel feature configuration library: configure the monitoring information channel feature set R1 involved in the task: configure the news information platform address and information platform name, such as: Jiangsu Development and Reform, CCID Think Tank, etc.; configure the new media address and information platform of the public account Name, such as: WeChat official account, Weibo, etc.;
配置监控标签特征集R2:配置品牌标签,如:广东创新院、浙江移动公司、中国电信研究院等;配置产品标签,如:区块链、无线基站、工业互联网等。Configure monitoring label feature set R2: configure brand labels, such as Guangdong Innovation Institute, Zhejiang Mobile Company, China Telecom Research Institute, etc.; configure product labels, such as blockchain, wireless base station, industrial Internet, etc.
配置敏感词标签特征集R3:配置疑似敏感词标签,如:暑假、实习、招聘等;确认算法识别敏感词标签,根据算法识别到无效文章提取的词频大于20的标签,由人工确认标签是否添加为疑似敏感词标签。Configure sensitive word tag feature set R3: configure suspected sensitive word tags, such as summer vacation, internship, recruitment, etc.; confirm that the algorithm identifies sensitive word tags, and identify tags with a word frequency greater than 20 extracted from invalid articles according to the algorithm, and manually confirm whether the tags are added Label suspected sensitive words.
配置资讯特征集Ri1~Ri4:基于网页分析算法和候选URL排序算法提取网页特征配置。资讯特征集主要包括资讯标题特征、资讯正文特征、发布时间特征、发布作者特征,主要通过匹配和正则对进行配置。Configuration information feature set Ri1~Ri4: Extract web page feature configuration based on web page analysis algorithm and candidate URL sorting algorithm. The information feature set mainly includes information title features, information text features, release time features, and release author features, which are mainly configured through matching and regular pairs.
其中,对需要获取的资讯进行特征提取,根据需要生成的任务,配置任务创建任务,并将其与任务模板进行匹配。配置任务初始化创建完成后,可以为所创建的任务设置相关配置信息。并将配置任务整合在当前页面特征配置,后续运营过程中可以重新对提取和标注信息进行配置任务。Among them, feature extraction is performed on the information that needs to be obtained, and tasks are configured to create tasks based on the tasks that need to be generated, and matched with task templates. After the initial creation of the configuration task is completed, you can set relevant configuration information for the created task. And the configuration task is integrated into the current page feature configuration, and the extraction and annotation information can be re-configured during subsequent operations.
需说明的是,预设多模渠道特征配置库可以是在不断更新,在每次搜集到新的资讯后可以根据咨询中不在预设多模渠道特征配置库的特征新增加至其中。It should be noted that the default multi-mode channel feature configuration library can be continuously updated, and each time new information is collected, new features that are not in the default multi-mode channel feature configuration library can be added to it according to the consultation.
应理解的是,全网资讯内容可以是使用成熟的爬虫技术根据步骤一配置的多模渠道配置库R1循环进行搜索或者访问得到的咨询内容,参考资讯是对全网资讯进行筛选后得到的资讯内容。It should be understood that the information content of the entire network can be the consulting content obtained by searching or accessing the multi-mode channel configuration library R1 configured in step 1 using mature crawler technology. The reference information is the information obtained after filtering the information of the entire network. content.
需进一步说明的是,预设多模渠道特征配置库可以是在不断更新可以是根据预设多模渠道特征配置库中的监控资讯渠道特征集进行搜索,得到网页内容;对所述网页内容进行内容提取,得到初始内容特征;根据预设反馈神经网络将所述初始内容特征与预设多模渠道特征配置库中的资讯特征集进行匹配;将为匹配成功的网页内容通过预设反馈神经网络进行学习,得到新增特征,根据所述新增特征对所述预设多模渠道特征配置库中的资讯特征集进行更新;将匹配成功的网页内容作为参考资讯。It should be further explained that the preset multi-mode channel feature configuration library may be continuously updated or may be searched based on the monitoring information channel feature set in the preset multi-mode channel feature configuration library to obtain web page content; the web page content may be searched Extract content to obtain initial content features; match the initial content features with the information feature set in the preset multi-mode channel feature configuration library according to the preset feedback neural network; pass the preset feedback neural network for successfully matched web page content Perform learning to obtain new features, update the information feature set in the preset multi-mode channel feature configuration library according to the new features, and use the successfully matched web page content as reference information.
在具体实施中,使用成熟的爬虫技术根据步骤一配置的多模渠道配置库循环进行搜索或者访问,返回搜索对象所包含的所有链接和绝对链接,得出网页的结构和内容。每抽取得到一篇符合多模渠道特征配置库的文章,对其标记文章状态,并通过写入历史资讯数据库,对所有资讯文章进行统一管理。以防止后续文章抽取阶段重复提取相同文章。In the specific implementation, mature crawler technology is used to search or access in a loop according to the multi-mode channel configuration library configured in step 1, return all links and absolute links contained in the search object, and obtain the structure and content of the web page. Each time an article is extracted that conforms to the multi-mode channel feature configuration library, the article status is marked, and all information articles are managed uniformly by writing to the historical information database. To prevent repeated extraction of the same articles in subsequent article extraction stages.
将抽取到的内容通过递归反馈神经网络模型与多模渠道特征配置库中的特征进行匹配,找出参考资讯中未匹配部分,采用递归反馈神经网络不断学习到未知特征,更新补充到对应的特征库中。并将网页的内容与已知的配置库中进行比较、对照,并根据网站权重算法对网站进行评估。确定当前获取数据是否有效,不断完善资讯标题特征库、资讯正文特征库、发布时间特征库、发布作者特征库。Match the extracted content with the features in the multi-mode channel feature configuration library through the recursive feedback neural network model, find out the unmatched parts of the reference information, use the recursive feedback neural network to continuously learn unknown features, and update and supplement the corresponding features. in the library. And compare and contrast the content of the web page with the known configuration library, and evaluate the website according to the website weight algorithm. Determine whether the currently acquired data is valid, and continuously improve the information title feature database, information text feature database, release time feature database, and release author feature database.
需要说明的是,本实施例的执行主体是简报生成设备,其中,该简报生成设备具有数据处理,数据通信及程序运行等功能,所述简报生成设备可以为集成控制器,控制计算机等设备,当然还可以为其他具备相似功能的设备,本实施例对此不做限制。It should be noted that the execution subject of this embodiment is a briefing generation device, where the briefing generating device has functions such as data processing, data communication, and program running. The briefing generating device can be an integrated controller, a control computer, and other equipment. Of course, it can also be other devices with similar functions, which is not limited in this embodiment.
步骤S20:对所述多个参考资讯进行特征提取得到各个参考资讯的参考特征,所述参考特征包括资讯标题、资讯正文、资讯作者、资讯发布时间。Step S20: Perform feature extraction on the plurality of reference information to obtain reference features of each reference information. The reference features include information title, information text, information author, and information release time.
需说明的是,在对所述参考资讯进行特征提取之前,通过清洗数据去除冗余字段、规格统一等处理。It should be noted that before feature extraction is performed on the reference information, redundant fields are removed through data cleaning and specifications are unified.
需说明的是,所述对所述多个参考资讯进行特征提取得到各个参考资讯的参考特征,所述参考特征包括资讯标题、资讯正文、资讯作者、资讯发布时间,包括:根据所述预设多模渠道特征配置库中的资讯标题特征计算各个参考资讯的标题权重,根据所述标题权重进行特征提取。It should be noted that the feature extraction of the plurality of reference information obtains the reference features of each reference information. The reference features include information title, information text, information author, and information release time, including: according to the preset Calculate the title weight of each reference information based on the information title features in the multi-mode channel feature configuration library, and perform feature extraction based on the title weight.
其中,资讯标题特征提取可以是首次资讯标题提取时,根据已配置的资讯标题特征计算相应的文档对象模型(Document Object Model,简称DOM)的权重,最后分析得到对应的DOM区块,进而提取网页标题。若首次未提取成功,则可以基于提取到的网页文字密度和文本特征算法生成新闻标题,此处新闻标题可以简单理解为也是一种资讯标题,可以是采用jieba算法(jieba是百度工程师Sun Junyi开发的一个开源库)使用已有的停词库,将网页划分为疑似标题区和正文区域;通过去除正文区域中的标记,降低网页噪声干扰,准确抽取出正文候选区。采用无向有权图模型(用于文本的基于图的排序算法,简称TextRank算法)计算每个资讯的权重集合,采用改进的相似度计算方法从正文候选区抽取新闻标题。同时标记文章资讯标题提取状态。Among them, the information title feature extraction can be performed by calculating the weight of the corresponding Document Object Model (DOM) based on the configured information title features when extracting the information title for the first time, and finally analyzing and obtaining the corresponding DOM block, and then extracting the web page title. If the extraction is not successful for the first time, the news title can be generated based on the extracted text density and text feature algorithm of the web page. The news title here can be simply understood as an information title, which can be based on the jieba algorithm (jieba is developed by Baidu engineer Sun Junyi An open source library) uses the existing stop word library to divide the web page into a suspected title area and a text area; by removing the marks in the text area, the noise interference of the web page is reduced, and the candidate text area is accurately extracted. An undirected weighted graph model (graph-based ranking algorithm for text, TextRank algorithm for short) is used to calculate the weight set of each information, and an improved similarity calculation method is used to extract news titles from the text candidate area. At the same time, the article information title extraction status is marked.
计算所述各个参考资讯的正文的段落链接密度和文本密度,根据所述段落链接密度和文本密度进行特征提取,得到资讯正文;Calculate the paragraph link density and text density of the text of each reference information, perform feature extraction based on the paragraph link density and text density, and obtain the information text;
在具体实施中,可以是将正文部分去除标记,只保留正文文字标记P、div、span内容,同时留下标签去除后的所有空白位置信息,标记为行块。以行块中的行号为x轴,取其周围k行(上下文均可,设置阈值k>3,方向向下,k为行块厚度),合起来称为一个行块,行块i是以行号i为轴的行块;定义行块长度:一个行块去掉其中的所有空白符(\n,\r,\t等)后的字符总数称为该行块的长度。定义行块分布函数:以每行为轴,共有LinesNum(xtext)-K个行块,做出以[1,LinesNum(xtext-K)]为横轴,以其各自的行块长度为纵轴的分布函数。使用DOM树结构,采用聚类算法对正文段落分别分析段落链接密度和文本密度进行计算并按词语分词。选用Jieba对文本进行分词,然后再计算词语数量。根据递归反馈神经网络模型对相同资讯网站文本密度和链接密度进行识别,以获取最有效正文内容。文本密度可以是行的总词语数量/行数量;计算链接密度可以是链接文本的词语数量/总词语数量。对获取到的正文内容使用正文资讯过滤声明特征替换无关内容。得到有效的正文。标记文章资讯正文提取状态。In a specific implementation, the mark may be removed from the main text part, and only the text mark P, div, and span content of the text may be retained, while all the blank position information after the tags are removed may be left and marked as line blocks. Take the row number in the row block as the x-axis, take the k rows around it (any context is acceptable, set the threshold k>3, the direction is downward, k is the thickness of the row block), together it is called a row block, and row block i is A line block with line number i as the axis; define the line block length: the total number of characters in a line block after removing all whitespace characters (\n, \r, \t, etc.) is called the length of the line block. Define the line block distribution function: taking each line as the axis, there are a total of LinesNum(xtext)-K line blocks, and making [1, LinesNum(xtext-K)] as the horizontal axis and the length of each line block as the vertical axis. Distribution function. Using the DOM tree structure, the clustering algorithm is used to analyze the paragraph link density and text density of the text paragraphs, calculate them and segment them by words. Use Jieba to segment the text and then calculate the number of words. Based on the recursive feedback neural network model, the text density and link density of the same information website are identified to obtain the most effective text content. The text density can be the total number of words in the line/the number of lines; the link density can be calculated as the number of words in the link text/the total number of words. Use the text information filtering statement feature to replace the irrelevant content in the obtained text content. Get valid text. Mark the article information text extraction status.
通过正则对所述各个参考资讯的统一资源定位器进行时间提取,得到初始时间特征,对所述初始时间进行格式化处理,得到资讯发布时间;Extract the time from the unified resource locator of each reference information through regular rules to obtain the initial time characteristics, format the initial time to obtain the information release time;
其中,在具体实施中,使用正则在资讯统一资源定位器(URL)中提取时间网页中根据发布时间特征库提取的时间。Among them, in the specific implementation, regular rules are used to extract the time extracted from the time webpage according to the release time feature library in the information uniform resource locator (URL).
若没有年月特征时间,主要表现形式为:X天前、X分钟前、X小时前、X秒前等,使用如下正则进行清洗:If there is no year and month characteristic time, the main manifestations are: X days ago, X minutes ago, X hours ago, X seconds ago, etc. Use the following regular rules for cleaning:
(r'\d{1,2}%s\d{1,2}%s\d{1,2}%s','%%H%s%%M%s%%S%s')(r'\d{1,2}%s\d{1,2}%s\d{1,2}%s','%%H%s%%M%s%%S%s')
(r'\d{1,2}%s\d{1,2}%s','%%H%s%%M%s')(r'\d{1,2}%s\d{1,2}%s','%%H%s%%M%s')
没有年份特征时间,主要表现形式为X月X日等,使用如下正则进行清洗:There is no year characteristic time, and the main expression is X month, X day, etc. Use the following regular rules for cleaning:
(r'\d{1,2}%s\d{1,2}%s\d{1,2}%s\d{1,2}%s\d{1,2}%s','%%m%s%%d%s%%H%s%%M%s%%S%s')(r'\d{1,2}%s\d{1,2}%s\d{1,2}%s\d{1,2}%s\d{1,2}%s', '%%m%s%%d%s%%H%s%%M%s%%S%s')
(r'\d{1,2}%s\d{1,2}%s\d{1,2}%s\d{1,2}%s','%%m%s%%d%s%%H%s%%M%s')(r'\d{1,2}%s\d{1,2}%s\d{1,2}%s\d{1,2}%s','%%m%s%%d %s%%H%s%%M%s')
(r'\d{1,2}%s\d{1,2}%s','%%m%s%%d%s')(r'\d{1,2}%s\d{1,2}%s','%%m%s%%d%s')
其他非标准时间特征,使用如下正则进行清洗:For other non-standard time features, use the following regular rules to clean:
(r'\d{4}%s\d{1,2}%s\d{1,2}%s\d{1,2}%s\d{1,2}%s\d{1,2}%s','%%Y%s%%m%s%%d%s%%H%s%%M%s%%S%s')(r'\d{4}%s\d{1,2}%s\d{1,2}%s\d{1,2}%s\d{1,2}%s\d{1 ,2}%s','%%Y%s%%m%s%%d%s%%H%s%%M%s%%S%s')
(r'\d{4}%s\d{1,2}%s\d{1,2}%sT\d{1,2}%s\d{1,2}%s\d{1,2}%s','%%Y%s%%m%s%%d%s%%H%s%%M%s%%S%s')(r'\d{4}%s\d{1,2}%s\d{1,2}%sT\d{1,2}%s\d{1,2}%s\d{1 ,2}%s','%%Y%s%%m%s%%d%s%%H%s%%M%s%%S%s')
(r'\d{4}%s\d{1,2}%s\d{1,2}%s\d{1,2}%s\d{1,2}%s','%%Y%s%%m%s%%d%s%%H%s%%M%s')(r'\d{4}%s\d{1,2}%s\d{1,2}%s\d{1,2}%s\d{1,2}%s','% %Y%s%%m%s%%d%s%%H%s%%M%s')
(r'\d{4}%s\d{1,2}%s\d{1,2}%s','%%Y%s%%m%s%%d%s')(r'\d{4}%s\d{1,2}%s\d{1,2}%s','%%Y%s%%m%s%%d%s')
(r'\d{2}%s\d{1,2}%s\d{1,2}%s','%%y%s%%m%s%%d%s')(r'\d{2}%s\d{1,2}%s\d{1,2}%s','%%y%s%%m%s%%d%s')
最后,标记文章资讯发布时间提取状态。Finally, mark the article information release time extraction status.
根据所述预设多模渠道特征配置库中的发布作者特征与所述参考资讯进行匹配,得到资讯作者,具体来说可以是根据配置发布作者特征库提取发布作者信息;如果未获取到作者信息,则使用步骤一配置的资讯平台名称,并进行发布作者特征标记;标记文章资讯发布作者状态。According to the publishing author characteristics in the preset multi-mode channel feature configuration library and the reference information, the information author is obtained. Specifically, the publishing author information can be extracted according to the configured publishing author feature library; if the author information is not obtained , then use the information platform name configured in step 1, and mark the publishing author characteristics; mark the article information publishing author status.
需进一步说明的是,所述对所述多个参考资讯进行特征提取得到各个参考资讯的参考特征之后,还包括:将所述各个参考资讯的资讯正文和资讯标题与所述预设多模渠道特征配置库中的监控标签特征集进行匹配,根据匹配结果得到所述参考资讯是否相关;将所述各个参考资讯的资讯正文和资讯标题与所述预设多模渠道特征配置库中的敏感词标签特征集进行匹配,根据匹配结果判断所述参考资讯是否为无效资讯;对所述参考资讯的资讯正文和资讯标题进行文本关键词提取,得到文本关键词,根据所述文本关键词判断所述参考资讯是否满足质量要求;剔除所述参考资讯中不相关的参考资讯、无效的参考资讯以及不满足质量要求的参考资讯,得到多个参考资讯。It should be further explained that after the feature extraction of the plurality of reference information to obtain the reference features of each reference information, it also includes: combining the information text and information title of each reference information with the preset multi-mode channel. Match the monitoring tag feature sets in the feature configuration library, and obtain whether the reference information is relevant based on the matching results; compare the information text and information title of each reference information with the sensitive words in the preset multi-mode channel feature configuration library The tag feature sets are matched, and the reference information is judged according to the matching result whether it is invalid information; text keywords are extracted from the information text and information title of the reference information, text keywords are obtained, and the text keywords are judged according to the text keywords. Whether the reference information meets the quality requirements; eliminate irrelevant reference information, invalid reference information and reference information that does not meet the quality requirements in the reference information to obtain multiple reference information.
可理解的是,在搜索到参考资讯时,也可能存在内容并不符合当前的需求或者是存在一些敏感词不能进行使用,则需要通过参考资讯的正文和标签进行筛选。It is understandable that when searching for reference information, there may be content that does not meet the current needs or there are some sensitive words that cannot be used, so you need to filter through the text and tags of the reference information.
在具体实施中,将资讯正文/资讯标题循环根据监控标签特征集进行词表匹配,判断资讯是否相关;将资讯正文/资讯标题根据监控标签特征集,采用词频-逆文档频率算法(Term Frequency-Inverse Word Frequency,简称TF-IWF)算法进行文本关键词提取,判断资讯质量;将资讯正文/资讯标题循环根据步骤一敏感词标签特征集,采用敏感词过滤算法筛选掉无效资讯。结束上述所有操作后,保留获取到所有效资讯文章。In the specific implementation, the information body/information title is cycled according to the monitoring tag feature set for vocabulary matching to determine whether the information is relevant; the information body/information title is based on the monitoring tag feature set, using the term frequency-inverse document frequency algorithm (Term Frequency- Inverse Word Frequency (TF-IWF) algorithm is used to extract text keywords and judge the quality of information; the information text/information title cycle is based on the sensitive word tag feature set in step 1, and the sensitive word filtering algorithm is used to filter out invalid information. After completing all the above operations, all valid information articles obtained will be retained.
步骤S30:对所述参考特征进行特征拓展,得到资讯简报数据集。Step S30: Perform feature expansion on the reference features to obtain an information briefing data set.
可理解的是,对各个参考资讯进行特征拓展可以是考虑到参考资讯中的标题、正文、作者、时间都非常固定,但是因为标题中特征具有延展,也就是文字的表达有很多近义词或者是相似的描述,通过拓展可以得到能够更加全面描述参考资讯的资讯简报数据集。It is understandable that the feature expansion of each reference information can be based on the fact that the title, text, author, and time in the reference information are very fixed. However, because the features in the title are extended, that is, the expression of the text has many synonyms or similar words. The description can be expanded to obtain an information briefing data set that can more comprehensively describe the reference information.
需说明的是,资讯简报数据集中包括很多特征信息,例如关键词、标题、正文、标签、关键字短语等,但是并不是每一个特征信息都是有用且有效的,需要对资讯简报数据集进行清洗和整合。It should be noted that the information briefing data set includes a lot of feature information, such as keywords, titles, text, tags, keyword phrases, etc., but not every feature information is useful and effective, and the information briefing data set needs to be processed Clean and integrate.
在具体实施中,创建资讯简报数据集,以完整的资讯标题作为唯一标识key,将资讯标题、资讯正文、资讯作者和资讯时间转化为相应的ID作为values与资讯标题key相对应,保存至资讯简报数据集内,可理解是每个key有一个对应的values。In the specific implementation, create an information briefing data set, use the complete information title as the unique identification key, convert the information title, information text, information author and information time into the corresponding ID as values corresponding to the information title key, and save it to the information In the briefing data set, it can be understood that each key has a corresponding value.
步骤S40:基于预设反馈神经网络对所述资讯简报数据集进行数据处理,得到报头、报核、报尾,根据所述报头、报核、报尾生成简报。Step S40: Perform data processing on the information briefing data set based on a preset feedback neural network to obtain a header, a report core, and a trailer, and generate a briefing based on the header, report core, and trailer.
可理解的是,预设反馈神经网络可以是递归反馈神经网络,递归反馈神经网络可以将特征向量化。It is understandable that the preset feedback neural network can be a recursive feedback neural network, and the recursive feedback neural network can vectorize features.
需说明的是,数据处理可以是通过递归反馈神经网络将资讯简报数据集分词后的资讯简报特征向量化。将向量化的资讯简报数据集Ri作为输入参数,将资讯简报数据集中的特征目标值Rj作为参考对象,设置递归反馈神经网络特征学习的各类特征Ci,计算出目标值作为输出参数Cj,初始采用如下特征进行特征目标学习。It should be noted that the data processing can be through a recursive feedback neural network to vectorize the information briefing features after word segmentation of the information briefing data set. Use the vectorized information briefing data set Ri as the input parameter, use the feature target value Rj in the information briefing data set as the reference object, set various features Ci for the recursive feedback neural network feature learning, and calculate the target value as the output parameter Cj. Initially The following features are used for feature target learning.
在具体实施中,特征目标学习可以参考下列步骤:In specific implementation, feature target learning can refer to the following steps:
Hopfield反馈神经网络文章标题特征Ci1:Hopfield feedback neural network article title feature Ci1:
标题一般HTMLDOM区块标记为:“<h>”、“<span>”、“<div>”等。Titles are generally marked as HTML DOM blocks: "<h>", "<span>", "<div>", etc.
标题一般位于HTMLDOMTitle的位置。The title is generally located at the position of HTMLDOMTitle.
标题一般位于HTMLDOM“<h1>-<h3>”、“<b>”、“<i>”、“<strong>”等标签The title is generally located in HTMLDOM "<h1>-<h3>", "<b>", "<i>", "<strong>" and other tags
标题文字长度,一般大于10个字符,小于35个字符。The title text length is generally greater than 10 characters and less than 35 characters.
Hopfield反馈神经网络文章正文特征Ci2:Hopfield feedback neural network article text feature Ci2:
文章正文一般HTMLDOM区块标记为:“<table>”、“<span>”、“<div>”The article body is generally marked with HTMLDOM blocks: "<table>", "<span>", "<div>"
HTML网页通常含有“<P>”“<BR>”等标记。并且具有的段落大于2个。HTML web pages usually contain tags such as "<P>" and "<BR>". And has more than 2 paragraphs.
文章正文一般HTMLDOM含有较多的句子,具有较多“。”、“,”等标点符号(>5)。The article body generally contains more sentences in HTMLDOM and has more punctuation marks (>5) such as ".", ",".
根据已知算法获取到的标签,文章正文的标签密度=1000*标签数/文字数正文密度应在1%~5%之间。According to the tags obtained by known algorithms, the tag density of the article text = 1000 * number of tags / number of words. The text density should be between 1% and 5%.
文章正文应该占有比较大篇幅,文本密度=len(正文标记区块)/len(全部网页代码)较大大于30%。The text of the article should occupy a relatively large space, and the text density = len (text markup block) / len (the entire web page code) should be at least 30%.
文章正文不应该包含“上一篇/页”,“下一篇/页”,需要排除。The text of the article should not contain "previous article/page" and "next article/page", which need to be excluded.
文章正文应该不包含文字应为“相关链接”、“相关新闻”、“相关报道”等敏感词,并且没有太多超链接内容,链接大于5的段落应该属于推荐相关内容,需要排除。The text of the article should not contain sensitive words such as "related links", "related news", "related reports", etc., and there should not be too many hyperlinks. Paragraphs with more than 5 links should be recommended related content and need to be excluded.
文章正文应该不包含版权等说明文字,需要进行减权排除。The text of the article should not contain copyright and other explanatory text, and needs to be reduced and excluded.
正文常见位置如下;正常正文在标题之下;正常在发表时间块之下;正常正文在相关链接块之上。The common positions of the text are as follows; normal text is under the title; normal text is under the publication time block; normal text is above the relevant link block.
Hopfield反馈神经网络发表时间特征Ci3:Hopfield feedback neural network publishes time characteristics Ci3:
发表时间一般HTMLDOM区块标记为:“<td>”、“<span>”、“<div>”。The publication time is generally marked as: "<td>", "<span>", "<div>" in HTML DOM blocks.
发表时间特征一般是时间格式或时间戳格式,字符长度符合时间格式。Publication time characteristics are generally in time format or timestamp format, and the character length conforms to the time format.
疑似包含以下关键字:来源,发表、时间等。It is suspected to contain the following keywords: source, publication, time, etc.
Hopfield反馈神经网络发布作者特征Ci4:Hopfield Feedback Neural Network Release Author Features Ci4:
发布作者一般HTMLDOM区块标记为:“<td>”、“<span>”、“<div>”。Publishing authors generally use HTML DOM block tags: "<td>", "<span>", "<div>".
发布作者一般包含常用的“author”、“source”、“infor”等关键核心词。Publishing authors generally include commonly used key words such as "author", "source", and "infor".
需强调的是,发布作者文字长度,一般大于2个字符,小于10个字符。It should be emphasized that the length of the published author's text is generally greater than 2 characters and less than 10 characters.
假设步骤三共获取N条有效性信息,ωij表示以已有规则库获取到的文章j到文章i的联接权,Sj表示神经网络获取文章内容的j状态(取+1或-1),Vj表示神经元j的净输入,当Vj(t)=0时可认为神经元的状态保持不变。Assume that a total of N pieces of validity information are obtained in step three, ωij represents the connection right from article j to article i obtained from the existing rule base, Sj represents the j state of the article content obtained by the neural network (taken as +1 or -1), Vj represents the net input of neuron j. When Vj (t)=0, it can be considered that the state of the neuron remains unchanged.
整个网络的状态可用列向量表示:The state of the entire network is available as a column vector express:
使用递归反馈神经网络算法对全量数据集相关特征进行抽取,并扫描整个模型,将得到的模型转化成训练向量模型。Use the recursive feedback neural network algorithm to extract relevant features of the full data set, scan the entire model, and convert the resulting model into a training vector model.
常规特征向量:Regular eigenvectors:
ij(t)=[R1(t),R2(t),…,Ri(t)]ij(t)=[R1 (t), R2 (t),…, Ri (t)]
神经网络特征训练向量:Neural network feature training vector:
ji(t)=[C1(t),C2(t),…,Ci(t)]ji(t)=[C1 (t), C2 (t),…,Ci (t)]
将测试集的特征使用混淆矩阵进行匹配获取状态变量,把矩阵展开成长度为ij*ji=N2的序列,输入到神经网络中。即:Use the confusion matrix to match the characteristics of the test set to obtain the state variables, expand the matrix into a sequence of length ij*ji=N2 , and input it into the neural network. Right now:
ωij=ωji,ωij=0ωij =ωji ,ωij =0
所以网络的联接权矩阵W是N×N维对角线为0的对称阵。Therefore, the connection weight matrix W of the network is an N×N symmetric matrix with a diagonal of 0.
采用串行方式随机选择进行计算,如果网络从t=0时刻的初态开始,存在某一有限时刻t,使得网络在此之后的状态不再发生变化,即:Random selection is used for calculation in a serial manner. If the network starts from the initial state at time t=0 From the beginning, there is a certain limited time t, so that the state of the network will no longer change after this time, that is:
由于si、sj只能为+1或-1,而ωij和θi均有界,所以能量也是有界的,即:Since si and sj can only be +1 or -1, and ωij and θi are both bounded, the energy is also bounded, that is:
当ΔE=0时,则有:When ΔE=0, Then there are:
(1)若则得到完全一致数据。(1)If Completely consistent data will be obtained.
(2)若则得到模糊一致数据。(2)If Then fuzzy consistent data is obtained.
(3)若则得到完全不匹配数据。(3)If You will get completely unmatched data.
通过神经元计算输出神经元的取值为0、-1、1,分别对应为“模糊一致”、“完全不匹配”、“完全一致”。The values of the output neurons calculated through neurons are 0, -1, and 1, which correspond to "fuzzy consistency", "complete mismatch", and "complete consistency" respectively.
评判递归反馈神经网络算法的生成的输出值结果是否与资讯简报数据集中特征“完全一致”、“模糊一致”或者“无法匹配结果”。Judge whether the output value generated by the recursive feedback neural network algorithm is "completely consistent", "fuzzy consistent" or "cannot match the result" with the features in the information briefing data set.
进一步的,可以通过设置重复惩罚机制,防止后续相应资讯是否获取内容无效。如果完全一致则获取Ri对应特征,模糊一致则部分取信Ri特征,并对数据使用自动化调参方式进行调优。Furthermore, a duplication penalty mechanism can be set to prevent subsequent corresponding information from being invalid. If it is completely consistent, the corresponding features of Ri will be obtained. If it is fuzzy and consistent, the features of Ri will be partially trusted, and the data will be tuned using automated parameter tuning.
可理解的是,资讯简报数据集在生成简报后可以按照7:3比例随机拆分成训练集和测试集。多次生成训练集,多次循环训练,可以训练出多模渠道配置特征集、标签特征集、资讯标题特征集、资讯正文特征集、发布时间特征集、发布作者特征集等量化模型,在具体实施中,简报生成的完整逻辑可以参考图3。It is understandable that the information briefing data set can be randomly split into a training set and a test set according to a ratio of 7:3 after generating the briefing. Generate training sets multiple times and train multiple times to train quantitative models such as multi-mode channel configuration feature sets, label feature sets, information title feature sets, information text feature sets, publishing time feature sets, publishing author feature sets, etc. In specific During implementation, the complete logic of briefing generation can be referred to Figure 3.
可理解的是,通过对根据先验特征提取规则提取生成的资讯简报数据集Ri作为输入参数,匹配特征Rj作为参照对象,利用递归反馈神经网络特征学习的各类特征Ci,学习出目标值作为输出参数Cj,并比较Ri和Ci特征目标值Rj和Cj是否一致,如果一致则将Cj特征反哺填充到Rj配置库。It can be understood that by using the information briefing data set Ri extracted and generated according to the a priori feature extraction rules as the input parameter, the matching feature Rj as the reference object, and using the various features Ci of the recursive feedback neural network feature learning, the target value is learned as Output the parameter Cj, and compare whether the feature target values Rj and Cj of Ri and Ci are consistent. If they are consistent, the Cj feature will be fed back to the Rj configuration library.
本实施例通过基于通用配置库、标签库本申请提案提供完成自适应的新型的资讯收集方式,通过用户感兴趣的标签配置,采用有效关键词配置特征库和敏感词配置库获取有效网页/内容,缩短科研简报信息传递时间、提高简报工作效率,有效避免了简报信息搜集以及撰写的相关工作因为需要通过人工进行信息筛选和手工撰写,造成大量时间成本浪费的问题。This embodiment provides a new adaptive information collection method based on the general configuration library and tag library. Through the tag configuration that the user is interested in, the effective keyword configuration feature library and the sensitive word configuration library are used to obtain effective web pages/content. , shorten the information transmission time of scientific research briefings, improve the efficiency of briefing work, and effectively avoid the problem of wasting a lot of time and cost due to the need for manual information screening and manual writing in the collection and writing of briefing information.
参考图4,图4为本发明一种简报生成方法第二实施例的流程示意图。Referring to Figure 4, Figure 4 is a schematic flowchart of a second embodiment of a briefing generation method of the present invention.
基于上述第一实施例,本实施例简报生成方法在所述步骤S30,包括:Based on the above-mentioned first embodiment, the briefing generation method of this embodiment includes in step S30:
步骤S31:将各个参考资讯的资讯标题作为资讯标识,将资讯标题、资讯正文、资讯作者和资讯时间转化为资讯ID。Step S31: Use the information title of each reference information as an information identifier, and convert the information title, information text, information author and information time into information ID.
可理解的是,资讯ID可以是该参考资讯的编号。It can be understood that the information ID may be the number of the reference information.
步骤S32:将各个参考资讯与预设多模渠道特征配置库中资讯特征集匹配成功的资讯标题配置特征、资讯正文配置特征、资讯发布时间配置特征、资讯作者配置作为特征目标值。Step S32: Use the information title configuration features, information body configuration features, information release time configuration features, and information author configuration that successfully match each reference information with the information feature set in the preset multi-mode channel feature configuration library as feature target values.
可理解的是,预设多模渠道特征配置库中存在有特征库,匹配成功的特征作为特征目标值,并将目标特征值也作为该参考资讯的一部分特征,可以丰富资讯简报数据集中特征。It is understandable that there is a feature library in the preset multi-mode channel feature configuration library, and the successfully matched features are used as feature target values, and the target feature values are also used as part of the features of the reference information, which can enrich the features of the information briefing data set.
在具体实施中,将经过页面特征解析获得的有效内容的资讯标题配置特征、资讯正文配置特征、资讯发布时间配置特征、资讯作者配置作为特征目标值,作为补充value,也就是作为新增加的内容,对应获取的资讯简报数据集内已存在的唯一标识key资讯标题,扩展资讯简报数据集。In the specific implementation, the information title configuration characteristics, information body configuration characteristics, information release time configuration characteristics, and information author configuration of the effective content obtained through page feature analysis are used as feature target values, as supplementary values, that is, as newly added content , corresponding to the unique identification key information title that already exists in the obtained information briefing data set, and expand the information briefing data set.
进一步的,将资讯简报数据集所有数据进行归一化处理。并将资讯简报数据集中特征转换成特征矩阵的数值,排除无效/冗余的特征,把有用的特征挑选出来可以作为模型的训练数据,对递归反馈神经网络进行训练,实现对预设多模渠道特征配置库中特征的自动扩充。Further, all data in the information briefing data set are normalized. Convert the features in the information briefing data set into the value of the feature matrix, eliminate invalid/redundant features, select useful features as training data for the model, train the recursive feedback neural network, and realize the preset multi-mode channel Automatic expansion of features in the feature configuration library.
步骤S33:根据所述资讯标识、所述特征目标值以及所述资讯ID生成资讯目录,根据多个资讯目录构成资讯简报数据集。Step S33: Generate an information catalog based on the information identifier, the characteristic target value and the information ID, and construct an information briefing data set based on multiple information catalogs.
需说明的是,所述根据所述资讯标识、所述特征目标值以及所述资讯ID生成资讯目录,根据多个资讯目录构成资讯简报数据集,包括:It should be noted that the information directory is generated based on the information identifier, the characteristic target value and the information ID, and an information briefing data set is formed based on multiple information directories, including:
根据所述资讯标识、所述特征目标值以及所述资讯ID生成资讯目录;Generate an information catalog based on the information identifier, the characteristic target value and the information ID;
根据所述参考资讯的资讯正文和资讯标题确定多个关键词,根据所述关键词生成资讯标签;Determine a plurality of keywords based on the information text and information title of the reference information, and generate information tags based on the keywords;
在具体实施汇中,将步骤五得到的资讯简报数据集内资讯正文和资讯标题部分结合,通过采用TextRank算法确定5~10个能够描述资讯稳定含义的关键词,并生成相应的资讯标签,即一些分散有助于理解资讯内容的词组短语。通过上述方法得到的资讯标签,作为补充生成的value,对应步骤五获取的资讯简报数据集内已存在的唯一标识key资讯标题,扩展资讯简报数据集。In the specific implementation, the information text and information title part in the information briefing data set obtained in step five are combined, and 5 to 10 keywords that can describe the stable meaning of the information are determined by using the TextRank algorithm, and corresponding information tags are generated, namely Some scattered phrases that help understand the information content. The information tag obtained through the above method, as a supplementary generated value, corresponds to the unique identification key information title that already exists in the information briefing data set obtained in step five, and extends the information briefing data set.
根据所述参考资讯的资讯正文和资讯标题构建相邻词组,根据所述相邻词组进行组合得到关键字短语;Construct adjacent phrases based on the information text and information title of the reference information, and combine the adjacent phrases to obtain keyword phrases;
在具体实施中,将步骤五中资讯简报数据集内资讯正文、资讯标题部分结合,通过算法构建一些能够描述资讯的相邻词组,并组合成关键字短语。与步骤六不同,本步骤着重提取具备一定长度的短语,区别于分散词组。In the specific implementation, the information text and information title part in the information briefing data set in step five are combined, and some adjacent phrases that can describe the information are constructed through algorithms, and combined into keyword phrases. Different from step six, this step focuses on extracting phrases of a certain length, which is different from scattered phrases.
通过上述方法得到的关键词短语,作为补充生成的value,对应步骤五获取的资讯简报数据集内已存在的唯一标识key资讯标题,扩展资讯简报数据集。The keyword phrase obtained through the above method, as a supplementary generated value, corresponds to the unique identifier key information title that already exists in the information briefing data set obtained in step five, and expands the information briefing data set.
根据所述资讯标签和所述关键字短语对所述资讯目录进行更新,得到参考目录,根据所述参考资讯目录构成资讯简报数据集。The information catalog is updated according to the information tags and the keyword phrases to obtain a reference catalog, and an information briefing data set is formed based on the reference information catalog.
需进一步说明的是,所述根据所述资讯标签和所述关键字短语对所述资讯目录进行更新,得到参考目录,根据所述参考资讯目录构成资讯简报数据集,还包括:It should be further explained that the step of updating the information directory according to the information tags and the keyword phrases to obtain a reference directory, and forming an information briefing data set based on the reference information directory also includes:
根据所述资讯标签和所述关键字短语对所述资讯目录进行更新,得到参考目录;根据所述参考资讯的资讯正文生成文章简报;根据所述文章简报对所述参考资讯目录进行更新,根据更新后的参考资讯目录构成资讯简报数据集。The information catalog is updated according to the information tag and the keyword phrase to obtain a reference catalog; an article briefing is generated according to the information text of the reference information; the reference information catalog is updated according to the article briefing, and The updated reference information catalog forms the information briefing data set.
在具体实施中,将步骤五中资讯简报数据集资讯正文部分,采用GPT-2模型生成对应的文章简报。将得到的文章简报,作为补充value,对应步骤五获取的资讯简报数据集内已存在的唯一标识key资讯标题,扩展资讯简报数据集。In the specific implementation, the GPT-2 model is used to generate the corresponding article briefing part of the information body part of the information briefing data set in step five. Use the obtained article briefing as a supplementary value, corresponding to the unique identifier key information title that already exists in the information briefing data set obtained in step 5, to expand the information briefing data set.
本实施例通过基于融合多特征标签基于反馈神经网络生成科研简报方法,将大数据预训练与多元丰富知识相结合,通过持续学习技术,不断吸收海量文本数据中词汇、结构、语义等方面的新知识,实现模型效果不断进化。在实际训练中,该模型能够在不改变任何训练数据的情况下,自主学习、自我完善,快速提升模型性能,同时通过对参考资讯的资讯简报数据集中特征的扩充能够实现对参考资讯内容的匹配,同时有利于对后续模型的训练,更快速准确的找到简报中对应的内容。This embodiment generates scientific research briefings based on feedback neural networks based on fused multi-feature labels, combines big data pre-training with diverse and rich knowledge, and continuously absorbs new vocabulary, structure, semantics and other aspects from massive text data through continuous learning technology. knowledge to achieve continuous evolution of model effects. In actual training, the model can learn independently, self-improve, and quickly improve model performance without changing any training data. At the same time, it can match the content of reference information by expanding the features in the information briefing data set of reference information. , and at the same time it is conducive to the training of subsequent models, and the corresponding content in the briefing can be found more quickly and accurately.
参考图5,图5为本发明一种简报生成方法第二实施例的流程示意图。Referring to Figure 5, Figure 5 is a schematic flow chart of a second embodiment of a briefing generation method of the present invention.
基于上述第一实施例,本实施例简报生成方法在所述步骤S40,包括:Based on the above-mentioned first embodiment, the briefing generation method of this embodiment includes in step S40:
步骤S41:基于预设反馈神经网络对所述资讯简报数据集中对应的资讯配置特征标签以及配置科研资讯单位生成报头。Step S41: Generate headers based on the preset feedback neural network for the corresponding information configuration feature tags and configuration scientific research information units in the information briefing data set.
需说明的是,将收集的资讯文本进行资讯分类,用户可根据标签算法找到所需资讯。It should be noted that the collected information text is classified into information, and users can find the required information based on the label algorithm.
将分好类的资讯进行二次组装,用户只需点击标签“组合”就可以看到新组成的文档,可以方便快速阅读文章并获取信息。The classified information is reassembled. Users only need to click on the label "Combine" to see the newly composed document, which makes it easy to quickly read articles and obtain information.
将需要生成的科研资讯文档添加到素材库,选择简报模板,采用docx算法一键生成word形式科研简报。并按照设置定时将自动将用户感兴趣的标签将收集的资讯根据标题算法生成资讯标题、资讯内容。Add the scientific research information document that needs to be generated to the material library, select the briefing template, and use the docx algorithm to generate a scientific research briefing in word form with one click. And according to the set timing, the information collected from the tags that the user is interested in will be automatically generated based on the title algorithm to generate information titles and information content.
简报主要包含报头、报核、目录、资讯正文、报尾等部分,默认字体使用黑色仿宋12磅设计。The briefing mainly includes the masthead, report verification, table of contents, information body, report tail and other parts. The default font uses a black Song Dynasty 12-point design.
具体实施中,报头生成规则如下:In specific implementation, the header generation rules are as follows:
报名生成:根据科研资讯配置特征标签内容生成简报报名Registration generation: Generate newsletter registration based on scientific research information configuration feature tag content
编号生成:根据起始日、周、月维度生成编号Number generation: Generate numbers based on the starting day, week, and month dimensions
日期生成:根据当前资讯日期进行日期生成Date generation: date generation based on the current information date
单位:根据用户配置科研资讯单位生成Unit: Generated based on user configuration of scientific research information unit
生成后的简报报头示例如下:紫金院-ICT科研区块链资讯-第1期-2022年08月04日An example of the generated briefing header is as follows: Zijin Academy-ICT Scientific Research Blockchain Information-Issue 1-August 4, 2022
报头使用红色仿宋14磅字体。并使用间隔符对报头和报核部分进行分割。The masthead uses red imitation Song 14-point font. And use separators to separate the header and core parts.
步骤S42:根据所述资讯简报数据集中对应的科研资讯配置、所述资讯简报数据集中的文章简报、关键词、资讯目录生成报核。Step S42: Generate a report based on the corresponding scientific research information configuration in the information briefing data set, the article briefings, keywords, and information catalogs in the information briefing data set.
需说明的是,所述根据所述资讯简报数据集中对应的科研资讯配置、所述资讯简报数据集中的文章简报、关键词、资讯目录生成报核,包括:It should be noted that the report is generated based on the corresponding scientific research information configuration in the information briefing data set, the article briefings, keywords, and information catalogs in the information briefing data set, including:
根据所述资讯简报数据集中对应的科研资讯配置进行特征提取,得到有效目录;根据所述资讯简报数据集中参考资讯生成多个简报按语,将多个简报按照进行组合得到目标按语;通过所述资讯简报数据集中关键词进行词频统计,对多个关键词进行聚类形得到词群,根据词群得到简报核心关键词信息;根据所述资讯简报数据集中的关键字短语生成文章标题,根据所述资讯简报数据集中的文章简报生成文章导语;根据所述资讯简报数据集中的得到文章来源,文章作者信息,以及作者信息;根据所述有效目录、所述目标按语、所述简报核心关键词信息、所述文章导语、文章来源,文章作者信息,以及作者信息生成报核。Perform feature extraction based on the corresponding scientific research information configuration in the information briefing data set to obtain an effective catalog; generate multiple briefing notes based on the reference information in the information briefing data set, and combine the multiple briefing notes accordingly to obtain target notes; use the information Perform word frequency statistics on keywords in the briefing data set, cluster multiple keywords to obtain word groups, and obtain briefing core keyword information based on the word groups; generate article titles based on the keyword phrases in the information briefing data set, and generate article titles according to the Generate an article lead from the article briefing in the information briefing data set; obtain the article source, article author information, and author information according to the information briefing data set; according to the effective directory, the target note, the core keyword information of the briefing, The article introduction, article source, article author information, and author information are used to generate a report for review.
具体实施中,报核生成规则如下:In specific implementation, the report generation rules are as follows:
目录生成:根据科研资讯配置的特征标签循环提取步骤五有效资讯目录生成方法生成有效目录生成:Catalog generation: cyclic extraction based on the feature tags configured for scientific research information. Step 5: Effective information catalog generation method generates effective catalog generation:
按语:将目录生成的所有有效资讯依据训练后的模型采取步骤八文章简报生成方法生成多条简报按语,依据训练后的模型生成再组合多条简报按照步骤八生成一条按语。Notes: Use the trained model to generate multiple briefing notes based on all the valid information generated by the catalog using the article briefing generation method in Step 8. Generate multiple briefing notes based on the trained model and then combine multiple briefing notes to generate one note according to Step 8.
词群:通过对资讯目录新闻关键词抽取、词频统计,对多个关键词进行聚类形成词群。用以描述当前简报核心关键词信息。Word groups: By extracting news keywords from the information catalog and counting word frequencies, multiple keywords are clustered to form word groups. Used to describe the core keyword information of the current briefing.
资讯正文:Information text:
标题:循环提取步骤五有效资讯目录,依据训练后的模型生成步骤七关键字短语生成文章标题。Title: Loop to extract the effective information directory in step 5, and generate the article title based on the trained model to generate keyword phrases in step 7.
导语:循环提取步骤五有效资讯目录,依据训练后的模型生成步骤八文章简报生成文章导语。Introduction: Loop to extract the effective information directory in step 5, and generate the article introduction in step 8 based on the trained model.
文章来源:循环提取步骤五有效资讯目录,依据训练后的模型填写根据步骤三:页面特征解析提取的文章来源,文章作者信息,以及作者信息。Article source: Loop to extract the effective information directory in step five, and fill in the article source, article author information, and author information extracted according to step three: page feature analysis based on the trained model.
报尾字体使用红色仿宋12磅设计。并使用间隔符对报核和报尾部分进行分割。The font at the end of the newspaper uses a red imitation Song 12-point design. And use separators to separate the core and trailer parts.
步骤S43:根据所述资讯简报数据集中对应的配置科研资讯发生范围生成报尾。Step S43: Generate a trailer according to the corresponding configured scientific research information occurrence range in the information briefing data set.
步骤S44:根据所述报头、报核、报尾生成简报。Step S44: Generate a briefing based on the header, report core, and report trailer.
值得说明的是,简报内容推送可以在页面中设置资讯推送部门,资讯以日、周、月为维度,后续程序可以定时推送最新关注的资讯给相应用户。方便快速获取最新资讯信息。It is worth mentioning that the information push department can be set up on the page to push the briefing content. The information is based on day, week, and month. Subsequent programs can regularly push the latest information to the corresponding users. Conveniently and quickly obtain the latest information.
需说明的是,可以将快速实现常规性简报(速报、日报、周报、月报等)和专题性简报(以大量专题信息为素材,以情报分析方法模块为手段,将专题领域的综述、发展、影响、效果等内容可视化呈现)可将大规模的数据、图表、动画等多媒体数据,通过简报制作功能快速地生成专项简报。It should be noted that regular briefings (quick reports, daily reports, weekly reports, monthly reports, etc.) and thematic briefings (using a large amount of thematic information as material and using the intelligence analysis method module as a means to combine summaries of thematic areas, etc.) can be quickly realized. Visual presentation of development, impact, effects, etc.) can quickly generate special briefings through the briefing production function from large-scale data, charts, animations and other multimedia data.
本实施例通过在公开渠道收集集团内部以及外部业内公司和机构科技创新方面的动态,可自动生成资讯、简报,同时根据公开渠道以及内部渠道获取的信息和数据导入,实现科技创新领域标杆比较分析,可通过表格和图形等方式呈现,减少简报信息搜集以及撰写的相关工作,避免通过人工进行信息筛选和手工撰写造成的大量时间成本浪费的问题。This embodiment can automatically generate information and briefings by collecting the trends in technological innovation within the group and external companies and institutions in the industry through public channels. At the same time, based on the information and data obtained from public channels and internal channels, it can achieve benchmark comparison analysis in the field of technological innovation. , can be presented through tables and graphics, etc., to reduce the work related to collecting and writing briefing information, and to avoid the waste of a lot of time and cost caused by manual information screening and manual writing.
参照图6,图6为本发明简报生成装置第一实施例的结构框图。Referring to Figure 6, Figure 6 is a structural block diagram of a first embodiment of a briefing generating device of the present invention.
如图6所示,本发明实施例提出的简报生成装置包括:As shown in Figure 6, the briefing generation device proposed by the embodiment of the present invention includes:
资讯获取模块,用于根据预设多模渠道特征配置库从全网资讯内容中筛选出多个参考资讯;The information acquisition module is used to filter out multiple reference information from the entire network information content based on the preset multi-mode channel feature configuration library;
所述资讯获取模块,还用于对所述多个参考资讯进行特征提取得到各个参考资讯的参考特征,所述参考特征包括资讯标题、资讯正文、资讯作者、资讯发布时间;The information acquisition module is also used to perform feature extraction on the plurality of reference information to obtain reference features of each reference information. The reference features include information title, information text, information author, and information release time;
简报生成模块,用于对所述参考特征进行特征拓展,得到资讯简报数据集;A briefing generation module, used to perform feature expansion on the reference features to obtain an information briefing data set;
所述简报生成模块,还用于基于预设反馈神经网络对所述资讯简报数据集进行数据处理,得到报头、报核、报尾,根据所述报头、报核、报尾生成简报。The briefing generation module is also used to perform data processing on the information briefing data set based on a preset feedback neural network to obtain a header, a report core, and a trailer, and generate a briefing based on the header, report core, and report trailer.
本实施例通过基于通用配置库、标签库本申请提案提供完成自适应的新型的资讯收集方式,通过用户感兴趣的标签配置,采用有效关键词配置特征库和敏感词配置库获取有效网页/内容,缩短科研简报信息传递时间、提高简报工作效率,有效避免了简报信息搜集以及撰写的相关工作因为需要通过人工进行信息筛选和手工撰写,造成大量时间成本浪费的问题。This embodiment provides a new adaptive information collection method based on the general configuration library and tag library. Through the tag configuration that the user is interested in, the effective keyword configuration feature library and the sensitive word configuration library are used to obtain effective web pages/content. , shorten the information transmission time of scientific research briefings, improve the efficiency of briefing work, and effectively avoid the problem of wasting a lot of time and cost due to the need for manual information screening and manual writing in the collection and writing of briefing information.
在一实施例中,所述资讯获取模块10,还用于根据预设多模渠道特征配置库从全网资讯内容中筛选出多个参考资讯;In one embodiment, the information acquisition module 10 is also used to filter out a plurality of reference information from the entire network information content according to the preset multi-mode channel feature configuration library;
对所述多个参考资讯进行特征提取得到各个参考资讯的参考特征,所述参考特征包括资讯标题、资讯正文、资讯作者、资讯发布时间;Perform feature extraction on the plurality of reference information to obtain reference features of each reference information, where the reference features include information title, information text, information author, and information release time;
对所述参考特征进行特征拓展,得到资讯简报数据集;Perform feature expansion on the reference features to obtain an information briefing data set;
基于预设反馈神经网络对所述资讯简报数据集进行数据处理,得到报头、报核、报尾,根据所述报头、报核、报尾生成简报。The information briefing data set is processed based on a preset feedback neural network to obtain a header, a report core, and a trailer, and a briefing is generated based on the header, report core, and report trailer.
在一实施例中,所述资讯获取模块10,还用于根据预设多模渠道特征配置库中的监控资讯渠道特征集进行搜索,得到网页内容;In one embodiment, the information acquisition module 10 is also used to search according to the monitoring information channel feature set in the preset multi-mode channel feature configuration library to obtain web page content;
对所述网页内容进行内容提取,得到初始内容特征;Extract content from the web page content to obtain initial content features;
根据预设反馈神经网络将所述初始内容特征与预设多模渠道特征配置库中的资讯特征集进行匹配;Match the initial content features with the information feature set in the preset multi-mode channel feature configuration library according to the preset feedback neural network;
将为匹配成功的网页内容通过预设反馈神经网络进行学习,得到新增特征,根据所述新增特征对所述预设多模渠道特征配置库中的资讯特征集进行更新;Learn through a preset feedback neural network for successfully matched web content to obtain new features, and update the information feature set in the preset multi-mode channel feature configuration library according to the new features;
将匹配成功的网页内容作为参考资讯。Use the successfully matched web content as reference information.
在一实施例中,所述资讯获取模块10,还用于根据所述预设多模渠道特征配置库中的资讯标题特征计算各个参考资讯的标题权重,根据所述标题权重进行特征提取;In one embodiment, the information acquisition module 10 is also used to calculate the title weight of each reference information based on the information title features in the preset multi-mode channel feature configuration library, and perform feature extraction based on the title weight;
计算所述各个参考资讯的正文的段落链接密度和文本密度,根据所述段落链接密度和文本密度进行特征提取,得到资讯正文;Calculate the paragraph link density and text density of the text of each reference information, perform feature extraction based on the paragraph link density and text density, and obtain the information text;
通过正则对所述各个参考资讯的统一资源定位器进行时间提取,得到初始时间特征,对所述初始时间进行格式化处理,得到资讯发布时间;Extract the time from the unified resource locator of each reference information through regular rules to obtain the initial time characteristics, format the initial time to obtain the information release time;
根据所述预设多模渠道特征配置库中的发布作者特征与所述参考资讯进行匹配,得到资讯作者。The information author is obtained by matching publishing author characteristics in the preset multi-mode channel feature configuration library with the reference information.
在一实施例中,所述资讯获取模块10,还用于将所述各个参考资讯的资讯正文和资讯标题与所述预设多模渠道特征配置库中的监控标签特征集进行匹配,根据匹配结果得到所述参考资讯是否相关;In one embodiment, the information acquisition module 10 is also used to match the information text and information title of each reference information with the monitoring tag feature set in the preset multi-mode channel feature configuration library. According to the matching Whether the reference information obtained as a result is relevant;
将所述各个参考资讯的资讯正文和资讯标题与所述预设多模渠道特征配置库中的敏感词标签特征集进行匹配,根据匹配结果判断所述参考资讯是否为无效资讯;Match the information text and information title of each reference information with the sensitive word tag feature set in the preset multi-mode channel feature configuration library, and determine whether the reference information is invalid information based on the matching results;
对所述参考资讯的资讯正文和资讯标题进行文本关键词提取,得到文本关键词,根据所述文本关键词判断所述参考资讯是否满足质量要求;Extract text keywords from the information text and information title of the reference information to obtain text keywords, and determine whether the reference information meets quality requirements based on the text keywords;
剔除所述参考资讯中不相关的参考资讯、无效的参考资讯以及不满足质量要求的参考资讯,得到多个参考资讯。Eliminate irrelevant reference information, invalid reference information and reference information that does not meet quality requirements from the reference information to obtain multiple reference information.
在一实施例中,所述简报生成模块20,还用于将各个参考资讯的资讯标题作为资讯标识,将资讯标题、资讯正文、资讯作者和资讯时间转化为资讯ID;In one embodiment, the briefing generation module 20 is also used to use the information title of each reference information as an information identifier, and convert the information title, information text, information author and information time into an information ID;
将各个参考资讯与预设多模渠道特征配置库中资讯特征集匹配成功的资讯标题配置特征、资讯正文配置特征、资讯发布时间配置特征、资讯作者配置作为特征目标值;The information title configuration features, information body configuration features, information release time configuration features, and information author configuration that successfully match each reference information with the information feature set in the preset multi-mode channel feature configuration library are used as feature target values;
根据所述资讯标识、所述特征目标值以及所述资讯ID生成资讯目录,根据多个资讯目录构成资讯简报数据集。An information directory is generated based on the information identifier, the characteristic target value and the information ID, and an information briefing data set is formed based on multiple information directories.
在一实施例中,所述简报生成模块20,还用于根据所述资讯标识、所述特征目标值以及所述资讯ID生成资讯目录;In one embodiment, the briefing generation module 20 is also used to generate an information catalog based on the information identifier, the characteristic target value and the information ID;
根据所述参考资讯的资讯正文和资讯标题确定多个关键词,根据所述关键词生成资讯标签;Determine a plurality of keywords based on the information text and information title of the reference information, and generate information tags based on the keywords;
根据所述参考资讯的资讯正文和资讯标题构建相邻词组,根据所述相邻词组进行组合得到关键字短语;Construct adjacent phrases based on the information text and information title of the reference information, and combine the adjacent phrases to obtain keyword phrases;
根据所述资讯标签和所述关键字短语对所述资讯目录进行更新,得到参考目录,根据所述参考资讯目录构成资讯简报数据集。The information catalog is updated according to the information tags and the keyword phrases to obtain a reference catalog, and an information briefing data set is formed based on the reference information catalog.
在一实施例中,所述简报生成模块20,还用于根据所述资讯标签和所述关键字短语对所述资讯目录进行更新,得到参考目录;In one embodiment, the briefing generation module 20 is also used to update the information catalog according to the information tags and the keyword phrases to obtain a reference catalog;
根据所述参考资讯的资讯正文生成文章简报;Generate an article summary based on the information text of the reference information;
根据所述文章简报对所述参考资讯目录进行更新,根据更新后的参考资讯目录构成资讯简报数据集。The reference information catalog is updated according to the article brief, and an information brief data set is formed based on the updated reference information catalog.
在一实施例中,所述简报生成模块20,还用于基于预设反馈神经网络对所述资讯简报数据集中对应的资讯配置特征标签以及配置科研资讯单位生成报头;In one embodiment, the briefing generation module 20 is also configured to configure feature tags and configure scientific research information units to generate headers for corresponding information in the information briefing data set based on a preset feedback neural network;
根据所述资讯简报数据集中对应的科研资讯配置、所述资讯简报数据集中的文章简报、关键词、资讯目录生成报核;Generate a report based on the corresponding scientific research information configuration in the information briefing data set, the article briefings, keywords, and information catalogs in the information briefing data set;
根据所述资讯简报数据集中对应的配置科研资讯发生范围生成报尾;Generate a tail report based on the corresponding configured scientific research information occurrence range in the information briefing data set;
根据所述报头、报核、报尾生成简报。Generate a briefing based on the header, report verification, and report trailer.
在一实施例中,所述简报生成模块20,还用于根据所述资讯简报数据集中对应的科研资讯配置进行特征提取,得到有效目录;In one embodiment, the briefing generation module 20 is also used to perform feature extraction based on the corresponding scientific research information configuration in the information briefing data set to obtain an effective directory;
根据所述资讯简报数据集中参考资讯生成多个简报按语,将多个简报按照进行组合得到目标按语;Generate multiple briefing notes based on the reference information in the information briefing data set, and combine the multiple briefing notes accordingly to obtain target notes;
通过所述资讯简报数据集中关键词进行词频统计,对多个关键词进行聚类形得到词群,根据词群得到简报核心关键词信息;Perform word frequency statistics on keywords in the information briefing data set, cluster multiple keywords to obtain word groups, and obtain briefing core keyword information based on the word groups;
根据所述资讯简报数据集中的关键字短语生成文章标题,根据所述资讯简报数据集中的文章简报生成文章导语;Generate an article title based on the keyword phrases in the information briefing data set, and generate an article lead based on the article briefings in the information briefing data set;
根据所述资讯简报数据集中的得到文章来源,文章作者信息,以及作者信息;Obtain the article source, article author information, and author information from the information briefing data set;
根据所述有效目录、所述目标按语、所述简报核心关键词信息、所述文章导语、文章来源,文章作者信息,以及作者信息生成报核。A report is generated based on the effective directory, the target comment, the core keyword information of the briefing, the article lead, the article source, the article author information, and the author information.
应当理解的是,以上仅为举例说明,对本发明的技术方案并不构成任何限定,在具体应用中,本领域的技术人员可以根据需要进行设置,本发明对此不做限制。It should be understood that the above are only examples and do not constitute any limitation on the technical solution of the present invention. In specific applications, those skilled in the art can make settings as needed, and the present invention does not impose any limitations on this.
需要说明的是,以上所描述的工作流程仅仅是示意性的,并不对本发明的保护范围构成限定,在实际应用中,本领域的技术人员可以根据实际的需要选择其中的部分或者全部来实现本实施例方案的目的,此处不做限制。It should be noted that the workflow described above is only illustrative and does not limit the scope of the present invention. In practical applications, those skilled in the art can select some or all of them for implementation according to actual needs. The purpose of this embodiment is not limited here.
此外,需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。Furthermore, it should be noted that, as used herein, the terms "include", "comprises" or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or system that includes a list of elements includes not only those elements, but also other elements not expressly listed or elements inherent to the process, method, article or system. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of other identical elements in the process, method, article, or system that includes that element.
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。The above serial numbers of the embodiments of the present invention are only for description and do not represent the advantages and disadvantages of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如只读存储器(Read Only Memory,ROM)/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product in essence or that contributes to the existing technology. The computer software product is stored in a storage medium (such as a read-only memory). , ROM)/RAM, magnetic disk, optical disk), including several instructions to cause a terminal device (which can be a mobile phone, computer, server, or network device, etc.) to execute the method described in various embodiments of the present invention.
应该理解的是,虽然本申请实施例中的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although each step in the flow chart in the embodiment of the present application is displayed in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated in this article, the execution of these steps is not strictly limited in order, and they can be executed in other orders. Moreover, at least some of the steps in the figure may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, and their execution order is not necessarily sequential. may be performed in turn or alternately with other steps or sub-steps of other steps or at least part of stages.
以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and do not limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made using the description and drawings of the present invention may be directly or indirectly used in other related technical fields. , are all similarly included in the scope of patent protection of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311073852.9ACN117407519A (en) | 2023-08-23 | 2023-08-23 | Presentation generation method and device |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311073852.9ACN117407519A (en) | 2023-08-23 | 2023-08-23 | Presentation generation method and device |
| Publication Number | Publication Date |
|---|---|
| CN117407519Atrue CN117407519A (en) | 2024-01-16 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202311073852.9APendingCN117407519A (en) | 2023-08-23 | 2023-08-23 | Presentation generation method and device |
| Country | Link |
|---|---|
| CN (1) | CN117407519A (en) |
| Publication | Publication Date | Title |
|---|---|---|
| CN101464905B (en) | Web page information extraction system and method | |
| US8868621B2 (en) | Data extraction from HTML documents into tables for user comparison | |
| CN112035653A (en) | A method and device for extracting key policy information, storage medium, and electronic device | |
| CN110990590A (en) | Dynamic financial knowledge map construction method based on reinforcement learning and transfer learning | |
| WO2023035330A1 (en) | Long text event extraction method and apparatus, and computer device and storage medium | |
| CN111026671A (en) | Test case set construction method and test method based on test case set | |
| CN113254751B (en) | Method, equipment and storage medium for accurately extracting complex webpage structured information | |
| WO2022226716A1 (en) | Deep learning-based java program internal annotation generation method and system | |
| CN113407678B (en) | Knowledge graph construction method, device and equipment | |
| CN116628303A (en) | A method and system for extracting attribute values of semi-structured web pages based on hint learning | |
| CN108959204B (en) | Internet financial project information extraction method and system | |
| CN106446072A (en) | Webpage content processing method and apparatus | |
| CN116595191A (en) | Construction method and device of interactive low-code knowledge graph | |
| CN110134844A (en) | Public opinion monitoring method, device, computer equipment and storage medium in subdivided fields | |
| CN118821749A (en) | Method and device for generating presentation manuscript | |
| Leonandya et al. | A semi-supervised algorithm for Indonesian named entity recognition | |
| CN117131935A (en) | Knowledge graph construction method oriented to futures field | |
| CN118070784A (en) | Method, device, equipment and storage medium for constructing entity dictionary in vertical industry field | |
| CN116775811A (en) | Data retrieval and intelligent auxiliary writing system and method based on power grid information | |
| CN109902299B (en) | A text processing method and device | |
| CN112667819B (en) | Entity description reasoning knowledge base construction and reasoning evidence quantitative information acquisition method and device | |
| US20240086448A1 (en) | Detecting cited with connections in legal documents and generating records of same | |
| CN117407519A (en) | Presentation generation method and device | |
| CN113495964B (en) | Triad screening method, device, equipment and readable storage medium | |
| CN115481240A (en) | A data asset quality detection method and detection device |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |