Movatterモバイル変換


[0]ホーム

URL:


CN110457579A - Web page denoising method and system based on template and classifier working together - Google Patents

Web page denoising method and system based on template and classifier working together
Download PDF

Info

Publication number
CN110457579A
CN110457579ACN201910694087.XACN201910694087ACN110457579ACN 110457579 ACN110457579 ACN 110457579ACN 201910694087 ACN201910694087 ACN 201910694087ACN 110457579 ACN110457579 ACN 110457579A
Authority
CN
China
Prior art keywords
node set
node
classifier
template
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910694087.XA
Other languages
Chinese (zh)
Other versions
CN110457579B (en
Inventor
王运锋
严金承
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan UniversityfiledCriticalSichuan University
Priority to CN201910694087.XApriorityCriticalpatent/CN110457579B/en
Publication of CN110457579ApublicationCriticalpatent/CN110457579A/en
Application grantedgrantedCritical
Publication of CN110457579BpublicationCriticalpatent/CN110457579B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种基于模板和分类器协同工作的网页去噪方法及系统,其中,所述去噪方法包括:解析获取到的原始的HTML文档,删除无关标签节点,生成符合要求的简化DOM树;计算目标网页的DOM树中的每个块级节点特征,得到原始节点集合;将原始节点集合加入对应网站的缓存节点集合,当缓存节点集合中元素个数达到预设阈值时,触发模板生成算法,更新对应网站的模板节点集合;利用目标网页所属网站的模板节点集合对目标网页的原始节点集合进行过滤处理得到已过滤目标网页节点集合;用训练好的分类器对已过滤目标网页节点集合进行分类,保留分类结果为主体内容的节点,从中提取主体内容文本。本发明人工干预少,效率高,适用于各种主题型网页去噪。

The present invention discloses a web page denoising method and system based on templates and classifiers working together, wherein the denoising method includes: parsing the acquired original HTML document, deleting irrelevant label nodes, and generating a simplified DOM that meets requirements tree; calculate the characteristics of each block-level node in the DOM tree of the target web page, and obtain the original node set; add the original node set to the cache node set of the corresponding website, and trigger the template when the number of elements in the cache node set reaches the preset threshold Generate an algorithm to update the template node set of the corresponding website; use the template node set of the website to which the target web page belongs to filter the original node set of the target web page to obtain the filtered target web page node set; use the trained classifier to classify the filtered target web page nodes The collection is classified, and the node whose classification result is the main content is reserved, from which the main content text is extracted. The invention has less manual intervention and high efficiency, and is suitable for denoising various theme webpages.

Description

Translated fromChinese
基于模板和分类器协同工作的网页去噪方法及系统Web page denoising method and system based on template and classifier working together

技术领域technical field

本发明涉及网页去噪技术领域,特别是一种基于模板和分类器协同工作的网页去噪方法及系统。The invention relates to the technical field of webpage denoising, in particular to a webpage denoising method and system based on templates and classifiers working together.

背景技术Background technique

随着互联网技术的不断发展,互联网中的信息量愈来愈大,呈爆炸式增长。海量的web网页信息是互联网信息的主要体现,是许多其他研究领域的天然数据矿场,这些研究领域包括:搜索引擎,舆情分析,自然语言处理等。但web网页中除主要内容外,还附带一些商业广告,导航栏,版权信息,公告消息等与主要内容无关的其他信息,这些信息可以称作网页噪声,如何去除网页中的噪声内容,提取网页主要内容供上述领域分析使用,具有重要的研究意义和实用价值。With the continuous development of Internet technology, the amount of information on the Internet is getting bigger and bigger, showing explosive growth. Massive web page information is the main embodiment of Internet information, and it is a natural data mine for many other research fields, including: search engines, public opinion analysis, natural language processing, etc. However, in addition to the main content of web pages, there are also some commercial advertisements, navigation bars, copyright information, announcements and other information that are not related to the main content. These information can be called web page noise. How to remove noise content in web pages and extract web pages The main content is for the analysis and use of the above fields, which has important research significance and practical value.

目前,网页去噪的主要方法有基于规则的去噪方法、基于模板的去噪方法、基于视觉内容的去噪方法等。基于规则的方法是预先制定一些启发式规则,筛选出满足规则的那些文本内容,该方法只适用于某些简单网页,对于结构复杂的网页需要复杂的启发式规则,有其局限性。基于模板的方法去噪速度快,但往往需要人工构造适用于特定网站网页的模板,不能作为一个通用的网页去噪器,2010年李立文等人在论文《基于HTML树和模板的文献信息提取方法研究》中采用网页相似度计算将不同网页分类,对每个类构造相应的模板,该模板利用了主体内容的位置信息,当主体内容分散到多个文档对象模型DOM(DocumentObject Model,文档对象模型)节点时,选取包含这些主体内容的最近父节点为模板,提出来的主体信息可能含有大量噪声,对去噪效果有较大影响。基于视觉内容的去噪方法先将网页分为不同的块,利用人工标注并通过神经网络和支持向量机来对网页块的重要程度进行预测,最后选取重要程度最高的网页块,但该方法计算量大,效率不高。At present, the main methods of webpage denoising include rule-based denoising methods, template-based denoising methods, and visual content-based denoising methods. The rule-based method is to formulate some heuristic rules in advance to filter out those text contents that meet the rules. This method is only applicable to some simple web pages, and complex heuristic rules are required for web pages with complex structures, which has its limitations. The template-based method has a fast denoising speed, but it often needs to manually construct a template suitable for a specific website page, which cannot be used as a general webpage denoiser. In 2010, Li Liwen et al. In "Research", webpage similarity calculation is used to classify different webpages, and a corresponding template is constructed for each class. The template uses the location information of the main content. ) nodes, select the nearest parent node containing these main content as the template, and the proposed main body information may contain a lot of noise, which has a great impact on the denoising effect. The denoising method based on visual content first divides the webpage into different blocks, uses artificial annotation and predicts the importance of the webpage block through neural network and support vector machine, and finally selects the most important webpage block, but the method calculates The amount is large and the efficiency is not high.

发明内容Contents of the invention

本发明所要解决的技术问题是提供一种基于模板和分类器协同工作的网页去噪方法及系统,能自动生成去噪模板进行预处理,协同分类器对DOM节点进行分类判断,最后提取主体信息;本发明人工干预少,效率高,适用于各种主题型网页去噪。The technical problem to be solved by the present invention is to provide a web page denoising method and system based on templates and classifiers working together, which can automatically generate denoising templates for preprocessing, cooperate with classifiers to classify and judge DOM nodes, and finally extract subject information ; The present invention has less manual intervention and high efficiency, and is suitable for denoising various themed webpages.

为解决上述技术问题,本发明采用的技术方案是:In order to solve the problems of the technologies described above, the technical solution adopted in the present invention is:

一种基于模板和分类器协同工作的网页去噪方法,包括以下步骤:A web page denoising method based on templates and classifiers working together, comprising the following steps:

步骤1:下载目标网页,获取原始的HTML文档;Step 1: Download the target webpage and get the original HTML document;

步骤2:解析原始的HTML文档,删除无关标签节点,修正DOM树,生成符合要求的简化DOM树;Step 2: Parse the original HTML document, delete irrelevant label nodes, correct the DOM tree, and generate a simplified DOM tree that meets the requirements;

步骤3:计算目标网页的DOM树中的每个块级节点特征,得到该目标网页的原始节点集合;Step 3: Calculate the feature of each block-level node in the DOM tree of the target webpage to obtain the original node set of the target webpage;

步骤4:生成模板,即将原始节点集合加入对应网站的缓存节点集合,当缓存节点集合中元素个数达到预设阈值时,触发模板生成算法,更新对应网站的模板节点集合;Step 4: Generate a template, that is, add the original node set to the cache node set of the corresponding website. When the number of elements in the cache node set reaches the preset threshold, the template generation algorithm is triggered to update the template node set of the corresponding website;

步骤5:利用目标网页所属网站的模板节点集合对目标网页的原始节点集合进行过滤处理,输出已过滤目标网页节点集合;Step 5: filter the original node set of the target web page by using the template node set of the website to which the target web page belongs, and output the filtered target web page node set;

步骤6:训练分类器,即预先将一些节点标记为噪声和主体两类,用标记好的节点作为训练样本训练分类器,直到分类器达到预定的分类效果;Step 6: Train the classifier, that is, pre-mark some nodes as noise and subject, and use the marked nodes as training samples to train the classifier until the classifier achieves the predetermined classification effect;

步骤7:用训练好的分类器对已过滤目标网页节点集合进行分类,保留分类结果为主体内容的节点,从中提取主体内容文本。Step 7: Use the trained classifier to classify the filtered target web page node set, keep the nodes whose classification results are the main content, and extract the main content text from it.

进一步的,所述步骤1具体为:包括网页下载和网页发现;网页下载负责下载目标网页并按目标网页的域名地址的不同,分类存入数据库中,网页发现负责发现符合要求的新网页地址,并将其添加到待爬取列表。Further, said step 1 is specifically: including webpage download and webpage discovery; webpage download is responsible for downloading the target webpage and according to the difference of the domain name address of the target webpage, and classifies and stores it in the database, and webpage discovery is responsible for finding new webpage addresses that meet the requirements, and add it to the pending list.

进一步的,所述步骤2具体为:包括预处理和修正;预处理负责删除不包含文本内容的标签,包括注释、脚本和样式,修正即是修正DOM树的可修正错误,包括“<>”匹配错误、标签对匹配错误。Further, the step 2 specifically includes: including preprocessing and correction; preprocessing is responsible for deleting tags that do not contain text content, including comments, scripts and styles, and correction is to correct correctable errors in the DOM tree, including "<>" Mismatch, label pair mismatch.

进一步的,所述步骤3中,节点特征包括:节点文本内容长度与文档文本内容长度比值、节点文本内容长度、节点文本内容标点符号长度与节点文本内容长度比值、节点链接标签个数与文档链接标签个数比值、节点图片标签个数与文档图片标签个数比值、节点权重分数、节点内链接字符与文本内容长度比值、节点内链接标签个数加图片标签个数与节点文本内容长度比值。Further, in the step 3, the node features include: the ratio of node text content length to document text content length, node text content length, node text content punctuation length to node text content length ratio, node link label number and document link The ratio of the number of tags, the ratio of the number of node image tags to the number of document image tags, the node weight score, the ratio of the link characters in the node to the length of the text content, the ratio of the number of link tags in the node plus the number of image tags to the length of the node text content.

进一步的,在步骤6中,所述分类器采用的分类器模型为支持向量机(SVM,SupportVector Machine)或分类回归树(CART,Classification And Regression Tree)。Further, in step 6, the classifier model adopted by the classifier is a support vector machine (SVM, Support Vector Machine) or a classification and regression tree (CART, Classification And Regression Tree).

一种基于模板和分类器协同工作的网页去噪系统,包括网页爬虫模块、HTML预处理模块、DOM树特征向量计算模块、模板生成模块、模板预处理模块、分类器训练模块和分类器预测模块;A web page denoising system based on template and classifier cooperation, including web crawler module, HTML preprocessing module, DOM tree feature vector calculation module, template generation module, template preprocessing module, classifier training module and classifier prediction module ;

所述网页爬虫模块用于下载目标网页,获取原始的HTML文档;The web crawler module is used to download the target webpage and obtain the original HTML document;

所述HTML预处理模块用于解析原始的HTML文档,删除无关标签节点,修正DOM树,生成符合要求的简化DOM树;The HTML preprocessing module is used to parse the original HTML document, delete irrelevant label nodes, correct the DOM tree, and generate a simplified DOM tree meeting the requirements;

所述DOM树特征向量计算模块用于计算目标网页的DOM树中的每个块级节点特征,得到该目标网页的原始节点集合;The DOM tree feature vector calculation module is used to calculate each block-level node feature in the DOM tree of the target webpage to obtain the original node set of the target webpage;

所述模板生成模块用于将原始节点集合加入对应网站的缓存节点集合,当缓存节点集合中元素个数达到预设阈值时,触发模板生成算法,更新对应网站的模板节点集合;The template generation module is used to add the original node set to the cache node set of the corresponding website, and when the number of elements in the cache node set reaches a preset threshold, a template generation algorithm is triggered to update the template node set of the corresponding website;

所述模板预处理模块用于利用目标网页所属网站的模板节点集合对目标网页的原始节点集合进行过滤处理,输出已过滤目标网页节点集合;The template preprocessing module is used to filter the original node set of the target webpage by using the template node set of the website to which the target webpage belongs, and output the filtered target webpage node set;

所述分类器训练模块用于分类器的训练,即预先将一些节点标记为噪声和主体两类,用标记好的节点作为训练样本训练分类器,直到分类器达到预定的分类效果;The classifier training module is used for classifier training, that is, some nodes are marked as noise and subject in advance, and the marked nodes are used as training samples to train the classifier until the classifier reaches a predetermined classification effect;

所述分类器预测模块用于用训练好的分类器对已过滤目标网页节点集合进行分类,保留分类结果为主体内容的节点,从中提取主体内容文本。The classifier prediction module is used to classify the filtered target web page node set with a trained classifier, retain the nodes whose classification results are the main content, and extract the main content text therefrom.

与现有技术相比,本发明的有益效果是:使用模板和分类器协同工作,分两阶段进行去噪处理,去噪效果好。在第一阶段中,能自动识别目标网站的公有噪声信息作为模板对目标网页进行噪声过滤;第二阶段中,将网页去噪问题看作一个分类问题,利用分类器筛选出主体信息。本发明第一阶段处理速度快,且不需要人工干预,由于过滤了部分噪声信息,大大减轻了第二阶段的处理负担。本发明具有广泛的适应性,是一个通用的主题型网页去噪方法。Compared with the prior art, the beneficial effect of the present invention is that the template and the classifier are used to work together to perform denoising processing in two stages, and the denoising effect is good. In the first stage, the public noise information of the target website can be automatically identified as a template to filter the noise of the target webpage; in the second stage, the webpage denoising problem is regarded as a classification problem, and the main information is filtered out by a classifier. The processing speed of the first stage of the present invention is fast and does not require manual intervention, and the processing burden of the second stage is greatly reduced because part of the noise information is filtered. The invention has wide adaptability and is a general theme-type webpage denoising method.

附图说明Description of drawings

图1为本发明方法实施流程图。Fig. 1 is the implementation flow chart of the method of the present invention.

图2为本发明系统结构示意图。Fig. 2 is a schematic diagram of the system structure of the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施方式对本发明作进一步详细的说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

如图1所示,本发明去噪方法包括以下步骤:As shown in Figure 1, the denoising method of the present invention comprises the following steps:

一、通过网页爬虫技术获取原始的HTML文档,包括网页下载、网页发现。其中,网页下载负责下载目标网页并按目标网页的域名地址的不同,分类存入数据库中;网页发现负责发现符合要求的新网页地址,并将其添加到待爬取列表。1. Obtain the original HTML document through web crawler technology, including web page download and web page discovery. Among them, the webpage download is responsible for downloading the target webpage and storing it in the database according to the domain name address of the target webpage; the webpage discovery is responsible for finding the new webpage address that meets the requirements and adding it to the list to be crawled.

二、对原始的HTML文档进行处理,包括预处理和修正。其中,预处理负责删除不包含文本内容的标签,比如注释,脚本,样式等;修正即是修正DOM树的可修正错误,包括“<>”匹配错误,标签对匹配错误等。通过处理后,输出符合要求的简化DOM树。Second, process the original HTML document, including preprocessing and correction. Among them, preprocessing is responsible for deleting tags that do not contain text content, such as comments, scripts, styles, etc.; correction is to correct correctable errors in the DOM tree, including "<>" matching errors, tag pair matching errors, etc. After processing, the simplified DOM tree that meets the requirements is output.

三、对DOM树中的每个块级节点进行特征计算,并保存在节点结构中,输出该DOM树对应的原始节点集合OriginNodes。涉及的特征包括:节点文本内容长度与文档文本内容长度比值、节点文本内容长度、节点文本内容标点符号长度与节点文本内容长度比值、节点链接标签个数与文档链接标签个数比值、节点图片标签个数与文档图片标签个数比值、节点权重分数、节点内链接字符与文本内容长度比值、节点内链接标签个数加图片标签个数与节点文本内容长度比值。在统计计算上述特征的时候,应排除该块级节点下子块级节点的内容,由下自顶的计算每个块级节点的特征向量。3. Calculate the feature of each block-level node in the DOM tree, save it in the node structure, and output the original node set OriginNodes corresponding to the DOM tree. The features involved include: the ratio of node text content length to document text content length, node text content length, node text content punctuation length to node text content length ratio, node link label number to document link label number ratio, node image label The ratio of the number to the number of image tags in the document, the node weight score, the ratio of the link characters in the node to the length of the text content, the ratio of the number of link tags in the node plus the number of image tags to the length of the node text content. When statistically calculating the above features, the content of sub-block-level nodes under the block-level node should be excluded, and the feature vector of each block-level node should be calculated from bottom to top.

四、自动生成某一站点网页的模板,该模块对每个网站维护一个模板节点集合PatternNodes和缓存节点集合TempNodes,将目标网站的原始节点集合OriginNodes加入对应站点的缓存节点集合TempNodes中,一旦缓存节点集合TempNodes中元素个数超过某一设定阈值时,对缓存节点集合TempNodes中每个节点计数,那些文本重复频率较高的节点通常是携带该网站版权信息,重复广告等噪声内容的节点,将这些节点加入模板节点集合PatternNodes中,该集合即为该网站的模板,记录了该网站下各网页的共同噪声信息。4. Automatically generate a web page template for a certain site. This module maintains a template node set PatternNodes and a cache node set TempNodes for each website, and adds the original node set OriginNodes of the target website to the cache node set TempNodes of the corresponding site. Once the cache node When the number of elements in the set TempNodes exceeds a certain threshold, each node in the cache node set TempNodes is counted. Those nodes with high text repetition frequency are usually nodes that carry noise content such as copyright information of the website and repeated advertisements. These nodes are added to the template node set PatternNodes, which is the template of the website and records the common noise information of each web page under the website.

五、通过模板节点集合PatternNodes过滤掉目标网页中的部分噪声信息,输出已过滤目标网页节点集合PreNodes,PreNodes=OriginNodes-PatternNodes。5. Filter out part of the noise information in the target web page through the template node set PatternNodes, and output the filtered target web page node set PreNodes, PreNodes=OriginNodes-PatternNodes.

六、训练SVM或CART分类器。预先将一些节点标记为噪声和主体两类,用标记好的节点作为训练样本训练分类器,当分类器达到预定的分类效果即可停止训练,输出一个训练好的分类器classfer。6. Train the SVM or CART classifier. Mark some nodes as noise and subject in advance, use the marked nodes as training samples to train the classifier, stop the training when the classifier reaches the predetermined classification effect, and output a trained classifier classfer.

七、利用分类器classfer对已过滤目标网页节点集合PreNodes中的节点进行分类,分为噪声节点集合和主体内容节点集合RstNodes,最后输出主体内容节点集合RstNodes中的文本。7. Use the classifier classfer to classify the nodes in the filtered target webpage node set PreNodes, and divide them into noise node set and main content node set RstNodes, and finally output the text in the main content node set RstNodes.

如图2所示,本方法系统包括:网页爬虫模块101、网页预处理模块102、DOM树特征向量计算模块103、模板生成模块104、数据库系统105、模板预处理模块106、分类器训练模块107、分类器预测模块108。As shown in Figure 2, this method system comprises: webpage crawler module 101, webpage preprocessing module 102, DOM tree feature vector calculation module 103, template generation module 104, database system 105, template preprocessing module 106, classifier training module 107 , a classifier prediction module 108 .

网页爬虫模块101:负责循环不间断的抓取新的符合要求的目标网页;Web crawler module 101: responsible for continuously crawling new target web pages that meet the requirements;

预处理模块102:其与模块101相连,对目标网页进行无关标签删除,对错误标签对进行修正,输出简化DOM树;Preprocessing module 102: it is connected with module 101, deletes irrelevant tags to the target web page, corrects wrong tags, and outputs a simplified DOM tree;

DOM树特征向量计算模块103:其与模块102相连,对简化DOM树进行特征向量计算并输出目标网页的原始节点集合OriginNodes;DOM tree feature vector calculation module 103: it is connected with module 102, performs feature vector calculation to the simplified DOM tree and outputs the original node set OriginNodes of the target webpage;

模板生成模块104:其与模块103相连,对原始节点集合OriginNodes进行模板生成处理,生成模板节点集合PatternNodes;Template generation module 104: it is connected with module 103, and template generation process is carried out to original node set OriginNodes, generates template node set PatternNodes;

数据库105,其与模块104相连,对生成的模板节点集合PatternNodes,进行持久化;Database 105, which is connected to module 104, persists the generated template node set PatternNodes;

模板预处理模块106,其与模块103相连,获取模块103生成的原始节点集合OriginNodes,同时,模板预处理模块106还与数据库105相连,查询目标网页所属网站的模板节点集合PatternNodes。输出已过滤目标网页节点集合PreNodes;The template preprocessing module 106 is connected to the module 103 to obtain the original node set OriginNodes generated by the module 103. Meanwhile, the template preprocessing module 106 is also connected to the database 105 to query the template node set PatternNodes of the website to which the target webpage belongs. Output the filtered target web page node set PreNodes;

分类器训练模块107:负责训练分类器classfer;Classifier training module 107: responsible for training classifier classfer;

分类器预测模块108,其与模块106相连,接收模块106输出的已过滤目标网页节点集合PreNodes。同时,分类器预测模块108与模块107相连,接收107模块提供的分类器classfer。利用分类器classfer将该集合划分为噪声和主体内容两个集合,输出主体内容。The classifier prediction module 108 is connected to the module 106 and receives the filtered target web page node set PreNodes output by the module 106 . At the same time, the classifier prediction module 108 is connected to the module 107 and receives the classifier classfer provided by the module 107 . Use the classifier classfer to divide the set into two sets of noise and main content, and output the main content.

以下通过具体实例验证验证本发明技术效果。The technical effect of the present invention is verified by specific examples below.

步骤S201:从待爬取队列中取出一个URL(Uniform Resource Locator,统一资源定位符),下载该网页,筛选该网页内满足条件的URL,将其加入待爬取队列并转到步骤S201,以实现不间断获取网页。同时对该网页进行预处理,包括:删除无关标签,修正错误的标签对。随后解析该网页为DOM树,以并行的方式转到S202;Step S201: Take out a URL (Uniform Resource Locator, Uniform Resource Locator) from the queue to be crawled, download the webpage, filter the URLs that meet the conditions in the webpage, add it to the queue to be crawled and go to step S201 to Achieve uninterrupted access to web pages. At the same time, the web page is preprocessed, including: deleting irrelevant tags and correcting wrong tag pairs. Then parse the webpage into a DOM tree, and transfer to S202 in parallel;

步骤S202:对步骤S201输出的DOM树由下自顶的对每个块级节点计算特征向量。特征分量包括:节点文本内容长度与文档文本内容长度比值、节点文本内容长度、节点文本内容标点符号长度与节点文本内容长度比值、节点链接标签个数与文档链接标签个数比值、节点图片标签个数与文档图片标签个数比值、节点权重分数、节点内链接字符与文本内容长度比值、节点内链接标签个数加图片标签个数与节点文本内容长度比值。在统计计算上述特征的时候,子孙块节点的内容不计入该节点,每个特征向量存入节点中,则整个DOM树可以得到一个原始节点集合OriginNodes,转到步骤S203。若要将原始节点集合OriginNodes用于分类器训练,则采用并行的方式转到S204;Step S202: For the DOM tree output in step S201, the feature vector is calculated for each block-level node from bottom to top. Feature components include: ratio of node text content length to document text content length, node text content length, node text content punctuation length to node text content length ratio, node link label number to document link label number ratio, node image label number The ratio of the number to the number of image tags in the document, the node weight score, the ratio of the link characters in the node to the length of the text content, the ratio of the number of link tags in the node plus the number of image tags to the length of the text content in the node. When statistically calculating the above features, the content of the descendant block node is not included in the node, and each feature vector is stored in the node, then the entire DOM tree can obtain an original node set OriginNodes, and go to step S203. If the original node set OriginNodes is used for classifier training, proceed to S204 in a parallel manner;

步骤S203:将步骤S202输出的原始节点集合OriginNodes添加到为目标网页所属网站维护的缓存区,一旦缓存区元素数量达到设定阈值,则进行公用噪声信息提取,将提取结果加入模板节点集合PatterNodes集合,转到步骤S205;否则,直接转到S205。Step S203: Add the original node set OriginNodes output in step S202 to the cache area maintained for the website to which the target webpage belongs. Once the number of elements in the cache area reaches the set threshold, perform common noise information extraction, and add the extraction result to the template node set PatterNodes set , go to step S205; otherwise, go to S205 directly.

步骤S204:需人工标注原始节点集合OriginNodes,用于分类器classfer训练,一旦分类器classfer达到满足系统要求的效果,即可停止。该步骤并不是必须的,除非当前分类器classfer不满足系统要求,需要训练新的分类器。分类器classfer训练结束后,转到S206更新当前分类器classfer。Step S204: The original node set OriginNodes needs to be marked manually for classfer training. Once the classfer meets the system requirements, it can be stopped. This step is not necessary, unless the current classifier classfer does not meet the system requirements and a new classifier needs to be trained. After the classifier classfer is trained, go to S206 to update the current classifier classfer.

步骤S205:利用模板节点集合PatterNodes对原始节点集合OriginNodes进行过滤处理,其效果等价于过滤了目标网页部分噪声信息,这些信息往往是目标网站的公有噪声信息,包括:网站版权信息,部分广告,网站网页结构信息等,过滤后的节点集合为已过滤目标网页节点集合PreNodes,转到S206。Step S205: Using the template node set PatterNodes to filter the original node set OriginNodes, the effect is equivalent to filtering part of the noise information of the target webpage, which is often the public noise information of the target website, including: website copyright information, some advertisements, Website web page structure information, etc., the filtered node set is the filtered target web page node set PreNodes, and the process goes to S206.

步骤S206:利用当前的分类器classfer对已过滤目标网页节点集合PreNodes进行分类,输出分类结果为主体内容节点中的内容。Step S206: Use the current classifier classfer to classify the filtered target web page node set PreNodes, and output the classification result as the content in the main content node.

通过上述方式,对参考消息、人民日报、四川日报、华西都市报、腾讯新闻、搜狐新闻、新浪新闻、今日头条、凤凰网、光明网、环球网、四川省人民政府、成都市人民政府等网站获取主体型网页24334篇,并进行去噪处理。多次随机抽样2000篇进行检查,去噪平均准确率为98.64%,平均召回率为93.46%。将该方法应用于某舆情分析系统,改善了该系统的语料质量,对舆情分析系统的准确性提升有较大意义。Through the above methods, websites such as Reference News, People's Daily, Sichuan Daily, West China Metropolis Daily, Tencent News, Sohu News, Sina News, Toutiao, Phoenix.com, Guangming.com, Global.com, Sichuan Provincial People's Government, Chengdu Municipal People's Government, etc. Obtain 24334 subject-type web pages and perform denoising processing. 2000 papers were randomly sampled multiple times for inspection, the average denoising accuracy rate was 98.64%, and the average recall rate was 93.46%. Applying this method to a public opinion analysis system improves the corpus quality of the system, which is of great significance to the accuracy of the public opinion analysis system.

Claims (6)

Translated fromChinese
1.一种基于模板和分类器协同工作的网页去噪方法,其特征在于,包括以下步骤:1. A webpage denoising method based on template and classifier cooperative work, is characterized in that, comprises the following steps:步骤1:下载目标网页,获取原始的HTML文档;Step 1: Download the target webpage and get the original HTML document;步骤2:解析原始的HTML文档,删除无关标签节点,修正DOM树,生成符合要求的简化DOM树;Step 2: Parse the original HTML document, delete irrelevant label nodes, correct the DOM tree, and generate a simplified DOM tree that meets the requirements;步骤3:计算目标网页的DOM树中的每个块级节点特征,得到该目标网页的原始节点集合;Step 3: Calculate the feature of each block-level node in the DOM tree of the target webpage to obtain the original node set of the target webpage;步骤4:生成模板,即将原始节点集合加入对应网站的缓存节点集合,当缓存节点集合中元素个数达到预设阈值时,触发模板生成算法,更新对应网站的模板节点集合;Step 4: Generate a template, that is, add the original node set to the cache node set of the corresponding website. When the number of elements in the cache node set reaches the preset threshold, the template generation algorithm is triggered to update the template node set of the corresponding website;步骤5:利用目标网页所属网站的模板节点集合对目标网页的原始节点集合进行过滤处理,输出已过滤目标网页节点集合;Step 5: filter the original node set of the target web page by using the template node set of the website to which the target web page belongs, and output the filtered target web page node set;步骤6:训练分类器,即预先将一些节点标记为噪声和主体两类,并加入已标记节点集合,用该集合中的节点作为训练样本训练分类器,直到分类器达到预定的分类效果;Step 6: Train the classifier, that is, pre-mark some nodes as noise and subject, and add them to the set of marked nodes, and use the nodes in the set as training samples to train the classifier until the classifier achieves the predetermined classification effect;步骤7:用训练好的分类器对已过滤目标网页节点集合中的节点进行分类,保留分类结果为主体内容的节点,从中提取主体内容文本。Step 7: Use the trained classifier to classify the nodes in the filtered target web page node set, keep the nodes whose classification results are the main content, and extract the main content text from them.2.如权利要求1所述的基于模板和分类器协同工作的网页去噪方法,其特征在于,所述步骤1具体为:包括网页下载和网页发现;网页下载负责下载目标网页并按目标网页的域名地址的不同,分类存入数据库中,网页发现负责发现符合要求的新网页地址,并将其添加到待爬取列表。2. the webpage denoising method based on template and classifier cooperative work as claimed in claim 1, is characterized in that, described step 1 is specifically: comprise webpage download and webpage discovery; Different domain names and addresses are classified into the database, and webpage discovery is responsible for finding new webpage addresses that meet the requirements and adding them to the list to be crawled.3.如权利要求1所述的基于模板和分类器协同工作的网页去噪方法,其特征在于,所述步骤2具体为:包括预处理和修正;预处理负责删除不包含文本内容的标签,包括注释、脚本和样式,修正即是修正DOM树的可修正错误,包括“<>”匹配错误、标签对匹配错误。3. The web page denoising method based on templates and classifiers as claimed in claim 1, wherein said step 2 is specifically: including preprocessing and correction; preprocessing is responsible for deleting tags that do not contain text content, Including comments, scripts and styles, correction means correcting correctable errors of the DOM tree, including "<>" matching errors and tag pair matching errors.4.如权利要求1所述的基于模板和分类器协同工作的网页去噪方法,其特征在于,所述步骤3中,节点特征包括:节点文本内容长度与文档文本内容长度比值、节点文本内容长度、节点文本内容标点符号长度与节点文本内容长度比值、节点链接标签个数与文档链接标签个数比值、节点图片标签个数与文档图片标签个数比值、节点权重分数、节点内链接字符与文本内容长度比值、节点内链接标签个数加图片标签个数与节点文本内容长度比值。4. The web page denoising method based on template and classifier as claimed in claim 1, wherein in said step 3, the node features include: node text content length and document text content length ratio, node text content Length, the ratio of the punctuation mark length of the node text content to the length of the node text content, the ratio of the number of node link labels to the number of document link labels, the ratio of the number of node picture labels to the number of document picture labels, the node weight score, the link character and The ratio of the text content length, the ratio of the number of link tags in the node plus the number of image tags to the length of the node text content.5.如权利要求1所述的基于模板和分类器协同工作的网页去噪方法,其特征在于,在步骤6中,所述分类器采用的分类器模型为支持向量机或分类回归树。5. The webpage denoising method based on template and classifier working together as claimed in claim 1, wherein in step 6, the classifier model adopted by the classifier is a support vector machine or a classification regression tree.6.一种基于模板和分类器协同工作的网页去噪系统,其特征在于,包括网页爬虫模块、HTML预处理模块、DOM树特征向量计算模块、模板生成模块、模板预处理模块、分类器训练模块和分类器预测模块;6. A webpage denoising system based on template and classifier cooperative work, it is characterized in that, comprises web crawler module, HTML preprocessing module, DOM tree feature vector calculation module, template generation module, template preprocessing module, classifier training module and classifier prediction module;所述网页爬虫模块用于下载目标网页,获取原始的HTML文档;The web crawler module is used to download the target webpage and obtain the original HTML document;所述HTML预处理模块用于解析原始的HTML文档,删除无关标签节点,修正DOM树,生成符合要求的简化DOM树;The HTML preprocessing module is used to parse the original HTML document, delete irrelevant label nodes, correct the DOM tree, and generate a simplified DOM tree meeting the requirements;所述DOM树特征向量计算模块用于计算目标网页的DOM树中的每个块级节点特征,得到该目标网页的原始节点集合;The DOM tree feature vector calculation module is used to calculate each block-level node feature in the DOM tree of the target webpage to obtain the original node set of the target webpage;所述模板生成模块用于将原始节点集合加入对应网站的缓存节点集合,当缓存节点集合中元素个数达到预设阈值时,触发模板生成算法,更新对应网站的模板节点集合;The template generation module is used to add the original node set to the cache node set of the corresponding website, and when the number of elements in the cache node set reaches a preset threshold, a template generation algorithm is triggered to update the template node set of the corresponding website;所述模板预处理模块用于利用目标网页所属网站的模板节点集合对目标网页的原始节点集合进行过滤处理,输出已过滤目标网页节点集合;The template preprocessing module is used to filter the original node set of the target webpage by using the template node set of the website to which the target webpage belongs, and output the filtered target webpage node set;所述分类器训练模块用于分类器的训练,即预先将一些节点标记为噪声和主体两类,用标记好的节点作为训练样本训练分类器,直到分类器达到预定的分类效果;The classifier training module is used for classifier training, that is, some nodes are marked as noise and subject in advance, and the marked nodes are used as training samples to train the classifier until the classifier reaches a predetermined classification effect;所述分类器预测模块用于用训练好的分类器对已过滤目标网页节点集合中的节点进行分类,保留分类结果为主体内容的节点,从中提取主体内容文本。The classifier prediction module is used to classify the nodes in the filtered target web page node set with a trained classifier, retain the nodes whose classification results are the main content, and extract the main content text therefrom.
CN201910694087.XA2019-07-302019-07-30Webpage denoising method and system based on cooperative work of template and classifierActiveCN110457579B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201910694087.XACN110457579B (en)2019-07-302019-07-30Webpage denoising method and system based on cooperative work of template and classifier

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201910694087.XACN110457579B (en)2019-07-302019-07-30Webpage denoising method and system based on cooperative work of template and classifier

Publications (2)

Publication NumberPublication Date
CN110457579Atrue CN110457579A (en)2019-11-15
CN110457579B CN110457579B (en)2022-03-22

Family

ID=68483966

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201910694087.XAActiveCN110457579B (en)2019-07-302019-07-30Webpage denoising method and system based on cooperative work of template and classifier

Country Status (1)

CountryLink
CN (1)CN110457579B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110851606A (en)*2019-11-182020-02-28杭州安恒信息技术股份有限公司Website clustering method and system based on webpage structure similarity
CN112199613A (en)*2020-10-132021-01-08北京理工大学 A product URL automatic positioning method integrating DOM topology and text attributes
CN112347353A (en)*2020-11-062021-02-09同方知网(北京)技术有限公司Webpage denoising method
CN112528205A (en)*2020-12-222021-03-19中科院计算技术研究所大数据研究院Webpage main body information extraction method and device and storage medium
CN113254751A (en)*2021-06-242021-08-13北森云计算有限公司Method, equipment and storage medium for accurately extracting complex webpage structured information
CN114691137A (en)*2022-04-062022-07-01中国农业银行股份有限公司Webpage simplifying method and device and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101197849A (en)*2007-12-212008-06-11腾讯科技(深圳)有限公司Method and device for commuting internet page into wireless application protocol page
US20090063500A1 (en)*2007-08-312009-03-05Microsoft CorporationExtracting data content items using template matching
CN101727498A (en)*2010-01-152010-06-09西安交通大学Automatic extraction method of web page information based on WEB structure
US20100169311A1 (en)*2008-12-302010-07-01Ashwin TengliApproaches for the unsupervised creation of structural templates for electronic documents
CN103744981A (en)*2014-01-142014-04-23南京汇吉递特网络科技有限公司System for automatic classification analysis for website based on website content
CN103838823A (en)*2014-01-222014-06-04浙江大学Website content accessible detection method based on web page templates
CN107577783A (en)*2017-09-152018-01-12电子科技大学 Web Page Type Automatic Recognition Method Based on Web Structure Feature Mining

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20090063500A1 (en)*2007-08-312009-03-05Microsoft CorporationExtracting data content items using template matching
CN101197849A (en)*2007-12-212008-06-11腾讯科技(深圳)有限公司Method and device for commuting internet page into wireless application protocol page
US20100169311A1 (en)*2008-12-302010-07-01Ashwin TengliApproaches for the unsupervised creation of structural templates for electronic documents
CN101727498A (en)*2010-01-152010-06-09西安交通大学Automatic extraction method of web page information based on WEB structure
CN103744981A (en)*2014-01-142014-04-23南京汇吉递特网络科技有限公司System for automatic classification analysis for website based on website content
CN103838823A (en)*2014-01-222014-06-04浙江大学Website content accessible detection method based on web page templates
CN107577783A (en)*2017-09-152018-01-12电子科技大学 Web Page Type Automatic Recognition Method Based on Web Structure Feature Mining

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
万乐等: "基于主题的网页噪音去除机制", 《计算机工程与设计》*

Cited By (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110851606A (en)*2019-11-182020-02-28杭州安恒信息技术股份有限公司Website clustering method and system based on webpage structure similarity
CN112199613A (en)*2020-10-132021-01-08北京理工大学 A product URL automatic positioning method integrating DOM topology and text attributes
CN112199613B (en)*2020-10-132023-03-03北京理工大学Product URL automatic positioning method integrating DOM topology and text attributes
CN112347353A (en)*2020-11-062021-02-09同方知网(北京)技术有限公司Webpage denoising method
CN112347353B (en)*2020-11-062024-05-24同方知网(北京)技术有限公司Method for denoising webpage
CN112528205A (en)*2020-12-222021-03-19中科院计算技术研究所大数据研究院Webpage main body information extraction method and device and storage medium
CN113254751A (en)*2021-06-242021-08-13北森云计算有限公司Method, equipment and storage medium for accurately extracting complex webpage structured information
CN114691137A (en)*2022-04-062022-07-01中国农业银行股份有限公司Webpage simplifying method and device and electronic equipment

Also Published As

Publication numberPublication date
CN110457579B (en)2022-03-22

Similar Documents

PublicationPublication DateTitle
CN110457579A (en) Web page denoising method and system based on template and classifier working together
CN103294781B (en)A kind of method and apparatus for processing page data
CN103678412B (en)A kind of method and device of file retrieval
CN104182412B (en) A web crawling method and system
CN103942335B (en)Construction method of uninterrupted crawler system oriented to web page structure change
CN104679825B (en)Macroscopic abnormity of earthquake acquisition of information based on network text and screening technique
CN103838823B (en)Website content accessible detection method based on web page templates
CN107577783A (en) Web Page Type Automatic Recognition Method Based on Web Structure Feature Mining
CN105653668A (en)Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment
CN106815307A (en)Public Culture knowledge mapping platform and its use method
CN107885793A (en)A kind of hot microblog topic analyzing and predicting method and system
CN101661513A (en)Detection method of network focus and public sentiment
CN104361081A (en)WEB document-based automatic abstracting method
CN103853760A (en)Method and device for extracting contents of bodies of web pages
CN106557565A (en)A kind of text message extracting method based on website construction
CN102662969A (en)Internet information object positioning method based on webpage structure semantic meaning
CN104572934B (en) A method for extracting key content of web pages based on DOM
CN103927400A (en)Web site product detailed information classification crawling and product information base establishing method
CN103440315B (en)A kind of Web page cleaning method based on theme
CN112149422A (en) A dynamic monitoring method of enterprise news based on natural language
US20170235835A1 (en)Information identification and extraction
CN103559202B (en)A kind of webpage content extraction apparatus and method
CN110472126A (en)A kind of acquisition methods of page data, device and equipment
CN115640439A (en) Method, system and storage medium for network public opinion monitoring
CN114443928B (en)Web text data crawler method and system

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp