CN1536483A

Movatterモバイル変換

Info

Publication number: CN1536483A
Application number: CNA031093388A
Authority: CN
Inventors: 陈文中
Original assignee: Individual
Current assignee: Individual
Priority date: 2003-04-04
Filing date: 2003-04-04
Publication date: 2004-10-13

Abstract

A method and a system for extracting and processing network information adopt artificial intelligence and natural language processing technology, can automatically download the latest news information every day from each appointed site, extract and classify contents, automatically abstract and simplify full text, store the full text and index the text so as to carry out efficient full text retrieval in the future.

Description

Translated fromChinese

网络信息抽取及处理的方法及系统Method and system for network information extraction and processing

技术领域technical field

本发明涉及一种数据处理方法及系统，更具体地说，涉及一种计算机网络上的各种信息特别是网上新闻的抽取及处理的方法及系统。The present invention relates to a data processing method and system, more specifically, to a method and system for extracting and processing various information on a computer network, especially online news.

背景技术Background technique

当今是一个信息爆炸的时代，随着internet的飞速发展，人们越来越有多的通过网络来获得最新的咨询信息。Today is an era of information explosion. With the rapid development of the Internet, more and more people obtain the latest consulting information through the Internet.

现在，几乎每个人都有看报纸的习惯，特别是一些对咨询信息需求比较紧迫的个人和企业，更加是要从很多的报纸上获得自己需要的信息。我们几乎能够从网上看到所有的新闻，很多人已经通过上网来获取最新的新闻信息。但是，仅仅是上网看新闻并不能减少我们所需要的时间，我们仍然需要通读一大篇的新闻才能知道这篇新闻描述的内容，或者要察看很多的网页之后才能得到我们所需要的咨询信息。而且，网上的新闻一逝即过，很多人需要对多天以前的新闻进行查询，甚至需要对几个月，一年前的新闻进行查询。这种情况下，通过网络已经不能满足我们的要求的了。Now, almost everyone has the habit of reading newspapers, especially some individuals and enterprises that have urgent needs for consulting information, and they need to obtain the information they need from many newspapers. We can see almost all the news from the Internet, and many people have obtained the latest news information through the Internet. However, just reading news on the Internet does not reduce the time we need. We still need to read through a large piece of news to know what the news describes, or to check many web pages before we can get the consulting information we need. Moreover, the news on the Internet is fleeting, and many people need to inquire about the news of many days ago, or even inquire about the news of several months or a year ago. In this case, the network can no longer meet our requirements.

传统的基于统计的自动摘要的方法，一般利用数理统计的方法给文档中每一个词都赋予一定的权值，计算权值的方法一般是通过计算词在文章中的出现频率来计算的。出现频率高的词，所具有的权值就更高。具有高权值的词意味着这个词是文章的中心。Traditional statistics-based automatic summarization methods generally use mathematical statistics to assign a certain weight to each word in the document, and the method of calculating the weight is generally calculated by calculating the frequency of occurrence of words in the article. Words with high frequency of occurrence have higher weights. A word with a high weight means that the word is the center of the article.

文章的句子也是根据词的权值来赋予的，当我们给词赋完权值之后，我们就能够计算出每个句子的权值，权值越高的句子越能够代表文章的中心思想。我们能够直接用权值高的句子来产生摘要。The sentence of the article is also assigned according to the weight of the word. After we assign the weight to the word, we can calculate the weight of each sentence. The sentence with the higher weight can represent the central idea of the article. We can directly use sentences with high weights to generate summaries.

这种方法生成摘要的速度很快，但是由于出现频率高的词并不一定就是文章的中心思想，而且没有进行语法分析，用权值高的句子拼凑而成的摘要的可读性也是比较差的。This method generates a summary very quickly, but because the words with high frequency are not necessarily the central idea of the article, and there is no grammatical analysis, the readability of the summary pieced together with high-weight sentences is relatively poor. of.

但是，我们可以通过改进赋予权值的方法和中心句子选择的方法来达到比较能够接受的效果。However, we can achieve more acceptable results by improving the method of assigning weights and the method of central sentence selection.

中文自动分词是建立全文索引必须经过的一个步骤。所谓分词，就是把一句话、一篇文章中的词逐个划分出来。中文不像英文那样，中文没有明显的切分标志。词的长度不一，而且词的定义也不同，存在一词多义，同义词等情形。所以中文自动分词存在着很大的难度。Chinese automatic word segmentation is a necessary step to establish a full-text index. The so-called word segmentation is to divide the words in a sentence or an article one by one. Unlike English, Chinese does not have obvious segmentation marks. Words have different lengths, and the definitions of words are also different. There are situations such as polysemy and synonyms. Therefore, there is great difficulty in automatic Chinese word segmentation.

现在比较流行的分词的方法主要有以下几种：The more popular word segmentation methods are as follows:

正向最大匹配法：是最早提出的分词方法，每次用最长(如为6)的正向切分的词和词典的词进行匹配，如果匹配成功，则继续往下分词，否则删除最后一个字，继续匹配。Forward maximum matching method: It is the earliest word segmentation method. Each time, the longest (for example, 6) forward segmented words are matched with the words in the dictionary. If the match is successful, continue to segment the word, otherwise delete the last A word, continue to match.

高频优先法：这种方法是基于词频的统计，字与字之间的构成结合律，歧义划分等现象提出来的。这种方法提高了分词的效率，但是对于歧义无能为力，出错率没有减低。High-frequency priority method: This method is based on the statistics of word frequency, the combination law between words, and the division of ambiguity. This method improves the efficiency of word segmentation, but it can do nothing for ambiguity, and the error rate does not decrease.

神经网络分词法：按照模拟人脑并行，分布处理和建立数值模型工作。它将分词知识所分散隐式的方法存入神经网络内，通过自学习和训练修改内部权值，以求达到较好效果的分词结果。Neural Network Word Segmentation: According to the simulation of human brain parallel, distributed processing and numerical model work. It stores the scattered and implicit method of word segmentation knowledge into the neural network, and modifies the internal weights through self-learning and training, in order to achieve better word segmentation results.

专家系统分词法：这种分词的方法从专家系统的角度把分词的知识(包括常识性分词知识和消除歧义切分的启发性知识即歧义切分规则)从实现分词过程的推理机中独立出来。这样从而实现了知识库的维护和推理机的实现相互独立了。它还具有发现交集性歧义字段和多义组合歧义字段的能力和一定的自学习能力。Expert system word segmentation method: This word segmentation method separates word segmentation knowledge (including commonsense word segmentation knowledge and heuristic knowledge for disambiguating segmentation, that is, ambiguous segmentation rules) from the inference engine that realizes the word segmentation process from the perspective of the expert system . In this way, the maintenance of the knowledge base and the realization of the inference engine are independent of each other. It also has the ability to discover intersectional ambiguity fields and polysemous combination ambiguity fields and a certain self-learning ability.

现在的全文索引一般采用倒排文件作为索引机制，在倒排文件中保存词目对应的文档编号的列表。The current full-text index generally uses an inverted file as an index mechanism, and stores a list of document numbers corresponding to terms in the inverted file.

对于文本检索来说，最有效的索引结构则是倒排文件：它是一个列表集合，每个词目t对应一条记录，在记录中列出了包含此词目的所有文档d的标识符。For text retrieval, the most effective index structure is the inverted file: it is a collection of lists, each term t corresponds to a record, and the identifiers of all documents d containing this term are listed in the record.

倒排文件可被视为文档-词目频率矩阵的转置，从(d，t)转换为(t，d)，因为行优先的访问比列优先的访问更为有效。The inverted document can be viewed as the transpose of the document-term frequency matrix, from (d,t) to (t,d), since row-first access is more efficient than column-first access.

索引文件包含三部分：词典(invf.dict)，倒排文件(invf)和两者之间的映射文件(invf.idx)。索引文件结构如图2所示。The index file consists of three parts: the dictionary (invf.dict), the inversion file (invf) and the mapping file between the two (invf.idx). The index file structure is shown in Figure 2.

在词典(invf.dict)中：对于每个不同的词目t，保存词目字符串t、包含t的文档总数f_t、t在整个文档集合中总的出现次数F_t。In the dictionary (invf.dict): for each different term t, save the term string t, the total number of documents containing t f_t, and the total number of occurrences of t in the entire document collection F_t.

在映射文件(invf.idx)中：对于每个不同的词目t，保存指向相应倒排列表起始地址的指针。In the mapping file (invf.idx): for each different term t, save a pointer to the starting address of the corresponding posting list.

在倒排文件(invf)中：对于每个不同的词目t，保存包含t的每个文档的标识符d(顺序的数值)、t在每个文档d中的出现频率f_d，t，存储为<d，f_d，t>的列表。In the inverted file (invf): for each different term t, save the identifier d (sequential value) of each document containing t, the frequency of occurrence f_{d, t} of t in each document d, Stored as a list of <d,f_d,t >.

另外和权重数组W_d一起，就可以满足布尔查询(Boolean Query)和分级查询(Ranked Query)的需要。In addition, together with the weight array W_d , it can meet the needs of Boolean query (Boolean query) and hierarchical query (ranked query).

发明内容Contents of the invention

本发明的目的是提供一种网络信息抽取及处理的方法及系统，采用了计算机技术和自然语言处理技术，能够自动的从各个指定的站点下载每天最新的新闻信息，并且进行内容抽取，分类，自动摘要精简全文，且将全文储存到本系统中，并进行文本索引以便日后进行高效的全文检索。The purpose of the present invention is to provide a method and system for network information extraction and processing, which adopts computer technology and natural language processing technology, can automatically download the latest daily news information from each designated site, and perform content extraction, classification, Automatic summarization simplifies the full text, and stores the full text in the system, and performs text indexing for efficient full-text retrieval in the future.

为了实现上述的目的，本发明的技术方案如下：In order to achieve the above-mentioned purpose, the technical scheme of the present invention is as follows:

一种网络信息抽取及处理的方法，包括如下步骤：A method for extracting and processing network information, comprising the following steps:

一.新闻下载步骤：包括如下步骤1. News download steps: including the following steps

url分析步骤：系统指定一定的url，程序能够自动的从这些url上分析出新闻的最终内容url，而不用对每个新闻网站做一个特定的url模块，采用给予url统计以及对url进行相关性分析的方法，在一个含有最终内容新闻连接地址的网页，进行统计和分析，找到有用的最终url地址；url analysis steps: the system specifies certain urls, and the program can automatically analyze the final content urls of the news from these urls, instead of making a specific url module for each news website, by giving url statistics and correlating urls The analysis method is to perform statistics and analysis on a web page containing the final content news link address, and find a useful final url address;

自动抓取新闻网页步骤：将目标地址中的链接页面所有符合url格式的页面进行下载；The step of automatically grabbing news webpages: download all the pages of the linked pages in the target address that meet the url format;

垃圾过滤步骤：实现对抓下来的新闻内容网页进行垃圾过滤，除去其中的html标签以及一些无用的中文，最终得到中文向量信息；Garbage filtering step: implement garbage filtering on the captured news content web pages, remove the html tags and some useless Chinese, and finally get the Chinese vector information;

信息提取步骤：对以上得到的中文向量进行信息提取，前期实现能够提取标题和内容，后期实现对web新闻内容进行特征提取相关性分析，文档分类，排重处理等等；Information extraction step: perform information extraction on the Chinese vectors obtained above, the title and content can be extracted in the early stage, and the feature extraction correlation analysis, document classification, and duplicate processing of web news content can be implemented in the later stage;

二.自动生成摘要步骤：进行分词、特征词分析、句子重要分析、生成摘要，并输出摘要；2. Automatic summary generation steps: perform word segmentation, feature word analysis, sentence important analysis, generate summary, and output the summary;

三.生成全文索引步骤：对所有已经下载并且完成内容抽取的新闻内容文件进行全文索引，包括如下步骤：3. The step of generating full-text index: carry out full-text index to all news content files that have been downloaded and completed content extraction, including the following steps:

传入步骤，传入下一个文件名；Incoming step, passing in the next file name;

索引判断步骤，判断是否已经索引过，是则回到传入步骤，否则进入下一步；Index judging step, judging whether it has been indexed, if yes, return to the incoming step, otherwise go to the next step;

过滤步骤，过滤其中所有垃圾及无意义的词；Filtering step, filtering all garbage and meaningless words;

匹配分词步骤，进行词典匹配分词；Match word segmentation step, carry out dictionary matching word segmentation;

ngram分词步骤，进行ngram分词，以免词典分词有未能完全分出来的词；The ngram word segmentation step is to perform ngram word segmentation, so as to avoid words that cannot be completely separated from the dictionary word segmentation;

更新步骤，对每一个词都更新相关的索引文件，包括关键字和日期，类别索引；The update step updates the relevant index files for each word, including keywords and dates, and category indexes;

四.层次文本分类步骤：是把一个新的文档归入一个给定的层次类别里的一个类里分类步骤；每份文档仅仅只能被归入一个类里，在层次类别里的每个类与许多词汇和术语相关有较大权重一个给定的术语在层次中的一个层次上，而stopword在另一个层次上。被摘录的文档(财政的新闻)的特征词在这个系统中被当作术语和字汇使用；包括层次训练步骤和文档分类步骤；Four. Hierarchical text classification step: it is a classification step to classify a new document into a class in a given hierarchical category; each document can only be classified into one class, and each class in the hierarchical class A given term is at one level in the hierarchy and a stopword is at another level with a greater weight associated with many words and terms. The feature words of the extracted documents (financial news) are used as terms and vocabulary in this system; including a hierarchical training step and a document classification step;

层次训练是文档分类的预处理，在分类之前，先对类别的层次进行训练；训练的功能是要收集来自训练文档的一组特征(特征词)，然后为每个节点(类别)在层次中分配特征权重，在文档分类算法中，特征权重是用来为一份新的文档计算类别等级；Hierarchical training is the preprocessing of document classification. Before classification, the hierarchy of categories is trained; the function of training is to collect a set of features (feature words) from training documents, and then for each node (category) in the hierarchy Assign feature weights. In document classification algorithms, feature weights are used to calculate category levels for a new document;

文件分类步骤是在被训练阶级组织之后，现在一份文件能被分类到一个类别，文件分类方法从根类别开始，根类别的所有子类别被分配等级，它由下面等式计算：The document classification step is after the hierarchical organization is trained, now a document can be classified into a category, the document classification method starts from the root category, and all subcategories of the root category are assigned a grade, which is calculated by the following equation:

${R R}_{cd cd} = = \underset{f f}{Σ Σ} {N N}_{fd fd} {W W}_{fc fc}$

c是一个类别，d是一份文件，f是一个在D中的特征，Rcd是c的等级，Nfd是f出现在d中的次数，Wfc是f在类别c中的权重；c is a category, d is a document, f is a feature in D, Rcd is the level of c, Nfd is the number of times f appears in d, Wfc is the weight of f in category c;

如果所有子类别的等级都是零的或负的，d被留在根类别；如果在子类别中有确定的正的最大的等级的类别，则该类别被选择；如果该类别是一个叶类别，文件d被分到该类别；如果被选择的类别不是叶类别，则在该类别的子类别中继续进行计算；因此，文件d能分到叶类别或内部类别。If the ranks of all subcategories are zero or negative, d is left in the root category; if there is a category with the largest positive rank determined among the subcategories, that category is selected; if the category is a leaf category , file d is classified into this category; if the selected category is not a leaf category, the calculation continues in the subcategory of this category; therefore, file d can be classified into a leaf category or an internal category.

一种网络信息抽取及处理的系统，包括如下装置：A system for extracting and processing network information, including the following devices:

一.新闻下载装置：包括如下装置1. News download device: including the following devices

url分析装置：系统指定一定的url，程序能够自动的从这些url上分析出新闻的最终内容url，而不用对每个新闻网站做一个特定的url模块，采用给予url统计以及对url进行相关性分析的方法，在一个含有最终内容新闻连接地址的网页，进行统计和分析，找到有用的最终url地址；url analysis device: the system specifies certain urls, and the program can automatically analyze the final content urls of the news from these urls, instead of making a specific url module for each news website, it uses url statistics and url correlation The analysis method is to perform statistics and analysis on a web page containing the final content news link address, and find a useful final url address;

自动抓取新闻网页装置：将目标地址中的链接页面所有符合url格式的页面进行下载；Automatic grabbing device for news webpages: download all pages in the link page in the target address that conform to the url format;

垃圾过滤装置：实现对抓下来的新闻内容网页进行垃圾过滤，除去其中的html标签以及一些无用的中文，最终得到中文向量信息；Garbage filtering device: Realize garbage filtering of captured news content webpages, remove html tags and some useless Chinese, and finally obtain Chinese vector information;

信息提取装置：对以上得到的中文向量进行信息提取，前期实现能够提取标题和内容，后期实现对web新闻内容进行特征提取，相关性分析，文档分类，排重处理等等；Information extraction device: extract information from the Chinese vectors obtained above, the title and content can be extracted in the early stage, and the feature extraction, correlation analysis, document classification, and duplicate processing of web news content can be realized in the later stage;

二.自动生成摘要装置：进行分词、特征词分析、句子重要分析、生成摘要，并输出摘要；2. Automatic summary generation device: perform word segmentation, feature word analysis, sentence important analysis, generate summary, and output summary;

三.生成全文索引装置：对所有已经下载并且完成内容抽取的新闻内容文件进行全文索引，包括如下装置：3. Device for generating full-text index: carry out full-text index for all news content files that have been downloaded and completed content extraction, including the following devices:

传入装置，传入下一个文件名；Pass in the device, pass in the next file name;

索引判断装置，判断是否已经索引过，是则回到传入装置，否则进入下一步；The index judging device judges whether it has been indexed, and if so, returns to the incoming device, otherwise enters the next step;

过滤装置，过滤其中所有垃圾及无意义的词；Filtering device to filter all garbage and meaningless words;

匹配分词装置，进行词典匹配分词；Match word segmentation device to perform dictionary matching word segmentation;

ngram分词装置，进行ngram分词，以免词典分词有未能完全分出来的词；The ngram word segmentation device performs ngram word segmentation to avoid words that cannot be fully separated from the dictionary word segmentation;

更新装置，对每一个词都更新相关的索引文件，包括关键字和日期，类别索引；The update device updates the relevant index file for each word, including keywords and dates, and category indexes;

四.层次文本分类装置：是把一个新的文档归入一个给定的层次类别里的一个类里分类装置；每份文档仅仅只能被归入一个类里，在层次类别里的每个类与许多词汇和术语相关有较大权重一个给定的术语在层次中的一个层次上，而stopword在另一个层次上。被摘录的文档(财政的新闻)的特征词在这个系统中被当作术语和字汇使用；包括层次训练装置和文档分类装置；Four. Hierarchical text classification device: it is a classification device that classifies a new document into a class in a given hierarchical category; each document can only be classified into one class, and each class in the hierarchical class A given term is at one level in the hierarchy and a stopword is at another level with a greater weight associated with many words and terms. The feature words of the extracted documents (financial news) are used as terms and vocabulary in this system; including hierarchical training devices and document classification devices;

层次训练装置是对文档分类的预处理，在分类之前，先对类别的层次进行训练；训练的功能是要收集来自训练文档的一组特征(特征词)，然后为每个节点(类别)在层次中分配特征权重，在文档分类算法中，特征权重是用来为一份新的文档计算类别等级；Hierarchical training device is the preprocessing to document classification, before classification, earlier to the class hierarchy is trained; The function of training is to collect a group of features (characteristic words) from training document, then for each node (category) in Assign feature weights in the hierarchy. In the document classification algorithm, the feature weights are used to calculate the category level for a new document;

文件分类装置是在被训练阶级组织之后，现在一份文件能被分类到一个类别，文件分类方法从根类别开始，根类别的所有子类别被分配等级，它由下面等式计算：The document classification device is after being trained in hierarchical organization, now a document can be classified into a category, the document classification method starts from the root category, and all subcategories of the root category are assigned grades, which are calculated by the following equation:

${R R}_{cd cd} = = \underset{f f}{Σ Σ} {N N}_{fd fd} {W W}_{fc fc}$

由于采用上述的方法及系统，能够自动每天从指定的web站点的指点版下载最新的新闻网页源码；能够对下载的html code进行分析，获得其中有价值的新闻内容；对分析出来的内容进行自动摘要精简；对分析出来的内容进行分词并且索引，以供检索之用；对分析出来的内容进行自动分类。Owing to adopting above-mentioned method and system, can automatically download the latest news web page source code every day from the pointing version of the designated web site; The downloaded html code can be analyzed to obtain valuable news content; The analyzed content can be automatically The abstract is simplified; the analyzed content is word-segmented and indexed for retrieval purposes; the analyzed content is automatically classified.

附图说明Description of drawings

图1为现有的自动下载网络信息的方法及程序的系统结构图；Fig. 1 is the system structural diagram of existing method and program for automatically downloading network information;

图2为现有的网络信息处理方法的索引文件结构图；Fig. 2 is the index file structural diagram of existing network information processing method;

图3为本发明所述的网络信息抽取及处理方法中的新闻下载步骤的流程图；Fig. 3 is the flowchart of the news download step in the network information extraction and processing method of the present invention;

图4为人民网的新闻中心的新闻列表页面图；Fig. 4 is the news list page figure of the news center of People's Daily Online;

图5为分析得到token流的方法的流程图；Fig. 5 is the flowchart of the method for analyzing and obtaining the token flow;

图6为China.com财经频道页面图；Figure 6 is a page map of China.com's financial channel;

图7为http：//www.chinahd.com/news/stock/2002-3/161628.htm的页面图；Figure 7 is the page map ofhttp://www.chinahd.com/news/stock/2002-3/161628.htm ;

图8为图7的源代码图；Fig. 8 is the source code map of Fig. 7;

图9为某篇china.com财经频道的新闻网页图；Fig. 9 is a news page map of a certain china.com financial channel;

图10为图9所述的新闻网页经内容分析可得到的内容信息图；Fig. 10 is the content information figure that can obtain through content analysis of the news webpage described in Fig. 9;

图11为自动生成摘要方法的流程图；Figure 11 is a flow chart of the method for automatically generating a summary;

图12为自动生成摘要方法的分析图；Figure 12 is an analysis diagram of the method for automatically generating a summary;

图13为举例说明的内容储存原文图；Fig. 13 is the original text diagram of content storage for illustration;

图14为根据本发明自动生成的摘要图；Fig. 14 is a summary diagram automatically generated according to the present invention;

图15为本发明所述的生成全文检索步骤的流程图；Fig. 15 is the flow chart of generating full-text retrieval step described in the present invention;

图16为本发明所述的新闻查询步骤的流程图。Fig. 16 is a flow chart of news query steps in the present invention.

具体实施方式Detailed ways

下面结合附图和实施方式对本发明作进一步详细的说明：Below in conjunction with accompanying drawing and embodiment the present invention is described in further detail:

我们仅考虑了自动下载以及内容分析的过程，没有对每个网站构造对应的匹配模型，我们对新闻网站这一类型的站点实现了一个通用的算法，就是根据中文内容出现的频度和内容亲密的html tag出现的频度和位置来确定那一部分是新闻内容。将在后面的实现方法中进行具体描述。We only considered the process of automatic download and content analysis, and did not construct a corresponding matching model for each website. We implemented a general algorithm for news websites, which is based on the frequency of Chinese content and the intimacy of the content. The frequency and position of the html tag to determine which part is the news content. The implementation method will be described in detail later.

由于我们需要得到准确性比较大的内容，并对之进行信息抽取传递给最终用户，所以我们不需要robot进行深层次的递归访问。具体实现自动下载的方法在后面具体介绍Since we need to obtain content with relatively high accuracy, and extract information from it and pass it on to end users, we do not need robots to perform deep recursive access. The specific method of realizing automatic download will be introduced in detail later.

由于考虑到通用性，所以我们不考虑文本的网页特征，考虑的是基于背景资料库的纯内容的自动摘要。Due to the consideration of generality, we do not consider the web page features of the text, but the automatic summarization of the pure content based on the background database.

一、新闻下载步骤：如图3所示，新闻的自动下载分为两个部分，url分析以及源代码抓取两部分。由于java具有的网络编程的优点，使得我们可以对网上的任意资源建立连接，形成一个流，就可以像操作本地文件一样操作网络上的资源。1. News download steps: As shown in Figure 3, the automatic download of news is divided into two parts, url analysis and source code capture. Due to the advantages of network programming in java, we can establish a connection to any resource on the Internet to form a stream, and we can operate resources on the network like operating local files.

1、url分析步骤：1. URL analysis steps:

系统指定一定的url，程序能够自动的从这些url上分析出新闻的最终内容url。而不用对每个新闻网站做一个特定的url模块。The system specifies a certain url, and the program can automatically analyze the final content url of the news from these urls. Instead of making a specific url module for each news site.

采用给予url统计以及对url进行相关性分析的方法，在一个含有最终内容新闻连接地址的网页，进行统计和分析，找到我们有用的最终url地址。例如：程序指定一定数量的已经分过类的url。此url应该是新闻的列表文件。即在此页点击新闻的链接即可打开新闻内容页面。Using the methods of url statistics and url correlation analysis, we can find our useful final url address by performing statistics and analysis on a web page containing the final content news link address. For example: the program specifies a certain number of urls that have been classified. This url should be a list file of news. That is, click the link of the news on this page to open the news content page.

以人民网为例子：这个页面就是人民网的新闻中心的新闻列表页面，如图4所示。Take People's Daily Online as an example: this page is the news list page of the News Center of People's Daily Online, as shown in Figure 4.

通过对这个页面进行分析，我们可得出最终页面的url格式为http：//www.people.com.cn/GB/guoii/25/96/20020312/^*.html存到相关的最终url格式文件中。By analyzing this^page, we can conclude that the url format of the final page ishttp://www.people.com.cn/GB/guoii/25/96/20020312/*.html and save it to the relevant final url format file middle.

采用对html的token分析方法：Use the token analysis method for html:

充分运用java中的面向对象的思想，我们将每个html源代码文件看成一个对象，同时建立一个名为token的类，token用来描述html中一个有意义的字符串，并且由token继承出来urltoken类，urltoken用来描述特征符合url格式的token。Make full use of the object-oriented thinking in java, we regard each html source code file as an object, and create a class named token at the same time, token is used to describe a meaningful string in html, and is inherited from token The urltoken class, urltoken is used to describe the token whose characteristics conform to the url format.

这样在进行html源代码分析的时候，我们将每个文件看成一个对象，同时就该文件中每一个html tag以及每一个html tag之间的字符串，我们都将他们看成一个字符串。In this way, when analyzing the html source code, we regard each file as an object, and at the same time, we regard each html tag in the file and the strings between each html tag as a string.

每个token所具有的属性Attributes of each token

String tokenstr＝null；//描述该token的串值String tokenstr＝null;//Describe the string value of the token

int tokenloc＝0；//该token在原文件中的位置int tokenloc=0;//The position of the token in the original file

int gbnum＝0；//该token中具有的中文字符数量int gbnum=0;//the number of Chinese characters in the token

boolean iskeentag＝false；//是否完全是一个内容亲密tokenboolean iskeentag=false;//whether it is completely a content intimacy token

Float keenvalue＝0；//与内容的亲密程度Float keenvalue=0;//The degree of intimacy with the content

Token具有的比较特别的方法：Token has special methods:

public boolean ishref(){public boolean ishref(){

String flag1＝″href＝″；String flag1="href=";

int flag2＝-1；int flag2 = -1;

if(tokenstr.索引Of(flag1)＝＝flag2)if(tokenstr. index Of(flag1)==flag2)

return false；return false;

elseelse

return true；return true;

}}

该方法用来判断是否一个url html tagThis method is used to determine whether a url html tag

实际上，运用oo的思想来进行html源代码分析，利用java中流的思想，我们建立了token流，结果证明，这样做的效果是很好的：In fact, we use the idea of oo to analyze the html source code, and use the idea of flow in java to establish a token flow. The result proves that the effect of this is very good:

1.程序结构很清晰，oo思想得到了非常明显的体现。1. The program structure is very clear, and the oo idea is very clearly reflected.

2.分析实现的效果很好，达到的准确率高。2. The effect of the analysis is very good, and the accuracy rate is high.

3.无需对每个网站定义特殊的分析stop标志等。3. There is no need to define a special analysis stop flag, etc. for each website.

4.只要属于规范的html代码，都能够进行正常处理。4. As long as it belongs to the standard html code, it can be processed normally.

分析得到token流的方法如图5所示。The method of analyzing and obtaining the token flow is shown in Figure 5.

对每一个站点的任何一个新闻板块，我们都定义以下几个特征项：For any news section of each site, we define the following feature items:

该板块所属的类别，比如政治，工业，体育等。这些类别也是由管理模块定义的；The category the section belongs to, such as politics, industry, sports, etc. These categories are also defined by the management module;

该板块所属的服务器地址，比如：news.sina.com.cn；The address of the server to which this section belongs, for example: news.sina.com.cn;

该板块所属的当前目录(一般正规的网站，一个板块的新闻都是在一个目录下面)；The current directory to which the section belongs (generally, on regular websites, the news of a section is under one directory);

该板块list页面的路径属性，即绝对路径还是相对路径。The path attribute of the list page of this section, that is, an absolute path or a relative path.

对url进行分析，主要是由urlanalyse.class与contentanalyse.class两个类实现的，主要实现了token流的分析。The analysis of url is mainly realized by two classes, urlanalyse.class and contentanalyse.class, which mainly realize the analysis of token flow.

分析的主要方法：urlanalyse.class有一个方法geturl(stringfilename)先将源代码转化成token流读入来，然后将每一个符合格式的url token与这个url后面的gbnum不等于0的token加入缓存的hashmap中，一般情况下，url后面的gbnum不等于0的token都是新闻的标题。The main method of analysis: urlanalyse.class has a method geturl(stringfilename) that first converts the source code into a token stream and reads it in, and then adds each url token that conforms to the format and the token whose gbnum is not equal to 0 behind the url into the cache In the hashmap, under normal circumstances, the token whose gbnum is not equal to 0 after the url is the title of the news.

例如：China.com财经频道页面如图6所示。For example: the China.com financial channel page is shown in Figure 6.

经过url分析之后，我们可以得到相关的hashmap：After url analysis, we can get the relevant hashmap:

http：//finance.china.com/zh_cn/news/financenews/10001254/20020506/10255883.htmlhttp://finance.china.com/zh_cn/news/financenews/10001254/20020506/10255883.html

十年投入6000亿重庆要打造国际大都市Invest 600 billion in ten years to build Chongqing into an international metropolis

http：//finance.china.com/zh_cn/news/financenews/10001254/20020506/10255882.htmlhttp://finance.china.com/zh_cn/news/financenews/10001254/20020506/10255882.html

1吨油赔四五百元税控机拒收油票收现金1 ton of oil pays four to five hundred yuan

http：//finance.china.com/zh_cn/news/financenews/10001254/20020506/10255881.htmlhttp://finance.china.com/zh_cn/news/financenews/10001254/20020506/10255881.html

香港旅游业——经济复苏的一缕春风Hong Kong tourism industry - a ray of spring breeze for economic recovery

在获得这些之后，我们就将进行自动的抓取，将所有分析出来的url网页源代码都抓下来。After obtaining these, we will carry out automatic crawling, and grab all the source codes of the analyzed URL pages.

2、自动抓取新闻网页步骤：2. Steps to automatically grab news web pages:

每次启动程序，我们都要将目标地址中的链接页面所有符合url格式的页面进行下载。下载过程中并不进行信息抽取等相关分析，以免加大负担，影响下载速度。对已经下载过的页面不再下载。下载要区分gb，big5等编码因素的影响。Every time we start the program, we need to download all the pages in the link page in the target address that conform to the url format. During the download process, relevant analysis such as information extraction is not performed, so as not to increase the burden and affect the download speed. Do not download pages that have already been downloaded. Downloading should distinguish the influence of gb, big5 and other encoding factors.

3、垃圾过滤模块：3. Garbage filtering module:

此步骤是实现对抓下来的新闻内容网页进行垃圾过滤，除去其中的html标签以及一些无用的中文，最终得到中文向量信息。须在下载的同时在后台线程运行。后期可以考虑在得到的中文向量加入权值等相关信息。(权值根据文字出现的位置，前后的html标签等确定，需要一定数量的文档进行熟悉，训练)。This step is to implement garbage filtering on the captured news content web pages, remove html tags and some useless Chinese, and finally obtain Chinese vector information. Must run in background thread while downloading. In the later stage, you can consider adding relevant information such as weights to the obtained Chinese vector. (The weight is determined according to the position where the text appears, the HTML tags before and after, etc., and a certain number of documents are required for familiarization and training).

4、信息提取模块：4. Information extraction module:

对以上得到的中文向量进行信息提取，前期实现能够提取标题和内容。后期实现对web新闻内容进行特征提取，相关性分析，文档分类，排重处理等等。保证通用性。保证较高的准确率。前期的功能可以通过简单的方法实现(如a^****中的词在content b^***c^**d^**中的出现次数)实现。判断哪一块是内容可以通过句子之间的距离以及前后的html标签判断(标签都有一定权值)。Information extraction is performed on the Chinese vectors obtained above, and the title and content can be extracted in the early implementation. In the later stage, feature extraction, correlation analysis, document classification, duplication processing and so on are implemented for web news content. Versatility is guaranteed. Guaranteed high accuracy. The previous function can be realized by a simple method (such as the number of occurrences of words in a^**** in content b^*** c^** d^** ). Judging which piece is the content can be judged by the distance between sentences and the HTML tags before and after (the tags have a certain weight).

如图7所示，来源：As shown in Figure 7, source:

http：//www.chinahd.com/news/stock/2002-3/161628.htm。其源代码如图http://www.chinahd.com/news/stock/2002-3/161628.htm . Its source code is shown in the figure

8所示，可见，内容之间的距离都非常近，而且中间的html标签一般都是<p>，&nbsp，<br>(段落，空格，换行)之类的。我们可以通过距离和标签的特殊性来判断内容所在。8, it can be seen that the distance between the content is very close, and the html tags in the middle are generally <p>, &nbsp, <br> (paragraph, space, newline) and the like. We can judge the location of the content by the distance and the specificity of the label.

新闻内容抽取News Content Extraction

不同于传统的内容抽取方法，我们不针对每一个网站构造一个模型，在程序中，主要由contentanalyse.class和token.class等实现。Unlike traditional content extraction methods, we do not construct a model for each website. In the program, it is mainly implemented by contentanalyse.class and token.class.

具体方法如下：The specific method is as follows:

1、先将要抽取内容的文件转换为具体的token流；1. First convert the file to be extracted into a specific token stream;

2、将token流按照内容亲密度进行计算；2. Calculate the token flow according to the content intimacy;

3、将gb数量最集中以及亲密度同时又是最高的连续token集合取出来；3. Take out the continuous token set with the most concentrated number of GB and the highest intimacy;

4、如果gb数量以及亲密度不能同时符合以上要求，则直接cancel。4. If the number of GB and intimacy cannot meet the above requirements at the same time, it will be canceled directly.

例如：某篇china.com财经频道的新闻网页如图9所示。For example: the news page of a certain china.com financial channel is shown in Figure 9.

经过内容分析之后，由于china.com是比较规范的网页，我们一般能够达到很高的准确性，具体的测试数据在后面有详细的说明。After content analysis, since china.com is a relatively standardized webpage, we can generally achieve high accuracy. The specific test data will be described in detail later.

内容分析可得到的内容信息如图10所示。The content information that can be obtained by content analysis is shown in Figure 10.

在进行存储的时候，我们将会把新闻source，category，downloadtime，title，content等5部分全部储存起来，作为关键字索引，日期等索引的建立来源，同时也是摘要的来源。When storing, we will store all 5 parts of the news source, category, downloadtime, title, content, etc., as the source of keyword index, date index, etc., and also the source of abstract.

5、管理步骤：实现对本机存储的新闻数据进行管理，如删除，更新等。5. Management steps: realize the management of the news data stored in the machine, such as deleting and updating.

二、自动生成摘要步骤：先对原始文档进行预处理，然后进行分词、特征词分析、句子重要分析、生成摘要，并输出摘要；2. The step of automatically generating a summary: first preprocess the original document, then perform word segmentation, feature word analysis, sentence important analysis, generate a summary, and output the summary;

自动摘要步骤可以是一个独立的步骤，需要与外部接口的API接口只有一个get摘要ion。其接口原型为The automatic summary step can be an independent step that requires only one get summary ion for the API interface with the external interface. Its interface prototype is

public String get摘要ion(String FileName，boolean FileMode，intRatio)public String getsummary(String FileName, boolean FileMode, intRatio)

FileName参数，根据FileMode来决定；The FileName parameter is determined according to the FileMode;

如果FileMode＝true，那么FileName则为文件名；If FileMode=true, then FileName is the file name;

否则，为待抽取的文档本身Otherwise, for the document to be extracted itself

FileMode参数是模式参数The FileMode parameter is the mode parameter

Ratio为抽比率，只允许0-100之间的整数自动生成摘要步骤是一个独立的步骤，有独立的日志与事物处理模块，摘要之是否完成不影响下载以及索引的进行。Ratio is a pumping ratio, and only integers between 0 and 100 are allowed. The automatic summary generation step is an independent step with independent log and transaction processing modules. The completion of the summary does not affect the download and indexing.

自动摘要系统的系统流程如图11所示。The system flow of the automatic summarization system is shown in Figure 11.

分词采用“无词库”分词方法，采用词频，新旧算法思想一致，只做一些不必要的改进以加快分词速度。词重----衡量是词的可能性不必要的改进以加快分词速度。The word segmentation adopts the "no thesaurus" word segmentation method, using word frequency, the new and old algorithms have the same idea, and only some unnecessary improvements are made to speed up the word segmentation. Word weight----a measure of the possibility of words is unnecessary improvement to speed up word segmentation.

P_(w)＝F_(w)*L_(w)^c当(F(w)＞minFreq，L(w)＞minLen)否则P(w)＝0minFreq是预设的词的出现最小频率；通常≥2；降低不是词的串minLen是预设的词的最短词长；通常≥1；保证低频词不被分开c是预设的一个常值；通常≥4；保证长词不被分开P_(w) = F_(w) *L_(w)^c when (F (w) > minFreq, L (w) > minLen) otherwise P (w) = 0minFreq is the minimum frequency of occurrence of the preset word; usually ≥ 2; reduce the strings that are not words minLen is the shortest word length of the preset word; usually ≥ 1; ensure that low-frequency words are not separated; c is a preset constant value; usually ≥ 4; ensure that long words are not separated

流程：整文当作一个字符串，从头开始求子串，对所有子串求权，取权高者作为词(太多无用扫描)，系统值取一个串，采用所有文件作为背景，这样花去的扫描时间比较多。Process: Treat the whole text as a string, find substrings from the beginning, seek weights for all substrings, take the one with the highest weight as a word (too many useless scans), take a string as the system value, and use all files as the background, so that Go scan time is more.

特征词的抽取，基本思想是基于词的频率，以及想对于背景知识库的词频来统计。The basic idea of feature word extraction is based on the frequency of words, and counting the word frequency of the background knowledge base.

算法： $P (w) = F_{i} (w) \cdot (\frac{numdoc}{advnumdoc}) \cdot {(L (w) - D)}^{2}$ algorithm: $P (w) = f_{i} (w) &Center Dot; (\frac{numdoc}{advnumdoc}) &Center Dot; {(L (w) - D.)}^{2}$

F(w)为词出现的频率F(w) is the frequency of word occurrence

L(w)为词的长度L(w) is the length of the word

numdoc为该词的在本文中出现次数numdoc is the number of occurrences of the word in this text

advnumdoc为所有文档中出现平均次数advnumdoc is the average number of occurrences in all documents

D预设的最短词长D preset shortest word length

修改算法的原因有两点：There are two reasons for modifying the algorithm:

1、原算法必须使用大量的背景语料库(BWID)；因此会是系统耗费更大的时间和空间；而新算法则是基于语料库本身中出现的次数来进行想对统计。1. The original algorithm must use a large amount of background corpus (BWID); therefore, the system will consume more time and space; while the new algorithm is based on the number of occurrences in the corpus itself for statistical comparison.

2、新算法也具有理论说服力。因为背景语料库是广泛的，因此，一些常用词频率就会很多，这样numdoc/advnumdoc基本相等；而当一个特征词，通常在本文中出现较多次，而在BWID中则不是那么多，平均下来就使numdoc/advnumdoc大。因此特征词得权重也就大。具体如图12所示。2. The new algorithm is also theoretically convincing. Because the background corpus is extensive, some common words will have a lot of frequency, so numdoc/advnumdoc are basically equal; and when a feature word usually appears more times in this article, but not so many in BWID, on average Just make numdoc/advnumdoc larger. Therefore, the weight of feature words is also large. Specifically as shown in Figure 12.

句子的重要性与摘要的生成的关系：The relationship between sentence importance and summary generation:

$T T ((s the s)) = = \frac{ΣTi ΣTi}{s the s 00 * * s the s 11 * * s the s 22 * * m m}$

对每一个句子按这个公式计算他们的权重。For each sentence, calculate their weight according to this formula.

Ti为句子组成的词的权重Ti is the weight of the words formed by the sentence

S0为句子的总词数S0 is the total number of words in the sentence

S1为句子的字句数S1 is the number of words and sentences in the sentence

S2为数词的个数S2 is the number of numerals

m为整型常值，通常为1。m is an integer constant, usually 1.

内容储存原文如图13所示。The original text of content storage is shown in Figure 13.

摘要后文章如图14所示。The post-abstract article is shown in Figure 14.

三、生成全文索引步骤：3. Steps to generate full-text index:

本步骤需要对所有已经下载并且完成内容抽取的新闻内容文件进行全文索引，建立索引的过程实时的在后台进行建立索引的工作。自身也可以是一个独立的步骤，所需要提供的接口参数只是一个文件名。This step needs to perform full-text indexing on all news content files that have been downloaded and content extracted, and the indexing process is performed in the background in real time. It can also be an independent step, and the only interface parameter that needs to be provided is a file name.

生成全文检索步骤的流程如图15所示，包括如下步骤：The process of generating the full-text search step is shown in Figure 15, including the following steps:

更新步骤，对每一个词都更新相关的索引文件，包括关键字和日期，类别索引。The update step is to update the relevant index file for each word, including keyword and date, category index.

四、层次文本分类步骤：是把一个新的文档归入一个给定的层次类别里的一个类里分类步骤。每份文档仅仅只能被归入一个类里。在层次类别里的每个类与许多词汇和术语相关，而且分类算法本身在层次中被反复调整。因此，有较大权重一个给定的术语在层次中的一个层次上，而stopword在另一个层次上。被摘录的文档(财政的新闻)的特征词在这个系统中被当作术语和字汇使用。Fourth, hierarchical text classification step: it is a classification step to classify a new document into a class in a given hierarchical category. Each document can only be classified into one class. Each class in the hierarchy of categories is associated with many words and terms, and the classification algorithm itself is adjusted iteratively in the hierarchy. Thus, there is greater weight for a given term at one level in the hierarchy and a stopword at another level. Characteristic words of excerpted documents (financial news) are used as terms and vocabulary in this system.

包括二部份：层次训练步骤和文档分类步骤，层次训练是文档分类的预处理。在分类之前，先对类别的层次进行训练；It includes two parts: hierarchical training step and document classification step, hierarchical training is the preprocessing of document classification. Before classification, the hierarchy of categories is trained;

1.层次训练1. Hierarchical training

训练的功能是要收集来自训练文档的一组特征(特征词)，然后为每个节点(类别)在层次中分配特征权重。在文档分类算法中，特征权重是用来为一份新的文档计算类别等级。The function of training is to collect a set of features (feature words) from training documents, and then assign feature weights in the hierarchy for each node (category). In document classification algorithms, feature weights are used to calculate the class rank for a new document.

训练包括4个步骤：Training consists of 4 steps:

1)收集来自叶子类的特征词；1) Collect feature words from leaf classes;

层次中，对于每个叶子类的训练文档(新闻)的特征词，只有那些在单一训练文档中出现2次以上或者在训练文档集出现10次以上的特征词才被收集，这些词最后在摘要中出现。这些收集的特征词表示了叶子类的特征。当一个叶子类属于某一个训练文档集时，父类就要包含该叶子类的特征。非叶子类的特征包括它的孩子节点的所有特征和在所有孩子节点中特征发生频率的总和。In the hierarchy, for the feature words of the training documents (news) of each leaf class, only those feature words that appear more than 2 times in a single training document or more than 10 times in the training document set are collected, and these words are finally included in the abstract appears in . These collected feature words represent the features of leaf classes. When a leaf class belongs to a certain training document set, the parent class must contain the features of the leaf class. The features of non-leaf classes include all the features of its child nodes and the sum of the frequency of occurrence of features in all child nodes.

2)层次最优化步骤2) Hierarchical optimization steps

最优化用来解决在类别节和它的父母类别之间的竞争。因为一份文件(新闻)只能在类别的层次组织中被指定为一个类别，当在类别之间有竞争的时候，运算法则应该为文件决定适当的类别。Optimizations are used to resolve races between a category section and its parent category. Since a document (news) can only be assigned to a category within the hierarchical organization of categories, the algorithm should determine the appropriate category for the document when there is competition between categories.

包括如下步骤：Including the following steps:

采集步骤，采集在一个类别中所有的特征；The collection step collects all the features in a category;

特征判断步骤，判断是否在父母中的特征频率比在这个类别中大，是则到下一步骤，否则没有操作；The feature judgment step is to judge whether the feature frequency in the parents is larger than that in this category, if so, go to the next step, otherwise there is no operation;

查继步骤，查继承者的特征目录，找出继承者高频率和最低的频率的特征；Check the following step, check the feature catalog of the successor, and find out the characteristics of the high frequency and the lowest frequency of the successor;

比率判断步骤，判断是否在高的频率和最低的频率之差与最高频率的比率比门槛值大，是则到下一步骤，否则从所有的继承者删除该特征。只有父母保有该特征；The ratio judging step is to judge whether the ratio between the difference between the high frequency and the lowest frequency and the highest frequency is greater than the threshold value, if so, go to the next step, otherwise delete the feature from all successors. Only the parent retains the trait;

删除步骤，从继承者中删除该特征除非继承者有该特征的最高频率。In the delete step, the feature is removed from the successor unless the successor has the highest frequency of the feature.

上述的方法法则能找出通常的特征，对父类别来说，它的继承者拥有该特征和特征的频率。但是当该特征的频率没有传递到继承者时候，这意味着在继承者中的最高频率和最低频率。通常的特征从所有的继承者删除除非继承者包含通常的特征最高的频率。因此，所有叶类别的特征和频率向叶类别的上面除根类别之外的类别传递，在根类别他们将不参与任何的文件等级计算。The method described above finds common features, for a parent class, its successors have that feature and how often it is. But when the frequency of the feature is not passed to the successor, it means the highest frequency and the lowest frequency in the successor. The usual feature is removed from all successors unless the successor contains the usual feature with the highest frequency. Therefore, the features and frequencies of all leaf categories are passed to the categories above the leaf category except the root category, where they will not participate in any document level calculations.

当子类别保有它的时候，运算法则不能直接将一个特征从父类别删除。这是因为我们可能需要特征把文件传递到父类别；如果它不能传递到父类别它就没法传递到子类别。因此，在比较低层次类别(子类别)的分歧被向上传递到上面的层次(父类别)。Algorithms cannot directly remove a feature from a parent class while the child class retains it. This is because we may need traits to pass files to parent categories; if it cannot pass to parent categories it cannot pass to subcategories. Thus, divergences in lower-level categories (subcategories) are passed up to upper levels (parent categories).

3)分配类别特征权重步骤：为类别的每个特征指定权重，有比较高的权重特征意味着它对类别是更重要的，在每个类别中所有的特征被分配权重，由下式定义：3) Step of assigning category feature weights: Assign weights to each feature of the category. A feature with a relatively high weight means it is more important to the category. In each category, all features are assigned weights, defined by the following formula:

W_fc＝(λ+(1-λ)×N_fc/M_c)W_fc ＝(λ+(1-λ)×N_fc /M_c )

f正在每个存在的特征，c是类别，Wfc是被指定为特征的权重，λ是一个叁数并且现在设定为0.4，N_fc是f在c中出现的次数，Mc是在c中任何的特征最大的频率。f is for each feature that exists, c is the category, Wfc is the weight assigned to the feature, λ is a parameter and is now set to 0.4, N_fc is the number of occurrences of f in c, Mc is any The characteristic maximum frequency.

当一个特征只出现在兄弟类别中的时候，但是不在c中它本身，它被指定为负权重。有负权重的特征被增加到c的特征列表。负权重由下式定义：When a feature appears only in sibling categories, but not in c itself, it is assigned a negative weight. Features with negative weights are added to c's feature list. Negative weights are defined by:

f正在每个存在的特征，c是类别，Wfc是被指定为特征的权重，λ是一个叁数并且现在设定为0.4，N_fp是f在c的父类别中出现的次数，Mc是在c的父类别中任何的特征最大的频率。f is for each feature that exists, c is the category, Wfc is the weight assigned to the feature, λ is a parameter and is now set to 0.4, N_fp is the number of occurrences of f in the parent category of c, and Mc is the The maximum frequency of any feature in the parent category of c.

4)过滤每个类别的特征列表，每个类别的特征列表将被过滤。只有前面200个正特征和前面200个负特征被保留到该类别的最终特征列表中，无论是父类别还是叶类别。其他的特征将被抛弃。限制特征的数量是用来降低分类一个文件的计算复杂度。4) Filter the list of features for each category, the list of features for each category will be filtered. Only the top 200 positive features and the top 200 negative features are kept into the final feature list for that category, whether it is a parent category or a leaf category. Other features will be discarded. Limiting the number of features is used to reduce the computational complexity of classifying a document.

2.文件分类方法：在被训练阶级组织之后，现在一份文件能被分类到一个类别，文件分类方法从根类别开始。根类别的所有子类别被分配等级，它由下面等式计算：2. Document classification method: After being trained hierarchical organization, now a document can be classified into a category, and the document classification method starts from the root category. All subcategories of the root category are assigned a rank, which is calculated by the following equation:

${R R}_{cd cd} = = \underset{f f}{Σ Σ} {N N}_{fd fd} {W W}_{fc fc}$

c是一个类别，d是一份文件，f是一个在D中的特征，Rcd是c的等级，Nfd是f出现在d中的次数，Wfc是f在类别c中的权重。c is a category, d is a document, f is a feature in D, Rcd is the rank of c, Nfd is the number of times f appears in d, and Wfc is the weight of f in category c.

如果所有子类别的等级都是零的或负的，d被留在根类别。如果在子类别中有确定的正的最大的等级的类别，则该类别被选择。如果该类别是一个叶类别，文件d被分到该类别。如果被选择的类别不是叶类别，则在该类别的子类别中继续进行计算。因此，文件d能分到叶类别或内部类别。If the ranks of all subcategories are zero or negative, d is left in the root category. If there is a category with the positive maximum rank determined among the subcategories, that category is selected. If the category is a leaf category, document d is assigned to this category. If the selected category is not a leaf category, the calculation continues in the subcategories of that category. Therefore, document d can be classified into a leaf category or an internal category.

五、新闻查询步骤：如图16所示，包括如下步骤：5. News query steps: as shown in Figure 16, including the following steps:

提交步骤，用户提交查询条件；Submit step, the user submits query conditions;

搜索步骤，对索引进行搜索操作，得到结果集；The search step is to perform a search operation on the index to obtain a result set;

返回步骤，将结果返回给用户。Go back to the step to return the result to the user.

前面几个步骤只是实现了后台的自动下载，自动摘要，以及索引的建立，新闻查询子系统实现的功能是与用户的交互，能够让用户在前台进行相关的新闻查询，包括新闻关键字查询，新闻类别查询，新闻日期查询，新闻源查询等。The first few steps only realize the automatic downloading, automatic summarization, and establishment of indexes in the background. The function of the news query subsystem is to interact with users, allowing users to perform related news queries in the foreground, including news keyword queries, News category query, news date query, news source query, etc.

六、日志以及事务处理步骤：6. Log and transaction processing steps:

由于程序运行的情况下经常会遇到非正常性终止，比如突然死机，突然断电等。Because the program often encounters abnormal termination when it is running, such as sudden crash, sudden power failure, etc.

这种情况下，我们必须保证后台数据的完整性，如必须保证索引必须是完整的，即使是执行到一半程序终止了，下次运行仍然能够恢复原有的索引结果，并且从失败的位置开始从新进行索引工作。In this case, we must ensure the integrity of the background data. For example, we must ensure that the index must be complete. Even if the program is terminated halfway through the execution, the original index result can still be restored in the next run, and start from the failed position. Restart the indexing work.

还有，对于下载和摘要等工作，为了不造成重复工作以及节省时间，那么也必须对他们的工作进行纪录。Also, for tasks such as downloading and abstracting, in order not to cause duplication of work and save time, their work must also be recorded.

Log文件系统功能：Log file system function:

1、下载线程的url分析模块在分析url的时候，就先读入计数文件，并载入最新的两个log文件，用以判断是否已经下载过。1. When the url analysis module of the download thread analyzes the url, it first reads the count file and loads the latest two log files to determine whether it has been downloaded.

2、每当下载一个新闻内容网页，就存储相关的url至最新的log文件中。2. Whenever a news content web page is downloaded, the relevant url is stored in the latest log file.

3、在索引的过程中，必须先读入索引的位置信息，然后读入必须索引的log文件信息。然后对对应的内容文件进行索引，同时更新索引log文件中的索引位置信息。3. In the process of indexing, the location information of the index must be read first, and then the log file information that must be indexed must be read. Then the corresponding content files are indexed, and the index position information in the index log file is updated at the same time.

4、在摘要的过程中，必须先读入摘要的位置信息，然后读入必须摘要的log文件信息。然后对对应的内容文件进行摘要，同时更新摘要log文件中的摘要位置信息。4. In the process of summarizing, the location information of the summary must be read in first, and then the log file information that must be summed up must be read in. Then the corresponding content file is summarized, and the summary position information in the summary log file is updated at the same time.

5、每当下载完一个文件的源代码，分析出内容，进行完摘要，完成索引都要对这项工作进行纪录。以免事故发生无法处理，并可避免重复工作。5. Whenever the source code of a file is downloaded, the content is analyzed, the summary is completed, and the index is completed, this work must be recorded. In order to avoid accidents that cannot be handled, and to avoid duplication of work.

6、下载，摘要，索引三个线程永不停止，即使已经完成了某项工作，比如摘要已经完成，则重新load摘要的log文件，开始摘要。6. The three threads of downloading, summarizing and indexing will never stop. Even if a certain work has been completed, such as the summary has been completed, the log file of the summary will be reloaded and the summary will start.

七、管理步骤：7. Management steps:

管理步骤主要实现对本机的数据管理，类别管理，新闻源管理，数据删除的索引更新，日志更新等。The management step mainly implements data management, category management, news source management, index update of data deletion, log update, etc. of the machine.

Claims

1, the method for a kind of network information extraction and processing comprises the steps:

One. news download step: comprise the steps

The url analytical procedure: system specifies certain url, program can analyze the final content url of news automatically from these url, and need not do a specific url module to each news website, employing gives url statistics and the method for url being carried out correlation analysis, at a webpage that contains final content news link address, carry out statistics and analysis, find useful final url address;

Automatically grasp the news web page step: all pages that meet the url form are downloaded with the link page in the destination address;

Rubbish filtering step: realize carrying out rubbish filtering, remove html label wherein and Chinese that some are useless, finally obtain Chinese vector information to grabbing the news content webpage that gets off;

The information extraction step: the above Chinese vector that obtains is carried out information extraction, realize early stage extracting title and content, the later stage realization is carried out feature extraction to the web news content, correlation analysis, and document classification, row heavily handles or the like;

Two. generate the summary step automatically: carry out participle, the analysis of feature speech, sentence important analysis, generate summary, and the output summary;

Three. generate the full-text index step: all news content files of having downloaded and having finished content extraction are carried out full-text index, comprise the steps:

Import step into, import next filename into;

The index determining step judges whether index mistake, is then to get back to import step into, otherwise enters next step;

Filtration step filters wherein all rubbish and insignificant speech;

Coupling participle step is carried out the dictionary matching participle;

Ngram participle step is carried out the ngram participle, in order to avoid the dictionary participle has the speech of failing to branch away fully;

Step of updating is all upgraded relevant index file to each speech, comprises key word and date, the classification index;

Four. level text classification step: be that a new document is included into classification step in the class in the given stratigraphic classification; Every part of document can only be included in the class, have on the big level of given term of weight in level in that each class in the stratigraphic classification and many vocabulary are relevant with term, and stopword to be on another level. the feature speech of the document of being taken passages (news of finance) is taken as term in this system and glossary uses; Comprise level training step and document classification step;

The level training is the pre-service of document classification, before classification, earlier the level of classification is trained; The function of training is the stack features (feature speech) of self-training document of will collecting, and is each node (classification) assigned characteristics weight in level then, and in the document classification algorithm, feature weight is to be used for being the new document calculations classification grade of portion;

The document classification step is that present a file can be classified into a classification after by the training hierarchy, and file classifying method is from the root classification, and all subclass of root classification are assigned with grade, and it is calculated by equation:

R_{cd} = \underset{f}{Σ} N_{fd} W_{fc}

C is a classification, and d is a file, and f is a feature in D, and Rcd is the grade of c, and Nfd is that f appears at the number of times among the d, and Wfc is the weight of f in classification c;

If the grade of all subclass all be zero or negative, d is left on the root classification; If the classification of the grade of definite positive maximum is arranged in subclass, then this classification is selected; If this classification is a leaf classification, file d is assigned to this classification; If selecteed classification is not the leaf classification, then in such other subclass, proceed to calculate; Therefore, file d can assign to leaf classification or internal sort.

2, the method for network information extraction according to claim 1 and processing is characterized in that described news download step also comprises management process, realizes the news data of this machine storage is managed, and as deletion, upgrades etc.

3, the method for network information extraction according to claim 1 and processing is characterized in that described method also comprises the news query steps, comprises the steps:

Submit step to, the submit queries condition;

Search step carries out search operation to index, obtains result set;

Return step, the result is returned to the user.

4, network information extraction according to claim 1 and disposal route, it is characterized in that described method also comprises daily record and issued transaction step, even carrying out half has stopped, next time, operation still can recover original indexed results, and begin to note down for work such as download and summaries from newly carrying out indexing service from the position of failure; The url analysis module of download thread just reads in accounting file earlier, and is written into two up-to-date journal files when analyzing url, downloads in order to judging whether; Whenever downloading a news content webpage, just the url that storage is relevant is to up-to-date journal file; In the process of index, must read in the positional information of index earlier, read in the journal file information of necessary index then; Then corresponding content file is carried out index, upgrade the index position information in the index journal file simultaneously; In the process of summary, must read in the positional information of summary earlier, read in the journal file information that must make a summary then; Then corresponding content file is made a summary, upgrade the summary positional information in the summary journal file simultaneously; Source code whenever having downloaded a file analyzes content, finishes summary, finishes index and all will note down this work; Download, summary, three threads of index never stop, even finished a certain work, finishing such as summary, then download the journal file of summary again, begin summary.

5, the method for network information extraction according to claim 1 and processing is characterized in that described method also comprises management process, and management process is mainly realized the data management to this machine, category management, the news sources management, the index upgrade of data deletion, daily record renewal etc.

6, the method for network information extraction according to claim 1 and processing is characterized in that described autoabstract step can be an independent step, need have only a get summary ion with the api interface of external interface, and its interface prototype is

Public String get summary ion (String FileName, boolean FileMode, intRatio)

The FileName parameter decides according to FileMode; If FileMode=true, FileName then is a filename so; Otherwise, be document to be extracted itself; The FileMode parameter is a mode parameter; Ratio only allows the integer between the 0-100 for taking out ratio.

7, the method for network information extraction according to claim 1 and processing is characterized in that described generation full-text index step can be an independent step, and the required interface parameters that provides is a filename.

8, the method for network information extraction according to claim 1 and processing is characterized in that adopting in the described news download step token analytical approach to html; Fully use the OO thought among the java, regard each html source code file as an object, set up the class of a token by name simultaneously, token is used for describing significant character string among the html, and inherit out the urltoken class by token, urltoken is used for describing the token that feature meets the url form;

When carrying out the html source code analysis, regard each file as an object, simultaneously with regard to the character string between each html tag in this document and each the html tag, all it is regarded as a character string;

The attribute that each token had is

String tokenstr=null; The string value int tokenloc=0 of this token of // description; The position int gbnum=0 of // this token in original; The Chinese character quantity boolean iskeentag=false that has among // this token; // whether be the intimate token Float of a content keenvalue=0 fully; // more special the method that has with the intimate degree Token of content: public boolean ishref () { String flag 1=" href="; Int flag2=-1; If (tokenstr. index Of (flag1)==flag2) return false; Else return true;

This method is used for judging whether a url html tag;

Url is analyzed, mainly realize, mainly realized the analysis of token stream by urlanalyse.class and two classes of contentanalyse.class;

The main method of analyzing: urlanalyse.class has a method geturl (stringfilename) earlier source code to be changed into token stream to read in, the gbnum that then each is met the url token of form and this url back is not equal to 0 token and adds among the hashmap of buffer memory, generally speaking, to be not equal to 0 token all be the title of news to the gbnum of url back.

9, the method for network information extraction according to claim 1 and processing is characterized in that participle adopts " no dictionary " segmenting method in the described automatic generation summary step, adopts word frequency, and speech is heavy---measurement is the algorithmic formula of the possibility of speech:

P (w)=F (w) * L (w)^cWhen (F (w)＞minFreq, L (w)＞minLen) otherwise P (w) minFreq are the appearance minimum frequencies of the speech preset; Usually 〉=2; Reduce is not that the string minLen of speech is that the shortest speech of the speech preset is long; Usually 〉=1; Guarantee that it is a normal value of presetting that low-frequency word is not separated c; Usually 〉=4; Guarantee that long word is not separated;

Flow process is as follows: a character string be used as in whole literary composition, starts anew to ask substring, and all substrings are asked power, and the high person of weighting is as speech (too many useless scanning), and system value is got a string, adopts All Files as a setting.

10, the method for network information extraction according to claim 1 and processing is characterized in that the extraction of feature speech in the described automatic generation summary step, based on the frequency of speech and want to add up for the word frequency in background knowledge storehouse,

P (w) = F_{i} (w) \cdot (\frac{numdoc}{advnumdoc}) \cdot {(L (w) - D)}^{c}

The frequency that F (w) occurs for speech, L (w) is the length of speech, and numdoc is the occurrence number in this article of this speech, and advnumdoc average time occurs in all documents, and the shortest speech that D presets is long.

11, the method for network information extraction according to claim 1 and processing is characterized in that the relation of the generation of the importance of sentence in the described automatic generation summary step and summary:

T (s) \cdot \frac{ΣTi}{s 0 * s 1 * s 2 * m}

Each sentence is calculated their weight by this formula;

Ti is the weight of the speech of sentence composition, and S0 is total speech number of sentence, and S1 is the words and expressions number of sentence,

S2 is the number of number, and m is that integer often is worth, and is generally 1.

12, the method for network information extraction according to claim 1 and processing is characterized in that described level training step comprises 4 steps:

1) collects from the feature speech of leaf class: in the level, feature speech for the training document (news) of each leaf class, have only those in single training document, to occur just being collected more than 2 times or at the feature speech of training document sets appearance more than 10 times, these speech occur in summary at last, the feature vocabulary of these collections has shown the feature of leaf class, when a leaf class belongs to some training document sets, parent will comprise the feature of this leaf class, the feature of non-leaf class comprise it child nodes all features and in all child nodes the summation of feature occurrence frequency;

2) level optimization step: the optimization competition that solves between classification joint and its father and mother's classification, because a file (news) can only be designated as a classification in the hierarchical organization of classification, when between classification competitive the time, algorithm should determine suitable classification for file, comprises the steps:

Acquisition step is captured in all features in the classification;

The feature determining step judges whether that the characteristic frequency ratio in father and mother is big in this classification, is then to arrive next step, otherwise not operation;

Look into the step that continues, look into successor's feature catalogue, find out the feature of successor's high-frequency and minimum frequency;

The ratio determining step judges whether in the difference of high frequency and minimum frequency greatlyyer than threshold value with the ratio of highest frequency, is then to arrive next step, otherwise deletes this feature from all successors.Have only father and mother to possess this feature;

The deletion step is unless delete the highest frequency that this feature successor has this feature from the successor;

3) distribute the category feature weight step: be each feature specified weight of classification, have than higher weight feature to mean that it is prior to classification that features all in each classification are assigned with weight, are defined by following formula: W_Fc=(λ+(1-λ) * N_Fc/ M_c) just in the feature of each existence, c is a classification to f, Wfc is the weight that is designated as feature, λ is one three number and is set at 0.4 now, N_FcBe the number of times that f occurs in c, Mc is the frequency of feature maximum any in c;

When a feature only appears in the fraternal classification, but itself in c not, it is designated as negative weight, has the feature of negative weight to be added to the feature list of c, and negative weight is defined by following formula: W_Fc=-(λ+(1-λ) * N_Fp/ M_p) just in the feature of each existence, c is a classification to f, Wfc is the weight that is designated as feature, λ is one three number and is set at 0.4 now, N_FpBe the number of times that f occurs in the parent class of c, Mc is the frequency of feature maximum any in the parent class of c;

4) filter the feature list of each classification, the feature list of each classification will be filtered, no matter have only 200 negative features of 200 positive features in front and front to be carried in such other final feature list, be parent class or leaf classification, and other feature will be abandoned.The quantity of limited features is the computation complexity that is used for reducing a file of classification.

13, the system of a kind of network information extraction and processing is characterized in that: comprise as lower device:

One. news download apparatus: comprise as lower device

The url analytical equipment: system specifies certain url, program can analyze the final content url of news automatically from these url, and need not do a specific url module to each news website, employing gives url statistics and the method for url being carried out correlation analysis, at a webpage that contains final content news link address, carry out statistics and analysis, find useful final url address;

Automatically grasp the news web page device: all pages that meet the url form are downloaded with the link page in the destination address;

Rubbish filtering device: realize carrying out rubbish filtering, remove html label wherein and Chinese that some are useless, finally obtain Chinese vector information to grabbing the news content webpage that gets off;

Information extracting device: the above Chinese vector that obtains is carried out information extraction, realize early stage extracting title and content, the later stage realization is carried out feature extraction to the web news content, correlation analysis, and document classification, row heavily handles or the like;

Two. generate summarization device automatically: carry out participle, the analysis of feature speech, sentence important analysis, generate summary, and the output summary;

Three. generate the full-text index device: all news content files of having downloaded and having finished content extraction are carried out full-text index, comprise as lower device:

Import device into, import next filename into;

The index judgment means judges whether index mistake, is then to get back to import device into, otherwise enters next step;

Filtration unit filters wherein all rubbish and insignificant speech;

Coupling participle device carries out the dictionary matching participle;

Ngram participle device carries out the ngram participle, in order to avoid the dictionary participle has the speech of failing to branch away fully;

Updating device all upgrades relevant index file to each speech, comprises key word and date, the classification index;

Four. level document sorting apparatus: be that a new document is included into sorter in the class in the given stratigraphic classification; Every part of document can only be included in the class, have on the big level of given term of weight in level in that each class in the stratigraphic classification and many vocabulary are relevant with term, and stopword to be on another level. the feature speech of the document of being taken passages (news of finance) is taken as term in this system and glossary uses; Comprise level trainer and document classification device;

The level trainer is the pre-service to document classification, before classification, earlier the level of classification is trained; The function of training is the stack features (feature speech) of self-training document of will collecting, and is each node (classification) assigned characteristics weight in level then, and in the document classification algorithm, feature weight is to be used for being the new document calculations classification grade of portion;

Device for sorting document is that present a file can be classified into a classification after by the training hierarchy, and file classifying method is from the root classification, and all subclass of root classification are assigned with grade, and it is calculated by equation:

R_{cd} = \underset{f}{Σ} N_{fd} W_{fc}

14, the system of network information extraction according to claim 13 and processing is characterized in that described news download apparatus also comprises management devices, realizes the news data of this machine storage is managed, and as deletion, upgrades etc.

15, network information extraction according to claim 13 and disposal system is characterized in that described system also comprises the news inquiry unit, comprise as lower device:

Submit device to, the submit queries condition;

Searcher carries out search operation to index, obtains result set;

Return mechanism returns to the user with the result.

16, network information extraction according to claim 13 and disposal system, it is characterized in that described system also comprises daily record and transacter, even carrying out half has stopped, next time, operation still can recover original indexed results, and begin to note down for work such as download and summaries from newly carrying out indexing service from the position of failure; The url analysis module of download thread just reads in accounting file earlier, and is written into two up-to-date journal files when analyzing url, downloads in order to judging whether; Whenever downloading a news content webpage, just the url that storage is relevant is to up-to-date journal file; In the process of index, must read in the positional information of index earlier, read in the journal file information of necessary index then; Then corresponding content file is carried out index, upgrade the index position information in the index journal file simultaneously; In the process of summary, must read in the positional information of summary earlier, read in the journal file information that must make a summary then.Then corresponding content file is made a summary, upgrade the summary positional information in the summary journal file simultaneously; Source code whenever having downloaded a file analyzes content, finishes summary, finishes index and all will note down this work; Download, summary, three threads of index never stop, even finished a certain work, finishing such as summary, then download the journal file of summary again, begin summary.

17, the system of network information extraction according to claim 13 and processing is characterized in that described system also comprises management devices, and management devices is mainly realized the data management to this machine, category management, the news sources management, the index upgrade of data deletion, daily record renewal etc.

18, the system of network information extraction according to claim 13 and processing is characterized in that described autoabstract device can be an independent device, need have only a get summary ion with the api interface of external interface, and its interface prototype is

Public String get summary ion (String FileName, boolean FileMode, intRatio)

19, the system of network information extraction according to claim 13 and processing is characterized in that described generation full-text index device can be an independent device, and the required interface parameters that provides is a filename.

20, the system of network information extraction according to claim 13 and processing is characterized in that adopting in the described news download apparatus token analytical approach to html; Fully use the OO thought among the java, regard each html source code file as an object, set up the class of a token by name simultaneously, token is used for describing significant character string among the html, and inherit out the urltoken class by token, urltoken is used for describing the token that feature meets the url form;

The attribute that each token had is

String tokenstr=null; The string value int tokenloc=0 of this token of // description; The position int gbnum=0 of // this token in original; The Chinese character quantity boolean iskeentag=false that has among // this token; // whether be the intimate token Float of a content keenvalue=0 fully; // more special the method that has with the intimate degree Token of content: public boolean ishref () { String flag 1=" href="; Int flag2=-1; If (tokenstr. index Of (flag1)=flag2) return false; Else return true;

This method is used for judging whether a url html tag;

21, the system of network information extraction according to claim 13 and processing is characterized in that participle adopts " no dictionary " segmenting method in the described automatic generation summarization device, adopts word frequency, as P (w)=F (w) * L (w)^c(F (w)＞minFreq, L (w)＞minLen) otherwise P (w) minFreq are the appearance minimum frequencies of the speech preset; Usually 〉=2; Reduce is not that the string minLen of speech is that the shortest speech of the speech preset is long; Usually 〉=1; Guarantee that it is a normal value of presetting that low-frequency word is not separated c; Usually 〉=4; Guarantee that long word is not separated;

Flow process is as follows: a character string be used as in whole literary composition, starts anew to ask substring, and all substrings are asked power, and the high person of weighting is as speech (too many useless scanning), and system value is got a string, adopts All Files as a setting;

22, the system of network information extraction according to claim 13 and processing is characterized in that the extraction of feature speech in the described automatic generation summarization device, based on the frequency of speech and want to add up for the word frequency in background knowledge storehouse,

P (w) = F_{i} (w) \cdot (\frac{numdoc}{advnumdoc}) \cdot {(L (w) - D)}^{c}

23, the system of network information extraction according to claim 13 and processing is characterized in that the relation of the generation of the importance of sentence in the described automatic generation summarization device and summary:

T (s) \cdot \frac{ΣTi}{s 0 * s 1 * s 2 * m}

Each sentence is calculated their weight by this formula;

24, network information extraction according to claim 13 and disposal system is characterized in that described level trainer comprises 4 devices:

1) gathering-device: collect from the feature speech of leaf class; In the level, feature speech for the training document (news) of each leaf class, have only those in single training document, to occur just being collected more than 2 times or at the feature speech of training document sets appearance more than 10 times, these speech occur in summary at last, the feature vocabulary of these collections has shown the feature of leaf class, when a leaf class belongs to some training document sets, parent will comprise the feature of this leaf class, the feature of non-leaf class comprise it child nodes all features and in all child nodes the summation of feature occurrence frequency;

2) level optimization apparatus: the optimization competition that solves between classification joint and its father and mother's classification, because a file (news) can only be designated as a classification in the hierarchical organization of classification, when between classification competitive the time, algorithm should determine suitable classification for file, comprises as lower device:

Harvester is captured in all features in the classification;

The feature judgment means judges whether that the characteristic frequency ratio in father and mother is big in this classification, is then to arrive next device, otherwise not operation;

Look into the device that continues, look into successor's feature catalogue, find out the feature of successor's high-frequency and minimum frequency;

The ratio judgment means judges whether in the difference of high frequency and minimum frequency greatlyyer than threshold value with the ratio of highest frequency, is then to arrive next device, otherwise deletes this feature from all successors.Have only father and mother to possess this feature;

Delete device is unless delete the highest frequency that this feature successor has this feature from the successor;

3) distribute category feature weight device: be each feature specified weight of classification, have than higher weight feature to mean that it is prior to classification that features all in each classification are assigned with weight, are defined by following formula: W_Fc=(λ+(1-λ) * N_Fc/ M_c) just in the feature of each existence, c is a classification to f, Wfc is the weight that is designated as feature, λ is one three number and is set at 0.4 now, N_FcBe the number of times that f occurs in c, Mc is the frequency of feature maximum any in c;

4) filtration unit: the feature list of filtering each classification, the feature list of each classification will be filtered, no matter have only 200 negative features of 200 positive features in front and front to be carried in such other final feature list, be parent class or leaf classification, and other feature will be abandoned.The quantity of limited features is the computation complexity that is used for reducing a file of classification.