



技术领域technical field
本发明涉及一种数据处理方法及系统,更具体地说,涉及一种计算机网络上的各种信息特别是网上新闻的抽取及处理的方法及系统。The present invention relates to a data processing method and system, more specifically, to a method and system for extracting and processing various information on a computer network, especially online news.
背景技术Background technique
当今是一个信息爆炸的时代,随着internet的飞速发展,人们越来越有多的通过网络来获得最新的咨询信息。Today is an era of information explosion. With the rapid development of the Internet, more and more people obtain the latest consulting information through the Internet.
现在,几乎每个人都有看报纸的习惯,特别是一些对咨询信息需求比较紧迫的个人和企业,更加是要从很多的报纸上获得自己需要的信息。我们几乎能够从网上看到所有的新闻,很多人已经通过上网来获取最新的新闻信息。但是,仅仅是上网看新闻并不能减少我们所需要的时间,我们仍然需要通读一大篇的新闻才能知道这篇新闻描述的内容,或者要察看很多的网页之后才能得到我们所需要的咨询信息。而且,网上的新闻一逝即过,很多人需要对多天以前的新闻进行查询,甚至需要对几个月,一年前的新闻进行查询。这种情况下,通过网络已经不能满足我们的要求的了。Now, almost everyone has the habit of reading newspapers, especially some individuals and enterprises that have urgent needs for consulting information, and they need to obtain the information they need from many newspapers. We can see almost all the news from the Internet, and many people have obtained the latest news information through the Internet. However, just reading news on the Internet does not reduce the time we need. We still need to read through a large piece of news to know what the news describes, or to check many web pages before we can get the consulting information we need. Moreover, the news on the Internet is fleeting, and many people need to inquire about the news of many days ago, or even inquire about the news of several months or a year ago. In this case, the network can no longer meet our requirements.
传统的基于统计的自动摘要的方法,一般利用数理统计的方法给文档中每一个词都赋予一定的权值,计算权值的方法一般是通过计算词在文章中的出现频率来计算的。出现频率高的词,所具有的权值就更高。具有高权值的词意味着这个词是文章的中心。Traditional statistics-based automatic summarization methods generally use mathematical statistics to assign a certain weight to each word in the document, and the method of calculating the weight is generally calculated by calculating the frequency of occurrence of words in the article. Words with high frequency of occurrence have higher weights. A word with a high weight means that the word is the center of the article.
文章的句子也是根据词的权值来赋予的,当我们给词赋完权值之后,我们就能够计算出每个句子的权值,权值越高的句子越能够代表文章的中心思想。我们能够直接用权值高的句子来产生摘要。The sentence of the article is also assigned according to the weight of the word. After we assign the weight to the word, we can calculate the weight of each sentence. The sentence with the higher weight can represent the central idea of the article. We can directly use sentences with high weights to generate summaries.
这种方法生成摘要的速度很快,但是由于出现频率高的词并不一定就是文章的中心思想,而且没有进行语法分析,用权值高的句子拼凑而成的摘要的可读性也是比较差的。This method generates a summary very quickly, but because the words with high frequency are not necessarily the central idea of the article, and there is no grammatical analysis, the readability of the summary pieced together with high-weight sentences is relatively poor. of.
但是,我们可以通过改进赋予权值的方法和中心句子选择的方法来达到比较能够接受的效果。However, we can achieve more acceptable results by improving the method of assigning weights and the method of central sentence selection.
中文自动分词是建立全文索引必须经过的一个步骤。所谓分词,就是把一句话、一篇文章中的词逐个划分出来。中文不像英文那样,中文没有明显的切分标志。词的长度不一,而且词的定义也不同,存在一词多义,同义词等情形。所以中文自动分词存在着很大的难度。Chinese automatic word segmentation is a necessary step to establish a full-text index. The so-called word segmentation is to divide the words in a sentence or an article one by one. Unlike English, Chinese does not have obvious segmentation marks. Words have different lengths, and the definitions of words are also different. There are situations such as polysemy and synonyms. Therefore, there is great difficulty in automatic Chinese word segmentation.
现在比较流行的分词的方法主要有以下几种:The more popular word segmentation methods are as follows:
正向最大匹配法:是最早提出的分词方法,每次用最长(如为6)的正向切分的词和词典的词进行匹配,如果匹配成功,则继续往下分词,否则删除最后一个字,继续匹配。Forward maximum matching method: It is the earliest word segmentation method. Each time, the longest (for example, 6) forward segmented words are matched with the words in the dictionary. If the match is successful, continue to segment the word, otherwise delete the last A word, continue to match.
高频优先法:这种方法是基于词频的统计,字与字之间的构成结合律,歧义划分等现象提出来的。这种方法提高了分词的效率,但是对于歧义无能为力,出错率没有减低。High-frequency priority method: This method is based on the statistics of word frequency, the combination law between words, and the division of ambiguity. This method improves the efficiency of word segmentation, but it can do nothing for ambiguity, and the error rate does not decrease.
神经网络分词法:按照模拟人脑并行,分布处理和建立数值模型工作。它将分词知识所分散隐式的方法存入神经网络内,通过自学习和训练修改内部权值,以求达到较好效果的分词结果。Neural Network Word Segmentation: According to the simulation of human brain parallel, distributed processing and numerical model work. It stores the scattered and implicit method of word segmentation knowledge into the neural network, and modifies the internal weights through self-learning and training, in order to achieve better word segmentation results.
专家系统分词法:这种分词的方法从专家系统的角度把分词的知识(包括常识性分词知识和消除歧义切分的启发性知识即歧义切分规则)从实现分词过程的推理机中独立出来。这样从而实现了知识库的维护和推理机的实现相互独立了。它还具有发现交集性歧义字段和多义组合歧义字段的能力和一定的自学习能力。Expert system word segmentation method: This word segmentation method separates word segmentation knowledge (including commonsense word segmentation knowledge and heuristic knowledge for disambiguating segmentation, that is, ambiguous segmentation rules) from the inference engine that realizes the word segmentation process from the perspective of the expert system . In this way, the maintenance of the knowledge base and the realization of the inference engine are independent of each other. It also has the ability to discover intersectional ambiguity fields and polysemous combination ambiguity fields and a certain self-learning ability.
现在的全文索引一般采用倒排文件作为索引机制,在倒排文件中保存词目对应的文档编号的列表。The current full-text index generally uses an inverted file as an index mechanism, and stores a list of document numbers corresponding to terms in the inverted file.
对于文本检索来说,最有效的索引结构则是倒排文件:它是一个列表集合,每个词目t对应一条记录,在记录中列出了包含此词目的所有文档d的标识符。For text retrieval, the most effective index structure is the inverted file: it is a collection of lists, each term t corresponds to a record, and the identifiers of all documents d containing this term are listed in the record.
倒排文件可被视为文档-词目频率矩阵的转置,从(d,t)转换为(t,d),因为行优先的访问比列优先的访问更为有效。The inverted document can be viewed as the transpose of the document-term frequency matrix, from (d,t) to (t,d), since row-first access is more efficient than column-first access.
索引文件包含三部分:词典(invf.dict),倒排文件(invf)和两者之间的映射文件(invf.idx)。索引文件结构如图2所示。The index file consists of three parts: the dictionary (invf.dict), the inversion file (invf) and the mapping file between the two (invf.idx). The index file structure is shown in Figure 2.
在词典(invf.dict)中:对于每个不同的词目t,保存词目字符串t、包含t的文档总数f_t、t在整个文档集合中总的出现次数F_t。In the dictionary (invf.dict): for each different term t, save the term string t, the total number of documents containing t f_t, and the total number of occurrences of t in the entire document collection F_t.
在映射文件(invf.idx)中:对于每个不同的词目t,保存指向相应倒排列表起始地址的指针。In the mapping file (invf.idx): for each different term t, save a pointer to the starting address of the corresponding posting list.
在倒排文件(invf)中:对于每个不同的词目t,保存包含t的每个文档的标识符d(顺序的数值)、t在每个文档d中的出现频率fd,t,存储为<d,fd,t>的列表。In the inverted file (invf): for each different term t, save the identifier d (sequential value) of each document containing t, the frequency of occurrence fd, t of t in each document d, Stored as a list of <d,fd,t >.
另外和权重数组Wd一起,就可以满足布尔查询(Boolean Query)和分级查询(Ranked Query)的需要。In addition, together with the weight array Wd , it can meet the needs of Boolean query (Boolean query) and hierarchical query (ranked query).
发明内容Contents of the invention
本发明的目的是提供一种网络信息抽取及处理的方法及系统,采用了计算机技术和自然语言处理技术,能够自动的从各个指定的站点下载每天最新的新闻信息,并且进行内容抽取,分类,自动摘要精简全文,且将全文储存到本系统中,并进行文本索引以便日后进行高效的全文检索。The purpose of the present invention is to provide a method and system for network information extraction and processing, which adopts computer technology and natural language processing technology, can automatically download the latest daily news information from each designated site, and perform content extraction, classification, Automatic summarization simplifies the full text, and stores the full text in the system, and performs text indexing for efficient full-text retrieval in the future.
为了实现上述的目的,本发明的技术方案如下:In order to achieve the above-mentioned purpose, the technical scheme of the present invention is as follows:
一种网络信息抽取及处理的方法,包括如下步骤:A method for extracting and processing network information, comprising the following steps:
一.新闻下载步骤:包括如下步骤1. News download steps: including the following steps
url分析步骤:系统指定一定的url,程序能够自动的从这些url上分析出新闻的最终内容url,而不用对每个新闻网站做一个特定的url模块,采用给予url统计以及对url进行相关性分析的方法,在一个含有最终内容新闻连接地址的网页,进行统计和分析,找到有用的最终url地址;url analysis steps: the system specifies certain urls, and the program can automatically analyze the final content urls of the news from these urls, instead of making a specific url module for each news website, by giving url statistics and correlating urls The analysis method is to perform statistics and analysis on a web page containing the final content news link address, and find a useful final url address;
自动抓取新闻网页步骤:将目标地址中的链接页面所有符合url格式的页面进行下载;The step of automatically grabbing news webpages: download all the pages of the linked pages in the target address that meet the url format;
垃圾过滤步骤:实现对抓下来的新闻内容网页进行垃圾过滤,除去其中的html标签以及一些无用的中文,最终得到中文向量信息;Garbage filtering step: implement garbage filtering on the captured news content web pages, remove the html tags and some useless Chinese, and finally get the Chinese vector information;
信息提取步骤:对以上得到的中文向量进行信息提取,前期实现能够提取标题和内容,后期实现对web新闻内容进行特征提取相关性分析,文档分类,排重处理等等;Information extraction step: perform information extraction on the Chinese vectors obtained above, the title and content can be extracted in the early stage, and the feature extraction correlation analysis, document classification, and duplicate processing of web news content can be implemented in the later stage;
二.自动生成摘要步骤:进行分词、特征词分析、句子重要分析、生成摘要,并输出摘要;2. Automatic summary generation steps: perform word segmentation, feature word analysis, sentence important analysis, generate summary, and output the summary;
三.生成全文索引步骤:对所有已经下载并且完成内容抽取的新闻内容文件进行全文索引,包括如下步骤:3. The step of generating full-text index: carry out full-text index to all news content files that have been downloaded and completed content extraction, including the following steps:
传入步骤,传入下一个文件名;Incoming step, passing in the next file name;
索引判断步骤,判断是否已经索引过,是则回到传入步骤,否则进入下一步;Index judging step, judging whether it has been indexed, if yes, return to the incoming step, otherwise go to the next step;
过滤步骤,过滤其中所有垃圾及无意义的词;Filtering step, filtering all garbage and meaningless words;
匹配分词步骤,进行词典匹配分词;Match word segmentation step, carry out dictionary matching word segmentation;
ngram分词步骤,进行ngram分词,以免词典分词有未能完全分出来的词;The ngram word segmentation step is to perform ngram word segmentation, so as to avoid words that cannot be completely separated from the dictionary word segmentation;
更新步骤,对每一个词都更新相关的索引文件,包括关键字和日期,类别索引;The update step updates the relevant index files for each word, including keywords and dates, and category indexes;
四.层次文本分类步骤:是把一个新的文档归入一个给定的层次类别里的一个类里分类步骤;每份文档仅仅只能被归入一个类里,在层次类别里的每个类与许多词汇和术语相关有较大权重一个给定的术语在层次中的一个层次上,而stopword在另一个层次上。被摘录的文档(财政的新闻)的特征词在这个系统中被当作术语和字汇使用;包括层次训练步骤和文档分类步骤;Four. Hierarchical text classification step: it is a classification step to classify a new document into a class in a given hierarchical category; each document can only be classified into one class, and each class in the hierarchical class A given term is at one level in the hierarchy and a stopword is at another level with a greater weight associated with many words and terms. The feature words of the extracted documents (financial news) are used as terms and vocabulary in this system; including a hierarchical training step and a document classification step;
层次训练是文档分类的预处理,在分类之前,先对类别的层次进行训练;训练的功能是要收集来自训练文档的一组特征(特征词),然后为每个节点(类别)在层次中分配特征权重,在文档分类算法中,特征权重是用来为一份新的文档计算类别等级;Hierarchical training is the preprocessing of document classification. Before classification, the hierarchy of categories is trained; the function of training is to collect a set of features (feature words) from training documents, and then for each node (category) in the hierarchy Assign feature weights. In document classification algorithms, feature weights are used to calculate category levels for a new document;
文件分类步骤是在被训练阶级组织之后,现在一份文件能被分类到一个类别,文件分类方法从根类别开始,根类别的所有子类别被分配等级,它由下面等式计算:The document classification step is after the hierarchical organization is trained, now a document can be classified into a category, the document classification method starts from the root category, and all subcategories of the root category are assigned a grade, which is calculated by the following equation:
c是一个类别,d是一份文件,f是一个在D中的特征,Rcd是c的等级,Nfd是f出现在d中的次数,Wfc是f在类别c中的权重;c is a category, d is a document, f is a feature in D, Rcd is the level of c, Nfd is the number of times f appears in d, Wfc is the weight of f in category c;
如果所有子类别的等级都是零的或负的,d被留在根类别;如果在子类别中有确定的正的最大的等级的类别,则该类别被选择;如果该类别是一个叶类别,文件d被分到该类别;如果被选择的类别不是叶类别,则在该类别的子类别中继续进行计算;因此,文件d能分到叶类别或内部类别。If the ranks of all subcategories are zero or negative, d is left in the root category; if there is a category with the largest positive rank determined among the subcategories, that category is selected; if the category is a leaf category , file d is classified into this category; if the selected category is not a leaf category, the calculation continues in the subcategory of this category; therefore, file d can be classified into a leaf category or an internal category.
一种网络信息抽取及处理的系统,包括如下装置:A system for extracting and processing network information, including the following devices:
一.新闻下载装置:包括如下装置1. News download device: including the following devices
url分析装置:系统指定一定的url,程序能够自动的从这些url上分析出新闻的最终内容url,而不用对每个新闻网站做一个特定的url模块,采用给予url统计以及对url进行相关性分析的方法,在一个含有最终内容新闻连接地址的网页,进行统计和分析,找到有用的最终url地址;url analysis device: the system specifies certain urls, and the program can automatically analyze the final content urls of the news from these urls, instead of making a specific url module for each news website, it uses url statistics and url correlation The analysis method is to perform statistics and analysis on a web page containing the final content news link address, and find a useful final url address;
自动抓取新闻网页装置:将目标地址中的链接页面所有符合url格式的页面进行下载;Automatic grabbing device for news webpages: download all pages in the link page in the target address that conform to the url format;
垃圾过滤装置:实现对抓下来的新闻内容网页进行垃圾过滤,除去其中的html标签以及一些无用的中文,最终得到中文向量信息;Garbage filtering device: Realize garbage filtering of captured news content webpages, remove html tags and some useless Chinese, and finally obtain Chinese vector information;
信息提取装置:对以上得到的中文向量进行信息提取,前期实现能够提取标题和内容,后期实现对web新闻内容进行特征提取,相关性分析,文档分类,排重处理等等;Information extraction device: extract information from the Chinese vectors obtained above, the title and content can be extracted in the early stage, and the feature extraction, correlation analysis, document classification, and duplicate processing of web news content can be realized in the later stage;
二.自动生成摘要装置:进行分词、特征词分析、句子重要分析、生成摘要,并输出摘要;2. Automatic summary generation device: perform word segmentation, feature word analysis, sentence important analysis, generate summary, and output summary;
三.生成全文索引装置:对所有已经下载并且完成内容抽取的新闻内容文件进行全文索引,包括如下装置:3. Device for generating full-text index: carry out full-text index for all news content files that have been downloaded and completed content extraction, including the following devices:
传入装置,传入下一个文件名;Pass in the device, pass in the next file name;
索引判断装置,判断是否已经索引过,是则回到传入装置,否则进入下一步;The index judging device judges whether it has been indexed, and if so, returns to the incoming device, otherwise enters the next step;
过滤装置,过滤其中所有垃圾及无意义的词;Filtering device to filter all garbage and meaningless words;
匹配分词装置,进行词典匹配分词;Match word segmentation device to perform dictionary matching word segmentation;
ngram分词装置,进行ngram分词,以免词典分词有未能完全分出来的词;The ngram word segmentation device performs ngram word segmentation to avoid words that cannot be fully separated from the dictionary word segmentation;
更新装置,对每一个词都更新相关的索引文件,包括关键字和日期,类别索引;The update device updates the relevant index file for each word, including keywords and dates, and category indexes;
四.层次文本分类装置:是把一个新的文档归入一个给定的层次类别里的一个类里分类装置;每份文档仅仅只能被归入一个类里,在层次类别里的每个类与许多词汇和术语相关有较大权重一个给定的术语在层次中的一个层次上,而stopword在另一个层次上。被摘录的文档(财政的新闻)的特征词在这个系统中被当作术语和字汇使用;包括层次训练装置和文档分类装置;Four. Hierarchical text classification device: it is a classification device that classifies a new document into a class in a given hierarchical category; each document can only be classified into one class, and each class in the hierarchical class A given term is at one level in the hierarchy and a stopword is at another level with a greater weight associated with many words and terms. The feature words of the extracted documents (financial news) are used as terms and vocabulary in this system; including hierarchical training devices and document classification devices;
层次训练装置是对文档分类的预处理,在分类之前,先对类别的层次进行训练;训练的功能是要收集来自训练文档的一组特征(特征词),然后为每个节点(类别)在层次中分配特征权重,在文档分类算法中,特征权重是用来为一份新的文档计算类别等级;Hierarchical training device is the preprocessing to document classification, before classification, earlier to the class hierarchy is trained; The function of training is to collect a group of features (characteristic words) from training document, then for each node (category) in Assign feature weights in the hierarchy. In the document classification algorithm, the feature weights are used to calculate the category level for a new document;
文件分类装置是在被训练阶级组织之后,现在一份文件能被分类到一个类别,文件分类方法从根类别开始,根类别的所有子类别被分配等级,它由下面等式计算:The document classification device is after being trained in hierarchical organization, now a document can be classified into a category, the document classification method starts from the root category, and all subcategories of the root category are assigned grades, which are calculated by the following equation:
c是一个类别,d是一份文件,f是一个在D中的特征,Rcd是c的等级,Nfd是f出现在d中的次数,Wfc是f在类别c中的权重;c is a category, d is a document, f is a feature in D, Rcd is the level of c, Nfd is the number of times f appears in d, Wfc is the weight of f in category c;
如果所有子类别的等级都是零的或负的,d被留在根类别;如果在子类别中有确定的正的最大的等级的类别,则该类别被选择;如果该类别是一个叶类别,文件d被分到该类别;如果被选择的类别不是叶类别,则在该类别的子类别中继续进行计算;因此,文件d能分到叶类别或内部类别。If the ranks of all subcategories are zero or negative, d is left in the root category; if there is a category with the largest positive rank determined among the subcategories, that category is selected; if the category is a leaf category , file d is classified into this category; if the selected category is not a leaf category, the calculation continues in the subcategory of this category; therefore, file d can be classified into a leaf category or an internal category.
由于采用上述的方法及系统,能够自动每天从指定的web站点的指点版下载最新的新闻网页源码;能够对下载的html code进行分析,获得其中有价值的新闻内容;对分析出来的内容进行自动摘要精简;对分析出来的内容进行分词并且索引,以供检索之用;对分析出来的内容进行自动分类。Owing to adopting above-mentioned method and system, can automatically download the latest news web page source code every day from the pointing version of the designated web site; The downloaded html code can be analyzed to obtain valuable news content; The analyzed content can be automatically The abstract is simplified; the analyzed content is word-segmented and indexed for retrieval purposes; the analyzed content is automatically classified.
附图说明Description of drawings
图1为现有的自动下载网络信息的方法及程序的系统结构图;Fig. 1 is the system structural diagram of existing method and program for automatically downloading network information;
图2为现有的网络信息处理方法的索引文件结构图;Fig. 2 is the index file structural diagram of existing network information processing method;
图3为本发明所述的网络信息抽取及处理方法中的新闻下载步骤的流程图;Fig. 3 is the flowchart of the news download step in the network information extraction and processing method of the present invention;
图4为人民网的新闻中心的新闻列表页面图;Fig. 4 is the news list page figure of the news center of People's Daily Online;
图5为分析得到token流的方法的流程图;Fig. 5 is the flowchart of the method for analyzing and obtaining the token flow;
图6为China.com财经频道页面图;Figure 6 is a page map of China.com's financial channel;
图7为http://www.chinahd.com/news/stock/2002-3/161628.htm的页面图;Figure 7 is the page map ofhttp://www.chinahd.com/news/stock/2002-3/161628.htm ;
图8为图7的源代码图;Fig. 8 is the source code map of Fig. 7;
图9为某篇china.com财经频道的新闻网页图;Fig. 9 is a news page map of a certain china.com financial channel;
图10为图9所述的新闻网页经内容分析可得到的内容信息图;Fig. 10 is the content information figure that can obtain through content analysis of the news webpage described in Fig. 9;
图11为自动生成摘要方法的流程图;Figure 11 is a flow chart of the method for automatically generating a summary;
图12为自动生成摘要方法的分析图;Figure 12 is an analysis diagram of the method for automatically generating a summary;
图13为举例说明的内容储存原文图;Fig. 13 is the original text diagram of content storage for illustration;
图14为根据本发明自动生成的摘要图;Fig. 14 is a summary diagram automatically generated according to the present invention;
图15为本发明所述的生成全文检索步骤的流程图;Fig. 15 is the flow chart of generating full-text retrieval step described in the present invention;
图16为本发明所述的新闻查询步骤的流程图。Fig. 16 is a flow chart of news query steps in the present invention.
具体实施方式Detailed ways
下面结合附图和实施方式对本发明作进一步详细的说明:Below in conjunction with accompanying drawing and embodiment the present invention is described in further detail:
我们仅考虑了自动下载以及内容分析的过程,没有对每个网站构造对应的匹配模型,我们对新闻网站这一类型的站点实现了一个通用的算法,就是根据中文内容出现的频度和内容亲密的html tag出现的频度和位置来确定那一部分是新闻内容。将在后面的实现方法中进行具体描述。We only considered the process of automatic download and content analysis, and did not construct a corresponding matching model for each website. We implemented a general algorithm for news websites, which is based on the frequency of Chinese content and the intimacy of the content. The frequency and position of the html tag to determine which part is the news content. The implementation method will be described in detail later.
由于我们需要得到准确性比较大的内容,并对之进行信息抽取传递给最终用户,所以我们不需要robot进行深层次的递归访问。具体实现自动下载的方法在后面具体介绍Since we need to obtain content with relatively high accuracy, and extract information from it and pass it on to end users, we do not need robots to perform deep recursive access. The specific method of realizing automatic download will be introduced in detail later.
由于考虑到通用性,所以我们不考虑文本的网页特征,考虑的是基于背景资料库的纯内容的自动摘要。Due to the consideration of generality, we do not consider the web page features of the text, but the automatic summarization of the pure content based on the background database.
一种网络信息抽取及处理的方法,包括如下步骤:A method for extracting and processing network information, comprising the following steps:
一、新闻下载步骤:如图3所示,新闻的自动下载分为两个部分,url分析以及源代码抓取两部分。由于java具有的网络编程的优点,使得我们可以对网上的任意资源建立连接,形成一个流,就可以像操作本地文件一样操作网络上的资源。1. News download steps: As shown in Figure 3, the automatic download of news is divided into two parts, url analysis and source code capture. Due to the advantages of network programming in java, we can establish a connection to any resource on the Internet to form a stream, and we can operate resources on the network like operating local files.
1、url分析步骤:1. URL analysis steps:
系统指定一定的url,程序能够自动的从这些url上分析出新闻的最终内容url。而不用对每个新闻网站做一个特定的url模块。The system specifies a certain url, and the program can automatically analyze the final content url of the news from these urls. Instead of making a specific url module for each news site.
采用给予url统计以及对url进行相关性分析的方法,在一个含有最终内容新闻连接地址的网页,进行统计和分析,找到我们有用的最终url地址。例如:程序指定一定数量的已经分过类的url。此url应该是新闻的列表文件。即在此页点击新闻的链接即可打开新闻内容页面。Using the methods of url statistics and url correlation analysis, we can find our useful final url address by performing statistics and analysis on a web page containing the final content news link address. For example: the program specifies a certain number of urls that have been classified. This url should be a list file of news. That is, click the link of the news on this page to open the news content page.
以人民网为例子:这个页面就是人民网的新闻中心的新闻列表页面,如图4所示。Take People's Daily Online as an example: this page is the news list page of the News Center of People's Daily Online, as shown in Figure 4.
通过对这个页面进行分析,我们可得出最终页面的url格式为http://www.people.com.cn/GB/guoii/25/96/20020312/*.html存到相关的最终url格式文件中。By analyzing thispage, we can conclude that the url format of the final page ishttp://www.people.com.cn/GB/guoii/25/96/20020312/*.html and save it to the relevant final url format file middle.
采用对html的token分析方法:Use the token analysis method for html:
充分运用java中的面向对象的思想,我们将每个html源代码文件看成一个对象,同时建立一个名为token的类,token用来描述html中一个有意义的字符串,并且由token继承出来urltoken类,urltoken用来描述特征符合url格式的token。Make full use of the object-oriented thinking in java, we regard each html source code file as an object, and create a class named token at the same time, token is used to describe a meaningful string in html, and is inherited from token The urltoken class, urltoken is used to describe the token whose characteristics conform to the url format.
这样在进行html源代码分析的时候,我们将每个文件看成一个对象,同时就该文件中每一个html tag以及每一个html tag之间的字符串,我们都将他们看成一个字符串。In this way, when analyzing the html source code, we regard each file as an object, and at the same time, we regard each html tag in the file and the strings between each html tag as a string.
每个token所具有的属性Attributes of each token
String tokenstr=null;//描述该token的串值String tokenstr=null;//Describe the string value of the token
int tokenloc=0;//该token在原文件中的位置int tokenloc=0;//The position of the token in the original file
int gbnum=0;//该token中具有的中文字符数量int gbnum=0;//the number of Chinese characters in the token
boolean iskeentag=false;//是否完全是一个内容亲密tokenboolean iskeentag=false;//whether it is completely a content intimacy token
Float keenvalue=0;//与内容的亲密程度Float keenvalue=0;//The degree of intimacy with the content
Token具有的比较特别的方法:Token has special methods:
public boolean ishref(){public boolean ishref(){
String flag1=″href=″;String flag1="href=";
int flag2=-1;int flag2 = -1;
if(tokenstr.索引Of(flag1)==flag2)if(tokenstr. index Of(flag1)==flag2)
return false;return false;
elseelse
return true;return true;
}}
该方法用来判断是否一个url html tagThis method is used to determine whether a url html tag
实际上,运用oo的思想来进行html源代码分析,利用java中流的思想,我们建立了token流,结果证明,这样做的效果是很好的:In fact, we use the idea of oo to analyze the html source code, and use the idea of flow in java to establish a token flow. The result proves that the effect of this is very good:
1.程序结构很清晰,oo思想得到了非常明显的体现。1. The program structure is very clear, and the oo idea is very clearly reflected.
2.分析实现的效果很好,达到的准确率高。2. The effect of the analysis is very good, and the accuracy rate is high.
3.无需对每个网站定义特殊的分析stop标志等。3. There is no need to define a special analysis stop flag, etc. for each website.
4.只要属于规范的html代码,都能够进行正常处理。4. As long as it belongs to the standard html code, it can be processed normally.
分析得到token流的方法如图5所示。The method of analyzing and obtaining the token flow is shown in Figure 5.
对每一个站点的任何一个新闻板块,我们都定义以下几个特征项:For any news section of each site, we define the following feature items:
该板块所属的类别,比如政治,工业,体育等。这些类别也是由管理模块定义的;The category the section belongs to, such as politics, industry, sports, etc. These categories are also defined by the management module;
该板块所属的服务器地址,比如:news.sina.com.cn;The address of the server to which this section belongs, for example: news.sina.com.cn;
该板块所属的当前目录(一般正规的网站,一个板块的新闻都是在一个目录下面);The current directory to which the section belongs (generally, on regular websites, the news of a section is under one directory);
该板块list页面的路径属性,即绝对路径还是相对路径。The path attribute of the list page of this section, that is, an absolute path or a relative path.
对url进行分析,主要是由urlanalyse.class与contentanalyse.class两个类实现的,主要实现了token流的分析。The analysis of url is mainly realized by two classes, urlanalyse.class and contentanalyse.class, which mainly realize the analysis of token flow.
分析的主要方法:urlanalyse.class有一个方法geturl(stringfilename)先将源代码转化成token流读入来,然后将每一个符合格式的url token与这个url后面的gbnum不等于0的token加入缓存的hashmap中,一般情况下,url后面的gbnum不等于0的token都是新闻的标题。The main method of analysis: urlanalyse.class has a method geturl(stringfilename) that first converts the source code into a token stream and reads it in, and then adds each url token that conforms to the format and the token whose gbnum is not equal to 0 behind the url into the cache In the hashmap, under normal circumstances, the token whose gbnum is not equal to 0 after the url is the title of the news.
例如:China.com财经频道页面如图6所示。For example: the China.com financial channel page is shown in Figure 6.
经过url分析之后,我们可以得到相关的hashmap:After url analysis, we can get the relevant hashmap:
http://finance.china.com/zh_cn/news/financenews/10001254/20020506/10255883.htmlhttp://finance.china.com/zh_cn/news/financenews/10001254/20020506/10255883.html
十年投入6000亿重庆要打造国际大都市Invest 600 billion in ten years to build Chongqing into an international metropolis
http://finance.china.com/zh_cn/news/financenews/10001254/20020506/10255882.htmlhttp://finance.china.com/zh_cn/news/financenews/10001254/20020506/10255882.html
1吨油赔四五百元税控机拒收油票收现金1 ton of oil pays four to five hundred yuan
http://finance.china.com/zh_cn/news/financenews/10001254/20020506/10255881.htmlhttp://finance.china.com/zh_cn/news/financenews/10001254/20020506/10255881.html
香港旅游业——经济复苏的一缕春风Hong Kong tourism industry - a ray of spring breeze for economic recovery
在获得这些之后,我们就将进行自动的抓取,将所有分析出来的url网页源代码都抓下来。After obtaining these, we will carry out automatic crawling, and grab all the source codes of the analyzed URL pages.
2、自动抓取新闻网页步骤:2. Steps to automatically grab news web pages:
每次启动程序,我们都要将目标地址中的链接页面所有符合url格式的页面进行下载。下载过程中并不进行信息抽取等相关分析,以免加大负担,影响下载速度。对已经下载过的页面不再下载。下载要区分gb,big5等编码因素的影响。Every time we start the program, we need to download all the pages in the link page in the target address that conform to the url format. During the download process, relevant analysis such as information extraction is not performed, so as not to increase the burden and affect the download speed. Do not download pages that have already been downloaded. Downloading should distinguish the influence of gb, big5 and other encoding factors.
3、垃圾过滤模块:3. Garbage filtering module:
此步骤是实现对抓下来的新闻内容网页进行垃圾过滤,除去其中的html标签以及一些无用的中文,最终得到中文向量信息。须在下载的同时在后台线程运行。后期可以考虑在得到的中文向量加入权值等相关信息。(权值根据文字出现的位置,前后的html标签等确定,需要一定数量的文档进行熟悉,训练)。This step is to implement garbage filtering on the captured news content web pages, remove html tags and some useless Chinese, and finally obtain Chinese vector information. Must run in background thread while downloading. In the later stage, you can consider adding relevant information such as weights to the obtained Chinese vector. (The weight is determined according to the position where the text appears, the HTML tags before and after, etc., and a certain number of documents are required for familiarization and training).
4、信息提取模块:4. Information extraction module:
对以上得到的中文向量进行信息提取,前期实现能够提取标题和内容。后期实现对web新闻内容进行特征提取,相关性分析,文档分类,排重处理等等。保证通用性。保证较高的准确率。前期的功能可以通过简单的方法实现(如a****中的词在content b***c**d**中的出现次数)实现。判断哪一块是内容可以通过句子之间的距离以及前后的html标签判断(标签都有一定权值)。Information extraction is performed on the Chinese vectors obtained above, and the title and content can be extracted in the early implementation. In the later stage, feature extraction, correlation analysis, document classification, duplication processing and so on are implemented for web news content. Versatility is guaranteed. Guaranteed high accuracy. The previous function can be realized by a simple method (such as the number of occurrences of words in a**** in content b*** c** d** ). Judging which piece is the content can be judged by the distance between sentences and the HTML tags before and after (the tags have a certain weight).
如图7所示,来源:As shown in Figure 7, source:
http://www.chinahd.com/news/stock/2002-3/161628.htm。其源代码如图http://www.chinahd.com/news/stock/2002-3/161628.htm . Its source code is shown in the figure
8所示,可见,内容之间的距离都非常近,而且中间的html标签一般都是<p>, ,<br>(段落,空格,换行)之类的。我们可以通过距离和标签的特殊性来判断内容所在。8, it can be seen that the distance between the content is very close, and the html tags in the middle are generally <p>,  , <br> (paragraph, space, newline) and the like. We can judge the location of the content by the distance and the specificity of the label.
新闻内容抽取News Content Extraction
不同于传统的内容抽取方法,我们不针对每一个网站构造一个模型,在程序中,主要由contentanalyse.class和token.class等实现。Unlike traditional content extraction methods, we do not construct a model for each website. In the program, it is mainly implemented by contentanalyse.class and token.class.
具体方法如下:The specific method is as follows:
1、先将要抽取内容的文件转换为具体的token流;1. First convert the file to be extracted into a specific token stream;
2、将token流按照内容亲密度进行计算;2. Calculate the token flow according to the content intimacy;
3、将gb数量最集中以及亲密度同时又是最高的连续token集合取出来;3. Take out the continuous token set with the most concentrated number of GB and the highest intimacy;
4、如果gb数量以及亲密度不能同时符合以上要求,则直接cancel。4. If the number of GB and intimacy cannot meet the above requirements at the same time, it will be canceled directly.
例如:某篇china.com财经频道的新闻网页如图9所示。For example: the news page of a certain china.com financial channel is shown in Figure 9.
经过内容分析之后,由于china.com是比较规范的网页,我们一般能够达到很高的准确性,具体的测试数据在后面有详细的说明。After content analysis, since china.com is a relatively standardized webpage, we can generally achieve high accuracy. The specific test data will be described in detail later.
内容分析可得到的内容信息如图10所示。The content information that can be obtained by content analysis is shown in Figure 10.
在进行存储的时候,我们将会把新闻source,category,downloadtime,title,content等5部分全部储存起来,作为关键字索引,日期等索引的建立来源,同时也是摘要的来源。When storing, we will store all 5 parts of the news source, category, downloadtime, title, content, etc., as the source of keyword index, date index, etc., and also the source of abstract.
5、管理步骤:实现对本机存储的新闻数据进行管理,如删除,更新等。5. Management steps: realize the management of the news data stored in the machine, such as deleting and updating.
二、自动生成摘要步骤:先对原始文档进行预处理,然后进行分词、特征词分析、句子重要分析、生成摘要,并输出摘要;2. The step of automatically generating a summary: first preprocess the original document, then perform word segmentation, feature word analysis, sentence important analysis, generate a summary, and output the summary;
自动摘要步骤可以是一个独立的步骤,需要与外部接口的API接口只有一个get摘要ion。其接口原型为The automatic summary step can be an independent step that requires only one get summary ion for the API interface with the external interface. Its interface prototype is
public String get摘要ion(String FileName,boolean FileMode,intRatio)public String getsummary(String FileName, boolean FileMode, intRatio)
FileName参数,根据FileMode来决定;The FileName parameter is determined according to the FileMode;
如果FileMode=true,那么FileName则为文件名;If FileMode=true, then FileName is the file name;
否则,为待抽取的文档本身Otherwise, for the document to be extracted itself
FileMode参数是模式参数The FileMode parameter is the mode parameter
Ratio为抽比率,只允许0-100之间的整数自动生成摘要步骤是一个独立的步骤,有独立的日志与事物处理模块,摘要之是否完成不影响下载以及索引的进行。Ratio is a pumping ratio, and only integers between 0 and 100 are allowed. The automatic summary generation step is an independent step with independent log and transaction processing modules. The completion of the summary does not affect the download and indexing.
自动摘要系统的系统流程如图11所示。The system flow of the automatic summarization system is shown in Figure 11.
分词采用“无词库”分词方法,采用词频,新旧算法思想一致,只做一些不必要的改进以加快分词速度。词重----衡量是词的可能性不必要的改进以加快分词速度。The word segmentation adopts the "no thesaurus" word segmentation method, using word frequency, the new and old algorithms have the same idea, and only some unnecessary improvements are made to speed up the word segmentation. Word weight----a measure of the possibility of words is unnecessary improvement to speed up word segmentation.
P(w)=F(w)*L(w)c当(F(w)>minFreq,L(w)>minLen)否则P(w)=0minFreq是预设的词的出现最小频率;通常≥2;降低不是词的串minLen是预设的词的最短词长;通常≥1;保证低频词不被分开c是预设的一个常值;通常≥4;保证长词不被分开P(w) = F(w) *L(w)c when (F (w) > minFreq, L (w) > minLen) otherwise P (w) = 0minFreq is the minimum frequency of occurrence of the preset word; usually ≥ 2; reduce the strings that are not words minLen is the shortest word length of the preset word; usually ≥ 1; ensure that low-frequency words are not separated; c is a preset constant value; usually ≥ 4; ensure that long words are not separated
流程:整文当作一个字符串,从头开始求子串,对所有子串求权,取权高者作为词(太多无用扫描),系统值取一个串,采用所有文件作为背景,这样花去的扫描时间比较多。Process: Treat the whole text as a string, find substrings from the beginning, seek weights for all substrings, take the one with the highest weight as a word (too many useless scans), take a string as the system value, and use all files as the background, so that Go scan time is more.
特征词的抽取,基本思想是基于词的频率,以及想对于背景知识库的词频来统计。The basic idea of feature word extraction is based on the frequency of words, and counting the word frequency of the background knowledge base.
算法:
F(w)为词出现的频率F(w) is the frequency of word occurrence
L(w)为词的长度L(w) is the length of the word
numdoc为该词的在本文中出现次数numdoc is the number of occurrences of the word in this text
advnumdoc为所有文档中出现平均次数advnumdoc is the average number of occurrences in all documents
D预设的最短词长D preset shortest word length
修改算法的原因有两点:There are two reasons for modifying the algorithm:
1、原算法必须使用大量的背景语料库(BWID);因此会是系统耗费更大的时间和空间;而新算法则是基于语料库本身中出现的次数来进行想对统计。1. The original algorithm must use a large amount of background corpus (BWID); therefore, the system will consume more time and space; while the new algorithm is based on the number of occurrences in the corpus itself for statistical comparison.
2、新算法也具有理论说服力。因为背景语料库是广泛的,因此,一些常用词频率就会很多,这样numdoc/advnumdoc基本相等;而当一个特征词,通常在本文中出现较多次,而在BWID中则不是那么多,平均下来就使numdoc/advnumdoc大。因此特征词得权重也就大。具体如图12所示。2. The new algorithm is also theoretically convincing. Because the background corpus is extensive, some common words will have a lot of frequency, so numdoc/advnumdoc are basically equal; and when a feature word usually appears more times in this article, but not so many in BWID, on average Just make numdoc/advnumdoc larger. Therefore, the weight of feature words is also large. Specifically as shown in Figure 12.
句子的重要性与摘要的生成的关系:The relationship between sentence importance and summary generation:
对每一个句子按这个公式计算他们的权重。For each sentence, calculate their weight according to this formula.
Ti为句子组成的词的权重Ti is the weight of the words formed by the sentence
S0为句子的总词数S0 is the total number of words in the sentence
S1为句子的字句数S1 is the number of words and sentences in the sentence
S2为数词的个数S2 is the number of numerals
m为整型常值,通常为1。m is an integer constant, usually 1.
内容储存原文如图13所示。The original text of content storage is shown in Figure 13.
摘要后文章如图14所示。The post-abstract article is shown in Figure 14.
三、生成全文索引步骤:3. Steps to generate full-text index:
本步骤需要对所有已经下载并且完成内容抽取的新闻内容文件进行全文索引,建立索引的过程实时的在后台进行建立索引的工作。自身也可以是一个独立的步骤,所需要提供的接口参数只是一个文件名。This step needs to perform full-text indexing on all news content files that have been downloaded and content extracted, and the indexing process is performed in the background in real time. It can also be an independent step, and the only interface parameter that needs to be provided is a file name.
生成全文检索步骤的流程如图15所示,包括如下步骤:The process of generating the full-text search step is shown in Figure 15, including the following steps:
传入步骤,传入下一个文件名;Incoming step, passing in the next file name;
索引判断步骤,判断是否已经索引过,是则回到传入步骤,否则进入下一步;Index judging step, judging whether it has been indexed, if yes, return to the incoming step, otherwise go to the next step;
过滤步骤,过滤其中所有垃圾及无意义的词;Filtering step, filtering all garbage and meaningless words;
匹配分词步骤,进行词典匹配分词;Match word segmentation step, carry out dictionary matching word segmentation;
ngram分词步骤,进行ngram分词,以免词典分词有未能完全分出来的词;The ngram word segmentation step is to perform ngram word segmentation, so as to avoid words that cannot be completely separated from the dictionary word segmentation;
更新步骤,对每一个词都更新相关的索引文件,包括关键字和日期,类别索引。The update step is to update the relevant index file for each word, including keyword and date, category index.
四、层次文本分类步骤:是把一个新的文档归入一个给定的层次类别里的一个类里分类步骤。每份文档仅仅只能被归入一个类里。在层次类别里的每个类与许多词汇和术语相关,而且分类算法本身在层次中被反复调整。因此,有较大权重一个给定的术语在层次中的一个层次上,而stopword在另一个层次上。被摘录的文档(财政的新闻)的特征词在这个系统中被当作术语和字汇使用。Fourth, hierarchical text classification step: it is a classification step to classify a new document into a class in a given hierarchical category. Each document can only be classified into one class. Each class in the hierarchy of categories is associated with many words and terms, and the classification algorithm itself is adjusted iteratively in the hierarchy. Thus, there is greater weight for a given term at one level in the hierarchy and a stopword at another level. Characteristic words of excerpted documents (financial news) are used as terms and vocabulary in this system.
包括二部份:层次训练步骤和文档分类步骤,层次训练是文档分类的预处理。在分类之前,先对类别的层次进行训练;It includes two parts: hierarchical training step and document classification step, hierarchical training is the preprocessing of document classification. Before classification, the hierarchy of categories is trained;
1.层次训练1. Hierarchical training
训练的功能是要收集来自训练文档的一组特征(特征词),然后为每个节点(类别)在层次中分配特征权重。在文档分类算法中,特征权重是用来为一份新的文档计算类别等级。The function of training is to collect a set of features (feature words) from training documents, and then assign feature weights in the hierarchy for each node (category). In document classification algorithms, feature weights are used to calculate the class rank for a new document.
训练包括4个步骤:Training consists of 4 steps:
1)收集来自叶子类的特征词;1) Collect feature words from leaf classes;
层次中,对于每个叶子类的训练文档(新闻)的特征词,只有那些在单一训练文档中出现2次以上或者在训练文档集出现10次以上的特征词才被收集,这些词最后在摘要中出现。这些收集的特征词表示了叶子类的特征。当一个叶子类属于某一个训练文档集时,父类就要包含该叶子类的特征。非叶子类的特征包括它的孩子节点的所有特征和在所有孩子节点中特征发生频率的总和。In the hierarchy, for the feature words of the training documents (news) of each leaf class, only those feature words that appear more than 2 times in a single training document or more than 10 times in the training document set are collected, and these words are finally included in the abstract appears in . These collected feature words represent the features of leaf classes. When a leaf class belongs to a certain training document set, the parent class must contain the features of the leaf class. The features of non-leaf classes include all the features of its child nodes and the sum of the frequency of occurrence of features in all child nodes.
2)层次最优化步骤2) Hierarchical optimization steps
最优化用来解决在类别节和它的父母类别之间的竞争。因为一份文件(新闻)只能在类别的层次组织中被指定为一个类别,当在类别之间有竞争的时候,运算法则应该为文件决定适当的类别。Optimizations are used to resolve races between a category section and its parent category. Since a document (news) can only be assigned to a category within the hierarchical organization of categories, the algorithm should determine the appropriate category for the document when there is competition between categories.
包括如下步骤:Including the following steps:
采集步骤,采集在一个类别中所有的特征;The collection step collects all the features in a category;
特征判断步骤,判断是否在父母中的特征频率比在这个类别中大,是则到下一步骤,否则没有操作;The feature judgment step is to judge whether the feature frequency in the parents is larger than that in this category, if so, go to the next step, otherwise there is no operation;
查继步骤,查继承者的特征目录,找出继承者高频率和最低的频率的特征;Check the following step, check the feature catalog of the successor, and find out the characteristics of the high frequency and the lowest frequency of the successor;
比率判断步骤,判断是否在高的频率和最低的频率之差与最高频率的比率比门槛值大,是则到下一步骤,否则从所有的继承者删除该特征。只有父母保有该特征;The ratio judging step is to judge whether the ratio between the difference between the high frequency and the lowest frequency and the highest frequency is greater than the threshold value, if so, go to the next step, otherwise delete the feature from all successors. Only the parent retains the trait;
删除步骤,从继承者中删除该特征除非继承者有该特征的最高频率。In the delete step, the feature is removed from the successor unless the successor has the highest frequency of the feature.
上述的方法法则能找出通常的特征,对父类别来说,它的继承者拥有该特征和特征的频率。 但是当该特征的频率没有传递到继承者时候,这意味着在继承者中的最高频率和最低频率。通常的特征从所有的继承者删除除非继承者包含通常的特征最高的频率。因此,所有叶类别的特征和频率向叶类别的上面除根类别之外的类别传递,在根类别他们将不参与任何的文件等级计算。The method described above finds common features, for a parent class, its successors have that feature and how often it is. But when the frequency of the feature is not passed to the successor, it means the highest frequency and the lowest frequency in the successor. The usual feature is removed from all successors unless the successor contains the usual feature with the highest frequency. Therefore, the features and frequencies of all leaf categories are passed to the categories above the leaf category except the root category, where they will not participate in any document level calculations.
当子类别保有它的时候,运算法则不能直接将一个特征从父类别删除。这是因为我们可能需要特征把文件传递到父类别;如果它不能传递到父类别它就没法传递到子类别。因此,在比较低层次类别(子类别)的分歧被向上传递到上面的层次(父类别)。Algorithms cannot directly remove a feature from a parent class while the child class retains it. This is because we may need traits to pass files to parent categories; if it cannot pass to parent categories it cannot pass to subcategories. Thus, divergences in lower-level categories (subcategories) are passed up to upper levels (parent categories).
3)分配类别特征权重步骤:为类别的每个特征指定权重,有比较高的权重特征意味着它对类别是更重要的,在每个类别中所有的特征被分配权重,由下式定义:3) Step of assigning category feature weights: Assign weights to each feature of the category. A feature with a relatively high weight means it is more important to the category. In each category, all features are assigned weights, defined by the following formula:
Wfc=(λ+(1-λ)×Nfc/Mc)Wfc =(λ+(1-λ)×Nfc /Mc )
f正在每个存在的特征,c是类别,Wfc是被指定为特征的权重,λ是一个叁数并且现在设定为0.4,Nfc是f在c中出现的次数,Mc是在c中任何的特征最大的频率。f is for each feature that exists, c is the category, Wfc is the weight assigned to the feature, λ is a parameter and is now set to 0.4, Nfc is the number of occurrences of f in c, Mc is any The characteristic maximum frequency.
当一个特征只出现在兄弟类别中的时候,但是不在c中它本身,它被指定为负权重。有负权重的特征被增加到c的特征列表。负权重由下式定义:When a feature appears only in sibling categories, but not in c itself, it is assigned a negative weight. Features with negative weights are added to c's feature list. Negative weights are defined by:
f正在每个存在的特征,c是类别,Wfc是被指定为特征的权重,λ是一个叁数并且现在设定为0.4,Nfp是f在c的父类别中出现的次数,Mc是在c的父类别中任何的特征最大的频率。f is for each feature that exists, c is the category, Wfc is the weight assigned to the feature, λ is a parameter and is now set to 0.4, Nfp is the number of occurrences of f in the parent category of c, and Mc is the The maximum frequency of any feature in the parent category of c.
4)过滤每个类别的特征列表,每个类别的特征列表将被过滤。只有前面200个正特征和前面200个负特征被保留到该类别的最终特征列表中,无论是父类别还是叶类别。其他的特征将被抛弃。限制特征的数量是用来降低分类一个文件的计算复杂度。4) Filter the list of features for each category, the list of features for each category will be filtered. Only the top 200 positive features and the top 200 negative features are kept into the final feature list for that category, whether it is a parent category or a leaf category. Other features will be discarded. Limiting the number of features is used to reduce the computational complexity of classifying a document.
2.文件分类方法:在被训练阶级组织之后,现在一份文件能被分类到一个类别,文件分类方法从根类别开始。 根类别的所有子类别被分配等级,它由下面等式计算:2. Document classification method: After being trained hierarchical organization, now a document can be classified into a category, and the document classification method starts from the root category. All subcategories of the root category are assigned a rank, which is calculated by the following equation:
c是一个类别,d是一份文件,f是一个在D中的特征,Rcd是c的等级,Nfd是f出现在d中的次数,Wfc是f在类别c中的权重。c is a category, d is a document, f is a feature in D, Rcd is the rank of c, Nfd is the number of times f appears in d, and Wfc is the weight of f in category c.
如果所有子类别的等级都是零的或负的,d被留在根类别。如果在子类别中有确定的正的最大的等级的类别,则该类别被选择。如果该类别是一个叶类别,文件d被分到该类别。如果被选择的类别不是叶类别,则在该类别的子类别中继续进行计算。因此,文件d能分到叶类别或内部类别。If the ranks of all subcategories are zero or negative, d is left in the root category. If there is a category with the positive maximum rank determined among the subcategories, that category is selected. If the category is a leaf category, document d is assigned to this category. If the selected category is not a leaf category, the calculation continues in the subcategories of that category. Therefore, document d can be classified into a leaf category or an internal category.
五、新闻查询步骤:如图16所示,包括如下步骤:5. News query steps: as shown in Figure 16, including the following steps:
提交步骤,用户提交查询条件;Submit step, the user submits query conditions;
搜索步骤,对索引进行搜索操作,得到结果集;The search step is to perform a search operation on the index to obtain a result set;
返回步骤,将结果返回给用户。Go back to the step to return the result to the user.
前面几个步骤只是实现了后台的自动下载,自动摘要,以及索引的建立,新闻查询子系统实现的功能是与用户的交互,能够让用户在前台进行相关的新闻查询,包括新闻关键字查询,新闻类别查询,新闻日期查询,新闻源查询等。The first few steps only realize the automatic downloading, automatic summarization, and establishment of indexes in the background. The function of the news query subsystem is to interact with users, allowing users to perform related news queries in the foreground, including news keyword queries, News category query, news date query, news source query, etc.
六、日志以及事务处理步骤:6. Log and transaction processing steps:
由于程序运行的情况下经常会遇到非正常性终止,比如突然死机,突然断电等。Because the program often encounters abnormal termination when it is running, such as sudden crash, sudden power failure, etc.
这种情况下,我们必须保证后台数据的完整性,如必须保证索引必须是完整的,即使是执行到一半程序终止了,下次运行仍然能够恢复原有的索引结果,并且从失败的位置开始从新进行索引工作。In this case, we must ensure the integrity of the background data. For example, we must ensure that the index must be complete. Even if the program is terminated halfway through the execution, the original index result can still be restored in the next run, and start from the failed position. Restart the indexing work.
还有,对于下载和摘要等工作,为了不造成重复工作以及节省时间,那么也必须对他们的工作进行纪录。Also, for tasks such as downloading and abstracting, in order not to cause duplication of work and save time, their work must also be recorded.
Log文件系统功能:Log file system function:
1、下载线程的url分析模块在分析url的时候,就先读入计数文件,并载入最新的两个log文件,用以判断是否已经下载过。1. When the url analysis module of the download thread analyzes the url, it first reads the count file and loads the latest two log files to determine whether it has been downloaded.
2、每当下载一个新闻内容网页,就存储相关的url至最新的log文件中。2. Whenever a news content web page is downloaded, the relevant url is stored in the latest log file.
3、在索引的过程中,必须先读入索引的位置信息,然后读入必须索引的log文件信息。然后对对应的内容文件进行索引,同时更新索引log文件中的索引位置信息。3. In the process of indexing, the location information of the index must be read first, and then the log file information that must be indexed must be read. Then the corresponding content files are indexed, and the index position information in the index log file is updated at the same time.
4、在摘要的过程中,必须先读入摘要的位置信息,然后读入必须摘要的log文件信息。然后对对应的内容文件进行摘要,同时更新摘要log文件中的摘要位置信息。4. In the process of summarizing, the location information of the summary must be read in first, and then the log file information that must be summed up must be read in. Then the corresponding content file is summarized, and the summary position information in the summary log file is updated at the same time.
5、每当下载完一个文件的源代码,分析出内容,进行完摘要,完成索引都要对这项工作进行纪录。以免事故发生无法处理,并可避免重复工作。5. Whenever the source code of a file is downloaded, the content is analyzed, the summary is completed, and the index is completed, this work must be recorded. In order to avoid accidents that cannot be handled, and to avoid duplication of work.
6、下载,摘要,索引三个线程永不停止,即使已经完成了某项工作,比如摘要已经完成,则重新load摘要的log文件,开始摘要。6. The three threads of downloading, summarizing and indexing will never stop. Even if a certain work has been completed, such as the summary has been completed, the log file of the summary will be reloaded and the summary will start.
七、管理步骤:7. Management steps:
管理步骤主要实现对本机的数据管理,类别管理,新闻源管理,数据删除的索引更新,日志更新等。The management step mainly implements data management, category management, news source management, index update of data deletion, log update, etc. of the machine.
String tokenstr=null; The string value int tokenloc=0 of this token of // description; The position int gbnum=0 of // this token in original; The Chinese character quantity boolean iskeentag=false that has among // this token; // whether be the intimate token Float of a content keenvalue=0 fully; // more special the method that has with the intimate degree Token of content: public boolean ishref () { String flag 1=" href="; Int flag2=-1; If (tokenstr. index Of (flag1)==flag2) return false; Else return true;String tokenstr=null; The string value int tokenloc=0 of this token of // description; The position int gbnum=0 of // this token in original; The Chinese character quantity boolean iskeentag=false that has among // this token; // whether be the intimate token Float of a content keenvalue=0 fully; // more special the method that has with the intimate degree Token of content: public boolean ishref () { String flag 1=" href="; Int flag2=-1; If (tokenstr. index Of (flag1)=flag2) return false; Else return true;| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CNA031093388ACN1536483A (en) | 2003-04-04 | 2003-04-04 | Method and system for extracting and processing network information |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CNA031093388ACN1536483A (en) | 2003-04-04 | 2003-04-04 | Method and system for extracting and processing network information |
| Publication Number | Publication Date |
|---|---|
| CN1536483Atrue CN1536483A (en) | 2004-10-13 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CNA031093388APendingCN1536483A (en) | 2003-04-04 | 2003-04-04 | Method and system for extracting and processing network information |
| Country | Link |
|---|---|
| CN (1) | CN1536483A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN100336056C (en)* | 2005-01-07 | 2007-09-05 | 清华大学 | Technological term extracting, law-analysing and reusing method based no ripe technogical file |
| CN100399330C (en)* | 2005-03-23 | 2008-07-02 | 腾讯科技(深圳)有限公司 | System for managing World Wide Web media in World Wide Web pages and its implementation method |
| WO2008131597A1 (en)* | 2007-04-29 | 2008-11-06 | Haitao Lin | Search engine and method for filtering agency information |
| CN100433018C (en)* | 2007-03-13 | 2008-11-12 | 白云 | Method for criminating electronci file and relative degree with certain field and application thereof |
| CN100444591C (en)* | 2006-08-18 | 2008-12-17 | 北京金山软件有限公司 | Method and Application System for Obtaining Web Page Keywords |
| CN100462972C (en)* | 2005-12-08 | 2009-02-18 | 国际商业机器公司 | Document-based information and uniform resource locator (URL) management method and device |
| WO2009021429A1 (en)* | 2007-08-13 | 2009-02-19 | Tencent Technology (Shenzhen) Company Limited | Method and device for dealing with the instant messaging information |
| CN100592293C (en)* | 2007-04-28 | 2010-02-24 | 李树德 | Knowledge search engine based on intelligent ontology and implementation method thereof |
| CN101035128B (en)* | 2007-04-18 | 2010-04-21 | 大连理工大学 | Recognition and filtering method of triple webpage text content based on Chinese punctuation marks |
| CN101203847B (en)* | 2005-03-11 | 2010-05-19 | 雅虎公司 | System and method for managing listings |
| CN101231641B (en)* | 2007-01-22 | 2010-05-19 | 北大方正集团有限公司 | A method and system for automatically analyzing the dissemination process of hot topics on the Internet |
| CN1786965B (en)* | 2005-12-21 | 2010-05-26 | 北大方正集团有限公司 | A Method for Extracting Text Information of News Web Pages |
| CN1858737B (en)* | 2006-01-25 | 2010-06-02 | 华为技术有限公司 | Method and system for data search |
| CN101196935B (en)* | 2008-01-03 | 2010-06-09 | 中兴通讯股份有限公司 | System and method for creating index database |
| CN101192220B (en)* | 2006-11-21 | 2010-09-15 | 财团法人资讯工业策进会 | Label construction method and system suitable for resource search |
| CN101140578B (en)* | 2006-09-06 | 2010-12-08 | 鸿富锦精密工业(深圳)有限公司 | System and method for multi-thread analysis of web page data |
| CN101984435A (en)* | 2010-11-17 | 2011-03-09 | 百度在线网络技术(北京)有限公司 | Method and device for distributing texts |
| US7925621B2 (en) | 2003-03-24 | 2011-04-12 | Microsoft Corporation | Installing a solution |
| CN101128819B (en)* | 2004-12-30 | 2011-06-22 | 谷歌公司 | Partial Item Extraction |
| CN102117317A (en)* | 2010-12-28 | 2011-07-06 | 北京航空航天大学 | Blind person Internet system based on voice technology |
| CN102118400A (en)* | 2009-12-31 | 2011-07-06 | 北京四维图新科技股份有限公司 | Data acquisition method and system |
| US7979803B2 (en) | 2006-03-06 | 2011-07-12 | Microsoft Corporation | RSS hostable control |
| US7979856B2 (en) | 2000-06-21 | 2011-07-12 | Microsoft Corporation | Network-based software extensions |
| CN102236654A (en)* | 2010-04-26 | 2011-11-09 | 广东开普互联信息科技有限公司 | Web Invalid Link Filtering Method Based on Content Correlation |
| CN101526938B (en)* | 2008-03-06 | 2011-12-28 | 夏普株式会社 | File processing device |
| CN102385570A (en)* | 2010-08-31 | 2012-03-21 | 国际商业机器公司 | Method and system for matching fonts |
| CN102446191A (en)* | 2010-10-13 | 2012-05-09 | 北京创新方舟科技有限公司 | Method for generating webpage content abstracts and equipment and system adopting same |
| CN102446311A (en)* | 2010-10-15 | 2012-05-09 | 商业对象软件有限公司 | Business intelligence technology for process driving |
| CN101180624B (en)* | 2004-10-28 | 2012-05-09 | 雅虎公司 | Link-based spam detection |
| CN102460437A (en)* | 2009-06-26 | 2012-05-16 | 乐天株式会社 | Information search device, information search method, information search program, and recording medium having information search program recorded thereon |
| CN102521313A (en)* | 2011-12-01 | 2012-06-27 | 北京大学 | Static index pruning method based on web page quality |
| CN101055581B (en)* | 2006-04-13 | 2012-07-04 | Lg电子株式会社 | Document management system and method |
| CN102592039A (en)* | 2011-01-18 | 2012-07-18 | 四川火狐无线科技有限公司 | Interaction method for processing cantering and entertainment service data and device and system for realizing same |
| CN101751438B (en)* | 2008-12-17 | 2012-08-22 | 中国科学院自动化研究所 | Theme webpage filter system for driving self-adaption semantics |
| US8280843B2 (en) | 2006-03-03 | 2012-10-02 | Microsoft Corporation | RSS data-processing object |
| CN102812475A (en)* | 2009-12-24 | 2012-12-05 | 梅塔瓦纳股份有限公司 | System And Method For Determining Sentiment Expressed In Documents |
| CN102902757A (en)* | 2012-09-25 | 2013-01-30 | 姚明东 | Automatic generation method of e-commerce dictionary |
| CN102945246A (en)* | 2012-09-28 | 2013-02-27 | 北界创想(北京)软件有限公司 | Method and device for processing network information data |
| CN102955791A (en)* | 2011-08-23 | 2013-03-06 | 句容今太科技园有限公司 | Searching and classifying service system for network information |
| US8429522B2 (en) | 2003-08-06 | 2013-04-23 | Microsoft Corporation | Correlation, association, or correspondence of electronic forms |
| CN103149840A (en)* | 2013-02-01 | 2013-06-12 | 西北工业大学 | Semanteme service combination method based on dynamic planning |
| CN103150632A (en)* | 2013-03-13 | 2013-06-12 | 河海大学 | Structuring method for flood control and drought control bulletin generation system based on water conservancy cloud platform |
| CN103488750A (en)* | 2013-09-24 | 2014-01-01 | 长沙裕邦软件开发有限公司 | Implementation method and system of network robot |
| US8661459B2 (en) | 2005-06-21 | 2014-02-25 | Microsoft Corporation | Content syndication platform |
| US8751936B2 (en) | 2005-06-21 | 2014-06-10 | Microsoft Corporation | Finding and consuming web subscriptions in a web browser |
| CN103853834A (en)* | 2014-03-12 | 2014-06-11 | 华东师范大学 | Text structure analysis-based Web document abstract generation method |
| CN104008126A (en)* | 2014-03-31 | 2014-08-27 | 北京奇虎科技有限公司 | Method and device for segmentation on basis of webpage content classification |
| US8892993B2 (en) | 2003-08-01 | 2014-11-18 | Microsoft Corporation | Translation file |
| US8918729B2 (en) | 2003-03-24 | 2014-12-23 | Microsoft Corporation | Designing electronic forms |
| CN104424308A (en)* | 2013-09-04 | 2015-03-18 | 中兴通讯股份有限公司 | Web page classification standard acquisition method and device and web page classification method and device |
| CN104657347A (en)* | 2015-02-06 | 2015-05-27 | 北京中搜网络技术股份有限公司 | News optimized reading mobile application-oriented automatic summarization method |
| CN105005563A (en)* | 2014-04-15 | 2015-10-28 | 腾讯科技(深圳)有限公司 | Abstract generation method and apparatus |
| US9210234B2 (en) | 2005-12-05 | 2015-12-08 | Microsoft Technology Licensing, Llc | Enabling electronic documents for limited-capability computing devices |
| US9229917B2 (en) | 2003-03-28 | 2016-01-05 | Microsoft Technology Licensing, Llc | Electronic form user interfaces |
| CN105760500A (en)* | 2009-11-10 | 2016-07-13 | 启创互联公司 | System, method and computer program for creating and manipulating data structures using an interactive graphical interface |
| CN106383887A (en)* | 2016-09-22 | 2017-02-08 | 深圳市博安达信息技术股份有限公司 | Environment-friendly news data acquisition and recommendation display method and system |
| US10146843B2 (en) | 2009-11-10 | 2018-12-04 | Primal Fusion Inc. | System, method and computer program for creating and manipulating data structures using an interactive graphical interface |
| CN109086361A (en)* | 2018-07-20 | 2018-12-25 | 北京开普云信息科技有限公司 | A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint |
| CN112115259A (en)* | 2020-06-17 | 2020-12-22 | 上海金融期货信息技术有限公司 | A feature word-driven text multi-label hierarchical classification method and system |
| CN113190644A (en)* | 2021-05-24 | 2021-07-30 | 浪潮软件科技有限公司 | Method and device for hot updating search engine word segmentation dictionary |
| CN113486279A (en)* | 2021-06-29 | 2021-10-08 | 平安信托有限责任公司 | Automatic news generation method, device, equipment and storage medium |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7979856B2 (en) | 2000-06-21 | 2011-07-12 | Microsoft Corporation | Network-based software extensions |
| US8918729B2 (en) | 2003-03-24 | 2014-12-23 | Microsoft Corporation | Designing electronic forms |
| US7925621B2 (en) | 2003-03-24 | 2011-04-12 | Microsoft Corporation | Installing a solution |
| US9229917B2 (en) | 2003-03-28 | 2016-01-05 | Microsoft Technology Licensing, Llc | Electronic form user interfaces |
| US8892993B2 (en) | 2003-08-01 | 2014-11-18 | Microsoft Corporation | Translation file |
| US9239821B2 (en) | 2003-08-01 | 2016-01-19 | Microsoft Technology Licensing, Llc | Translation file |
| US8429522B2 (en) | 2003-08-06 | 2013-04-23 | Microsoft Corporation | Correlation, association, or correspondence of electronic forms |
| US9268760B2 (en) | 2003-08-06 | 2016-02-23 | Microsoft Technology Licensing, Llc | Correlation, association, or correspondence of electronic forms |
| CN101180624B (en)* | 2004-10-28 | 2012-05-09 | 雅虎公司 | Link-based spam detection |
| US8433704B2 (en) | 2004-12-30 | 2013-04-30 | Google Inc. | Local item extraction |
| JP2011129154A (en)* | 2004-12-30 | 2011-06-30 | Google Inc | Local item extraction |
| CN101128819B (en)* | 2004-12-30 | 2011-06-22 | 谷歌公司 | Partial Item Extraction |
| CN100336056C (en)* | 2005-01-07 | 2007-09-05 | 清华大学 | Technological term extracting, law-analysing and reusing method based no ripe technogical file |
| CN101203847B (en)* | 2005-03-11 | 2010-05-19 | 雅虎公司 | System and method for managing listings |
| CN100399330C (en)* | 2005-03-23 | 2008-07-02 | 腾讯科技(深圳)有限公司 | System for managing World Wide Web media in World Wide Web pages and its implementation method |
| US9104773B2 (en) | 2005-06-21 | 2015-08-11 | Microsoft Technology Licensing, Llc | Finding and consuming web subscriptions in a web browser |
| US9762668B2 (en) | 2005-06-21 | 2017-09-12 | Microsoft Technology Licensing, Llc | Content syndication platform |
| US9894174B2 (en) | 2005-06-21 | 2018-02-13 | Microsoft Technology Licensing, Llc | Finding and consuming web subscriptions in a web browser |
| US8751936B2 (en) | 2005-06-21 | 2014-06-10 | Microsoft Corporation | Finding and consuming web subscriptions in a web browser |
| US8661459B2 (en) | 2005-06-21 | 2014-02-25 | Microsoft Corporation | Content syndication platform |
| US8832571B2 (en) | 2005-06-21 | 2014-09-09 | Microsoft Corporation | Finding and consuming web subscriptions in a web browser |
| US9210234B2 (en) | 2005-12-05 | 2015-12-08 | Microsoft Technology Licensing, Llc | Enabling electronic documents for limited-capability computing devices |
| CN100462972C (en)* | 2005-12-08 | 2009-02-18 | 国际商业机器公司 | Document-based information and uniform resource locator (URL) management method and device |
| CN1786965B (en)* | 2005-12-21 | 2010-05-26 | 北大方正集团有限公司 | A Method for Extracting Text Information of News Web Pages |
| CN1858737B (en)* | 2006-01-25 | 2010-06-02 | 华为技术有限公司 | Method and system for data search |
| US8768881B2 (en) | 2006-03-03 | 2014-07-01 | Microsoft Corporation | RSS data-processing object |
| US8280843B2 (en) | 2006-03-03 | 2012-10-02 | Microsoft Corporation | RSS data-processing object |
| US7979803B2 (en) | 2006-03-06 | 2011-07-12 | Microsoft Corporation | RSS hostable control |
| CN101055581B (en)* | 2006-04-13 | 2012-07-04 | Lg电子株式会社 | Document management system and method |
| CN100444591C (en)* | 2006-08-18 | 2008-12-17 | 北京金山软件有限公司 | Method and Application System for Obtaining Web Page Keywords |
| CN101140578B (en)* | 2006-09-06 | 2010-12-08 | 鸿富锦精密工业(深圳)有限公司 | System and method for multi-thread analysis of web page data |
| CN101192220B (en)* | 2006-11-21 | 2010-09-15 | 财团法人资讯工业策进会 | Label construction method and system suitable for resource search |
| CN101231641B (en)* | 2007-01-22 | 2010-05-19 | 北大方正集团有限公司 | A method and system for automatically analyzing the dissemination process of hot topics on the Internet |
| CN100433018C (en)* | 2007-03-13 | 2008-11-12 | 白云 | Method for criminating electronci file and relative degree with certain field and application thereof |
| CN101035128B (en)* | 2007-04-18 | 2010-04-21 | 大连理工大学 | Recognition and filtering method of triple webpage text content based on Chinese punctuation marks |
| CN100592293C (en)* | 2007-04-28 | 2010-02-24 | 李树德 | Knowledge search engine based on intelligent ontology and implementation method thereof |
| WO2008131597A1 (en)* | 2007-04-29 | 2008-11-06 | Haitao Lin | Search engine and method for filtering agency information |
| WO2009021429A1 (en)* | 2007-08-13 | 2009-02-19 | Tencent Technology (Shenzhen) Company Limited | Method and device for dealing with the instant messaging information |
| US8204946B2 (en) | 2007-08-13 | 2012-06-19 | Tencent Technology (Shenzhen) Company Ltd. | Method and apparatus for processing instant messaging information |
| CN101196935B (en)* | 2008-01-03 | 2010-06-09 | 中兴通讯股份有限公司 | System and method for creating index database |
| CN101526938B (en)* | 2008-03-06 | 2011-12-28 | 夏普株式会社 | File processing device |
| CN101751438B (en)* | 2008-12-17 | 2012-08-22 | 中国科学院自动化研究所 | Theme webpage filter system for driving self-adaption semantics |
| CN102460437A (en)* | 2009-06-26 | 2012-05-16 | 乐天株式会社 | Information search device, information search method, information search program, and recording medium having information search program recorded thereon |
| CN102460437B (en)* | 2009-06-26 | 2014-10-15 | 乐天株式会社 | Information search device, information search method, information search program, and storage medium on which information search program has been stored |
| CN105760500A (en)* | 2009-11-10 | 2016-07-13 | 启创互联公司 | System, method and computer program for creating and manipulating data structures using an interactive graphical interface |
| CN105760500B (en)* | 2009-11-10 | 2019-08-09 | 启创互联公司 | System and method for being created using interactive graphics (IG) interface and manipulating data structure |
| US10146843B2 (en) | 2009-11-10 | 2018-12-04 | Primal Fusion Inc. | System, method and computer program for creating and manipulating data structures using an interactive graphical interface |
| CN102812475A (en)* | 2009-12-24 | 2012-12-05 | 梅塔瓦纳股份有限公司 | System And Method For Determining Sentiment Expressed In Documents |
| CN102118400A (en)* | 2009-12-31 | 2011-07-06 | 北京四维图新科技股份有限公司 | Data acquisition method and system |
| CN102118400B (en)* | 2009-12-31 | 2013-07-17 | 北京四维图新科技股份有限公司 | Data acquisition method and system |
| CN102236654A (en)* | 2010-04-26 | 2011-11-09 | 广东开普互联信息科技有限公司 | Web Invalid Link Filtering Method Based on Content Correlation |
| US9218325B2 (en) | 2010-08-31 | 2015-12-22 | International Business Machines Corporation | Quick font match |
| US9002877B2 (en) | 2010-08-31 | 2015-04-07 | International Business Machines Corporation | Quick font match |
| CN102385570A (en)* | 2010-08-31 | 2012-03-21 | 国际商业机器公司 | Method and system for matching fonts |
| CN102446191A (en)* | 2010-10-13 | 2012-05-09 | 北京创新方舟科技有限公司 | Method for generating webpage content abstracts and equipment and system adopting same |
| CN102446311B (en)* | 2010-10-15 | 2016-12-21 | 商业对象软件有限公司 | The business intelligence of proceduredriven |
| CN102446311A (en)* | 2010-10-15 | 2012-05-09 | 商业对象软件有限公司 | Business intelligence technology for process driving |
| CN101984435A (en)* | 2010-11-17 | 2011-03-09 | 百度在线网络技术(北京)有限公司 | Method and device for distributing texts |
| CN101984435B (en)* | 2010-11-17 | 2012-10-10 | 百度在线网络技术(北京)有限公司 | Method and device for distributing texts |
| CN102117317B (en)* | 2010-12-28 | 2012-08-22 | 北京航空航天大学 | Blind person Internet system based on voice technology |
| CN102117317A (en)* | 2010-12-28 | 2011-07-06 | 北京航空航天大学 | Blind person Internet system based on voice technology |
| CN102592039A (en)* | 2011-01-18 | 2012-07-18 | 四川火狐无线科技有限公司 | Interaction method for processing cantering and entertainment service data and device and system for realizing same |
| CN102955791A (en)* | 2011-08-23 | 2013-03-06 | 句容今太科技园有限公司 | Searching and classifying service system for network information |
| CN102521313A (en)* | 2011-12-01 | 2012-06-27 | 北京大学 | Static index pruning method based on web page quality |
| CN102902757A (en)* | 2012-09-25 | 2013-01-30 | 姚明东 | Automatic generation method of e-commerce dictionary |
| CN102902757B (en)* | 2012-09-25 | 2015-07-29 | 姚明东 | A kind of Automatic generation method of e-commerce dictionary |
| CN102945246A (en)* | 2012-09-28 | 2013-02-27 | 北界创想(北京)软件有限公司 | Method and device for processing network information data |
| CN103149840A (en)* | 2013-02-01 | 2013-06-12 | 西北工业大学 | Semanteme service combination method based on dynamic planning |
| CN103149840B (en)* | 2013-02-01 | 2015-03-04 | 西北工业大学 | Semanteme service combination method based on dynamic planning |
| CN103150632B (en)* | 2013-03-13 | 2016-03-16 | 河海大学 | Flood control based on water conservation cloud platform is taked precautions against drought the construction method of bulletin generation system |
| CN103150632A (en)* | 2013-03-13 | 2013-06-12 | 河海大学 | Structuring method for flood control and drought control bulletin generation system based on water conservancy cloud platform |
| CN104424308A (en)* | 2013-09-04 | 2015-03-18 | 中兴通讯股份有限公司 | Web page classification standard acquisition method and device and web page classification method and device |
| CN103488750A (en)* | 2013-09-24 | 2014-01-01 | 长沙裕邦软件开发有限公司 | Implementation method and system of network robot |
| CN103853834A (en)* | 2014-03-12 | 2014-06-11 | 华东师范大学 | Text structure analysis-based Web document abstract generation method |
| CN103853834B (en)* | 2014-03-12 | 2017-02-08 | 华东师范大学 | Text structure analysis-based Web document abstract generation method |
| CN104008126A (en)* | 2014-03-31 | 2014-08-27 | 北京奇虎科技有限公司 | Method and device for segmentation on basis of webpage content classification |
| CN105005563A (en)* | 2014-04-15 | 2015-10-28 | 腾讯科技(深圳)有限公司 | Abstract generation method and apparatus |
| CN104657347A (en)* | 2015-02-06 | 2015-05-27 | 北京中搜网络技术股份有限公司 | News optimized reading mobile application-oriented automatic summarization method |
| CN106383887A (en)* | 2016-09-22 | 2017-02-08 | 深圳市博安达信息技术股份有限公司 | Environment-friendly news data acquisition and recommendation display method and system |
| CN106383887B (en)* | 2016-09-22 | 2023-04-07 | 深圳博沃智慧科技有限公司 | Method and system for collecting, recommending and displaying environment-friendly news data |
| CN109086361A (en)* | 2018-07-20 | 2018-12-25 | 北京开普云信息科技有限公司 | A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint |
| CN109086361B (en)* | 2018-07-20 | 2019-06-21 | 北京开普云信息科技有限公司 | A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint |
| CN112115259A (en)* | 2020-06-17 | 2020-12-22 | 上海金融期货信息技术有限公司 | A feature word-driven text multi-label hierarchical classification method and system |
| CN112115259B (en)* | 2020-06-17 | 2024-06-25 | 上海金融期货信息技术有限公司 | Text multi-label hierarchical classification method and system driven by feature words |
| CN113190644A (en)* | 2021-05-24 | 2021-07-30 | 浪潮软件科技有限公司 | Method and device for hot updating search engine word segmentation dictionary |
| CN113190644B (en)* | 2021-05-24 | 2023-01-13 | 浪潮软件科技有限公司 | Method and device for hot updating word segmentation dictionary of search engine |
| CN113486279A (en)* | 2021-06-29 | 2021-10-08 | 平安信托有限责任公司 | Automatic news generation method, device, equipment and storage medium |
| Publication | Publication Date | Title |
|---|---|---|
| CN1536483A (en) | Method and system for extracting and processing network information | |
| CN1096038C (en) | Method and device for file retrieval based on Bayesian network | |
| CN1145901C (en) | A Construction Method of Intelligent Decision Support Based on Information Mining | |
| Glover et al. | Using web structure for classifying and describing web pages | |
| CN1109982C (en) | hypertext document retrieving apparatus for retrieving hypertext documents relating to each other | |
| CN1669029A (en) | System and method for automatically discovering a hierarchy of concepts from a corpus of documents | |
| CN1904896A (en) | Structured document processing apparatus, search apparatus, structured document system and method | |
| CN1728141A (en) | Phrase-Based Search in Information Retrieval Systems | |
| CN1269897A (en) | Methods and/or system for selecting data sets | |
| CN1728140A (en) | Phrase-Based Indexing in Information Retrieval Systems | |
| CN1535433A (en) | Category based, extensible and interactive system for document retrieval | |
| CN1882943A (en) | Systems and methods for search processing using superunits | |
| CN1871603A (en) | System and method for processing a query | |
| CN1559044A (en) | Information analysis method and device | |
| CN101055587A (en) | Search engine retrieving result reordering method based on user behavior information | |
| CN1281191A (en) | Information retrieval method and information retrieval device | |
| CN100504857C (en) | Filtering method and device for effectively extracting documents desired by searchers using learning data | |
| CN115563313A (en) | Semantic retrieval system for literature and books based on knowledge graph | |
| CN102200974A (en) | Unified information retrieval intelligent agent system and method for search engine | |
| CN1942877A (en) | Information extraction system | |
| CN1707476A (en) | Auxiliary translation searching engine system and method thereof | |
| CN1786947A (en) | System, method and program for extracting web page core content based on web page layout | |
| CN1750002A (en) | Method for providing research result | |
| CN118535978A (en) | News analysis method and system based on multi-mode large model | |
| CN1265209A (en) | System for processing textual inputs natural language processing techniques |
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
| WD01 | Invention patent application deemed withdrawn after publication | Open date:20041013 |