CN104657347A

Movatterモバイル変換

Info

Publication number: CN104657347A
Application number: CN201510063837.5A
Authority: CN
Inventors: 尹柳; 许欢庆; 郭永福; 陈沛
Original assignee: Beijing Zhongsou Network Technology Co ltd
Current assignee: Beijing Wyatt Network Technology Co. Ltd.
Priority date: 2015-02-06
Filing date: 2015-02-06
Publication date: 2015-05-27

Abstract

The invention relates to a news optimized reading mobile application-oriented automatic summarization method. The method is characterized by comprising the following steps of (1) preprocessing news webpage content; (2) extracting a text abstract; (3) generating a result. An html (Hypertext Markup Language) format is added; a picture and a table are retained; the display form of the text abstract is optimized, and the visual experience of a user is enhanced. In the traditional automatic summarization method, semantic loss is caused, but in the automatic summarization method provided by the invention, sentences are subjected to context expansion, and blank sentences are combined by being connected by suspension points, so that the semantic loss in the traditional automatic summarization method is made up, and the integrity and the continuity of semanteme are improved. Two options, i.e. percentage of abstract in an original article and length of the abstract, are set to be selectively set by the user, so that the flexibility is improved; 100 articles are randomly selected, and human checking shows that the pass rate is up to 99.8 percent.

Description

A kind of auto-abstracting method reading class Mobile solution towards news optimization

Technical field

The present invention relates to a kind of auto-abstracting method, specifically relate to a kind of auto-abstracting method reading class Mobile solution towards news optimization.

Background technology

The fast development of internet in recent years, has bulk information to appear in face of people with the form of electronic document every day.People depend on internet more and more to obtain required information, in the face of the magnanimity information that every day blows against one's face, need to filter a large amount of information, just can obtain the information needed, in order to obtain useful information quickly and accurately from magnanimity electronic information, the autoabstract process of document becomes more and more important.

Develop into present smart mobile phone from initial stage PC, people have started to hold browsing information from single traditional PC, turn to mobile phone mobile terminal.In the face of the small screen of mobile phone, also more urgent to the demand of autoabstract.

Autoabstract refers to extracts document subject matter thought automatically by computer program, generates more concise than original text, the digest be more readily understood by the important information extracted after recombinant modified.As long as read a small amount of digest namely can understand original text fast, like a cork, and need not go to read in full, substantially increase the efficiency that people obtain electronic text information.Main automatic Summarization Technique is divided into two classes at present: the mechanical method of abstracting of Corpus--based Method and Knowledge based engineering understand method of abstracting.Machinery summary Using statistics method obtains the keyword of document, and in conjunction with the heuristic information such as cue, position, picks out the sentence that some are suitable from document, obtains the summary of document after polishing.Understand summary expectation and utilize various knowledge and Formal Theory, the basis understanding document semantic content generates digest (summary or concentrated to original text).

Machinery is made a summary and is had the advantages that speed is fast, field is not limited, but the summary generated is second-rate, there is the problems such as reflection content is comprehensive not, statement redundancy.Compared with making a summary with machinery, understanding summary quality is better, has the advantages such as succinct refining, comprehensively accurate, readability are strong.But, understand summary and not only require that computing machine has natural language understanding and generative capacity, also need express and organize various background, domain knowledge.The difficulty of these work is very huge, is in progress very micro-up to now.Therefore, the use understanding method of abstracting is more rare, is only limitted in very narrow and small application.

Summary of the invention

For the deficiencies in the prior art, the present invention proposes a kind of auto-abstracting method reading class Mobile solution towards news optimization.Based on the singularity of mobile terminal, design a kind of autoabstract of tape format, improve the comfort level of Consumer's Experience.The present invention generates summary automatically in conjunction with html pattern, remains picture and the form of original text, and expansion before and after important information has been carried out, improve integrality and the continuity of content.Avoid dull, the stiff and tomography of the pattern of summary, the news optimizing mobile terminal is read.

The object of the invention is to adopt following technical proposals to realize:

Read an auto-abstracting method for class Mobile solution towards news optimization, its improvements are, described method comprises

(1) pre-service news web page content;

(2) text snippet is extracted;

(3) result is generated.

Preferably, described step (1) comprises

(1.1) dictionary and stop words is loaded;

(1.2) news web page content according to html label piecemeal, be designated as k_i;

(1.3) respectively to each k_icut sentence, the method for cutting sentence with paragraph end mark and fullstop to divide sentence;

(1.4) the html label h of every is extracted_iwith text s_i;

(1.5) h of every is recorded_iwith text s_icorrespondence position;

(1.6) to text s_iparticiple;

(1.7) remove stop words and other noise, be designated as word_i.

Further, described each word_ifor removing the word sequence after stop words.

Preferably, described step (2) comprises

(2.1) word is calculated_iand word_jco-occurrence similarity sim_i,j;

(2.2) according to formula pr_i=1-d/m+d* Σ sim_j,i* pr_j/ out_jcarry out iteration,

(2.3) according to s_ipr_ivalue carries out down sequence, generates sentence sequence s_k;

Wherein, word_ifor sentence text s_icorresponding word sequence, word_jfor sentence text s_jcorresponding word sequence, sim_i,jfor sentence i is to the contribution margin of sentence j, d ∈ (0,1), m is matrix maximal dimension, out_jfor the out-degree of sentence summit j, the initial value of pr is 1/m, and convergence precision is 0.001.

Preferably, described step (3) comprises

(3.1) from s_kl sentence before middle taking-up;

(3.2) to L sentence before taking-up, carry out front and back expansion, must s be gathered_l;

(3.3) by the order in original text, to s_lresequence to obtain s '_l;

(3.4) in conjunction with h_i, by s '_lbe inserted in correspondence position;

(3.5) continuous many all not selected, namely not set s '_lin, then merge;

(3.6) according to length or the number percent of user's setting, judge whether the length of (3.5) meets, if exceed, then cuts word, draws net result.

Compared with the prior art, beneficial effect of the present invention is:

With general autoabstract ratio, increase html form, retain picture and form, what optimize digest represents form, enhances user's visual experience.

Tradition autoabstract has semantic disappearance, and the present invention carries out context extension to sentence, and merges empty sentence and connect with suspension points, compensate for the semantic disappearance of tradition summary, improves semantic integrality and continuity.

The present invention is provided with number percent and length of summarization two options that summary accounts for original text, selects to arrange, improve dirigibility for user.

Randomly draw 100 sections of articles, through desk checking, percent of pass reaches 99.8%.

Accompanying drawing explanation

Fig. 1 is a kind of auto-abstracting method process flow diagram reading class Mobile solution towards news optimization provided by the invention.

Fig. 2 is a kind of structural drawing reading pretreatment module in the auto-abstracting method of class Mobile solution towards news optimization provided by the invention.

Fig. 3 is a kind of process flow diagram reading the auto-abstracting method Chinese version abstract extraction module of class Mobile solution towards news optimization provided by the invention.

Fig. 4 is a kind of process flow diagram reading result-generation module in the auto-abstracting method of class Mobile solution towards news optimization provided by the invention.

Embodiment

Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in further detail.

The present invention is a kind of, and the auto-abstracting method towards news optimization reading class Mobile solution comprises the steps: to carry out pre-service to news web page content, text snippet extracts and result generates.

As shown in Figure 2, for carrying out pretreated structural drawing to news web page content, pre-service is that news web page content is first carried out piecemeal, every section of corresponding one piece of sequence of news, and each block is to application one word sequence, and concrete steps are as follows:

1. load dictionary and stop words;

2. news web page content according to html label piecemeal, be designated as k_i(i ∈ 1,2,3 ..., n), if there is form, extract form as independent block k_j, otherwise each divides a block k to beginning label into end-tag_j;

3. respectively to each k_i(i ≠ j) cuts sentence, the method for cutting sentence with paragraph end mark and fullstop to divide sentence;

4. extract the html label h of every_i(i ∈ 1,2,3 ..., m) with text s_i(i ∈ 1,2,3 ..., m);

5. record the h of every_i(i ∈ 1,2,3 ..., m) with text s_i(i ∈ 1,2,3 ..., correspondence position m);

6. to text s_i(i ∈ 1,2,3 ..., m) participle;

7. remove stop words and other noise, be designated as word_i(i ∈ 1,2,3 ..., m), each word_ifor removing the word sequence after stop words and denoising.

Fig. 3 is the process flow diagram of text snippet extraction module, and concrete steps are as follows:

1, word is calculated_iand word_jco-occurrence similarity sim_i,j;

Calculate word_i(represent sentence text s_icorresponding word sequence) and word_j(represent sentence text s_jcorresponding

Word sequence) similarity sim_i,j, sim_i,jfor sentence i is to the contribution margin of sentence j;

By sim_i,jgenerate non-directed graph matrix;

2, according to formula pr_i=1-d/m+d* Σ sim_j,i* pr_j/ out_jcarry out iteration,

According to pr_i=1-d/m+d* Σ sim_j,i* pr_j/ out_j; Carry out iteration, wherein d ∈ (0,1), m is matrix maximal dimension, out_jfor the out-degree of sentence summit j (i.e. sentence j), convergence precision is 0.001;

3, according to s_ipr_ivalue carries out down sequence, generates sentence sequence s_k;

Note: the formula in (2.2) comes from pageRank algorithm, Brin and Page, 1998

Fig. 4 is the process flow diagram of result-generation module, and the concrete steps of result template generation module are as follows:

1. from s_kl sentence before middle taking-up, wherein L ∈ (1, m);

2. pair front L sentence taken out, carries out front and back expansion, must gather s_l;

3. according to the order in original text, to s_lresequence to obtain s '_l;

4. in conjunction with h_i(i ∈ 1,2,3 ..., m) and positional information, by s '_lbe inserted in correspondence position;

If continuous many all not selected, namely not set s '_lin, then merge, and connect with ' ... ';

6., according to length or the number percent of user's setting, judge whether the length of 3.5 meets, if exceed, then cuts word, draws net result.

Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit; those of ordinary skill in the field still can modify to the specific embodiment of the present invention with reference to above-described embodiment or equivalent replacement; these do not depart from any amendment of spirit and scope of the invention or equivalent replacement, are all applying within the claims of the present invention awaited the reply.

Claims

1. read an auto-abstracting method for class Mobile solution towards news optimization, it is characterized in that, described method comprises

(1) pre-service news web page content;

(2) text snippet is extracted;

(3) result is generated.

2. a kind of auto-abstracting method reading class Mobile solution towards news optimization as claimed in claim 1, it is characterized in that, described step (1) comprises

(1.1) dictionary and stop words is loaded;

(1.4) the html label h of every is extracted_iwith text s_i;

(1.5) h of every is recorded_iwith text s_icorrespondence position;

(1.6) to text s_iparticiple;

(1.7) remove stop words and other noise, be designated as word_i.

3. a kind of auto-abstracting method reading class Mobile solution towards news optimization as claimed in claim 2, is characterized in that, described each word_ifor removing the word sequence after stop words.

4. a kind of auto-abstracting method reading class Mobile solution towards news optimization as claimed in claim 1, it is characterized in that, described step (2) comprises

(2.1) word is calculated_iand word_jco-occurrence similarity sim_i,j;

(2.2) according to formula

{pr}_{i} = 1 - d / m + d * Σ^{{sim}_{j, i}} * {pr}_{j} / {out}_{j}

Carry out iteration,

5. a kind of auto-abstracting method reading class Mobile solution towards news optimization as claimed in claim 1, it is characterized in that, described step (3) comprises

(3.1) from s_kl sentence before middle taking-up;

(3.3) by the order in original text, to s_lresequence to obtain s_l';

(3.4) in conjunction with h_i, by s_l' be inserted in correspondence position;

(3.5) continuous many all not selected, namely not set s_l' in, then merge;