A kind of auto-abstracting method reading class Mobile solution towards news optimizationTechnical field
The present invention relates to a kind of auto-abstracting method, specifically relate to a kind of auto-abstracting method reading class Mobile solution towards news optimization.
Background technology
The fast development of internet in recent years, has bulk information to appear in face of people with the form of electronic document every day.People depend on internet more and more to obtain required information, in the face of the magnanimity information that every day blows against one's face, need to filter a large amount of information, just can obtain the information needed, in order to obtain useful information quickly and accurately from magnanimity electronic information, the autoabstract process of document becomes more and more important.
Develop into present smart mobile phone from initial stage PC, people have started to hold browsing information from single traditional PC, turn to mobile phone mobile terminal.In the face of the small screen of mobile phone, also more urgent to the demand of autoabstract.
Autoabstract refers to extracts document subject matter thought automatically by computer program, generates more concise than original text, the digest be more readily understood by the important information extracted after recombinant modified.As long as read a small amount of digest namely can understand original text fast, like a cork, and need not go to read in full, substantially increase the efficiency that people obtain electronic text information.Main automatic Summarization Technique is divided into two classes at present: the mechanical method of abstracting of Corpus--based Method and Knowledge based engineering understand method of abstracting.Machinery summary Using statistics method obtains the keyword of document, and in conjunction with the heuristic information such as cue, position, picks out the sentence that some are suitable from document, obtains the summary of document after polishing.Understand summary expectation and utilize various knowledge and Formal Theory, the basis understanding document semantic content generates digest (summary or concentrated to original text).
Machinery is made a summary and is had the advantages that speed is fast, field is not limited, but the summary generated is second-rate, there is the problems such as reflection content is comprehensive not, statement redundancy.Compared with making a summary with machinery, understanding summary quality is better, has the advantages such as succinct refining, comprehensively accurate, readability are strong.But, understand summary and not only require that computing machine has natural language understanding and generative capacity, also need express and organize various background, domain knowledge.The difficulty of these work is very huge, is in progress very micro-up to now.Therefore, the use understanding method of abstracting is more rare, is only limitted in very narrow and small application.
Summary of the invention
For the deficiencies in the prior art, the present invention proposes a kind of auto-abstracting method reading class Mobile solution towards news optimization.Based on the singularity of mobile terminal, design a kind of autoabstract of tape format, improve the comfort level of Consumer's Experience.The present invention generates summary automatically in conjunction with html pattern, remains picture and the form of original text, and expansion before and after important information has been carried out, improve integrality and the continuity of content.Avoid dull, the stiff and tomography of the pattern of summary, the news optimizing mobile terminal is read.
The object of the invention is to adopt following technical proposals to realize:
Read an auto-abstracting method for class Mobile solution towards news optimization, its improvements are, described method comprises
(1) pre-service news web page content;
(2) text snippet is extracted;
(3) result is generated.
Preferably, described step (1) comprises
(1.1) dictionary and stop words is loaded;
(1.2) news web page content according to html label piecemeal, be designated as ki;
(1.3) respectively to each kicut sentence, the method for cutting sentence with paragraph end mark and fullstop to divide sentence;
(1.4) the html label h of every is extractediwith text si;
(1.5) h of every is recordediwith text sicorrespondence position;
(1.6) to text siparticiple;
(1.7) remove stop words and other noise, be designated as wordi.
Further, described each wordifor removing the word sequence after stop words.
Preferably, described step (2) comprises
(2.1) word is calculatediand wordjco-occurrence similarity simi,j;
(2.2) according to formula pri=1-d/m+d* Σ simj,i* prj/ outjcarry out iteration,
(2.3) according to siprivalue carries out down sequence, generates sentence sequence sk;
Wherein, wordifor sentence text sicorresponding word sequence, wordjfor sentence text sjcorresponding word sequence, simi,jfor sentence i is to the contribution margin of sentence j, d ∈ (0,1), m is matrix maximal dimension, outjfor the out-degree of sentence summit j, the initial value of pr is 1/m, and convergence precision is 0.001.
Preferably, described step (3) comprises
(3.1) from skl sentence before middle taking-up;
(3.2) to L sentence before taking-up, carry out front and back expansion, must s be gatheredl;
(3.3) by the order in original text, to slresequence to obtain s 'l;
(3.4) in conjunction with hi, by s 'lbe inserted in correspondence position;
(3.5) continuous many all not selected, namely not set s 'lin, then merge;
(3.6) according to length or the number percent of user's setting, judge whether the length of (3.5) meets, if exceed, then cuts word, draws net result.
Compared with the prior art, beneficial effect of the present invention is:
With general autoabstract ratio, increase html form, retain picture and form, what optimize digest represents form, enhances user's visual experience.
Tradition autoabstract has semantic disappearance, and the present invention carries out context extension to sentence, and merges empty sentence and connect with suspension points, compensate for the semantic disappearance of tradition summary, improves semantic integrality and continuity.
The present invention is provided with number percent and length of summarization two options that summary accounts for original text, selects to arrange, improve dirigibility for user.
Randomly draw 100 sections of articles, through desk checking, percent of pass reaches 99.8%.
Accompanying drawing explanation
Fig. 1 is a kind of auto-abstracting method process flow diagram reading class Mobile solution towards news optimization provided by the invention.
Fig. 2 is a kind of structural drawing reading pretreatment module in the auto-abstracting method of class Mobile solution towards news optimization provided by the invention.
Fig. 3 is a kind of process flow diagram reading the auto-abstracting method Chinese version abstract extraction module of class Mobile solution towards news optimization provided by the invention.
Fig. 4 is a kind of process flow diagram reading result-generation module in the auto-abstracting method of class Mobile solution towards news optimization provided by the invention.
Embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in further detail.
The present invention is a kind of, and the auto-abstracting method towards news optimization reading class Mobile solution comprises the steps: to carry out pre-service to news web page content, text snippet extracts and result generates.
As shown in Figure 2, for carrying out pretreated structural drawing to news web page content, pre-service is that news web page content is first carried out piecemeal, every section of corresponding one piece of sequence of news, and each block is to application one word sequence, and concrete steps are as follows:
1. load dictionary and stop words;
2. news web page content according to html label piecemeal, be designated as ki(i ∈ 1,2,3 ..., n), if there is form, extract form as independent block kj, otherwise each divides a block k to beginning label into end-tagj;
3. respectively to each ki(i ≠ j) cuts sentence, the method for cutting sentence with paragraph end mark and fullstop to divide sentence;
4. extract the html label h of everyi(i ∈ 1,2,3 ..., m) with text si(i ∈ 1,2,3 ..., m);
5. record the h of everyi(i ∈ 1,2,3 ..., m) with text si(i ∈ 1,2,3 ..., correspondence position m);
6. to text si(i ∈ 1,2,3 ..., m) participle;
7. remove stop words and other noise, be designated as wordi(i ∈ 1,2,3 ..., m), each wordifor removing the word sequence after stop words and denoising.
Fig. 3 is the process flow diagram of text snippet extraction module, and concrete steps are as follows:
1, word is calculatediand wordjco-occurrence similarity simi,j;
Calculate wordi(represent sentence text sicorresponding word sequence) and wordj(represent sentence text sjcorresponding
Word sequence) similarity simi,j, simi,jfor sentence i is to the contribution margin of sentence j;
By simi,jgenerate non-directed graph matrix;
2, according to formula pri=1-d/m+d* Σ simj,i* prj/ outjcarry out iteration,
According to pri=1-d/m+d* Σ simj,i* prj/ outj; Carry out iteration, wherein d ∈ (0,1), m is matrix maximal dimension, outjfor the out-degree of sentence summit j (i.e. sentence j), convergence precision is 0.001;
3, according to siprivalue carries out down sequence, generates sentence sequence sk;
Wherein, wordifor sentence text sicorresponding word sequence, wordjfor sentence text sjcorresponding word sequence, simi,jfor sentence i is to the contribution margin of sentence j, d ∈ (0,1), m is matrix maximal dimension, outjfor the out-degree of sentence summit j, the initial value of pr is 1/m, and convergence precision is 0.001.
Note: the formula in (2.2) comes from pageRank algorithm, Brin and Page, 1998
Fig. 4 is the process flow diagram of result-generation module, and the concrete steps of result template generation module are as follows:
1. from skl sentence before middle taking-up, wherein L ∈ (1, m);
2. pair front L sentence taken out, carries out front and back expansion, must gather sl;
3. according to the order in original text, to slresequence to obtain s 'l;
4. in conjunction with hi(i ∈ 1,2,3 ..., m) and positional information, by s 'lbe inserted in correspondence position;
If continuous many all not selected, namely not set s 'lin, then merge, and connect with ' ... ';
6., according to length or the number percent of user's setting, judge whether the length of 3.5 meets, if exceed, then cuts word, draws net result.
Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit; those of ordinary skill in the field still can modify to the specific embodiment of the present invention with reference to above-described embodiment or equivalent replacement; these do not depart from any amendment of spirit and scope of the invention or equivalent replacement, are all applying within the claims of the present invention awaited the reply.