CN116414939A

Movatterモバイル変換

Info

Publication number: CN116414939A
Application number: CN202310661303.7A
Authority: CN
Inventors: 陈毅凯; 杨石; 张凌哲; 胡小武; 江朋欣; 陈杰杰; 江永胜
Original assignee: Nanjing Guozhun Data Co ltd
Current assignee: Nanjing Guozhun Data Co ltd
Priority date: 2023-06-06
Filing date: 2023-06-06
Publication date: 2023-07-11
Anticipated expiration: 2043-06-06
Also published as: CN116414939B

Abstract

The invention discloses an article generating method based on multidimensional data, which relates to the technical field of article generation, and comprises the following steps of firstly, preprocessing text data and establishing a data index base, wherein the establishing mode of the data index base comprises the steps of establishing a related semantic chain network, mapping texts, reconstructing semantic communities and acquiring text recognition rates; step two, data retrieval is carried out by inputting keywords, and data searching is carried out in a data index base, so that a keyword correlation text set is formed; thirdly, generating multiple texts according to a network diagram structure from a text set obtained by the keywords, and sequencing the texts; and fourthly, screening the candidate text sentences by the referents according to the sentence ordering of the candidate texts. The method and the system can spontaneously find the semantics from the text set, generate multiple documents, generate articles according to the requirements of users, enable the users to quickly check the data, save the retrieval time and improve the data acquisition efficiency.

Description

Article generation method based on multidimensional data

Technical Field

The invention relates to the technical field of article generation, in particular to an article generation method based on multi-dimensional data.

Background

The key words are input by a user when the user uses a search engine, and can furthest summarize information contents to be searched by the user; and identifying and corresponding the text data according to the keywords, and recommending the text.

The prior art has the following defects: the text searched according to the keywords cannot be scored for similarity and importance, the semantics are difficult to be spontaneously found from the text set to be combined and generated for data recommendation, the search time of the user cannot be saved to a greater extent, and the efficiency of searching the data is difficult to be improved.

Disclosure of Invention

The invention aims to provide an article generating method based on multidimensional data, which aims to solve the defects in the background technology.

In order to achieve the above object, the present invention provides the following technical solutions: the article generating method based on the multi-dimensional data comprises the following steps:

preprocessing text data and establishing a data index base, wherein the establishing mode of the data index base comprises the steps of establishing a related semantic chain network, mapping texts, reconstructing semantic communities and acquiring text recognition rates;

step two, data retrieval is carried out by inputting keywords, and data searching is carried out in a data index base, so that a keyword correlation text set is formed;

thirdly, generating multiple texts according to a network diagram structure from a text set obtained by the keywords, and sequencing the texts;

and fourthly, screening the candidate text sentences by the referents according to the sentence ordering of the candidate texts to form accurate positioning of the candidate text sentences.

In a preferred embodiment, the construction flow for constructing the associated semantic chain network is as follows:

selecting a word set of a text preprocessing part as a semantic node set, and taking a co-occurrence relation based on words as an association rule;

carrying out community division on semantic communities in the related semantic chain network through the weight of the semantic chain, so as to form a single semantic community describing the same event or theme; the weight calculation formula of the semantic chain is as follows:

；

wherein the method comprises the steps of

Representing the number of occurrences of word a and word b in the same text, +.>

Text number representing the occurrence of word a, < +.>

Representing text having word b present, wherein +.>

And->

Is the weight coefficient of the word a and the word b.

In a preferred embodiment, the text map association is:

acquiring interaction information of the text and the semantic community to measure the similarity degree of the text and the semantic community; the smaller the interchange information, the lower the relevance between the text and the semantic community, and the lower the likelihood that the text is similar to the topic described by the semantic community; the higher the mutual information is, the higher the relevance is, the higher the likelihood that the text is similar to the semantic community description theme is, and the calculation formula of the mutual information is as follows:

；

wherein the method comprises the steps of

Is the probability that document d is selected, since the texts are independent during the mapping process of the texts and the community,/for each text>

Set to 1; />

Is->

Probability of being selected;

is document d and event semantic Community->

Is defined by document d and community +.>

Similarity of (2)

Association degree->

Calculated by the formula:

。

in a preferred embodiment, the process of reconstructing the semantic community is:

the edge weight of the semantic community is adjusted through a formula:

；

wherein the method comprises the steps of

After the text mapping is finished, the semantic chain weights between the nodes a and b are +.>

Is the semantic chain weight between nodes a and b in the community at the end of the community division,

then the semantic chain weights between nodes a and b in the original associated semantic chain network;

the text recognition rate is obtained by the following steps:

number of partitioned semantic communities

And text quantity->

The closer the ratio is to 1, the closer the result of semantic community division is to the actual situation;

。

in a preferred embodiment, the keyword relevance text set is generated in the following manner:

and inputting keywords, performing answering segmentation on the input keywords, quickly searching a text set matched with the keywords from a data index library according to the keywords, sequencing the text set, and returning to a list display page.

In a preferred embodiment, the ranking procedure of the candidate text sentences is as follows:

the text sentence score is obtained, and text sorting is carried out according to the text sentence score, wherein the text sorting flow is as follows:

constructing an undirected graph by taking sentences as nodes and taking the similarity between the nodes as edges through a LexRank algorithm; when importance scoring is carried out on each sentence node according to the degree of each sentence node and the weight of the edge, finally, the sentences with the scores arranged in the front are selected as the last text candidate sentences according to the importance scoring;

the text candidate sentence score is calculated in the following calculation mode:

the text candidate sentences are represented by adopting a vector space model, and because the text candidate sentences are generated by short texts, word sequences of the text candidate sentences can be directly obtained; with words as the smallest linguistic units in the VSM, a set of text sentences may be represented as:

；

wherein the method comprises the steps of

For one text sentence in the text sentence set S, the vector is expressed as

Wherein->

TF-ISF values representing the i-th term feature item in the kth text sentence; the calculation formula of TF-ISF is:

；

wherein the method comprises the steps of

Representing word frequency of the word w in the text sentence, N representing the number of text sentences in the text sentence set, N representing the number of text sentences containing the word w;

the similarity of each text sentence in the text sentence set is calculated through cosine similarity and is used as the weight among all nodes in the language network diagram, and the calculation mode is as follows:

；

representing the structured undirected graph G by an n x n matrix A, where n is the total number of sentences in the text sentence set, and

set to 0, i.e., ignore the self-connection of the statement node:

；

calculating the semantic representation of the text candidate sentence and the event, and calculating the relevance of the text sentence and the event through a calculation formula:

；

wherein e is an event semantic community divided by a community discovery algorithm;

the saliency of the modified sentence node is expressed by the following calculation formula:

；

s refers to the whole text sentence collection, d is the adjustable parameter in the interval [0, 1 ]; when the d value is larger, more relevance between the text sentence and the event semantics is considered in the calculation of the saliency score; lexRank, by definition, can be converted into the following matrix form:

；

wherein A, B are all square matrices, A matrix represents similarity between sentence nodes, and B matrix represents similarity between sentence nodes and event semantics.

Is the final score for the text sentence set.

In a preferred embodiment, the redundancy processing manner of the candidate text sentence is as follows:

defining two collections

And->

Wherein->

Is empty set, is->

The element of each text sentence corresponds to the score of each text sentence in the previous step of significance score calculation result;

for a pair of

The middle elements are sorted in descending order;

suppose at this time

Element->

Is the text sentence with the highest score, will +.>

From->

Move to +.>

Recalculate +.>

The scores of all text sentences are calculated according to the following formula:

；

execution to

The blank set or the text is reached to a preset condition limit;

finally, sentence ordering is carried out on the candidate texts to form a candidate text sentence recommendation ranking sequence:

when the two candidate texts contain time information, the candidate texts are arranged according to the sequence of the time;

when two candidate text sentences

All from the same data text, then +.>

And->

Ranking the candidate texts according to the sequence in the same data text;

when the two candidate texts do not contain time information and do not originate from the same data text, sorting is carried out through the relativity of the text sentences and event semantics, and the candidate text sentences with high relativity are arranged in front of the candidate text sentences with low relativity.

In a preferred embodiment, the method for accurately locating the candidate text sentence is as follows:

selecting the availability of the candidate text sentences, removing the candidate text sentences with offset references to form a new candidate text sentence recommendation ranking sequence, simultaneously carrying out feature recording on the removed candidate text sentences, simultaneously asking an operator for the removal reason, carrying out information acquisition, and avoiding the text sentences of the removal reason when the key words are input subsequently for data retrieval.

In the technical scheme, the invention has the technical effects and advantages that:

1. according to the text searching method and the text searching device, related texts can be subjected to content searching and text sentence scoring according to the input keywords, and then the text sentences are ranked according to the similarity of the keywords, so that semantics can be spontaneously found from a text set, multiple documents are generated, articles are generated according to the requirements of users, the users can quickly check the data, the searching time is saved, and the data acquisition efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is an edge-structured undirected view of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In embodiment 1, referring to fig. 1, the article generating method based on multi-dimensional data in this embodiment includes the following steps:

the construction flow for constructing the associated semantic chain network comprises the following steps:

selecting a word set of a text preprocessing part as a semantic node set, and taking a co-occurrence relation of words as an association rule (namely, two semantic nodes which occur simultaneously in the same text are regarded as being associated, and measuring the association degree of the semantic nodes through the weight of a semantic chain);

；

wherein the method comprises the steps of

Text number representing the occurrence of word a, < +.>

Representing text having word b present, wherein +.>

And->

Is the weight coefficient of the word a and the word b;

text mapping the association mode is as follows:

；

wherein the method comprises the steps of

Set to 1; />

Is->

Probability of being selected;

is document d and event semantic Community->

Is defined by document d and community +.>

Similarity of (2)

Association degree->

Calculated by the formula:

；

the process of reconstructing the semantic community is as follows:

the edge weight of the semantic community is adjusted through a formula:

；

wherein the method comprises the steps of

Is that after the text mapping is finished, the semantics are associated through the text frameworksSemantic chain weight between nodes a and b when linking the network, +.>

the text recognition rate is obtained by the following steps:

number of partitioned semantic communities

And text quantity->

；

on the other hand, the purity index is used for verifying the accuracy of the algorithm; the specific calculation formula is as follows;

；

wherein the method comprises the steps of

Is a weighted harmonic mean of accuracy and recall, get +.>

；

inputting keywords, performing simple answer word segmentation on the input keywords, quickly searching a text set matched with the keywords from a data index library according to the keywords, sequencing the text set, and returning to a list display page to form a keyword correlation text set;

the generation mode of the multi-text is that multi-dimensional data statistics is carried out on the text set (the multi-dimensional data statistics comprise aspects of text source websites, media attention and the like); clustering the text sets by using a text similarity analysis method, spontaneously finding semantics from the text sets, combining a multi-document generation method, and generating articles according to user requirements;

as shown in fig. 2, constructing an undirected graph by taking sentences as nodes and taking the similarity between the nodes as edges through a LexRank algorithm; when importance scoring is carried out on each sentence node according to the degree of each sentence node and the weight of the edge, finally, the sentences with the scores arranged in the front are selected as the last text candidate sentences according to the importance scoring;

the text candidate sentences are represented by adopting a vector space model, and because the text candidate sentences are generated by short texts, word sequences of the text candidate sentences can be directly obtained (without the operations of word segmentation, word stopping and the like on the text sentences); with words as the smallest linguistic units in the VSM, a set of text sentences may be represented as:

；

wherein the method comprises the steps of

For one text sentence in the text sentence set S, the vector is expressed as

Wherein->

；

wherein the method comprises the steps of

；

set to 0, i.e., ignore the self-connection of the statement node:

；

；

；

s refers to a whole text sentence set, d is an adjustable parameter in an interval [0, 1] and is used for adjusting the proportion of similarity between sentence nodes and event topics and the similarity between semantic nodes in significance calculation; when the d value is larger, more relevance between the text sentence and the event semantics is considered in the calculation of the saliency score; lexRank, by definition, can be converted into the following matrix form:

；

Is the final score for the text sentence set;

redundancy processing:

defining two collections

And->

Wherein->

Is empty set, is->

for a pair of

The middle elements are sorted in descending order;

suppose at this time

Element->

Is the text sentence with the highest score, will +.>

From->

Move to +.>

Recalculate +.>

；

execution to

The blank set or the text is reached to a preset condition limit;

when two candidate text sentences

All from the same data text, then +.>

And->

Ranking the candidate texts according to the sequence in the same data text;

when the two candidate texts do not contain time information and do not originate from the same data text, sorting is carried out through the relativity of the text sentences and event semantics, and candidate text sentences with high relativity are arranged in front of candidate text sentences with low relativity;

step four, screening the candidate text sentences by the referents according to sentence ordering of the candidate texts to form accurate positioning of the candidate text sentences;

selecting the availability of the candidate text sentences, removing the candidate text sentences with offset references to form a new candidate text sentence recommendation ranking sequence, simultaneously carrying out feature recording on the removed candidate text sentences, simultaneously asking an operator for removing reasons to acquire information, and avoiding the text sentences of the removing reasons when the key words are input subsequently for data retrieval;

the text searching method and the text searching device can search the content of the related text and acquire the score of the text sentence according to the input keywords, and then sort the text sentence according to the similarity of the keywords, so that the semantics can be found spontaneously from the text set, the generation of multiple documents is performed, articles are generated according to the requirements of users, the users can quickly check the data, the searching time is saved, and the data acquisition efficiency is improved.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The article generating method based on the multi-dimensional data is characterized by comprising the following steps of:

2. The article generation method based on multi-dimensional data according to claim 1, wherein: the construction flow for constructing the associated semantic chain network comprises the following steps:

；

wherein the method comprises the steps of

Text number representing the occurrence of word a, < +.>

Representing text having word b present, wherein +.>

And->

Is the weight coefficient of the word a and the word b.

3. The article generation method based on multi-dimensional data according to claim 2, wherein: the text mapping association mode is as follows:

；

wherein the method comprises the steps of

Set to 1; />

Is->

Probability of being selected;

is document d and event semantic Community->

Is defined by document d and community +.>

Similarity of (2)

Association degree->

Calculated by the formula:

。

4. the article generation method based on multi-dimensional data according to claim 3, wherein: the process of reconstructing the semantic community is as follows:

the edge weight of the semantic community is adjusted through a formula:

；

wherein the method comprises the steps of

Is the semantic chain weight between nodes a and b in the community at the end of community division, +.>

the text recognition rate is obtained by the following steps:

number of partitioned semantic communities

And text quantity->

The ratio of (2) is closer to 1, explainingThe closer the result of semantic community division is to the actual situation;

。

5. the article generation method based on multidimensional data according to claim 4, wherein: the generation mode of the keyword relevance text set is as follows:

6. The article generation method based on multidimensional data according to claim 5, wherein: the ranking flow of the candidate text sentences is as follows:

；

wherein the method comprises the steps of

For one text sentence in the text sentence set S, the vector is expressed as

Wherein->

；

wherein the method comprises the steps of

；

set to 0, i.e., ignore the self-connection of the statement node:

；

；

；

；

wherein A, B is a square matrix, A matrix represents similarity between sentence nodes, and B matrix represents similarity between sentence nodes and event semantics;

is the final score for the text sentence set.

7. The article generation method based on multidimensional data according to claim 6, wherein: the redundant processing mode of the candidate text sentence is as follows:

defining two collections

And->

Wherein->

Is empty set, is->

for a pair of

The middle elements are sorted in descending order;

suppose at this time

Element->

Is the text sentence with the highest score, will +.>

From->

Move to +.>

Recalculating

；

execution to

The blank set or the text is reached to a preset condition limit;

when two candidate text sentences

All from the same data text, then +.>

And->

Ranking the candidate texts according to the sequence in the same data text;

8. The article generation method based on multidimensional data according to claim 7, wherein: the accurate positioning method of the candidate text sentence comprises the following steps: